# DS107 Big Data : Lesson Two Companion Notebook

### Table of Contents <a class="anchor" id="DS107L2_toc"></a>

* [Table of Contents](#DS107L2_toc)
    * [Page 1 - Introduction](#DS107L2_page_1)
    * [Page 2 - Windows Installation of Virtual Box](#DS107L2_page_2)
    * [Page 3 - Install a Hortonworks Sandbox](#DS107L2_page_3)
    * [Page 4 - Windows Installation - Enable Virtualization on Your Computer](#DS107L2_page_4)
    * [Page 5 - Mac and Linux Installations](#DS107L2_page_5)
    * [Page 6 - Install a Hortonworks Sandbox](#DS107L2_page_6)
    * [Page 7 - Introduction to Ambari](#DS107L2_page_7)
    * [Page 8 - Introduction to HDFS](#DS107L2_page_8)
    * [Page 9 - Interacting with HDFS](#DS107L2_page_9)
    * [Page 10 - Windows Connecting to your Cluster via Command Prompt](#DS107L2_page_10)
    * [Page 11 - Mac and Linux Connecting to your Cluster via Command Prompt](#DS107L2_page_11)
    * [Page 12 - Linux System Basics](#DS107L2_page_12)
    * [Page 13 - Vi](#DS107L2_page_13)
    * [Page 14 - Using HDFS from the Command Prompt](#DS107L2_page_14)
    * [Page 15 - Exiting your Virtual Machine](#DS107L2_page_15)
    * [Page 16 - Key Terms](#DS107L2_page_16)
    * [Page 17 - Lesson 2 Hands-On](#DS107L2_page_17)
    * [Page 18 - Lesson 2 Hands-On Solution - Alternative Assignment](#DS107L2_page_18)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS107L2_page_1"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Getting Started with Hadoop
VimeoVideo('388550429', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO107L02overview.zip)**.

# Introduction

In this lesson, you will install a virtual machine that will run Hadoop. You'll then learn how to play around with the basics of Hadoop, Ambari, HDFS, and Linux. By the end of this lesson, you should be able to: 

* Install VirtualBox 
* Install a Hortonworks Sandbox that contains a Hadoop environment
* Enable virtualization on your computer if necessary
* Navigate through Ambari
* Understand the structure of HDFS
* Interact with HDFS through Ambari and the command line
* Utilize basic Linux commands
* Be familiar with vi commands
* Exit your virtual machine safely

This lesson will culminate in a hands-on in which you utilize HDFS to load and remove data from your cluster.


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Windows Installation of Virtual Box<a class="anchor" id="DS107L2_page_2"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Windows Installation of Virtual Box

The next three pages contain the Hadoop installation instructions for Windows.  If you are on a Mac or Linux machine, please skip the next three pages. 

The first installation you will complete is `VirtualBox`. `VirtualBox` is a free and open-source tool that allows you to create, manage, and run virtual machines or virtual computers. Visit the VirtualBox's website **[here](https://www.virtualbox.org/wiki/Downloads)** to download. Choose your operating system, and it will automatically start installing VirtualBox as seen below:

![Snapshot of a page of the virtual box. Below the logo, the left panel displays about, screenshots, downloads, documentation, end-user docs, technical docs, contribute, and community. The page covers VirtualBox binaries. The provided sub-topics are VirtualBox 5.2.2 platform packages, VirtualBox 5.2.2 Oracle VM VirtualBox extension pack, and VirtualBox 5.2.2 software developer kit. The hyperlinks under VirtualBox 5.2.2 platform packages are windows hosts, OS X hosts, Linux distributions, and Solaris hosts. The above mentioned five hyperlinks are highlighted with an arrow.](Media/VirtualBox.png)

You want the defaults - just click next through the first two pages.  Next, you will see a page like this: 

![A window labeled Oracle VM VirtualBox 6.0.12, on the right x is present. A storage box is present. Two commit buttons are labeled Yes and No, where Yes is highlighted in blue.](Media/install3.png)

Don't panic! All this is telling you is that it will briefly disconnect you from your internet and then reconnect you. Typically this is no big deal, but if you have someone else sharing your internet connection that is doing something critical, or you yourself have something downloading that you don't want to interrupt, you may want to wait until you're done.  Otherwise, go ahead and proceed.  Then click install: 

![A window labeled Oracle VM VirtualBox 6.0.12 Setup, on the right x is present. Below captioned. Three commit buttons are labeled <Back, Install and Cancel, where Install is highlighted in blue.](Media/install4.png)

If you see see any warnings such as this (you may or may not), it's totally fine to proceed: 

![A dialog box labeled Window Security contains the Main instruction, content area, checkbox, footnote area captioned. And two commit buttons labeled Install and Don’t install where the second button Don’t install is highlighted in blue.](Media/install5.png)

You should now approach the screen below. Make sure that the box is unchecked before you click finish, as you don't want to start this up yet.

![A window labeled Oracle VM VirtualBox 6.0.12 Setup, on the right x is present. A storage box is present. Main instruction and checkbox are present and Three commit buttons are labeled, <Back, Finish and Cancel, where Finish is highlighted in blue.](Media/install6.png)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Install a Hortonworks Sandbox <a class="anchor" id="DS107L2_page_3"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Install a Hortonworks Sandbox

These directions are for installing Hadoop on a Windows computer.  If you have a Mac or Linux machine, please skip the next two pages.

Next you want to actually install Hadoop, and you will do that through a distribution called Hortonworks. Using the Hortonworks distribution means that a lot of things come pre-installed, including Hadoop, so there is less work for you! 

**[Here is the website to download the Hortonworks Sandbox](https://www.cloudera.com/downloads/hortonworks-sandbox.html)**

You want to go with the option for HDP, which is on the left: 

![A webpage from the cloudera website displays a message that reads, get started with Horton works sandbox. Hortonworks sandbox an help you get started learning, developing, testing and trying out new features on HDP and HDF. It also displays two separate panels for downloading Hortonworks HDP and Hortonworks HDF with separate download now buttons.](Media/install7.png)

Then click download, and you should reach a screen looking something like the screenshot below.  Make sure that you choose the installation type of Virtualbox, since that is the virtual machine software you just installed.  

![A webpage from the cloudera website displays a message that reads, Hortonworks data platform on Hortonworks sandbox. It also displays a dropdown list box labeled Virtualbox and a button labeled let's go.](Media/install8.png)

Next you'll get a popup that is asking for some contact information.  Don't worry, the program is still free, and you DO NOT need to select any of the info sharing options at the bottom to be able to continue.  You do, however, need to fill in every field: 

![The sign in or complete our product interest form displays a sign in button and a few text fields and a dropdown list box. The dropdown list box is labeled as why are you downloading this product? The text fields are labeled first name, last name, business email, company, job title, and phone. It also three checkboxes to confirm about the privacy and data policy. The page has a button labeled continue at the end of the page.](Media/install9.png)

You will also need to agree to the terms on the next page: 

![A window has a caption that reads please read and accept our terms. The window has scroll bars and x on the right corner. The window displays the acceptance of terms of use. Below contains a checkbox ticked and a button labeled Submit.](Media/install10.png)

Then click submit, and you will be brought to this page.  You want to choose version 2.5, because it requires less computer resources. Using this older version reduces the chances you won't be able to follow along with the curriculum because you don't have enough memory on your computer.  Big data processing is very intense!

![A window captioned along with two points and an Orange button labeled HDP Sandbox 3.0.1 open bracket Latest close bracket.](Media/install11.png)

Click on 2.5 to start downloading, and then settle in for the long haul.  This will take some time to download, especially if you have a slower and older computer. 

Your computer may ask you what program you'd like to open the file you downloaded. It could look something like this: 

![A window captioned contains three icons, a down arrow, a checkbox, and an OK button. Where the first icon captioned is highlighted.](Media/install12.png)

You will want to choose the VirtualBoxManager option if this happens.

Then press the import button. 

It may take a little while, since again, that Hortonworks Distribution is relatively large.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>If you are having trouble with the download, and are using Chrome, try using a different browser. </p>
    </div>
</div>

Once that is done, your VirtualBox should look something like this, though you probably will only have one system on the left.  Regardless, go ahead and click on the one that says ```Hortonworks Docker``` and then click the big green start button at the top.

![A window labeled Oracle VM VirtualBox Manager contains x, square, minus, scroll bar on the right corner, below three menu options, are present and fifteen icons are present and captioned with a black screen labeled Hortonworks Docker Sandbox.](Media/install13.png)

Once you click start, one of two things will happen: Either your virtual machine will begin to boot up, or you will get an error that looks something like this: 

![A dialog box labeled VirtualBox- Error with x and a question mark on the right, below has a red circle labeled X and captioned on the side, two commit buttons labeled OK and Copy are present.](Media/install14.png)

Now don't panic! The first thing to do is to look at the details tab, which will display the type of error you are having.  Most likely, you will have the same one as here: 

![A dialog box labeled VirtualBox- Error with x and a question mark on the right, below has a red circle labeled X and captioned on the side, divided into three paragraphs captioned where the third paragraph is highlighted in a grey box, two commit buttons labeled OK and Copy are present.](Media/install15.png)

The way to fix this particular error is to enable virtualization on your computer.  How do you do that? Glad you asked! Please proceed to the next page if you received this error.


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Windows Installation - Enable Virtualization on Your Computer<a class="anchor" id="DS107L2_page_4"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Windows Installation - Enable Virtualization on Your Computer

Many computers do not come pre-setup with the ability to host a virtual machine, so you will have to play with your computer's settings to get this to work. To do this, you need to change something called the ```BIOS``` settings.  Changing these settings can only be done in the middle of starting your computer, which can be a pain, and every computer will look a little bit different.  There are screenshots and actual screen pictures provided on this page, but if things don't match up exactly, don't worry.  There is a lot of variation, so use your own best judgement.

The easiest way to access ```BIOS``` is to do a special restart of your computer that can be accessed through the ```Settings``` menu. However, you may not have this option, so a second option will be provided as well to interrupt your normal start processes.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Some of these directions WILL NOT be available for you to follow, since you will NOT be able to access your normal computer system.  Ensure that you have access to the LMS via phone, tablet, a second computer, or print these directions out. </p>
    </div>
</div>

---

## Advanced Startup Restart

Go to your ```Settings``` on your computer: 

![A window labeled Settings contains a search bar captioned, fourteen icons labeled System, Devices, Phone, Network & Internet, Personalization, Apps, Accounts, Time & language, Gaming, Ease of Access, Search, Cortana, Privacy, and Update & Security.](Media/install21.png)

Then go to the ```Updates and Security``` tab:

![A window labeled Settings on the left contains an icon labeled Home, a search bar captioned, ten icons labeled Windows Update, Delivery optimization, Windows Security, Backup, Troubleshoot, Recovery, Activation, Find my device, For developers, and Windows Insider Program. On the right five icons are present and the button is labeled Check for updates.](Media/install22.png)

Then choose the ```Recovery``` option from the left hand side:

![A window labeled Settings on the left contains an icon labeled Home, a search bar captioned, ten icons labeled Windows Update, Delivery optimization, Windows Security, Backup, Troubleshoot, Recovery, Activation, Find my device, For developers, and Windows Insider Program. On the right three-button are present and captioned.](Media/install23.png)

Then click on the ```Restart Now``` button under the ```Advanced Startup``` section.  This will bring up some advanced settings for your computer as you restart. You want to choose the ```Troubleshoot``` option:

![A snapshot of a screen captioned, with Three icons labeled Continue, Troubleshoot, and Turn off your PC, where the first icon is highlighted in white.](Media/install24.jpg)

Then under ```Troubleshoot```, you will choose the ````Advanced options``` menu:

![A snapshot of a screen captioned, with Three icons labeled Reset this PC, Recovery Tool, and Advanced options, where the first icon is highlighted in white.](Media/install25.jpg)

You will then have an option to click on ```UEFI Firmware Settings```, which will bring you to ```BIOS```. It should look something like this: 

Please skip the next subsection and proceed to ```Changing BIOS Settings```.

---

## Interrupt Normal Startup

The second way to access ```BIOS``` is to restart your computer as normal, but when it reboots, press a key (usually F1 or F2, though it may tell you) when you get a message like ```Interrupt Normal Startup```. 

---

## Changing BIOS Settings

Either way, you should see a menu that looks something like this:

![A snapshot of a laptop screen viewing ThinkPad Setup caption divided into blue and black, bottom eight options are present with respective keys.](Media/install26.jpg)

Though it is hard to see in the picture, there should be a menu at the top labeled ```Security``` that you can navigate to.  You can't use a mouse now, so you will need to use your right arrow key to move sections, and will need to hit ```Enter``` to select. Once you get to the ```Security``` section, you will want to choose the option for virtualization:

![A snapshot of a laptop screen viewing ThinkPad Setup. Six menus are present where the fourth menu is selected. Nine bullet points where the sixth point is highlighted in white, bottom eight options are present with respective keys.](Media/install28.jpg)

And here is the virtualization options:

![A snapshot of a laptop screen viewing ThinkPad Setup. the menu labeled Security is selected, below captioned in blue and black, bottom eight options are present with respective keys.](Media/install27.jpg)

Note that right now, your computer states that Virtualization Technology is ```Disabled```.  To change that, hit ```F9```, which allows you to toggle to Enabled.  You'll get a confirmation page that looks like this: 

![A snapshot of a laptop screen viewing ThinkPad Setup. the menu labeled Security is selected, below captioned in blue and black, a dialog box appears labeled Setup confirmation with Yes and No options, bottom eight options are present with respective keys.](Media/install29.jpg)

With that, you now have the option to be ```Enabled```.  Use the arrow keys to select and then hit enter.

![A snapshot of a laptop screen viewing ThinkPad Setup. the menu labeled Security is selected, below captioned in blue and black, a blue dialog box appears with Disabled and Enable options, bottom eight options are present with respective keys.](Media/install30.jpg)

Then to save your changes and exit out, press ```F10```. It will bring up one more confirmation window to navigate:

![A snapshot of a laptop screen viewing ThinkPad Setup. the menu labeled Security is selected, below captioned in blue and black, a dialog box appears labeled Setup confirmation with Yes and No options, bottom eight options are present with respective keys.](Media/install31.jpg)

And then your computer will proceed to restart as normal! When you are done, you can open up VirtualBox one more time and try to start Hortonworks with the big green arrow button:

![A window labeled Oracle VM VirtualBox Manager contains x, square, minus, scroll bar on the right corner, below three menu options, are present and fifteen icons are present and captioned with a black screen labeled Hortonworks Docker Sandbox.](Media/install13.png)

The virtual machine should now begin to boot up. You'll see a little red hat icon at the bottom in your tool bar, and wehn you click on it (if it doesn't automatically pop up), you will see it starting to boot with varying levels of purple loading bars, like this: 

![A window labeled Hortonworks Docker Sandbox [Running]- Oracle VM Virt..., contains X, square, and minus on the right. Below six menu options are present. Black background with two dialog boxes captioned and contains two icons each. A combination of white, grey, and purple are present on the bottom left, Bottom right nine icons are present.](Media/install32.png)

You will know it's done and you are good to go when the purple bars disappears and it is replace by this message:

![A window labeled Hortonworks Docker Sandbox [Running]- Oracle VM Virt..., contains X, square, and minus on the right. Below six menu options are present. Black background with two dialog boxes captioned and contains two icons each. The command prompt is present. Bottom right nine icons are present.](Media/install33.png)

---

## Uh Oh, My Computer Refuses to Cooperate!

If you absolutely can't get your Virtual Machine launched, or the Ambari GUI on the next page, and you have already worked with a mentor or instructor to get help, then don't sweat it. Big data takes a lot of processing power and is very finicky, and so your computer may just not cooperate.  If that happens, you can read through everything, and do alternative assignments for the hands-ons that don't require you to actually use Hadoop.  While it's always fun to get your hands dirty, most corporations using big data make use of their own proprietary systems, that won't operate in the same way you're learning here anyway.  So as long as you get a good grasp of the theoretical basics, you will still be a valuable asset in the workforce!

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Mac and Linux Installations<a class="anchor" id="DS107L2_page_5"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Mac and Linux Installations

This page is the installation instructions for Mac and Linux users. If you are on a Windows machine, please click proceed forward two pages. The below instructions are shown on a Mac but they will all be the same for Linux, however, it may look slightly different on a Linux machine.

The first installation you will need is `VirtualBox`. `VirtualBox` is a free and open-source tool that allows you to create, manage, and run virtual machines or virtual computers. 

Visit the VirtualBox's website **[here](https://www.virtualbox.org/wiki/Downloads)** to download. Choose your operating system and it will automatically start installing VirtualBox:

![Snapshot of a page of the virtual box. Below the logo, the left panel displays about, screenshots, downloads, documentation, end-user docs, technical docs, contribute, and community. The page covers VirtualBox binaries. The provided sub-topics are VirtualBox 5.2.2 platform packages, VirtualBox 5.2.2 Oracle VM VirtualBox extension pack, and VirtualBox 5.2.2 software developer kit. The hyperlinks under VirtualBox 5.2.2 platform packages are windows hosts, OS X hosts, Linux distributions, and Solaris hosts. The above mentioned five hyperlinks are highlighted with an arrow.](Media/VirtualBox.png)

Once you click the link, you will click on your download, and then run this from the applications folder, like this:

![A window labeled Virtual box displays number one and two highlighted and captioned four icons labeled VirtualBox.pkg, Applications, UserManual.pdf, VirtualBox_Uninstall.tool. in between a cubic are present labeled VirtualBox on two sides and a symbol on the top.](Media/1.png)

Then, you will get a prompt to run a program to see if the software can be installed.  You will want to click `Continue`:

![A window labeled Install Oracle VM VirtualBox on the left three circle middle is yellow colored and on the right, a lock icon is present below displays a dialog box captioned with two commit buttons Cancel and Continue.](Media/2.png)

Next, it will tell you how much space it will take up on your computer.  You will want to click `Install` on this screen:

![A window labeled Install Oracle VM VirtualBox on the left three circle middle is yellow colored and on the right, a lock icon is present below left five options quoted where the third option is highlighted and on the background a cube and disk are interlinked. On the right bottom, four commit buttons are present labeled Change Install Location, Customize, Go back, and Install.](Media/3.png)

You will next need to change your system preferences so that VirtualBox can access some things. The prompt to do so looks something like this:

![A window labeled Accessibility Access (Events) below captioned, on the left present, a lock with gold-colored and the bottom presents two commit buttons labeled Open System Preferences and Deny.](Media/4.png)

When you click the `Open System Preferences` button, you will see this screen: You want to click `Allow` below:

![A window labeled Security & Privacy, on the left present three circles first is red colored and the middle is yellow colored and on the right search bar is present in which the cursor is present inside below four menu boxes present where the first menu box is highlighted in pink, below four commit buttons captioned and four check box and circles present.](Media/5.png)

You'll know this worked when you get this screen:

![A window labeled Install Oracle VM VirtualBox on the left three circles first is the red colored middle is yellow colored and on the right, a lock icon is present below left five options quoted where the fifth option is highlighted and on the background a cube and disk are interlinked. On the right green tick indicates success bottom, two commit buttons labeled Go back, and Close are present.](Media/6.png)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Install a Hortonworks Sandbox<a class="anchor" id="DS107L2_page_6"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Install a Hortonworks Sandbox

These directions are for installing Hadoop on a Mac or Linux computer.  If you have a Windows machine, please skip to the next page.

Next you want to actually install Hadoop, and you will do that through a distribution called Hortonworks. Using the Hortonworks distribution means that a lot of things come pre-installed, including Hadoop, so there is less work for you! 

**[Here is the website to download the Hortonworks Sandbox](https://www.cloudera.com/downloads/hortonworks-sandbox.html)**

You want to go with the option for HDP, which is on the left: 

![A window displays a web browser, there are four menus labeled Why Cloudera, Products, Solutions, Services & Support, are present with three icons on the right corner. Below captioned and has two boxes labeled Hortonworks HDP and Hortonworks HDF both of them have a download button.](Media/install7.png)

Then click download, and you should reach a screen looking something like the screenshot below.  Make sure that you choose the installation type of Virtualbox, since that is the virtual machine software you just installed.  

![A window displays a web browser, with green background captioned on the left, and on the right, a dialog box labeled Installation type Is present in that VirtualBox is selected. and a commit button labeled Let’s Go is present. Below two icons in green and blue labeled HDP on Sandbox. The first icon represents three options and the second icon represents two options.](Media/install8.png)

Next you'll get a popup that is asking for some contact information.  Don't worry, the program is still free, and you DO NOT need to select any of the info sharing options at the bottom to be able to continue.  You do, however, need to fill in every field: 

![A snapshot of a window contains a sign-in button, below one dialog box in which an option is selected. Six boxes labeled First Name, Last Name, Business Email, Company, Job title, and Phone with USA flag icon are present, Three checkboxes are present and captioned. Commit button Continue is present.](Media/install9.png)

You will also need to agree to the terms on the next page: 

![A window with scroll bars, x on the right corner, captioned. Below contains a checkbox ticked and a blue button labeled Submit.](Media/install10.png)

Then click submit, and you will be brought to this page.  You want to choose version 2.5, because it requires less computer resources. Using this older version reduces the chances you won't be able to follow along with the curriculum because you don't have enough memory on your computer.  Big data processing is very intense!

![A window captioned along with two points and an Orange button labeled HDP Sandbox 3.0.1 (Latest).](Media/install11.png)

Click on 2.5 to start downloading, and then settle in for the long haul.  This will take some time to download, especially if you have a slower and older computer. 

Your computer may ask you what program you'd like to open the file you downloaded. You will want to choose the VirtualBoxManager option if this happens. Then press the import button. It may take a little while, since again, that Hortonworks Distribution is relatively large.

Once that is done, your VirtualBox should look something like this, though you probably will only have one system on the left.  Regardless, go ahead and click on the one that says ```Hortonworks Docker``` and then click the big green start button at the top.

![A window labeled Oracle VM VirtualBox Manager contains x, square, minus, scroll bar on the right corner, below three menu options, are present and fifteen icons are present and captioned, with a black screen labeled Hortonworks Docker Sandbox.](Media/install13.png)

---

## Uh Oh, My Computer Refuses to Cooperate!

**If you absolutely can't get your Virtual Machine launched, or the Ambari GUI on the next page, and you have already worked with a mentor or instructor to get help, then don't sweat it. Big data takes a lot of processing power and is very finicky, and so your computer may just not cooperate.  If that happens, you can read through everything, and do alternative assignments for the hands-ons that don't require you to actually use Hadoop.  While it's always fun to get your hands dirty, most corporations using big data make use of their own proprietary systems, that won't operate in the same way you're learning here anyway.  So as long as you get a good grasp of the theoretical basics, you will still be a valuable asset in the workforce!**

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Introduction to Ambari<a class="anchor" id="DS107L2_page_7"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Introduction to Ambari

Ambari is a browser interface that allows you to interact with your Hadoop cluster directly. 

---

## Connecting to the Ambari GUI

Once your virtual machine is all the way booted up, you can access Ambari for the first time by going to: **[http://127.0.0.1:8888/](http://127.0.0.1:8888/)**. 

Once there, you will see this screen: 

![A window displays a web browser contains three elephants different in size, a Get help button, below three icons captioned and two buttons labeled Launch Dashboard and Quick Links are present.](Media/install34.png)

Click on the left hand side to launch the dashboard, and ensure that you have popup blockers disabled. That should then take you to a login screen:

![A window displays a web browser consists of a box labeled sign in and in that two boxes one for Username and the other for the password, a green commit button labeled sign-in is present.](Media/install35.png)

You will use the login ```maria_dev``` with the password ```maria_dev```. 

In the future, if you want to go directly to Ambari, you can do so by typing in **[http://127.0.0.1:8080/](http://127.0.0.1:8080/)**.

---

## Dashboard

And here is the dashboard. 

![A window displays a web browser consists of five menu options and an icon and a button. On the left has twenty-three options with the button labeled action below. On the right three options are present in which first is selected, two buttons are present below and twenty-two boxes captioned and present in the middle.](Media/install36.png)

The first thing your eye is probably drawn to is the main window of squares that have various moving colored pieces. This is where you will understand the health of your Hadoop cluster. A lot of this is not active, because you are only running your cluster on one node, not multiple, but there is still some interesting stuff to digest! From upper left to bottom right, some of the things to highlight are: 

* **HDFS Disk Usage:** This is the percentage of space taken up by your data and your cluster in total.
* **DataNodes Live:** This tells you the number of nodes you have active right now. It is particularly handy for determining if one is not working properly.

You'll understand some of the others in more depth later, as you learn more about Hadoop's architecture.

Note that you can customize your dashboard to display the important that you need to see, and you can even change the coloring of those metrics, etc. For more details and to play around, hit the ```Metric Actions``` button at the top of the dashboard.

![A snapshot of three options labeled Metrics, Heatmaps, and Config History below has two buttons Metric Actions and Last 1 hour the metric Actions is selected and pops two options Add and Edit with icons.](Media/install37.png)

---

## Heatmaps

If you flip to the next tab to the right, you will see these metrics by heatmap as well.  When you actually have things running, it can provide you with a nice at-a-glace view of status a little more granularly.  You'll examine heatmaps more later when you actually have some things running on your cluster.

![A window displays a web browser consists of five menu options and an icon and a button. On the left has twenty-three options with the button labeled action below. On the right three options are present in which the second is selected, Seven Colors are present Green indicates 0-20, light green indicates 20-40, yellow indicates 40-60, orange indicates 60-80, red indicates 80-100, and the other two colors are captioned. A box with 100 percentage is presently labeled as Maximum.](Media/install38.png)

---

## Config History

Remember that in most corporate situations, you will have more than one node and you will not be the only one who has access to your Hadoop cluster! Therefore it can be nice to have a configuration log, to see what changes have been made to the system.  Luckily, on the last tab to the right, you will find ```Config History```, which looks like this:

![A window displays a web browser consists of five menu options and an icon and a button. On the left has twenty-three options with the button labeled action below. On the right three options are present in which the third is selected. Categorized into five columns and eleven rows captioned.](Media/install39.png)

---

## List of Services

Stepping out of the main square of information, on the far right of your screen, you will see a list of services.  These services come pre-installed with Hortonworks, which is wonderful! It has saved you so much time.  The ones with green icons are active and running, while the ones with briefcases next to them have yet to be started. You can (and will later) click on any of them to see the status of the service and restart if needed.  

You can also start or stop all services using the ```Actions``` button at the bottom, and when you login as admin (still to come), you will be able to add new services there as well.

![A snapshot of Twenty- three options labeled with icons HDFS, YARN, MapReduce2, Tez, Hive, HBase, Pig, Sqoop, Oozie, ZooKeeper, Falcon, Storm, Flume, Ambari Infra, Ambari Metrics, Atlas, Kafka, Knox, Ranger, Spark, Spark2, Zeppelin Notebook, and Slider. Below with the button labeled action below has three options with icons labeled Start all, Stop all, and Restart all required are present.](Media/install40.png)

You can also access your services by clicking on the ```Services``` menu on the top navigator bar.

---

## Views

The last important part about Ambari is the view toggle. The small group of squares towards the top right of the screen provides you different views for interacting with your cluster, including a view for HDFS, YARN, Hive, Pig, Storm, and TEZ. You will use most of these in time.

![A snapshot of five menu options labeled Dashboard, Services, Hosts, Alert, Admin, an icon, and a button labeled maria_dev pops up six options are YARN Queue Manager, Files View, Hive View, Pig View, Storm View, and Tez View.](Media/install41.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Introduction to HDFS<a class="anchor" id="DS107L2_page_8"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Introduction to HDFS

HDFS is the main place where your data gets stored in Hadoop and it is also responsible for distributing your data across the cluster so that you can access it quickly and reliably.  It handles big files by breaking them into blocks, which are set as 128 mb by default, though you can change that setting.  If you have less data than 128 mb, then you only need one data block. This means that HDFS can store files that are bigger than what an individual hard drive can store. If you need more space, you can always add *commodity computers*, which is just a fancy way of saying "buy or rent more virtual space."  

---

## HDFS Architecture

To get an idea of what is going on behind the scenes, examine the image below, focusing right now on the HDFS layer on the bottom:

![High level architecture of Hadoop. The master node has task tracker, job tracker, name node, and data node. There are two slave nodes and they have a task tracker and a data node. A horizontal dottel line divides the task tracker and data node in all the three nodes. The above portion of the horizontal line is labeled map reduce layer and below portion is labeled HDFS layer. The job tracker on the master node is connected to all the taskt trackers and the name node in the master node is connected to all the data nodes.](Media/bigData5.png)

This is showing how HDFS is broken up.  There is a *name node*, which is in charge of keeping track of everything.  You can think of it as a version of a master node.  The name node contains a large table with all the different file names you have in your cluster, as well as an edit log, so that it is tracking where each file is at any one time and what has happened to it.

Then you have the *data node*, which is just HDFS' name for a little worker bee.  The data node is where the file block is actually stored.  In the case of big data, there is ordinarily more than one data node, and each of the data nodes can talk to each other to maintain copies of the other blocks as well.

---

### Reading a File

Beneath the surface, whenever you read or work with a file, your node (the *client node*) will send a message to the name node, which will state where the file is, and then the client node will hop over to the data node(s) and retrieve the data required for the particular operation. 

---

### Writing a File

When writing a file, beneath the surface, here is what Hadoop is doing: the client node will reach out the name node and ask it to create a new entry.  Then the client node will go to the data node(s) and tell it that new data is being written.  The data nodes will then communicate with each other to create a backup of all the data blocks on different data nodes, to ensure reliability.  Then that information is sent back to the name node so that it knows what just happened and where everything is located.

---

## HDFS Resilience

Now, you may be concerned about the fact that there is only one name node.  How can you have fault tolerance when all the controls rest in one node's hands? Well, listed below, are a number of ways in which you can work with Hadoop to deal with this issue.

* **Constant metadata backups:** You can backup your metadata constantly, so that you can always restore to an edit log.  You may lose a little bit of work, but it shouldn't be much.
* **Secondary name node:** A secondary name node will maintain a merged copy of the edit log with the primary name node.
* **HDFS Federation:** When you have so much data that one name node isn't enough, you can create an HDFS federation that spreads out the logs of the name node into multiple.  This means that if a name node goes down, you only lose the information off that one name node, not all of it.
* **HDFS High Availability:** Multiple name nodes trade off their work and are always on alert if the other goes down.  This is managed through Zookeeper.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Interacting with HDFS<a class="anchor" id="DS107L2_page_9"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Interacting with HDFS

There are many, many ways that you as a user can interact with HDFS to read and write files to your cluster.  They can include: 

* Ambari 
* Command line
* Through a proxy server (website) that sits between your machine and HDFS
* Java
* Network File System (NFS) gateway, which basically allows you to mount a remote file system on a server, so that you can interact with programs that weren't actually designed to work within the Hadoop ecosystem

The last three are quite complex, and you won't learn about them here, though you should know that they are possible.  Instead, you'll focus on interacting with HDFS first through Ambari and then through the command line.

---

## HDFS through Ambari

In order to access HDFS through Ambari, go the ```Files View```:

!["A window displays a web browser consists of five menu options and an icon and a button labeled maria_dev pops up six options are YARN Queue Manager, Files View, Hive View, Pig View, Storm View, and Tez View.
  On the left has twenty-three options with the button labeled action below. On the right three options are present in which first is selected, two buttons are present below and twenty-two boxes captioned and present in the middle."](Media/install42.png)

Which should bring up a page that looks something like this:

![A window displays a web browser consists of five menu options and an icon and a button. On the right three blue buttons and one yellow button on the center are present and captioned. A search bar is present, Categorized into six columns and thirteen rows captioned.](Media/install43.png)

You can navigate into ```User``` then ```maria_dev``` to get at the home view for your access point.  You'll note that there's nothing in it yet, but that will change! You can hit the ```New Folder``` button in the top right:

![A window displays a web browser consists of five menu options and an icon and a button. On the right three blue buttons and one yellow button on the center are present and captioned. A search bar is present, Categorized into six columns captioned and one row contains the icon.](Media/install44.png)

This will then prompt you to name your new folder and add it: 

![A window displays a web browser consists of five menu options and an icon and a button. On the right three blue buttons and one yellow button on the center are present and captioned. A search bar is present, Categorized into six columns captioned and one row contains the icon A window is present and contains Name and a dialog box, two commit buttons Cancel and Add with X in the corner right.](Media/install45.png)

Name this folder ```books_data```.  You will add **[a dataset about books](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/books.zip)** and a **[a dataset with the crossroads between the book IDs and their titles](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/bookIDs.zip)** to HDFS for usage throughout the portion of the module on Hadoop.

To do so, save those files on your local computer, and then you can click into the ```books_data``` folder and then use the ```Upload``` button in the top right to add files. It's as easy as drag and drop, or you can click in the box outline to navigate to the files. 

![A window displays a web browser consists of five menu options and an icon and a button. On the right three blue buttons and one yellow button on the center are present and captioned. A search bar is present, Categorized into six columns captioned and one row contains the icon A window is present and labeled with icon Upload file to /user/maria_dev/books_data, below contains an icon for darg and upload files one commit buttons Cancel with X in the corner right is present.](Media/install46.png)

While you will have to upload the files one at a time, this is still a pretty nifty way to get files into your Hadoop cluster! Of course, it only works with things that are small enough to be stored on your local machine, a single computer. 

Once you've uploaded both, if you select a file, you see that you get a whole new set of options appearing in dark blue along the top left. You can open the file, rename it, set permissions, copy it, remove it, or download the file back to your local machine.

![A window displays a web browser consists of five menu options and an icon and a button. On the right three blue buttons and one yellow button on the center are present and captioned. Below eight options labeled Open, Rename, Permissions, Delete, Copy, Move, Download, Concatenate with icons are present. A search bar is present, Categorized into six columns captioned and two rows captioned.](Media/install47.png)

If you hold down ```control``` and select both files at once, you'll see the last icon at the end light up in blue as well, ```concatenate```.  This lets you download both files together as one, though it won't be in a nice user friendly way - they'll just be smooshed together.

Go ahead and delete these files; you will add them again into HDFS through the command prompt a little later.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Windows Connecting to your Cluster via Command Prompt<a class="anchor" id="DS107L2_page_10"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Windows Connecting to your Cluster via Command Prompt

These directions are for Windows users on how to connect to your cluster via the command prompt.  If you are a Mac or Linux user, please proceed to the next page.

---

## Install PuTTY

PuTTY is a free and open-source terminal emulator, serial console and network file transfer application. It allows you to connect via command prompt with your virtual machine.  Mac and Linux machines come with these protocols already included which is why they don't need to install PuTTY.

First go to the **[PuTTY website](https://www.chiark.greenend.org.uk/~sgtatham/putty)**, or Google PuTTY and select it.  It will take you to a straight forward documentation page, most of which you can ignore.  At the top, after Download, there is ```Stable```.  For the most part, it is better to use the Stable version as opposed to the latest, because Stable means that it was the last version that was fully debugged. If you opt for the latest version, who knows what you could run into!

![A snapshot labeled Putty: a free SSh and Telnet client has twelve options present in which twelve are highlighted underlined in blue one highlighted underlined in yellow and two in black.](Media/107.16.png)

Easy, now the next page is where you are downloading things.  You want to get ```putty.exe``` and ```puttygen.exe```. You will only need ```putty.exe``` right now, but you'll use ```puttygen.exe``` later. Just make sure you select the 64-bit if your computer runs on 64-bit or 32-bit if it doesn’t.

![A window displays alternative binary files. It also displays a message that reads, the installer packages above will provide all of these, but you can download them one by one if you prefer. The putty.exe from 64-bit category is highlighted.](Media/107.17.png)

---

## Run PuTTY

Now you run ```putty.exe```!

![A window labeled Putty configuration. On the left categories are mentioned on the right eight checkboxes, four dialog boxes captioned, and six commit buttons are present. where the Load button and HDP from the fourth dialog box are highlighted.](Media/install48.png)

Make sure the port is 2222 and the Connection type is SSH.  Now you have to put in your Host Name!  It will be ```maria_dev@127.0.0.1```. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Once you've typed in the info, you can save this login and give it a name (the one in the picture is called "HDP." That way you can easily access it and call it up without needing to type in the information every time.</p>
    </div>
</div>

---

## Sign into Your Virtual Machine

Click `Open` and it will take you to the PuTTY window to sign into the server. Then enter the password, which should be `maria_dev`) and you're in! Now that you are signed in to the server on your machine, you don't need the Virtual Box window, so you can just minimize that window.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You may not be able to see your password as it is being typed in. This is for privacy, so type carefully on faith and then hit enter.</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Mac and Linux Connecting to your Cluster via Command Prompt<a class="anchor" id="DS107L2_page_11"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Mac and Linux Connecting to your Cluster via Command Prompt

These directions are for Mac and Linux users.  If you are a Windows user, please proceed to the next page.

First, open up the terminal on your machine. Then run the command: 

```bash
ssh maria_dev@127.0.0.1 -p 2222
```

This is the command to allow you to signin to the server on your local machine through your own terminal. `ssh` is the command needed, `maria_dev` is the username you are signing into (this will change if you have different users) and `127.0.0.1` is the IP address.  ```-p``` specifies the port you are using, which is ```2222``` in this case.

Lastly, enter the password for your user (which should be `maria_dev`) and you're in! You don't need the Virtual Box window, so you can just minimize it.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You may not be able to see your password as it is being typed in. This is for privacy, so type carefully on faith and then hit enter.</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Linux System Basics<a class="anchor" id="DS107L2_page_12"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Linux System Basics

Now that you are successfully signed into your cluster, it is important to understand the basics of working on a Linux system, which is what VirtualBox is running. 

---

## Tree Structure

The tree structure of the file system is a great place to start when learning Linux. If you have ever worked in a Terminal or Command Line on your computer, you may recognize some of the commands for navigating throughout files.

Have your VirtualBox command line up and running and make sure you are signed in. You will be trying some of these commands as you learn them.

### `~`

When you sign into the server on Virtual Box, it will default to your `home` directory. This is represented with `~` . If you look at your signed virtual machine, you should see `[maria_dev@sandbox ~]` which tells you that you are in your home directory.

### `pwd`

```pwd``` stands for "print working directory" and is the command to see the file route to where you are located within the file structure. If you run ```pwd``` in Virtual Box, you should see that you are located within the home directory and are on user `maria_dev` like below:

![A snapshot of command prompt labeled maria_dev@sandbox, six lines captioned with numbers, letters, and special characters at the last the cursor is highlighted in green.](Media/install50.png)

### `cd`

```cd``` stands for "change directory" and is used to navigate to and from different folders. First, if you run the command ```cd ..``` that will bring you back one step from where you are located. From running ```pwd```, you saw that you were in the /home/ubuntu directory. Run ```cd ..``` to navigate to the home directory. If you run ```pwd``` again, you will see that you are no longer within the ubuntu user but are just in the home. See below:

![A snapshot of command prompt maria_dev@sandbox, four lines captioned at the last the cursor is highlighted in green.](Media/install51.png)

To return back to the user directory, use ```cd``` again, but this time instead of the ``..``, you just need to type the directory that you want to be in. So in this case, you would run the command ```cd maria_dev``` to return to where you were, as shown below:

![A snapshot of command prompt maria_dev@sandbox, four lines captioned at the last the cursor is highlighted in green.](Media/install52.png)

Now, if you were deep into the file structure and wanted to easily return to the home directory (/home/maria_dev) without running ```cd ..``` a bunch of times, you can use the ```~``` symbol and that will automatically return you to the home directory. Go ahead and try it. Cd back one step (```cd ..```), run ```pwd``` to see your directory (which should be /home), and then run ```cd ~```. Once that is complete, run ```pwd``` again and you will see that you are back in the home directory for your user. See below for the commands run right after another:

![A snapshot of command prompt maria_dev@sandbox, seven lines captioned at the last the cursor is highlighted in green.](Media/install53.png)

### `ls`

To list the files in your home directory, you can use the following command: `ls`. This will show a listing of files located in that directory. If you are currently in the `~` directory and you run `ls`, nothing will show up because you have no files yet on this user. But if you cd back one (```cd ..```) and run `ls`, you should now see all the awesome things that your Hortonworks comes with, including the `maria_dev` user within the home directory. See below:

![A snapshot of command prompt maria_dev@sandbox, seven columns and six rows in blue captioned, and at the last the cursor is highlighted in green.](Media/install54.png)

---

## Case Sensitivity

When working with Linux, be careful because it is _very_ case sensitive. For example, if you ```cd``` back to ```/home``` and then try to ```cd``` into ```maria_dev``` with a capital ```M```, it will give an error. See below:

![A snapshot of command prompt maria_dev@sandboxthree lines captioned, and at the last, the cursor is highlighted in green.](Media/install55.png)

Just remember to be very sensitive when it comes to case!

---

## Permissions

Permissions are very important when working with many different users. You may not want to have certain users able to edit sensitive information such as passwords. All users can change the permissions for the files and folders they create and own, while the root user can change any file or folder permission.

When adding or removing permissions, you need to define which permissions they are allowed to have with three different keys: `r`, `w`, `x`. `r` stands for "read," which would give the user permissions to read files. `w` stands for "write," which gives the user permissions to write and edit files. `x` stands for "execute," which would allow the users to execute the files available.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Vi<a class="anchor" id="DS107L2_page_13"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Vi

`Vi` is a text editor you can run from your terminal/command line. Sometimes, configuration is needed within a project located on a server and not on your local machine. `Vi` is a text editor that takes the place of text editors that are used on different systems. This section will go over the basics of `Vi`. `Nano` is another text editor that is available, but `Vi` is the more commonly used one.

---

## Vi Modes

`Vi` has two basic modes: `insert` and `normal`. `Insert` mode is when you are able to write text or code as if you were in a normal text editor. `Normal` mode allows you to navigate through and manipulate the text or code. When in either of these two modes, you will be able to see which mode you are in at the top of the editor.

When you are changing between modes, use the `Esc` key for normal mode and `i` for insert mode.

---

## Find Text

You are able to find text as well. Using `f` followed by the character you are searching for will find and move to the next (or previous) occurrence of that particular character. For example, `fd` will find and move you to the next `d` character. You are also able to combine numbers with `f`. If you type `3fk`, that will find and move you to the third occurrence of the character `k`. If you want to jump to the previous occurrence of a particular character, you would use `F` followed by the character itself. Typing `Fd` will find the previous occurrence of `d`.

If you need to search for certain text or characters within the entire file, press `/` followed by the text you are looking for. You can then repeat that search for the next occurrence using `n` and for the previous occurrence using `N`.

The `t` command followed by a character will jump the cursor to before the next occurrence of that particular character. So if you want to jump to before the next occurrence of `p`, you would type `tp`. If you want to find the previous occurrence of `p`, you would type `Tp`. And no, this doesn't work in the bathroom if you've run out of toilet paper...

---

## Insert Text

When using `Vi`, you are able to insert text by putting it into `insert` mode by pressing `i`.

### Insert Text Repeatedly

The option is available to you to insert text multiple times without having to write it out yourself. For example, if you wanted to create an underline with 30 dashes `-`, you don't need to type out 30 dashes. You can use `30i-Esc` and that will do it for you. The `30` defines how many times you want the word/character repeated, the `i` will put `Vim` into insert mode, the `-` is the character you want to be repeated, and the `Esc` will return to normal mode and execute the insertion of the 30 dashes.

### Replace Text

You can use `r` to replace a character without changing to insert mode. So if you move your cursor over a particular character and press `r`, you will be able to replace that character with another character.

---

## Exiting Vi

When you are done updating your file, you can run different commands to save and quit out of the `Vi` itself.

* `:w` will save the file, but not quit
* `:wq` or `:x` will save the file, then quit
* `:q` will quit, but will fail if there are unsaved changes
* `:q!` will quit and will throw away any unsaved changes

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 14 - Using HDFS from the Command Prompt<a class="anchor" id="DS107L2_page_14"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Using HDFS from the Command Prompt

There are times when you will want to access HDFS through your command prompt instead of using Ambari. In order to do that, you'll practice with the same books data that you did before with Ambari, so make sure that you deleted the data from the Ambari HDFS view first.

Next, log into your virtual machine via command line and then you will always use the following command to access HDFS: 

```bash
hadoop fs
```

This tells your cluster that you wan to access the file system (fs).  Then you can add additional commands.  For instance, if you wanted to list out what was in HDFS, you could add -ls onto the end: 

```bash
hadoop fs -ls
```

---

## Make a Directory

You can do the same thing here as you did in Ambari, but with a line of code.  First, set up a folder for the data:

```bash
hadoop fs -mkdir books_data
```

And check to see if it's really there:

```bash
hadoop fs -ls
```

---

## Download Data

It is! Now you can upload data into that folder. You'll pull it down from our resources webpage, using the command ```wget```:

```bash
wget https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/books.zip
```

You will also want to pull the other data file down:

```bash
wget https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/bookIDs.zip
```

---

## Unzip Data

And then you'll need to unzip them both using the command ```unzip```:

```bash
unzip books.zip
```

And then: 

```bash
unzip bookIDs.zip
```

---

## Upload to HDFS

Now, they are on your virtual machine and unzipped, but they are not into HDFS yet.  To do that, you will use your HDFS command to ensure Hadoop knows you want to work in the file system, and then utilize ```-copyFromLocal```.  Then specify the file name, and where you want to place it:

```bash
hadoop fs -copyFromLocal books.csv books_data/books.csv
```

Go ahead and do the other one while you are at it:

```bash
hadoop fs -copyFromLocal books.csv books_data/bookIDs.csv
```

Now to make sure that it is actually there:

```bash
hadoop fs -ls books_data
```

Tada! Your files have successfully been uploaded to HDFS.

---

## Removing Files from HDFS

There may also be times when you need to remove files from Hadoop, so go ahead and practice that as well! You'll make use of the linux command ```-rm``` to remove files:

```bash
hadoop fs -rm books_data/books.csv
```

And once you do that, you should get an acknowledgement that the files have been moved to the trash.  Go ahead and do the other one too, just to keep things consistent:

```bash
hadoop fs -rm books_data/bookIDs.csv
```

And now the ```bookIDs``` file is no longer there either! Want to triple check? You can always make use of ```-ls``` once more:

```bash
hadoop fs -ls books_data
```

It should return nothing, as there are no longer any files in the ```books_data``` directory.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 15 - Exiting your Virtual Machine<a class="anchor" id="DS107L2_page_15"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Exiting your Virtual Machine

It's important to exit and shut down your virtual machine in the appropriate way so that you do not lose any information or mess up the system.  Just like pressing the power button on your laptop is not a good idea unless it's frozen, the same thing applies here. You'd rather tell the computer to shut down rather than forcing it to, because it goes through different procedures to keep itself running happily in future.

Once you are done in the terminal, you can use the command ```exit``` to close down your command line window to your VirtualBox.  Then, you can go to the Red Hat icon that shows the status of your virtual machine, and select ```Machine``` from the file menu, then ```ACPI Shutdown``` at the very bottom. 

![A window labeled hortonworks docker sandbox 1 open square bracket running - Oracle VM has six menus in the menu bar. The menus are labeled file, machine, view, input, devices, and help. The machine menu is selected and it lists a few options.](Media/install49.png)

It will take a few minutes, but you will know it has completed when the Red Hat icon disappears and closes. You can now close the VirtualBox window if you'd like to.

---

## Summary

A lot has been accomplished in this one lesson! You now have a Hadoop cluster running, know how to access it through both the command line and Ambari, and understand the basics of Linux! You also have in-depth knowledge of HDFS and can utilize it to add or remove data from your cluster.  Bravo! You have taken one of the hardest steps into big data - just getting started - and are now ready to take on the wide Hadoop world!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 16 - Key Terms<a class="anchor" id="DS107L2_page_16"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>VirtualBox</td>
        <td>A program to run virtual machines on your computer.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Hortonworks Sandbox</td>
        <td>A virtual machine that comes pre-installed with Hadoop and some of its ecosystem.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Ambari</td>
        <td>A browser interface that allows you to directly interact with your cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>HDFS</td>
        <td>Main data storage on Hadoop; stands for Hadoop Data File Storage.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Commodity Computer</td>
        <td>Virtual computing for sale or rent.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Name Node</td>
        <td>HDFS' version of a master node, which keeps track of everything.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Data Node</td>
        <td>HDFS' version of a slave node, which actually processes and stores data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Client Node</td>
        <td>The computer that is interacting with the cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Secondary Name Node</td>
        <td>A backup name node that keeps a merged edit log.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>HDFS Federation</td>
        <td>Multiple name nodes because your data is so large.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>HDFS High Availability</td>
        <td>A second live backup name node.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>PuTTY</td>
        <td>A program that emulates your terminal for Windows so that you can connect to your Linux virtual machine.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>vi</td>
        <td>A text editor for Linux.</td>
    </tr>
</table>

## Key Linux Commands

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>~</td>
        <td>Indicates your home directory.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pwd</td>
        <td>Shows you where you are located in your file system.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>cd</td>
        <td>Changes your directory to the directory you specify.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>cd ..</td>
        <td>Moves you up a directory level.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ls</td>
        <td>Lists everything contained within a directory.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>hadoop fs</td>
        <td>Allows you to access HDFS.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>mkdir</td>
        <td>Creates a directory.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>wget</td>
        <td>Access things through the web.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>unzip</td>
        <td>Unzips .zip files.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>copyFromLocal</td>
        <td>Copies data from your local machine.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>rm</td>
        <td>Removes files.</td>
    </tr>
</table>

---

## Key vi Commands

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>i</td>
        <td>Allows you to insert text.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Esc</td>
        <td>Stop inserting text.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>:wq or :x</td>
        <td>Save and quit the vi.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 17 - Lesson 2 Hands-On<a class="anchor" id="DS107L2_page_17"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">



In this lesson, you've installed Hadoop and begun to learn about HDFS.  For this hands on, you will upload and then delete the **["crimes-sample-2.csv" file](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/crimes-sample-2.zip)** from HDFS, in whichever manner you choose.  Take screenshots to demonstrate this has been completed. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Alternative Assignment if You Can't Run Hadoop and/or Ambari

If your computer refuses to run Hadoop and/or Ambari, **[here](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/L2exam.zip)** is an alternative exam to test your understanding of the material. Please attach it instead.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 18 - Lesson 2 Hands-On Solution - Alternative Assignment<a class="anchor" id="DS107L2_page_18"></a>

[Back to Top](#DS107L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Lesson 2 Hands-On Solution - Alternative Assignment

This exam serves as the assessment for those students who cannot utilize the Hadoop system and/or Ambari GUI. Answers are shown in bold.

1.	Where are the list of services located in Ambari? 
    **a.	Left panel**
    b.	Right panel
    c.	Top menu
    d.	Along the bottom

2.	Which node is the “worker bee” in HDFS?
    a.	Name node
    **b.	Data node**

3.	True or False? "Adding data to your cluster can be as easy and dragging and dropping."
    **a.	True**
    b.	False

4.	What code will print your working directory in Linux?
    **a.	pwd** 
    b.	cd
    c.	ls
    d.	lsd