# DS107 Big Data : Lesson One Companion Notebook

### Table of Contents <a class="anchor" id="DS107L1_toc"></a>

* [Table of Contents](#DS107L1_toc)
    * [Page 1 - Module Overview](#DS107L1_page_1)
    * [Page 2 - What is Big Data?](#DS107L1_page_2)
    * [Page 3 - Introduction to Amazon Web Services](#DS107L1_page_3)
    * [Page 4 - Introduction to Hadoop](#DS107L1_page_4)
    * [Page 5 - The Hadoop Ecosystem](#DS107L1_page_5)
    * [Page 6 - Key Terms](#DS107L1_page_6)
    * [Page 7 - Lesson 1 Hands-On](#DS107L1_page_7)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Module Overview<a class="anchor" id="DS107L1_page_1"></a>

[Back to Top](#DS107L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to Big Data
VimeoVideo('251886621', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO107L01overview.zip)**.

# Module Overview

Big data refers to both the storage and analysis of large collections of data. For perspective, in the year 2013, **[90% of the world's data](https://www.sciencedaily.com/releases/2013/05/130522085217.htm)** was generated in the previous two years. This trend is going to continue to increase exponentially as computing becomes cheaper. In this module, you will learn how to process large datasets on your local machine, how to distribute work to multiple computers for better performance, and how to prevent and recover from errors. By the end of this module, you should be able to:

* Learn the fundamentals of Big Data
* Utilize MapReduce in Python
* Learn how to maximize computing efficiency
* Learn how Hadoop can be used to solve Big Data problems
* Get your feet wet with cloud computing by spinning up a virtual machine with AWS
* Use PySpark to analyze large data
* Learn how to recover from data processing errors

---

## Lesson Overview

In this lesson, you will get an introduction to big data and the Hadoop ecosystem.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/456037958"> recorded live workshop on the concepts in this lesson. </a> </p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - What is Big Data?<a class="anchor" id="DS107L1_page_2"></a>

[Back to Top](#DS107L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is Big Data?

Data is usually categorized as "big" when you can no longer process it efficiently on one computer.  However, the parameters for this will differ by both dataset and computer setup.  There is no one cutoff that allows someone to say "my data is big;" there is no set number of columns or rows that will push you over the edge into "big data" territory.  You could have a relatively small number of records (rows), but have a lot of columns.  You could have data that are images, audio, or text that is space and processing intensive.  You could have a relatively small set of variables in key-pair format that is updated incredibly frequently.  All of these may or may not be instances of big data, based on their particular ramifications AND based on the computer you're using.  A ten year old system running an i5 processor is going to be able to handle much less data than a brand-new computer with a solid state hard drive and an i10 processor.

---

## Example

If ACME Groceries has collected both the dates and times of each item sold in their store, they're empowered to ask questions concerning inventory. This sounds simple, right? The variables you're interested in are only date, time, and item.  How can that possibly become big data? Well, think about how many items a store has.  Think about how many customers PER DAY they may receive.  Think about how many items at a time a customer may purchase.  Is it starting to look a bit bigger now? What if you wanted just a little more information, and you increased data collection to include multiple ACME Groceries in the chain? Now maybe you need store location as well, and you have increased the number of dates, times, and items tenfold. The amount of data generated and stored becomes monumental! 

With fine detail and vast records of customer purchase history, companies can spot patterns and predict customer behavior. Big data makes large-scale insights possible. For instance, consider the following questions that can be answered with just those few variables and the actions that can be taken with those answers:

* What month has the highest produce sales? 
  * _Action:_ Mail out advertisements and coupons the week prior to additionally incentivize produce purchase.
* Which two items are frequently purchased together?
  * _Action:_ Place these items adjacent to each other to enhance the likelihood they will both be purchased.
* What are the five most frequently purchased items between the hours of 6:00 AM and 8:00 AM?
  * _Action:_ Place the items in a popup kiosk near the front of the store for quick access in the morning.
* Which brand of toothpaste generates the most money?
  * _Action:_ Move the best seller to the middle shelf (eye-level) to encourage purchases.

---

## Properties of Big Data

The properties of big data are all denoted with a V. There are the main 3 Vs of Big Data, which are _Volume_, _Variety_, and _Velocity_.

1.  **Volume** refers to the size of the datasets being stored. According to an **[article](https://www.northeastern.edu/graduate/blog/how-much-data-produced-every-day/)** by Northeastern University, the total amount of data in the world is projected to rise steeply to 44 zetabytes by 2020. For those who are unfamiliar with the **[metric prefixes](https://www.nist.gov/pml/weights-and-measures/metric-si-prefixes)**, 44 zetabytes expanded is 44,000,000,000,000,000,000,000 bytes! 

    <div class="panel panel-success">
        <div class="panel-heading">
            <h3 class="panel-title">Additional Info!</h3>
        </div>
        <div class="panel-body">
            <p>To really get a perspective on how much data volume has changed, think about this: NASA traveled to the moon and back on only 4kb of memory! For more details about the early Apollo computing system, <a href="https://www.metroweekly.com/2014/07/to-the-moon-and-back-on-4kb-of-memory/"> check this article out! </a></p>
        </div>
    </div>

2.  **Variety** refers to the different forms the data can take, such as video, audio, and text. These types of data often take more space to store and require longer processing times as well. Some systems will have a lot of variety, but others may be a little less dispersed. For example, Twitter's _variety_ dimension may consist mostly of text while YouTube's _variety_ dimension is primarily video.

3.  **Velocity** refers to the speed of data generation and storage. This typically ties into the "real-time" rate of data transmission.  With large-scale or global companies that are dealing with transactional or monitoring data, the velocity can be incredibly fast.  For instance, **[back in 2013 there were 143,199 Tweets being sent per second](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how.html)**. 

In addition, as big data has gotten more popular, other Vs have been added.  Consider also noting: 

4. **Veracity:** How trustworthy the data is. As data is coming in, is it likely to create discrepancies on your cluster? Are you likely to have duplicate records? What happens in the case of computer failure?

5. **Value:** It don't mean a thing if it ain't got that swing! Big data has got to provide useful information and insights from it, otherwise, why are you throwing all this computer processing power and storage at it?

---

## Big Data Computing

A fundamental part of big data analysis is being able to process all of the necessary data, and because of the nature of big data, you're going to need more than one computer, elsewise the the processing time and cost is too much to justify the usefulness of the data. What if the work could be shared among two computers? Well, that would cut the time in half. What if a third computer were added? As you may guess, the more computers that are added, the shorter the amount of time required to process the same amount of data. This is called *scalability* - the ability to keep throwing additional computers at the data in a linear fashion - the more data you have, the more computers you need.

When you're harnessing multiple computers to do one job, altogether those computers are referred to as a *cluster*. Each individual computer in a cluster is referred to as a *node* typically, and you will have one main node that is running the whole shebang, and then others that are just doing the work.  The node that is in charge is called the *master node*, or sometimes the *manager*, and the computers that are working are the *slave nodes* or sometimes the *workers*.   

![A laptop labeled Master Manager has one vertical and two horizontal lines in between connects six laptops, three in row one and three in row two.](Media/bigData1.png)

Note that your workplace may or may not actually physically own the computers in the cluster. There are many services out there that allow you to rent out clusters.  This then becomes an endeavor in *virtual machines* as well.  For instance, depending on the processing needs of your data, you may be renting anywhere from a small partition on just one of the small silver boxes below to several silver boxes (each a computer). 

![A server room.](Media/bigData2.jpg)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Introduction to Amazon Web Services<a class="anchor" id="DS107L1_page_3"></a>

[Back to Top](#DS107L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Introduction to Amazon Web Services

There are many different ways and places from which to spin up virtual machines and rent clusters. One of them is *Amazon Web Services (AWS)*, which you will be utilizing later in this module.  In order to access AWS for free, and to take advantaged of their free student services, like training and job boards, you will create an AWS Educate account.  Below are the directions to create this account. 

---

## What is AWS?

From Amazon's website:

> "Amazon Web Services (AWS) is a secure cloud services platform, offering compute power, database storage, content delivery and other functionality to help businesses scale and grow. Explore how millions of customers are currently leveraging AWS cloud products and solutions to build sophisticated applications with increased flexibility, scalability and reliability." - [Amazon AWS](https://aws.amazon.com/what-is-aws/).

The process to obtain an AWS account is described below.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you already have an AWS Educate account, you can skip to the end of the page.</p>
    </div>
</div>

# Amazon Web Services (AWS) Account

You will need an AWS account to proceed with this course and deploy web apps. The process to obtain an AWS account is described below.

**If you already have an AWS account or already applied for an AWS Educate account, you can skip this section and head to the section on accessing your AWS Educate account.**

AWS Educate is a program Amazon offers that will allow you to obtain an AWS account for free (with limitations). To apply for an account, click the link below:

**[AWS Educate](https://aws.amazon.com/education/awseducate/apply/)**

---

## Step 1 - Check Email for AWS Educate Application Process
You should receive an email from AWS Educate Support providing a link to complete the `AWS Educate application process`.   **If you did not receive an email, please inform your mentor right away**.

![A snapshot of an email from AWS Educate Support where four options and one icon are highlighted in blue, below salutations and email content is mentioned.](Media/sign-up_email-one.png "AWS Educate Application Process Step 1")
---

## Step 2 - Fill In Your Details

Once you click the `here` option, you should be taken to the next page to enter your details. You should see something similar to the screenshot below:

![A snapshot of a page from AWS educate displaying the second step for the registration process. There are ten text fields labeled school or institution name, first name, email, birth month, birth year, country, last name, graduation month, graduation year, and promo code. A checkbox labeled I'm not a robot is placed below the text fields. A button labeled next is also placed at the bottom of the page.](Media/signup-details.png "AWS Educate Apply Step 2")

You need to enter `Woz U Education Holdings, LLC` into the __Institution Name__ field. As you begin typing, it should begin to filter the list of institutions, giving you the option to select it from the list.

You'll also need to enter the __country__, __city__, and __state__ of the institution, which are the next three fields near the institution name:

* Country: **United States**
* City: **Phoenix**
* State: **Arizona**

Once you've entered the above values fill in the remainder of the fields with the values that are appropriate for you. All fields will require a value in order to move to the next page. Please ensure first name, last name, birthday, graduation month and year are all filled.  __Do not enter a promo code__.

You may also need to solve a CAPTCHA.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>It is very important that you choose the correct institution! Make sure you select Woz U Education Holdings, LLC.</p>
    </div>
</div>

Click "NEXT" to proceed.

---

## Step 3 - Check for Confirmation Email 

Once you have completed filling out your account details, AWS Educate Support will send a __"Thank you for submitting your AWS Educate Account application"__ email. 

![A snapshot of an email from AWS Educate Support where four options and one icon are highlighted in blue, below salutations and email content with a link is mentioned. On the bottom right, a rectangular box is captioned.](Media/signup-confirmation.png "AWS Educate Apply Step 3")

Your email should look similar to the image above. Once you've accessed the email, click on the link provided to confirm your email address and complete the application process. Once you click the link provided in the email, you will see the following screen. 

![A snapshot of a website labeled AWS educate below two blue backgrounds divided is captioned. In the first blue background, a rectangular box is captioned.](Media/signup-verified.png "AWS Educate Apply Step 4")

---

## Step 4 - Set AWS Password 

Once your application has been verified, you will receive the following email to set your password. 

![A snapshot of an email from AWS Educate Support, where four options and one icon are highlighted in blue, below salutations and email content, links are mentioned, where Click here is highlighted in yellow.](Media/signup-password.png "AWS Educate Apply Step 5")

Select the `Click here` option to set your account password. 

![A snapshot of a website labeled AWS educate below captioned has two boxes with eight dots present in each. A yellow button labeled Set Password, In bold five roman letters mentioned in points are captioned. The fifth roman letter contains ten special characters.](Media/signup-createpassword.png "AWS Educate Apply Step 6")

Once you click `Set Password` you will be redirected to your AWS account, shown in the below. 

---

## Step 5 - Access AWS Account 

![A window labeled AWS educate on the right six options is present where a red arrow points to the fifth option which is highlighted in yellow, below three boxes are captioned where the third box header is highlighted in blue and below a blue button is present.](Media/signup-account.png "AWS Educate Apply Step 7")

Click `AWS Account` in the upper right corner of your dashboard. You will be directed to another page where you will be given the option to `Create Starter Account`

![A window labeled AWS educate on the left corner, on the right corner six options are present where the fifth option is underlined in blue. Below on the left portrays a man carries a laptop in the air, on the right header captioned in yellow and the rest in black, below a yellow button is present in a black circle.](Media/signup-starterAcct.png "AWS Educate Apply Step 8")

You will then be redirected to use your AWS Educate Starter Account to access the AWS Console and resources to begin building. 

![A window labeled AWS educate on the left corner, on the right corner six options are present. Below on the left portrays a man carries a laptop in the air, on the right header captioned in yellow and the rest in black, in between a yellow button is present in a red circle.](Media/signup-console.png "AWS Educate Apply Step 9")

Once you have clicked the `AWS Educate Starter Account` button, you will be able to view your AWS Account Status, where you will want to select `AWS Console` option. Finally, you will be directed to your AWS Management Console.

![A web page labeled AWS management console displays a text field labeled find services. The web page is divided into five panels. The first panel is labeled as build a solution, the second panel is labeled as learn to build, the third panel is labeled stay connected to your AWS resources, the fourth panel is labeled as explore AWS, and the fifth panel is labeled as have a feedback.](Media/signup-management.png "AWS Educate Apply Step 10")

The AWS Console is what you'll need for the course work. There will be additional material on how to use the AWS Console later in the program.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Introduction to Hadoop<a class="anchor" id="DS107L1_page_4"></a>

[Back to Top](#DS107L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Introduction to Hadoop

The system to know in the world of big data is *Hadoop*.  Hadoop is, at its heart, a way to distribute and manipulate data across a cluster.  What separates Hadoop from its predecessors is the ability to interact with your cluster as if it was a single computer; Hadoop does most of the distribution work behind the scenes so you don't have to worry about it. Hadoop is not the hardware of the big data; the machines you are using in your cluster can come from anywhere.  Instead, it is the software that allows your data files to get broken up, stored, accessed, and used seamlessly. 

![Logo of Apache Hadoop](Media/bigData3.png)

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>Hadoop is named after the developer's child's stuffed yellow elephant. Knowing this, does the logo make more sense? </p>
    </div>
</div>

---

## Fault Tolerance

One of the most important things about big data in general, and Hadoop in particular, is that failure recovery is key.  If any part of the cluster goes down, you don't want to lose access to that data, and any processing you've done with it! Think about the devastating effects of running out of computer battery and not being able to save your homework assignment, and the hours it might take you to re-do something.  Then think about what would happen if an error like that took place on one node of a cluster, and it brought down all the others! You might lose billions of records and thousands of hours worth of work.  A catastrophic loss like that is obviously out of the question, so when working with big data it's important to be able to not only have a backup, which is called *redundancy*, but also to keep the work going even if you lose one node.  This ability to keep going even when things go to heck in a handbasket is called *fault tolerance*. Hadoop automatically keeps a backup of your work, and even further, has built-in programming so that if one node goes down, other nodes can be added or utilized more fully to keep your jobs running smoothly. 

---

## Core Hadoop Components

There are three components that go into the main Hadoop system, although many other programs have now been built to sit upon and work with these main components.  These three main components in Hadoop are HDFS, YARN, and MapReduce: 

![Three boxes present, the first box labeled MapReduce, the second box labeled YARN, and the third box labeled HDFS.](Media/bigData4.png)

* **Hadoop Distributed File System (HDFS):** HDFS is the start of it all! HDFS allows you to distribute data storage across the cluster, provides a backup system, and makes your cluster look and feel like one computer with which you can interact.

* **Yet Another Resource Negotiator (YARN):** YARN sits on top of HDFS and manages the resources in your cluster. In big data, *resources* typically refers to node processing power, so it is YARN's job to determine what nodes are available for extra work and to keep your cluster up and running.

* **MapReduce:** MapReduce is a programming model that allows you to process your data across the entire cluster.  These days, it is old and slow, so you'll mainly interact with other programs that are harnessing the power of MapReduce instead of actually using MapReduce yourself.  Nevertheless, the function of MapReduce allows for all the distributed analytics you will run.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - The Hadoop Ecosystem<a class="anchor" id="DS107L1_page_5"></a>

[Back to Top](#DS107L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# The Hadoop Ecosystem

With those Hadoop components at the core, other programs can work on top of them and/or connect with them to increase processing speed, functionality, or data availability. The wide world of programs that interact with Hadoop in some way is known as the *Hadoop ecosystem*. Here is an image of the complete Hadoop ecosystem: 

![Three boxes are labeled as MySQL, Cassandra, and mongoDB and they are grouped as external data storage. A table titled Ambari. The fields of the table are labeled zookeeper, oozie, pig, hive, map reduce, TEZ, Spark, HBase, Storm, YARN, Mesos, HDFS, and data ingestion. Another set of boxes are labeled drill, phoenix, hue, presto, and zeppelin are grouped and labeled query engines.](Media/bigData6.png)

Now that you have a broad idea of all the pieces and technologies possible, you will learn a little about each and where they fit.  Keep in mind that this module will not cover every single component listed here, as many are redundant.  However, you will get a high-level overview of the systems so that you will be able to recognize them and their function should you encounter them outside this course.

You'll ingest this diagram starting at the center and working your way up and out. 

---

##  Analytics Programs

Sitting on top of HDFS is both YARN and Mesos.  

* **Mesos:** Mesos is an alternative to YARN. It's function is nearly identical to YARN as a resource negotiator.

Sitting on top of YARN is TEZ.

* **TEZ:** TEZ speeds up the MapReduce capabilities considerably, and can be used in conjunction with Hive and Pig for faster processing.

Sitting on top of either YARN or Mesos: 

* **Spark:** Apache Spark uses the basics of MapReduce but takes it much further.  It is used for data querying, analysis, machine learning, and streaming in real-time.  You can write into Spark using Python, Scala, or Java. 

There are two programs that sit on top of MapReduce: 

* **Pig:** Pig is a SQL-like query program that runs off MapReduce and has the option of using TEZ.  

    <div class="panel panel-success">
        <div class="panel-heading">
            <h3 class="panel-title">Fun Fact!</h3>
        </div>
        <div class="panel-body">
            <p>The official language that Pig uses is Pig Latin! Not a joke!</p>
        </div>
    </div>

* **Hive:** Hive allows you to make SQL queries against a database stored on your cluster. It also has the option of using TEZ to improve processing speed.

---

## Built-in Transactional Database

On top of HDFS, there is HBase. 

* **HBase:** HBase is a transactional database using key-value pairs (think NoSQL!) that is built into Hadoop.  It can provide fast results to query systems like Spark, Pig, and Hive.

---

## Real-time Data Processing Program

Alongside this analysis setup to the right is Storm and programs used for *data ingestion*, or getting data into your cluster.

* **Storm:** Apache Storm allows you to process data in real-time as it comes in. You might use this when processing sensor data, which often comes in by the second.

---

## Data Ingestion Programs

* **Sqoop:** Sqoop is a data ingestion program that allows you to tie in legacy SQL databases such as a MySQL Oracle setup via a *database connection (DBC)*. 

* **Flume:** Flume is a data ingestion program that can transport web logs into your cluster in real time.

* **Kafka:** Kafka is a broad data ingestion program that can collect data of any sort from a cluster of computers and broadcast them to your Hadoop cluster.

---

## Cluster Management Programs

To the left of the analysis setup are two programs that help you manage and maintain your Hadoop cluster: Oozie and Zookeeper.

* **Oozie:** Oozie allows you to schedule jobs on your cluster, and integrates with other programs in your Hadoop ecosystem so that they can be passed from one to another with ease.

* **Zookeeper:** Zoookeeper helps you coordinate your cluster's nodes and allows you to maintain a stable system even if your master node goes down. It's primary use is for failure recovery.

---

## External Data Storage Programs

There are also programs that don't layer within the Hadoop ecosystem but instead reach into it to interact, but primarily stay separate.  There are two overarching groups: programs that function as external storage outside the Hadoop cluster and programs that function as query engines outside the Hadoop Cluster. You'll learn aobut the external data storage programs first: 

* **MySQL:** Really, this includes MySQL as well as any other external SQL database.  You can import data from a SQL database to process on your cluster, and then export the results back to the database if desired.

* **cassandra:** This is a non-relational database (NoSQL) that can sit between your real-time website data and your Hadoop cluster.

* **mongoDB:** Another non-relational database you can utilize to connect to the Hadoop cluster.

---

## Query Engines

* **Drill:** Apache Drill is a query system that allows you to conduct SQL queries on non-relational databases such as cassandra and mongoDB.

* **Phoenix:** Apache Phoenix allows you to query across a wide range of data storage options and makes them all look and feel relational, even if they are not.

* **Hue:** An interactive query system with visualizations that allows you to use Hive and/or HBase.  Primarily used in Cloudera distributions of Hadoop.

* **presto:** Another way to execute queries across your entire cluster.

* **Zeppelin:** Apache Zeppelin is a notebook approach to querying that is similar in nature to Jupyter Notebook.

---

## The GUI

Overlying all of this is Apache Ambari, which is the graphical user interface for your Hadoop cluster. There will be a lot you can accomplish through Ambari, like adding and deleting files, using Hive and Pig, and turning on and off services and clusters, with some point-and-click or easy coding functionality. Though you will still need to use the command line at times, Apache Ambari is a lifesaver!

You will go into these with much more detail at a later date, though again not everything will be covered, as there is quite a bit of redundancy.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Key Terms<a class="anchor" id="DS107L1_page_6"></a>

[Back to Top](#DS107L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Cluster</td>
        <td>A group of connected computers.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>3 Vs of Big Data</td>
        <td>Volume, variety, and velocity.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Volume</td>
        <td>Size of the data being stored.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Variety</td>
        <td>Different types of data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Velocity</td>
        <td>Speed of data generation and storage.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Veracity</td>
        <td>Trustworthiness of the data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Value</td>
        <td>Usefulness of the data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Scalability</td>
        <td>The ability to continue adding computers to deal with more data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Cluster</td>
        <td>A group of computers all harnessed for the same purpose.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Node</td>
        <td>A computer in the cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Master Node</td>
        <td>AKA Manager.  The node controlling the other computers in the cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Slave Node</td>
        <td>AKA Worker.  The computers doing the work in the cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Amazon Web Services (AWS)</td>
        <td>A system for spinning up virtual machines to use in big data processing.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Hadoop</td>
        <td>A program that harnesses a cluster for big data processing but allows the user to feel like they're interacting with one computer instead of many.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Redundancy</td>
        <td>Ensuring data backup.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Fault Tolerance</td>
        <td>Ability to continue processing even after a part of the cluster has crashed.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Data Ingestion</td>
        <td>Getting data into your cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Database Connection (DBC)</td>
        <td>A way of connecting to SQL databases.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Lesson 1 Hands-On<a class="anchor" id="DS107L1_page_7"></a>

[Back to Top](#DS107L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 1 Hands On

This Hands-­On **will** be graded, so make sure you complete each part. When you are done, please submit one document with all of your findings for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

Pick one aspect of the Hadoop Ecosystem that you are most interested in and read a little further about it. Write down a few interesting key points about that program and how it operates. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>