# DS107 Big Data : Lesson Four Companion Notebook

### Table of Contents <a class="anchor" id="DS107L4_toc"></a>

* [Table of Contents](#DS107L4_toc)
    * [Page 1 - Introduction](#DS107L4_page_1)
    * [Page 2 - This Little Piggy went to Hadoop!](#DS107L4_page_2)
    * [Page 3 - Time to Use Pig!](#DS107L4_page_3)
    * [Page 4 - Pig Queries](#DS107L4_page_4)
    * [Page 5 - Pig Latin Commands Available](#DS107L4_page_5)
    * [Page 6 - HBase](#DS107L4_page_6)
    * [Page 7 - Using HBase](#DS107L4_page_7)
    * [Page 8 - Lesson 4 Hands-On ](#DS107L4_page_8)
    * [Page 9 - Lesson 4 Hands-On Solution](#DS107L4_page_9)
    
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS107L4_page_1"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Pig and HBase
VimeoVideo('388626004', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO107L04overview.zip)**.

# Introduction

In this lesson, you will learn about Pig, a SQL-like query system, and HBase, Hadoop's own non-relational database.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - This Little Piggy went to Hadoop! <a class="anchor" id="DS107L4_page_2"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# This Little Piggy went to Hadoop!

*Pig* is another query system built on top of Hadoop and MapReduce, similar to Hive.  However, while Hive uses SQL, Pig uses a SQL-like language, called PigLatin. No joke! You have the option to run it with either the built-in Hadoop system, or with TEZ to make things go faster. Just like Hive, Pig is extremely flexible, and you can create your own user-defined functions if you'd like. Pig calls datasets *relations*.  They behave in much the same manner as Hive tables, but of course everyone needs to express their individuality with their own naming conventions!

---

## How to Run Pig

There are three main ways to run Pig: 

* Ambari with Pig View
* Grunt, a command prompt interface
* Pig script file to run through the command prompt

You will be using the lovely, user-friendly query window in the Pig View of Ambari.  Keep it simple!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Time to Use Pig!<a class="anchor" id="DS107L4_page_3"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Step up to the Feeding Trough! It's Time to Use Pig!

Now that you have a general understanding of what Pig is, you will get some experience with Pig.

--- 

## Pig View

To access ```Pig View``` in Ambari, go to the ```Views``` section in the right hand corner of Ambari and select ```Pig View```.  When you do so, this is what you will see:

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The left panel of the screen displays three options labeled scripts, UDFs, and history. The right panel displays four columns labeled name, lastExecuted, lastResults, and actions.](Media/pig1.png)

As it suggests, click on the ```New Script``` button and then give that bad boy a name!

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The left panel of the screen displays three options labeled scripts, UDFs, and history. A prompt box labeled new script has two text fields for name and script HDFS location. Two buttons labeled cancel and create are placed at the bottom of the script.](Media/pig2.png)

Then click the ```Create``` button, and you are entered into a Pig query window: 

![A broswer displays a webpage labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The left panel of the screen displays icons for scripts, UDFs, and history. PiggingOut option is selected from a menu placed in the left panel. The right panel of the screen has two tabs labeled script and history. The option script is selected.](Media/pig3.png)

When you have run a script, you can access the ```History``` tab to see previous results and/or the Logs.  

![A window with three tabs labeled script, history, and pig attempt two - completed. The data is provided in four columns and five rows. The column headings are labeled date, status, duration, and actions.](Media/pig13.png)

Clicking on the ```Logs``` button will bring you to something that looks like this:

![A page displays the logs with a download button on top of the page. The warning message reads, use yarn jar to launch YARN applications.](Media/pig14.png)

If your Pig query doesn't run, or runs with a little red exclamation point after it, you'll need to comb the logs to see what went wrong and how to fix it.

---

## TEZ

There is a little miracle button on Pig that you will never want to ignore:

![A window with three tabs labeled script, history, and pig attempt two - completed. The script tab is selected. The screen displays PigAttempt2 and a checkbox labeled execute on Tez and a button labeled execute.](Media/pig15.png)

You'll recall that TEZ can take the place of MapReduce for both Hive and Pig and tries to figure out what can be done simultaneously, to make your jobs process faster.  Hive utilizes TEZ automatically, but with Pig, you need to tell it to make use of TEZ with that magic button.  Once you select it, it stays selected until you exit out of ```Pig View```. If you doubt TEZ's power, go ahead and try to run some of the queries on the next page with and without the TEZ button.  Results can be up to 15x faster!

---

### How Does TEZ Work?

TEZ utilizes *directed acyclic graphs (DAG)* to plot the most efficient route to do the work you've assigned it. By graphing it all out, you can eliminate unnecessary steps and dependencies, and can even run things at the same time.  This means that when you use TEZ, you are optimizing the flow of data, and it can act as a resource manager as well.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Pig Queries<a class="anchor" id="DS107L4_page_4"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Pig Queries

Out of the frying pan and into the fire with this bacon! Time to really get going with some querying in Pig. You will need to re-upload your `books` file to HDFS to follow along with this lesson. 

---

## Find All High Rated Books

Your first task with Pig is to find all the books that are highly rated.

The first step in this is to bring in your data:

```pig
ratings = LOAD '/user/maria_dev/books_data/books.csv' USING PigStorage(',')
	AS (bookID:int, authors:chararray, average_rating:float, isbn:chararray, isbn13:chararray, language_code: chararray, num_pages:int, ratings_count: int, text_reviews_count:int);

metadata = LOAD '/user/maria_dev/books_data/bookIDs.csv' USING PigStorage(',')
	AS (bookID: int, title: chararray);
```

You use the ```LOAD``` function to specify exactly where data is stored on your HDFS, and then tell it to parse the data, with the command ```USING PigStorage()```.  ```','``` goes in the parentheses in this case, because your data is comma delimited, but if you were doing other data parsed other ways, you would need to put in another delimiter type.

Next, you hit use the function ```AS```.  ```AS``` is how you define the columns and data types in this case.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to see all the different data types Pig has available, check <a href="https://www.folkstalk.com/2013/07/pig-data-types-primitive-and-complex.html"> here! </a></p>
    </div>
</div>

The code above brings in both datasets - the one that has information about the books and the one that contains the titles of the books. Next, you create an object that will give you back the title and the book id for every row.  This is done using the ```FOREACH``` and ```GENERATE``` commands, which you can think of like a Pig version of a for loop.  It takes the same "for this" format, but instead of "in that," you have "do this."  So this next line of code below says, "for each row of metadata, put in the book id and the title, and call it all bookStuff."

```pig    
nameLookup = FOREACH metadata GENERATE bookID, title AS bookStuff;
```

Then you want to group the first dataset, ```ratings```, by ```bookID```, using the ```GROUP``` function:

```pig
ratingsByBook = GROUP ratings by bookID;
```

And then you can get the average ratings, by again using ```FOREACH / GENERATE```.  The function ```AVG``` is also utilized:

```pig
avgRatings = FOREACH ratingsByBook GENERATE group AS bookID, AVG(ratings.average_rating) AS avgRating;
```

Remember that this isn't strictly necessary for this particular data, since it only has one rating per book, but would be helpful to do if you had multiple book ratings. Then, you can use the ```FILTER``` function to look only at ratings above a 4:

```pig
fiveStarBooks = FILTER avgRatings by avgRating > 4.0;
```

And then you want to ```JOIN``` the data you've filtered down from the original ```ratings``` file with the data from the ```nameLookup``` you created to be a crosswalk:

```pig
fiveStarsWithData = JOIN fiveStarBooks BY bookID, nameLookup BY bookID;
```

And lastly, you can ```DUMP``` the data, which is Pig's way of printing or displaying the data.

```pig
DUMP fiveStarsWithData;
```

This should result in results something like this:

![A broswer displays a webpage labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The left panel of the screen displays icons for scripts, UDFs, and history. PiggingOut option is selected from a menu placed in the left panel. The right panel of the screen has three tabs labeled script, history, and PiggingOut - completed. The piggingout - completed option is selected.](Media/pig11.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Pig Latin Commands Available<a class="anchor" id="DS107L4_page_5"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Pig Latin Commands Available

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pig Latin Command</td>
        <td>Description</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>LOAD</td>
        <td>Reads in your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>STORE</td>
        <td>Saves your data into PigStorage.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>FILTER</td>
        <td>Allows you to look at subsets of your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>DISTINCT</td>
        <td>Find only the unique values.</td>
    </tr>
        <tr>
        <td style="font-weight: bold;" nowrap>FOREACH / GENERATE</td>
        <td>Iterate over data, for loop style.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>MAPREDUCE</td>
        <td>Call explicit mappers and reducers.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>STREAM</td>
        <td>Use standard input for queries.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>SAMPLE</td>
        <td>Create a random sample of your relation.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>JOIN</td>
        <td>Connect two relations.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>COGROUP</td>
        <td>A type of join that creates a separate tuple for each key. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>GROUP</td>
        <td>Aggregate data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ORDER</td>
        <td>Organizes your relation.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>RANK</td>
        <td>Assign a rank order to each row without changing the order in the relation itself.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>LIMIT</td>
        <td>Allows you to view the first n rows of your relation.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>UNION</td>
        <td>Takes two relations and squishes them together.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>SPLIT</td>
        <td>Separates one relation into more than one.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>DESCRIBE</td>
        <td>Tells you what the schema is for your relation.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>EXPLAIN</td>
        <td>Tells you how Pig intends to execute your query.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ILLUSTRATE</td>
        <td>Takes a sample of each pice of data and shows each step as it processes.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - HBase<a class="anchor" id="DS107L4_page_6"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# HBase

*HBase* is a non-relational, transactional database built on top of HDFS.  There is no query language that goes with HDFS, but there is an API that allows you to perform CRUD operations using HBase. 

---

## HBase Architecture

HBase has two components to it: *region* servers, which contain groups of keys that can automatically grow and re-partition as needed, and *master servers*, which keep track of everything that's going on with the region servers.

---

## Data Structure

Like other non-relational databases, in HBase, every row is a reference to a unique key.  However, HBase has the ability to store data in *column families* instead of just key-value pairs.  Each column family can contain multiple related individual columns.  For instance, you might have a column family named address, and stored inside are columns for street number, street name, city name, state name, and zip code. HBase excels at dealing with missing data, since it does not take up space within a column if no data populates that column.  

HBase also has something called a *cell*, which is the intersection of a row and a column.  You can store multiple versions of each cell to get an idea of the history if data is being updated.  History can include a timestamp as well, so you know exactly when a change occurred. 

---

## How to Access HBase

You can access HBase from a wide variety of applications! From the easiest to the most complicated, you have:

* Hive
* Pig
* Spark
* HBase shell
* Java API that includes wrappers for other languages such as Python or Scala

Hive, Pig, and Spark also allow you to process data within their program and then output it to HBase, which is helpful.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Using HBase<a class="anchor" id="DS107L4_page_7"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Using HBase

In order to make the best and most realistic use of HBase, you will be running a REST service on your virtual machine that interfaces between HBase on your cluster and a Python program on your actual computer.  This way, you are replicating the use case in which someone out there on the web wants to make a call to your data and pull out specific information from it.

---

## Open a New Port on Your Virtual Machine

The first thing you need to do to replicate this scenario is to create a new port on your virtual machine to which your regular computer can connect. In order to do that, open up your ```Oracle VM VirtualBox Manager```, and right click on your ```Hortonworks Docker Sandbox```. Then you'll want to choose the first option at the top, for ```Settings```: 

![A window for Oracle VM virtualbox manager has three menus labeled file, machine, and help. The option Hortonworks docker sandbox 1 is selected from the left panel. A right-click menu appears and settings option is selected from the menu.](Media/pig4.png)

Then go to the ```Network``` tab along the side: 

![A window for Hortonworks docker sandbox 1 - settings displays general, system, display, storage, audio, network, serial ports, USB, shared folders, and user interface on the left panel. The right panel has four tabs labeled adapter 1, adapter 2, adapter 3, and adapter 4. The tab adapter 1 is selected.](Media/pig5.png)

And expand the ```Advanced``` arrow on the bottom:

![A window for Hortonworks docker sandbox 1 - settings displays general, system, display, storage, audio, network, serial ports, USB, shared folders, and user interface on the left panel. The right panel has four tabs labeled adapter 1, adapter 2, adapter 3, and adapter 4. The tab adapter 1 is selected.](Media/pig6.png)

You then want to click on the ```Port Forwarding``` button on the bottom. You should end up seeing a screen something like this:

![A window for port forwarding rules. The data is provided in a few columns labeled protocol, host ip, host port, guest ip, and guest port. The ok and cancel buttons are placed below.](Media/pig7.png)

You will add a new port by using the green plus sign in the upper right hand corner.  This will give you a new row to enter data into. In the far left column, which you almost cannot see, you'll want to type in a name; something like ```HBase REST```.  That way you can identify it later.  Then, you will want to put in the ```Host IP``` as ```127.0.0.1```, and both the ```Host Port``` and the ```Guest Port``` should be numbered ```8000```.  The bottom line in the picture above is an example.

---

## Start HBase in Ambari

The next thing you'll need to take care of is starting HBase.  While some other services already come started, like Pig and Hive, you actually need to tell your Hadoop cluster that you'd like to use HBase.  To do so, login to Ambari, and then click on the ```HBase``` service from the left hand menu. You should see this:

![A browser displays a webpage from Ambari sandbox. It displays HDFS, YARN, MapReduce2, Tez, Hive, HBase, Pig, and sqoop on the left panel. The option Hbase is selected. The right panel has three tabs labeled summary, heatmaps, and configs. The summary option is selected. A right-click menu is appeared on the right side of the window. The start option is selected from the menu.](Media/pig8.png)

Click on the ```Service Actions``` menu in the right hand corner, and then click on ```Start```. You should see a confirmation window like this: 

![A confirmation message that reads, you are about to start HBase. A checkbox labeled turn off maintenance mode for Hbase is placed below the confirmation message. Two buttons labeled cancel and confirm start are placed at the bottom of the screen.](Media/pig9.png)

Make sure you click ```Confirm Start```.  Then you will get a window showing the progress of starting HBase.  You're done when it gets to 100% without error.

![A screen titled 1 background operation running displays the content in four columns labeled operations, start time, duration, and show.](Media/pig10.png)

Then you're good to go! You can click ```Ok```.

---

## Sign in as a Super User

Next, login to your virtual machine via terminal (Mac / Linux) or PuTTY (Windows). Once you sign in, you'll want to change to the super user:

```bash
su root
```

It will then ask you for the password.  The default password is ```hadoop```. You may not be able to see to enter it, so type carefully, take it on faith, and then hit enter.

Next, it will tell you that you need to change the default password immediately, but you'll need to type in ```hadoop``` one more time first.  Then you can choose a new password.  Change it to whatever you'd like, but make sure you write it down somewhere, because you will be needing it again.

---

## Start the REST Server

Next, you'll need to start your REST server running.  To do this, you'll use the following code:

```bash
/usr/hdp/current/hbase-master/bin/hbase-daemon.sh start rest -p 8000 --infoport 8001
```

This tells the HBase Master on your current Hadoop cluster that you are going to start your REST server on port 8000 (```-p```).  Although you didn't open up port 8001, so you won't be able to use it, you can also open up a port that will provide information about the service on port 8001 (```--infoport```). 

---

## Setup your File

Next, you will need to setup a file on your local machine in Python. If you use Jupyter Notebook, it should have all the packages you need already installed, so that version is recommended. You DO NOT need to understand all the details in the code below; it is enough to copy and paste the code into your editor and change the appropriate lines.

```python
from starbase import Connection

c=Connection("127.0.0.1", "8000")

# Creates a table
ratings = c.table('ratings')

# Drop the table if it already exists
if (ratings.exists()):
	print("Dropping existing ratings table\n")
	ratings.drop()

# Create a column family
ratings.create('rating')

# Open the Data File
print("Parsing the data...")
ratingFile = open("C:/Users/meredith.dodd/Downloads/rating.csv", "r")

# Create a batch
batch = ratings.batch()

# Structure the Data
for line in ratingFile:
	(user_id, anime_id, rating) = line.split(',')
	batch.update(user_id, {'rating': {anime_id: rating}})

# Close the File
ratingFile.close()

print ("Committing ratings data to HBase via REST service\n")

# Commit the Data
batch.commit(finalize=True)

print("Get back ratings for some users...\n")
print("Ratings for user ID 1:\n")
print(ratings.fetch("1"))
print("Ratings for user ID 33:\n")
print(ratings.fetch("33"))
```

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>If starbase is not already installed, you can open your command prompt and type in pip install starbase (Windows machines) or pip3 install starbase (Mac / Linux machines).</p>
    </div>
</div>

The code above creates a table named ```ratings``` and drops it if it already exists.  Then it creates a column family named ```rating```.  Then, it will open the data from the data file specified.  You will need to change the line that starts ```ratingFile = open``` so that the file pathway points to wherever you save **[this data](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/rating.zip)**. Then this code batches things together, and parses the data in the right format. It then closes the file and commits the data. Then, you can ask it to retrieve things for you! In this case, you are asking it to give you all the ratings for given users - user 1 and user 33. 

This file may take a while to run.  If your computer is slow, it may take an hour or more.  But the end result should be that you can retrieve the different anime that each user has rated.

---

## Shut Things Down

Now that you have completed your work, you will want to 

```bash
/usr/hdp/current/hbase-master/bin/hbase-daemon.sh stop rest
```

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 4 Hands-On<a class="anchor" id="DS107L4_page_8"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">



This Hands-On will **not** be graded, but you are encouraged to complete it. The best way to become a great data scientist is to practice. Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Description

Using the Book data files, write a Pig Query that will find all the books that have a 1-star rating. Include a file with your Pig Query and a screenshot of your results.

---

## Alternative Assignment if You Can't Run Hadoop and/or Ambari

If your computer refuses to run Hadoop and/or Ambari, **[here](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/L4exam.zip)** is an alternative exam to test your understanding of the material. Please attach it instead.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Lesson 4 Hands-On Solution<a class="anchor" id="DS107L4_page_9"></a>

[Back to Top](#DS107L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 4 Hands-On Solution

```pig
ratings = LOAD '/user/maria_dev/books_data/books.csv' USING PigStorage(',')
	AS (bookID:int, authors:chararray, average_rating:float, isbn:chararray, isbn13:chararray, language_code: chararray, num_pages:int, ratings_count: int, text_reviews_count:int);

metadata = LOAD '/user/maria_dev/books_data/bookIDs.csv' USING PigStorage(',')
	AS (bookID: int, title: chararray);
    
nameLookup = FOREACH metadata GENERATE bookID, title;

groupedRating = GROUP ratings by bookID;

averageRatings = FOREACH groupedRating GENERATE group AS bookID,
	AVG(ratings.average_rating) AS avgRating, COUNT(ratings.average_rating) AS numRatings;
    
badBooks = FILTER averageRatings BY avgRating < 2.0;

namedBadBooks = JOIN badBooks BY bookID, nameLookup BY bookID;

finalResults = FOREACH namedBadBooks GENERATE nameLookup::title AS bookName,
	badBooks::avgRating as avgRating, badBooks::numRatings as numRatings;

finalResultsSorted = ORDER finalResults By numRatings DESC;

DUMP finalResultsSorted;
```

---

## Lesson 4 Hands-on Solution  - Alternative Assignment

This exam serves as the assessment for those students who cannot utilize the Hadoop system and/or Ambari GUI. Answers are shown in bold.

1.	Explain the concept of TEZ and how it works.

    **TEZ makes jobs on Hadoop go faster. It runs off of directed acyclic graphs, or DAGs, which find the most efficient path for your work to be conducted.  It removes unnecessary steps and figures out what can be run in parallel.  These efficiencies save both time and money!**

2.	What is Pig’s version of the for loop? 
    a.  Group
    b.	Filter
    c.	Join
    **d.	For each / generate** 

3.	How is HBase structured? Describe or draw – whatever makes the most sense to you.

    **HBase uses key-value pairs, with each row having a unique key. But, it also makes use of column families, so that you can have a little more granularity. A cell is the intersection of a column and a row, and you can store multiple versions of cell data and even timestamp it!**