# DS107 Big Data : Lesson Three Companion Notebook

### Table of Contents <a class="anchor" id="DS107L3_toc"></a>

* [Table of Contents](#DS107L3_toc)
    * [Page 1 - Introduction](#DS107L3_page_1)
    * [Page 2 - MapReduce](#DS107L3_page_2)
    * [Page 3 - Hive Basics](#DS107L3_page_3)
    * [Page 4 - Using Hive](#DS107L3_page_4)
    * [Page 5 - Hive Queries](#DS107L3_page_5)
    * [Page 6 - Sqoop](#DS107L3_page_6)
    * [Page 7 - Setting up MySQL](#DS107L3_page_7)
    * [Page 8 - Using Sqoop](#DS107L3_page_8)
    * [Page 9 - Key Terms](#DS107L3_page_9)
    * [Page 10 - Lesson 3 Hands-On](#DS107L3_page_10)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS107L3_page_1"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: MapReduce, Hive, and Sqoop
VimeoVideo('388136496', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO107L03overview.zip)**.

# Introduction

In this lesson, you will learn about the foundation of big data processing on Hadoop - MapReduce.  Then you'll begin utilizing one of the modern applications built on top of MapReduce: Hive.  Hive allows you to use SQL queries to interact with your big data, and you can even utilize Sqoop to integrate Hive with traditional databases like MySQL.

By the end of this lesson, you should be able to complete the following tasks: 

* Understand the theory behind MapReduce 
* Understand the architecture of Hive
* Utilize Hive to upload data and create HiveQL queries
* Prepare MySQL for the use of Sqoop
* Use Sqoop to import and export files from MySQL

This lesson will culminate in a hands on in which you use Hive to find the most popular and the most highly rated anime shows.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/456790695"> recorded live workshop on the concepts in this lesson. </a> </p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - MapReduce<a class="anchor" id="DS107L3_page_2"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# MapReduce

_MapReduce_ is what's known as a programming model that is used to process large amounts data across multiple computers in two steps: _map_ and _reduce_. The concept of MapReduce is to take a collection of data, such as the `crimes-sample.csv` file, _map_ (or filter) the desired data to produce a list of results in a key-value pair format, then those key-value pairs are sent to the _reduce_ step, which aggregates the results. The mapper helps optimize your work as well - you will basically throw away any data you aren't currently interested in, and then extract and organize the things that you do care about so that it can be aggregated.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>MapReduce was the result of two Google researchers in 2004 who wrote a research paper titled <a href="https://research.google.com/archive/mapreduce.html">"MapReduce: Simplified Data Processing on Large Clusters."</a></p>
    </div>
</div>

This concept is best understood by visualizing what's happening with real data and code.

---

## How MapReduce Works

Imagine you are handed the task of writing a Python program, using the `crimes-samples.csv` file as data, to count how many crime reports resulted in an arrest and how many did _not_ result in an arrest. Since MapReduce is a two step process, the work will be split into two parts. The _map_ step will process the "Arrest" column in the `crimes-samples.csv` file and produce a list as shown below, where "Yes" means an arrest occurred and "No" means there was not an arrest. The comma followed by the number one in the output below is used to count each instance and will make more sense in the reduce step.

---

### Example `Map` Results

Below are some example map results: 

```text
Yes,1
No,1
No,1
Yes,1
No,1
No,1
No,1
Yes,1
Yes,1
No,1
```

Remember, MapReduce is primarily used when data processing will be distributed to multiple computers. Therefore, imagine multiple computers process their own different crime files and produce a list of results similar to the one above. Once the _map_ step is done, the next step is to _reduce_ or aggregate all of these entries into a simpler representation. 

---

### Example `Reduce` Results

Below are some example reduce results:

```text
Yes,4
No,6
```

Under the hood, the values ```Yes``` and ```No``` are treated like keys in a dictionary with the value 1 being treated as the value. During the _reduce_ step, each line is processed and the values for each key are combined. So there were four entries that were a "Yes" and six entries that were a "No" that came out of the map step, which were added together in this reduce step.

---

## MapReduce Architecture

Here are the steps that take place under the hood when you run a MapReduce program on a Hadoop cluster:

* You (the client node) send in your code
* YARN keeps track of what machines need to be utilized and copies data to HDFS or other distributed filing system as needed
* MapReduce Application master runs under a NodeManager and keeps track of all the nodes that are working to complete the MapReduce task
* The task takes place while talking to HDFS

The system tries to ensure that your mapper gets run as close to where the data is located as possible, so that efficiency is increased.

---

## Languages for MapReduce

While Hadoop was programmed in Java, you can utilize the concept of MapReduce through Python as well as many other languages.

---

## How MapReduce Handles Failure

MapReduce can be quite slow, and isn't used as a standalone often anymore for that reason, but it does handle failure at a wide variety of levels pretty well - it is resilient.  Below, you will find a table of potential things that could go wrong and how MapReduce will handle the situation: 

<table class="table table-striped">
    <tr>
        <th>Issue</th>
        <th>Solution</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Worker nodes have errors.</td>
        <td>You can restart the worker as needed, and/or shift processing to a different node.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Application master goes down.</td>
        <td>YARN will try to restart it.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Entire node goes down.</td>
        <td>The resource manager will try to restart the process on another node.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Resource manager goes down.</td>
        <td>MapReduce itself can't handle this, but you can set up high availability MapReduce and manage it through zookeeper. </td>
    </tr>
</table>

---

## How is MapReduce Used Today?

MapReduce is the backbone of many other big data programs that sit in Hadoop, such as Hive, Pig, and Spark.  However, it is no longer used by itself, because in comparison to newer technology, MapReduce is clunky and slow.  Therefore, you will only learn the theory behind MapReduce, and not practice with it yourself.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Hive Basics<a class="anchor" id="DS107L3_page_3"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Hive Basics

*Hive* allows you to write SQL queries and execute them on a Hadoop cluster, using a language called *HiveQL*, which is just another SQL variant, very similar to the SQL you have already learned.  It is interactive and scalable, allowing you to use it on an entire cluster of computers, no matter how many you have.  It's also pretty flexible, so you can use Hive to connect to other types of databases and write your own *user-defined functions* if you'd like.

Hive does have a few limitations, however.  It does take a few minutes to execute queries, and it is limited compared to other programs you'll learn later, such as Pig and Spark.  It also does not allow you to perform CRUD operations because Hive does not connect to a real-time database.

---

## HiveQL vs. SQL

Most things really are the same between HiveQL and SQL.  However, HiveQL uses views slightly differently than SQL. In SQL, when you create a view, it stores the data as a temporary table that you can access later.  With HiveQL, the view is just a construct, not data actually stored anywhere.  The purpose of the view in HiveQL is to enable you to break up complicated queries into smaller individual ones.

---

## How Hive Works

Hive uses a *schema on read* format that takes unstructured data and only applies structure to it when it is being read and processed. This is done to speed up processing; it would be clunky and slow if you had to maintain a data structure (think tables) all the time. The opposition to schema on read is *schema on write*; this represents a traditional SQL table system.

---

## Ways to Run Hive

While you can run Hive queries through the command line, the easiest way is simply to utilize the ```Hive View``` in Ambari, which will provide you with an interactive query editor window. You'll start working with Hive on earnest on the next page!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Using Hive<a class="anchor" id="DS107L3_page_4"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Using Hive

The easiest way to access Hive is through the Ambari Hive View.  To access it, go to the upper right hand corner and select ```Hive View```:

!["A window displays a web browser consists of five menu options and an icon and a button labeled maria_dev pops up six options are YARN Queue Manager, Files View, Hive View, Pig View, Storm View, and Tez View.
  On the left has eighteen options. On the right three options are present in which first is selected, two buttons are present below and fifteen boxes captioned and present in the middle."](Media/hive1.png)

Here's what the initial interface looks like when you click on it, once it goes through its service checks:

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The next bar has six tabs labeled hive, query, saved queries, history, UDF's and upload table. The query tab is selected.](Media/hive3.png)

This main page is the ```Query Editor```.  It's where you'll type in your SQL code and get information about the query that ran, etc. To the left, you can see the databases that you are currently connected to, in the ```Database Explorer``` section, and to the right, you can see some buttons that will give you more info and tools to go along with your query.

---

## Load Data into Hive Using Upload Table Wizard

The first thing you need to do is to load data into Hive.  This is done in the ```Upload Data``` section of the ```Hive View```:

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The next bar has six tabs labeled hive, query, saved queries, history, UDF's and upload table. The upload table option is selected. The screen displays two radio buttons labeled upload from local and upload from HDFS. A dropdown list box to choose the file type and a button labeled browse is used to select from local.](Media/hive8.png)

You can load data in from either your local machine (if it's a relatively small file) or from HDFS (for larger data; assuming you have a system in place for getting large data into HDFS). In this case, you will just upload it from your local machine, so go ahead and leave the ```Upload from Local``` radio button selected, and then hit the ```Browse``` button on the right hand side, to search for the file.  You will be utilizing the two files ```books``` and ```bookids``` you have been using. Start with ```books```.  Once you drag and drop it in, you will see another view pop up:

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The next bar has six tabs labeled hive, query, saved queries, history, UDF's and upload table. The upload table option is selected. The screen displays two radio buttons labeled upload from local and upload from HDFS. Three dropdown list boxes to choose the file type, database, and stored as and a button labeled browse is used to select from local. A text field is used to indicate the table name. The table details are provided below.](Media/hive12.png)

Here, you can give names to all of the columns and set the data types.  Note that because your data has headers, it all comes in as ```STRING``` data, which is not necessarily correct.  You'll need to change ```average_rating``` to ```FLOAT```, and ``` # num_pages```, ```ratings_count```, and ```text_reviews_count``` to ```INT```. You may have trouble changing this - try hitting the very left side of the arrow. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You will need to change the default column names of column1, etc. to the values in the first row, so that you can follow along with the queries.</p>
    </div>
</div>

Here's what the end result should look like approximately: 

![A window with three tabs labeled script, history, and pig attempt two - completed. The data is provided in four columns and five rows. The column headings are labeled date, status, duration, and actions.](Media/hive13.png)

Once you're satisfied, you will want to click the ```Upload Table```  button on the right side. It will kick off a lot of stuff, but you'll know your data is ready to roll once the ```Upload Progress``` bar goes away and you get a green message in the right hand corner of your screen that indicates a successful upload.  Then you can flip back into the ```Query``` section, and if you click the refresh button on right side of the ```Database Explorer```, and then look into the ```default``` database, you should see your data!

![Database explorer with a dropdown list box labeled default and a box labeled search tables. The list of databases is provided below the search box.](Media/hive14.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Hive Queries<a class="anchor" id="DS107L3_page_5"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Hive Queries

Now it's on to the fun part of actually running some Hive queries! You can run all the normal SQL commands through here, but the one thing that is special to Hive is how the function ```create view``` gets used, so both queries you'll practice here make use of ```view```.  For the first one, you'll count up the number of ratings each book has. 

---

## Count up the Number of Book Ratings

The first thing you'll want to start with is ```create view```.  You are creating a view named ```topBooks```, and then you need to define it in the next few lines with the ```as```. Into ```topBooks``` you want to pick out a couple columns, so make use of that good 'ol ```select``` statement.  You can even count the number of book ids you have by wrapping the column ```bookid``` in the ```count``` function, and give it a name, too: ```ratingCount```.  Then make sure you tell Hive where these columns are coming from, using that trusty ```from``` statement.  You can then ```group by``` and ```order by``` to ensure things are going to make the most logical sense. ```desc``` tells Hive that you want to sort in descending order, with the largest on top.

The next part of the query joins the view you just created together with the book titles, so that you can get a list of book titles, and not just IDs, since that wouldn't be very practical. Although it's not necessary, to keep things simple you can give each table an alias. ```bookid``` becomes ```b``` and ```topBooks``` view becomes ```t```, which allows you to easily and quickly specify what you're joining and from whence it comes.

```sql
create view topBooks as
select bookid, count(bookid) as ratingCount
from books
group by bookid
order by ratingCount desc;

select b.title, ratingCount
from topBooks t join bookids b on t.bookid = b.bookid;
```

And here are the results you receive from this query: 

![A bar labeled 100-percent. The screen displays the query process results with the status as succeeded. The page has two tabs labeled logs and results. The option results is chosen. It displays the title and the corresponding ratingcount.](Media/hive4.png)

Results will appear on the bottom of your screen, under the ```Results``` tab.  If you need to know anything about the query you just processed, you can also check the ```Logs``` tab. Looks like in this case that all the book ids are unique (only used once), so you are getting a count of 1 for all.  But, it would be more helpful in an instance when an id wasn't unique - maybe a situation where you had a whole bunch of users rating the same book.

It's important that you remove the view when you're done, since you can only have one instance. To drop the view, simply use this code: 

```sql
drop view topBooks
```

If you try and run the query a second time, or even make some sort of changes and do it again, you will get an error.  It flashes by really quickly, but don't worry - you can always go back and read it in more depth in the notifications section, which is denoted by the little envelope icon on the right menu at the bottom. 

In case you encounter this issue, here is the error you will get:

![An error message that reads, java.sql.SQLException: Error while processing statement. FAILED: Execution error, return code 1 from org apache hadoop hive.sql.exec.DDLTask.AlreadyExistsException open bracket message: Table topBooks already exists close bracket.](Media/hive2.png)

The ```AlreadyExistsException``` is the giveaway that tells you this error is about a repeat view, and it will go on to tell you exactly which view you already have.

---

## Saving your Queries

You can save your Hive queries by clicking on the ```Save as``` button at the bottom, which will pop up this window and enable to you to give the query a name and save: 

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The next bar has six tabs labeled hive, query, saved queries, history, UDF's and upload table. The query tab is selected. A prompt box labeled saving item has a dropdown list box and two buttons labeled close and ok.](Media/hive5.png)

It can then be accessed on the ```Saved Queries``` tab along the top:

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The next bar has six tabs labeled hive, query, saved queries, history, UDF's and upload table. The saved queries tab is selected. The screen displays four fields labeled preview, title, database, and owner. A button labeled clear filters is placed at the end of the fields.](Media/hive6.png)

You can access a new query window by clicking on the ```New Worksheet``` button on the bottom while on the ```Query``` tab.  Go ahead and do this now, as you'll attempt a new query to find the highest rated book. 

---

## Find the Highest Rated Book

How about trying to find the highest rated book(s)? You can keep all the count info in, but you will also ```select``` the ```average_rating``` column and use the function ```avg``` on it.  Since you learned above that the books are only listed once, performing an aggregate function is a little redundant, but it's good to know in case you face a situation in which you have non-unique IDs, so you'll just roll with it for now. 

Then you can order by the ```ratingAvg``` instead of by ```ratingCount```, and perform the same ```join``` as before.  You'll be able to see which books had the highest ratings!

```sql
create view topBooks2 as 
select bookid, avg(average_rating) as ratingAvg, count(bookid) as ratingCount
from books
group by bookid
order by ratingAvg desc;

select b.title, ratingAvg
from topBooks2 t join bookids b on t.bookid = b.bookid;
```

And here are the results of that query:

![A screen displays the query process results with the status as succeeded. The page has two tabs labeled logs and results. The option results is chosen. It displays the title and the corresponding ratingavg.](Media/hive7.png)

---

## Is Your Query Done?

Sometimes things can take a while to run, and sometimes you may be left wondering if your computer is finished processing your Hive Query.  If you ever wonder whether something ran or not, you can check in several places: 

* In the Query tab, if the button at the bottom is green and says ```Execute```, then something has finished running, whether you have results or you got an error.
* In the notifications section of the Query tab, you will see a message that will state whether an error occurred.
* In the History tab, you can see what was processed and run successfully or errored out.  Don't worry, you can see how many times even your instructor strikes out here! Learning is a trial-and-error process, so that comes with a lot of errors before you hit success! 

    ![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The next bar has six tabs labeled hive, query, saved queries, history, UDF's and upload table. The history tab is selected. It has five fields labeled title, status, 10/03/2019, 10/08/2019, and time scale. It also displays two buttons for refresh and clear filters.](Media/hive15.png)

---

## Have you Forgotten to Drop Views?

Views take up a lot of space, and because you're only running one node on a virtual machine, you are probably a little short on space.  If you get anything like the error below, remember to drop all your views and try again.  Chances are, you've just run out of working memory!

![An error message that reads, java.sql.SQLException: Error while processing statement. FAILED: Execution error, return code 2 from org apache hadoop hive.ql.exec.tez.TezTask.](Media/hive16.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Sqoop<a class="anchor" id="DS107L3_page_6"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Sqoop

*Sqoop* is a program that allows you to use Hadoop to run queries on databases that are stored in MySQL, allowing you to integrate the processing power of Hadoop with your already created database systems. It kicks off MapReduce but just uses mappers to talk with HDFS and doesn't utilize reducers.  Then you can use Hive or Pig on it if you'd like. Sqoop does not have an Ambari interface, so in order to make use of this connector, you'll need to work with the command line.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>The name Sqoop came about as SQL and Hadoop squished together.</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Setting up MySQL<a class="anchor" id="DS107L3_page_7"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Setting up MySQL

Your goal is to learn how to use Sqoop to "scoop out" data from MySQL and process it on your Hadoop cluster. However, you'll first need to set up MySQL for this.  Amongst other things, you'll create a data table in MySQL.  In real life, you'd already have data to connect to, but since you're just practicing, you don't. 

---

## Login to MySQL

Luckily, Hortonworks already comes with MySQL, so you don't need to install it! Just log in to MySQL, using the ```mysql``` command and specifying that you are going to login as a ```root``` user, and use ```-p``` to ask for a password prompt: 

```bash
mysql -u root -p
```

Just like other things, the default password for Hortonworks MySQL is ```hadoop```, so type that in when prompted.  Remember that you might not be able to see what you type, but go for it anyway and hit enter.

You will know you are into MySQL when you get the ```mysql>``` prompt in the command line.

---

## Create a Database

Now you can go ahead and create your database, using the simple command ```create database```!

```sql
create database books;
```

Where ```books``` is the name of the database you are going to be creating. You should receive a prompt like this one when complete: 

```text
Query OK, 1 row affected (0.00 sec)
```

To double check that it worked, you can also use the command ```show databases;``` which will provide you a list of all databases in MySQL:

```sql
show databases;
```

This should be your result:

```text
+--------------------+
| Database           |
+--------------------+
| information_schema |
| books              |
| hive               |
| mysql              |
| performance_schema |
| ranger             |
+--------------------+
6 rows in set (0.01 sec)
```

---

## Getting Data into MySQL

You can then run this stuff line by line to create a few entries for the ```books1``` table. First tell SQL to use the `books` table: 

```sql
USE books;
```

Then tell SQL to ```Begin;```:

```sql
BEGIN;
```

Then ensure that you drop the table if you already have one:

```sql
DROP TABLE IF EXISTS books1;
```

Then here comes your ```create table``` command:

```sql
CREATE TABLE books1 (
  id integer NOT NULL,
  authors varchar(255),
  average_rating integer NOT NULL,
  language_code varchar(255),
  num_pages integer NOT NULL,
  ratings_count integer NOT NULL,
  text_reviews_count integer NOT NULL
);
```

And then you will enter in some data into that table, making use of ```insert into```.  Note that this is not the entirety of the dataset; just the first five rows and a few of the columns to practice with.  It would be pretty time consuming to manually all of this data this way, and you are assuming that in the real world your MySQL database would be connected to data entered by a user of some sort.

```sql
INSERT INTO books1 VALUES (1,'J.K. Rowling', 4.56, 'eng', 652, 1944099, 26249),(2,'J.K.Rowling', 4.49, 'eng', 870, 1996446,27613),(3,'J.K. Rowling', 4.47, 'eng', 320,5629932,70390),(4,'J.K. Rowling', 4.41, 'eng', 352, 6267,272),(5,'J.K. Rowling', 4.55, 'eng', 435,2149872,33964);
```

---

## Make Sure it can Use UTF8

Now there are just a few little formatting things to add to the settings for MySQL.  You will set both ```names``` and ```character``` to the ```utf8``` format.

```sql
set names 'utf8';
```

```sql
set character set utf8;
```

---

## Examine the Data You Brought in

Now that you've inserted some data into your table, take a look-see to ensure that it came in as expected.  Go ahead and tell MySQL to look at the ```books``` database:

```sql
use books;
```

Then ask it to show all the tables:

```sql
show tables;
```

And lastly, ```select``` everything from the table ```books1```:

```sql
select * from books1;
```

It should all be there and ready to roll!

---

## Set Privileges 

The last thing you will need to do in MySQL is to set privileges so that you to run Sqoop on your local host and connect to MysQL:

```sql
grant all privileges on books.* to ''@'localhost';
```

Once that's done, you'll exit out of MySQL.

---

## Exiting MySQL

If you want to exit out of MySQL, it's as simple as saying so: 

```bash
exit
```

MySQL politely tells you "Bye!"

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Using Sqoop<a class="anchor" id="DS107L3_page_8"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Using Sqoop

Now you have MySQL all set up, it's time to finally utilize Sqoop!

---

## Suck out Data from MySQL to HDFS using Sqoop

There are two options for where to put the data from MySQL: it can go directly into HDFS (Files View), or it can go into Hive.  You'll start by putting data into HDFS.

The line below uses the ```sqoop``` command to connect to MySQL.  ```--driver``` helps make things run smoothly, and the command ```--table``` is for creating a new table.  Whatever comes after ```--table``` is the name of your new table:

```bash
sqoop import --connect jdbc:mysql://localhost/books --driver com.mysql.jdbc.Driver --table books1 -m 1
```

Lots goes on under the hood, including kicking off MapReduce jobs to get the data in. You can confirm by going into your HDFS view that the file has been created.  It will also provide you will a ```_SUCCESS``` file.

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The screen displays the data in name, size, last modified date, owner, group, and permission. It has three buttons labeled select all, new folder, and upload on top.](Media/hive9.png)

Then go to `users` and `maria_dev` and click to open the file, and you should see this:

![A window for file preview. The location is given as /user/maria_dev/books1/part-m-00000. It displays five data. Two buttons labeled cancel and download are placed at the bottom of the page.](Media/hive10.png)

---

## Suck out Data from MysQL to Hive using Sqoop

You can also put data from MySQL directly into Hive. It's simple - all you need to add is ```--hive-import``` to the end, and Hadoop will take care of it!

```bash
sqoop import --connect jdbc:mysql://localhost/books --driver com.mysql.jdbc.Driver --table books1 -m 1 --hive-import
```

This may take a while, and it may be stuck on a particular part, but don't panic! It should move on ok.

Then you can check to see if its there, by going to the ```Hive view``` and look under the ```default``` database and see your table. 

![A window labeled Ambari Sandbox has five tabs on top. The tabs are labeled dashboard, servies, hosts, alerts, and admin. The bar also has a button for categories. The next bar has six tabs labeled hive, query, saved queries, history, UDF's and upload table. The query tab is selected.](Media/hive11.png)

---

## Put Data Back into MySQL from Hive

You can also go the other way - take data that you've used in Hive and put it into a MySQL database.  

Data used in Hive lives in the ```Files view```, under the ```App``` directory, under ```hive``` and then in the ```warehouse``` directory. You should see that there is a ```books``` file there. That file is what you'll be transferring back.  

---

### Create a New Table in MySQL

But before you do anything else, however, you need to create a table  that will contain your data. Start by signing into MySQL again:

```bash
mysql -u root -p
```

Your password will, of course, be ```hadoop```.  Then you can use the ```books``` table you created:

```sql
use books;
```

And then finally create a table:

```sql
create table exported_books3 (bookid INTEGER, authors VARCHAR(255), average_rating INTEGER, language_code VARCHAR(255), num_pages INTEGER, ratings_count INTEGER, text_reviews_count INTEGER);
```

---

### Export Data Using Sqoop

And now exit out of MySQL, and then you can export: 

```bash
sqoop export --connect jdbc:mysql://localhost/books -m 1 --driver com.mysql.jdbc.Driver --table exported_books3 --export-dir /apps/hive/warehouse/books1 --input-fields-terminated-by '\0001'
```

Now, to check to see if it worked, you will want to log back into MySQL, then change to the ```books``` database:

```sql
use books;
```

Then you can look at everything in the ```exported_books3``` table:

```sql
select * from exported_books3
```

And if its worked properly, what was now empty should be populated with the data! Just like below!

```text
+--------+--------------+----------------+---------------+-----------+---------------+--------------------+
| bookid | authors      | average_rating | language_code | num_pages | ratings_count | text_reviews_count |
+--------+--------------+----------------+---------------+-----------+---------------+--------------------+
|      1 | J.K. Rowling |              5 | eng           |       652 |       1944099 |              26249 |
|      2 | J.K.Rowling  |              4 | eng           |       870 |       1996446 |              27613 |
|      3 | J.K. Rowling |              4 | eng           |       320 |       5629932 |              70390 |
|      4 | J.K. Rowling |              4 | eng           |       352 |          6267 |                272 |
|      5 | J.K. Rowling |              5 | eng           |       435 |       2149872 |              33964 |
+--------+--------------+----------------+---------------+-----------+---------------+--------------------+
5 rows in set (0.00 sec)
```

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Key Terms<a class="anchor" id="DS107L3_page_9"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>MapReduce</td>
        <td>A programming model used to process big data in parallel using a map procedure to filter and process, and a reduce procedure to perform data aggregation and summarization.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Hive</td>
        <td>A program that allows you to write SQL queries for your Hadoop cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>HiveQL</td>
        <td>The Hive brand of SQL.  Primarily only differs in how views are used.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>User-Defined Functions</td>
        <td>Functions you, the user, create.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Schema on Read</td>
        <td>Storing unstructured data and only giving the data a structure when you go to use it.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Schema on Write</td>
        <td>Storing data in structured tables.  A traditional SQL storage system.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Sqoop</td>
        <td>A program to integrate Hive and traditional database connections like MySQL.</td>
    </tr>
</table>

---

## Key SQL Commands

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>create view</td>
        <td>Makes a view</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>drop view</td>
        <td>Removes a view.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>create database</td>
        <td>Sets up an empty database.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>show database</td>
        <td>Allows you to view all the tables in your database.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>drop table if exists</td>
        <td>Allows you to overwrite a table.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>use</td>
        <td>Sets the table you're working in.</td>
    </tr>
</table>

---

## Key Command Line Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>mysql -u root -p</td>
        <td>Connection sequence to get into MySQL.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sqoop import</td>
        <td>Imports data from a database connection.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>--hive-import</td>
        <td>A specifier to sqoop import that allows you to import data from a database connection directly into Hive.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sqoop export</td>
        <td>Exports data from Hadoop into a database connection.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Lesson 3 Hands-On<a class="anchor" id="DS107L3_page_10"></a>

[Back to Top](#DS107L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In this lesson, you've learned Hive and Sqoop. Now it's time to practice with a Hands-On project! This Hands-­On **will** be graded, so make sure you complete each part. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

**[Here](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/rating.zip)** is a dataset on ratings for Anime, and a **[corresponding dataset](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/anime.zip)** that has their titles and other information about them.  Please upload these data files into Hive and then determine the following:

* Determine which Anime show has been rated the most times.
* Find out which Anime show is the highest rated.
* Do the highest anime ratings differ for Anime shows that have been rated by more than ten people? 
* Find out which Anime show is the highest rated among only those shows in the "Slice of Life" genre. 

Then, practice using Sqoop by creating a 5-row table of your own design in MySQL and importing it directly into HDFS and Hive.  Then, export the data back from Hive into a new MySQL table.  Take screenshots along the way to demonstrate this has been done.

When you have completed the assignment, please include a document with your HiveQL queries, screenshots of your results, and the process of using Sqoop.

---

## Alternative Assignment if You Can't Run Hadoop and/or Ambari

If your computer refuses to run Hadoop and/or Ambari, **[here](https://repo.exeterlms.com/documents/V2/DataScience/Big-Data/L3exam.zip)** is an alternative exam to test your understanding of the material. Please attach it instead.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

