# Instructions and Exercises

## Introduction

These notes contain instructions and questions for the labs portion of the "Big Data: tools and statistics" course. Within this document, command-line steps are presented as follows:

In [None]:
hdfs dfs -text /example/wordcountout/part-00000

All commands will be in a separate grey "cell" (as above).
<br><br>
Exercises will be listed as a bulleted item and italicized. For example:

<ul><li><i>Create a new directory in your home directory called sample. Upload data.csv into the sample directory on HDFS.</i></li></ul>

In this lab you will be expected to achieve the following:
<ol>
<li>Connect to the Big Data infrastructure and download data from github;
<li>Execute a WordCount MapReduce job;
<li>Create and execute a LineCount MapReduce job;
<li>Create and execute a Map Reduce job with a Combiner;
<li>Create and execute a Map Reduce job to search for specific data;
<li>Create and execute a Map Reduce job to join two datasets.
</ol>

# Connecting to the cluster

To connect to the Hadoop cluster from a windows PC, follow these instructions:
<ol>
<li>Start Putty (type putty into the search bar on the start menu).</li>
<li>In the hostname textbox, type the HOSTNAME field provided on the printout, ensuring that Connection type is set to ssh.</li>
<li>Click Open.</li>
<li>On first connection, you will be asked a question about connection security. Please click "yes".
<li>You will be asked for your username and password.</li>
<li>Usernames are trainingN (with N replaced by your allocated number) and password, provided on the printout.</li>
</ol>

To connect to the Hadoop cluster from a Mac, follow these instructions:
<ol>
<li>Start Terminal (type terminal into the spotlight search bar).</li>
<li>Type the following command, followed by enter:
<br>ssh USERNAME@HOSTNAME </li>
<li>On first connection, you will be asked a question about connection security. Please type "yes".
<li>You will be asked for your username and password.</li>
<li>Usernames are trainingN (with N replaced by your allocated number) and password, provided on the printout.</li>
</ol>

## Exercises

### Exercise 1

Hadoop is available from the command line via the following command:

In [None]:
hadoop

Executing this command will display a help message on the console. Please take a look at the printed help, as you may wish to consult such help during the exercises.

You are now going to download all of the supporting material for this lab. This will be done using git. If you would like to learn more about git, please consult [Git basics](https://git-scm.com/book/en/v2/Getting-Started-Git-Basics). However, the instructions within this document (and subsequent documents) should be sufficient for your needs.

Type the following command to download the course material:

In [None]:
git clone https://github.com/markbriers/bd-sp-2017.git

After a short delay (whilst the material is downloaded), you should be able to change your working directory to bd-sp-2017 via the following command:

In [None]:
cd bd-sp-2017

In [None]:
ll

The second command will list the contents of the folder. There are four directories.

You have now connected to the Hadoop cluster, have taken a first look at the available Hadoop commands, and have cloned the git repository that we will use throughout the module.

### Exercise 2

You are now going to interact with the Hadoop Distributed Filesystem (HDFS). HDFS help can be obtained by executing the following command:

In [None]:
hdfs dfs

We will now list the contents of the data directory to be used on this course, using the following command:

In [None]:
hdfs dfs -ls /rss-data

After executing this command, you should see a list of five data files listed in HDFS.

You will now create a new folder in HDFS (via -mkdir), and move (-mv) the heathrowdata.txt file into this new folder.

In [None]:
hdfs dfs -mkdir /rss-data/temperatureData

In [None]:
hdfs dfs -mv /rss-data/heathrowdata.txt /rss-data/temperatureData/heathrowdata.txt

Complete the following activities:

<ul><li><i>Move the wickairportdata.txt file into the /rss-data/temperatureData folder.</i></li></ul>

<ul><li><i>List the contents of the /rss-data/temperatureData/ folder.</i></li></ul>

### Exercise 3

You will now be executing a Map Reduce job that performs word count. First, change your working directory to exercise 3:

In [None]:
cd exercise3

In [None]:
ll

You will see that there are two files in this folder; a mapper and a reducer. You will need to update the properties of these files, so that they are appropriately formatted, and have execute permissions:

In [None]:
perl -pi -e 's/\r\n/\n/g' mapper.py
perl -pi -e 's/\r\n/\n/g' reducer.py
chmod +x *.py

The first two commands modify the line endings, to ensure that they are compatible with Hadoop streaming code. The third command modifies the permissions on all python files (with .py extension) so that they can be executed via the bash shell. <b>Note that these commands will need to be executed for all exercises, but will not be listed again.</b>

Examine the mapper.py and reducer.py files using your favourite Linux text editor (instructions for vim can be found [here](https://www.engadget.com/2012/07/10/vim-how-to/); instructions for nano can be found [here](https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/)):

In [None]:
nano mapper.py

In [None]:
nano reducer.py

<ul><li><i>Check that you understand the contents of each Python file.</i></li></ul>

You are now going to execute a Hadoop job across the sample.txt data file, using the following command:

In [None]:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
-files mapper.py,reducer.py
-mapper mapper.py
-reducer reducer.py
-input /rss-data/sample.txt
-output /rss-results/wordcountout

After execution, results will be output to the /rss-results/wordcountout/ folder in HDFS.

<ul><li><i>List the contents of the wordcountout folder.<br>
How many part files are produced?<br>
What does this number indicate about the number of reducer processes that were executed?<br>
What should we have done to improve the output? [HINT: Look at the keyspace and punctuation.]</i></li></ul>

One way to view the contents of the results is to cat the files into the Linux less paging utility function, as follows:

In [None]:
hdfs dfs -text /rss-results/wordcountout/part-00000

Use your space bar to page through the results.

### Exercise 4

Change your working directory to exercise4.

In [None]:
cd ../exercise4

<ul><li><i>Using the stub mapper.py and reducer.py functions in the w1e4 working directory, write and execute a Map Reduce program that:
<ol><li>computes the length of each line during the Map phase, outputting (lineLength, 1) as the (key, value) pair
counts the number of occurrences of lineLength during the reduce phase, outputting (lineLength, count) as (key, value) pairs.<li>View the results using the cat and less utility functions.</ol><br>What do you notice about the results?</i></li></ul>

### Exercise 5

<ul><li><i>Change (cd) to the exercise5 directory and list (ll) the files in this directory.</i></li></ul>

As you will see, the folder contains three python files; mapper.py, reducer.py and combiner.py. You will now execute a distributed word count map reduce job that uses a combiner to reduce the amount of network traffic prior to the shuffle and sort phase of the Map Reduce process. This is executed via the following command:

In [None]:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
-files mapper.py,reducer.py,combiner.py
-mapper mapper.py
-reducer reducer.py
-combiner combiner.py
-input /rss-data/sample.txt
-output /rss-results/wordcountout-combiner

### Exercise 6

<ul><i>
<li>Change (cd) to the exercise6 directory.
<li> Using the pre-populated mapper.py and reducer.py functions, write a combiner function (using combiner.py) that improves the reducer efficiency of the Map Reduce program. This map reduce process should be executed across the files in the textData directory in HDFS.</i></ul>

### Exercise 7

<ul><i><li>Change (cd) to the exercise7 directory.</i></ul>

The temperatureData folder on HDFS contains two text files that are populated with monthly meteorological data from Heathrow airport (London) and Wick airport (Scotland) between 1948 to 2015. The columns within each data file correspond to the following headers:<br><br>
<i>Year ~ Month ~ MaxTemp ~ MinTemp ~ Rainfall</i><br><br>
A typical SQL operation performed within a relational database is to join two (or more) tables, using a specified join key (this is the same as the merge function in R). In the case of the temperature data, one could join the data on Year-Month, listing the Heathrow airport values and then the Wick airport values on the same line. That is, a line from the joined data (by Year-Month) would for formatted as follows:<br><br>
<i>Year ~ Month ~ HMaxTemp ~ HMinTemp ~ HRainfall ~ WMaxTemp ~ WMinTemp ~ WRainfall</i><br><br>
where the H prefix corresponds to data from Heathrow airport, and the W prefix corresponds to data from Wick airport, for the same Year and Month.

<ul><i><li>Write a Map Reduce program that joins the two temperature datasets (/rss-data/temperatureData/heathrowdata.txt and /rss-data/temperatureData/wickairportdata.txt in HDFS) using Year-Month as the join key. The output should be the consistent with the joined line defined directly above.</i></ul>

### Exercise 8

<ul><i><li> Change (cd) to the exercise7 directory.
<li>Write a Map Reduce program that outputs data for September only using the joined results from Exercise 3. Get your results from HDFS and place them in the local filesystem. Using R or matplotlib in Python (R is available by typing R at the command line prompt and/or Python is available by typing python at the command line prompt) plot two time-series of maximum temperature data, sorted by year, for each location (Heathrow airport and Wick airport). What do you notice about the results?</i></ul>

### Exercise 9 (Optional)

<ul><i><li> Change (cd) to the exercise8 directory.
<li>Write a Map Reduce program to compute the parameters of an ordinary least squares model. Apply the programme to estimate "maximum temperature" data, where y = Heathrow airport data and x = Wick airport data.<i></ul>