# Lab 2 Instructions and Exercises (Big Data in Statistics)

## Introduction

These notes contain instructions and questions for the labs portion of the Big Data in Statistics module. Within this document, command-line steps are presented as follows:

In [None]:
hadoop fs –put data /user/mark/repository/data

All commands will be in a separate grey "cell" (as above).
<br><br>
Exercises will be listed as a bulleted item and italicized. For example:

<ul><li><i>Create a new directory in your HDFS home directory called sample. Upload data.csv into the sample directory on HDFS.</i></li></ul>

To follow real-world development practices, you will be using configuration control software git, and internet based repositories on <a href="http://github.com">github.com</a>. Instructions will be provided on how to use these tools during the exercises.
<br><br>
<b>For these exercises, before using putty to connect to the compute cluster, please remember to execute Xming first.</b>

## Objectives

In this lab you will be expected to achieve the following:
<ol>
<li>Create and execute a Map Reduce job with a Combiner
<li>Create and execute a Map Reduce job to search for specific data
<li>Create and execute a Map Reduce job to join two datasets
</ol>

## Exercises

### Exercise 1

Make sure that you are in the main folder for the exercises by running the following command:

In [None]:
cd ~/ltcc-2017

You should now change your working directory to the w2e1 subfolder.

In [None]:
cd w2e1

In [None]:
chmod +x *.py

In [None]:
ls -la

The third command will list the contents of this folder. As you will see, the folder contains three python files; <i>mapper.py</i>, <i>reducer.py</i> and <i>combiner.py</i>.
You will now execute a distributed word count map reduce job that uses a combiner to reduce the amount of network traffic prior to the shuffle and sort phase of the Map Reduce process. This is executed via the following command:

In [None]:
hadoop jar $HADOOP_STR/hadoop-0.20.2-dev-streaming.jar \
-libjars $HADOOP_STR/hadoop-0.20.2-dev-streaming.jar \
-input textData/* \
-output w2e1-output \
-mapper "python mapper.py" \
-file /home/USERNAME/bd-sp-2017/w2e1/mapper.py \
-reducer "python reducer.py" \
-file /home/USERNAME/bd-sp-2017/w2e1/reducer.py \
-combiner combiner.py \
-file /home/USERNAME/bd-sp-2017/w2e1/combiner.py

Note that <b>USERNAME</b> will need to be replaced with your compute cluster username, <i>userN</i>.

### Exercise 2

You should now change your working directory to w2e2:

In [None]:
cd ../w2e3

In [None]:
chmod +x *.py

In [None]:
ls –la

<ul>
<li><i>Using the pre-populated mapper.py and reducer.py functions, write a combiner function (using combiner.py) that improves the reducer efficiency of the Map Reduce program. This map reduce process should be executed across the files in the textData directory in HDFS.</i>
</ul>

### Exercise 3

You should now change your working directory to w2e3:

In [None]:
cd ../w2e3

In [None]:
chmod +x *.py

In [None]:
ls –la

The temperatureData folder on HDFS contains two text files that are populated with monthly meteorological data from Heathrow airport (London) and Wick airport (Scotland) between 1948 to 2015. The columns within each data file correspond to the following headers:

Year ~ Month ~ MaxTemp ~ MinTemp ~ Rainfall

A typical SQL operation performed within a relational database is to join two (or more) tables, using a specified join key (this is the same as the merge function in R). In the case of the temperature data, one could join the data on Year-Month, listing the Heathrow airport values and then the Wick airport values on the same line. That is, a line from the joined data (by Year-Month) would for formatted as follows:

Year ~ Month ~ HMaxTemp ~ HMinTemp ~ HRainfall ~ WMaxTemp ~ WMinTemp ~ WRainfall

where the <i>H</i> prefix corresponds to data from Heathrow airport, and the <i>W</i> prefix corresponds to data from Wick airport, for the same Year and Month. 

<ul>
<li><i>Write a Map Reduce program that joins the two temperature datasets (temperatureData/heathrowdata.txt and temperatureData/wickairportdata.txt in HDFS) using Year-Month as the join key. The output should be the consistent with the header line defined directly above.</i>
</ul>

In this exercise, you will need to establish which datafile the mapper is processing, so that you can appropriately label the values for use in the joiner. To extract the filename within the mapper, use the following python code:

In [None]:
fname = os.environ["map_input_file"]

### Exercise 4

You should now change your working directory to w2e4:

In [None]:
cd ../w2e4

In [None]:
chmod +x *.py

In [None]:
ls -la

This folder contains two python files (<i>mapper.py</i> and <i>reducer.py</i>) that do not contain any python code. You will be required to write the python code for the following exercise:

<ul>
<li><i>Write a Map Reduce program that outputs data for <b>September only</b> using the joined results from Exercise 3. Get your results from HDFS and place them in the local filesystem. Using R or matplotlib in Python (R is available by typing R at the command line prompt and/or Python is available by typing python at the command line prompt) plot two time-series of maximum temperature data, sorted by year, for each location (Heathrow airport and Wick airport). What do you notice about the results?</i>
</ul>

### Exercise 5 (Optional)

You should now change your working directory to w2e5:

In [None]:
cd ../w2e5

In [None]:
chmod +x *.py

In [None]:
ls -la

This folder contains two python files (<i>mapper.py</i> and <i>reducer.py</i>) that do not contain any python code. You are required to write the python code for the following exercise:

<ul>
<li><i>Write a Map Reduce program to compute the parameters of an ordinary least squares model. Apply the programme to estimate "maximum temperature" data, where y = Heathrow airport data and x = Wick airport data.</i>
</ul>