# Lab 1 Instructions and Exercises (Big Data in Statistics)

## Introduction

These notes contain instructions and questions for the labs portion of the Big Data in Statistics module. Within this document, command-line steps are presented as follows:

In [None]:
hadoop fs -put data /user/mark/repository/data

All commands will be in a separate grey "cell" (as above).
<br><br>
Exercises will be listed as a bulleted item and italicized. For example:

<ul><li><i>Create a new directory in your HDFS home directory called sample. Upload data.csv into the sample directory on HDFS.</i></li></ul>

To follow real-world development practices, you will be using configuration control software git, and internet based repositories on <a href="http://github.com">github.com</a>. Instructions will be provided on how to use these tools during the exercises.

## Objectives

In this lab you will be expected to achieve the following:
<ol>
<li>Connect to the Big Data infrastructure and download data from github; </li>
<li>Move data into and out of HDFS;</li>
<li>Execute a WordCount MapReduce job;</li>
<li>Create and execute a LineCount MapReduce job.</li>

## Exercises

### Exercise 1

To connect to the cluster, follow these instructions:
<ol>
<li>Start Xming (type xming into the search bar on the start menu).</li>
<li>Start Putty (type putty into the search bar on the start menu).</li>
<li>In the hostname textbox, type bazooka.ma.ic.ac.uk, ensuring that Connection type is set to ssh.</li>
<li>Navigate to Connection –> SSH -> X11, and ensure “Enable X11 forwarding” is selected.</li>
<li>Click Open.</li>
<li>On first connection to bazooka, you may be presented with a dialogue box asking a “host key” related question. Please click “Yes” in response to this question.</li>
<li>You will be asked for your username and password.</li>
<li>Usernames are userN (with N replaced by your allocated number) and initial password.</li>
<li>On first login, you will be asked to change your password. In current password, enter the initial password, then press enter. You will then be asked to type your new password, and then to confirm your new password. You will be notified if your password has been successfully changed.</li>
<li>Close putty. Now start from instruction 2 again. On this occasion you will enter your new password and will be able to continue with the exercises below.</li>
</ol>

You are now connected to the Hadoop environment, where you will perform all labs-based exercises and your coursework for this module.
<br><br>
Hadoop is available from the command line via the following command:

In [None]:
hadoop

Executing this command will display a help message on the console. Please take a look at the printed help, as you may wish to consult such help during the exercises.
<br><br>
You are now going to download all of the supporting material for this lab. This will be done using git. For those of you that would like to learn more about git, please consult <a href="https://git-scm.com/book/en/v2/Getting-Started-Git-Basics">Git basics</a>. However, the instructions within this document (and subsequent documents) should be sufficient for your needs.
<br><br>
Type the following command to download the course material:

In [None]:
git clone https://github.com/mbriers/ltcc-2017.git

After a short delay (whilst the material is downloaded), you should be able to change your working directory to M5MS19 via the following command:

In [None]:
cd ltcc-2017

In [None]:
ls -la

The second command will list the contents of the folder M5MS19. There are four directories.
<br><br>
You have now connected to the Hadoop cluster, have taken a first look at the available Hadoop commands, and have cloned the git repository that we will use throughout the module.

### Exercise 2

You should now change your working directory to the data subfolder:

In [None]:
cd data

In [None]:
ls -la

As listed, there are four files in this directory: <i>sample.txt</i>, <i>textdata.txt</i>, <i>heathrowdata.txt</i> and <i>wickairportdata.txt</i>.
You are now going to interact with the Hadoop Distributed Filesystem (HDFS). HDFS help can be obtained by executing the following command:

In [None]:
hadoop fs

You will now put the <i>sample.txt</i> file from your local machine into your home folder on HDFS, renaming the file to be <i>sampledata.txt</i> on HDFS:

In [None]:
hadoop fs -put sample.txt sampledata.txt

In [None]:
hadoop fs -ls

The second command lists all files and folders in your HDFS home directory. You should see the </i>sampledata.txt</i> file listed as existing in HDFS.<br>
The reverse operation, to get data from HDFS onto your local disk, is executed as follows:

In [None]:
hadoop fs -get sampledata.txt .

The period at the end of the command implies that you want to place the file <i>sampledata.txt</i> into your current working directory, and you do not wish to rename the file. <b>NB: Downloading data from HDFS can cause problems; data in HDFS is often very large, so please be careful when using the get command.</b>
<br><br>
You will now make a new directory inside your HDFS home directory using the following command:

In [None]:
hadoop fs -mkdir temperatureData

In [None]:
hadoop fs -ls

The second command lists all files and folders in your HDFS home directory. You should see the <i>temperatureData</i> directory listed as existing in HDFS.
<br><br>
You will now upload <i>heathrowdata.txt</i> and <i>wickairportdata.txt</i> into the <i>temperatureData</i> directory in HDFS:

In [None]:
hadoop fs -put heathrowdata.txt temperatureData/heathrowdata.txt

In [None]:
hadoop fs -put wickairportdata.txt temperatureData/wickairportdata.txt

In [None]:
hadoop fs -ls temperatureData

You should see the new files in HDFS now listed.

<ul>
<li><i>Create a new folder in HDFS called textData. Put the file textData.txt into this new folder on HDFS. Confirm that the upload has been successful by listing the contents of the textData directory.</i></li>
</ul>

### Exercise 3

You will now be executing a Map Reduce job that performs word count. First, change your working directory to w1e3 (the folder naming convention is <b>w</b>eek <b>1</b> <b>e</b>xercise <b>3</b>):

In [None]:
cd ../w1e3

In [None]:
ls -la

You will see that there are two files in this folder; a mapper and a reducer. You will need to update the properties of these files, so that they have execute permissions:

In [None]:
chmod +x *.py

This command modifies the permissions on all python files (with <i>.py</i> extension) so that they can be executed via the bash shell.<br>
Examine these files using your favourite Linux text editor (instructions for vimcan be found <a href="http://www.engadget.com/2012/07/10/vim-how-to/">here</a>; instructions for nano can be found <a href="https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/">here</a>):

In [None]:
nano mapper.py

In [None]:
nano reducer.py

Check that you understand the contents of each Python file.

You are now going to execute a Hadoop job across the files in the <i>textData</i> directory, using the following command:

In [None]:
hadoop jar $HADOOP_STR/hadoop-0.20.2-dev-streaming.jar \
-libjars $HADOOP_STR/hadoop-0.20.2-dev-streaming.jar \
-input textData/* \
-output w1e3-output \
-mapper "python mapper.py" \
-file /home/USERNAME/bd-sp-2017/w1e3/mapper.py \
-reducer "python reducer.py" \
-file /home/USERNAME/bd-sp-2017/w1e3/reducer.py

Note that <b>USERNAME</b> will need to be replaced with your compute cluster username, <i>userN</i>.

After execution, results will be output to the w1e3-output folder on HDFS.

<ul>
<li><i>List the contents of the w1e3-output folder.<br>
How many part files are produced?<br>
What does this number indicate about the number of reducer processes that were executed?<br>
What should we have done to improve the output? [HINT: Look at the keyspace and punctuation.]</i></li>
</ul>

One way to view the contents of the results is to <i>cat</i> the files into the Linux less paging utility function, as follows:

In [None]:
hadoop fs -cat w1e3-output/part* | less

Use your space bar to page through the results.

### Exercise 4

Change your working directory to w1e4:

In [None]:
cd ../w1e4

In [None]:
chmod +x *.py

In [None]:
ls -la

<ul>
<li><i>Using the stub mapper.py and reducer.py functions in the w1e4 working directory, write and execute a Map Reduce program that:<br>
<ol>
<li>computes the length of each line during the Map phase, outputting (lineLength, 1) as the (key, value) pair
<li>counts the number of occurrences of lineLength during the reduce phase, outputting (lineLength, count) as (key, value) pairs. View the results using the cat and less utility functions. What do you notice about the results?
</ol>
</i></li>


At this point, you may introduce an error in your Python code. Hadoop produces an unhelpful error message (Error: NA) when you have a coding problem. A good testing strategy is to run a subset of the data through python code on the local machine. To do this, execute the following command:

In [None]:
cat ../data/textData.txt | python mapper.py | sort -k1,1 | python reducer.py