This notebook will be collected automatically at **6pm on Monday** from `/home/data_scientist/assignments/Week12` directory on the course JupyterHub server. If you work on this assignment on the course Jupyterhub server, just make sure that you save your work and instructors will pull your notebooks automatically after the deadline. If you work on this assignment locally, the only way to submit assignments is via Jupyterhub, and you have to place the notebook file in the correct directory with the correct file name before the deadline.

1. Make sure everything runs as expected. First, restart the kernel (in the menubar, select `Kernel` → `Restart`) and then run all cells (in the menubar, select `Cell` → `Run All`).
2. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed by the autograder.
3. Do not change the file path or the file name of this notebook.
4. Make sure that you save your work (in the menubar, select `File` → `Save and CheckPoint`)

## Problem 12.1. MapReduce.

In this problem, we will use Hadoop Streaming to execute a MapReduce code written in Python.

In [None]:
import os
from nose.tools import assert_equal, assert_true

We will use the [airline on-time performance data](http://stat-computing.org/dataexpo/2009/), but before proceeding, recall that the data set is encoded in `latin-1`. However, the Python 3 interpreter expects the standard input and output to be in `utf-8` encoding. Thus, we have to explicitly state that the Python interpreter should use `latin-1` for all IO operations, which we can do by setting the Python environment variable `PYTHONIOENCODING` equal to `latin-1`. We can set the environment variables of the current IPython kernel by modifying the `os.environ` dictionary.

In [None]:
os.environ['PYTHONIOENCODING'] = 'latin-1'

Let's use the shell to check if the variable is set correctly. If you are not familiar with the following syntax (i.e., Python variable = ! shell command), [this notebook](https://github.com/UI-DataScience/info490-fa15/blob/master/Week4/assignment/unix_ipython.ipynb) from the previous semester might be useful.

In [None]:
python_io_encoding = ! echo $PYTHONIOENCODING
assert_equal(python_io_encoding.s, 'latin-1')

## Mapper

Write a Python script that
  - Reads data from `STDIN`,
  - Skips the first line (The first line of `2001.csv` is the header that has the column titles.)
  - Outputs to `STDOUT` the `Origin` and `DepDelay` columns separated with a tab.

In [None]:
%%writefile mapper.py
#!/usr/bin/env python3

import sys

# YOUR CODE HERE

We need make the file executable.

In [None]:
! chmod u+x mapper.py

Before testing the mapper code on the entire data set, let's first create a small file and test our code on this small data set.

In [None]:
! head -n 50 $HOME/data/2001.csv > 2001.csv.head
map_out_head = ! ./mapper.py < 2001.csv.head
print('\n'.join(map_out_head))

In [None]:
assert_equal(
    map_out_head,
    ['BWI\t-4', 'BWI\t-5', 'BWI\t11', 'BWI\t-3', 'BWI\t0',
     'BWI\t-3', 'BWI\t-8', 'BWI\t-6', 'BWI\t2', 'BWI\t2',
     'BWI\t2', 'BWI\t-6', 'BWI\t-8', 'BWI\t-3', 'BWI\t-5',
     'PHL\t20', 'PHL\t100', 'PHL\t1', 'PHL\t-2', 'PHL\t-7',
     'PHL\tNA', 'PHL\t4', 'PHL\t3', 'PHL\t-4', 'PHL\t-5',
     'PHL\t-4', 'PHL\t17', 'PHL\t-5', 'PHL\t0', 'PHL\t-2',
     'PHL\t97', 'PHL\t3', 'PHL\t-4', 'PHL\tNA', 'PHL\t17',
     'PHL\tNA', 'PHL\t2', 'PHL\t27', 'PHL\t3', 'PHL\t-6',
     'PHL\t-3', 'PHL\t-3', 'PHL\t-5', 'PHL\t-2', 'PHL\t-3',
     'PHL\t1', 'CLT\t32', 'CLT\t18', 'CLT\t38']
    )

## Reducer

Write a Python script that

  - Reads key-value pairs from `STDIN`,
  - Computes the minimum and maximum departure delays at each airport,
  - Outputs to `STDOUT` the airports and the minimum and maximum departure delays at each airport, separated with tabs.
  
For example,

```shell
$ ./mapper.py < 2001.csv.head | sort -n -k 1 | ./reducer.py
```

should give

```
BWI	-8	11
CLT	18	38
PHL	-7	100
```

In [None]:
%%writefile reducer.py
#!/usr/bin/env python3

import sys

# YOUR CODE HERE

In [None]:
! chmod u+x reducer.py

In [None]:
red_head_out = ! ./mapper.py < 2001.csv.head | sort -n -k 1 | ./reducer.py
print('\n'.join(red_head_out))

In [None]:
assert_equal(red_head_out, ['BWI\t-8\t11', 'CLT\t18\t38', 'PHL\t-7\t100'])

If the previous tests on the smaller data set were successful, we can run the mapreduce on the entire data set.

In [None]:
mapred_out = ! ./mapper.py < $HOME/data/2001.csv | sort -n -k 1 | ./reducer.py
print('\n'.join(mapred_out[:10]))

In [None]:
assert_equal(len(mapred_out), 231)
assert_equal(mapred_out[:5], ['ABE\t-30\t666', 'ABI\t-19\t285', 'ABQ\t-30\t576', 'ACT\t-22\t234', 'ACY\t106\t106'])
assert_equal(mapred_out[-5:], ['TYS\t-15\t757', 'VPS\t-14\t389', 'WRG\t-52\t494', 'XNA\t-20\t813', 'YAK\t-28\t396'])

## HDFS: Reset

We will do some cleaning up before we run Hadoop streaming. Let's first stop the [namenode and datanodes](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html).

In [None]:
! $HADOOP_PREFIX/sbin/stop-dfs.sh
! $HADOOP_PREFIX/sbin/stop-yarn.sh

If there are any temporary files created during the previous Hadoop operation, we want to clean them up.

In [None]:
! rm -rf /tmp/*

We will simply [format the namenode](https://wiki.apache.org/hadoop/GettingStartedWithHadoop#Formatting_the_Namenode) and delete all files in our HDFS. Note that our HDFS is in an ephemeral Docker container, so all data will be lost anyway when the Docker container is shut down.

In [None]:
! echo "Y" | $HADOOP_PREFIX/bin/hdfs namenode -format 2> /dev/null

After formatting the namenode, we restart the namenode and datanodes.

In [None]:
!$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
!$HADOOP_PREFIX/sbin/start-dfs.sh
!$HADOOP_PREFIX/sbin/start-yarn.sh

Sometimes when the namenode is restarted, it enteres Safe Mode, not allowing any changes to the file system. We do want to make changes, so we manually leave Safe Mode.

In [None]:
! $HADOOP_PREFIX/bin/hdfs dfsadmin -safemode leave

## HDFS: Create directory

- Create a new directory in HDFS at `/user/data_scientist`.

In [None]:
# Create a new directory in HDFS at /user/data_scientist.

# YOUR CODE HERE

In [None]:
ls_user = ! $HADOOP_PREFIX/bin/hdfs dfs -ls /user/
print('\n'.join(ls_user))

In [None]:
assert_true('/user/data_scientist' in ls_user.s)

- Create a new directory in HDFS at `/user/data_scientist/wc/in`

In [None]:
# Create a new directory in HDFS at `/user/data_scientist/wc/in`

# YOUR CODE HERE

In [None]:
ls_wc = ! $HADOOP_PREFIX/bin/hdfs dfs -ls wc
print('\n'.join(ls_wc))

In [None]:
assert_true('wc/in' in ls_wc.s)

## HDFS: Copy

- Copy `/home/data_scientist/data/2001.csv` from local file system into our new HDFS directory `wc/in`.

In [None]:
# Copy `/home/data_scientist/data/2001.csv` from local file system into our new HDFS directory `wc/in`.

# YOUR CODE HERE

In [None]:
ls_wc_in = ! $HADOOP_PREFIX/bin/hdfs dfs -ls wc/in
print('\n'.join(ls_wc_in))

In [None]:
assert_true('wc/in/2001.csv' in ls_wc_in.s)

## Python Hadoop Streaming

- Run `mapper.py` and `reducer.py` via Hadoop Streaming.
- Use `/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar`.
- We need to pass the `PYTHONIOENCODING` environment variable to our Hadoop streaming task. To find out how to set `PYTHONIOENCODING` to `latin-1` in a Hadoop streaming task, use the `--help` and `-info` options.

In [None]:
# Run Python code via Hadoop streaming

# YOUR CODE HERE

In [None]:
ls_wc_out = ! $HADOOP_PREFIX/bin/hdfs dfs -ls wc/out
print('\n'.join(ls_wc_out))

In [None]:
assert_true('wc/out/_SUCCESS' in ls_wc_out.s)
assert_true('wc/out/part-00000' in ls_wc_out.s)

In [None]:
stream_out = ! $HADOOP_PREFIX/bin/hdfs dfs -cat wc/out/part-00000
print('\n'.join(stream_out[:10]))

In [None]:
assert_equal(mapred_out, stream_out)

## Cleanup

In [None]:
! $HADOOP_PREFIX/bin/hdfs dfs -rm -r -f -skipTrash wc/out