# Tutorial 2: Hadoop and HDFS

In this tutorial, you will:

* Import a spam dataset into HDFS filesystem
* Run a MapReduce task using the Hadoop "Streaming" API (Python)
* Run a machine learning algorithm using Mahout in order to create a simple spam filter using the naive Bayes algorithm.

Remember:

* If you are not on the wired network, you will need to connect to the VPN
* You do not have to use Jupyter.  If you prefer, you can do everything in the Putty terminal.  However, if you do use Jupyter, you should bind the service to `0.0.0.0` on port `8888`, and add the token in as a password.

## Start a new terminal
In addition to using notebooks, other features of Jupyter include running a terminal.  On the main menu on the `Home` page, you can start a new terminal by clicking on `New` -> `Terminal`.  Do this now, so that you can run interactive bash commands.

## Check that Hadoop is running

The first thing to do is to check that we have Hadoop installed and running.  Open a terminal, and type in: `hadoop version`, which should show you that the version you have is Hadoop 2.7.4.  

In [None]:
%%bash

# YOUR CODE HERE
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
unzip YouTube-Spam-Collection-v1.zip
ls -lh *.csv

## MapReduce

Having downloaded the data, we want to be able to run a Machine learning algorithm over it.  To do this, we will use the Hadoop Streaming API, which allows us to write Python code.  When we call Hadoop, we pass two Python files to the command - one which maps, and one which reduces.

Let's look at the data:

In [None]:
%%bash

head -n 10 Youtube04-Eminem.csv

## Word counting

The first thing we want to do is to set up a MapReduce function which will allow us to count the number of each individual word from the `comment` field.

In [None]:
# MAPPER

import csv
import sys

input_text = open('Youtube04-Eminem.csv', 'r')
# When we move to the actual MapReduce job, we will need to read from STDIN
input_text = sys.stdin

reader = csv.reader(input_text)
# Skip the column header
next(reader)
for row in reader:
    tokens = row[3].split(' ')
    print(tokens)
    for t in tokens:
        print
        # print tab delimted here,
        # will be input for the reducer
        print('%s\t%d' % (t, 1))    
    # Only do it for the first record for now
    break

# input_text.close()

In [None]:
# REDUCER

import sys
# Keep simple example in for now, switch to stdin later
# input_text = ['+447935454150', 'lovely', 'girl', 'talk', 'to', 'me', 'xxx\ufeff']

input_text = [
    '+447935454150	1',
    'lovely	1',
    'girl	1',
    'talk	1',
    'to	1',
    'me	1',
    'xxx﻿	1'
]

# input_text = sys.stdin
words = {}

for line in input_text:
    word, count = line.split('\t', 1)
    print('word: %s count: %s' % (word, count))
    
    # Convert count to an integer
    try:
        count = int(count)
    except ValueError:
        # We can safely ignore, so keep calm and carry on
        continue
        
        
    if word in words:
        words[word] += 1
    else:
        words[word] = 1
        
for w in words:
    print('%s\t%s' % (w, words[w]))
    


Does it work in principle?  We can test without Hadoop


In [None]:
%%bash
# cat Youtube04-Eminem.csv | ./mapper.py | ./reducer.py
jps
kill 11620
kill 11783

## Setting up HDFS

There are a few more things you need to make Hadoop work nicely.  We are going to set up pseudo-distributed mode, which requires passwordless SSH to be set up.  To do this, we need to run the following commands:

    ssh-keygen -t rsa -P ''
    ssh localhost
    
For the first command, leave all the options as default (press enter for each one to do this).  For the second, you are checking you are able to SSH into your own machine, localhost.  Normally this would be done to a different computer, but to SSH into localhost.  You will have output like the following text.  Type `yes`, because you do still want to connect.


    The authenticity of host 'localhost (127.0.0.1)' can't be established.
    ECDSA key fingerprint is 18:6e:42:bd:0c:8c:35:bc:d9:e8:3c:c6:a3:08:56:43.
    Are you sure you want to continue connecting (yes/no)? yes
    
Now, we need to set up some configuration for the various parts of Hadoop.  Firstly, download the following file, and make sure it has permissions to be executed.  It's good practice to check what unknown files from the Internet are doing, so have a read of the code.

In [None]:
%%bash
rm hadoop-config*
wget https://raw.githubusercontent.com/huwf/data-science-vm/master/hadoop-config.sh
cat hadoop-config.sh
chmod 755 hadoop-config.sh

Open a terminal, and run the following command which will execute the file: **YOU WILL NEED TO RUN THIS AS SUDO**

    sudo ./hadoop-config.sh

We should now have HDFS configured for pseudo-distributed mode.  We will now need to create a disk for HDFS, which will use the configurations we just set:

In [None]:
%%bash 
hdfs namenode -format

## Starting services

Now we need to start the different services and we can get to work!  Run the commands in the following cell to start YARN and DFS:

In [None]:
%%bash

start-dfs.sh
echo "started dfs"
start-yarn.sh
echo "started yarn"
# See what Hadoop (JVM) processes we have running on the VM
jps

In [None]:
%%bash

# start-dfs.sh
# start-yarn.sh

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files ./mapper.py,./reducer.py \
-input Youtube04-Eminem.csv \
-mapper ./mapper.py \
-reducer ./reducer.py \
-output output \