# Tutorial: Hadoop and Hadoop Distributed File System (HDFS)

In this tutorial, you will:

* Create a MapReduce task using the Hadoop "Streaming" API (Python)
* Import a spam dataset into the Hadoop Distributed File System (HDFS)
* Run a MapReduce task using HDFS

## Setup
* This tutorial expects you to be using the COMP6235 Virtual Machine for VirtualBox. No support is provided for other solutions. Setup instructions are available at http://edshare.soton.ac.uk/id/document/324163
* Run "run-jupyter" to start Jupyter Notebook
* Download the .ipynb file at http://edshare.soton.ac.uk/19650/ and import it into Jupyter

# Refresher: What is Hadoop?

Apache Hadoop is an open-source software framework for distributing the processing of large amounts of data across multiple machines. It has an emphasis on fault-tolerant processing of data on large clusters. Hadoop has three important components:

**Hadoop Distributed File System** - A distributed file-system that stores data and facilitates the sharing of data between different machines in a Hadoop cluster (group of machines).

**Hadoop YARN** - A platform for managing the computing resources available to Hadoop, notably performing the task of scheduling jobs to run on other machines.

**Hadoop MapReduce** - Support for the MapReduce programming model for large-scale data processing

All of these are already set up and (mostly) configured in the Virtual Machine, though this tutorial will walk you through starting and using these tools.

## Firstly: Start a new terminal
In addition to running Notebooks, Jupyter is also capable of running a terminal, an interactive text-based interface to the Virtual machine. On the main menu on the `Home` page, you can start a new terminal by clicking on `New` -> `Terminal`.  

We'll be using this to run some of the commands necessary to configure Hadoop. 

## Hadoop Modes of Operation

Hadoop has three main modes of operation:

**Standalone Mode** - This is the default mode used by Hadoop. It's localised to the current machine, and doesn't use HDFS, instead reading files from the local filesystem. It's primarily used for debugging.

**Pseudo-Distributed Mode** - This is where Hadoop uses a cluster consisting of only a single machine, with every Hadoop daemon (a type of program that sits there doing work in that background) running on that machine. This is mainly used for testing the Hadoop setup. 

**Fully-Distributed Mode** - This is where data and processing is split between multiple machines. This enables Hadoop to horizontally scale and leverage the resources of multiple machines. This is the main mode used by Hadoop in production.

In this tutorial, we'll only be using Standalone and Pseudo-Distributed modes.

## MapReduce

MapReduce is a programming model used by Hadoop to process large amounts of data in parallel. It accepts input data in the form of a set of key-value pairs <key1, value1>. It divides this set into individual chunks and assigns them as tasks to be processed on individual machines. It works in two phases: A Map phase and a Reduce phase.

The Map phase takes these key-value pairs in the form <key1, value1> and maps (processes them) into other, intermediate key-value pairs <key2, value2>.

These pairs are then sorted by their key, and passed into the Reduce phase.

The Reduce phase takes these keys and produces a third (smaller) set of keys, combining the elements from the intermediate pairs that share a common key.

In summary:

**<key1, value1>** is *mapped* to **<key2, value2>** which is *reduced* to a smaller set of **<key3, value3>**.

Don't worry if it's all a bit abstract - there'll be examples in the rest of the tutorial.

## Importing data

The first thing we're going to do, is download some data.  We will store this on our VMs, but could represent data which is remote, or in a datacentre somewhere.  Run the following code:

In [1]:
%%bash

wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip \
-O YouTube-Spam-Collection-v1.zip

unzip -o YouTube-Spam-Collection-v1.zip

ls -lh *.csv

Archive:  YouTube-Spam-Collection-v1.zip
  inflating: Youtube01-Psy.csv       
  inflating: __MACOSX/._Youtube01-Psy.csv  
  inflating: Youtube02-KatyPerry.csv  
  inflating: __MACOSX/._Youtube02-KatyPerry.csv  
  inflating: Youtube03-LMFAO.csv     
  inflating: __MACOSX/._Youtube03-LMFAO.csv  
  inflating: Youtube04-Eminem.csv    
  inflating: __MACOSX/._Youtube04-Eminem.csv  
  inflating: Youtube05-Shakira.csv   
  inflating: __MACOSX/._Youtube05-Shakira.csv  
-rw-r--r-- 1 comp6235 comp6235 57K Mar 26  2017 Youtube01-Psy.csv
-rw-r--r-- 1 comp6235 comp6235 63K Mar 26  2017 Youtube02-KatyPerry.csv
-rw-r--r-- 1 comp6235 comp6235 63K Mar 26  2017 Youtube03-LMFAO.csv
-rw-r--r-- 1 comp6235 comp6235 81K Mar 26  2017 Youtube04-Eminem.csv
-rw-r--r-- 1 comp6235 comp6235 72K Mar 26  2017 Youtube05-Shakira.csv


--2018-12-04 21:42:17--  https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163567 (160K) [application/zip]
Saving to: ‘YouTube-Spam-Collection-v1.zip’

     0K .......... .......... .......... .......... .......... 31% 62.8K 2s
    50K .......... .......... .......... .......... .......... 62%  135K 1s
   100K .......... .......... .......... .......... .......... 93%  185K 0s
   150K .........                                             100% 19.9M=1.4s

2018-12-04 21:42:19 (111 KB/s) - ‘YouTube-Spam-Collection-v1.zip’ saved [163567/163567]



Having downloaded the data, we want to be able to do a MapReduce task on it.  To do this, we will use the Hadoop Streaming API, which allows us to write Python code rather than the usual Java.  

When we call the Hadoop process, we pass two Python files to the command - one which maps, and one which reduces.

First, let's look at the data:

In [23]:
%%bash

head -n 10 Youtube04-Eminem.csv

COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
z12rwfnyyrbsefonb232i5ehdxzkjzjs2,Lisa Wellas,,+447935454150 lovely girl talk to me xxx﻿,1
z130wpnwwnyuetxcn23xf5k5ynmkdpjrj04,jason graham,2015-05-29T02:26:10.652000,I always end up coming back to this song<br />﻿,0
z13vsfqirtavjvu0t22ezrgzyorwxhpf3,Ajkal Khan,,"my sister just received over 6,500 new <a rel=""nofollow"" class=""ot-hashtag"" href=""https://plus.google.com/s/%23active"">#active</a> youtube views Right now. The only thing she used was pimpmyviews. com﻿",1
z12wjzc4eprnvja4304cgbbizuved35wxcs,Dakota Taylor,2015-05-29T02:13:07.810000,Cool﻿,0
z13xjfr42z3uxdz2223gx5rrzs3dt5hna,Jihad Naser,,Hello I&#39;am from Palastine﻿,1
z133yfmjdur4dvyjr04ceh2osl2fvngrqi4,Darrion Johnson,2015-05-29T01:27:30.360000,Wow this video almost has a billion views! Didn&#39;t know it was so popular ﻿,0
z12zgrw5furdsn0sc233hfwavnznyhicq,kyeman13,,Go check out my rapping video called Four Wheels please ❤️﻿,1
z12vxdzzds2kzzrzq04cdjc4ozq2szuyl5o,Damax,2015-05-29T00:4

## Check that Hadoop is running

The next thing to do is to check that we have Hadoop installed and running.  Open a terminal, and type in: 

    hadoop version
    
This should show you that the version you have is Hadoop 2.8.5.  

## Word counting

Now we have our CSV files, let's get started processing them.

The first thing we want to do is to set up a MapReduce function which will allow us to count the number of each individual word from the `comment` field of the file.

The streaming API uses streams, which means that the information passed in to the map process is information from the output of one of our CSV files, and the data is then passed between the map and reduce process is output which is printed to stdout.

The streaming API provides us a stream of data to the program's "standard input", more commonly called "stdin". In this case, we'll get each of the lines of our CSV file as the input to the mapper. The mapper will then then process this, and put it to "standard output" or "stdout". This will then be used as the input to the reduce process, and so on.

It helps to first think of what the inputs and outputs of each stage of the process are. For word counting, we could do something like this:

**Line of CSV Data** is mapped to **<Word, 1>** is mapped to **<Word, Count>**

By default, the Streaming API uses a *tab character* as a seperator between the key and the value. The output of your map function might look something like:

    "Banana\t1" or "Banana    1"

Some code has been provided to you below, including the libraries used in our answer. However, other solutions are possible that do not use these libraries.

The cells below use the %%writefile magic keyword to write their contents to a file, instead of executing them.
If you wish to execute them, comment this out with a `#`.

In [1]:
%%writefile mapper.py
#!/usr/bin/env python2.7
# MAPPER

import csv
import sys
import re

lines = sys.stdin.readlines()

csvreader = csv.reader(lines)
# YOUR CODE GOES BELOW

# Create a list of ONLY the comments using a list comprehension
comments = [row[3] for row in csvreader]  ####get the fouth column in each row
####comments is a string list

# Iterate over each of the comments
for comment in comments:  ###iterate each element in "comments" list
    # Split the comment string into words, using every whitespace character as a divider. 
    tokens = re.split("\s", comment)  ####tokens is a string list
    for token in tokens:  ####iterate every string element in "tokens"
        #Print the key, value pair <token, 1>
        print(token + "\t1")  ###initialize every string with "****\t1"
        #######print to stdout
        

Overwriting mapper.py


In [2]:
%%writefile reducer.py
#!/usr/bin/env python2.7
# REDUCER

import sys
from collections import defaultdict
# Keep simple example in for now, switch to stdin later

input_pairs = [
    '+447935454150	1',
    'lovely	1',
    'lovely	1',
    'girl	1',
    'talk	1',
    'to	1',
    'me	1'
    #'xxx	1',
     #Add an extra one to test that it works
    #'to\t1'
]  #######The list is just for test
# Once we test this with streams, we can uncomment this next line
input_pairs = sys.stdin.readlines()

# YOUR CODE GOES BELOW

# Create a default dictionary. 
# This is a key-value store (dictionary) which returns a default value if the key hasn't been added.
# Here, we use it to store <word, count> pairs.
accumulator = defaultdict(lambda: 0)

for row in input_pairs:
    # Split the line into our key value pair.
    key_value_pair = row.split("\t", 1)  #######row is string, split every row in input_pairs into 2(second parameter 1+1) parts
    
    # If we don't have a pair, ignore the line, as something has gone wrong.
    if len(key_value_pair) != 2:
        continue
        
    word = key_value_pair[0]
    # Strip removes whitespace at the start and end of a string. In this case, making sure we have just a number.
    # We also convert it to an integer here.
    count = int(key_value_pair[1].strip())
    
    # Retrieve the count of that word we've seen so far, add to it, then store the result.
    accumulator[word] = accumulator[word] + count
    
for (key, value) in accumulator.items():
    print(key + "\t" + str(value))

Overwriting reducer.py


Ensure the above files have been written to two files: `mapper.py` and `reducer.py`. The easiest way to do this is make sure the `%%writefile mapper/reducer.py` lines are uncommented, then run the cell.

Since the mapper and reducer accept a stream into their stdin and output to stdout, we can test whether the scripts above work as a pipeline without using Hadoop!

The below command reads the .csv file, then pipes the output to `mapper.py`'s stdin. The mapper's output is then piped to `reducer.py`, and so on.

In [3]:
%%bash
chmod a+x mapper.py reducer.py
cat Youtube04-Eminem.csv | ./mapper.py | ./reducer.py | sort

,0	1
=	1
;-)	1
;)﻿	1
:(﻿	1
:))))﻿	1
:))	1
:	1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!﻿	1
!!!!!	1
!!﻿	1
!﻿	1
?	1
...	1
(:,	1
┏━━━┓┏┓╋┏┓┏━━━┓┏━━━┓┏┓╋╋┏┓	1
┗━━━┛┗┛╋┗┛┗┛╋┗┛┗━━━┛╋╋┗┛	1
┗━━┓┃┃┏━┓┃┃┗━┛┃╋┃┃┃┃╋┗┓┏┛	1
┃┗━━┓┃┗━┛┃┃┃╋┃┃╋┃┃┃┃┗┓┗┛┏	1
┃┏━┓┃┃┃╋┃┃┃┏━┓┃┗┓┏┓┃┃┗┓┏┛┃	1
┃┗━┛┃┃┃╋┃┃┃┏━┓┃┏┛┗┛┃╋╋┃┃	1
◄◄••­••	1
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤	1
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤	1
❤❤❤❤❤❤❤﻿	1
😻✌💓😻👏﻿	1
❤❤❤❤❤❤❤	1
❤️❤️❤️﻿	1
💗💗💗💗﻿	1
♥♥♥♥	1
❤❤❤	1
❤️﻿	1
💜﻿	1
🙌﻿	1
❤﻿	1
♥﻿	1
1,000,000,000	1
1.000.000.000	1
1,000,000	1
1000	1
1,00	1
100%	2
10-15	1
#1	1
1﻿	1
11	2
12year	1
1337	2
.	140
14	2
15	2
(16	2
16	3
﻿	17
17	1
17yr	2
1	8
18	2
19	2
1990	1
1billion	1
~	2
=)	2
;)	2
:*	2
!!!	2
!!	2
..﻿	2
▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌	2
200	4
2005.	1
2008-2010	1
200k	2
2010:(﻿	1
2010?	1
2013!	1
2013.	1
2013	3
2014	1
#2015	1
2015?﻿	1
2015.﻿	1
2015﻿	3
2015	5
21	1
25	1
2	7
-	28
2nd	1
$3,000+	20
30	1
	305
;3﻿	1
3	1
31st	1
365	3
&#39;Beware&#39;	1
3RD	1
:)﻿	4
!	40
4000	4
4000DOLLARS	1
41

Now we've tested our pipeline works, it's time to integrate it into hadoop. The below commands clear the output folder, ensure Hadoop is in standalone mode, then run our pipeline.

The parameters are as follows:

`-files` - Ensures these files are provided to every machine in our cluster.

`-input` - The data sources to be passed to the pipeline.

`-mapper` - The mapper to use.

`-reducer` - The reducer to use.

`-output` - The output folder.

Test out your pipeline by running the command below!

In [4]:
%%bash

rm -rf output

hadoop-standalone-mode.sh

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files mapper.py,reducer.py \
-input Youtube04-Eminem.csv \
-mapper ./mapper.py \
-reducer ./reducer.py \
-output output

Hadoop switched to standalone mode.


It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
18/12/10 13:09:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/12/10 13:09:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/12/10 13:09:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/12/10 13:09:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
18/12/10 13:09:33 INFO mapred.FileInputFormat: Total input files to process : 1
18/12/10 13:09:33 INFO mapreduce.JobSubmitter: number of splits:1
18/12/10 13:09:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1623317647_0001
18/12/10 13:09:34 INFO mapred.LocalDistributedCacheManager: Localized file:/home/comp6235/Notebooks/mapper.py as file:/tmp/hadoop-comp6235/mapred/

In [5]:
%%bash
# cd output
# cat part-00000
cat output/part-00000

	305
EVERYONE	1
EMİNEM	1
EMINEM	5
Does	1
like?	1
EMINEM&lt;3 <br	1
hate	5
up,	1
ROULETTE	1
phenomenallyricshere	9
seriously	1
up!	1
href="https://plus.google.com/s/%23Eminem">#Eminem</a>	1
MUCH	2
hermann	1
sorry	5
Tube	2
Support.	2
Industry	2
smack	1
Naperone	1
GO	2
today.	1
bringing	1
..&quot;﻿	1
look	5
Pakistan﻿	1
MAY	1
upload	1
Looplab	1
regret?	1
Go	4
lyrically,	1
relationship.	1
stars!	2
Waratel	1
enjoy	4
her.	1
her,	1
me,	3
me.	1
DON&#39;T	1
eminmem	1
second	3
lies.....﻿	1
much﻿	2
Zesty	1
monster	2
Dongs	1
Pun	1
great﻿	1
Eggmode	1
goot﻿	1
Put	1
new	12
ever	4
told	4
Made	1
never	8
here	12
Cool﻿	1
harbor.	2
famous,	1
don&#39;t,	1
Little	1
NOT	2
me:)	1
NOW	1
conveying.	1
video..	1
criticism	4
though﻿	1
FUCK	1
Also	3
cheats	1
Sick	2
breath?	1
Wiry	1
Toogit	1
Skizzle	1
criminals	2
Thailand♧]﻿	1
Accidental	1
motivate	1
chooses	1
SPARE	1
YouTube:<br	2
polish.	1
Gameplay	1
WHY	1
holy	1
successful	1
brings	1
yahoo	1
aware	1
sleeping...	1
feeling.	1
adult	1
...	1
video﻿	1
annoying.	1
must	

## Setting up HDFS

Now you've had a chance to use Hadoop in standalone mode, it's time we set it to pseudo-distributed mode and set up HDFS.

To speed things up, some commands have been provided to easily configure Hadoop. Start by running in your terminal:

    hadoop-pseudo-distributed-mode.sh
    
If you're curious how this works, feel free to have a read of the code, you'll find it in `~/vm_creation/scripts`. 

We should now have HDFS configured for pseudo-distributed mode.  We will now need to create a disk for HDFS, which will use the configurations we just set:

    hdfs namenode -format

## Starting services

Now we need to start the different services and we can get to work!  Run the following command in the terminal to start the HDFS:

    start-dfs.sh

You'll also need to start YARN in order to run any MapReduce jobs, so let's do that now:

    start-yarn.sh

To see what this has left you with, you can see the processes which are running on the JVM by running the `jps` command:

In [None]:
%%bash
# hadoop-pseudo-distributed-mode.sh
# hdfs namenode -format
# start-dfs.sh
# start-yarn.sh

In [1]:
%%bash 
jps

bash: line 1: jps: command not found


You should see something similar to the following:

```
XXXX ResourceManager
XXXX SecondaryNameNode
XXXX NameNode
XXXX DataNode
XXXX NodeManager
```

        If any of these aren't running, double check that you've run all of the above commands. If any are still missing, you may encounter errors later, so please contact one of the demonstrators.

Now that we have a HDFS disk, and the appropriate Hadoop services running in pseudo-distributed mode, we can start to import the data into the new HDFS filesystem and run the MapReduce task there. 

Fully-distributed mode runs on the exact same principles described below, so we could apply MapReduce over various machines.

In summary, we need to: 

* Create a directory for the input
* Import the data from the local file to the HDFS datanode
* Run the MapReduce job
* View the output

The Commands on HDFS are similar to standard linux CLI commands, except for the fact that they are prefixed by either `hadoop fs` or `hdfs dfs`.

The `hadoop fs` command is more general, as it can cope with different types of filesystem, such as the one on the local disk.  As such, this is a better choice to use for commands relating solely to HDFS.

The command to create a directory is `-mkdir`.  Create a directory `/input` on the HDFS system.  Use the `hdfs dfs` command below to achieve this.

In [7]:
%%bash
# YOUR CODE HERE
hdfs dfs -mkdir /input


It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
18/12/10 13:15:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mkdir: Cannot create directory /input. Name node is in safe mode.


Next, we need to import our data into HDFS. Here, we are dealing with two different filesystems: the local system and the HDFS node so we will use `hadoop fs`, with the `-copyFromLocal` command. This command copies files from the local filesystem to HDFS, accepting two arguments: file source and destination.

HDFS filesystems are defined by a URI prefixed by `hdfs://`, and the `hdfs dfs` and `hadoop fs` commands will normally expect to see them.

If they are not specified, the default location of the filesystem is specified in `core-site.xml`, which is one of the config files we imported earlier.  The value can be seen from the following command:

In [12]:
%%bash 
cat $HADOOP_HOME/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/comp6235/hadooptmp</value>
    <description>A base for other temporary directories</description>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost

If you are interested in learning more about the configuration options we have specified for Hadoop, check out the documentation for Hadoop, as well as the `~/vm_creation/hadoop` folder.

For the `-copyFromLocal` we can either specify `hdfs://localhost:9000/` or leave it out, instead using `hdfs:///`. For example, `hdfs://localhost:9000/input` and `hdfs:///input` refer to the same location.

The local file can be specified with a relative command, leaving the import command as one of the following two.  Pick one and execute it in the cell below.

In [13]:
%%bash
# With fully specified URI
hadoop fs -copyFromLocal *.csv hdfs://localhost:9000/input

# Explicit HDFS, but with the default host
# hadoop fs -copyFromLocal *.csv hdfs:///input

# Implied URI based on default
# hadoop fs -copyFromLocal *.csv /input



It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
18/12/04 22:07:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Next, we'll check that the files have been successfully imported.

In [14]:
%%bash
hadoop fs -ls /input

Found 5 items
-rw-r--r--   1 comp6235 supergroup      57438 2018-12-04 22:07 /input/Youtube01-Psy.csv
-rw-r--r--   1 comp6235 supergroup      64279 2018-12-04 22:07 /input/Youtube02-KatyPerry.csv
-rw-r--r--   1 comp6235 supergroup      64419 2018-12-04 22:07 /input/Youtube03-LMFAO.csv
-rw-r--r--   1 comp6235 supergroup      82896 2018-12-04 22:07 /input/Youtube04-Eminem.csv
-rw-r--r--   1 comp6235 supergroup      72706 2018-12-04 22:07 /input/Youtube05-Shakira.csv


It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
18/12/04 22:08:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Perform the same for the `mapper.py` and `reducer.py` files we created for the MapReduce task earlier, keeping those in the `input` directory as well.

You may need to add `-p` and `-f` as options. These options preserve file permissions, and force the new files to overwrite any existing files, respectively. 

In [15]:
%%bash
# YOUR CODE HERE
hdfs dfs -copyFromLocal -p -f mapper.py reducer.py /input


It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
18/12/04 22:11:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Now we run the hadoop command again, this time sourcing our files from HDFS instead of the local filesystem.

Note: If you run this command more than once, Hadoop will throw an error due to the output directory already existing. You may need to erase the existing directory or output to one with a different name.

In [16]:
%%bash
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files hdfs:///input/mapper.py,hdfs:///input/reducer.py \
-input hdfs:///input/Youtube04-Eminem.csv \
-mapper ./mapper.py \
-reducer ./reducer.py \
-output hdfs://localhost:9000/output_2

packageJobJar: [/tmp/hadoop-unjar6388094368782989186/] [] /tmp/streamjob6092706763412020625.jar tmpDir=null


It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
18/12/04 22:14:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/12/04 22:14:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/04 22:14:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/04 22:14:05 INFO mapred.FileInputFormat: Total input files to process : 1
18/12/04 22:14:05 INFO mapreduce.JobSubmitter: number of splits:2
18/12/04 22:14:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1543960749609_0001
18/12/04 22:14:06 INFO impl.YarnClientImpl: Submitted application application_1543960749609_0001
18/12/04 22:14:06 INFO mapreduce.Job: The url to track the job: http://comp6235:8088/proxy/application_1543960749609_0001/
18/12/04 22:14:06 INFO mapreduce.Job: Running job: job_1543960749609_0001
18/12/04 22:14:22 INFO mapred

In the cell below, write a command to view the files listed in the `/output_2` directory.

In [17]:
%%bash
# YOUR CODE HERE
hdfs dfs -ls /output_2


It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
18/12/04 22:15:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: `/output_16': No such file or directory


The `_SUCCCESS` file indicates that the job was a success, which is good.  The other file, `part-00000` contains the result.  Write code in the cell below to get the output (from HDFS)

In [None]:
%%bash
# YOUR CODE HERE
hdfs dfs -cat /output_16/part-00000

You can include multiple `-input` parameters to operate on more than one file.  Update the streaming command above to include all 5 files in the cell below.  Make sure you include a new output directory!

In [None]:
%%bash

# Update this command to include multiple files
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files hdfs:///input/mapper.py,hdfs:///input/reducer.py \
-input hdfs://localhost:9000/input/Youtube04-Eminem.csv \
-mapper ./mapper.py \
-reducer ./reducer.py \
-output hdfs://localhost:9000/output_8

## Summary

In this tutorial, you have started to use Hadoop and HDFS.  You created a MapReduce task using the Hadoop streaming framework, and then set up your Hadoop instance to work in pseudo-distributed mode.

### Where next?
You might want to look at Mahout, and try and put the data here into a format like the one in the [video](https://www.youtube.com/watch?v=TWl6AIZIVps) which can be used to train a naive Bayes algorithm to classify the spam data.  Alternatively, you can try and find the [SpamAssassin dataset](http://csmining.org/index.php/spam-email-datasets-.html) and import that yourself.

Hadoop over multiple machines is difficult to configure.  You might try and look at systems which assist you to do this, such as Ambari or Cloudera, and try to do these yourself.  If you don't have multiple computers available, to practice, give the [Caochong](https://github.com/weiqingy/caochong) library a try which sets up Hadoop in Docker containers on a single computer.