# Hadoop
(and Hadoop streaming)

<center>
<img src='http://www.opensourceforu.efytimes.com/wp-content/uploads/2012/03/hadoop-database-590x321.jpg'>
</center>

**MapReduce** is a completely different paradigm 

* Solving a certain subset of parallelizable problems 
    - around the bottleneck of ingesting input data from disk
* Traditional parallelism brings the data to the computing machine
    - Map/reduce does the opposite, it brings the compute to the data
* Input data is not stored on a separate storage system
* Data exists in little pieces 
    - and is permanently stored on each computing node

MapReduce is the programming paradigm that allows massive scalability across thousands of servers.

Its open source server implementation is the *Hadoop* cluster.

Also always keep in mind that ***HDFS*** is fundamental to Hadoop 

* it provides the data chunking distribution across compute elements 
* necessary for map/reduce applications to be efficient

# Word count
The '`Hello World`' for MapReduce

Among the simplest of full Hadoop jobs you can run

<img src='http://www.glennklockwood.com/data-intensive/hadoop/wordcount-schematic.png'
width='700'>
<small>Reading ***Moby Dick*** </small>

### How it works
* The **MAP step** will take the raw text and convert it to key/value pairs
    - Each key is a word
    - All keys (words) will have a value of 1


* The **REDUCE step** will combine all duplicate keys 
    - By adding up their values (sum)
    - Every key (word) has a value of 1 (Map)
    - Output is reduced to a list of unique keys
    - Each key’s value corresponding to key's (word's) count

<center>
<img src='http://disco.readthedocs.org/en/latest/_images/map_shuffle_reduce.png' width=800>
</center>

### Map function:
processes data and generates a set of  intermediate key/value pairs.


### Reduce function:
merges all intermediate values  associated with the same intermediate key.

## A WordCount example 
*(with Java)*

Consider doing a word count of the following file using  MapReduce:
```
Hello World Bye World
Hello Hadoop Goodbye Hadoop
```

The map function reads in words one at a time outputs (“word”, 1) for each parsed input word

```
(Hello, 1)
(World, 1)
(Bye, 1)
(World, 1)
(Hello, 1)
(Hadoop, 1)
(Goodbye, 1)
(Hadoop, 1)
```

The shuffle phase between map and reduce creates a  list of values associated with each key
```
(Bye, (1))
(Goodbye, (1))
(Hadoop, (1, 1))
(Hello, (1, 1))
(World, (1, 1))
```

The reduce function sums the numbers in the list for each  key and outputs (word, count) pairs
```
(Bye, 1)
(Goodbye, 1)
(Hadoop, 2)
(Hello, 2)
(World, 2)
```

## How can you do this with Java?
(the Hadoop framework native language)

``` Java
// Imports
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.*

// Create JAVA class
public class WordCount {
```

``` Java
//Mapper function
  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
      }
    }
  }
```

``` Java
//Reducer function
  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }
    
```

<small>
``` Java
//Main function
  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
  }
```
</small>

We can test the Java code here. *Live*.

In [1]:
%env HADOOP_EXAMPLES /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar
%env HADOOP_STREAMING /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar

env: HADOOP_EXAMPLES=/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar
env: HADOOP_STREAMING=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar


In [2]:
%env HADOOP_EXAMPLES

'/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar'

In [3]:
# Hadoop available examples
! hadoop jar $HADOOP_EXAMPLES | grep word

  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  multifilewc: A job that counts words from several files.
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.


In [4]:
# Check wordcount
! hadoop jar $HADOOP_EXAMPLES wordcount

Usage: wordcount <in> [<in>...] <out>


In [5]:
%%bash
# Note for my self: run this and next three cells 'live'

########################
# Preprocess with HDFS

# Create input directory
hdfs dfs -mkdir myinput
# Save one file inside
file="/data/lectures/data/books/twolines.txt"
hdfs dfs -put $file myinput/file01
# Remove output or Hadoop will give error if existing
hdfs dfs -rm -r -f myoutput

In [6]:
# Test wordcount with real hadoop on our system
! hadoop jar $HADOOP_EXAMPLES wordcount myinput myoutput

16/03/15 16:20:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/03/15 16:20:06 INFO input.FileInputFormat: Total input paths to process : 1
16/03/15 16:20:06 INFO mapreduce.JobSubmitter: number of splits:1
16/03/15 16:20:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1458056148609_0001
16/03/15 16:20:07 INFO impl.YarnClientImpl: Submitted application application_1458056148609_0001
16/03/15 16:20:07 INFO mapreduce.Job: The url to track the job: http://mapreduce:8088/proxy/application_1458056148609_0001/
16/03/15 16:20:07 INFO mapreduce.Job: Running job: job_1458056148609_0001
16/03/15 16:20:15 INFO mapreduce.Job: Job job_1458056148609_0001 running in uber mode : false
16/03/15 16:20:15 INFO mapreduce.Job:  map 0% reduce 0%
16/03/15 16:20:20 INFO mapreduce.Job:  map 100% reduce 0%
16/03/15 16:20:27 INFO mapreduce.Job:  map 100% reduce 100%
16/03/15 16:20:28 INFO mapreduce.Job: Job job_1458056148609_0001 completed successfully
16/03/15 16:20:28 INF

In [7]:
# This was our input
! cat /data/lectures/data/books/twolines.txt

Hello World Bye World
Hello Hadoop Goodbye Hadoop

In [8]:
# This is our output
! hadoop fs -cat myoutput/part*

Bye	1
Goodbye	1
Hadoop	2
Hello	2
World	2


# Recap
<img src='https://pbs.twimg.com/media/B2RlCy-IIAEFCLC.jpg' width=700>

# Exercise

Write your first Java class to count letters inside a text file

*...just kidding!*

> You may try to use hadoop commands within the notebook instead.

# Hadoop like `pipes` in Unix Bash

Prepare a data sample

In [2]:
# Variables for python and bash
myfile = '/tmp/ngs.sam'
%env myfile $myfile

env: myfile=/tmp/ngs.sam


In [3]:
%%bash

# Download compressed NGS data from a link
wget -q "http://bit.ly/ngs_sample_data" -O $myfile.bz2 && echo "downloaded"
# Decompress the file
bunzip2 $myfile.bz2 && echo "decompressed"

downloaded
decompressed


In [4]:
# Check if the file is there
! ls $myfile

/tmp/ngs.sam


In [12]:
%%bash
# Bash piping our own MapReduce with Unix commands
head -2000 $myfile | tail -n 10 | awk '{ print $3":"$4 }' | sort | uniq -c

      3 chr1:142803456
      3 chr1:142803458
      1 chr1:142803465
      1 chr1:142803470
      2 chr1:142803471


### Understanding better

Splitting the command:

1. `head -2000 data/ngs/input.sam | tail -n 10`

1. `awk '{ print $3":"$4 }’`

1. `sort`

1. `uniq -c`


**INPUT STREAM**
`head -2000 data/ngs/input.sam | tail -n 10`

**MAPPER**
`awk '{ print $3":"$4 }'`

**SHUFFLE**
`sort`

**REDUCER**
`uniq -c`

**OUTPUT STREAM**
`<STDOUT>`

Let's see step by step what it's happening

In [13]:
# STREAMING THE FILE
! head -2000 $myfile | tail -n 10
# note: 
# taking the last 10 lines of the first 2000
# to skip headers lines

HSCAN:421:C47DAACXX:3:1206:5660:99605	99	chr1	142803456	10	75M	=	142803517	136	CTGTGCATTCTTATGATTTTAATATTCTGTACATTTATTATTGATTTAAAATGCATTTTACCTTTTTCTTTAATA	@@CDDDD?DBFHFHIHIJJJJGHEIFHIIIHGDFEGEIAHAHGIIGIIIC?CBFCGHFCEG?@DGIIIGHIGGHC	X0:i:3	X1:i:1	XA:Z:chr21,+9746045,75M,0;chr1,+143355186,75M,0;chr1,-143233123,75M,1;	MD:Z:75	RG:Z:1	XG:i:0	AM:i:0	NM:i:0	SM:i:0	XM:i:0	XO:i:0	MQ:i:18	XT:A:R
HSCAN:421:C47DAACXX:3:1207:14092:152623	99	chr1	142803456	0	75M	=	142803560	180	CTGTGCATTCTTATGATTTTAATATTCTGTACATTTATTATTGATTTAAAATGCATTTTACCTTTTTCTTTAATA	CCCDFFFFHHHHGJJIJJJJJIJJJJJJJHIJJJJJIJJIJJJJJJJJIIIFHHEHJJIGHHIIJJJIIIJIIJG	X0:i:3	X1:i:1	XA:Z:chr21,+9746045,75M,0;chr1,+143355186,75M,0;chr1,-143233123,75M,1;	MD:Z:75	RG:Z:1	XG:i:0	AM:i:0	NM:i:0	SM:i:0	XM:i:0	XO:i:0	MQ:i:0	XT:A:R
HSCAN:421:C47DAACXX:3:1301:4054:177529	99	chr1	142803456	12	75M	=	142803494	113	CTGTGCATTCTTATGATTTTAATATTCTGTACATTTATTATTGATTTAAAATGCATTTTACCTTTTTCTTTAATA	CCCFFFFFHHHHHJJHHJJIIGJJJJJJJIIGGIJJJJJJJJHIJJIIHIFFHIIJJJJJIGIJJH

In [14]:
# MAPPING
! head -2000 $myfile | tail -n 10 | awk '{ print $3":"$4 }'

chr1:142803456
chr1:142803456
chr1:142803456
chr1:142803458
chr1:142803458
chr1:142803458
chr1:142803465
chr1:142803470
chr1:142803471
chr1:142803471


In [15]:
# SHUFFLING
! head -2000 $myfile | tail -n 10 | awk '{ print $3":"$4 }' | sort

chr1:142803456
chr1:142803456
chr1:142803456
chr1:142803458
chr1:142803458
chr1:142803458
chr1:142803465
chr1:142803470
chr1:142803471
chr1:142803471


In [16]:
# REDUCER
! head -2000 $myfile | tail -n 10 | awk '{ print $3":"$4 }' | sort | uniq -c

      3 chr1:142803456
      3 chr1:142803458
      1 chr1:142803465
      1 chr1:142803470
      2 chr1:142803471


# Exercise
<br>

<big>
Play *a little bit* with bash pipes. 
</big>

Try to understand how MapReduce process data

### Considerations with bash pipes as simulation of MapReduce

* Serial steps
* No file distribution
* Single node
* Single mapper
* Single reducer
* Can we add a Combiner?

# Exercise

Do you know how to substitute the awk/mapper with a python script?

If yes, create one to count the most recurring words inside the Divine Comedy.

# Exercise

Create a python script to count the vowels inside the Divine Comedy.

(with bash pipes)

# Hadoop streaming
### Concepts and mechanisms

Hadoop streaming is a utility 

* It comes bundled with the Hadoop distribution
* It allows creating and running Map/Reduce jobs 
    - with any executable or script as the mapper and/or the reducer

Protocol steps

* Create a Map/Reduce job
* Submit the job to an appropriate cluster
* Monitor the progress of the job until it completes
* Links to Hadoop HDFS job directories

### Why?

One of the most unappetizing aspects of Hadoop to users of traditional HPC is that it is written in Java. 

* Java is not originally designed to be a high-performance language
* Learning Java is very difficult for domain scientists

This is why Hadoop allows you to write map/reduce code in any language you want using the Hadoop Streaming interface

* It means turning an existing Python or Perl script into a Hadoop job
* Does not require learning any Java at all

### MapReduce streaming with binaries/executables

* Executables are specified for mappers and reducers!
    - each mapper task run as a separate process 
* Inputs converted into lines and feed to the `STDIN` of the process
* The mapper collects `STDOUT` of the process 
    - each line is a key/value pair **separated by TAB**
    - e.g. ”this is the key\tvalue is the rest\n”

warning: If there is no tab character in the line, then entire line is considered as key and the value is null (!)

<big> 
Let's go live
</big> 

A streaming command line example:
``` bash
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /bin/wc
```

A streaming command line example **for python**:
``` bash
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
    -files mapper.py,reducer.py
    -input input_dir/ \
    -output output_dir/ \
    -mapper mapper.py \
    -reducer reducer.py \
```

Before submitting the Hadoop Streaming job:

* Make sure your scripts have no errors
* Do mapper and reducer scripts actually work?

This is just a matter of running them through pipes on a **little bit** of sample data,

like `cat` or `head` linux bash commands, with pipes, as seen before.

```
# Simulating hadoop streaming with bash pipes
$ cat $file | python mapper.py | sort | python reducer.py
```

First approach: split

In [17]:
%%writefile mapper.py

# -*- coding: utf-8 -*-
import sys

for line in sys.stdin:
    line = line.strip()
    pieces = line.split('\t')
    print(pieces) 


Writing mapper.py


In [18]:
! head -n 150 $myfile | python mapper.py

['@HD', 'VN:1.4', 'GO:none', 'SO:coordinate']
['@SQ', 'SN:chrM', 'LN:16571']
['@SQ', 'SN:chr1', 'LN:249250621']
['@SQ', 'SN:chr2', 'LN:243199373']
['@SQ', 'SN:chr3', 'LN:198022430']
['@SQ', 'SN:chr4', 'LN:191154276']
['@SQ', 'SN:chr5', 'LN:180915260']
['@SQ', 'SN:chr6', 'LN:171115067']
['@SQ', 'SN:chr7', 'LN:159138663']
['@SQ', 'SN:chr8', 'LN:146364022']
['@SQ', 'SN:chr9', 'LN:141213431']
['@SQ', 'SN:chr10', 'LN:135534747']
['@SQ', 'SN:chr11', 'LN:135006516']
['@SQ', 'SN:chr12', 'LN:133851895']
['@SQ', 'SN:chr13', 'LN:115169878']
['@SQ', 'SN:chr14', 'LN:107349540']
['@SQ', 'SN:chr15', 'LN:102531392']
['@SQ', 'SN:chr16', 'LN:90354753']
['@SQ', 'SN:chr17', 'LN:81195210']
['@SQ', 'SN:chr18', 'LN:78077248']
['@SQ', 'SN:chr19', 'LN:59128983']
['@SQ', 'SN:chr20', 'LN:63025520']
['@SQ', 'SN:chr21', 'LN:48129895']
['@SQ', 'SN:chr22', 'LN:51304566']
['@SQ', 'SN:chrX', 'LN:155270560']
['@SQ', 'SN:chrY', 'LN:59373566']
['@SQ', 'SN:chr1_gl000191_random', 'LN:106433']
['@SQ', 'SN:chr1_gl000192_rand

# Exercise

Count symbols (everything that is not a letter of the alphabet) inside a text file.

Back to the example:

Skip header lines.

In [19]:
%%writefile mapper.py

# -*- coding: utf-8 -*-
TAB = "\t"
import sys

# Cycle current streaming data
for line in sys.stdin:

    # Clean input
    line = line.strip()
    # Skip SAM/BAM headers
    if line[0] == "@":
        continue

    # Use data
    pieces = line.split(TAB)
    mychr = pieces[2]
    mystart = int(pieces[3])
    myseq = pieces[9]
    print(mychr,mystart.__str__())
    sys.exit(1)

Overwriting mapper.py


In [20]:
! head -n 100 $myfile | python mapper.py

chrM 14


Produce an output that can be sorted and then used by reducer

In [6]:
%%writefile mapper.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
TAB = "\t"
SEP = ':'
import sys

# Cycle current streaming data
for line in sys.stdin:
    # Clean input
    line = line.strip()
    # Skip SAM/BAM headers
    if line[0] == "@":
        continue
    
    # Use data
    pieces = line.split(TAB)
    mychr = pieces[2]
    mystart = int(pieces[3])
    myseq = pieces[9]

    mystop = mystart + len(myseq)

    # Each element with coverage
    for i in range(mystart,mystop):
        results = [mychr+SEP+i.__str__(), "1"]
        print(TAB.join(results))


Overwriting mapper.py


In [22]:
! head -n 100 $myfile | python mapper.py

chrM:14	1
chrM:15	1
chrM:16	1
chrM:17	1
chrM:18	1
chrM:19	1
chrM:20	1
chrM:21	1
chrM:22	1
chrM:23	1
chrM:24	1
chrM:25	1
chrM:26	1
chrM:27	1
chrM:28	1
chrM:29	1
chrM:30	1
chrM:31	1
chrM:32	1
chrM:33	1
chrM:34	1
chrM:35	1
chrM:36	1
chrM:37	1
chrM:38	1
chrM:39	1
chrM:40	1
chrM:41	1
chrM:42	1
chrM:43	1
chrM:44	1
chrM:45	1
chrM:46	1
chrM:47	1
chrM:48	1
chrM:49	1
chrM:50	1
chrM:51	1
chrM:52	1
chrM:53	1
chrM:54	1
chrM:55	1
chrM:56	1
chrM:57	1
chrM:58	1
chrM:59	1
chrM:60	1
chrM:61	1
chrM:62	1
chrM:63	1
chrM:64	1
chrM:65	1
chrM:66	1
chrM:67	1
chrM:68	1
chrM:69	1
chrM:70	1
chrM:71	1
chrM:72	1
chrM:73	1
chrM:74	1
chrM:75	1
chrM:76	1
chrM:77	1
chrM:78	1
chrM:79	1
chrM:80	1
chrM:81	1
chrM:82	1
chrM:83	1
chrM:84	1
chrM:85	1
chrM:86	1
chrM:87	1
chrM:88	1
chrM:14	1
chrM:15	1
chrM:16	1
chrM:17	1
chrM:18	1
chrM:19	1
chrM:20	1
chrM:21	1
chrM:22	1
chrM:23	1
chrM:24	1
chrM:25	1
chrM:26	1
chrM:27	1
chrM:28	1
chrM:29	1

### Shuffle step 

<br><big>
A lot happens, transparent to the developer
</big>

* Mappers’s output is transformed and distributed to the reducers
* All key/value pairs are sorted before sent to reducer function
* Pairs sharing the same key are sent to the same reducer
* If you encounter a key that is different from the last key you processed
    - *you know that previous key will never appear again*
* If your keys are all the same
    - only use one reducer and gain no parallelization
    - come up with a more unique key if this happens

### Reducer

In [7]:
%%writefile reducer.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
TAB = "\t"
SEP = ':'
import sys
last_value = ""
value_count = 1
for line in sys.stdin:
    value, count = line.strip().split(TAB)
    # if this is the first iteration
    if not last_value:
        last_value = value
    # if they're the same, log it
    if value == last_value:
        value_count += int(count)
    else:
        # state change
        try: 
            print(TAB.join([last_value, str(value_count)]))
        except:
            pass
        last_value = value
        value_count = 1
# LAST ONE after all records have been received
print(TAB.join([last_value, str(value_count)]))

Overwriting reducer.py


In [24]:
%%bash
# needs ~ 5 seconds for running
time head -n 10000 $myfile | python mapper.py | sort | python reducer.py | head -n 5

chr1:10000	3
chr1:10001	2
chr1:10002	2
chr1:10003	2
chr1:10004	2



real	0m3.880s
user	0m3.800s
sys	0m0.070s


In [25]:
%%bash
time cat $myfile | python mapper.py | sort | python reducer.py > out.txt


real	2m24.110s
user	2m20.610s
sys	0m2.050s


In [26]:
! grep "\s185" out.txt

chr21:11181667	185
chr21:11181668	185
chr21:11181669	185
chr7:151970834	185
chr7:151970867	185


# Exercise

Write a mapper and a reducer in python to group words of "The Prince" based on their length.

The book is available at the following path: `/data/lectures/data/books/prince.txt`

Find out how many are longer than ten letters or shorter than five.

# Switching to real Hadoop

<big>
A working python code tested on pipes **should work** with Hadoop Streaming
</big>

* To make this happen we need to handle copy of input and output files inside the Hadoop FS
* Also the job tracker logs will be found inside HDFS
* We are going to use bash scripting inside the notebook to make our workflow

## Preprocessing

HDFS commands to interact with Hadoop file system use the same syntax:

```
hdfs dfs -command
```

`command` are like bash commands for file

e.g.

```
hadoop fs -mkdir hdfs:///dir
hadoop fs -put file_on_host hdfs:///path/to/file
hadoop fs -ls
```

<small>
Note: we have seen this in action with the Java example
</small>

<big>
Hadoop Streaming needs “binaries” to execute
</big>

You need to specify interpreter at the beginning of your scripts:
```
#!/usr/bin/env python
```

Make also the script executables:
```
chmod +x hs*.py
```

In [27]:
! chmod +x mapper.py reducer.py

In [28]:
%env HADOOP_STREAMING

'/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar'

In [29]:
%%bash
# Launch streaming
hadoop jar $HADOOP_STREAMING

Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  dumptb <glob-pattern> Dumps all files that match the given pattern to 
                        standard output as typed bytes.
  loadtb <path> Reads typed bytes from standard input and stores them in
                a sequence file in the specified path
  [streamjob] <args> Runs streaming job with given arguments


No Arguments Given!


In [5]:
%%bash
# Preprocess with HDFS
hdfs dfs -rm -r -f myinput
hdfs dfs -mkdir myinput
# Save one file inside
file="/tmp/ngs.sam"
hdfs dfs -put $file myinput/file01
# Remove output or Hadoop will give error if existing
hdfs dfs -rm -r -f myoutput

Deleted myinput


16/03/15 17:04:11 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.


Final launch via bash command for using Hadoop streaming

In [None]:
%%bash 
# A real Hadoop Streaming run
time hadoop jar $HADOOP_STREAMING \
    -D mapreduce.job.mapper=12 -D mapreduce.job.reducers=4  \
    -files mapper.py,reducer.py \
    -input myinput -output myoutput \
    -mapper mapper.py -reducer reducer.py

<small>
**Note 1**:

if this comand takes minutes, you may go to see what is happening in your Hadoop JobTracker, if you have it, e.g.
http://localhost:8088/cluster

<br>


**Note 2**:

this command cannot be executed from the cloud version of the notebooks, as the port is not opened
</small>

Hadoop streaming is **difficult to debug**.
Just like real Java Hadoop.

If you did a typical setup mistake, you may end receiving unrelated errors stacktrace from the Java virtual machine.

So before googling those stacktrace, make sure that:

* Python files (mapper and reducer) exists
* They are provided inside the main bash command also as **files** list
* They are executables and contain as first line the hashbang
* Your input directory exists on HDFS
* Files inside your input directory are not corrupted
    - e.g. bad decompression

In [None]:
# OUTPUT: Check directory
! hdfs dfs -ls myoutput

In [None]:
# OUTPUT:  Copy file and go see it
! rm -rf hs.*.txt && hdfs dfs -get myoutput/part-00000 hs.out.txt

In [None]:
! head hs.out.txt

# Exercise

Run your previously written python Mapper/Reducer with Hadoop Streaming.

(*Good luck*)

## Final thoughts on Hadoop Streaming 


* Provides options to write MapReduce jobs in other languages
* Even executables can be used to work as a MapReduce job
* One of the best examples of flexibility available to MapReduce
* Fast
* Simpler than Java
* Also close to the original standard Java API Hadoop power


### Where it really works

* When the developer do not have knowhow of Java 
* Write Mapper/Reducer in any scripting language 

### Disadvantages

* Force scripts in a Java VM
    - Although almost free overhead
* The program/executable have to take input from STDIN 
    - and produce output at STDOUT
* Restrictions on the input/output formats
    - Does not take care of input and output file and directory preparation
    - User have to implement hdfs commands “hand-made”

### Where it falls short

* No pythonic way to work the MapReduce code

(Because it was not written specifically for python)

## Recap 

* Hadoop streaming handles Hadoop in almost a classic manner
* Wrap any executable (and script)
    - Also python scripts
* Runnable on a cluster using a non-interactive, all-encapsulated job

# End of Chapter