# Hadoop
(and Hadoop streaming)

<center>
<img src='http://www.datameer.com/images/technology/hadoop-pic1.png'>
</center>

**MapReduce** is a completely different paradigm 

* Solving a certain subset of parallelizable problems 
    - around the bottleneck of ingesting input data from disk
* Traditional parallelism brings the data to the computing machine
    - Map/reduce does the opposite, it brings the compute to the data
* Input data is not stored on a separate storage system
* Data exists in little pieces 
    - and is permanently stored on each computing node

MapReduce is the programming paradigm that allows massive scalability across thousands of servers.

Its open source server implementation is the *Hadoop* cluster.

Also always keep in mind that ***HDFS*** is fundamental to Hadoop 

* it provides the data chunking distribution across compute elements 
* necessary for map/reduce applications to be efficient

# Word count
The '`Hello World`' for MapReduce

Among the simplest of full Hadoop jobs you can run

<img src='http://www.glennklockwood.com/data-intensive/hadoop/wordcount-schematic.png'
width='700'>
<small>Reading ***Moby Dick*** </small>

### How it works
* The **MAP step** will take the raw text and convert it to key/value pairs
    - Each key is a word
    - All keys (words) will have a value of 1


* The **REDUCE step** will combine all duplicate keys 
    - By adding up their values (sum)
    - Every key (word) has a value of 1 (Map)
    - Output is reduced to a list of unique keys
    - Each key’s value corresponding to key's (word's) count

<img src='http://disco.readthedocs.org/en/latest/_images/map_shuffle_reduce.png'>

### Map function
processes data and generates a set of  intermediate key/value pairs.
### Reduce function 
merges all intermediate values  associated with the same intermediate key.

## A WordCount example 
*(with Java)*

Consider doing a word count of the following file using  MapReduce:
```
Hello World Bye World
Hello Hadoop Goodbye Hadoop
```

The map function reads in words one at a time outputs (“word”, 1) for each parsed input word

```
(Hello, 1)
(World, 1)
(Bye, 1)
(World, 1)
(Hello, 1)
(Hadoop, 1)
(Goodbye, 1)
(Hadoop, 1)
```

The shuffle phase between map and reduce creates a  list of values associated with each key
```
(Bye, (1))
(Goodbye, (1))
(Hadoop, (1, 1))
(Hello, (1, 1))
(World, (1, 1))
```

The reduce function sums the numbers in the list for each  key and outputs (word, count) pairs
```
(Bye, 1)
(Goodbye, 1)
(Hadoop, 2)
(Hello, 2)
(World, 2)
```

## How can you do this with Java?
(the Hadoop framework native language)

``` Java
// Imports
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.*

// Create JAVA class
public class WordCount {
```

``` Java
//Mapper function
  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
      }
    }
  }
```

``` Java
//Reducer function
  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }
    
```  

``` Java
//Main function
  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
  }
```

In [82]:
# Hadoop available examples
! hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar 

An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writ

In [84]:
# Test wordcount
! hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount

Usage: wordcount <in> [<in>...] <out>


In [85]:
# Test wordcount
! hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount dft dft-output

15/10/12 11:55:59 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/10/12 11:56:02 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1444641399800_0001
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ipyhadoop:9000/user/root/dft
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.

```
$ hadoop jar hadoop-examples.jar wordcount dft dft-output

10/03/16 11:40:51 INFO mapred.FileInputFormat: Total input paths to process : 1 10/03/16 11:40:51 INFO mapred.JobClient: Running job: job_201003161102_0002 10/03/16 11:40:52 INFO mapred.JobClient: map 0% reduce 0%
10/03/16 11:40:55 INFO mapred.JobClient: map 9% reduce 0% 10/03/16 11:40:56 INFO mapred.JobClient: map 27% reduce 0%
10/03/16 11:40:58 INFO mapred.JobClient: map 45% reduce 0% 10/03/16 11:40:59 INFO mapred.JobClient: map 81% reduce 0% 

10/03/16 11:41:01 INFO mapred.JobClient: map 100% reduce 0% 10/03/16 11:41:09 INFO mapred.JobClient: Job complete: job_201003161102_0002
```

```
$ hadoop dfs -cat dft-output/part-00000 | less 

“Information" 1 
"J" 1 
“Plain" 2 
“Project" 5 
```

**IMAGE**: cutting sandwiches with hadoop

# Exercise

Write a Java class to count letters inside a text file

*...just kidding!*

# Hadoop Streaming

Prepare a data sample

In [12]:
myfile = '/data/worker/ngs.sam'
%env myfile $myfile

env: myfile=/data/worker/ngs.sam


In [16]:
%%bash
wget -q "http://bit.ly/ngs_sample_data" -O $myfile.bz2 && echo "downloaded"
bunzip2 $myfile.bz2 && echo "decompressed"

downloaded
decompressed


```
head -2000 data/ngs/input.sam | tail -n 10 \
| awk '{ print $3":"$4 }' | sort | uniq -c

      3 chr1:142803456
      3 chr1:142803458
      1 chr1:142803465
      1 chr1:142803470
      2 chr1:142803471
```

In [17]:
%%bash
head -2000 $myfile | tail -n 10 | awk '{ print $3":"$4 }' | sort | uniq -c

      3 chr1:142803456
      3 chr1:142803458
      1 chr1:142803465
      1 chr1:142803470
      2 chr1:142803471


Splitting the command:

1. `head -2000 data/ngs/input.sam | tail -n 10`

1. `awk '{ print $3":"$4 }’`

1. `sort`

1. `uniq -c`


**INPUT STREAM**
`head -2000 data/ngs/input.sam | tail -n 10`

**MAPPER**
`awk '{ print $3":"$4 }'`

**SHUFFLE**
`sort`

**REDUCER**
`uniq -c`

**OUTPUT STREAM**
`<STDOUT>`

```
# head -2000 data/ngs/input.sam | tail -n 10
HSCAN:421:C47DAACXX:3:1206:5660:99605	99	chr1	142803456	10	75M	=	142803517	136	CTGTGCATTCTTATGATTTTAATATTCTGTACATTTATTATTGATTTAAAATGCATTTTACCTTTTTCTTTAATA	@@CDDDD?DBFHFHIHIJJJJGHEIFHIIIHGDFEGEIAHAHGIIGIIIC?CBFCGHFCEG?@DGIIIGHIGGHC	X0:i:3	X1:i:1	XA:Z:chr21,+9746045,75M,0;chr1,+143355186,75M,0;chr1,-143233123,75M,1;	MD:Z:75	RG:Z:1	XG:i:0	AM:i:0	NM:i:0	SM:i:0	XM:i:0	XO:i:0	MQ:i:18	XT:A:R

[…]

# head -2000 data/ngs/input.sam | tail -n 10 | awk '{ print $3":"$4 }'
chr1:142803471
chr1:142803456
chr1:142803465
chr1:142803458
chr1:142803456
chr1:142803471
chr1:142803458
chr1:142803456
chr1:142803470
chr1:142803458

# head -2000 data/ngs/input.sam | tail -n 10 | awk '{ print $3":"$4 }' | sort
chr1:142803456
chr1:142803456
chr1:142803456
chr1:142803458
chr1:142803458
chr1:142803458
chr1:142803465
chr1:142803470
chr1:142803471
chr1:142803471

# head -2000 data/ngs/input.sam | tail -n 10 | awk '{ print $3":"$4 }' | sort | uniq -c
      3 chr1:142803456
      3 chr1:142803458
      1 chr1:142803465
      1 chr1:142803470
      2 chr1:142803471
```

### Considerations with bash pipes as simulation of MapReduce

* Serial steps
* No file distribution
* Single node
* Single mapper
* Single reducer
* Can we add a Combiner?

# Hadoop streaming
### Concepts and mechanisms

Hadoop streaming is a utility 
comes with the Hadoop distribution
Allows to create and run Map/Reduce jobs 
with any executable or script as the mapper and/or the reducer
Protocol steps:
Create a Map/Reduce job
Submit the job to an appropriate cluster
Monitor the progress of the job until it completes
Links to Hadoop HDFS job directories

### Why?
One of the unappetizing aspects of Hadoop to users of traditional HPC is that it is written in Java. 
Java is not originally designed to be a high-performance language
Learning it is not easy for domain scientists
 Hadoop allows you to write map/reduce code in any language you want using the Hadoop Streaming interface
It means turning an existing Python or Perl script into a Hadoop job
Does not require learning any Java at all

### MapReduce with binaries/executables
Executable is specified for mappers and reducers
each mapper task will launch the executable as a separate process 
converts inputs into lines and feed to the stdin of the process
the mapper collects the line oriented outputs from the stdout of the process 
converts each line into a key/value pair
By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value
e.g. ”this is the key\tvalue is the rest\n”
If there is no tab character in the line, then entire line is considered as key and the value is null (!)

A streaming command line example:
``` bash
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /bin/wc
```

A streaming command line example **for python**:
``` bash
$ hadoop jar hadoop-streaming-1.2.1.jar \
    -input input_dir/ \
    -output output_dir/ \
    -mapper mapper.py \
    -file mapper.py \
    -reducer reducer.py \
    -file reducer.py 

```

Before submitting the Hadoop Streaming job:

* Make sure your scripts have no errors
* Do mapper and reducer scripts actually work?

This is just a matter of running them through pipes on a **little bit** of sample data,

like `cat` or `head` linux bash commands, with pipes, as seen before.

```
# Simulating hadoop streaming with bash pipes
$ cat $file | python mapper.py | sort | python reducer.py
```

First approach: split

In [18]:
%%writefile mapper.py

# -*- coding: utf-8 -*-
import sys

for line in sys.stdin:
    line = line.strip()
    pieces = line.split('\t')
    print(pieces) 


Overwriting mapper.py


In [19]:
! head -n 10 $myfile | python mapper.py

['@HD', 'VN:1.4', 'GO:none', 'SO:coordinate']
['@SQ', 'SN:chrM', 'LN:16571']
['@SQ', 'SN:chr1', 'LN:249250621']
['@SQ', 'SN:chr2', 'LN:243199373']
['@SQ', 'SN:chr3', 'LN:198022430']
['@SQ', 'SN:chr4', 'LN:191154276']
['@SQ', 'SN:chr5', 'LN:180915260']
['@SQ', 'SN:chr6', 'LN:171115067']
['@SQ', 'SN:chr7', 'LN:159138663']
['@SQ', 'SN:chr8', 'LN:146364022']


In [22]:
%%writefile mapper.py

# -*- coding: utf-8 -*-
TAB = "\t"
import sys

# Cycle current streaming data
for line in sys.stdin:

    # Clean input
    line = line.strip()
    # Skip SAM/BAM headers
    if line[0] == "@":
        continue

    # Use data
    pieces = line.split(TAB)
    mychr = pieces[2]
    mystart = int(pieces[3])
    myseq = pieces[9]
    print(mychr,mystart.__str__())
    sys.exit(1)

Overwriting mapper.py


In [27]:
! head -n 100 $myfile | python mapper.py

chrM 14


In [76]:
%%writefile mapper.py

# -*- coding: utf-8 -*-
TAB = "\t"
SEP = ':'
import sys

# Cycle current streaming data
for line in sys.stdin:
    # Clean input
    line = line.strip()
    # Skip SAM/BAM headers
    if line[0] == "@":
        continue
    
    # Use data
    pieces = line.split(TAB)
    mychr = pieces[2]
    mystart = int(pieces[3])
    myseq = pieces[9]

    mystop = mystart + len(myseq)

    # Each element with coverage
    for i in range(mystart,mystop):
        results = [mychr+SEP+i.__str__(), "1"]
        print(TAB.join(results))


Overwriting mapper.py


In [77]:
! head -n 100 $myfile | python mapper.py

chrM:14	1
chrM:15	1
chrM:16	1
chrM:17	1
chrM:18	1
chrM:19	1
chrM:20	1
chrM:21	1
chrM:22	1
chrM:23	1
chrM:24	1
chrM:25	1
chrM:26	1
chrM:27	1
chrM:28	1
chrM:29	1
chrM:30	1
chrM:31	1
chrM:32	1
chrM:33	1
chrM:34	1
chrM:35	1
chrM:36	1
chrM:37	1
chrM:38	1
chrM:39	1
chrM:40	1
chrM:41	1
chrM:42	1
chrM:43	1
chrM:44	1
chrM:45	1
chrM:46	1
chrM:47	1
chrM:48	1
chrM:49	1
chrM:50	1
chrM:51	1
chrM:52	1
chrM:53	1
chrM:54	1
chrM:55	1
chrM:56	1
chrM:57	1
chrM:58	1
chrM:59	1
chrM:60	1
chrM:61	1
chrM:62	1
chrM:63	1
chrM:64	1
chrM:65	1
chrM:66	1
chrM:67	1
chrM:68	1
chrM:69	1
chrM:70	1
chrM:71	1
chrM:72	1
chrM:73	1
chrM:74	1
chrM:75	1
chrM:76	1
chrM:77	1
chrM:78	1
chrM:79	1
chrM:80	1
chrM:81	1
chrM:82	1
chrM:83	1
chrM:84	1
chrM:85	1
chrM:86	1
chrM:87	1
chrM:88	1
chrM:14	1
chrM:15	1
chrM:16	1
chrM:17	1
chrM:18	1
chrM:19	1
chrM:20	1
chrM:21	1
chrM:22	1
chrM:23	1
chrM:24	1
chrM:25	1
chrM:26	1
chrM:27	1
chrM:28	1
chrM:29	1

### Shuffle step 

A lot happens, transparent to the developer
Mappers’s output is transformed and distributed to the reducers
All key/value pairs are sorted before sent to reducer function
Pairs sharing the same key are sent to the same reducer
If you encounter a key that is different from the last key you processed, you know that previous key will never appear again
If your keys are all the same
only use one reducer and gain no parallelization
come up with a more unique key if this happens

### Reducer

In [79]:
%%writefile reducer.py

# -*- coding: utf-8 -*-
TAB = "\t"
SEP = ':'
import sys
last_value = ""
value_count = 1
# Cycle current streaming data
for line in sys.stdin:

    # Clean input
    line = line.strip()
    value, count = line.split(TAB)
    count = int(count)

    # if this is the first iteration
    if not last_value:
        last_value = value

    # if they're the same, log it
    if value == last_value:
        value_count += count
    else:
        # state change
        try: 
            print(TAB.join([last_value, str(value_count)]))
        except:
            pass
        last_value = value
        value_count = 1

# LAST ONE after all records have been received
print(TAB.join([last_value, str(value_count)]))

Overwriting reducer.py


In [80]:
%%bash
# needs ~ 10 seconds for running
time head -n 10000 worker/ngs.sam | python mapper.py | sort | python reducer.py | head -n 5

chr1:10000	3
chr1:10001	2
chr1:10002	2
chr1:10003	2
chr1:10004	2



real	0m5.921s
user	0m5.340s
sys	0m0.150s


In [78]:
%%bash
time cat worker/ngs.sam | python mapper.py | sort | python reducer.py > worker/out.txt


real	3m30.950s
user	3m8.630s
sys	0m10.660s


# Switching to real Hadoop

A working python code tested on pipes should work with Hadoop Streaming

To make this work we need to handle copy of input and output file 
inside the Hadoop FS
Also the job tracker logs will be found inside HDFS
We are going to build a bash script to make our workflow

## Preprocessing

In [None]:
HDFS commands to interact with Hadoop file system

Create dir 
hadoop fs -mkdir
Copy file
hadoop fs -put
Check if file is there
hadoop fs -ls
Remove recursively data
hadoop fs -rmr

In [None]:
Hadoop Streaming needs “binaries” to execute

You need to specify interpreter inside the script
#!/usr/bin/env python
Make the script executable
chmod +x hs*.py

---

Final launch via bash command for using Hadoop streaming

---

## Final thoughts on Hadoop Streaming 

Provides options to write MapReduce jobs in other languages
One of the best examples of flexibility available to MapReduce
Fast
Simple
Close to the original standard Java API power

Even executables can be used to work as a MapReduce job (!)

Where it really works

When the developer do not have knowhow of Java 
Write Mapper/Reducer in any scripting language 
Faster

Cons

Force scripts in a Java VM
Although free overhead
The program/executable should be able to take input from STDIN and produce output at STDOUT
Restrictions on the input/output formats
Does not take care of input and output file and directory preparation
User have to implement hdfs commands “hand-made”

Where it falls short

No pythonic way to work the MapReduce code
Because it was not written specifically for python

Recap 

Hadoop streaming handles Hadoop in almost a classic manner
Wrap any executable (and script)
Wrap python scripts
Runnable on a cluster
using a non-interactive, all-encapsulated job

* Docker lightweight virtualization
* Two environments 
* Schema