Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

# References
- [Outils pour le Big Data - Pierre Nerzic 🇫🇷](https://perso.univ-rennes1.fr/pierre.nerzic/Hadoop/)
- [Writing an Hadoop MapReduce Program in Python - Michael G. Noll](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/)
- [Hadoop MapReduce Framework Tutorials with Examples - Matthew Rathbone](https://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html)
- [Python course: Lambda, filter, reduce and map](http://www.python-course.eu/lambda.php)
- [Mastering Python for Data Science - Samir Madhavan](https://www.packtpub.com/big-data-and-business-intelligence/mastering-python-data-science)
* [Implementing MapReduce with multiprocessing](https://pymotw.com/2/multiprocessing/mapreduce.html)
* [Parallel MapReduce in Python in Ten Minutes](https://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python/)

# Data processing through MapReduce

![MapReduce](http://mm-tom.s3.amazonaws.com/blog/MapReduce.png)

# Python Map Reduce

We will compute a norm with this process

In [3]:
V = [4,1,2,3]

- The `map(func, seq)` Python function applies the function func to all the elements of the sequence seq. It returns a new list with the elements changed by func

- The function `reduce(func, seq)` continually applies the function func() to the sequence seq and return a single value. For example, reduce(f, [1, 2, 3, 4, 5]) calculates f(f(f(f(1,2),3),4),5).

In [4]:
from operator import add
from functools import reduce
from math import sqrt

f = lambda x: x*x   # Function applied
L = map(f, V)       # map return a iterator
s = reduce(add,L)   # reduce compute the sum
sqrt(s) == sqrt(sum(map(f,V)))

True

# Wordcount Example

[WordCount](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0) is a simple application that counts the number of occurrences of each word in a given input set.

The input  and the output is text files, each line of which contains a 

Each mapper takes a line of text files as input and breaks it into words. It then emits a key/value pair of the word and 1 (separated by a tab). Each reducer sums the counts for each word and emits a single key/value with the word and sum.

This Python code uses the [Hadoop Streaming API](http://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) to pass data between our Map and Reduce code via Python’s sys.stdin (standard input) and sys.stdout (standard output). 

In [5]:
%mkdir -p hadoop

# Map 

The following Python code read data from sys.stdin, split it into words and output a list of lines mapping words to their (intermediate) counts to sys.stdout. For every word it outputs <word> 1 tuples immediately. 


In [93]:
from lorem import text
t = text()

with open("hadoop/sample.txt", "w") as sample:
    sample.write(t)

print(t)

Modi eius quaerat voluptatem sed adipisci quisquam. Modi aliquam porro est voluptatem aliquam voluptatem neque. Quisquam tempora quisquam aliquam amet consectetur labore modi. Sit adipisci quaerat ut. Quisquam dolor tempora dolore amet eius. Quaerat sit velit eius non dolor velit.

Eius porro non ut ipsum aliquam. Tempora ut etincidunt velit quisquam dolor modi velit. Sit quisquam non dolorem dolor quisquam. Numquam est etincidunt quiquia consectetur porro magnam neque. Dolorem est amet sed tempora. Labore aliquam porro numquam dolorem numquam labore dolorem. Numquam non quaerat dolorem adipisci.

Sed quaerat velit aliquam. Porro amet est dolore. Numquam ut sit amet dolore. Quisquam sit neque quaerat dolorem non. Est dolor ipsum dolore. Voluptatem dolorem etincidunt etincidunt. Etincidunt dolorem neque dolore labore consectetur.

Porro etincidunt eius adipisci dolorem velit. Est adipisci labore dolore etincidunt. Est labore ipsum est labore quisquam quaerat. Eius adipisci dolorem magna

In [103]:
%%file hadoop/mapper.py
#!/usr/bin/env python
from __future__ import print_function

import sys, string

# input comes from standard input
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # strip punctuation
    line = line.translate(None,string.punctuation)

    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to standard output;
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print ('%s\t%s' % (word, 1))

Overwriting hadoop/mapper.py


In [104]:
!chmod +x hadoop/mapper.py 

# Reduce 

The following code reads the results of mapper.py and sum the occurrences of each word to a final count, and then output its results to sys.stdout.
Remember that Hadoop sorts map output so it is easier to count words.


In [105]:
%%file hadoop/reducer.py
#!/usr/bin/env python
from __future__ import print_function
from operator import itemgetter
import sys


current_word = None
current_count = 0
word = None

# input lines
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to sys.stdout
            print ('{}\t{}'.format(current_word, current_count))
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print ('{}\t{}'.format(current_word, current_count))

Overwriting hadoop/reducer.py


In [106]:
!chmod +x hadoop/reducer.py

# Test

In [107]:
!cat hadoop/sample.txt | ./hadoop/mapper.py | sort | ./hadoop/reducer.py

Traceback (most recent call last):
  File "./hadoop/mapper.py", line 11, in <module>
    line = line.translate(string.punctuation)
ValueError: translation table must be 256 characters long
None	0


# Multiprocessing version

The multiprocessing Pool class provides a map function. Partition and distribute input to a user-specified function in pool of worker processes is automatic.

## Convert mapper to a python function named `words`.

Function takes the input text as argument and returns a list containing the sequence of key-value pairs (word, 1).



In [116]:
import string

def words(file):
    """
    Read a text file and return a list of (word, 1) values.
    """
    #print multiprocessing.current_process().name, 'reading', filename
    output = []
    with open(file) as f:
        for line in f:
            line = line.strip()
            line = line.translate(None,string.punctuation)
            for word in line.split():
                output.append((word, 1))
    return output

In [117]:
words('hadoop/sample.txt')

TypeError: translate() takes exactly one argument (2 given)


[defaultdict](https://docs.python.org/3.6/library/collections.html#collections.defaultdict) from `collections` module 



In [12]:


# Deploying the MapReduce code on Hadoop

Download some books
* The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
* The Notebooks of Leonardo Da Vinci
* Ulysses by James Joyce
* The Art of War by 6th cent. B.C. Sunzi
* The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
* The Devil’s Dictionary by Ambrose Bierce
* Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3



SyntaxError: invalid syntax (<ipython-input-12-6c32a12a147d>, line 5)

* Copy books to HDFS
* Run the WordCount MapReduce

Makefile

In [13]:
%%file hadoop/Makefile
HADOOP_TOOLS=/usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/tools/lib/
HDFS_DIR=/user/${USER}
default:
	echo "coucou"
run_with_hadoop:
	hadoop jar ${HADOOP_TOOLS}/hadoop-streaming-2.8.0.jar \
    -file  ${PWD}/mapper.py  -mapper  ${PWD}/mapper.py \
    -file  ${PWD}/reducer.py -reducer ${PWD}/reducer.py \
    -input ${HDFS_DIR}/books/* -output ${HDFS_DIR}/output-hadoop

run_with_yarn:
	yarn jar ${HADOOP_TOOLS}/hadoop-streaming-2.8.0.jar \
	-file  ${PWD}/mapper.py  -mapper  ${PWD}/mapper.py \
	-file  ${PWD}/reducer.py -reducer ${PWD}/reducer.py \
	-input ${HDFS_DIR}/books/* -output ${HDFS_DIR}/output-yarn


Writing hadoop/Makefile


### Run

```bash
$ make run_with_hadoop

$ make run_with_yarn
```


In [None]:
# Py