# A more pythonic MapReduce

With Hadoop Streaming we switched MapReduce from Java to Python.

How could we improve some more? 

What problems do we deal with?

We have to directly deal with HDFS
* move input files
* recover outputs
* human make mistakes

Not easy debugging

* logs via jobtracker
* errors are Java stacktrace, often unrelated with the real problem

We have to write two different files
* one for the mapper
* one for the reducer
* not a **module**

Is there a Python part of the language that can help representing a MapReduce task?

In [8]:
# A python class

class myclass(object):
    
    a_property = 42
    
    def a_method(self):
        pass
    def another_method(self, var):
        print(var)


In [11]:
# Create instance of our class
instance = myclass()
# Use it
instance.a_method()
instance.another_method("test")
instance.a_property

test


42

In [12]:
class MapReduce(object):
    """ A MapReduce class prototype """
    
    def mapper(self, line):
        pass
    def reducer(self, sorted_line):
        pass

# MRJob 
A more pythonic MapReduce library from Yelp

<img src='https://avatars1.githubusercontent.com/u/49071?v=3&s=400' width=300>

> “Easiest route to Python programs that run on Hadoop”

Install with: 
```bash
pip install mrjob
```

**Running modes**
* Test your code locally without installing Hadoop 
* or run it on a cluster of your choice!
    - Integrates with Amazon **Elastic MapReduce** (EMR)
    - same code with local, Hadoop, EMR
    - easy to run your job in the cloud as on your laptop

### How does MrJob work?

Python module built on top of Hadoop Streaming
jar opens a subprocess to your code
sends it input via stdin
gathers results via stdout.
Wrap HDFS pre and post processing if hadoop exists
a consistent interface across every environment it supports
automatically serializes/deserializes data flow out of each task 
JSON: json.loads() and json.dumps()

## Getting hands dirty

A job is defined by a class extended from MRJob package
Contains methods that define the steps of a Hadoop job
A “step” consists of a mapper, a combiner, and a reducer. 
All of those  are optional, though you must have at least on

In [None]:
class myjob(MRJob):
    def mapper(self, _, line):
        pass
    def combiner(self, key, values):
        pass
    def reducer(self, key, values):
        pass
    def steps(self):
        return [ MRStep(mapper=self.mapper, … ), … ]

## WordCount

### Mapper

The mapper() method takes a key and a value as args
E.g. key is ignored and a single line of text input is the value
Yields as many key-value pairs as it likes
Warning: yield != return
yield return a generator, the one you usually use with 
print i; for i in generator
Example
```python
def mygen():
for i in range(1,10):
# THIS IS WHAT HAPPENS INSIDE MAPPER
yield i, “value” 

for key, value in mygen():
print key, value
```

### Reducer

The reduce() method takes a key and an iterator of values
Also yields as many key-value pairs as it likes
E.g. it sums the values for each key
Represent the  numbers of characters, words, and lines in the initial input

Example
```python
def mygen():
for i in range(1,10):
yield i, “value” 

for key, value in mygen():
# THIS IS WHAT HAPPENS INSIDE A REDUCER
print key, value
```

In [None]:
%writefile 
from mrjob.job import MRJob
class MRWordCount(MRJob):

    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(),1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

## Input

By default, output will be written to stdout.
$ python my_job.py input.txt

You can pass input via stdin
but be aware that mrjob will just dump it to a file first:
$ python my_job.py < input.txt

You can pass multiple input files, mixed with stdin (using the – character)
$ python my_job.py input1.txt input2.txt - < input3.txt

In [None]:
%%bash

# Download compressed NGS data from a link
wget -q "http://bit.ly/ngs_sample_data" -O $myfile.bz2 && echo "downloaded"
# Decompress the file
bunzip2 $myfile.bz2 && echo "decompressed"

In [None]:
! python word_count.py data/txt/2261.txt.utf-8 

mrjob can be configured to run different steps

for each step you can specify which part has to be executed
and the method to use within the class you wrote


In [None]:
def steps(self):
    return [
        MRStep(
            mapper=self.mapper_get_words,
            combiner=self.combiner_count_words,
            reducer=self.reducer_count_words),
        MRStep(reducer=self.reducer_find_max_word)
    ]

By default, mrjob will run your job on your local (normal) environment) in a single Python process 

You change the way the job is run with the `-r/--runner` option:

```
-r inline, -r local, -r hadoop, or -r emr
```

Use also `--verbose` option to show all the steps