# A more pythonic MapReduce

With Hadoop Streaming we switched MapReduce from Java to Python.

How could we improve some more? 

What problems do we deal with?

We have to directly deal with HDFS
* move input files
* recover outputs
* human make mistakes

Not easy debugging

* logs via jobtracker
* errors are Java stacktrace, often unrelated with the real problem

We have to write two different files
* one for the mapper
* one for the reducer
* not a **module**

Is there a Python part of the language that can help representing a MapReduce task?

In [None]:
# A python class

class myclass(object):
    
    a_property = 42
    
    def a_method(self):
        pass
    def another_method(self, var):
        print(var)


In [None]:
# Create instance of our class
instance = myclass()
# Use it
instance.a_method()
instance.another_method("test")
instance.a_property

In [None]:
class MapReduce(object):
    """ A MapReduce class prototype """
    
    def mapper(self, line):
        pass
    def reducer(self, sorted_line):
        pass

# MRJob 
A more pythonic MapReduce library from Yelp

<img src='https://avatars1.githubusercontent.com/u/49071?v=3&s=400' width=300>

> “Easiest route to Python programs that run on Hadoop”

Install with: 
```bash
pip install mrjob
```

**Running modes**
* Test your code locally without installing Hadoop 
* or run it on a cluster of your choice!
    - Integrates with Amazon **Elastic MapReduce** (EMR)
    - same code with local, Hadoop, EMR
    - easy to run your job in the cloud as on your laptop

### How does MrJob work?

Python module built on top of Hadoop Streaming
jar opens a subprocess to your code
sends it input via stdin
gathers results via stdout.
Wrap HDFS pre and post processing if hadoop exists
a consistent interface across every environment it supports
automatically serializes/deserializes data flow out of each task 
JSON: json.loads() and json.dumps()

## Getting hands dirty

A job is defined by a class extended from MRJob package
Contains methods that define the steps of a Hadoop job
A “step” consists of a mapper, a combiner, and a reducer. 
All of those  are optional, though you must have at least on

In [None]:
class myjob(MRJob):
    def mapper(self, _, line):
        pass
    def combiner(self, key, values):
        pass
    def reducer(self, key, values):
        pass
    def steps(self):
        return [ MRStep(mapper=self.mapper, … ), … ]

## WordCount

### Mapper

The mapper() method takes a key and a value as args
E.g. key is ignored and a single line of text input is the value
Yields as many key-value pairs as it likes
Warning: yield != return
yield return a generator, the one you usually use with 
print i; for i in generator
Example
```python
def mygen():
for i in range(1,10):
# THIS IS WHAT HAPPENS INSIDE MAPPER
yield i, “value” 

for key, value in mygen():
print key, value
```

### Reducer

The reduce() method takes a key and an iterator of values
Also yields as many key-value pairs as it likes
E.g. it sums the values for each key
Represent the  numbers of characters, words, and lines in the initial input

Example
```python
def mygen():
for i in range(1,10):
yield i, “value” 

for key, value in mygen():
# THIS IS WHAT HAPPENS INSIDE A REDUCER
print key, value
```

## Let's write our job

In [13]:
# Little configuration

mydir = "mymrjob"
%env mydir = $mydir
myinput = "/data/worker/books/twolines.txt"
%env myinput $myinput
myscript = mydir + "/wordcount.py"
%env myscript $myscript

%system mkdir -p $mydir
%env myoutput $mydir/out.txt
%env mylog $mydir/out.log

env: mydir=mymrjob
env: myinput=/data/worker/books/twolines.txt
env: myscript=mymrjob/wordcount.py
env: myoutput=mymrjob/out.txt
env: mylog=mymrjob/out.log


Create the job file

In [14]:
%%writefile $myscript

from mrjob.job import MRJob
class MRWordCount(MRJob):
    """ Wordcount with MapReduce in a pythonic way"""

    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Overwriting mymrjob/wordcount.py


## I/O

You can pass input via stdin but be aware that mrjob will just dump it to a file first:
```bash
$ python my_job.py < input.txt
```

You can pass multiple input files, mixed with stdin (using the – character)
```bash
$ python my_job.py input1.txt input2.txt - < input3.txt
```

By default, output will be written to stdout.
```bash
$ python my_job.py input.txt
```


In [15]:
# Execute MrJob
! python $myscript $myinput 1> $myoutput 2> $mylog

Note again: if this comand takes minutes, you may go to see what is happening in your Hadoop JobTracker
http://localhost:8088/cluster

In [41]:
%cat $myoutput

"bye"	1
"goodbye"	1
"hadoop"	2
"hello"	2
"world"	2


In [42]:
%cat $mylog

Unexpected option hdfs_tmp_dir
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcount.root.20151013.105600.184640
writing wrapper script to /tmp/wordcount.root.20151013.105600.184640/setup-wrapper.sh
Using Hadoop version 2.6.0
Copying local files into hdfs:///user/root/tmp/mrjob/wordcount.root.20151013.105600.184640/files/
HADOOP: packageJobJar: [/tmp/hadoop-unjar7755496883310791297/] [] /tmp/streamjob2462321277901270862.jar tmpDir=null
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Total input paths to process : 1
HADOOP: number of splits:2
HADOOP: Submitting tokens for job: job_1444732540385_0001
HADOOP: Submitted application application_1444732540385_0001
HADOOP: The url to track the job: http://ipyhadoop:8088/proxy/application_1444732540385_0001/
HADOOP: Running job: job_1444732540385_0001
HADOOP: Jo

Here's an empty **template** to work with in copy/paste

In [None]:
%%writefile SPECIFY_A_FILENAME.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
""" MapReduce easily with Python """

from mrjob.job import MRJob
from mrjob.step import MRStep

class job(MRJob):
    def mapper(self, _, line):
        pass
    def reducer(self, key, line):
        pass
    def steps(self):
        return [ MRStep(mapper=self.mapper, reducer=self.reducer)]

if __name__ == "__main__":
    job.run()

## Exercise

Convert your exercise for vowels inside the divine comedy into MrJob class

## Running on Hadoop

By default, mrjob will run your job on your local (normal) environment) in a single Python process 

You change the way the job is run with the `-r/--runner` option:

```
-r inline, -r local, -r hadoop, or -r emr
```

Use also `--verbose` option to show all the steps

So we just need to add `-r hadoop`.

<small>Note: The `capture` *magic* is another way we could handle output.</small>

In [52]:
%%capture hadoop_out
# Execute MrJob on Hadoop and let the magic handle outputs
! python $myscript $myinput -r hadoop 2> $mylog

In [53]:
hadoop_out.show()

"bye"	1
"goodbye"	1
"hadoop"	2
"hello"	2
"world"	2


In [54]:
%cat $mylog

Unexpected option hdfs_tmp_dir
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcount.root.20151013.130840.424615
writing wrapper script to /tmp/wordcount.root.20151013.130840.424615/setup-wrapper.sh
Using Hadoop version 2.6.0
Copying local files into hdfs:///user/root/tmp/mrjob/wordcount.root.20151013.130840.424615/files/
HADOOP: packageJobJar: [/tmp/hadoop-unjar6938543500754848653/] [] /tmp/streamjob3313232432800080753.jar tmpDir=null
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Total input paths to process : 1
HADOOP: number of splits:2
HADOOP: Submitting tokens for job: job_1444732540385_0005
HADOOP: Submitted application application_1444732540385_0005
HADOOP: The url to track the job: http://ipyhadoop:8088/proxy/application_1444732540385_0005/
HADOOP: Running job: job_1444732540385_0005
HADOOP: Jo

## An example of creating notebook extensions

A pattern recurs often:

* You have to write a file
* You have to execute mrjob 
    - with or without hadoop parameter
* You have to bring back the output

Can we do this with an ipython magic?

### A MrJob dedicated extension

Since there is none, 
we created an extension for helping users with mapreduce in Jupyter Notebook

In [3]:
# Load the extension
%reload_ext mrjobmagic
# note: reload_ext does not get error if you try two times

How it works:

* Cell extension

```
%%mapreduce LOCAL_INPUT [inline,hadoop]

MAPPER function
[COMBINER function]
REDUCER function
```

* Line extension

```
%mapreduce LOCAL_INPUT MR_FILE [inline,hadoop]
```

In [4]:
%mapreduce?

In [5]:
myfile = '/data/worker/books/prince.txt'

In [6]:
%%mapreduce $myfile

def mapper(self, _, line):
    pass
def reducer(self, key, line):
    pass

Input file is /data/worker/books/prince.txt
Saving jobs/script_000090887.py
Executing python3 jobs/script_000090887.py -r inline /data/worker/books/prince.txt
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/script_000090887.root.20151013.131851.841947
writing to /tmp/script_000090887.root.20151013.131851.841947/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/script_000090887.root.20151013.131851.841947/step-0-mapper-sorted
> sort /tmp/script_000090887.root.20151013.131851.841947/step-0-mapper_part-00000
writing to /tmp/script_000090887.root.20151013.131851.841947/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/script_000090887.root.20151013.131851.841947/step-0-reducer_part-00000 -> /tmp/script_000090887.root.20151013.131851.841947/output/part-00000
Streaming final output from /tmp/script_000090887.root.20151013.131851.841947/outpu

'jobs/script_000090887.py'

---

Executing on the existing Hadoop cluster

In [None]:
%%mapreduce $file hadoop

def mapper(self, _, line):
    pass
def reducer(self, key, line):
    pass

What if i wanto to modify directly the file?

In [None]:
# Line execution, give a file you created or modified 
%mapreduce $file

## More than a single step

mrjob can be configured to run different steps

for each step you can specify which part has to be executed
and the method to use within the class you wrote


In [None]:
def steps(self):
    return [
        MRStep(
            mapper=self.mapper_get_words,
            combiner=self.combiner_count_words,
            reducer=self.reducer_count_words),
        MRStep(reducer=self.reducer_find_max_word)
    ]

A quick note

* With MrJob you cannot connect to a **remote** Hadoop cluster. 
    - Hadoop does not allow job submissions (class or executables) from outside.
* On the contrary EMR on Amazon can be accessible from your laptop.
    - Amazon created the [boto api](http://boto.readthedocs.org/en/latest/ref/emr.html) to solve the issue.

# Recap with Mrjob

You should really read the [documentation of the latest version](http://mrjob.readthedocs.org/en/latest/).

It covers every need without getting too much complicated.
There are many other options and advanced behaviours to discover.

<small>
Note 1: We are using the most recent version (*release* **v.0.5.0dev** ) because its the first one to support Python 3.

Note 2: Developer are there to help you, see my case in https://github.com/Yelp/mrjob/issues/1142
</small>



* More documentation than any other framework or library
* Write code in a single class (per Hadoop job)
    * Map and Reduce are single methods
    * Very clean and simple
* Advanced configuration
    * Configure multiple steps
    * Handle command line options inside the python code (see docs)
* Easily wrap input/output 
    * No data copy required with HDFS
* Hadoop logs or errors directly into the script output
* **Switch environment without changing the code...!**

**Cons**

* Doesn’t give you the same level of access to Hadoop APIs 
    - Better: Dumbo and Pydoop
    - Other libraries can be faster if you use typedbytes

### Comparison
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/features.png'>

### Performance
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/performance.png'>

source: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

# End of Chapter