# A more pythonic MapReduce

With Hadoop Streaming we switched MapReduce from Java to Python.

How could we improve some more? 

What problems do we deal with?

We have to directly deal with HDFS
* move input files
* recover outputs
* human make mistakes

Not easy debugging

* logs via jobtracker
* errors are Java stacktrace, often unrelated with the real problem

We have to write two different files
* one for the mapper
* one for the reducer
* not a **module**

How could we shape the Map Reduce job into Python syntax?

In [3]:
class MapReduce():
    
    def mapper(self, line):
        pass
    def reducer(self, sorted_line):
        pass

# MRJob 
A more pythonic MapReduce library from Yelp

<img src='https://avatars1.githubusercontent.com/u/49071?v=3&s=400' width=300>

> “Easiest route to Python programs that run on Hadoop”

Install with: 
```bash
pip install mrjob
```

### How does MrJob work?

* Python module built **on top of Hadoop Streaming**
* Wrap HDFS pre and post processing if hadoop exists
* automatically serializes/deserializes data flow out of each task 
    - e.g. JSON: json.loads() and json.dumps()

A job is defined by a class extended from MRJob package

* Contains methods that define the steps of a Hadoop job
* A “step” consists of a mapper, a combiner, and a reducer. 
* All of those  are optional, though you must have at least one.


In [4]:
from mrjob.job import MRJob

In [5]:
class myjob(MRJob):
    def mapper(self, _, line):
        pass
    def combiner(self, key, values):
        pass
    def reducer(self, key, values):
        pass
    def steps(self):
        return [ MRStep(mapper=self.mapper, combiner=self.combiner, reducer=self.reducer) ]

### Running modes

Test your code locally without installing Hadoop 

...or run it on a cluster of your choice!

The same code may run locally, on Hadoop, or Amazon **Elastic MapReduce** (EMR)


## WordCount

### Mapper

The mapper() method takes a key and a value as args

```
    def mapper(self, _, line):
        pass
```

* E.g. key is ignored and a single line of text input is the value

* Yields as many key-value pairs as it likes

### Reducer

The reduce() method takes a key and an iterator of values

```
    def reducer(self, key, values):
        pass
```

* Also yields as many key-value pairs as it likes
    * E.g. it sums the values for each key
* Represent the  numbers of characters, words, and lines in the initial input



In [7]:
%%writefile wordcount.py

from mrjob.job import MRJob
class MRWordCount(MRJob):
    """ Wordcount with MapReduce in a pythonic way"""

    def mapper(self, key, line):
        for word in line.split():
             yield word.lower(), 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Overwriting mymrjob/wordcount.py


Note!
    
Thanks to MrJob and generators we do not care inside the reducer to check when the value is changing.

## I/O

You can pass input via stdin but be aware that mrjob will just dump it to a file first:
```bash
$ python my_job.py < input.txt
```

You can pass multiple input files, mixed with stdin (using the – character)
```bash
$ python my_job.py input1.txt input2.txt - < input3.txt
```

By default, output will be written to stdout.
```bash
$ python my_job.py input.txt
```


In [None]:
# How you should execute in Batch Mode
# Note: MrJob cannot run interactively inside a cell

! python MYPYTHONFILE.py MYINPUT.txt 1> OUTPUT.txt 2> STDERR.log

Here's an empty **template** to work with in copy/paste

```python
# -*- coding: utf-8 -*-
from mrjob.job import MRJob
from mrjob.step import MRStep

class job(MRJob):
    def mapper(self, _, line):
        pass
    def reducer(self, key, line):
        pass
    def steps(self):
        return [MRStep(mapper=self.mapper, reducer=self.reducer)]

if __name__ == "__main__":
    job.run()
```

Note: With MRStep you define iterations

## Exercise

Count the different vowels inside the divine comedy using a MrJob class

## Running on Hadoop

By default, mrjob will run your job on your local (normal) environment) in a single Python process 

You change the way the job is run with the `-r/--runner` option:

```
-r inline, -r local, -r hadoop, or -r emr
```

Use also `--verbose` option to show all the steps

So we just need to add `-r hadoop`.

<small>Note: The `capture` *magic* is another way we could handle output.</small>

In [11]:
%%capture hadoop_out
# Execute MrJob on Hadoop and let the magic handle outputs
! python $myscript $myinput -r hadoop 2> $mylog

In [12]:
hadoop_out.show()

"bye"	1
"goodbye"	1
"hadoop"	2
"hello"	2
"world"	2


In [13]:
%cat $mylog

Using configs in /etc/mrjob.conf
Using Hadoop version 2.6.0
Copying local files to hdfs:///user/jovyan/tmp/mrjob/wordcount.jovyan.20160315.231008.131608/files/...
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar8386463698946503097/] [] /tmp/streamjob581799815579415374.jar tmpDir=null
  Connecting to ResourceManager at /0.0.0.0:8032
  Connecting to ResourceManager at /0.0.0.0:8032
  Total input paths to process : 1
  number of splits:2
  Submitting tokens for job: job_1458083121911_0001
  Submitted application application_1458083121911_0001
  The url to track the job: http://76752dc90450:8088/proxy/application_1458083121911_0001/
  Running job: job_1458083121911_0001
  Job job_1458083121911_0001 running in uber mode : false
   map 0% reduce 0%
   map 100% reduce 0%
   map 100% reduce 100%
  Job job_1458083121911_0001 completed successfully
  Output directory: hdfs:///user/jovyan/tmp/mrjob/wordcount.jovyan.20160315.231008.131608/output
Counters: 49
	File Inp

## Protocols

MRJob add many comodities!

Some of them may result expensive on heavy computing.

For example MapReduce data transfer is serialize in JSON format.

In [None]:
class MyMRJob(mrjob.job.MRJob):

    # these are the defaults
    INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
    INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol
    OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol


Let's see what it means

In [14]:
%%writefile protocol.py
from mrjob.job import MRJob

class MRWordCount(MRJob):

    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

                
    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Writing protocol.py


In [15]:
! python protocol.py /data/lectures/data/books/twolines.txt 2> /dev/null

"bye"	1
"goodbye"	1
"hadoop"	2
"hello"	2
"world"	2


In [16]:
%%writefile protocol.py
from mrjob.job import MRJob
from mrjob.protocol import PickleProtocol

class MRWordCount(MRJob):

    # Optimization on internal protocols
    INTERNAL_PROTOCOL = PickleProtocol
    OUTPUT_PROTOCOL = PickleProtocol
    
    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

                
    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Overwriting protocol.py


In [17]:
! python protocol.py /data/lectures/data/books/twolines.txt 2> /dev/null

\x80\x03X\x03\x00\x00\x00byeq\x00.	\x80\x03K\x01.
\x80\x03X\x05\x00\x00\x00helloq\x00.	\x80\x03K\x02.
\x80\x03X\x05\x00\x00\x00worldq\x00.	\x80\x03K\x02.
\x80\x03X\x06\x00\x00\x00hadoopq\x00.	\x80\x03K\x02.
\x80\x03X\x07\x00\x00\x00goodbyeq\x00.	\x80\x03K\x01.


### Runners

http://mrjob.readthedocs.org/en/latest/guides/runners.html

You cannot use the programmatic runner functionality in the same file as your job class.


In [None]:
## RUNNER: JUST AN EXAMPLE

# Load class for mapreduce
from mypackage import SomeClassWeWrote
#from job import MRcoverage
from mrjob.util import log_to_stream

if __name__ == '__main__':

    # Create object
    mrjob = SomeClassWeWrote(args=[   \
        '-r', 'inline',
        #'-r', 'local',
        # '-r', 'hadoop',
        # '--jobconf=mapreduce.job.maps=10',
        # '--jobconf=mapreduce.job.reduces=4'
        #'--jobconf=stream.recordreader.compression=bz2'
        ])

    # Run and handle output
    with mrjob.make_runner() as runner:

        # Redirect hadoop logs to stderr
        log_to_stream("mrjob")
        # Execute the job
        runner.run()

        # Do something with stream output (e.g. file, database, etc.)
        for line in runner.stream_output():
            key, value = mrjob.parse_output_line(line)
            print(key, value)

# Recap with Mrjob

You should really read the [documentation of the latest version](http://mrjob.readthedocs.org/en/latest/).

It covers every need without getting too much complicated.
There are many other options and advanced behaviours to discover.

<small>
Note 1: We are using the most recent version (*release* **v.0.5.0dev** ) because its the first one to support Python 3.

Note 2: Developer are there to help you, see my case in https://github.com/Yelp/mrjob/issues/1142
</small>



* More documentation than any other framework or library
* Write code in a single class (per Hadoop job)
    * Map and Reduce are single methods
    * Very clean and simple
* Advanced configuration
    * Configure multiple steps
    * Handle command line options inside the python code (see docs)
* Easily wrap input/output 
    * No data copy required with HDFS
* Hadoop logs or errors directly into the script output
* **Switch environment without changing the code...!**

**Cons**

* Doesn’t give you the same level of access to Hadoop APIs 
    - Better: Dumbo and Pydoop
    - Other libraries can be faster if you use typedbytes

### Comparison
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/features.png'>

### Performance
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/performance.png'>

source: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

*One final note*: 
> Open source is a great thing

The community listens to you:
https://github.com/Yelp/mrjob/issues/1142

# End of Chapter

In [4]:
%reload_ext watermark
%watermark -a "Paolo D." -d -v -m

Paolo D. 2016-04-19 

CPython 3.4.3
IPython 4.1.2

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)
system     : Darwin
release    : 15.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
