# A more pythonic MapReduce

With Hadoop Streaming we switched MapReduce from Java to Python.

How could we improve some more? 

What problems do we deal with?

We have to directly deal with HDFS
* move input files
* recover outputs
* human make mistakes

Not easy debugging

* logs via jobtracker
* errors are Java stacktrace, often unrelated with the real problem

We have to write two different files
* one for the mapper
* one for the reducer
* not a **module**

Is there a Python part of the language that can help representing a MapReduce task?

In [1]:
# A python class

class myclass(object):
    
    a_property = 42
    
    def a_method(self):
        pass
    def another_method(self, var):
        print(var)


In [2]:
# Create instance of our class
instance = myclass()
# Use it
instance.a_method()
instance.another_method("test")
instance.a_property

test


42

In [3]:
class MapReduce(object):
    """ A MapReduce class prototype """
    
    def mapper(self, line):
        pass
    def reducer(self, sorted_line):
        pass

# MRJob 
A more pythonic MapReduce library from Yelp

<img src='https://avatars1.githubusercontent.com/u/49071?v=3&s=400' width=300>

> “Easiest route to Python programs that run on Hadoop”

Install with: 
```bash
pip install mrjob
```

**Running modes**
* Test your code locally without installing Hadoop 
* or run it on a cluster of your choice!
    - Integrates with Amazon **Elastic MapReduce** (EMR)
    - same code with local, Hadoop, EMR
    - easy to run your job in the cloud as on your laptop

### How does MrJob work?

* Python module built **on top of Hadoop Streaming**
    - HS jar opens a subprocess to your code
    - sends it input via stdin
    - gathers results via stdout
* Wrap HDFS pre and post processing if hadoop exists
* a consistent interface across every environment it supports
* automatically serializes/deserializes data flow out of each task 
    - e.g. JSON: json.loads() and json.dumps()

## Getting hands dirty

In [4]:
from mrjob.job import MRJob

A job is defined by a class extended from MRJob package

* Contains methods that define the steps of a Hadoop job
* A “step” consists of a mapper, a combiner, and a reducer. 
* All of those  are optional, though you must have at least one.


In [5]:
class myjob(MRJob):
    def mapper(self, _, line):
        pass
    def combiner(self, key, values):
        pass
    def reducer(self, key, values):
        pass
    def steps(self):
        return [ MRStep(mapper=self.mapper, combiner=self.combiner, reducer=self.reducer) ]

## WordCount

### Mapper

The mapper() method takes a key and a value as args

```
    def mapper(self, _, line):
        pass
```

* E.g. key is ignored and a single line of text input is the value

* Yields as many key-value pairs as it likes

### Yield?

**Warning**: `yield` != `return`

> yield return a generator, the one you usually use with  ```print i; for i in generator```
    
Example
```python
def mygen():
    for i in range(1,10):
    # THIS IS WHAT HAPPENS INSIDE THE MAPPER/REDUCER
        yield i, “value” 

for key, value in mygen():
    print key, value
```

### Reducer

The reduce() method takes a key and an iterator of values

```
    def reducer(self, key, values):
        pass
```

* Also yields as many key-value pairs as it likes
    * E.g. it sums the values for each key
* Represent the  numbers of characters, words, and lines in the initial input



## Let's write our job

In [6]:
# Little configuration

mydir = "mymrjob"
%env mydir = $mydir
myinput = "/data/lectures/data/books/twolines.txt"
%env myinput $myinput
myscript = mydir + "/wordcount.py"
%env myscript $myscript

%system mkdir -p $mydir
%env myoutput $mydir/out.txt
%env mylog $mydir/out.log

env: mydir=mymrjob
env: myinput=/data/lectures/data/books/twolines.txt
env: myscript=mymrjob/wordcount.py
env: myoutput=mymrjob/out.txt
env: mylog=mymrjob/out.log


Create the job file

In [7]:
%%writefile $myscript

from mrjob.job import MRJob
class MRWordCount(MRJob):
    """ Wordcount with MapReduce in a pythonic way"""

    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Writing mymrjob/wordcount.py


Note!
    
Thanks to MrJob and generators we do not care inside the reducer to check when the value is changing.

## I/O

You can pass input via stdin but be aware that mrjob will just dump it to a file first:
```bash
$ python my_job.py < input.txt
```

You can pass multiple input files, mixed with stdin (using the – character)
```bash
$ python my_job.py input1.txt input2.txt - < input3.txt
```

By default, output will be written to stdout.
```bash
$ python my_job.py input.txt
```


In [8]:
# Execute MrJob
! python $myscript $myinput 1> $myoutput 2> $mylog

In [9]:
%cat $myoutput

"bye"	1
"goodbye"	1
"hadoop"	2
"hello"	2
"world"	2


In [10]:
%cat $mylog

using configs in /etc/mrjob.conf
creating tmp directory /tmp/wordcount.jovyan.20151215.064342.962245
writing to /tmp/wordcount.jovyan.20151215.064342.962245/step-0-mapper_part-00000
Counters from step 1:
  (none found)
writing to /tmp/wordcount.jovyan.20151215.064342.962245/step-0-mapper-sorted
> sort /tmp/wordcount.jovyan.20151215.064342.962245/step-0-mapper_part-00000
writing to /tmp/wordcount.jovyan.20151215.064342.962245/step-0-reducer_part-00000
Counters from step 1:
  (none found)
Moving /tmp/wordcount.jovyan.20151215.064342.962245/step-0-reducer_part-00000 -> /tmp/wordcount.jovyan.20151215.064342.962245/output/part-00000
Streaming final output from /tmp/wordcount.jovyan.20151215.064342.962245/output
removing tmp directory /tmp/wordcount.jovyan.20151215.064342.962245


Here's an empty **template** to work with in copy/paste

In [None]:
%%writefile SPECIFY_A_FILENAME.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
""" MapReduce easily with Python """

from mrjob.job import MRJob
from mrjob.step import MRStep

class job(MRJob):
    def mapper(self, _, line):
        pass
    def reducer(self, key, line):
        pass
    def steps(self):
        return [ 
            MRStep(mapper=self.mapper, reducer=self.reducer)
        ]

if __name__ == "__main__":
    job.run()

With MRStep you define iterations

## Exercise

Convert your exercise for vowels inside the divine comedy into MrJob class

## Running on Hadoop

By default, mrjob will run your job on your local (normal) environment) in a single Python process 

You change the way the job is run with the `-r/--runner` option:

```
-r inline, -r local, -r hadoop, or -r emr
```

Use also `--verbose` option to show all the steps

So we just need to add `-r hadoop`.

<small>Note: The `capture` *magic* is another way we could handle output.</small>

In [11]:
%%capture hadoop_out
# Execute MrJob on Hadoop and let the magic handle outputs
! python $myscript $myinput -r hadoop 2> $mylog

In [12]:
hadoop_out.show()

"bye"	1
"goodbye"	1
"hadoop"	2
"hello"	2
"world"	2


In [13]:
%cat $mylog

using configs in /etc/mrjob.conf
Using Hadoop version 2.6.0
Copying local files into hdfs:///user/jovyan/tmp/mrjob/wordcount.jovyan.20151215.064553.617771/files/
HADOOP: packageJobJar: [/tmp/hadoop-unjar5834422501600369026/] [] /tmp/streamjob4169325867795510076.jar tmpDir=null
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Total input paths to process : 1
HADOOP: number of splits:2
HADOOP: Submitting tokens for job: job_1450159525228_0002
HADOOP: Submitted application application_1450159525228_0002
HADOOP: The url to track the job: http://sparker:8088/proxy/application_1450159525228_0002/
HADOOP: Running job: job_1450159525228_0002
HADOOP: Job job_1450159525228_0002 running in uber mode : false
HADOOP:  map 0% reduce 0%
HADOOP:  map 100% reduce 0%
HADOOP:  map 100% reduce 100%
HADOOP: Job job_1450159525228_0002 completed successfully
Parsing counters from hadoop output
HADOOP: Output directory: hdf

### Runners

http://mrjob.readthedocs.org/en/latest/guides/runners.html

You cannot use the programmatic runner functionality in the same file as your job class.


In [None]:
## RUNNER EXAMPLE

# Load class for mapreduce
from job import MRcoverage
#from job import MRcoverage
from mrjob.util import log_to_stream

if __name__ == '__main__':

    # Create object
    mrjob = MRcoverage(args=[   \
        '-r', 'inline',
        #'-r', 'local',
        # '-r', 'hadoop',
        # '--jobconf=mapreduce.job.maps=10',
        # '--jobconf=mapreduce.job.reduces=4'
        #'--jobconf=stream.recordreader.compression=bz2'
        ])

    # Run and handle output
    with mrjob.make_runner() as runner:

        # Redirect hadoop logs to stderr
        log_to_stream("mrjob")
        # Execute the job
        runner.run()

        # Do something with stream output (e.g. file, database, etc.)
        for line in runner.stream_output():
            key, value = mrjob.parse_output_line(line)
            print key, value

### Protocols

http://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs.html#protocols



MRJob add many comodities!

Some of them may result expensive on heavy computing.

For example MapReduce data transfer is serialize in JSON format.

In [None]:
class MyMRJob(mrjob.job.MRJob):

    # these are the defaults
    INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
    INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol
    OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol


Let's see what it means

In [33]:
%%writefile protocol.py
from mrjob.job import MRJob

class MRWordCount(MRJob):

    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

                
    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Overwriting protocol.py


In [34]:
! python protocol.py /data/lectures/data/books/twolines.txt 2> /dev/null

"bye"	1
"goodbye"	1
"hadoop"	2
"hello"	2
"world"	2


In [35]:
%%writefile protocol.py
from mrjob.job import MRJob
from mrjob.protocol import PickleProtocol

class MRWordCount(MRJob):

    # Optimization on internal protocols
    INTERNAL_PROTOCOL = PickleProtocol
    OUTPUT_PROTOCOL = PickleProtocol
    
    def mapper(self, key, line):
        for word in line.split(' '):
             yield word.lower(), 1

                
    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCount.run()

Overwriting protocol.py


In [36]:
! python protocol.py /data/lectures/data/books/twolines.txt 2> /dev/null

\x80\x03X\x03\x00\x00\x00byeq\x00.	\x80\x03K\x01.
\x80\x03X\x05\x00\x00\x00helloq\x00.	\x80\x03K\x02.
\x80\x03X\x05\x00\x00\x00worldq\x00.	\x80\x03K\x02.
\x80\x03X\x06\x00\x00\x00hadoopq\x00.	\x80\x03K\x02.
\x80\x03X\x07\x00\x00\x00goodbyeq\x00.	\x80\x03K\x01.


In [17]:
myfile = "/data/lectures/data/books/prince.txt"

In [19]:
%%mapreduce $myfile
def mapper(self, key, line):
    for word in line.split():
        yield word.lower(), 1

def reducer(self, word, occurrences):
    yield word, sum(occurrences)

Input file is: /data/lectures/data/books/prince.txt
Saving the python class in: jobs/script_000040541.py 
Executing Mrjob.
Done!

Files:
jobs/script_000040541.py      jobs/script_000040541.py.out
jobs/script_000040541.py.err


{'show.': 2,
 'rash': 4,
 'campaign': 2,
 'smaller,': 1,
 'entity': 3,
 'steps': 2,
 'incensed': 1,
 'individual': 6,
 'bold': 2,
 'year;': 1,
 'forte': 1,
 'also': 53,
 'leads': 6,
 'wherein,': 1,
 'quartered': 1,
 'domination,': 1,
 '31st': 1,
 'vice': 1,
 'start:': 1,
 'freer': 1,
 'treatise': 1,
 'empire.': 2,
 'motives,': 1,
 'wicked,\\': 1,
 'sheltered': 1,
 'servants': 2,
 'ran': 1,
 'benefices,': 1,
 'usual': 5,
 '1461.': 1,
 'italy,': 18,
 'magistrates,': 1,
 'misfortunes': 1,
 'likely': 1,
 'obtained': 10,
 'works.': 7,
 '.': 9,
 'labours': 1,
 'discussing': 3,
 'then,': 5,
 'indignity': 1,
 'done,': 5,
 'remarkable': 5,
 'way': 49,
 'duchy.': 2,
 'waste': 1,
 'changes.': 1,
 'dying': 1,
 'born': 26,
 'loyalty,': 1,
 'famous.': 1,
 'e': 3,
 'dexterity.': 1,
 'advance': 4,
 'vile': 1,
 'entire': 2,
 'further;': 1,
 'securing': 1,
 'apply': 2,
 '1.f.1.': 1,
 'lets': 1,
 'sundry': 1,
 'imagined.': 1,
 'victorious,\\': 1,
 'downwards,': 1,
 'answered:': 5,
 'uberti,': 1,
 '1515;'

In [20]:
# Recover last cell output
data = _

In [21]:
# Sort in descend based on values count
sorted(data.items(), key=lambda x: x[1], reverse=True) 
# note: values were already converted into integers

[('the', 3068),
 ('to', 2099),
 ('and', 1905),
 ('of', 1796),
 ('in', 987),
 ('he', 904),
 ('a', 774),
 ('that', 692),
 ('his', 638),
 ('by', 504),
 ('it', 500),
 ('with', 487),
 ('not', 486),
 ('be', 467),
 ('they', 437),
 ('for', 436),
 ('is', 428),
 ('have', 386),
 ('was', 376),
 ('this', 336),
 ('which', 334),
 ('who', 328),
 ('as', 324),
 ('had', 290),
 ('are', 284),
 ('but', 277),
 ('or', 271),
 ('him', 267),
 ('their', 266),
 ('one', 263),
 ('from', 233),
 ('i', 209),
 ('at', 208),
 ('you', 201),
 ('will', 201),
 ('on', 199),
 ('so', 195),
 ('those', 194),
 ('were', 190),
 ('all', 185),
 ('them', 180),
 ('if', 178),
 ('when', 176),
 ('would', 165),
 ('has', 162),
 ('been', 159),
 ('more', 150),
 ('because', 147),
 ('prince', 142),
 ('being', 135),
 ('should', 133),
 ('other', 132),
 ('any', 127),
 ('than', 126),
 ('men', 125),
 ('no', 112),
 ('there', 109),
 ('may', 108),
 ('such', 107),
 ('these', 107),
 ('having', 107),
 ('can', 106),
 ('castruccio', 101),
 ('himself', 98),
 (

In [23]:
%cat jobs/script_000040541.py.err

using configs in /etc/mrjob.conf
creating tmp directory /tmp/script_000040541.jovyan.20151215.065013.858128
writing to /tmp/script_000040541.jovyan.20151215.065013.858128/step-0-mapper_part-00000
Counters from step 1:
  (none found)
writing to /tmp/script_000040541.jovyan.20151215.065013.858128/step-0-mapper-sorted
> sort /tmp/script_000040541.jovyan.20151215.065013.858128/step-0-mapper_part-00000
writing to /tmp/script_000040541.jovyan.20151215.065013.858128/step-0-reducer_part-00000
Counters from step 1:
  (none found)
Moving /tmp/script_000040541.jovyan.20151215.065013.858128/step-0-reducer_part-00000 -> /tmp/script_000040541.jovyan.20151215.065013.858128/output/part-00000
Streaming final output from /tmp/script_000040541.jovyan.20151215.065013.858128/output
removing tmp directory /tmp/script_000040541.jovyan.20151215.065013.858128


Executing on the existing Hadoop cluster

In [24]:
%%mapreduce $myfile hadoop

def mapper(self, _, line):
    pass
def reducer(self, key, line):
    pass

Input file is: /data/lectures/data/books/prince.txt
Saving the python class in: jobs/script_000019470.py 
Executing Mrjob.
Done!

Files:
jobs/script_000019470.py      jobs/script_000019470.py.out
jobs/script_000019470.py.err


{}

What if i want to modify directly the file?

Can i use the extension without writing the file, but only executing my script with MrJob?

> Yes

In [5]:
# Line execution, give a file you created or modified 
%mapreduce $myfile jobs/script_000061748.py hadoop

Input file is: /data/worker/books/prince.txt
File provided by user
Executing Mrjob.
Done!

Files:
jobs/script_000061748.py      jobs/script_000061748.py.out
jobs/script_000061748.py.err


{'': 2016,
 'sight': 4,
 'fight.': 1,
 'worth': 1,
 'continual': 2,
 'excelled': 6,
 '1513,': 2,
 'gentleman.': 1,
 'comes': 12,
 'guido,': 1,
 'extraordinary': 6,
 'saddles': 1,
 'arbiter,': 1,
 'canzoni,': 1,
 'discover': 4,
 'enlarged': 1,
 '1502;': 3,
 '1457.': 1,
 'almost': 8,
 'enriching': 2,
 'infant.': 1,
 'cry': 4,
 'inquiries': 1,
 'fast.': 1,
 'possente,': 2,
 'sharp': 2,
 'mere': 2,
 'xenophon.': 1,
 'cost,': 2,
 '4': 1,
 '252': 1,
 'think': 10,
 'free.': 2,
 'scanty': 1,
 'merciful': 2,
 'reproached': 2,
 'agreeable,': 1,
 'ghibellines,': 2,
 'council': 3,
 'thirdly,': 1,
 '1.c.': 1,
 '\\"an': 2,
 'he': 904,
 'building.': 1,
 'above-named': 4,
 'bind': 5,
 'croce.': 1,
 'ready': 11,
 'durable,': 1,
 'now': 30,
 'carnascialeschi.': 1,
 'accepted,': 1,
 'pitigliano;': 1,
 'appearance': 1,
 'state.': 8,
 'sinigalia,': 9,
 'different': 7,
 'middle-aged': 1,
 'affirm': 1,
 'italian,': 1,
 'future': 7,
 'scruple': 1,
 'past,': 1,
 'settled,': 1,
 'fruit': 1,
 'correspondencies(*

## More than a single step

mrjob can be configured to run different steps

for each step you can specify which part has to be executed
and the method to use within the class you wrote


In [36]:
def steps(self):
    return [
        MRStep(
            mapper=self.mapper_get_words,
            combiner=self.combiner_count_words,
            reducer=self.reducer_count_words),
        MRStep(reducer=self.reducer_find_max_word)
    ]

This could be an improvement for our mrjob extension:

- if steps method is provided, override the default written by the extension

A quick note

* With MrJob you cannot connect to a **remote** Hadoop cluster. 
    - Hadoop does not allow job submissions (class or executables) from outside.
* On the contrary EMR on Amazon can be accessible from your laptop.
    - Amazon created the [boto api](http://boto.readthedocs.org/en/latest/ref/emr.html) to solve the issue.

In [None]:
%%writefile runner.py

# Load class for mapreduce
from job import QUALCOSA
#from job import MRcoverage
from mrjob.util import log_to_stream

if __name__ == '__main__':

    # Create object
    mrjob = MRcoverage(args=[   \
        '-r', 'inline',
        #'-r', 'local',
        # '-r', 'hadoop',
        # '--jobconf=mapreduce.job.maps=10',
        # '--jobconf=mapreduce.job.reduces=4'
        #'--jobconf=stream.recordreader.compression=bz2'
        ])

    # Run and handle output
    with mrjob.make_runner() as runner:

        # Redirect hadoop logs to stderr
        log_to_stream("mrjob")
        # Execute the job
        runner.run()

        # Do something with stream output (e.g. file, database, etc.)
        for line in runner.stream_output():
            key, value = mrjob.parse_output_line(line)
            print key, value


# Recap with Mrjob

You should really read the [documentation of the latest version](http://mrjob.readthedocs.org/en/latest/).

It covers every need without getting too much complicated.
There are many other options and advanced behaviours to discover.

<small>
Note 1: We are using the most recent version (*release* **v.0.5.0dev** ) because its the first one to support Python 3.

Note 2: Developer are there to help you, see my case in https://github.com/Yelp/mrjob/issues/1142
</small>



* More documentation than any other framework or library
* Write code in a single class (per Hadoop job)
    * Map and Reduce are single methods
    * Very clean and simple
* Advanced configuration
    * Configure multiple steps
    * Handle command line options inside the python code (see docs)
* Easily wrap input/output 
    * No data copy required with HDFS
* Hadoop logs or errors directly into the script output
* **Switch environment without changing the code...!**

**Cons**

* Doesn’t give you the same level of access to Hadoop APIs 
    - Better: Dumbo and Pydoop
    - Other libraries can be faster if you use typedbytes

### Comparison
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/features.png'>

### Performance
<img src='http://blog.cloudera.com/wp-content/uploads/2013/01/performance.png'>

source: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

*One final note*: 
> Open source is a great thing

The community listens to you:
https://github.com/Yelp/mrjob/issues/1142

# End of Chapter