## 1.4 Introduction to Computation at Scale

We are going to use the python [mrjob](https://github.com/Yelp/mrjob) package developed at Yelp.

This package allows us to develop and test map reduce jobs locally and when ready deploy them to a hadoop cluster with hadoop streaming enabled.  We are going to use it to run jobs locally.

To write a map reduce job we need to implement mapper() and reducer() functions.  The mrjob package takes care of the orchestration of the job.  Here is a first example that will count words in a file:

In [4]:
%%file wordcounter.py 
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "words", len(line.split())

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Writing wordcounter.py


The key points to note:

* We inherit from the class MRJob and provide at least one mapper, reducer or combiner method implementation
* All python methods take `self` as their first argument - this is normal - not mrjob specific
* The mappers will be called once for each line by of the input file specified on the command line
* The mappers must yield a key value pair - the emitted key value pairs will be sent to combiners and reducers
* The reducers will be called once for each key and value emitted by the mappers
* The reducers must also output key and value pairs

Here we can count the words in the bike-items data we were using earlier:

In [5]:
! python wordcounter.py data/bike-items.txt > out.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcounter.csumb.20160209.061521.363793

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /tmp/wordcounter.csumb.20160209.061521.363793/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/wordcounter.csumb.20160209.061521.363793/step-0-mapper-sorted
> sort /tmp/wordcounter.csumb.20160209.061521.363793/step-0-mapper_part-00000
writing to /tmp/wordcounter.csumb.20160209.061521.363793/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/wordcounter.csumb.20160209.061521.363793/step-0-reducer_part-00000 -> /tmp/wordcounter.csumb.20160209.061521.363793/output/part-00000
Streaming final output f

The process runs and the output is dumped into the file out.txt.  In this case there is just a single line:

In [6]:
! cat out.txt

"words"	755154


Here we have one pass through the file and have computed just the number of words.  We can have more elaborate jobs that compute multiple statistics.  Here we count characters, word and line count - the mapper emits three key value pairs for each line:


In [9]:
%%file wordcounter.py 
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1
        

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting wordcounter.py


In [7]:
! python wordcounter.py data/bike-items.txt > out.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcounter.csumb.20160209.061535.032858

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /tmp/wordcounter.csumb.20160209.061535.032858/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/wordcounter.csumb.20160209.061535.032858/step-0-mapper-sorted
> sort /tmp/wordcounter.csumb.20160209.061535.032858/step-0-mapper_part-00000
writing to /tmp/wordcounter.csumb.20160209.061535.032858/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/wordcounter.csumb.20160209.061535.032858/step-0-reducer_part-00000 -> /tmp/wordcounter.csumb.20160209.061535.032858/output/part-00000
Streaming final output f

In [10]:
! cat out.txt

"words"	755154


## Term Frequency in Map Reduce

In [25]:
%%file term-frequency.py 
from mrjob.job import MRJob
import re

class MRTermFrequencyCount(MRJob):

    def mapper(self, _, line):
        rx = re.compile(r'^\"([a-zA-Z0-9\s]*){1}\"')
                
        for term in title.split():
            yield term, 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRTermFrequencyCount.run()

TypeError: <module '__main__' (built-in)> is a built-in class

In [74]:
import re
s = '"AUCT_TITL","ITEM_DESC_TXT"'
m = re.match(r'\"([\w\s]*)\".*',s)
print(m)
print(m.group(1))

<_sre.SRE_Match object at 0x7f542178a0a8>
AUCT_TITL


In [15]:
! head data/bike-items.txt

"AUCT_TITL","ITEM_DESC_TXT"
"ZIPP VUKA CARBON AERO BASE BAR AND EXTENSIONS COMPLETE TRIATHLON TT TRI CYCLING","carbon bar for 31.8 stems measurements on the carbon aero base bar: 40cm center to center / 42cm outside to outside measurements for the carbon aero extensions: 32cm 50mm riser blocks included everything looks great is ready to ride PAYPAL ONLY / CONTINENTAL US ONLY / NO ""0"" FEEDBACK BIDDERS"
"Cycling Bicycle MTB Bike Fixie Gloss 3K Carbon Fiber Riser Bar Handlebar 31.8mm","Description Feature: Easy to use Made of high quality carbon fiber With the special design, can save for a long time The carbon fiber handlebar is made of high quality carbon fiber.So that you can use it relieved This Quick disassembling Carbon Fiber handlebar is easy to use,and one of the best gifts to your friends Specification: Material: Carbon fiber Color: Black Handlebar Clamp Diameter: 31.8 mm Length: 700mm, 680mm, 660mm, 640mm, 620mm, 600mm Package Included: 1 x Cycling Carbon Fiber Rise"
"BI

In [23]:
! python term-frequency.py data/bike-items.txt > out.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/term-frequency.csumb.20160209.063203.425899

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /tmp/term-frequency.csumb.20160209.063203.425899/step-0-mapper_part-00000
Traceback (most recent call last):
  File "term-frequency.py", line 16, in <module>
    MRTermFrequencyCount.run()
  File "/home/csumb/anaconda2/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
    mr_job.execute()
  File "/home/csumb/anaconda2/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
    super(MRJob, self).execute()
  File "/home/csumb/anaconda2/lib/python2.7/site-packages/mrjob/launch.py", line 153, in execute
    self.run_job()
  File "/home/csumb/an

In [18]:
! tail out.txt

"~MATERIAL:"	1
"~N\",\"sale4women"	1
"~O(\u2229_\u2229)O~"	1
"~\",\"Bicycledreamer"	1
"~\",\"Specialized"	1
"~~"	1
"~~~~~\""	1
"~~~~~~~\""	1
"~~~~~~~~~~"	1
"~~~~~~~~~~*PLEASE"	1
