#DATASCI W261: Machine Learning at Scale

# Write some words to a file

In [1]:
!echo foo foo quux labs foo bar quux > WordCount.txt

# MrJob class for wordcount

In [2]:
%%writefile mr_wc.py
from mrjob.job import MRJob
from mrjob.step import MRJobStep
import re
 
WORD_RE = re.compile(r"[\w']+")
 
class MRWordFreqCount(MRJob):
    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    MRWordFreqCount.run()

Overwriting mr_wc.py


The code above is straightforward. Mapper outputs (word, 1) key value pairs, and then conbiner combines the sum locally. At last, Reducer sums them up. 

# Run the code in command line

In [3]:
!python WordCount.py WordCount.txt

"bar"	1
"foo"	3
"labs"	1
"quux"	2


using configs in C:\Anaconda\mrjob.conf
creating tmp directory c:\users\liang.dai\appdata\local\temp\WordCount.liang.dai.20150524.134758.767000
writing to step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to c:\users\liang.dai\appdata\local\temp\WordCount.liang.dai.20150524.134758.767000\step-0-mapper-sorted
> sort 'c:\users\liang.dai\appdata\local\temp\WordCount.liang.dai.20150524.134758.767000\step-0-mapper_part-00000'
writing to step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving c:\users\liang.dai\appdata\local\temp\WordCount.liang.dai.20150524.134758.767000\step-0-reducer_part-00000 -> c:\users\liang.dai\appdata\local\temp\WordCount.liang.dai.20150524.134758.767000\output\part-00000
Streaming final output from c:\users\liang.dai\appdata\local\temp\WordCount.liang.dai.20150524.134758.767000\output
removing tmp directory c:\users\liang.dai\appdata\local\temp\WordCount.liang.dai.20150524.134758.767000


# Run the code through python driver

####  Reminder: You cannot use the programmatic runner functionality in the same file as your job class. That is because the file with the job class is sent to Hadoop to be run. Therefore, the job file cannot attempt to start the Hadoop job, or you would be recursively creating Hadoop jobs!

Use make_runner() to run an MRJob
1. seperate driver from mapreduce jobs
2. now we can run it within pythonnode book 
3. In python, typically one class is in each file. Each mrjob job is a seperate class, should be in a seperate file

In [7]:
from WordCount import MRWordFreqCount
mr_job = MRWordFreqCount(args=['WordCount.txt'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)

(u'bar', 1)
(u'foo', 3)
(u'labs', 1)
(u'quux', 2)
