### Total Sort MRJob Example

#### Megan Jasek, 6/18/16

This is an example of using MRJob to do a secondary sort.  I was unable to get sorting on an additional field to work (like to break ties).

The key things about this solution are as follows:
1.  MRJob.SORT_VALUES = True.  
Note this from the docs when you set this value to True, MRJob automatically sets these 4 values:  
stream.num.map.output.key.fields=2  
mapred.text.key.partitioner.options=k1,1  
blank out: mapred.output.key.comparator.class (to prevent interference from mrjob.conf.)  
blank out: mapred.text.key.comparator.options (to prevent interference from mrjob.conf.)  

2.  INTERNAL_PROTOCOL = RawProtocol
This is what I have been using.  I actually didn't try it without doing this, so you might want to try that first because, when you set this value, you have to modify how you pass things around.  I made sure all of my output are strings and it works.

3.  What is yielded from the mapper
Yield something like this from your mapper.  The value must be before the word.  And I think you need to use a tab as well.  
yield label, str(value)+'\t'+str(word)  

4.  Set these values in a jobconf  
'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',  
'mapred.text.key.comparator.options': '-k2,2nr',  

This code works, but there are many other ways to do this.  It seems like MRJob has a bunch of bugs in it.  Or I am missing some special parameter somewhere.  Here are a few examples of what's weird:

1.  When sorting is set to true, these parameters seem to be ignored:  
            'stream.num.map.output.key.field': 3,  
            'stream.map.output.field.separator':'\t',  
            And the '-k3,3' part here:  'mapreduce.partition.keycomparator.options': '-k2,2nr -k3,3',  

2.  When sorting is set to false, but INTERNAL_PROTOCOL = RawProtocol, these parameters don't work:  
            'mapreduce.job.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',  
            'stream.num.map.output.key.field': 3,  
            'stream.map.output.field.separator':'\t',  
            'mapreduce.partition.keycomparator.options': '-k2,2nr -k3,3',  


In [1]:
%%writefile 5gram_test.txt
A BILL FOR ESTABLISHING RELIGIOUS	59	59	54
A Biography of General George	92	90	74
A Case Study in Government	102	102	78
A Case Study of Female	447	447	327
A Case Study of Limited	55	55	43
A Child's Christmas in Wales	1099	1061	866
A Circumstantial Narrative of the	62	62	50
A City by the Sea	62	60	49
A Collection of Fairy Tales	123	117	80
A Collection of Forms of	116	103	82

Writing 5gram_test.txt


In [4]:
%%writefile EDALongest5gram_test.py
from mrjob.job import MRJob
from mrjob.protocol import RawProtocol
from mrjob.step import MRStep
import re
    
# This class outputs the 5-grams from the input file in descending order based
# on the number of characters in the 5-gram.
class MREDALongest5gram(MRJob):
    MRJob.SORT_VALUES = True 
    INTERNAL_PROTOCOL = RawProtocol

    def mapper_count_chars(self, _, line):
        # read the next line from the file and output the first record (the 5-gram) with the count
        # of its characters.
        record = line.strip().split('\t')
        # Remove spaces and apostrophes
        ngram = re.sub("[' ]", '', record[0])
        yield record[0], str(len(ngram))
    
    def reducer_sum_chars(self, ngram, counts):
        # output each ngram (key) that is input to the reducer plus the sum of the counts.
        c_sum = 0
        for c in counts:
            c_sum += int(c)
        yield ngram, str(c_sum)

    def mapper_sort_chars(self, word, value):
        # add a label to each input for sorting based on the value
        if int(value) < 20:
            label = 'a'
        else:
            label = 'b'
        yield label, str(value)+'\t'+str(word)
    
    def reducer_sort_chars(self, label, value_pair):
        # Output what is received.  The values are sorted at this point.
        for vp in value_pair:
            v, w = vp.split('\t')
            yield w, v
        
    def steps(self):
        JOBCONF_STEP1 = {        
            'mapreduce.job.reduces': '2'
        }
        # define the steps this MR job.  The JOBCONF_STEP2 tells Hadoop how to handle the data during
        # the Hadoop shuffle for the 2nd job.  In this case the data should be sorted in reverse order
        # by the 2nd output (in this case the values or counts of the characters).
        JOBCONF_STEP2 = {        
            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options': '-k2,2nr',
            'mapreduce.job.maps': '4',
            'mapreduce.job.reduces': '2'
        }
        return [
            MRStep(jobconf=JOBCONF_STEP1,            # STEP 1:  count the characters
                   mapper=self.mapper_count_chars,   
                   reducer=self.reducer_sum_chars),
            MRStep(jobconf=JOBCONF_STEP2,            # STEP 2:  sort the characters
                   mapper=self.mapper_sort_chars,
                   reducer=self.reducer_sort_chars)  
        ]

if __name__ == '__main__':
    MREDALongest5gram.run()

Overwriting EDALongest5gram_test.py


In [5]:
# Test the program on the small dataset
!python EDALongest5gram_test.py -r hadoop 5gram_test.txt 

No configs found; falling back on auto-configuration
Creating temp directory /tmp/EDALongest5gram_test.hadoop.20160618.161244.782700
Looking for hadoop binary in /usr/local/hadoop/bin...
Found hadoop binary: /usr/local/hadoop/bin/hadoop
Using Hadoop version 2.7.1
Copying local files to hdfs:///user/hadoop/tmp/mrjob/EDALongest5gram_test.hadoop.20160618.161244.782700/files/...
Looking for Hadoop streaming jar in /usr/local/hadoop...
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
Detected hadoop configuration property names that do not match hadoop version 2.7.1:
The have been translated as follows
 mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class
mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
mapred.text.key.partitioner.options: mapreduce.partition.keypartitioner.options
Running step 1 of 2...
  Unable to load native-hadoop library for your platform... using builtin-java classes w