# MIDS W261 Machine Learning At Scale

In order to preserve AWS output in notebooks, our team has decided to submit selected runs in each notebook. This notebook contains part of HW 5.3. Redudant answers have been removed from this workbook.


In [1]:
# Turn on autoreload for easier troubleshooting.
# This function causes iPython to re-load modules before executing code, which
#      is useful because we will be updating the MRJob code while troubleshooting.
%load_ext autoreload
%autoreload 2

<b>HW 5.3</b>

For the remainder of this assignment you will work with a large subset 
of the Google n-grams dataset,

https://aws.amazon.com/datasets/google-books-ngrams/

which we have placed in a bucket on s3:

s3://filtered-5grams/

In particular, this bucket contains (~200) files in the format:

	(ngram) \t (count) \t (pages_count) \t (books_count)

Do some EDA on this dataset using mrjob, e.g., 

- Longest 5-gram (number of characters)
- Top 10 most frequent words (count), i.e., unigrams
- Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the head -n 1000)
- Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000)

OPTIONAL Question:
- Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see:
https://en.wikipedia.org/wiki/Log%E2%80%93log_plot
https://en.wikipedia.org/wiki/Power_law

#### Longest 5-gram

In [21]:
%%writefile longest5Gram.py
#!/Library/Frameworks/Python.framework/Versions/2.7/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations
from mrjob.protocol import RawValueProtocol

class longest5Gram(MRJob):
    
    OUTPUT_PROTOCOL = RawValueProtocol
    
    def jobconf(self):
        orig_jobconf = super(longest5Gram, self).jobconf()        
        custom_jobconf = {
            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options': '-k1rn',
        }
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf

    def steps(self):
        return [MRStep(mapper = self.mapper, 
                       reducer_init = self.reducer_init,
                       reducer = self.reducer)]

    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        yield len(ngram),ngram

    # Use reducer_init to set up a variable to only output from the reducer once
    # (top result)
    def reducer_init(self):
        self.first = 0
        
    def reducer(self,count,values):
        data = {}
        if self.first < 5:
            self.first += 1
            for ngram in values:
                data[ngram] = count
            yield None,data
#         for ngram in values:
#             yield ngram,count

        

if __name__ == '__main__':
    longest5Gram.run()

Overwriting longest5Gram.py


In [22]:
!chmod +x longest5Gram.py

In [24]:
!aws s3 mb s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/

make_bucket failed: s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/ A client error (BucketAlreadyOwnedByYou) occurred when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.



In [25]:
!./longest5Gram.py s3://filtered-5grams/ -r emr \
    --output-dir=s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram \
    --no-output

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
using existing scratch bucket mrjob-03e94e1f06830625
using s3://mrjob-03e94e1f06830625/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/longest5Gram.cjllop.20151007.035139.869156
writing master bootstrap script to /var/folders/6f/lzgvrmnn68b1pdz9lfyc7gnh0000gn/T/longest5Gram.cjllop.20151007.035139.869156/b.py

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Copying non-input files into s3://mrjob-03e94e1f06830625/tmp/longest5Gram.cjllop.20151007.035139.869156/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-JO6XBAWVTCEI
Created new job flow j-JO6XBAWVTCEI
Job launched 30.5s 

In [26]:
!aws s3 cp s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/part-00000 53longest5Gram.txt

download: s3://ucb-mids-mls-christopher-llop-hw5/longest5Gram/part-00000 to ./53longest5Gram.txt


In [31]:
# Print result to screen
!cat ./ 53longest5Gram.txt | head -1

cat: ./: Is a directory
{'ROPLEZIMPREDASTRODONBRASLPKLSON YHROACLMPARCHEYXMMIOUDAVESAURUS PIOFPILOCOWERSURUASOGETSESNEGCP TYRAVOPSIFENGOQUAPIALLOBOSKENUO OWINFUYAIOKENECKSASXHYILPOYNUAT': 159, 'AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR': 159}	
