MIDS UC Berkeley, Machine Learning at Scale  
DATSCIW261 ASSIGNMENT #5  
Version 2016-02-12 (FINAL)  
[Natarajan Krishnaswami](mailto:natarajan@krishnaswami.org)  
2016 Feb 14

# INSTRUCTIONS for SUBMISSIONS


**SPECIAL INSTURCTIONS**: This weeks homework is a group exercise. Your team assignments for completing this HW are located at:

https://docs.google.com/spreadsheets/d/1ncFQl5Tovn-16slD8mYjP_nzMTPSfiGeLLzW8v_sMjg/edit?usp=sharing

See column Team assignment for Homeworks in tab "Teams for HW Assignments"

Please submit your homeworks (one per team) going forward via this form (and not thru the ISVC):

https://docs.google.com/forms/d/1ZOr9RnIe_A06AcZDB6K1mJN4vrLeSmS2PD6Xm3eOiis/viewform?usp=send_form

Please follow the instructions for submissions carefully.

# Week 5 ASSIGNMENTS
---
## HW 5.0
1. What is a data warehouse?
2. What is a Star schema?
3. When is it used?

**Answers**:
1. A data warehouse is an enterprise's repository of all relevant information, be it structured, semi-structured, or unstructured, needed to monitor and predict business performance and needs.
2. A star schema is one where rows in "fact" tables connect together IDs from (flat) "dimension" tables, potentially along with some descriptive columns.  It is a star since the fact table's columns fan out to each dimension table.
3. If the fact entries do not vary too much in the dimensions they need (e.g., varying by location in a hierarchy), a star schema can be a good fit, and permit straightforward, easy to optimize, queries, and natural visualization of the entity relationships. Snowflake schemas can be a good fit if more flexibility/variability in needed for the dimensions, at the cost of more complex processing.

---
## HW 5.1
1. In the database world What is 3NF?
1. Does machine learning use data in 3NF?
  1. If so why? 
1. In what form does ML consume data?
1. Why would one use log files that are denormalized?

**Answers**:
1. In Codd's hierarchy of normal forms, third normal form is a reduction of non-key column redundancy with and across rows, but not within superkeys. This avoidance of functional dependencies is (almost) sufficient to guarantee referential integrity during modification 
2. ML algorithms do not generally use normalized data
3. ML algorithms generally use highly denormalized data transformed into a suitable feature space.
4. If log files are not normalized, the various pieces of a record would need to be located and processed in order (joined) to process each record

---
## HW 5.2
Using MRJob, implement a hashside join (memory-backed map-side) for left, 
right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transfromed log file).  
In this output please include the webpage URL, webpageID and Visitor ID.)

1. Justify which table you chose as the Left table in this hashside join.
1. Please report the number of rows resulting from:
  1. Left joining Table Left with Table Right
  1. Right joining Table Left with Table Right
  1. Inner joining Table Left with Table Right

In [24]:
%%writefile hw52.py
#!/opt/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep

class HW52Job(MRJob):
    def configure_options(self):
        super(HW52Job,self).configure_options()
        self.add_passthrough_option("--join_type",type='string',help='[left|right|inner]', default='left')

    urlmap={}
    seen=set()
    def load_urlmap(self):
        with open('hw4.2-urls.txt', 'r') as urls:
            for row in urls:
                fields=row.strip().split(',')
                self.urlmap[int(fields[1])]=fields[3].strip('"')
    @staticmethod
    def split_visit(line):
        fields=line.strip().split(',')
        if len(fields) > 4:
            return int(fields[1]), int(fields[4])
    
    def left_join(self, _, line):
        urlid, userid=self.split_visit(line)
        if urlid in self.urlmap:
            self.seen.add(urlid)
            yield urlid, (userid, self.urlmap[urlid])
    def left_join_final(self):
        for urlid in set(self.urlmap)-self.seen:
            yield urlid, (self.urlmap[urlid], None)
    
    def right_join(self, _, line):
        urlid, userid=self.split_visit(line)
        yield urlid, (userid, self.urlmap.get(urlid, None))

    def inner_join(self, _, line):
        urlid, userid=self.split_visit(line)
        url=self.urlmap.get(urlid, None)
        if url:
            yield urlid, (userid, url)
        

    def id_reducer(self, key, values):
        for value in values:
            yield key, value
    
    def steps(self):
        if self.options.join_type.lower()=='left':
            mapper=self.left_join
            mapper_final=self.left_join_final
        elif self.options.join_type.lower()=='right':
            mapper=self.right_join
            mapper_final=None
        elif self.options.join_type.lower()=='inner':
            mapper=self.inner_join
            mapper_final=None
        else:
            raise ValueError('Unknown join type '+ self.options.join_type)
        return [MRStep(
                mapper_init=self.load_urlmap,
                mapper=mapper,
                mapper_final=mapper_final,
                reducer=self.id_reducer
            )]

if __name__=="__main__":
    HW52Job().run()

Overwriting hw52.py


In [27]:
%%bash
export HADOOP_HOME=/opt/hadoop-2.7.1
export PATH=$HADOOP_HOME/bin:$PATH
export HADOOP_ROOT_LOGGER=INFO,console

for jointype in left right inner; do
    rm -rf hw5.2-${jointype}
    ./hw52.py --join_type=${jointype} \
      -r inline --no-output \
      --file ../hw4/hw4.2-urls.txt \
      --output=hw5.2-${jointype}-output \
      ../hw4/hw4.2-visits.txt
done

using configs in /home/nkrishna/.mrjob.conf
creating tmp directory /tmp/hw52.nkrishna.20160215.003154.042726
writing to /tmp/hw52.nkrishna.20160215.003154.042726/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/hw52.nkrishna.20160215.003154.042726/step-0-mapper-sorted
> sort /tmp/hw52.nkrishna.20160215.003154.042726/step-0-mapper_part-00000
writing to /tmp/hw52.nkrishna.20160215.003154.042726/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/hw52.nkrishna.20160215.003154.042726/step-0-reducer_part-00000 -> hw5.2-left-output/part-00000
removing tmp directory /tmp/hw52.nkrishna.20160215.003154.042726
using configs in /home/nkrishna/.mrjob.conf
creating tmp directory /tmp/hw52.nkrishna.20160215.003200.262725
writing to /tmp/hw52.nkrishna.20160215.003200.262725/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/hw52.nkrishna.20160215.003200.262725/step-0-mapper-sorted
> sort /tmp/hw52.nkri

**Answers**:
1. I chose the URLs list for the left table since it is much, much smaller than the visits list.  Thus it made sense for it to be the one to load into ram.

In [36]:
%%bash
echo "2. A:" $(< hw5.2-left-output/part-00000 wc -l)
echo "   B:" $(< hw5.2-right-output/part-00000 wc -l)
echo "   C:" $(< hw5.2-inner-output/part-00000 wc -l)


2. A: 98663
   B: 98654
   C: 98654


--- 
## HW 5.3 For the remainder of this assignment you will work with two datasets:

### 1: unit/systems test data set: SYSTEMS TEST DATASET
Three terms, A,B,C and their corresponding strip-docs of co-occurring terms
```
DocA {X:20, Y:30, Z:5}
DocB {X:100, Y:20}
DocC {M:5, N:20, Z:5}
```

### 2: A large subset of the Google n-grams dataset

https://aws.amazon.com/datasets/google-books-ngrams/

which we have placed in a bucket/folder on Dropbox on s3:

   https://www.dropbox.com/sh/tmqpc4o0xswhkvz/AACUifrl6wrMrlK6a3X3lZ9Ea?dl=0 

   s3://filtered-5grams/

For each HW 5.3 -5.5 Please unit test and system test your code with with SYSTEMS TEST DATASET and show the results.  
Please compute the expected answer by hand and show your hand calculations. Then show the results you get with you system.  
Final show your results on the Google n-grams dataset


In particular, this bucket contains (~200) files (10Meg each) in the format:
```
	(ngram) \t (count) \t (pages_count) \t (books_count)
```
Do some EDA on this dataset using mrjob, e.g., 

1. Longest 5-gram (number of characters)
2. Top 10 most frequent words (please use the count information), i.e., unigrams
3. Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the `head -n 1000`)
4. Distribution of 5-gram sizes (using counts info.) sorted in decreasing order of relative frequency.
5. OPTIONAL Question: Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see:  
https://en.wikipedia.org/wiki/Log%E2%80%93log_plot  
https://en.wikipedia.org/wiki/Power_law

In [29]:
with open('hw5.3-test.txt','w') as f:
    print >>f, 'DocA\t{X:20, Y:30, Z:5}'
    print >>f, 'DocB\t{X:100, Y:20}'
    print >>f, 'DocC\t{M:5, N:20, Z:5}'

In [101]:
%%writefile hw5.3-1.py
#!/usr/bin/env python

from collections import namedtuple
from mrjob.job import MRJob
from mrjob.step import MRStep

class HW53Job(MRJob):
    JOBCONF = {
        'mapred.output.key.comparator.class':
          'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
        'mapred.text.key.comparator.options': '-k1,1nr',
    }
    Row=namedtuple('Row',['ngram', 'count', 'pages_count', 'books_count'])
    @staticmethod
    def split_line(line):
        fields=line.strip().split('\t')
        return HW53Job.Row(fields[0],*[int(field) for field in fields[1:]])

    # Find the longest 5-gram
    ## Mapper: keep track of/update the longest 5-gram.
    #          finally yield the longest one seen.
    def longest_map_init(self):
        self.longest=''
    def longest_map(self, _, line):
        row=HW53Job.split_line(line);
        if len(row.ngram) > len(self.longest):
            self.longest=row.ngram
    def longest_map_final(self):
        yield len(self.longest), self.longest
        
    ## Reducer: keep track of/update the longest word list.
    #          finally yield each of the longest words with their length.
    def longest_red_init(self):
        self.longest=[]
    def longest_red(self, key, values):
        if not self.longest or key > len(self.longest[0]):
            for val in values:
                self.longest.append(val)
    def longest_red_final(self):
        for val in self.longest:
            yield len(val), val
            
    def steps(self):
        return [MRStep(
                mapper_init=self.longest_map_init,
                mapper=self.longest_map,
                mapper_final=self.longest_map_final,
                reducer_init=self.longest_red_init,
                reducer=self.longest_red,
                reducer_final=self.longest_red_final,
            )]
if __name__=='__main__':
    HW53Job().run()
    exit(0)

Overwriting hw5.3-1.py


In [135]:
%%bash
/usr/bin/ssh root@50.22.252.4 bash -xs <<'EOF'
cd hw5
prog=hw5.3-1
hdfs=hdfs://master:9000
input=$hdfs/filtered-5grams
output=$hdfs/$prog-output
HADOOP_ROOT_LOGGER=WARN,console
hdfs dfs -rm -r -f $output
time ./$prog.py -q -r hadoop --no-bootstrap-mrjob \
  --output $output \
  $input
EOF

Deleted hdfs://master:9000/hw5.3-1-output
159	"ROPLEZIMPREDASTRODONBRASLPKLSON YHROACLMPARCHEYXMMIOUDAVESAURUS PIOFPILOCOWERSURUASOGETSESNEGCP TYRAVOPSIFENGOQUAPIALLOBOSKENUO OWINFUYAIOKENECKSASXHYILPOYNUAT"
159	"AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR"


+ cd hw5
+ prog=hw5.3-1
+ hdfs=hdfs://master:9000
+ input=hdfs://master:9000/filtered-5grams
+ output=hdfs://master:9000/hw5.3-1-output
+ HADOOP_ROOT_LOGGER=WARN,console
+ hdfs dfs -rm -r -f hdfs://master:9000/hw5.3-1-output
16/02/16 23:24:39 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
+ ./hw5.3-1.py -q -r hadoop --no-bootstrap-mrjob --output hdfs://master:9000/hw5.3-1-output hdfs://master:9000/filtered-5grams

real	0m57.575s
user	0m25.311s
sys	0m1.572s


In [149]:
%%writefile hw5.3-2.py
#!/usr/bin/env python

from collections import namedtuple
from mrjob.job import MRJob
from mrjob.step import MRStep
import sys

class HW53Job(MRJob):
    Row=namedtuple('Row',['ngram', 'count', 'pages_count', 'books_count'])
    @staticmethod
    def split_line(line):
        fields=line.strip().split('\t')
        return HW53Job.Row(fields[0],*[int(field) for field in fields[1:]])

    """Find the top ten unigrams"""
    def get_wordcounts(self, _, line):
        """Mapper: split the 5-grams, and yield the count with each."""
        row=HW53Job.split_line(line)
        for word in row.ngram.split():
            print >>sys.stderr, "word:",word
            yield word, int(row.count)
    def sum_wordcounts(self, key, values):
        """Combiner: sum the counts as in usual word count"""
        yield key, sum(values)
    def sum_swap_wordcounts(self, key, values):
        """Reducer: sum the counts as in usual word count and swap key/val for sorting"""
        yield sum(values), key

    ## dummy map/red steps to cause another sort
    def map_id(self, key, val):
        yield key, val
    def red_id(self, key, vals):
        for x in vals:
            yield key, x 
        
    def steps(self):
        return [
            MRStep(
                mapper=self.get_wordcounts,
                combiner=self.sum_wordcounts,
                reducer=self.sum_swap_wordcounts,
            ),
            MRStep(
                mapper=self.map_id,
                reducer=self.red_id,
                jobconf={
                    'mapred.output.key.comparator.class':
                       'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                    'mapred.text.key.comparator.options': '-k1,1nr',
                },
            ),
        ]

if __name__=='__main__':
    HW53Job().run()
    exit(0)

Overwriting hw5.3-2.py


In [156]:
%%bash
/usr/bin/ssh root@50.22.252.4 bash -xs <<'EOF'
cd hw5
prog=hw5.3-2
hdfs=hdfs://master:9000
input=${hdfs}/filtered-5grams
output=${hdfs}/${prog}-output
HADOOP_ROOT_LOGGER=INFO,console
hdfs dfs -rm -r -f ${output}
time ./${prog}.py -r hadoop --strict-protocols --no-bootstrap-mrjob \
  --no-output \
  --output ${output} \
   ${input}
hdfs dfs -cat ${output}/part-00000 | head -10
EOF

Deleted foo
5375699242	"the"
3691308874	"of"
2221164346	"to"
1387638591	"in"
1342195425	"a"
1135779433	"and"
798553959	"that"
756296656	"is"
688053106	"be"
481373389	"as"


+ cd hw5
+ cat out
+ cat err
16/02/17 08:24:14 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/hw5.root.20160217.142415.557030
writing wrapper script to /tmp/hw5.root.20160217.142415.557030/setup-wrapper.sh
Using Hadoop version 2.7.2
Copying local files into hdfs:///user/root/tmp/mrjob/hw5.root.20160217.142415.557030/files/
HADOOP: packageJobJar: [] [/opt/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar] /tmp/streamjob1443581881176190507.jar tmpDir=null
HADOOP: Connecting to ResourceManager at master/10.108.114.214:8032
HADOOP: Connecting to ResourceManager at master/10.108.114.214:8032
HADOOP: Total input paths to process : 190
HADOOP: number of splits:190
HADOOP: Submitting tokens for job: job_1455677115242_0032
HADOOP: Submitted application application_145567711

In [171]:
%%writefile hw5.3-3.py
#!/usr/bin/env python

from collections import namedtuple
from mrjob.job import MRJob
from mrjob.step import MRStep
import sys

class HW53Job(MRJob):
    Row=namedtuple('Row',['ngram', 'count', 'pages_count', 'books_count'])
    @staticmethod
    def split_line(line):
        fields=line.strip().split('\t')
        return HW53Job.Row(fields[0],*[int(field) for field in fields[1:]])

    """Produce the word densities and sort them"""
    def get_counts(self, _, line):
        """Mapper: split the 5-grams, and yield the count with each."""
        row=HW53Job.split_line(line)
        for word in row.ngram.split():
            yield word, (row.count, row.pages_count)
    def sum_counts(self, key, values):
        """Combiner: sum the counts as in usual word count"""
        count, page_count = 0,0
        for val in values:
            count+=val[0]
            page_count+=val[1]
        yield key, (count, page_count)
    def calc_freqs(self, key, values):
        """Reducer: sum the counts as in usual word count and swap key/val for sorting"""
        for _,(count, page_count) in self.sum_counts(key, values):
            yield 1.0*count/page_count, key

    ## dummy map/red steps to cause another sort
    def map_id(self, key, val):
        yield key, val
    def red_id(self, key, vals):
        for x in vals:
            yield key, x 
        
    def steps(self):
        return [
            MRStep(
                mapper=self.get_counts,
                combiner=self.sum_counts,
                reducer=self.calc_freqs,
            ),
            MRStep(
                mapper=self.map_id,
                reducer=self.red_id,
                jobconf={
                    'mapred.output.key.comparator.class':
                       'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                    'mapred.text.key.comparator.options': '-k1,1nr',
                },
            ),
        ]

if __name__=='__main__':
    HW53Job().run()
    exit(0)

Overwriting hw5.3-3.py


In [176]:
%%bash
/usr/bin/ssh root@50.22.252.4 bash -xs <<'EOF'
cd hw5
prog=hw5.3-3
hdfs=hdfs://master:9000
input=$hdfs/filtered-5grams
output=$hdfs/$prog-output
HADOOP_ROOT_LOGGER=WARN,console
hdfs dfs -rm -r -f $output
time ./$prog.py -q -r hadoop --no-bootstrap-mrjob \
  --no-output --output $output \
  $input
hdfs dfs -cat ${output}/* | head -1000
EOF

11.557291666666666	"xxxx"
10.161726044782885	"NA"
8.074159907300116	"blah"
7.533333333333333	"nnn"
6.561143644505684	"nd"
5.40736428467472	"ND"
4.921875	"oooooooooooooooo"
4.7272727272727275	"PIC"
4.511627906976744	"llll"
4.349498327759197	"LUTHER"
4.207237859573151	"oooooo"
4.0908402725208175	"NN"
3.9492846924177396	"ooooo"
3.9313725490196076	"OOOOOO"
3.7877030162412995	"IIII"
3.7624521072796937	"lillelu"
3.6570701447431206	"OOOOO"
3.6065625	"Sc"
3.576923076923077	"Pfeffermann"
3.576923076923077	"Madarassy"
3.56	"Meteoritical"
3.536491677336748	"Undecided"
3.505639097744361	"Lib"
3.5	"xxxxxxxx"
3.4791318864774623	"ri"
3.375068493150685	"Vir"
3.2390171258376768	"DREAM"
3.229038854805726	"beep"
3.188679245283019	"Latha"
3.188317505823329	"MARTIN"
3.1699346405228757	"Lis"
3.1147458480120784	"Ac"
3.037142857142857	"OUTPUT"
3.022222222222222	"HENNESSY"
3.0	"ALLIS"
2.9191176470588234	"IYENGAR"
2.869891270467005	"ft"
2.8432451923076925	"Adapted"
2.825	"counterfeiteth"
2.81981981981982	"nonmo

+ cd hw5
+ prog=hw5.3-3
+ hdfs=hdfs://master:9000
+ input=hdfs://master:9000/filtered-5grams
+ output=hdfs://master:9000/hw5.3-3-output
+ head -1000
+ hdfs dfs -cat 'hdfs://master:9000/hw5.3-3-output/*'
cat: Unable to write to output stream.


---
## HW 5.4  (over 2Gig of Data)
In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

1. Build stripes for the most frequent 10,000 words using cooccurence informationa based on
the words ranked from 1001,-10,000 as a basis/vocabulary (drop stopword-like terms),
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).
1. Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.

**Design notes for (1)**:  
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:
```
<word,count>
```
to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: `(word1,word2)`, and `(word2,word1)`, to preserve
symmetry in our output for (2).

**Design notes for (2)**:  
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

* Jaccard
* Cosine similarity
* Spearman correlation
* Euclidean distance
* Taxicab (Manhattan) distance
* Shortest path graph distance (a graph, because our data is symmetric!)
* Pearson correlation
* Kendall correlation

However, be cautioned that some comparison methods are more difficult to
parallelize than others, and do not perform more associations than is necessary, 
since your choice of association will be symmetric.

Please use the inverted index (discussed in live session #5) based pattern to compute the pairwise (term-by-term) similarity matrix. 

In [229]:
%%writefile hw5.4-1.py
#!/usr/bin/env python

from collections import namedtuple, defaultdict, Counter
from mrjob.job import MRJob
from mrjob.step import MRStep
import sys

class HW54Job(MRJob):
    Row=namedtuple('Row',['ngram', 'count', 'pages_count', 'books_count'])
    @staticmethod
    def split_line(line):
        fields=line.strip().split('\t')
        return HW54Job.Row(fields[0],*[int(field) for field in fields[1:]])

    """Produce the word densities and sort them"""
    def coocurrence_init(self):
        # I produced the vocabulary in a prior step, so I don't actually need the
        # total counts for ranking/fitlering in the reducer.
        with open('hw5.4-terms-1k-10k.txt') as vocab_file:
            self.vocab=set(term.strip('\n"') for term in vocab_file)
    def coocurrence(self, _, line):
        """Mapper: split the 5-grams, and yield the coocurrence stripes for each."""
        row=HW54Job.split_line(line)
        # here I filter both co-occuring terms to be in rank 1k to rank 10k.
        # This corresponds to option "B" in the sync session
        terms=[term for term in row.ngram.split() if term in self.vocab]
        # Since terms dont CO-occur with themselves, I use slices to omit the
        # term under consideration from the inner loop(s)
        for idx,term in enumerate(terms):
            counts=defaultdict(int)
            for co in terms[:idx]:
                counts[co]+=row.count
            for co in terms[idx+1:]:
                counts[co]+=row.count
            if counts:
                yield term, [row.count,counts]
    def sum_counts(self, key, values):
        """Combiner: sum the counts as in usual word count"""
        counts=Counter()
        total=0
        for value in values:
            total+=value[0]
            counts.update(value[1])
        yield key, [total, counts]

    def steps(self):
        return [
            MRStep(
                mapper_init=self.coocurrence_init,
                mapper=self.coocurrence,
                combiner=self.sum_counts,
                reducer=self.sum_counts,
            ),
        ]

if __name__=='__main__':
    HW54Job().run()
    exit(0)

Overwriting hw5.4-1.py


In [230]:
%%bash
(cd ../prov; vagrant rsync master)
/usr/bin/ssh root@50.22.252.4 bash -xs <<'EOF'
cd hw5
freqs=hdfs://master:9000/hw5.3-3-output
hdfs dfs -cat ${freqs}/\* | cut -f2 | head -10000 | tail -9000 > hw5.4-terms-1k-10k.txt

prog=hw5.4-1
hdfs=hdfs://master:9000
input=${hdfs}/filtered-5grams
output=${hdfs}/${prog}-output
HADOOP_ROOT_LOGGER=INFO,console
hdfs dfs -rm -r -f ${output}
time ./${prog}.py -r hadoop \
  --strict-protocols --no-bootstrap-mrjob \
  --no-output --output ${output} \
  --file  hw5.4-terms-1k-10k.txt \
  --hadoop-arg -Dmapreduce.job.maps=56 \
  --hadoop-arg -Dmapreduce.job.reduces=56 \
  ${input}
hdfs dfs -cat ${output}/\* | head -100
EOF

==> master: Rsyncing folder: /media/sf_berkeley/w261/hw/hw5/hw5/ => /root/hw5
==> master: Rsyncing folder: /media/sf_berkeley/w261/hw/hw5/prov/ => /vagrant
Deleted hdfs://master:9000/hw5.4-1-output
"ADV"	[159, {"Router": 159, "Seq": 159}]
"Adjutant"	[137, {"Received": 137}]
"Agronomic"	[94, {"Chania": 94}]
"Ar"	[1944, {"aq": 47, "Tau": 408, "Cap": 448, "abs": 47, "Ar": 2270}]
"Arxiu"	[174, {"Ciutat": 174}]
"Beverlacense"	[55, {"Dunelmense": 55, "Sanctuarium": 110}]
"Biosensors"	[190, {"Sensors": 190}]
"Bowen"	[280, {"J": 280}]
"Chichimec"	[499, {"Ripples": 499}]
"Clinics"	[18195, {"Respiratory": 67, "Immunology": 546, "Nursing": 156, "Maxillofacial": 226, "Multidisciplinary": 88, "Otolaryngologic": 162, "Allergy": 546, "Anesthesiology": 244, "Radiologic": 1607, "Gynecology": 447, "Radiological": 273, "Orthopedic": 302, "Pediatric": 12112, "Medicine": 1788, "Surgery": 226, "Metabolic": 86, "Obstetrics": 447, "Mortality": 91}]
"Colliding"	[99, {"Instrumentation": 99, "Beam": 99}]
"Consti

+ cd hw5
+ freqs=hdfs://master:9000/hw5.3-3-output
+ hdfs dfs -cat 'hdfs://master:9000/hw5.3-3-output/*'
+ cut -f2
+ head -10000
+ tail -9000
cat: Unable to write to output stream.
+ prog=hw5.4-1
+ hdfs=hdfs://master:9000
+ input=hdfs://master:9000/filtered-5grams
+ output=hdfs://master:9000/hw5.4-1-output
+ HADOOP_ROOT_LOGGER=INFO,console
+ hdfs dfs -rm -r -f hdfs://master:9000/hw5.4-1-output
16/02/18 00:16:05 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
+ ./hw5.4-1.py -r hadoop --strict-protocols --no-bootstrap-mrjob --no-output --output hdfs://master:9000/hw5.4-1-output --file hw5.4-terms-1k-10k.txt --hadoop-arg -Dmapreduce.job.maps=56 --hadoop-arg -Dmapreduce.job.reduces=56 hdfs://master:9000/filtered-5grams
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
Using Hadoop version 2.7.2
Copying local files into hdfs:///user/root/tmp/mrjob/hw5.root.20160218.

In [None]:
%%writefile hw5.4-2.py
#!/usr/bin/env python

from collections import defaultdict, Counter
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONProtocol
import sys

class HW54Job(MRJob):
    """Compare words using coocurrences"""
    def coocurrence_init(self):
        self.seen=set()
    def coocurrence(self, key, stripe):
        """Mapper: yield each member of the stripe"""
        for co in stripe[1]:
            # if we have seen this word as a key already,
            # 
            if co not in self.seen:
                yield co, [key, stripe[1]]
        yield ' ', [key, stripe[1]]
        self.seen.add(key)
            
    def sum_counts(self, key, values):
        """Combiner: sum the counts as in usual word count"""
        counts=Counter()
        total=0
        for value in values:
            total+=value[0]
            counts.update(value[1])
        yield key, [total, counts]

    def steps(self):
        return [
            MRStep(
                mapper_init=self.coocurrence_init,
                mapper=self.coocurrence,
                combiner=self.sum_counts,
                reducer=self.sum_counts,
            ),
        ]

if __name__=='__main__':
    HW54Job().run()
    exit(0)

In [231]:
sorted(['a', 'z', 'cc', None])

[None, 'a', 'cc', 'z']

In [None]:
%%bash
(cd ../prov; vagrant rsync master)
/usr/bin/ssh root@50.22.252.4 bash -xs <<'EOF'
cd hw5
prog=hw5.4-2
hdfs=hdfs://master:9000
input=${hdfs}/hw5.4-1-output
output=${hdfs}/${prog}-output
HADOOP_ROOT_LOGGER=INFO,console
hdfs dfs -rm -r -f ${output}
time ./${prog}.py -r hadoop \
  --strict-protocols --no-bootstrap-mrjob \
  --no-output --output ${output} \
  --file  hw5.4-terms-1k-10k.txt \
  --hadoop-arg -Dmapreduce.job.maps=56 \
  --hadoop-arg -Dmapreduce.job.reduces=56 \
  ${input}
hdfs dfs -cat ${output}/\* | head -100
EOF

---
## HW 5.5
In this part of the assignment you will evaluate the success of you synonym detector.
Take the top 1,000 closest/most similar/correlative pairs of words as determined
by your measure in (2), and use the synonyms function in the accompanying
python code:

nltk_synonyms.py

Note: This will require installing the python nltk package:

http://www.nltk.org/install.html

and downloading its data with `nltk.download()`.

For each `(word1,word2)` pair,
* check to see if `word1` is in the list, 
`synonyms(word2)`, and vice-versa.
* If one of the two is a synonym of the other, then consider this pair a 'hit', and then report the precision, recall, and F1 measure  of your detector across your 1,000 best guesses.
* Report the macro averages of these measures.

---
## HW 5.5.1 (optional)
There is also a corpus of stopwords, that is, high-frequency words like "the", "to" and "also" that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. Python's nltk comes with a prebuilt list of stopwords (see below). Using this stopword list filter out these tokens from your analysis and rerun the experiments in 5.5 and disucuss the results of using a stopword list and without using a stopword list.
```
>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
```

---
## HW 5.6 (optional)
There are many good ways to build our synonym detectors, so for optional homework, 
measure co-occurrence by (left/right/all) consecutive words only, 
or make stripes according to word co-occurrences with the accompanying 
2-, 3-, or 4-grams (note here that your output will no longer 
be interpretable as a network) inside of the 5-grams.

--- 
## Hw 5.7 (optional)
Once again, benchmark your top 10,000 associations (as in 5.5), this time for your
results from 5.6. Has your detector improved?