#DATASCI W261: Machine Learning at Scale

#Assignment: Week 5

- Juanjo Carin
- [juanjose.carin@ischool.berkeley.edu](mailto:juanjose.carin@ischol.berkeley.com)
- W261-2
- Week 05
- Submission date: 10/06/2015

##Errata

Here I will upload any **minor corrections** I may make to the assignment after I submit it:

[https://www.dropbox.com/s/59p1i0sb90qw8cp/HW5-Errata.txt?dl=0](https://www.dropbox.com/s/59p1i0sb90qw8cp/HW5-Errata.txt?dl=0)

#HW5.0

1. **What is a data warehouse?**

2. **What is a Star schema?**

3. **When is it used?**

1. A **data warehouse** is a central *repository of* integrated *data* from multiple sources; these (current and historical) data can then be used for reporting and data analysis.

2. The **star schema** is a data schema that consists of a fact table (or simply called a *fact*: a transaction, an event, a log entry...) referencing multiple *dimensions* (or dimension tables; e.g., business objects/attributes). Hence we can slice & dice over those different dimensions, thus handling simple queries. So the fact table in the center could contain information about a sale (price, quantity, time...), and reference dimensions such as product (name, brand, category...), store (address, ID...), etc.

3. It is **used when**:

    * data are not necessarily normalized,
    * we just need simple queries and fast aggregation , and/or
    * we want to feed OLAP cubes efficiently.

#HW5.1

1. **In the database world What is 3NF? **
2. **Does machine learning use data in 3NF? If so why?**
3. **In what form does ML consume data?**
4. **Why would one use log files that are denormalized?**

1. **3NF** (Third Normal Form) is a normalization (i.e., an organization of data into columns--attributes--and tables--relations-- to minimize redundancies, by decomposing a flat table into smaller relational tables)  in which:
    * the relation table is in 2NF:
        * every non-prime attribute of the table (i.e., that does not belong to any candidate key of the table) is dependent on the whole of every candidate key)
    * every non-prime attribute  of the relation table is non-transitively dependent on every superkey of R.

  Requiring existence of "the key" ensures that the table is in 1NF; requiring that non-key attributes be dependent on "the whole key" ensures 2NF; further requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF.

2. To solve **Machine Learning** problems we usually do not use data in **3NF** because the information in each table alone does not give the "full picture:" we need to **denormalize** the **data** first, joining or aggregating tables, to be able to answer typical questions from a Machine Learning perspective (that involve all the dimensions at hand)

3. As mentioned in the previous point, ML algorithms use **denormalized data**. This is because most of those algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descen), so we need the total amount of information.

4. For the reason exposed above: denormalized log files include, for a particular observation (or log file), all the information (variables) we are going to use to apply a ML algorithm.

#HW5.2

**Using MRJob, implement a hashside join (memory-backed map-side) for left, right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2 (i.e., transformed log file). In this output please include the webpage URL, webpageID and Visitor ID.)**

**Justify which table you chose as the Left table in this hashside join.**

**Please report the number of rows resulting from:**

1. **Inner joining Table Left with Table Right**

2. **Right joining Table Left with Table Right**

3. **Left joining Table Left with Table Right**

(I've reversed the order mentioned in the Instructions, so each new join adds a bit of complexity over the previous).

### Create Left and Right Tables
Since I included the URLs in the transformed log file, I will generate both tables from scratch.

Recall that the lines in the original file have these form:

    ...
    A,1100,1,"MS in Education","/education"
    A,1210,1,"SNA Support","/snasupport"
    C,"10001",10001
    V,1000,1
    V,1001,1
    V,1002,1
    C,"10002",10002
    V,1001,1
    V,1003,1
    ...

I.e., all the webpage IDs (the primary key) are listed with their URLs, and then each visitor ID, followed by the webpages he or she visited.

In [1]:
import urllib2
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/' +\
    'anonymous-msweb.data'
import os
os.chdir('/home/hduser/Dropbox/W261/HW5')
# Two counters to keep track of number of distinct webpages and visitors
A = 0
C = 0

with open('TableLeft.txt', 'w') as TL, open('TableRight.txt', 'w') as TR:
    for line in urllib2.urlopen(url):
        record = line.strip().split(',')
        record = [x.strip('"') for x in record]
        # If the record corresponds to an attribute, linke webpage ID with URL
        if record[0] == 'A':
            A += 1
            key = record[1] # webpage ID
            value = record[4] # webpage URL
            TL.write(key + ',' + value + '\n')
        # If the record corresponds to a case (visitor), save that info...
        elif record[0] == 'C':
            C += 1
            value = record[1]
        # ... and pass it to the Vroot (i.e., link visitor ID and webpage ID)
        elif record[0] == 'V':
            key = record[1]
            TR.write(key + ',' + value + '\n')
            
print 'Training Instances  {}'.format(C)
print 'Attributes  {}'.format(A)

Training Instances  32711
Attributes  294


According to [https://kdd.ics.uci.edu/databases/msweb/msweb.data.html](https://kdd.ics.uci.edu/databases/msweb/msweb.data.html) there were:

`Training Instances  32711
Attributes  294`

Exploratory analysis of the 2 tables:

In [2]:
# Number of lines in TableLeft.txt
!echo "Number of webpages:         "$(cat TableLeft.txt | wc -l)
# Number of unique visitor IDs in TableRight.txt
!echo "Number of visitors:         "$(cat TableRight.txt | cut -d',' -f2 | \
                                      uniq | wc -l)
# Number of unique webpage IDs in TableRight.txt
    # (sort before finding unique values: they have to be adjacent)
!echo "Number of webpages visited: "$(cat TableRight.txt | cut -d',' -f1 | \
                                      sort | uniq | wc -l)
# Number of lines in TableRight.txt
!echo "Number of visits:           "$(cat TableRight.txt | wc -l)

Number of webpages:         294
Number of visitors:         32711
Number of webpages visited: 285
Number of visits:           98654


As we already saw in HW4, 9 webpages were not visited.

##HW5.2.1: Inner

### Create MRJob task for Inner Joins

In [3]:
%%writefile HashSideInnerJoin.py
from mrjob.job import MRJob
import csv
    
class HashSideInnerJoin(MRJob):

    def mapper_init(self):
        # Load left-side table in memory as dictionary
        self.TL = {}
        # The absolute path will be passed as argument when calling MRJob
        for key, value in csv.reader(open("TableLeft.txt", "r")):
            # key = webpage ID, value = webpage URL
            self.TL[key] = value   
        
    def mapper(self, _, line):
        # Iterate over the right-side table, a record at a time
        TRrecord = line.split(",")
        key = TRrecord[0]
        value_visitor = TRrecord[1]
        # Look for each record, in the left-side table (in-memory)
        if key in self.TL.keys():
            yield key, (self.TL[key], value_visitor)
    
    # The reducer is optional. If not specified, I found out records are not 
        # sorted by webpage ID
    def reducer(self, key, value):
        for val_url, val_visitor in value:
            yield key, (val_url, val_visitor)
            
if __name__ == '__main__':
    HashSideInnerJoin.run()

Overwriting HashSideInnerJoin.py


### Create Python script to execute any of the 3 types of Join

In [19]:
%%writefile HW52.py
#!/home/hduser/anaconda/bin/python
import sys
JoinType = sys.argv[1]

# Import the class

if JoinType == 'Inner':
    from HashSideInnerJoin import HashSideInnerJoin
    JoinClass = 'HashSideInnerJoin'
    output = 'InnerJoinTable.txt'
elif JoinType == 'Right':
    from HashSideRightJoin import HashSideRightJoin
    JoinClass = 'HashSideRightJoin'
    output = 'RightJoinTable.txt'
elif JoinType == 'Left':
    from HashSideLeftJoin import HashSideLeftJoin
    JoinClass = 'HashSideLeftJoin'
    output = 'LeftJoinTable.txt'
else:
    raise ValueError('USE Inner, Right, OR Left AS ARGUMENTS')
    
# Use the 2 tables, left-side as seconrd argument (to be load by mapper_init)
mr_job = eval(JoinClass)(args=['TableRight.txt', '--file=TableLeft.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    # Create the join table
    with open(output,'w') as result:
        for line in runner.stream_output():
            webpageID = str(mr_job.parse_output_line(line)[0])
            # Extract webpage URL and visitor ID from value
            webpageURL = mr_job.parse_output_line(line)[1][0]
            visitorID = str(mr_job.parse_output_line(line)[1][1])
            result.writelines(webpageID + ',' + webpageURL + ',' + visitorID 
                              +'\n')
    result.close()

Overwriting HW52.py


###Call Python Script with Inner Join

In [20]:
!chmod a+x HW52.py
!./HW52.py Inner

###Create bash script for EDA of the joint table

In [39]:
%%writefile EDA_HW52.sh
joinTable=$1

echo "Number of webpage IDs:                            "\
    $(cut -d, -f1 $joinTable | grep -v None | sort | uniq | wc -l)
echo "Number of webpage URLs:                           "\
    $(cut -d, -f1,2 $joinTable | grep -v None | sort  | uniq | wc -l)
echo "Number of webpages with no associated webpage URL:"\
    $(cut -d, -f1,2 $joinTable | grep None | sort | uniq | wc -l)
echo "Number of webpages visited:                       "\
    $(cut -d, -f1,3 $joinTable | grep -v None | cut -d, -f1 | sort | uniq | \
      wc -l)
echo "Number of records:                                "\
    $(wc -l < $joinTable)
echo "Number of visits:                                 "\
    $(cut -d, -f3 $joinTable | grep -v None | wc -l)
echo "Number of webpages with no associated visitor ID: "\
    $(cut -d, -f1,3 $joinTable | grep None | sort | uniq | wc -l)
echo "Number of visitors (IDs):                         "\
    $(cut -d, -f3 $joinTable | grep -v None | sort | uniq | wc -l)
if [ $(grep None $joinTable | wc -l) != 0 ]; then \
    echo -e "Webpages with no visits or URL:\n$(grep None $joinTable | \
    sort | sed 's/^/\t/')"; fi

Overwriting EDA_HW52.sh


###EDA and output of the inner join

Exploratory analysis of the joint table:

In [40]:
!chmod a+x EDA_HW52.sh
!./EDA_HW52.sh InnerJoinTable.txt

Number of webpage IDs:                             285
Number of webpage URLs:                            285
Number of webpages with no associated webpage URL: 0
Number of webpages visited:                        285
Number of records:                                 98654
Number of visits:                                  98654
Number of webpages with no associated visitor ID:  0
Number of visitors (IDs):                          32711


Of course (being this an inner join), all webpage URLs are matched with visitor IDs and vice versa.

In [8]:
!head -50 InnerJoinTable.txt

1000,/regwiz,10001
1000,/regwiz,10010
1000,/regwiz,10039
1000,/regwiz,10073
1000,/regwiz,10087
1000,/regwiz,10101
1000,/regwiz,10132
1000,/regwiz,10141
1000,/regwiz,10154
1000,/regwiz,10162
1000,/regwiz,10166
1000,/regwiz,10201
1000,/regwiz,10218
1000,/regwiz,10220
1000,/regwiz,10324
1000,/regwiz,10348
1000,/regwiz,10376
1000,/regwiz,10384
1000,/regwiz,10409
1000,/regwiz,10429
1000,/regwiz,10454
1000,/regwiz,10457
1000,/regwiz,10471
1000,/regwiz,10497
1000,/regwiz,10511
1000,/regwiz,10520
1000,/regwiz,10541
1000,/regwiz,10564
1000,/regwiz,10599
1000,/regwiz,10752
1000,/regwiz,10756
1000,/regwiz,10861
1000,/regwiz,10935
1000,/regwiz,10943
1000,/regwiz,10969
1000,/regwiz,11027
1000,/regwiz,11050
1000,/regwiz,11410
1000,/regwiz,11429
1000,/regwiz,11440
1000,/regwiz,11490
1000,/regwiz,11501
1000,/regwiz,11528
1000,/regwiz,11539
1000,/regwiz,11544
1000,/regwiz,11685
1000,/regwiz,11695
1000,/regwiz,11723
1000,/regwiz,11766
1000,/regwiz,11774


##HW5.2.2: Right

### Create MRJob task for Right Joins

In [9]:
%%writefile HashSideRightJoin.py
from mrjob.job import MRJob
import csv
    
class HashSideRightJoin(MRJob):

    def mapper_init(self):
        # Load left-side table in memory as dictionary
        self.TL = {}
        # The absolute path will be passed as argument when calling MRJob
        for key, value in csv.reader(open("TableLeft.txt", "r")):
            # key = webpage ID, value = webpage URL
            self.TL[key] = value   
        
    def mapper(self, _, line):
        # Iterate over the right-side table, a record at a time
        TRrecord = line.split(",")
        key = TRrecord[0]
        value_visitor = TRrecord[1]
        # Look for each record, in the left-side table (in-memory)
        if key in self.TL.keys():
            yield key, (self.TL[key], value_visitor)
        # And if there's no match, include the visitor info anyway
        else:
            yield key, (None, value_visitor)
    
    # The reducer is optional. If not specified, I found out records are not 
        # sorted by webpage ID
    def reducer(self, key, value):
        for val_url, val_visitor in value:
            yield key, (val_url, val_visitor)
            
if __name__ == '__main__':
    HashSideRightJoin.run()

Overwriting HashSideRightJoin.py


###Call Python Script with Right Join

In [10]:
!python HW52.py Right

###EDA and output of the right join

Exploratory analysis of the joint table:

In [37]:
!./EDA_HW52.sh RightJoinTable.txt

Number of webpage IDs:                             285
Number of webpage URLs:                            285
Number of webpages visited:                        285
Number of webpages with no associated webpage URL: 0
Number of records:                                 98654
Number of visits:                                  98654
Number of webpages with no associated visitor ID:  0
Number of visitors (IDs):                          32711


Being this a right join, we might have found some visits not matched with any URL, but that's not the case (because all primary keys: the webpage IDs) appear in the left-side table.

In [12]:
!head -50 RightJoinTable.txt 

1000,/regwiz,10001
1000,/regwiz,10010
1000,/regwiz,10039
1000,/regwiz,10073
1000,/regwiz,10087
1000,/regwiz,10101
1000,/regwiz,10132
1000,/regwiz,10141
1000,/regwiz,10154
1000,/regwiz,10162
1000,/regwiz,10166
1000,/regwiz,10201
1000,/regwiz,10218
1000,/regwiz,10220
1000,/regwiz,10324
1000,/regwiz,10348
1000,/regwiz,10376
1000,/regwiz,10384
1000,/regwiz,10409
1000,/regwiz,10429
1000,/regwiz,10454
1000,/regwiz,10457
1000,/regwiz,10471
1000,/regwiz,10497
1000,/regwiz,10511
1000,/regwiz,10520
1000,/regwiz,10541
1000,/regwiz,10564
1000,/regwiz,10599
1000,/regwiz,10752
1000,/regwiz,10756
1000,/regwiz,10861
1000,/regwiz,10935
1000,/regwiz,10943
1000,/regwiz,10969
1000,/regwiz,11027
1000,/regwiz,11050
1000,/regwiz,11410
1000,/regwiz,11429
1000,/regwiz,11440
1000,/regwiz,11490
1000,/regwiz,11501
1000,/regwiz,11528
1000,/regwiz,11539
1000,/regwiz,11544
1000,/regwiz,11685
1000,/regwiz,11695
1000,/regwiz,11723
1000,/regwiz,11766
1000,/regwiz,11774


##HW5.2.3: Left

### Create MRJob task for Left Joins

In [23]:
%%writefile HashSideLeftJoin.py
from mrjob.job import MRJob
import csv
    
class HashSideLeftJoin(MRJob):

    def __init__(self, *args, **kwargs):
        super(HashSideLeftJoin, self).__init__(*args, **kwargs)
        self.TLkeys = []

    def mapper_init(self):
        # Load left-side table in memory as dictionary
        self.TL = {}
        # The absolute path will be passed as argument when calling MRJob
        for key, value in csv.reader(open("TableLeft.txt", "r")):
            # key = webpage ID, value = webpage URL
            self.TL[key] = value   
            self.TLkeys.append(key)
        
    def mapper(self, _, line):
        # Iterate over the right-side table, a record at a time
        TRrecord = line.split(",")
        key = TRrecord[0]
        value_visitor = TRrecord[1]
        # Look for each record, in the left-side table (in-memory)
        if key in self.TL.keys():
            try:
                self.TLkeys.remove(key)
            except ValueError:
                pass
            yield key, (self.TL[key], value_visitor)
    
    def mapper_final(self):
        # Iterate over the right-side table, a record at a time
        for key in self.TLkeys:
            yield key, (self.TL[key], None)
    
    def reducer(self, key, value):
        for val_url, val_visitor in value:
            yield key, (val_url, val_visitor)
            
if __name__ == '__main__':
    HashSideLeftJoin.run()

Overwriting HashSideLeftJoin.py


###Call Python Script with Left Join

In [24]:
!python HW52.py Left

###EDA and output of the left join

Exploratory analysis of the joint table:

In [38]:
!./EDA_HW52.sh LeftJoinTable.txt

Number of webpage IDs:                             294
Number of webpage URLs:                            294
Number of webpages visited:                        285
Number of webpages with no associated webpage URL: 0
Number of records:                                 98663
Number of visits:                                  98654
Number of webpages with no associated visitor ID:  9
Number of visitors (IDs):                          32711
Webpages with no visits or URL:
	1287,/autoroute,None
	1288,/library,None
	1289,/masterchef,None
	1290,/devmovies,None
	1291,/news,None
	1292,/northafrica,None
	1293,/encarta,None
	1294,/bookshelf,None
	1297,/centroam,None


As expected, this joint table contains 9 more records than the other two, since 9 URLs are not matched with any visitor IDs.

In [16]:
!head -50 LeftJoinTable.txt

1000,/regwiz,10001
1000,/regwiz,10010
1000,/regwiz,10039
1000,/regwiz,10073
1000,/regwiz,10087
1000,/regwiz,10101
1000,/regwiz,10132
1000,/regwiz,10141
1000,/regwiz,10154
1000,/regwiz,10162
1000,/regwiz,10166
1000,/regwiz,10201
1000,/regwiz,10218
1000,/regwiz,10220
1000,/regwiz,10324
1000,/regwiz,10348
1000,/regwiz,10376
1000,/regwiz,10384
1000,/regwiz,10409
1000,/regwiz,10429
1000,/regwiz,10454
1000,/regwiz,10457
1000,/regwiz,10471
1000,/regwiz,10497
1000,/regwiz,10511
1000,/regwiz,10520
1000,/regwiz,10541
1000,/regwiz,10564
1000,/regwiz,10599
1000,/regwiz,10752
1000,/regwiz,10756
1000,/regwiz,10861
1000,/regwiz,10935
1000,/regwiz,10943
1000,/regwiz,10969
1000,/regwiz,11027
1000,/regwiz,11050
1000,/regwiz,11410
1000,/regwiz,11429
1000,/regwiz,11440
1000,/regwiz,11490
1000,/regwiz,11501
1000,/regwiz,11528
1000,/regwiz,11539
1000,/regwiz,11544
1000,/regwiz,11685
1000,/regwiz,11695
1000,/regwiz,11723
1000,/regwiz,11766
1000,/regwiz,11774


#HW5.3

**For the remainder of this assignment you will work with a large subset of the Google n-grams dataset, [https://aws.amazon.com/datasets/google-books-ngrams/](https://aws.amazon.com/datasets/google-books-ngrams/), which we have placed in a bucket on s3: [s3://filtered-5grams/](s3://filtered-5grams/).**

**In particular, this bucket contains (~200) files in the format:**

	(ngram) \t (count) \t (pages_count) \t (books_count)

**Do some EDA on this dataset using mrjob, e.g., **

- **Longest 5-gram (number of characters)**
- **Top 10 most frequent words (count), i.e., unigrams**
- **Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the head -n 1000)**
- **Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000)**

**OPTIONAL Question:**

- **Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?**

**For more background see:**

**[https://en.wikipedia.org/wiki/Log%E2%80%93log_plot](https://en.wikipedia.org/wiki/Log%E2%80%93log_plot)**

**[https://en.wikipedia.org/wiki/Power_law](https://en.wikipedia.org/wiki/Power_law)**

In [1]:
!aws version

usage: aws [options] <command> <subcommand> [parameters]
aws: error: argument command: Invalid choice, valid choices are:

autoscaling                              | cloudformation                          
cloudfront                               | cloudhsm                                
cloudsearch                              | cloudsearchdomain                       
cloudtrail                               | cloudwatch                              
codecommit                               | codepipeline                            
cognito-identity                         | cognito-sync                            
datapipeline                             | devicefarm                              
directconnect                            | ds                                      
dynamodb                                 | dynamodbstreams                         
ec2                                      | ecs                                     
efs                     

In [3]:
!aws s3 mb s3://ucb-mids-mls-juanjocarin/
!aws s3 cp gbooks_filtered_sample.txt \
    s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt

make_bucket: s3://ucb-mids-mls-juanjocarin/
upload: ./gbooks_filtered_sample.txt to s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt


##HW5.3.1: Longest 5-gram

Since we don't have to aggregate the lengths of each n-gram, just pass them to the reducer, we can use them as keys and sort by the length (from highest to lowest).

Apart from that, to reduce the traffic between the mappers and the reducer, as well as the size of the output file, their outputs are only the longest n-gram they receive.

In [4]:
%%writefile Longest.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

class Longest(MRJob):
    
    # Define a global variable that captures the longest n-gram found 'til now
    longest = 0 
    
    def jobconf(self):
        orig_jobconf = super(Longest, self).jobconf()        
        custom_jobconf = {
            'mapred.output.key.comparator.class': 
                'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options': '-k1rn',
        }
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf
        
    def steps(self):
        return [MRStep(mapper = self.mapper, reducer = self.reducer)]
    
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        length = len(ngram)-4 # Lenght of the n-gram excluding (n-1) spaces
        # Only yield results if current n-gram is equal or longer than previous 
            # ones
        # This part is optional, but dramatically reduces the mappers' outputs
        if length >= self.longest:
            self.longest = length
            yield int(length),ngram
    
    def reducer(self,length,values):
        # Again, compare with previous n-grams
        if int(length) >= self.longest:
            self.longest = int(length)
            for ngram in values:
                yield length, ngram
        
if __name__ == '__main__':
    Longest.run()

Overwriting Longest.py


In [5]:
!chmod +x Longest.py

In [10]:
%%writefile Longest_driver.py
#!/home/hduser/anaconda/bin/python
from Longest import Longest
mr_job = Longest(args=
                 ['s3://filtered-5grams/',
                  '-r', 'emr'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        length,ngram = mr_job.parse_output_line(line)
        print str(length) + "\t" + ngram

Overwriting Longest_driver.py


In [11]:
!chmod +x Longest_driver.py

In [12]:
!python Longest_driver.py
# No need to print first values: there is only one

155	ROPLEZIMPREDASTRODONBRASLPKLSON YHROACLMPARCHEYXMMIOUDAVESAURUS PIOFPILOCOWERSURUASOGETSESNEGCP TYRAVOPSIFENGOQUAPIALLOBOSKENUO OWINFUYAIOKENECKSASXHYILPOYNUAT
155	AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR


##HW5.3.2: Top 10 most frequent words (unigrams)

Now we have to add the number of times a unigram occurs, for all the 5-grams that include that unigram, so we cannot use count as the key. To reduce the traffic load between the mappers and the reducer, we can use combiners. And in the final step, keep a dictionary of 10 items (10 arbitrary keys whose values are all zero), that is updated with the most frequent unigrams.

Hence, the output of the reducer will only contain the top 10 most frequent words. Again, we just have to print the whole output; in this case we must also sort it, but that won't consume many resources (it contains only 10 elements).

In [99]:
%%writefile FreqUnigrams.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

class FreqUnigrams(MRJob):
     
    def steps(self):
        return [MRStep(mapper = self.mapper, combiner = self.combiner, 
                       reducer_init = self.reducer_init, 
                       reducer = self.reducer, 
                       reducer_final = self.reducer_final)]
    
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        # Output the count for each word in the 5-gram
        for unigram in ngram.split():
            yield unigram,int(count)

    # Aggregate partial results before passing to reducer
    def combiner(self, unigram, count):
        partial = sum(c for c in count)
        yield unigram,int(partial)
            
    # Initialize a dictionary with top10 (initially 10 arbitrary keys 
        # whose value is 0)
    def reducer_init(self):
        self.top = {}
        import string
        for i in string.lowercase[:10]:
            self.top[i]=0

    def reducer(self,unigram,partial):
        # Aggregate counts
        total = sum(p for p in partial)
        # If higher than what's already in the Top10...
        if total > min(self.top.values()):
            # remove minimum value
            self.top.pop(min(self.top, key=self.top.get))
            # Substitute with new one
            self.top[unigram] = total

    # Output only Top10
    def reducer_final(self):
        for k,v in self.top.iteritems():
            yield v,k
    
if __name__ == '__main__':
    FreqUnigrams.run()

Overwriting FreqUnigrams.py


In [100]:
!chmod +x FreqUnigrams.py

In [101]:
%%writefile FreqUnigrams_driver.py
#!/home/hduser/anaconda/bin/python
from FreqUnigrams import FreqUnigrams
mr_job = FreqUnigrams(args=
                 ['s3://filtered-5grams/',
                  '-r', 'emr'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        freq,unigram = mr_job.parse_output_line(line)
        print str(freq) + "\t" + unigram

Overwriting FreqUnigrams_driver.py


In [102]:
!chmod +x FreqUnigrams_driver.py

In [106]:
!./FreqUnigrams_driver.py | sort -rn # Sort the 10 items in reverse order

5375699242	the
3691308874	of
2221164346	to
1387638591	in
1342195425	a
1135779433	and
798553959	that
756296656	is
688053106	be
481373389	as


##HW5.3.3: Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency

In [66]:
%%writefile Density.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

class Density(MRJob):
     
    def steps(self):
        return [MRStep(mapper = self.mapper, combiner = self.combiner, 
                       reducer = self.reducer)]
    
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        # Output the count for each word in the 5-gram
        for unigram in ngram.split():
            # Value: count & pages
            yield unigram,[int(count),int(pages)]

    # Aggregate partial results before passing to reducer
    def combiner(self, unigram, duple):
        partial_count = 0
        partial_pages = 0
        for count,pages in duple:
            partial_count += count
            partial_pages += pages
        yield unigram,(int(partial_count),int(partial_pages))

    def reducer(self,unigram,duple):
        # Aggregate results
        total_count = 0
        total_pages = 0
        for count,pages in duple:
            total_count += count
            total_pages += pages
        # Calculate density (minimum value will be 1.0)
        density = float(total_count)/total_pages
        yield density,unigram
        
if __name__ == '__main__':
    Density.run()

Writing Density.py


In [67]:
!chmod +x Density.py

In [68]:
%%writefile Density_driver.py
#!/home/hduser/anaconda/bin/python
from Density import Density
import os
mr_job = Density(args=[
        's3://filtered-5grams/','-r', 'emr',
        '--output-dir=s3://ucb-mids-mls-juanjocarin/Density_output',
        '--no-output'
    ])

with mr_job.make_runner() as runner: 
    runner.run()
os.system("aws s3 cp s3://ucb-mids-mls-juanjocarin/Density_output/part-00000 \
    s3://ucb-mids-mls-juanjocarin/DenseUnigrams.txt")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Density_output/part-00000")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Density_output/_SUCCESS")

Writing Density_driver.py


In [69]:
!chmod +x Density_driver.py

In [70]:
!./Density_driver.py 

copy: s3://ucb-mids-mls-juanjocarin/Density_output/part-00000 to s3://ucb-mids-mls-juanjocarin/DenseUnigrams.txt
delete: s3://ucb-mids-mls-juanjocarin/Density_output/part-00000
delete: s3://ucb-mids-mls-juanjocarin/Density_output/_SUCCESS


In [79]:
!aws s3 cp s3://ucb-mids-mls-juanjocarin/DenseUnigrams.txt ./DenseUnigrams.txt
!aws s3 rm s3://ucb-mids-mls-juanjocarin/DenseUnigrams.txt

download: s3://ucb-mids-mls-juanjocarin/DenseUnigrams.txt to ./DenseUnigrams.txt
delete: s3://ucb-mids-mls-juanjocarin/DenseUnigrams.txt


###Most densely appearing words sorted in decreasing order of relative frequency (count/pages_count)

In [107]:
!sort -rn DenseUnigrams.txt | head -50

11.557291666666666	"xxxx"
10.161726044782885	"NA"
8.0741599073001158	"blah"
7.5333333333333332	"nnn"
6.5611436445056839	"nd"
5.4073642846747196	"ND"
4.921875	"oooooooooooooooo"
4.7272727272727275	"PIC"
4.5116279069767442	"llll"
4.3494983277591972	"LUTHER"
4.2072378595731514	"oooooo"
4.0908402725208175	"NN"
3.9492846924177396	"ooooo"
3.9313725490196076	"OOOOOO"
3.7877030162412995	"IIII"
3.7624521072796937	"lillelu"
3.6570701447431206	"OOOOO"
3.6065624999999999	"Sc"
3.5769230769230771	"Pfeffermann"
3.5769230769230771	"Madarassy"
3.5600000000000001	"Meteoritical"
3.5364916773367479	"Undecided"
3.505639097744361	"Lib"
3.5	"xxxxxxxx"
3.4791318864774623	"ri"
3.3750684931506849	"Vir"
3.2390171258376768	"DREAM"
3.2290388548057258	"beep"
3.1886792452830188	"Latha"
3.1883175058233291	"MARTIN"
3.1699346405228757	"Lis"
3.1147458480120784	"Ac"
3.0371428571428569	"OUTPUT"
3.0222222222222221	"HENNESSY"
3.0	"ALLIS"
2.9191176470588234	"IYENGAR"
2.8698912704670052	"ft

###Least densely appearing words sorted in decreasing order of relative frequency (count/pages_count)

It's not worth printing many of them, since almost 200K have a relative frequency of exactly 1.

In [98]:
!echo -e "Number of unigrams that never appear more than once in a page: "\
    $(grep $'1.0\t' DenseUnigrams.txt | wc -l)"\n"
!sort -rn DenseUnigrams.txt | tail -20

Number of unigrams that never appear more than once in a page: 166114

1.0	"Aana"
1.0	"AAN"
1.0	"Aan"
1.0	"aame"
1.0	"AAMC"
1.0	"Aaltonen"
1.0	"AAL"
1.0	"aahs"
1.0	"AAHPERD"
1.0	"aahed"
1.0	"aah"
1.0	"Aagje"
1.0	"AAFES"
1.0	"AAE"
1.0	"Aadam"
1.0	"AACVPR"
1.0	"AACP"
1.0	"AAAE"
1.0	"AAAA"
1.0	"aA"


#HW5.4

**In this part of the assignment we will focus on developing methods for detecting synonyms, using the Google 5-grams dataset. To accomplish this you must script two main tasks using MRJob:**

1. **Build stripes of word co-ocurrence for the top 10,000 most frequently appearing words across the entire set of 5-grams, and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).**

2. **Using two (symmetric) comparison methods of your choice (e.g., correlations, distances, similarities), pairwise compare  all stripes (vectors), and output to a file in your bucket on s3.**

> Design notes for (1)

> For this task you will be able to modify the pattern we used in HW 3.2 (feel free to use the solution as reference). To total the word counts across the 5-grams, output the support from the mappers using the total order inversion pattern:

    > <*word,count>

> to ensure that the support arrives before the cooccurrences.

> In addition to ensuring the determination of the total word counts, the mapper must also output co-occurrence counts for the pairs of words inside of each 5-gram. Treat these words as a basket, as we have in HW 3, but count all stripes or pairs in both orders, i.e., count both orderings: (word1,word2), and (word2,word1), to preserve symmetry in our output for (2).

> Design notes for (2)

> For this task you will have to determine a method of comparison. Here are a few that you might consider:

> - Spearman correlation
> - Euclidean distance
> - Taxicab (Manhattan) distance
> - Shortest path graph distance (a graph, because our data is symmetric!)
> - Pearson correlation
> - Cosine similarity
> - Kendall correlation
> - ...

> However, be cautioned that some comparison methods are more difficult to parallelize than others, and do not perform more associations than is necessary, since your choice of association will be symmetric.


I've chosen **Manhattan** (or **Taxicab**) **distance** (Euclideand distance could be implemented in the same way just calculating the square of all differences, and the square root of the sum of those squared differences), which is defined as:

$$d_1(\mathbf{p},\mathbf{q})=\left \|\mathbf{p}-\mathbf{q}  \right \|_1=
\sum_{i=1}^N\left | p_i-q_i \right |$$

##HW5.4.1

In [67]:
%%writefile HW541_TopN.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from operator import itemgetter
from mrjob.compat import get_jobconf_value

class HW541_TopN(MRJob):
    
    def configure_options(self):
        super(HW541_TopN, self).configure_options()
        # The number of most frequent unigrams can be configured by
            # the user as an argument
        self.add_passthrough_option('--number_unigrams',  
                                    dest='number_unigrams', type='int', 
                                    default=10)
    
    def steps(self):
        return [MRStep(mapper = self.mapper, combiner = self.combiner,
                       reducer_init = self.reducer_init, 
                       reducer = self.reducer, 
                       reducer_final = self.reducer_final)]
    
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        # Output the count for each word in the 5-gram
        unigrams = ngram.split()
        for unigram in unigrams:
            yield unigram, int(count)

    def combiner(self, unigram, count):
        yield unigram, sum(count)

    def reducer_init(self):
        self.top = {}

    def reducer(self, unigram, count):
        total = sum(count)
        # If we have not exceeded max size of the dictionary yet
        if len(self.top.keys()) < self.options.number_unigrams:
            self.top[unigram] = total
        # If exceeded, include new unigram only if more frequent that
                # other previously stored
        else:
            if total > min(self.top.values()):
                # Remove unigram not so frequent
                self.top.pop(min(self.top, key = self.top.get))
                # Add new unigram
                self.top[unigram] = total
    
    def reducer_final(self):
        for unigram in self.top.keys():
            yield unigram, self.top[unigram]

if __name__ == '__main__':
    HW541_TopN.run()

Overwriting HW541_TopN.py


In [68]:
!chmod +x HW541_TopN.py

In [None]:
!./HW541_TopN.py s3://filtered-5grams/ -r emr --number_unigrams=100 > Top.txt

using configs in /home/hduser/.mrjob.conf
using existing scratch bucket mrjob-03e94e1f06830625
using s3://mrjob-03e94e1f06830625/tmp/ as our scratch dir on S3
creating tmp directory /tmp/HW541_TopN.hduser.20151008.185915.107662
writing master bootstrap script to /tmp/HW541_TopN.hduser.20151008.185915.107662/b.py
Copying non-input files into s3://mrjob-03e94e1f06830625/tmp/HW541_TopN.hduser.20151008.185915.107662/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-F8DJDNV76J81
Created new job flow j-F8DJDNV76J81
Job launched 31.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 62.3s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 93.3s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 124.3s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 155.3s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 186.3s ago, status STARTING: Configuri

Keep only the words, not their counts

In [77]:
!cut -f1 Top.txt > Top.txt

"all"
"think"
"His"
"had"
"him"
"to"
"must"
"Although"
"struck"
"has"
"ought"
"do"
"his"
"me"
"were"
"know"
"they"
"not"
"he"
"this"
"told"
"From"
"For"
"see"
"been"
"are"
"what"
"for"
"may"
"He"
"be"
"we"
"University"
"never"
"by"
"on"
"her"
"could"
"place"
"or"
"first"
"Even"
"one"
"long"
"should"
"your"
"from"
"would"
"there"
"But"
"doubt"
"certain"
"meeting"
"that"
"took"
"American"
"with"
"History"
"And"
"myself"
"these"
"was"
"will"
"can"
"of"
"my"
"and"
"Court"
"give"
"God"
"is"
"am"
"it"
"an"
"How"
"as"
"at"
"have"
"in"
"seen"
"if"
"no"
"After"
"when"
"also"
"you"
"A"
"shall"
"I"
"upon"
"man"
"a"
"All"
"An"
"thought"
"As"
"so"
"At"
"time"
"the"


In [27]:
%%writefile HW541.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations 
from operator import itemgetter
from mrjob.compat import get_jobconf_value

class HW541(MRJob):

    def configure_options(self):
        super(HW541, self).configure_options()
        # The number of most frequent unigrams can be configured by
            # the user as an argument
        self.add_passthrough_option('--number_unigrams',  
                                    dest='number_unigrams', type='int', 
                                    default=20)
    
    def steps(self):
        return [MRStep(mapper_init = self.mapper_init, mapper = self.mapper, 
                       mapper_final = self.mapper_final, 
                       reducer_init = self.reducer_init, reducer = self.reducer, 
                       reducer_final = self.reducer_final)]

    def mapper_init(self):
        ## Initialize the co-occurrence counts hash array
        self.countsM = {}
        self.supportM = {} # Total count for each unigram

    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        # Output the count for each word in the 5-gram
        unigrams = ngram.lower().split()
        for unigram in unigrams:
            # Add to dictionary
            self.supportM.setdefault(unigram,0)
            # Update value
            self.supportM[unigram] += int(count)
        # Get all of the 2-sets
        combs = list(combinations(unigrams,2))
        for combination in combs:
            unigram1,unigram2 = sorted(combination)
            # Add to dictionary
            self.countsM.setdefault(unigram1,{})
            # Initialize the other word of the bigram
            self.countsM[unigram1].setdefault(unigram2,0)
            # Add the support of unigram1 & unigram2
            self.countsM[unigram1][unigram2] += int(count)
        
    def mapper_final(self):
        # Place a * in for support order inversion in sort
        for unigram in self.supportM.keys():
            yield "*"+unigram, str(self.supportM[unigram])
        # Emit co-occurrence counts to stream
        for unigram in self.countsM.keys():
            yield unigram, self.countsM[unigram]
            
    def reducer_init(self):
        # Initialize similar dictionaries
        self.countsR = {}
        self.supportR = {}

    def reducer(self,unigram,stripe):
        N = self.options.number_unigrams 
            # To keep final dict size bound to the required value
        # Check to see if this is a support line
        if re.match("\*",unigram):
            unigram = unigram[1:]
            count = 0
            for s in stripe:
                count += int(s)
            # If we have not exceeded max size of the dictionary yet
            if len(self.supportR.keys()) < N:
                self.supportR[unigram] = count
            # If exceeded, include new unigram only if more frequent that
                # other previously stored
            else:
                if count > min(self.supportR.values()):
                    # Remove unigram not so frequent
                    self.supportR.pop(min(self.supportR, 
                        key = self.supportR.get))
                    # Add new unigram
                    self.supportR[unigram] = count
        else: # if it's a stripe
            unigram1 = unigram
            # Add stripe only if it corresponds to a frequent unigram
            if unigram1 in self.supportR.keys():
                self.countsR[unigram1]={}
                for s in stripe:
                    for unigram2 in s.keys():
                        # Include bigram only if both words are frequent 
                        if unigram2 in self.supportR.keys():
                            self.countsR[unigram1].setdefault(unigram2,0)
                            self.countsR[unigram1][unigram2] += \
                                int(s[unigram2])
                    
    # Now we have kind of a triangular matrix, but some values are missing
        # (when there is no co-occurrence) --  we have to fullfill them
        # with zero, as well as the upper right side of the matrix
    
    def reducer_final(self):
        for unigram1 in sorted(self.supportR.keys()):
            # If not in the hash table (the way the mapper was built,
                # the existence of self. supportR[A][B] made unnecessary the
                # the existence of self.supportR[A][B])
            if unigram1 not in self.countsR.keys():
                # Add it
                self.countsR[unigram1]={}
        # Double loop to build the complete matrix and fulfill missing values
        for unigram1 in sorted(self.supportR.keys()):
            for unigram2 in sorted(self.supportR.keys()):
                if unigram2 not in self.countsR[unigram1].keys():
                    if unigram1 in self.countsR[unigram2].keys():                        
                        self.countsR[unigram1][unigram2] = \
                            self.countsR[unigram2][unigram1]
                    else:
                        self.countsR[unigram1][unigram2] = 0
        # OUTPUT: a row of the confidence matrix, for each unigram
        for unigram1 in sorted(self.supportR.keys()):
            # Calculate the cooccurrences of each bigram
            yield unigram1,','.join([
                    str(self.countsR[unigram1][unigram2]) for unigram2 in \
                        sorted(self.supportR.keys())])

if __name__ == '__main__':
    HW541.run()

Overwriting HW541.py


In [28]:
!chmod +x HW541.py

Let's make a test with a sample file:

In [29]:
!./HW541.py gbooks_filtered_sample.txt --number_unigrams=100 > hw541_sample_output

using configs in /home/hduser/.mrjob.conf
creating tmp directory /tmp/HW541.hduser.20151008.040959.243053
writing to /tmp/HW541.hduser.20151008.040959.243053/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/HW541.hduser.20151008.040959.243053/step-0-mapper-sorted
> sort /tmp/HW541.hduser.20151008.040959.243053/step-0-mapper_part-00000
writing to /tmp/HW541.hduser.20151008.040959.243053/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/HW541.hduser.20151008.040959.243053/step-0-reducer_part-00000 -> /tmp/HW541.hduser.20151008.040959.243053/output/part-00000
Streaming final output from /tmp/HW541.hduser.20151008.040959.243053/output
removing tmp directory /tmp/HW541.hduser.20151008.040959.243053


In [30]:
!head -5 hw541_sample_output

"a"	"5042,896,717,2617,891,1016,550,1627,419,1623,10445,770,8903,108380,2272,1491,7070,1905,2224,990,1395,629,550,239,911,0,719,247,15903,5247,73,8427,1126,11904,2660,5500,28994,1503,441,6506,943,1683,889,38533,347,14592,14543,2337,1809,2314,7060,1075,314,104283,1763,420,1152,54,554,8678,52,2587,165017,3043,416,1393,0,382,524,372,745,107,665,540,367,43,5402,129225,2363,1324,397,710,1943,2454,3365,12092,0,95,0,287,11444,928,1136,435,766,803,8364,1213,3252,62"
"about"	"896,0,0,632,0,167,0,0,64,0,555,48,86,43,105,0,126,181,0,0,218,0,60,0,0,0,0,0,749,120,66,0,0,0,0,180,433,0,121,54,0,0,103,5303,0,0,374,3997,67,0,0,0,42,0,0,0,0,0,0,0,0,0,691,76,395,0,0,0,0,0,0,0,0,0,0,0,79,1421,88,0,0,106,1021,4094,262,735,0,0,0,0,4169,749,0,126,0,0,0,0,0,0"
"according"	"717,0,0,0,0,0,0,0,0,52,290,45,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,54,0,93,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8137,0,206,0,0,0,0,0,0,0,0,0,0,0,0,8792,0,0,0,0,195,0,0,10297,0,0,0,0,45,0,0,0,0,43,0,0,0,0"
"after"	"261

As seen above, $\text{confidence}(\text{unigram}_i,\text{unigram}_i)=1.0 \text{ } \forall i \in \{1,N\}$.

In [31]:
%%writefile HW541_driver.py
#!/home/hduser/anaconda/bin/python
from HW541 import HW541

import os

mr_job = HW541(args=['s3://filtered-5grams/', '-r', 'emr',
                     '--number_unigrams=10000',
                     '--output-dir=s3://ucb-mids-mls-juanjocarin/Confidence',
                     '--no-output'])
with mr_job.make_runner() as runner: 
    runner.run()
os.system("aws s3 cp s3://ucb-mids-mls-juanjocarin/Confidence/part-00000 \
    s3://ucb-mids-mls-juanjocarin/Confidence.txt")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Confidence/part-00000")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Confidence/_SUCCESS")

Overwriting HW541_driver.py


In [32]:
!chmod +x HW541_driver.py

In [37]:
!./HW541_driver.py 


KeyboardInterrupt


##HW5.4.2

Let's suppose that our coordinates are (for simplicity I'm using integers, though I've finally used confidences, which are float numbers):

$\begin{pmatrix}
7 & 8 & 5\\
8 & 4 & 1\\ 
5 & 1 & 9
\end{pmatrix}$

The (Manhattan) distance matrix is easy to calculate (and of course the elements in the diagonal will be null):

$\begin{pmatrix}
0 & 9 & 13\\
9 & 0 & 14\\ 
13 & 14 & 0
\end{pmatrix}$

With the first row, 
$\begin{pmatrix}
7 & 8 & 5
\end{pmatrix}$, corresponding to the 1st component, we can calculate $\mid p_1 - q_1 \mid$ for all possible combinations of $\mathbf{p}$ and $\mathbf{q}$: $\begin{pmatrix}
0 & 1 & 2
\end{pmatrix}$ if $\mathbf{q}$ is the unigram corresponding to the first column, $\begin{pmatrix}
1 & 0 & 3
\end{pmatrix}$ if $\mathbf{q}$ is the unigram corresponding to the second column, and so on. If we proceed the same way for all rows, for the unigram in the first column we could obtain the following matrix:

$\begin{pmatrix}
0 & 1 & 2\\
0 & 4 & 7\\ 
0 & 4 & 4
\end{pmatrix}$

The row-wise sum of this matrix corresponds to the first row in our distance matrix: $\begin{pmatrix}
0 & 9 & 13
\end{pmatrix}$, which give us the first component of the distance between the first unigram and itself, between that unigram and the second, and between the first unigram and the third.

In [33]:
%%writefile HW542.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations 
from operator import itemgetter
from math import sqrt

class HW542(MRJob):
    
    def configure_options(self):
        super(HW542, self).configure_options()
        # The number of most frequent unigrams can be configured by
            # the user as an argument
        self.add_passthrough_option('--number_unigrams',  
                                    dest='number_unigrams', type='int', 
                                    default=20)

    def steps(self):
        return [MRStep(mapper = self.mapper, 
                reducer = self.reducer)]
    
    def mapper(self, _, line):
        # i-th line (corresponding to i-th unigram from the top 10,000 
            # frequent contains the i-th coordinates for all unigrams
        line = re.sub('\"', '', line)
        line = line.split()
        unigram = line[0]
        coords = line[1].split(',')
        # We have N (=10,000) coordinates and points
        # For each row (or vector) of N elements we're going to calculate N
            # other vectors, by subtracting the 1st, second, ... N-th element
            # and taking the absolute value
        # We also need the unigram, because (since they were ordered 
            # alphabetically) it will allow us to detect the value of "i"
        partial_diff = []
        for i in range(len(coords)):
            partial_diff.append([abs(int(coords[i])-int(x)) for x in coords])
        # Flatten to a single list
        partial_diff = [item for sublist in partial_diff for item in sublist]
        yield None, {unigram:partial_diff}
    
    def reducer(self, _, partial_diff):
        N = self.options.number_unigrams # number of dimensions and unigrams
        unigrams = []
        distances = [0] * N * N
        for p in partial_diff:
            unigrams.append(p.keys()[0]) # just need 1st key, each dictionary
                # only contains one, corresponding to one unigram
            # Add distances row-wise
            distances = [d+int(x) for d,x in zip(distances,p.values()[0])]
        # Convert a single row (size N*N) to N rows of a NxN matrix
        distance = []
        i=0
        while i<N*N:
            distance.append(distances[i:i+N])
            i += N
        # 1st row contains the names of the unigrams
        yield None, unigrams
        # Subsequent N rows contains the distance to the other
        for u,d in zip(sorted(unigrams),distance):
            yield u,d
        
if __name__ == '__main__':
    HW542.run()

Overwriting HW542.py


In [34]:
!chmod +x HW542.py

Let's continue our test with the confidence matrix of the sample file:

In [36]:
!./HW542.py hw541_sample_output --number_unigrams=100 | head -11

using configs in /home/hduser/.mrjob.conf
creating tmp directory /tmp/HW542.hduser.20151008.041033.665064
writing to /tmp/HW542.hduser.20151008.041033.665064/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/HW542.hduser.20151008.041033.665064/step-0-mapper-sorted
> sort /tmp/HW542.hduser.20151008.041033.665064/step-0-mapper_part-00000
writing to /tmp/HW542.hduser.20151008.041033.665064/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/HW542.hduser.20151008.041033.665064/step-0-reducer_part-00000 -> /tmp/HW542.hduser.20151008.041033.665064/output/part-00000
Streaming final output from /tmp/HW542.hduser.20151008.041033.665064/output
null	["a", "about", "according", "after", "all", "also", "although", "am", "american", "an", "and", "are", "as", "at", "be", "been", "but", "by", "can", "certain", "could", "court", "did", "discussion", "do", "doubt", "even", "first", "for", "from", "get", "give", "god", "had", "has", "have",

As expected, $\text{distance}(\text{unigram}_i,\text{unigram}_i)=0.0 \text{ } \forall i \in \{1,N\}$.

In [147]:
%%writefile HW542_driver.py
#!/home/hduser/anaconda/bin/python
from HW542 import HW542

import os

mr_job = HW542(args=['s3://ucb-mids-mls-juanjocarin/Confidence.txt', 
                     '-r', 'emr', '--number_unigrams=10000',
                     '--output-dir=s3://ucb-mids-mls-juanjocarin/Distances',
                     '--no-output'])
with mr_job.make_runner() as runner:
    runner.run()
os.system("aws s3 cp s3://ucb-mids-mls-juanjocarin/Distances/part-00000 \
    s3://ucb-mids-mls-juanjocarin/Distances.txt")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Distances/part-00000")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Distances/_SUCCESS")    

Overwriting HW54_driver.py


In [148]:
!chmod +x HW542_driver.py

In [None]:
!./HW542_driver.py 

In [123]:
!aws s3 cp s3://ucb-mids-mls-juanjocarin/Distances.txt ./Distances.txt
!aws s3 rm s3://ucb-mids-mls-juanjocarin/Distances.txt
!head -25 Distances.txt

A client error (404) occurred when calling the HeadObject operation: Key "Stripe_commparison.txt" does not exist
Completed 1 part(s) with ... file(s) remaining
A client error (404) occurred when calling the HeadObject operation: Key "Stripe_commparison.txt" does not exist
Completed 1 part(s) with ... file(s) remaining
head: cannot open ‘Stripe_commparison.txt’ for reading: No such file or directory
