# DATSCIW261 ASSIGNMENT #5

Angela Gunn, Jing Xu

angela@egunn.com, jaling@gmail.com

W261-3

DATSCIW261 Assignment #5

2/10/16

## **HW5.0**

**What is a data warehouse?**

A data warehouse is a central repository of integrated data from one or many sources used for reporting and data analysis. In an enterprise setting, it serves as the primary repository of data from sales transactions to product inventories. Modern data warehouses can store:  
- relational data  
- semi-structured data like query logs  
- unstructured data like tweets, titles of web pages

Data warehouses form a foundation for business intelligence and data science, and are leveraged to gain a competitive advantage in the marketplace through data mining.

**What is a Star schema? When is it used?**

A Star Schema is a type of data mart schema that consists of one or more fact tables referencing any number of dimension tables. It gets its name from the tendency of the physical model to resemble a star with the fact table in the center and dimension tables surrounding it.

<img src="starschema.png">

It is used to handle simpler queries as the join logic is usually simpler to handle than the join logic needed to retrieve data from a highly normalized transactional schemas. There is also more simplified business reporting logic, query performance gains, and faster aggregations compared to other schemas.


## **HW5.1**

**In the database world What is 3NF?**

3NF is third normal form, the third step in normalizing a database and builds on the first and second normal forms. It is a normalization process used to reduce the unnecessary duplication of data and making sure every value of of attribute column of a table exists as a value of another attribute column in the same or different table (referential integrity). This is done by ensuring that the data is in second normal form and that all the attributes in the table are determined only by the candidate keys of the table and not by any other non-prime attributes. 3NF is used to improve processing while minimizing storage costs, which is ideal for online transaction processing (OLTP) applications.

**Does machine learning use data in 3NF? If so why?**

3NF is often used for machine learning - 3NF's structure is ideal for machine processing. The removal of transitive functional dependency avoids inputting features for machine learning that are redundant and non-independent. Relational databases almost always contain structured, normalized data with indexes, and machine learning in this realm uses 3NF data inputs. 

**In what form does ML consume data?**  

Although 3NF works well for ML, ML consumes data in a variety of additional forms. Not all ML applications depend on data normalization - 1NF and 2NF can be fed into ML algorithms, although the effectiveness will vary depending on the algorithm and type of data being worked with. ML algorithms are now sophisticated enough to learn from unstructured data as well. 

**Why would one use log files that are denormalized?**  

Denormalized log files can be more efficient for queries that draw information from several tables that are stored on disk and require complex joins to complete, depending on the size and type of data. Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data - it is a tradeoff of redundancy/extra space for scalability and read performance. Hadoop MapReduce for example primarily uses denormalized data that is fully contained in a single record to avoid costly joins.


## **HW5.2**

Using MRJob, implement a hashside join (memory-backed map-side) for left, right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.)
:
  
Justify which table you chose as the Left table in this hashside join.
  
Please report the number of rows resulting from:
  
(1) Left joining Table Left with Table Right  
(2) Right joining Table Left with Table Right  
(3) Inner joining Table Left with Table Right  

**Approach**

Recognition of 2 tables: the LogData and the URL data.  
anonymous-msweb.data.pp: first 5 lines  

V,1000,C,10001  
V,1001,C,10001  
V,1002,C,10001  
V,1001,C,10002  
V,1003,C,10002  

url.txt: first 2 lines

A,1287,1,"International AutoRoute","/autoroute"  
A,1288,1,"library","/library  

The left table will be the url.txt table because it has the much smaller rowset and can be more easily stored in memory.

In [1]:
# Create a file with only URL(s), i.e. records starting with 'A'
!rm -v url.txt
!grep ^A anonymous-msweb.data > url.txt

url.txt


## HashSideInnerJoin

In [4]:
%%writefile hashsideinnerjoin_52.py
#!/usr/bin/python
## hashsideinnerjoin_52.py
## Author: Angela Gunn & Jing Xu
## Description:Inner Join

from mrjob.job import MRJob
from mrjob.step import MRJobStep
from mrjob.compat import get_jobconf_value

import csv

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class innerjoin(MRJob):
    def steps(self):
        return [MRJobStep(mapper_init = self.mapper_init,
                     mapper=self.mapper)]
    
    def mapper_init(self):
        #store the URLs
        self.urls = {}
        with open('url.txt') as f:
            for line in f:
                cell = csv_readline(line)
                self.urls[cell[1]] = cell[4]
        
    def mapper(self, _, line):
        #this is the logs
        cell = csv_readline(line) 
        yield cell[1], (self.urls[cell[1]], cell[3]) #yield the matching rows.

if __name__ == '__main__':
    innerjoin.run()

Overwriting hashsideinnerjoin_52.py


In [5]:
%reload_ext autoreload
%autoreload 2
from hashsideinnerjoin_52 import innerjoin
import os

# Passing Hadoop Streaming parameters to:
# partition by leftmost part of composite key
# secodary sort by rightmost part of the same composite key

mr_job = innerjoin(args=['anonymous-msweb.data.pp', 
                        '--file', 'url.txt'])

output_file = "output_hw52_inner.txt"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    count = 0
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))
        count += 1
print "\n"
print "There are %s records for inner join" %count





There are 98654 records for inner join


In [6]:
!echo "Number of results:"
!wc -l < output_hw52_inner.txt
!echo "-----"
!echo "first 10 rows"
!echo "page_id [url, user]"
!head -10 output_hw52_inner.txt

Number of results:
   98654
-----
first 10 rows
page_id [url, user]
"1000"	["/regwiz", "10001"]
"1001"	["/support", "10001"]
"1002"	["/athome", "10001"]
"1001"	["/support", "10002"]
"1003"	["/kb", "10002"]
"1001"	["/support", "10003"]
"1003"	["/kb", "10003"]
"1004"	["/search", "10003"]
"1005"	["/norge", "10004"]
"1006"	["/misc", "10005"]


## HashSideLeftJoin

In [13]:
%%writefile hashsideleftjoin_52.py
#!/usr/bin/python
## hashsideleftjoin_52.py
## Author: Angela Gunn & Jing Xu 
## Description:Left Join

from mrjob.job import MRJob
from mrjob.step import MRJobStep
from mrjob.compat import get_jobconf_value
import csv

def csv_readline(line):
    for row in csv.reader([line]):
        return row

class leftjoin(MRJob):
    
    def steps(self):
        return [MRJobStep(mapper_init = self.mapper_init,
                         mapper = self.mapper, mapper_final = self.mapper_final)]
    
    def mapper_init(self):
        self.urls = {} #initialize urls library
         
        with open('url.txt') as f:
            for line in f: 
                cell = csv_readline(line)
                self.urls[cell[1]] = [cell[4],[]] #url, list of visitors

    def mapper(self, _, line):
        #these are the logs
        cell = csv_readline(line)
        key = cell[1]
        self.urls[key][1].append(cell[3])

    def mapper_final(self):
        for key, values in self.urls.iteritems():
            url = values[0]
            if len(values[1]) > 0:
                for u in values[1]: yield key, (url, u)
            else:
                yield key, (url, "NONE")

if __name__ == '__main__':
    leftjoin.run()

Overwriting hashsideleftjoin_52.py


In [14]:
!chmod a+x hashsideleftjoin_52.py

In [15]:
%reload_ext autoreload
%autoreload 2
# Running mrjob using a Hadoop Runner in local cluster
from hashsideleftjoin_52 import leftjoin
import os

# Passing Hadoop Streaming parameters to:
# partition by leftmost part of composite key
# secodary sort by rightmost part of the same composite key

mr_job = leftjoin(args=['anonymous-msweb.data.pp', 
                        '--file', 'url.txt'])

output_file = "output_hw52_left.txt"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    count = 0
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))
        count+=1
print "\n"
print "There are %s records for left join" %count





There are 98663 records for left join


In [16]:
!echo "Number of results:"
!wc -l < output_hw52_left.txt
!echo "-----"
!echo "first 10 rows"
!echo "page_id [url, user]"
!head -10 output_hw52_left.txt

Number of results:
   98663
-----
first 10 rows
page_id [url, user]
"1142"	["/southafrica", "10372"]
"1142"	["/southafrica", "13352"]
"1142"	["/southafrica", "19019"]
"1142"	["/southafrica", "24124"]
"1142"	["/southafrica", "25638"]
"1142"	["/southafrica", "25798"]
"1142"	["/southafrica", "26342"]
"1142"	["/southafrica", "28044"]
"1142"	["/southafrica", "28821"]
"1142"	["/southafrica", "29837"]


## HashSideRightJoin

In [17]:
%%writefile hashsiderightjoin_52.py
#!/usr/bin/python
## hashsiderightjoin_52.py
## Author: Angela Gunn & Jing Xu
## Description:Right Join

from mrjob.job import MRJob
from mrjob.step import MRJobStep
from mrjob.compat import get_jobconf_value
 
import csv

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class rightjoin(MRJob):
    
    def steps(self):
        return [MRJobStep(mapper_init = self.mapper_init,
                         mapper = self.mapper)]
    
    def mapper_init(self):
        self.urls = {} #initialize urls dictionary
        with open('url.txt') as f:
            for line in f: 
                cell = csv_readline(line)
                self.urls[cell[1]] = cell[4]

    def mapper(self, _, line):
        #these are the logs        
        cell = csv_readline(line)
        if cell[1] in self.urls.keys():
            yield cell[1], (self.urls[cell[1]], cell[3]) #yield the matching rows.     
        else:
            yield None, (None , cell[3])

if __name__ == '__main__':
    rightjoin.run()

Overwriting hashsiderightjoin_52.py


In [18]:
!chmod a+x hashsiderightjoin_52.py

In [19]:
%reload_ext autoreload
%autoreload 2
# Running mrjob using a Hadoop Runner in local cluster
from hashsiderightjoin_52 import rightjoin
import os

# Passing Hadoop Streaming parameters to:
# partition by leftmost part of composite key
# secodary sort by rightmost part of the same composite key

mr_job = rightjoin(args=['anonymous-msweb.data.pp', 
                        '--file', 'url.txt'])

output_file = "output_hw52_right.txt"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    count = 0
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))
        count+=1
print "\n"
print "There are %s records for right join" %count





There are 98654 records for right join


In [20]:
!echo "Number of results:"
!wc -l < output_hw52_right.txt
!echo "-----"
!echo "first 10 rows"
!echo "page_id [url, user_id]"
!head -10 output_hw52_right.txt

Number of results:
   98654
-----
first 10 rows
page_id [url, user_id]
"1000"	["/regwiz", "10001"]
"1001"	["/support", "10001"]
"1002"	["/athome", "10001"]
"1001"	["/support", "10002"]
"1003"	["/kb", "10002"]
"1001"	["/support", "10003"]
"1003"	["/kb", "10003"]
"1004"	["/search", "10003"]
"1005"	["/norge", "10004"]
"1006"	["/misc", "10005"]


## HW 5.3 - 5.5

For the remainder of this assignment you will work with two datasets:

#### 1: unit/systems test data set: SYSTEMS TEST DATASET
Three terms, A,B,C and their corresponding strip-docs of co-occurring terms

DocA {X:20, Y:30, Z:5}
DocB {X:100, Y:20}
DocC {M:5, N:20, Z:5}


#### 2: A large subset of the Google n-grams dataset

https://aws.amazon.com/datasets/google-books-ngrams/

which we have placed in a bucket/folder on Dropbox on s3:

   https://www.dropbox.com/sh/tmqpc4o0xswhkvz/AACUifrl6wrMrlK6a3X3lZ9Ea?dl=0 

   s3://filtered-5grams/

For each HW 5.3 -5.5 Please unit test and system test your code with with SYSTEMS TEST DATASET and show the results. 
Please compute the expected answer by hand and show your hand calculations. Then show the results you get with you system.
Final show your results on the Google n-grams dataset


In particular, this bucket contains (~200) files (10Meg each) in the format:

	(ngram) \t (count) \t (pages_count) \t (books_count)

## HW 5.3

Do some EDA on this dataset using mrjob, e.g., 

- Longest 5-gram (number of characters)
- Top 10 most frequent words (count), i.e., unigrams
- Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the head -n 1000)
- Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000)
OPTIONAL Question:
- Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see: https://en.wikipedia.org/wiki/Log%E2%80%93log_plot https://en.wikipedia.org/wiki/Power_law

In [23]:
%%writefile mrjob_longest_53.py
from mrjob.job import MRJob
from mrjob.step import MRJobStep
from mrjob.compat import get_jobconf_value

class longest_ngram(MRJob):
    long_ngram = None
    long_length = 0
    
    def steps(self):
        return [MRJobStep(mapper=self.mapper,
                         reducer=self.reducer,
                         reducer_final=self.reducer_final,
                        jobconf={
                            "mapred.map.tasks":4,
                            "mapred.reduce.tasks":1,
                            })]
        
    def mapper(self, _, line):
        #break out the lengths of the cells
        cell = line.split('\t')
        length = len(cell[0])
        yield cell[0], length
        
    def reducer(self, key, value):
        #Add to global values if largest 
        value = list(value)
        if sum(value) > self.long_length:
            self.long_ngram = key
            self.long_length = sum(value)
            
    def reducer_final(self):
        #output largest ngram and length
        yield self.long_ngram, self.long_length
        
if __name__ == '__main__':
    longest_ngram.run()

Overwriting mrjob_longest_53.py


In [24]:
!s3cmd put FILE mrjob_longest_53.py s3://w261jing

upload: 'mrjob_longest_53.py' -> 's3://w261jing/mrjob_longest_53.py'  [1 of 1]
 1120 of 1120   100% in    0s     4.63 kB/s  done


In [25]:
%reload_ext autoreload
%autoreload 2
# Running mrjob using a Hadoop Runner in local cluster for systems/unit test
from mrjob_longest_53 import longest_ngram
import os

# Passing Hadoop Streaming parameters to:
# partition by leftmost part of composite key
# secodary sort by rightmost part of the same composite key

mr_job = longest_ngram(args=['testngram/'])
#mr_job = longest_ngram(args=['ngramstest.txt'])

output_file = "output_hw53_length.txt"
#output_file = "output_hw53_test_length.txt"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))



In [26]:
!cat output_hw53_length.txt

"Ocherki istorii gosudarstvennykh uchrezhdenii dorevoliutsionnoi"	63


In [27]:
!s3cmd rm --recursive s3://w261jing/hw5/longest53/
!python mrjob_longest_53.py -r emr --conf-path mrjob_261jing.conf s3://filtered-5grams/ --output-dir=s3://w261jing/hw5/longest53 --no-output --no-strict-protocol

Got unexpected keyword arguments: ssh_tunnel
inferring aws_region from scratch bucket's region (us-west-1)
using s3://mrjob-0465390d52fc9db7/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/zs/k144hqks281fbt0x68c_zj9m0000gp/T/mrjob_longest_53.JingXu.20160216.180723.482758
writing master bootstrap script to /var/folders/zs/k144hqks281fbt0x68c_zj9m0000gp/T/mrjob_longest_53.JingXu.20160216.180723.482758/b.py
Copying non-input files into s3://mrjob-0465390d52fc9db7/tmp/mrjob_longest_53.JingXu.20160216.180723.482758/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Can't access IAM API, trying default instance profile: EMR_EC2_DefaultRole
Can't access IAM API, trying default service role: EMR_DefaultRole
Job flow created with ID: j-3IAD38WCPFWGE
Created new job flow j-3IAD38WCPFWGE
Job launched 30.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 60.5s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 

In [28]:
!s3cmd get s3://w261jing/hw5/longest53/part* longest53_out.txt
!head longest53_out.txt

download: 's3://w261jing/hw5/longest53/part-00000' -> 'longest53_out.txt'  [1 of 1]
 166 of 166   100% in    0s     3.40 kB/s  done
"AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR"	159


In [33]:
%%writefile mrjob_frequency_53.py
#!/usr/bin/python
## mrjob_frequency_53.py
## Author: Angela Gunn & Jing Xu
## Description:Find the frequency of a word in the 5-gram

from mrjob.job import MRJob
from mrjob.step import MRJobStep
from mrjob.compat import get_jobconf_value

import re

WORD_RE = re.compile(r"[A-Za-z0-9]+")

class frequency(MRJob):
    #top10={}
    
    def steps(self):
        return [
             MRJobStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer,
                   jobconf={
                            "mapred.map.tasks":16,
                            "mapred.reduce.tasks":8,
                            }),
             MRJobStep(mapper=self.mapper_frequent_unigrams,
                   reducer=self.reducer_frequent_unigrams,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            "mapred.map.tasks":4,
                            "mapred.reduce.tasks":1,
                            }
                   )
        ]
    
    def mapper(self, _, line):
        #get the word, and count for output
        line.strip()
        cell = re.split("\t",line)
        unigrams = cell[0].split()
        count = int(cell[1])
        for unigram in unigrams:
            yield unigram, count
            
    def combiner(self, unigram, counts):
        yield unigram, sum(counts)
        
    def reducer(self, unigram, counts):
        #combines the visits, and adds top10 dictionary if qualified
        yield unigram, sum(counts)
        
    def mapper_frequent_unigrams(self, unigram, count):
        #Just passing along with count first so that they all get shuffled by the count
        yield count, unigram
        
    def reducer_frequent_unigrams(self, count, unigrams):
        #Printing.
        for unigram in unigrams:
            yield count, unigram
        
if __name__ == '__main__':
    frequency.run()

Overwriting mrjob_frequency_53.py


In [34]:
!chmod a+x mrjob_frequency_53.py

In [35]:
!s3cmd put FILE mrjob_frequency_53.py s3://w261jing

upload: 'mrjob_frequency_53.py' -> 's3://w261jing/mrjob_frequency_53.py'  [1 of 1]
 2072 of 2072   100% in    0s    15.18 kB/s  done


In [36]:
# Running mrjob using a Hadoop Runner in local cluster for systems/unit test
from mrjob_frequency_53 import frequency
import os

# Passing Hadoop Streaming parameters to:
# partition by leftmost part of composite key
# secodary sort by rightmost part of the same composite key

mr_job = frequency(args=['testngram/'])
#mr_job = longest_ngram(args=['small_grams.txt'])

output_file = "output_hw53_freq.txt"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))



In [37]:
!head -10 output_hw53_freq.txt

100	"Acquaintance"
100	"Aigina"
100	"Allowances"
100	"Antipater"
100	"Atalantis"
100	"BCD"
100	"Bembo"
100	"Bibelstudien"
100	"Bodleiana"
100	"Cloete"


In [None]:
!s3cmd rm --recursive s3://w261jing/hw5/frequency53/
!python mrjob_frequency_53.py -r emr s3://filtered-5grams --output-dir=s3://w261jing/hw5/frequency53 --no-output --no-strict-protocol

Got unexpected keyword arguments: ssh_tunnel
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
inferring aws_region from scratch bucket's region (us-west-1)
using s3://mrjob-0465390d52fc9db7/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/zs/k144hqks281fbt0x68c_zj9m0000gp/T/mrjob_frequency_53.JingXu.20160216.234515.019133
writing master bootstrap script to /var/folders/zs/k144hqks281fbt0x68c_zj9m0000gp/T/mrjob_frequency_53.JingXu.20160216.234515.019133/b.py


In [None]:
!s3cmd get s3://w261jing/frequency53/part* frequency_out.txt
!head frequency_out.txt

In [39]:
!python -m mrjob.tools.emr.terminate_job_flow j-3BFS6FTHTMWLP

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
inferring aws_region from scratch bucket's region (us-west-1)
using s3://mrjob-0465390d52fc9db7/tmp/ as our scratch dir on S3
Terminated job flow j-3BFS6FTHTMWLP


In [None]:
## Most/Least Dense
#Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency 
#(Hint: save to PART-000* and take the head -n 1000)

#(ngram) \t (count) \t (pages_count) \t (books_count)

In [39]:
%%writefile mrjob_density_53.py
#!/usr/bin/python
## mrjob_density_53.py
## Author: Angela Gunn & Jing Xu
## Description:Find the density of the words in 5gram

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import get_jobconf_value

class density_53(MRJob):

    
    def steps(self):
        return [MRStep(mapper=self.mapper,
                         reducer=self.reducer, 
                   jobconf={
                    "mapred.map.tasks":16,
                    "mapred.reduce.tasks":8
                    }
                         ),
               MRStep(mapper=self.mapper_max_min,
                        reducer=self.reducer_max_min,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            "mapred.map.tasks":4,
                            "mapred.reduce.tasks":1
                            }
                     )]
        
    def mapper(self, _, line):
        #output the density (count / pages)
        cell = line.split('\t')
        words = cell[0].split()
        density = round((int(cell[1]) * 1.0 / int(cell[2])), 3)
        for w in words:
            yield w.lower(), density
            
    def combiner(self, unigram, densities):
        #combine
        densities = [d for d in densities]
        yield unigram, min(densities) 
        yield unigram, max(densities)
        
    def reducer(self, unigram, densities):
        #combine
        densities = [d for d in densities]
        yield unigram, min(densities)
        yield unigram, max(densities)
        
    def mapper_max_min(self, unigram, density):
        #output with density first so grouping can happen
        yield density, unigram
        
    def reducer_max_min(self, density, unigrams):
        #final output
        for unigram in unigrams:
            yield density, unigram
            
        
if __name__ == '__main__':
    density_53.run()

Writing mrjob_density_53.py


In [40]:
!s3cmd put FILE mrjob_density_53.py s3://w261jing

upload: 'mrjob_density_53.py' -> 's3://w261jing/mrjob_density_53.py'  [1 of 1]
 1169 of 1169   100% in    0s     2.50 kB/s  done
upload: 'mrjob_density_53.py' -> 's3://w261jing/mrjob_density_53.py'  [1 of 1]
 1169 of 1169   100% in    0s    11.77 kB/s  done
upload: 'mrjob_density_53.py' -> 's3://w261jing/mrjob_density_53.py'  [1 of 1]
 1169 of 1169   100% in    0s     8.80 kB/s  done


In [None]:
# Running mrjob using a Hadoop Runner in local cluster for systems/unit test
from mrjob_density_53 import density_53
import os

# Passing Hadoop Streaming parameters to:
# partition by leftmost part of composite key
# secodary sort by rightmost part of the same composite key

JOB_CONFIG = {'mapred.output.key.comparator.class':
      'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
  'mapred.text.key.comparator.options': '-k1 1nr'}


mr_job = density_53(args=['testngram/'])
#mr_job = density_53(args=['small_grams.txt'])
#mr_job.jobconf(JOB_CONFIG)

output_file = "output_hw53_density.txt"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))

In [None]:
!s3cmd rm --recursive s3://w261jing/hw5/density53/
!python mrjob_density_53.py -r emr s3://filtered-5grams/ --output-dir=s3://w261jing/hw5/density53 --no-output --no-strict-protocol    

In [None]:
!s3cmd get s3://w261jing/hw5/density53/part* density_out.txt
!head density_out.txt

In [None]:
#Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000) 

In [41]:
%%writefile mrjob_distribution_53.py
#!/usr/bin/python
## mrjob_distribution_53.py
## Author: Angela Gunn & Jing Xu
## Description: Distribution of 5gram lengths

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import get_jobconf_value

class distribution_53(MRJob):
    
    def steps(self):
        return [MRStep(mapper=self.mapper,
                       combiner = self.combiner,
                         reducer=self.reducer, 
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            "mapred.map.tasks":8,
                            "mapred.reduce.tasks":1
                            }
                     )]
        
    def mapper(self, _, line):
        #get length, and count = 1
        cell = line.split('\t')
        ngram_length = len(cell[0])
        count = int(cell[1])
        yield ngram_length, count
            
    def combiner(self, length, count):
        #combine
        count = sum(count)
        yield length, count
        
    def reducer(self, length, count):
        #combine
        count = sum(count)
        yield length, count
        
if __name__ == '__main__':
    distribution_53.run()

Writing mrjob_distribution_53.py


In [43]:
!s3cmd put FILE mrjob_distribution_53.py s3://w261jing

upload: 'mrjob_distribution_53.py' -> 's3://w261jing/mrjob_distribution_53.py'  [1 of 1]
 994 of 994   100% in    0s  1879.06 B/s  done
upload: 'mrjob_distribution_53.py' -> 's3://w261jing/mrjob_distribution_53.py'  [1 of 1]
 994 of 994   100% in    0s     9.65 kB/s  done
upload: 'mrjob_distribution_53.py' -> 's3://w261jing/mrjob_distribution_53.py'  [1 of 1]
 994 of 994   100% in    0s     7.32 kB/s  done


In [None]:
# Running mrjob using a Hadoop Runner in local cluster for systems/unit test
from mrjob_distribution_53 import distribution_53
import os


#mr_job = distribution_53(args=['HW5/'])
mr_job = distribution_53(args=['testngram/'])

output_file = "output_hw53_distribution.txt"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))

In [None]:
!s3cmd rm --recursive s3://w261jing/hw5/distribution53/
!python mrjob_distribution_53.py -r emr s3://filtered-5grams/ --output-dir=s3://w261jing/hw5/distribution53 --no-output --no-strict-protocol    

In [None]:
!s3cmd get s3://w261jing/hw5/distribution53/part* distribution_out.txt
!echo "------------------------------"
!echo "ngram length   count of length"
!echo "------------------------------"
!head distribution_out.txt

In [None]:
%matplotlib inline

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = 16, 8  # plotsize 

df = pd.read_csv('HW5/distribution_out.txt',sep='\t',header=None)
df.columns = ['length','frequency']
df = df.sort('length')
df = df.set_index('length')
my_plot = df.plot(kind='bar',legend=None,title="5-gram character length distribution")
my_plot.set_xlabel("5-gram length")
my_plot.set_ylabel("frequency")

## HW 5.4

In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

**(1) Build stripes of word co-ocurrence for the top 10,000 using the words ranked from 9001,-10,000 as a basis
most frequently appearing words across the entire set of 5-grams,
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).**

**==Design notes for (1)==**
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:

<*word,count>

to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).


In [None]:
%%writefile createtest.py

with open('SYSTEMS_TEST_DATASET.txt','w') as f:
    f.write('DocA\t{"X":20,"Y":30,"Z":5}\n')
    f.write('DocB\t{"X":100,"Y":20}\n')
    f.write('DocC\t{"M":5,"N":20,"Z":5}\n')

with open('SYSTEMS_TEST_DATASET_freq.txt','w') as f:
    f.write('120\t"X"\n')
    f.write('150\t"Y"\n')
    f.write('10\t"Z"\n')
    f.write('5\t"M"\n')
    f.write('20\t"N"\n')

In [None]:
!python createtest.py
!cat SYSTEMS_TEST_DATASET.txt
!cat SYSTEMS_TEST_DATASET_freq.txt

In [None]:
!cat HW5/frequency_out.txt | sed -n 9001,10000p > HW5/topwords_touse.txt

In [None]:
#get top 1000 words.
#build stripes of co-occurrence on ALL 5-grams  word [co1, co2, co3]
#this will be output from mapper for each 
!head HW5/topwords_touse.txt
!wc -l HW5/topwords_touse.txt

In [32]:
%%writefile mrjob_bigram_occurrence.py
#!/usr/bin/python
## inverse_index.py
## Author: Angela Gunn & Jing Xu
## Description: Inverses an Index.


from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
from sets import Set
import ast
import re

WORD_RE = re.compile(r"[A-Za-z0-9]+")

class bigram_occurrence(MRJob):
    
    doc_dict={} #global list
    
    def steps(self):
        return [MRStep(mapper_init = self.mapper_init,
                       mapper=self.mapper_main,
                     combiner=self.combiner,
                      reducer=self.reducer, 
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1',
                            "mapred.map.tasks":32,
                            "mapred.reduce.tasks":16
                            })]
    
    def mapper_init(self):
        #load unigrams
        self.unigrams = {}
        with open('topwords','r') as f:
            for line in f:
                cells = line.strip().split('\t')
                word = cells[1].replace('"','').strip()
                self.unigrams[word] = int(cells[0])
                yield "*"+word, int(cells[0])
                                
    def mapper_main(self, _, line):        
        cell = line.strip().split('\t')
        words = WORD_RE.findall(cell[0])
        # Filter 5-grams to only those in list
        words = [w for w in words if w in self.unigrams.keys()]        
        w_len = len(words)
        for i in range(0, w_len): #for each word
            key = words[i]
            H = {}
            for j in xrange(0, w_len): #for each word after this
                w = words[j]
                if key != w: H[w] = H.get(w,0) + 1
            #emit 
            if len(H) > 0: yield key, H
                      
    def combiner(self, key, stripes):
        dic = {}
        key = key.replace('"','')
        if key[0] == '*':
            total = sum(stripes)
            yield key, total
        else:
            for s in stripes:
                for k, v in s.iteritems():
                    k = k.replace('"','')
                    dic[k] = dic.get(k,0) + int(v)
            yield key, dic

    def reducer(self, key, stripes):
        dic = {}
        key = key.replace('"','')
        if key[0] == '*':
            total = sum(stripes)
            yield key, total
        else:
            for s in stripes:
                for k, v in s.iteritems():
                    k = k.replace('"','')
                    dic[k] = dic.get(k,0) + int(v)
            yield key, dic
        

        
if __name__ == '__main__':
    bigram_occurrence.run()

Overwriting mrjob_bigram_occurrence.py


In [33]:
!s3cmd put FILE mrjob_bigram_occurrence.py s3://w261jing

upload: 'mrjob_bigram_occurrence.py' -> 's3://w261jing/mrjob_bigram_occurrence.py'  [1 of 1]
 2615 of 2615   100% in    0s    13.85 kB/s  done


In [35]:
# Running mrjob using a Hadoop Runner in local cluster
from mrjob_bigram_occurrence import bigram_occurrence
import os

mr_job = bigram_occurrence(args=['ngramstest.txt', '--file', '1000grams.txt']) #use ngramstest file to test python script

output_file = "bigram_test_54.out"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))



In [36]:
!head -10 bigram_test_54.out
!tail -1 bigram_test_54.out

"*aa"	63
"*aachen"	7
"*aan"	9
"*aba"	6
"*abate"	87
"*abated"	92
"*abbas"	8
"*abbreviated"	72
"*abbreviation"	55
"*abbreviations"	97
"wrought"	{"me": 1, "be": 2, "great": 1, "lived": 1, "that": 1, "well": 1, "could": 1, "upon": 1, "most": 1, "between": 1, "is": 1, "must": 2}


In [None]:
!s3cmd rm --recursive s3://w261jing/hw5/bigram54/
!python mrjob_bigram_occurrence.py -r emr s3://filtered-5grams/ --output-dir=s3://w261jing/hw5/bigram54 --no-output --no-strict-protocol    

Got unexpected keyword arguments: ssh_tunnel
inferring aws_region from scratch bucket's region (us-west-1)
using s3://mrjob-0465390d52fc9db7/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/zs/k144hqks281fbt0x68c_zj9m0000gp/T/mrjob_bigram_occurrence.JingXu.20160216.171800.722951
writing master bootstrap script to /var/folders/zs/k144hqks281fbt0x68c_zj9m0000gp/T/mrjob_bigram_occurrence.JingXu.20160216.171800.722951/b.py

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Copying non-input files into s3://mrjob-0465390d52fc9db7/tmp/mrjob_bigram_occurrence.JingXu.20160216.171800.722951/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Can't access IAM API, trying default instance profile: EMR_EC2_DefaultRole
Can't access IAM API, trying default ser

**(2) Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.**

**==Design notes for (2)==**  
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

- Jaccard
- Cosine similarity
- Spearman correlation
- Euclidean distance
- Taxicab (Manhattan) distance
- Shortest path graph distance (a graph, because our data is symmetric!)
- Pearson correlation
- Kendall correlation
...

However, be cautioned that some comparison methods are more difficult to
parallelize than others, and do not perform more associations than is necessary, 
since your choice of association will be symmetric.

Please use the inverted index (discussed in live session #5) based pattern to compute the pairwise (term-by-term) similarity matrix.

In [37]:
%%writefile mrjob_Cosine_Inverted_Index.py
#!/usr/bin/python
## mrjob_Cosine_Inverted_Index.py
## Author: Angela Gunn & Jing Xu
## Description: Finds cosine scores for inverted index


from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
from sets import Set
import ast, json
import math




#then invert it: 
#   w1 {wordA:1/LA, wordB:1/LB}
#   w2 {wordA:1/LA, wordB:1/LB}
#   w3 {wordA:1/LA}
#cosine(wordA, wordB) = 1/LA*1/LB + 1/LA*1/LB = 1/LA*0


class Cosine_Inverted_Index(MRJob):

    global_doc_dict = {}
    def steps(self):
        return [
            MRStep(mapper=self.mapper, reducer= self.reducer,jobconf={
                    "mapred.map.tasks":16,
                    "mapred.reduce.tasks":8
                    }
                  ),
            MRStep(mapper=self.mapper2 ,combiner=self.combiner2, reducer=self.reducer2,
                   jobconf={
                    "mapred.map.tasks":8,
                    "mapred.reduce.tasks":4
                    }
                  )
               ]
    
    def mapper(self, _, line):
        total_sqrt = 0
        total_sq_cnt= 0 
        key,terms = line.strip().split('\t')
        key = key.replace('"', '')
        if key[0] != '*':
            docs = eval(terms)
            #normalise the counts for cosine similarity
            for word, count in docs.iteritems():
                total_sq_cnt += count**2
            total_sqrt = math.sqrt(total_sq_cnt)
            for doc,count in docs.iteritems():
                yield doc,(key, 1.0*count/total_sqrt)
    
    def reducer(self,key,value):
        doc_list ={}
        for doc,dist in value:
            doc_list[doc]=dist
        yield key, doc_list
        
    def mapper2(self,key,value):
        keys = value.keys()
        for key1 in keys:
            for key2 in keys:
                if(key1 == key2):
                    continue
                multiplied_keys = value[key1]*value[key2]
                yield(key1,key2),multiplied_keys
    
    def combiner2(self,key,value):
        yield key,sum(value)
    
    def reducer2(self,key,value):
        yield key,sum(value)

        
if __name__ == '__main__':
    Cosine_Inverted_Index.run()

Overwriting mrjob_Cosine_Inverted_Index.py


In [None]:
!s3cmd put FILE mrjob_Cosine_Inverted_Index.py s3://w261jing

**Test with SYSTEMS_TEST_DATASET**

In [38]:
from mrjob_Cosine_Inverted_Index import Cosine_Inverted_Index
import os

#mr_job = Cosine_Inverted_Index(args=['HW5/54_out.txt'])
mr_job = Cosine_Inverted_Index(args=['SYSTEMS_TEST_DATASET.txt'])

output_file = "SYSTEMS_TEST_COSINE.out"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))



13323


In [None]:
!cat SYSTEMS_TEST_COSINE.out

**Test with 5-grams**

In [None]:
#code to make things public for future functions
!s3cmd setacl s3://w261jing/hw5/bigram54/* --acl-public --recursive

In [None]:
!s3cmd rm --recursive s3://w261jing/hw5/cosine54/
!python mrjob_Cosine_Inverted_Index.py -r emr s3://w261jing/hw5/bigram54/part* --output-dir=s3://w261jing/hw5/cosine54 --no-output --no-strict-protocol    

In [None]:
!s3cmd get s3://w261jing/hw5/cosine54/part-00000 cosine54_out.txt

!head cosine54_out.txt

In [28]:
%%writefile mrjob_Jaccard_Index.py
#!/usr/bin/python
## mrjob_Jaccard_Index.py
## Author: Angela Gunn & Jing Xu
## Description: Finds Jaccard scores for inverted index


from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
from sets import Set
import ast, json
import math


class Jaccard_Index(MRJob):
    global_doc_dict = {}
    def steps(self):
        return [
            MRStep(mapper=self.mapper , reducer= self.reducer, 
                   jobconf={
                    "mapred.map.tasks":16,
                    "mapred.reduce.tasks":8
                    }),
            MRStep(mapper=self.mapper2 ,combiner=self.combiner2, reducer=self.reducer2,
                   jobconf={
                    "mapred.map.tasks":8,
                    "mapred.reduce.tasks":4
                    }
                  ),
             MRStep(reducer=self.jaccard_cal, 
                    jobconf={
                    "mapred.map.tasks":4,
                    "mapred.reduce.tasks":1
                    })
            
               ]
    
    def mapper(self, _, line):
        key,terms = line.strip().split('\t')
        key = key.replace('"', '')
        if key[0] != '*':  #* represents a word with count - not a word with dictionary
            docs = eval(terms).keys()
            for doc in docs:
                yield doc,key
    
    def reducer(self,key,value):
        doc_list ={}
        for v in value:
            doc_list[v]=1
        yield key, doc_list.keys()
        
    def mapper2(self,key,value):
        doc_list = list(value)
        for key1 in doc_list:
            starkey = '*' + key1  #addint the * back... this seems redundant, but handled a strange error.
            yield (starkey, key1),1
            for key2 in doc_list:
                if(key1 != key2):
                    yield(key1,key2),1
    
    def combiner2(self,key,value):
        yield key,sum(value)
    
    def reducer2(self,key,value):
        yield key,sum(value)
    
    def jaccard_cal(self,key,value):
        docA,docB = key

        if docA.startswith('*'): #|doc|
            self.global_doc_dict[docB] = sum(value)
        else:  #at this point we have all the |doc|
            ab = sum(value)
            calc = 1.0*ab / (self.global_doc_dict[docA] + self.global_doc_dict[docB] - ab)
            yield (docA,docB), calc
        
if __name__ == '__main__':
    Jaccard_Index.run()

Overwriting mrjob_Jaccard_Index.py


In [4]:
!s3cmd put FILE mrjob_Jaccard_Index.py s3://w261jing

upload: 'mrjob_Jaccard_Index.py' -> 's3://w261jing/mrjob_Jaccard_Index.py'  [1 of 1]
 2637 of 2637   100% in    0s    16.38 kB/s  done


**Test with SYSTEMS_TEST_DATASET**

In [None]:
from mrjob_Jaccard_Index import Jaccard_Index
import os

mr_job = Jaccard_Index(args=['SYSTEMS_TEST_DATASET.txt'])

output_file = "SYSTEMS_TEST_JACCARD.out"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))

In [None]:
!head SYSTEMS_TEST_JACCARD.out

**Test with 5-grams**

In [40]:
from mrjob_Jaccard_Index import Jaccard_Index
import os

mr_job = Jaccard_Index(args=['bigram_test_54.out'])

output_file = "bigram_test_Jaccard_54.out"
try:
    os.remove(output_file)
except OSError:
    pass

with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        #print mr_job.parse_output_line(line)
        f.write(str(line))



In [41]:
!s3cmd rm --recursive s3://w261jing/hw5/jaccard54/
!python mrjob_Jaccard_Index.py -r emr s3://w261jing/hw5/bigram54/part* --output-dir=s3://w261jing/hw5/jaccard54 --no-output --no-strict-protocol    

["abate", "between"]	0.003289473684210526
["abate", "black"]	0.021739130434782608
["abate", "brick"]	0.125
["abate", "building"]	0.016129032258064516
["abate", "cell"]	0.02127659574468085
["abate", "cells"]	0.024390243902439025
["abate", "first"]	0.003937007874015748
["abate", "flying"]	0.09090909090909091
["abate", "heat"]	0.023809523809523808
["abate", "is"]	0.0014005602240896359
["worked", "works"]	0.23376623376623376
["worked", "write"]	0.17857142857142858
["worked", "writing"]	0.18666666666666668
["worked", "wrought"]	0.1951219512195122
["works", "write"]	0.1650485436893204
["works", "writing"]	0.2222222222222222
["works", "wrought"]	0.09375
["write", "writing"]	0.17525773195876287
["write", "wrought"]	0.12121212121212122
["writing", "wrought"]	0.12280701754385964


In [None]:
!s3cmd get s3://w261jing/hw5/jaccard54/part-00000 jaccard54_out.txt

!head jaccard54_out.txt

## HW 5.5
In this part of the assignment you will evaluate the success of you synonym detector.
Take the top 1,000 closest/most similar/correlative pairs of words as determined
by your measure in (2), and use the synonyms function in the accompanying
python code:

nltk_synonyms.py

Note: This will require installing the python nltk package:

http://www.nltk.org/install.html

and downloading its data with nltk.download().

For each (word1,word2) pair, check to see if word1 is in the list, 
synonyms(word2), and vice-versa. If one of the two is a synonym of the other, 
then consider this pair a 'hit', and then report the precision, recall, and F1 measure  of 
your detector across your 1,000 best guesses. Report the macro averages of these measures.

In [44]:
!cat bigram_test_Jaccard_54.out | sort -k3nr  > file_test_55.out
!head file_test_55.out
!head -1000 file_test_55.out > top1k_test_55.out
!head top1k_test_55.out
!wc -l top1k_test_55.out

["abdication", "accentuated"]	1.0
["aching", "admittance"]	1.0
["aching", "admonished"]	1.0
["admittance", "admonished"]	1.0
["be", "that"]	0.7340686274509803
["is", "that"]	0.7305389221556886
["be", "is"]	0.725925925925926
["asserted", "checked"]	0.6666666666666666
["checked", "hate"]	0.6666666666666666
["beaten", "bestowed"]	0.6
["abdication", "accentuated"]	1.0
["aching", "admittance"]	1.0
["aching", "admonished"]	1.0
["admittance", "admonished"]	1.0
["be", "that"]	0.7340686274509803
["is", "that"]	0.7305389221556886
["be", "is"]	0.725925925925926
["asserted", "checked"]	0.6666666666666666
["checked", "hate"]	0.6666666666666666
["beaten", "bestowed"]	0.6
    1000 top1k_test_55.out


In [4]:
import nltk
from nltk.corpus import wordnet as wn
import sys
import ast
#print all the synset element of an element
def synonyms(string):
    syndict = {}
    for i,j in enumerate(wn.synsets(string)):
        syns = j.lemma_names()
        for syn in syns:
            syndict.setdefault(syn,1)
    return syndict.keys()

total = 0
true_positives = 0
false_negatives = 0

#Load synomyn file
dict_syn = {}
line_cnt = 0
with open('top1k_test_55.out', 'r') as f:
    for line in f:
        line_cnt += 1
        if line_cnt%100 == 0: print line_cnt
        t = line.strip().split('\t')
        w1 = t[0].lower()
        w2 = t[1].lower()
        if w1 in dict_syn.keys():
            dict_syn[w1].append(w2)
        else:
            dict_syn[w1] = [w2]
        if w2 in dict_syn.keys():
            dict_syn[w2].append(w1)
        else:
            dict_syn[w2] = [w1]

print "Length of Synonym Dict: {0}".format(len(dict_syn))

# Check if any of the top 1000 matches the synonym list
with open('top1k_test_55.out', 'r') as f:
    for line in f:
        total += 1
        t = line.strip().split('\t')
        pair = ast.literal_eval(t[0])
        syn0 = synonyms(pair[0].lower().strip(' '))
        syn1 = synonyms(pair[1].lower().strip(' '))
        # Precision
        if pair[1].lower() in syn0 or pair[0].lower() in syn1:
            print "MATCH: {0}".format(pair)
            true_positives += 1
        
        # Recall
        if pair[0].lower() in dict_syn.keys() or pair[1].lower() in dict_syn.keys():
            false_negatives += 1
        
            
print "\nTotal Count: {0}, TP: {1}, FP: {2}, FN: {3}".format(total, true_positives, true_positives - false_negatives, false_negatives)

p = round(float(true_positives) / total, 3)
r = round(float(true_positives) / (true_positives + false_negatives), 3)
f1 = round(2 * p * r / (p + r), 3)

print "\n### PRECISION: {0}".format(p)
print "\n### RECALL: {0}".format(r)
print "\n### F1 Score: {0}".format(f1)

100
200
300
400
500
600
700
800
900
1000
Length of Synonym Dict: 1457
MATCH: ['be', 'is']
MATCH: ['made', 'make']
MATCH: ['about', 'most']
MATCH: ['do', 'make']
MATCH: ['tried', 'try']
MATCH: ['do', 'made']
MATCH: ['job', 'task']
MATCH: ['make', 'work']

Total Count: 1000, TP: 8, FP: 8, FN: 0

### PRECISION: 0.008

### RECALL: 1.0

### F1 Score: 0.016


In [4]:
%%writefile inverse_index_54.py
#!/usr/bin/python
## inverse_index_54.py
## Author: Jing Xu 
## Description: Inverses an Index.


from mrjob.job import MRJob
from mrjob.step import MRStep
from sets import Set
import csv
import ast

class inverse_index(MRJob):
    
    stripes={} #initialize dictionary of stripes
    
    def steps(self):
        return [MRStep(mapper=self.mapper,
                     combiner=self.reducer)]
    
    def mapper(self, _, line):        
        tokens = line.split()
        key = tokens[0].strip()
        stripe = tokens[1].strip()
        stripes = ast.literal_eval(str(stripe))
        word_len = len(stripes)
        words = stripes.keys()
        i = 0
        for word in words:
            yield word, (key,1)  #X, DocA
    
    def reducer(self, key, value):
        dic = {}
        for w,c in value:dic[w] = c
            
        yield  key, dic
        
if __name__ == '__main__':
    inverse_index.run()

Overwriting inverse_index_54.py


In [5]:
from inverse_index_54 import inverse_index
mr_job = inverse_index(args=['SYSTEMS_TEST_DATASET.txt'])
with mr_job.make_runner() as runner: 
    runner.run()
    count = 0
    # stream_output: get access of the output 
    for line in runner.stream_output():
        key,value =  mr_job.parse_output_line(line)
        print key, value
        count = count + 1
print "\n"
print "There are %s records" %count



M {'DocC': 1}
N {'DocC': 1}
X {'DocB': 1, 'DocA': 1}
Y {'DocB': 1, 'DocA': 1}
Z {'DocC': 1, 'DocA': 1}


There are 5 records


In [8]:
!tail -1000 output_hw53_freq.txt > 1000grams.txt

In [9]:
!head 1000grams.txt

"asset"	547
"european"	5470
"reported"	5474
"bundle"	548
"college"	5481
"command"	5484
"notice"	5488
"hire"	549
"objects"	5494
"fell"	5495
