# HW4 DATASCI W261: Machine Learning at Scale 

* **Name:**  Megan Jasek
* **Email:**  meganjasek@ischool.berkeley.edu
* **Class Name:**  W261-2
* **Week Number:**  4
* **Date:**  6/7/16

### HW 4.0

What is MrJob? How is it different from Hadoop MapReduce? 

MrJob is a Python package for running Hadoop streaming jobs.  It is different from Hadoop MapReduce in that it provides an abstracted MapReduce interface to Hadoop streaming jobs.  With Hadoop MapReduce, the programmer is interacting directly with that tool, but with MrJob, the programmer interacts with the MrJob library and then the library interacts directly with Hadoop MapReduce through Hadoop streaming.  MrJob is simple to use and install while providing reasonable performance.  It provides an object-oriented programming framework that allows programmers to use inheritance and override MrJob classes.  One of the big advantages of using MrJob is that it assists with producing multistep jobs.  The programmer can create multiple MapReduce jobs and chain them together using Python code.  Other benefits are that it has tight integration with Amazon Web Service (AWS) making it easy to run jobs in the cloud to take advantage of the resources there to handle large jobs.  And there is a community of developers and contributors to the MrJob project making it easy to get questions answered and resolve issues.

What are the mapper_init, mapper_final(), combiner_final(), reducer_final() methods? When are they called?

In MrJob, each mapper, combiner and reducer have a init and final method.  These are denoted by:  mapper_init(), mapper_final(), combiner_init(), combiner_final(), reducer_init() and reducer_final().  The _init_ methods are called before the mapper, reducer or combiner method processes any of its inputs.  For example the mapper_init method is called before the mapper method processes any input.  From the documentation, one use for this function is to initialize mapper-specific helper structures.  The _final_ methods are called after the mapper, reducer or combiner method processes its input.  For example, the reducer_final method is called after the reducer reaches the end of input.

### HW 4.1

What is serialization in the context of MrJob or Hadoop? When is it used in these frameworks?

Serialization in the context of MrJob or Hadoop is the process of transmitting data among the methods in MrJobs.  There are 3 different instances when serialization is used:  input, internal and output.  _Input_ is how the data is read in to the mapper.  _Internal_ is how data is output from the mapper, input into the combiner, output from the combiner and input into the reducer.  _Output_ is how data is output from the reducer to the output stream.

What is the default serialization mode for input and outputs for MrJob?

The default serialization modes for MrJob are as follows.  The input mode is defined by INPUT_PROTOCOL.  The output mode is defined by OUTPUT_PROTOCOL.  And for completeness, the internal mode (as described above) is defined by INTERNAL_PROTOCOL.
* INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
* OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol
* INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol

### HW 4.2

Recall the Microsoft logfiles data from the async lecture. The logfiles are described are located at:
https://kdd.ics.uci.edu/databases/msweb/msweb.html http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/
This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.
Here, you must preprocess the data on a single node (i.e., not on a cluster of nodes) from the format:  
C,"10001",10001 #Visitor id 10001  
V,1000,1 #Visit by Visitor 10001 to page id 1000  
V,1001,1 #Visit by Visitor 10001 to page id 1001  
V,1002,1 #Visit by Visitor 10001 to page id 1002  
C,"10002",10002 #Visitor id 10001  
V  
Note: #denotes comments  
to the format:  
V,1000,1,C, 10001  
V,1001,1,C, 10001  
V,1002,1,C, 10001  

Write the python code to accomplish this.

In [2]:
'''
    This code reads the input file (infile), converts the data in it and writes the conversion
    to the output file (outfile).  The data is converted from writing customer information on
    a single line to writing customer information on the line corresponding its associated visit
    as per the instructions for HW4.2 above.
'''
infile = "anonymous-msweb.data"
outfile = "anonymous-msweb_converted.data"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    for line in rf.readlines():
        # Split the lines on commas
        items = line.split(',')
        # If the line is a customer line, then save the customer ID for later use
        # Write the line to the output file
        if items[0] == 'C':
            cust_str = items[2]
            wf.write(line)
        # If the line is a visit line, then concatenate, the original line with the
        # current customer information (the current value of cust_str) and write
        # it to the output file
        elif items[0] == 'V':
            wf.write('%s,C,%s' % (line.strip(), cust_str))
        # All other lines write directly to the output file as is
        else:
            wf.write(line)

### HW 4.3

Find the 5 most frequently visited pages using MrJob from the output of 4.2 (i.e., transfromed log file).

FIVE MOST FREQUENTLY VISITED PAGES:

| Page ID | Count |
| - | - |
| 1008 | 10,836 |
| 1034 | 9,383 |
| 1004 | 8,463 |
| 1018 | 5,330 |
| 1017 | 5,108 |


In [1]:
%%writefile MostFrequentVisits.py

from mrjob.job import MRJob
#from mrjob.step import MRJobStep
from mrjob.step import MRStep
import csv

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class MRMostFrequentVisits(MRJob):

    def mapper_count_visits(self, _, line):
        # read the next line from the file and only if it is a visitor record, denoted by
        # 'V', output it plus a count of 1 for the reducer.
        record = csv_readline(line)
        if record[0] == 'V':
            yield record[1], 1
    
    def reducer_sum_visits(self, page_id, counts):
        # output each page_id (key) that is input to the reducer plus the sum of the counts
        # for that page.
        yield page_id, sum(counts)
    
    def reducer_sort_visits(self, page_id, counts):
        # output each page_id (key) that is input to the reducer plus the sum of the counts
        # for that page.  Note here that since the previous reducer 'reducer_sum_visits'
        # already summed all of the counts for each page_id, there should just be one
        # page_id with its total count coming to this method.
        yield page_id, sum(counts)
        
    def steps(self):
        # define the steps this MR job.  The JOBCONF_STEP2 tells Hadoop how to handle the data during
        # the Hadoop shuffle for the 2nd job.  In this case the data should be sorted in reverse order
        # by the 2nd output (in this case the values or counts of the pages) and then if there are ties,
        # the data should be sorted by first output or the key.
        JOBCONF_STEP2 = {        
            'mapreduce.job.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'stream.num.map.output.key.field': 2,
            'stream.map.output.field.separator':',',
            'mapreduce.partition.keycomparator.options': '-k2,2nr -k1,1',
            'mapreduce.job.reduces': '1'
        }
        return [
            MRStep(mapper=self.mapper_count_visits,   # STEP 1:  count the visits
                   reducer=self.reducer_sum_visits),
            MRStep(jobconf=JOBCONF_STEP2,
                    reducer=self.reducer_sort_visits)  # STEP 2:  sort the visits
        ]
    
if __name__ == '__main__':
    MRMostFrequentVisits.run()

Overwriting MostFrequentVisits.py


In [2]:
# There is a known bug, that step-level jobconf does not work in local and inline modes, so the line below will not work.
#!python MostFrequentVisits.py anonymous-msweb_converted.data
# The job must be run with args '-r hadoop' to enable step-level jobconf
!python MostFrequentVisits.py -r hadoop anonymous-msweb_converted.data

No configs found; falling back on auto-configuration
Creating temp directory /tmp/MostFrequentVisits.hadoop.20160607.175413.746860
Looking for hadoop binary in /usr/local/hadoop/bin...
Found hadoop binary: /usr/local/hadoop/bin/hadoop
Using Hadoop version 2.7.1
Copying local files to hdfs:///user/hadoop/tmp/mrjob/MostFrequentVisits.hadoop.20160607.175413.746860/files/...
Looking for Hadoop streaming jar in /usr/local/hadoop...
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
Running step 1 of 2...
  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  packageJobJar: [/tmp/hadoop-unjar5967164530666942398/] [] /tmp/streamjob8913338694064333338.jar tmpDir=null
  Connecting to ResourceManager at master/50.97.205.254:8032
  Connecting to ResourceManager at master/50.97.205.254:8032
  Total input paths to process : 1
  number of splits:2
  Submitting tokens for job: job_1463787494457_0366
  Subm

In [4]:
####### OUTPUT FOR HW4.3 #######
from MostFrequentVisits import MRMostFrequentVisits
# There is a known bug, that step-level jobconf does not work in local and inline modes so the below command will not work.
#mr_job = MRMostFrequentVisits (args=['anonymous-msweb_converted.data'])
# The job must be run with args '-r', 'hadoop' to enable step-level jobconf
mr_job = MRMostFrequentVisits (args=['anonymous-msweb_converted.data', '-r', 'hadoop'])
with mr_job.make_runner() as runner:
    runner.run()
    # stream_output and print each line of the output
    # only print the first 5 values.
    for counter, line in enumerate(runner.stream_output()):
        if counter < 5:
            print mr_job.parse_output_line(line)
        else:
            break

ERROR:mrjob.fs.hadoop:STDERR: 16/06/07 13:01:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable



(u'1008', 10836)
(u'1034', 9383)
(u'1004', 8463)
(u'1018', 5330)
(u'1017', 5108)


### HW 4.4

Find the most frequent visitor of each page using MrJob and the output of 4.2 (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.

The most frequent visitors are listed below in sections labeled 'OUTPUT FOR HW4.4'  Here are the first 5 pages and their most frequent visitors:  

URL: www.microsoft.com/regwiz, Page ID: 1000, Visitor ID: 10001, # Page Visits: 1  
URL: www.microsoft.com/support, Page ID: 1001, Visitor ID: 10001, # Page Visits: 1  
URL: www.microsoft.com/athome, Page ID: 1002, Visitor ID: 10001, # Page Visits: 1  
URL: www.microsoft.com/kb, Page ID: 1003, Visitor ID: 10002, # Page Visits: 1  
URL: www.microsoft.com/search, Page ID: 1004, Visitor ID: 10003, # Page Visits: 1  

None of the pages have a visitor that visited more than one time.  All of the page visit counts per visitor are 1.

In [5]:
%%writefile MostFrequentVisitors.py

'''
    This code finds the most frequent visitor for each Page ID from the input file and
    prints out the full URL, the Page ID, the visitor ID and the number of Page Visits
    by that visitor ID.
'''
from mrjob.job import MRJob
from mrjob.step import MRStep
import csv

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class MRMostFrequentVisitors(MRJob):
    # create two instance variables to keep track of the current page_id in the 
    # reducer and the vroots (in order to print the full URL in the output)
    reducer_current_pageid = ""
    vroots = {}
    
    def mapper_count_visits(self, _, line):
        # read the next line from the file and only if it is a visitor record, denoted by
        # 'V', output page_visitor_pair with a count of 1.  A page_visitor_pair has this
        # format:  page_id.visitor_id
        record = csv_readline(line)
        if record[0] == 'V':
            page_id = record[1]
            visitor_id = record[4]
            page_visitor_pair = ('(%s.%s)' % (page_id, visitor_id))
            yield page_visitor_pair, 1
        
    def reducer_sum_visits(self, page_visitor_pairs, counts):
        # output each page_visitor_pair (key) that is input to the reducer plus the sum of the counts
        # for that page.
        yield page_visitor_pairs, sum(counts)
    
    def reducer_sort_init(self):
        # Read the data from the filename and store it in the self.vroots dictionary.  This
        # stores the base URL and the vroot labels for each vroot.  This enables pasting
        # the full URL together in reducer_sort_visits.
        filename = 'anonymous-msweb_converted.data'
        with open(filename, 'r') as f:
            for line in f.readlines():
                record = line.strip().split(',')
                if record[0] == 'I':
                    self.vroots['0'] = record[2].strip('"')
                elif record[0] == 'A':
                    page_id = record[1]
                    vroot = record[4].strip('"')
                    self.vroots[page_id] = vroot

    def reducer_sort_visits(self, page_visitor_pairs, counts):
        # receive page_visitor_pairs and their counts from a previous step.  The pairs
        # come in to this method sorted by their counts with the highest count first.
        # Create two output strings that concatenate the requested information.  The
        # first output string contains the full URL, the Page ID and the Visitor ID.
        # The second output string contatins the number of page visits.
        page_id, visitor_id = page_visitor_pairs.strip('()').split('.', 2)
        if page_id != self.reducer_current_pageid:
            self.reducer_current_pageid = page_id
            total_visits = sum(counts)
            output_str_1 = ('URL: %s%s, Page ID: %s, Visitor ID: %s' % 
                          (self.vroots['0'], self.vroots[page_id], page_id, visitor_id))
            output_str_2 = ('# Page Visits: %d' % (total_visits))
            yield output_str_1, output_str_2

    def steps(self):
        # define the steps this MR job.  The JOBCONF_STEP2 tells Hadoop how to handle the data during
        # the Hadoop shuffle for the 2nd job.  In this case the data should be sorted in reverse order
        # by the 2nd output (in this case the values or counts of the page pairs) and then if there are ties,
        # the data should be sorted by first output or the key.
        JOBCONF_STEP2 = {        
            'mapreduce.job.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'stream.num.map.output.key.field': 2,
            'stream.map.output.field.separator':',',
            'mapreduce.partition.keycomparator.options': '-k2,2nr -k1,1',
            'mapreduce.job.reduces': '1'
        }
        return [
            MRStep(mapper=self.mapper_count_visits,   # STEP 1:  count the visits
                   reducer=self.reducer_sum_visits),
            MRStep(jobconf=JOBCONF_STEP2,
                   reducer_init=self.reducer_sort_init,
                   reducer=self.reducer_sort_visits)  # STEP 2:  sort the visits
        ]
                
if __name__ == '__main__':
    MRMostFrequentVisitors.run()

Overwriting MostFrequentVisitors.py


In [6]:
####### OUTPUT FOR HW4.4 #######
# The job must be run with args '-r hadoop' to enable step-level jobconf
#!python MostFrequentVisitors.py anonymous-msweb_converted_small.data --file=anonymous-msweb_converted_small.data
!python MostFrequentVisitors.py -r hadoop anonymous-msweb_converted.data --file=anonymous-msweb_converted.data

No configs found; falling back on auto-configuration
Creating temp directory /tmp/MostFrequentVisitors.hadoop.20160607.181232.677596
Looking for hadoop binary in /usr/local/hadoop/bin...
Found hadoop binary: /usr/local/hadoop/bin/hadoop
Using Hadoop version 2.7.1
Copying local files to hdfs:///user/hadoop/tmp/mrjob/MostFrequentVisitors.hadoop.20160607.181232.677596/files/...
Looking for Hadoop streaming jar in /usr/local/hadoop...
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
Running step 1 of 2...
  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  packageJobJar: [/tmp/hadoop-unjar8902090679732739008/] [] /tmp/streamjob8737633217055621182.jar tmpDir=null
  Connecting to ResourceManager at master/50.97.205.254:8032
  Connecting to ResourceManager at master/50.97.205.254:8032
  Total input paths to process : 1
  number of splits:2
  Submitting tokens for job: job_1463787494457_0372
  

In [7]:
####### OUTPUT FOR HW4.4 #######
from MostFrequentVisitors import MRMostFrequentVisitors
# The job must be run with args '-r', 'hadoop' to enable step-level jobconf
mr_job = MRMostFrequentVisitors (args=['anonymous-msweb_converted.data', '-r', 'hadoop', 
                                       '--file=anonymous-msweb_converted.data'])
 
with mr_job.make_runner() as runner:
    runner.run()
    # stream_output and print each line of the output
    for line in runner.stream_output():
        print(line)




"URL: www.microsoft.com/regwiz, Page ID: 1000, Visitor ID: 10001"	"# Page Visits: 1"

"URL: www.microsoft.com/support, Page ID: 1001, Visitor ID: 10001"	"# Page Visits: 1"

"URL: www.microsoft.com/athome, Page ID: 1002, Visitor ID: 10001"	"# Page Visits: 1"

"URL: www.microsoft.com/kb, Page ID: 1003, Visitor ID: 10002"	"# Page Visits: 1"

"URL: www.microsoft.com/search, Page ID: 1004, Visitor ID: 10003"	"# Page Visits: 1"

"URL: www.microsoft.com/norge, Page ID: 1005, Visitor ID: 10004"	"# Page Visits: 1"

"URL: www.microsoft.com/misc, Page ID: 1006, Visitor ID: 10005"	"# Page Visits: 1"

"URL: www.microsoft.com/ie_intl, Page ID: 1007, Visitor ID: 10007"	"# Page Visits: 1"

"URL: www.microsoft.com/msdownload, Page ID: 1008, Visitor ID: 10009"	"# Page Visits: 1"

"URL: www.microsoft.com/windows, Page ID: 1009, Visitor ID: 10009"	"# Page Visits: 1"

"URL: www.microsoft.com/vbasic, Page ID: 1010, Visitor ID: 10010"	"# Page Visits: 1"

"URL: www.microsoft.com/officedev, Page ID: 1011, Visi

ERROR:mrjob.fs.hadoop:STDERR: 16/06/07 13:17:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable




"URL: www.microsoft.com/corporate_solutions, Page ID: 1213, Visitor ID: 12472"	"# Page Visits: 1"

"URL: www.microsoft.com/finserv, Page ID: 1214, Visitor ID: 12515"	"# Page Visits: 1"

"URL: www.microsoft.com/developer, Page ID: 1215, Visitor ID: 12577"	"# Page Visits: 1"

"URL: www.microsoft.com/vrml, Page ID: 1216, Visitor ID: 12666"	"# Page Visits: 1"

"URL: www.microsoft.com/ireland, Page ID: 1217, Visitor ID: 12675"	"# Page Visits: 1"

"URL: www.microsoft.com/publishersupport, Page ID: 1218, Visitor ID: 12714"	"# Page Visits: 1"

"URL: www.microsoft.com/ads, Page ID: 1219, Visitor ID: 12746"	"# Page Visits: 1"

"URL: www.microsoft.com/macofficesupport, Page ID: 1220, Visitor ID: 12795"	"# Page Visits: 1"

"URL: www.microsoft.com/mstv, Page ID: 1221, Visitor ID: 12815"	"# Page Visits: 1"

"URL: www.microsoft.com/msofc, Page ID: 1222, Visitor ID: 12819"	"# Page Visits: 1"

"URL: www.microsoft.com/finland, Page ID: 1223, Visitor ID: 12828"	"# Page Visits: 1"

"URL: www.microsoft.co

### HW 4.5 Clustering Tweet Dataset

Here you will use a different dataset consisting of word-frequency distributions for 1,000 Twitter users. These Twitter users use language in very different ways, and were classified by hand according to the criteria:  
* 0: Human, where only basic human-human communication is observed.
* 1: Cyborg, where language is primarily borrowed from other sources (e.g., jobs listings, classifieds postings, advertisements, etc...).
* 2: Robot, where language is formulaically derived from unrelated sources (e.g., weather/seismology, police/fire event logs, etc...).
* 3: Spammer, where language is replicated to high multiplicity (e.g., celebrity obsessions, personal promotion, etc... )  

Check out the preprints of recent research, which spawned this dataset:  
http://arxiv.org/abs/1505.04342 http://arxiv.org/abs/1508.01843  

The main data lie in the accompanying file:  topUsers_Apr-Jul_2014_1000-words.txt  
and are of the form:  
USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...  
where  
* USERID = unique user identifier
* CODE = 0/1/2/3 class code
* TOTAL = sum of the word counts  

Using this data, you will implement a 1000-dimensional K-means algorithm in MrJob on the users by their 1000-dimensional word stripes/vectors using several centroid initializations and values of K.  Note that each "point" is a user as represented by 1000 words, and that word-frequency distributions are generally heavy-tailed power-laws (often called Zipf distributions), and are very rare in the larger class of discrete, random distributions. For each user you will have to normalize by its "TOTAL" column. Try several parameterizations and initializations:  
* (A) K=4 uniform random centroid-distributions over the 1000 words (generate 1000 random numbers and normalize the vectors)
* (B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution
* (C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution
* (D) K=4 "trained" centroids, determined by the sums across the classes. Use use the (row-normalized) class-level aggregates as 'trained' starting centroids (i.e., the training is already done for you!).  
Note that you do not have to compute the aggregated distribution or the class-aggregated distributions, which are rows in the auxiliary file: topUsers_Apr-Jul_2014_1000-words_summaries.txt  
* Row 1: Words
* Row 2: Aggregated distribution across all classes
* Row 3-6 class-aggregated distributions for clases 0-3  
For (A), we select 4 users randomly from a uniform distribution [1,...,1,000].  For (B), (C), and (D) you will have to use data from the auxiliary file:  topUsers_Apr-Jul_2014_1000-words_summaries.txt  

This file contains 5 special word-frequency distributions:  
(1) The 1000-user-wide aggregate, which you will perturb for initializations in parts (B) and (C), and  
(2-5) The 4 class-level aggregates for each of the user-type classes (0/1/2/3)  
In parts (B) and (C), you will have to perturb the 1000-user aggregate (after initially normalizing by its sum, which is also provided). So if in (B) you want to create 2 perturbations of the aggregate, start with (1), normalize, and generate 1000 random numbers uniformly from the unit interval (0,1) twice (for two centroids), using:  
from numpy import random numbers = random.sample(1000)  
Take these 1000 numbers and add them (component-wise) to the 1000-user aggregate, and then renormalize to obtain one of your aggregate-perturbed initial centroids.  

For experiments A, B, C and D and iterate until a threshold (try 0.001) is reached. After convergence, print out a summary of the classes present in each cluster. In particular, report the composition as measured by the total portion of each class type (0-3) contained in each cluster, and discuss your findings and any differences in outcomes across parts A-D.

SUMMARY TABLE:

|              | Human (0)    | Cyborg (1)   | Robot (2)   | Spammer (3)   |
|:-------------|:-------------|:-------------|:------------|:--------------|
| **HW4.5 Part A**                                               
| Cluster 0    | 2.26% (17)   | 0.00% (0)    | 9.26% (5)   | 62.14% (64)   |
| Cluster 1    | 97.61% (734) | 3.30% (3)    | 16.67% (9)  | 33.98% (35)   |
| Cluster 2    | 0.00% (0)    | 56.04% (51)  | 3.70% (2)   | 0.00% (0)     |
| Cluster 3    | 0.13% (1)    | 40.66% (37)  | 70.37% (38) | 3.88% (4)     |
|              |              |              |             |               |
| **HW4.5 Part B** 
| Cluster 0    | 99.87% (751) | 3.30% (3)    | 25.93% (14) | 96.12% (99)   |
| Cluster 1    | 0.13% (1)    | 96.70% (88)  | 74.07% (40) | 3.88% (4)     |
|              |              |              |             |               |
| **HW4.5 Part C** 
| Cluster 0    | 78.19% (588) | 1.10% (1)    | 0.00% (0)   | 24.27% (25)   |
| Cluster 1    | 0.27% (2)    | 0.00% (0)    | 3.70% (2)   | 55.34% (57)   |
| Cluster 2    | 21.41% (161) | 2.20% (2)    | 24.07% (13) | 16.50% (17)   |
| Cluster 3    | 0.13% (1)    | 96.70% (88)  | 72.22% (39) | 3.88% (4)     |
|              |              |              |             |               |
| **HW4.5 Part D** 
| Cluster 0    | 99.60% (749) | 3.30% (3)    | 25.93% (14) | 36.89% (38)   |
| Cluster 1    | 0.00% (0)    | 56.04% (51)  | 0.00% (0)   | 0.00% (0)     |
| Cluster 2    | 0.13% (1)    | 40.66% (37)  | 74.07% (40) | 3.88% (4)     |
| Cluster 3    | 0.27% (2)    | 0.00% (0)    | 0.00% (0)   | 59.22% (61)   |
|              |              |              |             |               |

In [10]:
# Create a normalized version of the data topUsers_Apr-Jul_2014_1000-words.txt file
infile = "topUsers_Apr-Jul_2014_1000-words.txt"
outfile = "topUsers_Apr-Jul_2014_1000-words_normalized.txt"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    for line in rf.readlines():
        splt = line.strip().split(',')
        total = float(splt[2])
        wf.write('%s,%s,%s,' % (splt[0], splt[1], splt[2]))
        for i in range(3,len(splt)-1):
            wf.write('%f,' % (float(splt[i])/total))
        wf.write('%f' % (float(splt[len(splt)-1])/total))
        wf.write('\n')

In [None]:
# Create a normalized version of the data topUsers_Apr-Jul_2014_1000-words.txt file
# Exclude the first 3 columns of the data.
infile = "topUsers_Apr-Jul_2014_1000-words.txt"
outfile = "topUsers_Apr-Jul_2014_1000-words_normalized_only.txt"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    for line in rf.readlines():
        splt = line.strip().split(',')
        total = float(splt[2])
        for i in range(3,len(splt)-1):
            wf.write('%f,' % (float(splt[i])/total))
        wf.write('%f' % (float(splt[len(splt)-1])/total))
        wf.write('\n')

In [6]:
# Create a normalized version of the data topUsers_Apr-Jul_2014_1000-words.txt file
# Exclude the entire first row and the first 3 columns of the data.
infile = "topUsers_Apr-Jul_2014_1000-words_summaries.txt"
outfile = "topUsers_Apr-Jul_2014_1000-words_summaries_normalized_only.txt"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    counter = 0
    for line in rf.readlines():
        if counter != 0:
            splt = line.strip().split(',')
            total = float(splt[2])
            for i in range(3,len(splt)-1):
                wf.write('%f,' % (float(splt[i])/total))
            wf.write('%f' % (float(splt[len(splt)-1])/total))
            wf.write('\n')
        counter += 1

In [11]:
%%writefile Kmeans.py
from numpy import argmin, array, random
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain
import os

#Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    centroid_points = array(centroid_points)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx

#Check whether centroids converge
def stop_criterion(centroid_points_old, centroid_points_new,T):
    oldvalue = list(chain(*centroid_points_old))
    newvalue = list(chain(*centroid_points_new))
    Diff = [abs(x-y) for x, y in zip(oldvalue, newvalue)]
    Flag = True
    for i in Diff:
        if(i>T):
            Flag = False
            break
    return Flag

class MRKmeans(MRJob):
    # create instance variables to store the centroid_points for the mapper and the
    # value of k for the reducer.
    centroid_points=[]
    k=4
    def steps(self):
        # Create the steps for the MRJob.  There is only one MRJob here containing
        # a mapper, a combiner and a reducer.
        return [
            MRStep(mapper_init = self.mapper_init, mapper=self.mapper,
                   combiner = self.combiner,
                   reducer_init=self.reducer_init,reducer=self.reducer)
               ]
    
    def configure_options(self):
        # Configure a new command line option to enable the job to accept the value
        # of k required by the Kmeans algorithm.
        super(MRKmeans, self).configure_options()
        self.add_passthrough_option('--k', type='int', default=4)
    
    #load centroids info from file
    def mapper_init(self):
        #print "Current path:", os.path.dirname(os.path.realpath(__file__))        
        self.centroid_points = [map(float,s.split('\n')[0].split(',')) for s in open("Centroids.txt").readlines()]
        # This is the line that breaks things with multiple mappers, so it is commented out.
        #open('Centroids.txt', 'w').close()
        
    #load data and output the nearest centroid index and data point plus a count of 1. 
    def mapper(self, _, line):
        D = (map(float,line.split(',')))
        yield int(MinDist(D,self.centroid_points)), (D,1)
    
    #Combine sum of data points locally
    def combiner(self, idx, inputdata):
        num = 0
        sumD = [0.0]*1000
        for D,n in inputdata:
            num += n
            sumD = [x + y for x, y in zip(sumD,D)]
        yield idx,(sumD,num)
    
    # Set the value of k for the reducer from the command line argument passed in
    def reducer_init(self): 
        self.k = self.options.k
        
    #Aggregate sum for each cluster and then calculate the new centroids
    def reducer(self, idx, inputdata): 
        centroids = []
        num = [0]*self.k 
        for i in range(self.k):
            centroids.append([0.0]*1000)
        for D, n in inputdata:
            num[idx] = num[idx] + n
            centroids[idx] = [x + y for x, y in zip(centroids[idx],D)]
        centroids[idx] = [i / float(num[idx]) for i in centroids[idx]]        
        yield idx, centroids[idx]
      
if __name__ == '__main__':
    MRKmeans.run()

Overwriting Kmeans.py


In [12]:
# These are the functions for the different initialization methods for HW4.5
import re

#Geneate initial centroids FOR PART A.  Select k users at random from the input file
# and use those datapoints as the starting centroids.
def centroid_init_A(k, filename):
    centroid_points = []
    with open(filename, 'r') as f:
        lines = f.readlines()
        for i in range(k):
            rint = random.randint(0, 999)
            rline = lines[rint].strip().split(',')
            centroid_points.append([float(s) for s in rline[3:len(rline)]])
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)
    return centroid_points

###################################################################################
## Geneate random initial centroids around the global aggregate
## Part (B) and (C) of this question
###################################################################################
def centroid_init_BC(k):
    counter = 0
    for line in open("topUsers_Apr-Jul_2014_1000-words_summaries.txt").readlines():
        # Note correction from Kevin from Boulder
        if counter == 1:        
            data = re.split(",",line)
            globalAggregate = [float(data[i+3])/float(data[2]) for i in range(1000)]
        counter += 1
    ## perturb the global aggregate for the four initializations    
    centroids = []
    for i in range(k):
        rndpoints = random.sample(1000)
        peturpoints = [rndpoints[n]/10+globalAggregate[n] for n in range(1000)]
        centroids.append(peturpoints)
        total = 0
        for j in range(len(centroids[i])):
            total += centroids[i][j]
        for j in range(len(centroids[i])):
            centroids[i][j] = centroids[i][j]/total
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroids)
    return centroids

# Use the row-normalized class aggregates from the summaries file as the starting
# centroids.
def centroid_init_D(k, filename):
    centroid_points = []
    with open(filename, 'r') as f:
        lines = f.readlines()
        for i in range(1,k+1):
            rline = lines[i].strip().split(',')
            centroid_points.append([float(s) for s in rline])
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)
    return centroid_points


In [13]:
%reload_ext autoreload
%autoreload 2
from numpy import random
from Kmeans import MRKmeans, stop_criterion, MinDist

def write_centroids (centroids, hw, fnum, iter_num):
    # This function writes the centroids to a file that is labeld with the homework
    # part that we are working on plus an trial number.
    # The number of iterations that the kmeans algorithm had to use is also stored in the file
    filename = 'centroid_results_' + hw + str(fnum) + '.txt'
    with open(filename, 'w+') as f:
        f.write('Number of Iterations: %d\n' % (iter_num))
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroids)

# Set which HW4.5 homework part that we are working on A, B, C or D.
HW_PART = 'D'
# Set the TRIAL_NUM variable to keep track of which trial number it being run.
# Since randomization is involved in parts A, B and C, multiple trials might be desired
# to compare results.
TRIAL_NUM = 1

# Set k and inialize the centroids accoding the the HW_PART.
k = 4
if HW_PART == 'A':
    centroid_points = centroid_init_A(k, 'topUsers_Apr-Jul_2014_1000-words_normalized.txt')
elif HW_PART == 'B':
    # Part B is the only part with k=2
    k = 2
    centroid_points = centroid_init_BC(k)
elif HW_PART == 'C':
    centroid_points = centroid_init_BC(k)
elif HW_PART == 'D':
    centroid_points = centroid_init_D(k, 'topUsers_Apr-Jul_2014_1000-words_summaries_normalized_only.txt')

# Run the KMeans MRJob
mr_job = MRKmeans(args=['topUsers_Apr-Jul_2014_1000-words_normalized_only.txt', '--file=Centroids.txt', '--k', str(k)])
    
# Update centroids iteratively
i = 0
while(1):
    # save previous centoids to check convergency
    centroid_points_old = centroid_points[:]
    print "iteration"+str(i)+":"
    with mr_job.make_runner() as runner: 
        runner.run()
        # stream_output: get access of the output 
        for line in runner.stream_output():
            key,value =  mr_job.parse_output_line(line)
            #print('key: %d, len: %d' % (key, len(value)))
            centroid_points[key] = value
        
        # Update the centroids for the next iteration
        with open('Centroids.txt', 'w') as f:
            f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)
    print "\n"
    i += 1
    if(stop_criterion(centroid_points_old,centroid_points,0.001)):
        break

# Write the centroid_points to a file
write_centroids(centroid_points, HW_PART, TRIAL_NUM, i)

iteration0:


iteration1:


iteration2:


iteration3:


iteration4:




In [14]:
from Kmeans import MinDist
from tabulate import tabulate

# Reads centroids from a file called filename and returns them as a list of lists
def read_centroids (hw, trial):
    filename = 'centroid_results_' + hw + str(trial) + '.txt'
    with open(filename) as f:
        lines = f.readlines()
        num_iter = lines[0].strip().split()[3]
        centroids = [map(float,s.split('\n')[0].split(',')) for s in lines[1:]]
    return centroids, num_iter

# Set the total number of data points contained in each class (0-3) for use
# in calculating percentages.
total_codes = [752.0,91.0,54.0,103.0]

# Initialize the data variable to store the data listed in the tabulate table
data = []
# Summarize data for each of the homework parts
for hw in ['A', 'B', 'C', 'D']:
    # Read the centroids
    centroid_points, num_iter = read_centroids(hw, 0)

    # Summarize the class
    # Initialize a results array
    res_a = []
    for i in range(len(centroid_points)):
        res_a.append([0]*len(total_codes))
        
    # Count the number of data points from each class that are in each cluster
    with open('topUsers_Apr-Jul_2014_1000-words_normalized.txt', 'r') as f:
        for line in f.readlines():
            rline = line.strip().split(',')
            userid = rline[0]
            code = int(rline[1])
            D = [float(s) for s in rline[3:len(rline)]]
            cluster = MinDist(D, centroid_points)
            res_a[cluster][code] += 1

    # Calculate the percentage of each class that is in each cluster
    res_b = []
    data.append(['HW4.5 Part %s' % (hw), '', '', '', ''])
    for i in range(len(res_a)):
        res_b.append([(x / y)*100 for x, y in zip(res_a[i],total_codes)])
        # Append the information for the rows of the table
        data.append(['Cluster %d' % (i), '%.2f%% (%d)' % (res_b[i][0], res_a[i][0]), '%.2f%% (%d)' % (res_b[i][1], res_a[i][1]), 
                 '%.2f%% (%d)' % (res_b[i][2], res_a[i][2]), '%.2f%% (%d)' % (res_b[i][3], res_a[i][3])])
    data.append(['', '', '', '', ''])

# Create a variable 'headers' to store the headers of the table
headers = ['   ', 'Human (0)', 'Cyborg (1)', 'Robot (2)', 'Spammer (3)']

# Print the table and write the output to a file
print(tabulate(data, headers=headers))
#print(tabulate(data, headers=headers, tablefmt="pipe"))
with open('HW4-5_table.txt', 'w') as f:
    f.write(tabulate(data, headers=headers))

              Human (0)     Cyborg (1)    Robot (2)    Spammer (3)
------------  ------------  ------------  -----------  -------------
HW4.5 Part A
Cluster 0     2.26% (17)    0.00% (0)     9.26% (5)    62.14% (64)
Cluster 1     97.61% (734)  3.30% (3)     16.67% (9)   33.98% (35)
Cluster 2     0.00% (0)     56.04% (51)   3.70% (2)    0.00% (0)
Cluster 3     0.13% (1)     40.66% (37)   70.37% (38)  3.88% (4)

HW4.5 Part B
Cluster 0     99.87% (751)  3.30% (3)     25.93% (14)  96.12% (99)
Cluster 1     0.13% (1)     96.70% (88)   74.07% (40)  3.88% (4)

HW4.5 Part C
Cluster 0     78.19% (588)  1.10% (1)     0.00% (0)    24.27% (25)
Cluster 1     0.27% (2)     0.00% (0)     3.70% (2)    55.34% (57)
Cluster 2     21.41% (161)  2.20% (2)     24.07% (13)  16.50% (17)
Cluster 3     0.13% (1)     96.70% (88)   72.22% (39)  3.88% (4)

HW4.5 Part D
Cluster 0     99.60% (749)  3.30% (3)     25.93% (14)  36.89% (38)
Cluster 1     0.00% (0)     56.04% (51)   0.00% (0)    0.00% (0)
Cluster 2     0