# HW4 DATASCI W261: Machine Learning at Scale 

* **Name:**  Megan Jasek
* **Email:**  meganjasek@ischool.berkeley.edu
* **Class Name:**  W261-2
* **Week Number:**  4
* **Date:**  6/7/16

### HW 4.0

What is MrJob? How is it different to Hadoop MapReduce? What are the mapper_init, mapper_final(), combiner_final(), reducer_final() methods? When are they called?  

### HW 4.1

What is serialization in the context of MrJob or Hadoop? When it used in these frameworks? What is the default serialization mode for input and outputs for MrJob?

### HW 4.2

Recall the Microsoft logfiles data from the async lecture. The logfiles are described are located at:
https://kdd.ics.uci.edu/databases/msweb/msweb.html http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/
This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.
Here, you must preprocess the data on a single node (i.e., not on a cluster of nodes) from the format:  
C,"10001",10001 #Visitor id 10001  
V,1000,1 #Visit by Visitor 10001 to page id 1000  
V,1001,1 #Visit by Visitor 10001 to page id 1001  
V,1002,1 #Visit by Visitor 10001 to page id 1002  
C,"10002",10002 #Visitor id 10001  
V  
Note: #denotes comments  
to the format:  
V,1000,1,C, 10001  
V,1001,1,C, 10001  
V,1002,1,C, 10001  

Write the python code to accomplish this.

In [2]:
'''
    This code reads the input file (infile), converts the data in it and writes the conversion
    to the output file (outfile).  The data is converted from writing customer information on
    a single line to writing customer information on the line corresponding its associated visit
    as per the instructions for HW4.2 above.
'''
infile = "anonymous-msweb.data"
outfile = "anonymous-msweb_converted.data"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    for line in rf.readlines():
        # Split the lines on commas
        items = line.split(',')
        # If the line is a customer line, then save the customer ID for later use
        # Write the line to the output file
        if items[0] == 'C':
            cust_str = items[2]
            wf.write(line)
        # If the line is a visit line, then concatenate, the original line with the
        # current customer information (the current value of cust_str) and write
        # it to the output file
        elif items[0] == 'V':
            wf.write('%s,C,%s' % (line.strip(), cust_str))
        # All other lines write directly to the output file as is
        else:
            wf.write(line)

### HW 4.3

Find the 5 most frequently visited pages using MrJob from the output of 4.2 (i.e., transfromed log file).

In [86]:
%%writefile MostFrequentVisits.py

from mrjob.job import MRJob
#from mrjob.step import MRJobStep
from mrjob.step import MRStep
import csv

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

#            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
#            'mapred.text.key.comparator.options': '-k2,2nr',

class MRMostFrequentVisits(MRJob):

    def mapper_count_visits(self, _, line):
        record = csv_readline(line)
        if record[0] == 'V':
            yield record[1], 1
    
    def reducer_sum_visits(self, page_id, counts):
        yield page_id, sum(counts)
    
    def reducer_sort_visits(self, page_id, counts):
        yield page_id, sum(counts)
        
    def steps(self):
        JOBCONF_STEP2 = {        
            'mapreduce.job.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'stream.num.map.output.key.field': 2,
            'stream.map.output.field.separator':',',
            'mapreduce.partition.keycomparator.options': '-k2,2nr -k1,1',
            'mapreduce.job.reduces': '1'
        }
        return [
            MRStep(mapper=self.mapper_count_visits,   # STEP 1:  count the visits
                   reducer=self.reducer_sum_visits),
            MRStep(jobconf=JOBCONF_STEP2,
                    reducer=self.reducer_sort_visits)  # STEP 2:  sort the visits
        ]
    
if __name__ == '__main__':
    MRMostFrequentVisits.run()

Overwriting MostFrequentVisits.py


In [87]:
# There is a known bug, that step-level jobconf does not work in local and inline modes
#!python MostFrequentVisits.py anonymous-msweb_converted.data
# The job must be run with args '-r hadoop' to enable step-level jobconf
!python MostFrequentVisits.py -r hadoop anonymous-msweb_converted.data

No configs found; falling back on auto-configuration
Creating temp directory /tmp/MostFrequentVisits.hadoop.20160606.070103.450541
Looking for hadoop binary in /usr/local/hadoop/bin...
Found hadoop binary: /usr/local/hadoop/bin/hadoop
Using Hadoop version 2.7.1
Copying local files to hdfs:///user/hadoop/tmp/mrjob/MostFrequentVisits.hadoop.20160606.070103.450541/files/...
Looking for Hadoop streaming jar in /usr/local/hadoop...
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
Running step 1 of 2...
  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  packageJobJar: [/tmp/hadoop-unjar140944324081476772/] [] /tmp/streamjob2687987219726866806.jar tmpDir=null
  Connecting to ResourceManager at master/50.97.205.254:8032
  Connecting to ResourceManager at master/50.97.205.254:8032
  Total input paths to process : 1
  number of splits:2
  Submitting tokens for job: job_1463787494457_0332
  Submi

In [17]:
from MostFrequentVisits import MRMostFrequentVisits
# There is a known bug, that step-level jobconf does not work in local and inline modes
#mr_job = MRMostFrequentVisits (args=['anonymous-msweb_converted.data'])
# The job must be run with args '-r', 'hadoop' to enable step-level jobconf
mr_job = MRMostFrequentVisits (args=['anonymous-msweb_converted.data', '-r', 'hadoop'])
with mr_job.make_runner() as runner:
    runner.run()
    # stream_output and print each line of the output
    for counter, line in enumerate(runner.stream_output()):
        if counter < 50:
            print mr_job.parse_output_line(line)
        else:
            break

ERROR:mrjob.fs.hadoop:STDERR: 16/06/05 14:32:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable



(u'1008', 10836)
(u'1034', 9383)
(u'1004', 8463)
(u'1018', 5330)
(u'1017', 5108)
(u'1009', 4628)
(u'1001', 4451)
(u'1026', 3220)
(u'1003', 2968)
(u'1025', 2123)
(u'1035', 1791)
(u'1040', 1506)
(u'1041', 1500)
(u'1032', 1446)
(u'1037', 1160)
(u'1030', 1115)
(u'1038', 1110)
(u'1020', 1087)
(u'1000', 912)
(u'1007', 865)
(u'1052', 842)
(u'1036', 759)
(u'1002', 749)
(u'1014', 728)
(u'1295', 716)
(u'1010', 698)
(u'1058', 672)
(u'1053', 670)
(u'1046', 636)
(u'1070', 602)
(u'1074', 584)
(u'1031', 574)
(u'1067', 548)
(u'1024', 521)
(u'1027', 507)
(u'1045', 474)
(u'1078', 462)
(u'1076', 444)
(u'1075', 396)
(u'1130', 395)
(u'1060', 391)
(u'1021', 380)
(u'1123', 372)
(u'1119', 365)
(u'1039', 345)
(u'1049', 343)
(u'1054', 338)
(u'1022', 325)
(u'1064', 324)
(u'1065', 323)


### HW 4.4

Find the most frequent visitor of each page using MrJob and the output of 4.2 (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.

In [16]:
%%writefile MostFrequentVisitors.py

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class MRMostFrequentVisitors(MRJob):
    reducer_current_pageid = ""
    vroots = {}
    #SORT_VALUES = True
    
    def mapper_count_visits(self, _, line):
        record = csv_readline(line)
        if record[0] == 'I':
            self.vroots['0'] = record[2]
        elif record[0] == 'A':
            page_id = record[1]
            vroot = record[4]
            self.vroots[page_id] = vroot
        elif record[0] == 'V':
            page_id = record[1]
            visitor_id = record[4]
            page_visitor_pair = ('(%s.%s)' % (page_id, visitor_id))
            yield page_visitor_pair, 1
        
    def reducer_sum_visits(self, page_visitor_pairs, counts):
        yield page_visitor_pairs, sum(counts)
    
    def reducer_sort_visits(self, page_visitor_pairs, counts):
        page_id, visitor_id = page_visitor_pairs.strip('()').split('.', 2)
        #print(page_id)
        #print(visitor_id)
        if page_id != self.reducer_current_pageid:
            self.reducer_current_pageid = page_id
            #total_visits = sum(counts)
            #output_str_1 = ('URL: %s%s, Page ID: %s, Visitor ID: %s' % 
            #              (self.vroots['0'], self.vroots[page_id], page_id, visitor_id))
            #output_str_2 = ('# Page Visits: %d' % (total_visits))
            #yield output_str_1, output_str_2
            yield page_visitor_pairs, sum(counts)

    def steps(self):
        JOBCONF_STEP2 = {        
            'mapreduce.job.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'stream.num.map.output.key.field': 2,
            'stream.map.output.field.separator':',',
            'mapreduce.partition.keycomparator.options': '-k2,2nr -k1,1',
            'mapreduce.job.reduces': '1'
        }
        return [
            MRStep(mapper=self.mapper_count_visits,   # STEP 1:  count the visits
                   reducer=self.reducer_sum_visits),
            MRStep(jobconf=JOBCONF_STEP2,
                   reducer=self.reducer_sort_visits)  # STEP 2:  sort the visits
        ]
                
if __name__ == '__main__':
    MRMostFrequentVisitors.run()

Overwriting MostFrequentVisitors.py


In [19]:
##### forget about sorting right now.  I think maybe the comma is messing it up because I have 2 commas there
##### try getting it working without sorting then add that back in fixing the comma issue
##### without sorting you can go back to local mode and things will run faster.

# The job must be run with args '-r hadoop' to enable step-level jobconf
#!python MostFrequentVisitors.py anonymous-msweb_converted_small.data
!python MostFrequentVisitors.py -r hadoop anonymous-msweb_converted_small.data

No configs found; falling back on auto-configuration
Creating temp directory /tmp/MostFrequentVisitors.hadoop.20160606.073016.889574
Looking for hadoop binary in /usr/local/hadoop/bin...
Found hadoop binary: /usr/local/hadoop/bin/hadoop
Using Hadoop version 2.7.1
Copying local files to hdfs:///user/hadoop/tmp/mrjob/MostFrequentVisitors.hadoop.20160606.073016.889574/files/...
Looking for Hadoop streaming jar in /usr/local/hadoop...
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
Running step 1 of 2...
  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  packageJobJar: [/tmp/hadoop-unjar4260441000237539642/] [] /tmp/streamjob3705776716557138787.jar tmpDir=null
  Connecting to ResourceManager at master/50.97.205.254:8032
  Connecting to ResourceManager at master/50.97.205.254:8032
  Total input paths to process : 1
  number of splits:2
  Submitting tokens for job: job_1463787494457_0342
  

In [14]:
from MostFrequentVisitors import MRMostFrequentVisitors
# The job must be run with args '-r', 'hadoop' to enable step-level jobconf
mr_job = MRMostFrequentVisitors (args=['anonymous-msweb_converted_small.data', '-r', 'hadoop'])
with mr_job.make_runner() as runner:
    runner.run()
    counter = 0
    # stream_output and print each line of the output
    for line in runner.stream_output():
        if counter < 5:
            print mr_job.parse_output_line(line)
        else:
            break
        counter += 1

ERROR:mrjob.hadoop:  Job not successful!
ERROR:mrjob.fs.hadoop:STDERR: 16/06/05 14:30:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

ERROR:mrjob.fs.hadoop:STDERR: 16/06/05 14:30:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ERROR:mrjob.fs.hadoop:STDERR: ls: `hdfs:///tmp/hadoop-yarn/staging/userlogs/application_1463787494457_0316': No such file or directory


IOError: Could not check path hdfs:///tmp/hadoop-yarn/staging/userlogs/application_1463787494457_0316

### HW 4.5 Clustering Tweet Dataset

Here you will use a different dataset consisting of word-frequency distributions for 1,000 Twitter users. These Twitter users use language in very different ways, and were classified by hand according to the criteria:  
* 0: Human, where only basic human-human communication is observed.
* 1: Cyborg, where language is primarily borrowed from other sources (e.g., jobs listings, classifieds postings, advertisements, etc...).
* 2: Robot, where language is formulaically derived from unrelated sources (e.g., weather/seismology, police/fire event logs, etc...).
* 3: Spammer, where language is replicated to high multiplicity (e.g., celebrity obsessions, personal promotion, etc... )  

Check out the preprints of recent research, which spawned this dataset:  
http://arxiv.org/abs/1505.04342 http://arxiv.org/abs/1508.01843  

The main data lie in the accompanying file:  topUsers_Apr-Jul_2014_1000-words.txt  
and are of the form:  
USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...  
where  
* USERID = unique user identifier
* CODE = 0/1/2/3 class code
* TOTAL = sum of the word counts  

Using this data, you will implement a 1000-dimensional K-means algorithm in MrJob on the users by their 1000-dimensional word stripes/vectors using several centroid initializations and values of K.  Note that each "point" is a user as represented by 1000 words, and that word-frequency distributions are generally heavy-tailed power-laws (often called Zipf distributions), and are very rare in the larger class of discrete, random distributions. For each user you will have to normalize by its "TOTAL" column. Try several parameterizations and initializations:  
* (A) K=4 uniform random centroid-distributions over the 1000 words (generate 1000 random numbers and normalize the vectors)
* (B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution
* (C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution
* (D) K=4 "trained" centroids, determined by the sums across the classes. Use use the (row-normalized) class-level aggregates as 'trained' starting centroids (i.e., the training is already done for you!).  
Note that you do not have to compute the aggregated distribution or the class-aggregated distributions, which are rows in the auxiliary file: topUsers_Apr-Jul_2014_1000-words_summaries.txt  
* Row 1: Words
* Row 2: Aggregated distribution across all classes
* Row 3-6 class-aggregated distributions for clases 0-3  
For (A), we select 4 users randomly from a uniform distribution [1,...,1,000].  For (B), (C), and (D) you will have to use data from the auxiliary file:  topUsers_Apr-Jul_2014_1000-words_summaries.txt  

This file contains 5 special word-frequency distributions:  
(1) The 1000-user-wide aggregate, which you will perturb for initializations in parts (B) and (C), and  
(2-5) The 4 class-level aggregates for each of the user-type classes (0/1/2/3)  
In parts (B) and (C), you will have to perturb the 1000-user aggregate (after initially normalizing by its sum, which is also provided). So if in (B) you want to create 2 perturbations of the aggregate, start with (1), normalize, and generate 1000 random numbers uniformly from the unit interval (0,1) twice (for two centroids), using:  
from numpy import random numbers = random.sample(1000)  
Take these 1000 numbers and add them (component-wise) to the 1000-user aggregate, and then renormalize to obtain one of your aggregate-perturbed initial centroids.  

For experiments A, B, C and D and iterate until a threshold (try 0.001) is reached. After convergence, print out a summary of the classes present in each cluster. In particular, report the composition as measured by the total portion of each class type (0-3) contained in each cluster, and discuss your findings and any differences in outcomes across parts A-D.

In [10]:
# Create a normalized version of the data topUsers_Apr-Jul_2014_1000-words.txt file
infile = "topUsers_Apr-Jul_2014_1000-words.txt"
outfile = "topUsers_Apr-Jul_2014_1000-words_normalized.txt"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    for line in rf.readlines():
        splt = line.strip().split(',')
        total = float(splt[2])
        wf.write('%s,%s,%s,' % (splt[0], splt[1], splt[2]))
        for i in range(3,len(splt)-1):
            wf.write('%f,' % (float(splt[i])/total))
        wf.write('%f' % (float(splt[len(splt)-1])/total))
        wf.write('\n')

In [None]:
# Create a normalized version of the data topUsers_Apr-Jul_2014_1000-words.txt file
# Exclude the first 3 columns of the data.
infile = "topUsers_Apr-Jul_2014_1000-words.txt"
outfile = "topUsers_Apr-Jul_2014_1000-words_normalized_only.txt"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    for line in rf.readlines():
        splt = line.strip().split(',')
        total = float(splt[2])
        for i in range(3,len(splt)-1):
            wf.write('%f,' % (float(splt[i])/total))
        wf.write('%f' % (float(splt[len(splt)-1])/total))
        wf.write('\n')

In [6]:
# Create a normalized version of the data topUsers_Apr-Jul_2014_1000-words.txt file
# Exclude the entire first row and the first 3 columns of the data.
infile = "topUsers_Apr-Jul_2014_1000-words_summaries.txt"
outfile = "topUsers_Apr-Jul_2014_1000-words_summaries_normalized_only.txt"
with open(infile, 'r') as rf, open(outfile, 'w') as wf:
    counter = 0
    for line in rf.readlines():
        if counter != 0:
            splt = line.strip().split(',')
            total = float(splt[2])
            for i in range(3,len(splt)-1):
                wf.write('%f,' % (float(splt[i])/total))
            wf.write('%f' % (float(splt[len(splt)-1])/total))
            wf.write('\n')
        counter += 1

In [12]:
%%writefile Kmeans.py
from numpy import argmin, array, random
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain
import os

#Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    #print(datapoint.shape)
    centroid_points = array(centroid_points)
    #print(centroid_points.shape)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx

#Check whether centroids converge
def stop_criterion(centroid_points_old, centroid_points_new,T):
    oldvalue = list(chain(*centroid_points_old))
    newvalue = list(chain(*centroid_points_new))
    Diff = [abs(x-y) for x, y in zip(oldvalue, newvalue)]
    Flag = True
    for i in Diff:
        if(i>T):
            Flag = False
            break
    return Flag

class MRKmeans(MRJob):
    centroid_points=[]
    k=4
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, mapper=self.mapper,combiner = self.combiner,reducer=self.reducer)
               ]
    #load centroids info from file
    def mapper_init(self):
        print "Current path:", os.path.dirname(os.path.realpath(__file__))
        
        self.centroid_points = [map(float,s.split('\n')[0].split(',')) for s in open("Centroids.txt").readlines()]
        # This is the line that breaks things with multiple mappers
        #open('Centroids.txt', 'w').close()
        
        #print "Centroids: ", self.centroid_points
        
    #load data and output the nearest centroid index and data point 
    def mapper(self, _, line):
        D = (map(float,line.split(',')))
        yield int(MinDist(D,self.centroid_points)), (D,1)
    
    #Combine sum of data points locally
    def combiner(self, idx, inputdata):
        num = 0
        # ?? see if you can parameterize the 1000
        sumD = [0.0]*1000
        for D,n in inputdata:
            num += n
            sumD = [x + y for x, y in zip(sumD,D)]
        yield idx,(sumD,num)
        
    #Aggregate sum for each cluster and then calculate the new centroids
    def reducer(self, idx, inputdata): 
        # DOES THIS GET MESSED UP WITH MULTIPLE REDUCERS??
        #open('Centroids.txt', 'w').close()
        centroids = []
        num = [0]*self.k 
        # ?? see if you can parameterize the 1000
        for i in range(self.k):
            centroids.append([0.0]*1000)
        for D, n in inputdata:
            num[idx] = num[idx] + n
            centroids[idx] = [x + y for x, y in zip(centroids[idx],D)]
        centroids[idx] = [i / float(num[idx]) for i in centroids[idx]]
        
        #print 'centroids updates:', centroids
        
        with open('Centroids.txt', 'w') as f:
            #f.writelines(str(centroids[idx][0]) + ',' + str(centroids[idx][1]) + '\n')
            #f.writelines(','.join(str(j) for j in centroids[idx]) + '\n')
            f.writelines(','.join(str(j) for j in i) + '\n' for i in centroids)
        yield idx, centroids[idx]
      

if __name__ == '__main__':
    MRKmeans.run()

Overwriting Kmeans.py


In [11]:
import re

#Geneate initial centroids FOR PART A
def centroid_init_A(k, filename):
    centroid_points = []
    with open(filename, 'r') as f:
        lines = f.readlines()
        for i in range(k):
            rint = random.randint(0, 999)
            #rint = random.randint(0, 9)
            rline = lines[rint].strip().split(',')
            centroid_points.append([float(s) for s in rline[3:len(rline)]])
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)
    return centroid_points

###################################################################################
## Geneate random initial centroids around the global aggregate
## Part (B) and (C) of this question
###################################################################################
def centroid_init_BC(k):
    counter = 0
    for line in open("topUsers_Apr-Jul_2014_1000-words_summaries.txt").readlines():
        # Note correction from Kevin from Boulder
        if counter == 1:        
            data = re.split(",",line)
            globalAggregate = [float(data[i+3])/float(data[2]) for i in range(1000)]
        counter += 1
    ## perturb the global aggregate for the four initializations    
    centroids = []
    for i in range(k):
        rndpoints = random.sample(1000)
        peturpoints = [rndpoints[n]/10+globalAggregate[n] for n in range(1000)]
        centroids.append(peturpoints)
        total = 0
        for j in range(len(centroids[i])):
            total += centroids[i][j]
        for j in range(len(centroids[i])):
            centroids[i][j] = centroids[i][j]/total
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroids)
    return centroids

def centroid_init_D(k, filename):
    centroid_points = []
    with open(filename, 'r') as f:
        lines = f.readlines()
        for i in range(1,k+1):
            rline = lines[i].strip().split(',')
            centroid_points.append([float(s) for s in rline])
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)
    return centroid_points


In [15]:
%reload_ext autoreload
%autoreload 2
from numpy import random
from Kmeans import MRKmeans, stop_criterion, MinDist
mr_job = MRKmeans(args=['topUsers_Apr-Jul_2014_1000-words_normalized_only.txt', '--file=Centroids.txt'])
## how do I pass arguments to this??,  I need to pass k

def write_centroids (centroids, hw, fnum, iter_num):
    filename = 'centroid_results_' + hw + str(fnum) + '.txt'
    with open(filename, 'w+') as f:
        f.write('Number of Iterations: %d\n' % (iter_num))
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroids)

HW_PART = 'D'
k = 4
if HW_PART == 'A':
    centroid_points = centroid_init_A(k, 'topUsers_Apr-Jul_2014_1000-words_normalized.txt')
elif HW_PART == 'B':
    # Part B is the only part with k=2
    k = 2
    centroid_points = centroid_init_BC(k)
elif HW_PART == 'C':
    centroid_points = centroid_init_BC(k)
elif HW_PART == 'D':
    centroid_points = centroid_init_D(k, 'topUsers_Apr-Jul_2014_1000-words_summaries_normalized_only.txt')
    
# Update centroids iteratively
i = 0
while(1):
#while(i<=2):
    # save previous centoids to check convergency
    centroid_points_old = centroid_points[:]
    print "iteration"+str(i)+":"
    with mr_job.make_runner() as runner: 
        runner.run()
        # stream_output: get access of the output 
        for line in runner.stream_output():
            key,value =  mr_job.parse_output_line(line)
            print('key: %d, len: %d' % (key, len(value)))
            #print key, value
            centroid_points[key] = value
    print "\n"
    i += 1
    if(stop_criterion(centroid_points_old,centroid_points,0.001)):
        break

#print "Centroids\n"
#for j in range(len(centroid_points)):
#    print centroid_points[j]
#    print

# Write the centroid_points to a file
write_centroids(centroid_points, HW_PART, 2, i)

iteration0:
Current path: /home/hadoop/w261-Assignments/hw4
Current path: /home/hadoop/w261-Assignments/hw4
key: 2, len: 1000
key: 3, len: 1000
key: 0, len: 1000
key: 1, len: 1000


iteration1:
Current path: /home/hadoop/w261-Assignments/hw4
Current path: /home/hadoop/w261-Assignments/hw4
key: 0, len: 1000
key: 3, len: 1000


iteration2:
Current path: /home/hadoop/w261-Assignments/hw4
Current path: /home/hadoop/w261-Assignments/hw4
key: 0, len: 1000
key: 3, len: 1000


iteration3:
Current path: /home/hadoop/w261-Assignments/hw4
Current path: /home/hadoop/w261-Assignments/hw4
key: 0, len: 1000
key: 3, len: 1000


iteration4:
Current path: /home/hadoop/w261-Assignments/hw4
Current path: /home/hadoop/w261-Assignments/hw4
key: 0, len: 1000
key: 3, len: 1000


iteration5:
Current path: /home/hadoop/w261-Assignments/hw4
Current path: /home/hadoop/w261-Assignments/hw4
key: 0, len: 1000
key: 3, len: 1000


iteration6:
Current path: /home/hadoop/w261-Assignments/hw4
Current path: /home/hadoop/w

In [20]:
from Kmeans import MinDist

# Reads centroids from a file called filename and returns them as a list of lists
def read_centroids (filename):
    with open(filename) as f:
        lines = f.readlines()
        num_iter = lines[0].strip().split()[3]
        centroids = [map(float,s.split('\n')[0].split(',')) for s in lines[1:]]
    return centroids, num_iter

centroid_points, num_iter = read_centroids('centroid_results_D2.txt')

# Summarize the class
# Initialize a results array
total_codes = [752.0,91.0,54.0,103.0]
results_a = []
for i in range(len(centroid_points)):
    results_a.append([0]*len(total_codes))
with open('topUsers_Apr-Jul_2014_1000-words_normalized.txt', 'r') as f:
    for line in f.readlines():
        rline = line.strip().split(',')
        userid = rline[0]
        code = int(rline[1])
        D = [float(s) for s in rline[3:len(rline)]]
        cluster = MinDist(D, centroid_points)
        results_a[cluster][code] += 1

results_b = []
print('Number of Iterations: %s' % (num_iter))
for i in range(len(results_a)):
    results_b.append([x / y for x, y in zip(results_a[i],total_codes)])
    print('Cluster %d: ' % (i))
    print(results_a[i])
    print(results_b[i])


Number of Iterations: 9
Cluster 0: 
[1, 0, 12, 1]
[0.0013297872340425532, 0.0, 0.2222222222222222, 0.009708737864077669]
Cluster 1: 
[0, 52, 2, 0]
[0.0, 0.5714285714285714, 0.037037037037037035, 0.0]
Cluster 2: 
[1, 36, 38, 4]
[0.0013297872340425532, 0.3956043956043956, 0.7037037037037037, 0.038834951456310676]
Cluster 3: 
[750, 3, 2, 98]
[0.9973404255319149, 0.03296703296703297, 0.037037037037037035, 0.9514563106796117]


In [22]:
#### TEST CELL ####
filename = "topUsers_Apr-Jul_2014_1000-words_normalized.txt"
with open(filename, 'r') as f:
    for line in f.readlines():
        s = line.split(',')
        #print(len(s))

In [27]:
#### TEST CELL ####
l1 = [1, 2, 3, 4]
l2 = [5, 6, 7, 8]
l3 = [x + y for x, y in zip(l1,l2)]
print(l3)

[6, 8, 10, 12]


In [19]:
#### TEST CELL ####
from numpy import argmin, array, random
from itertools import chain
import os
from Kmeans import MRKmeans, stop_criterion

def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    print(datapoint.shape)
    #print(datapoint)
    centroid_points = array(centroid_points)
    print(centroid_points.shape)
    #print(centroid_points)
    diff = datapoint - centroid_points
    print(diff)
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx

#Geneate initial centroids FOR PART A
centroid_points = []
k = 4
with open('topUsers_Apr-Jul_2014_1000-words_normalized.txt', 'r') as f:
    lines = f.readlines()
    for i in range(k):
        rint = random.randint(0, 999)
        rline = lines[rint].strip().split(',')
        centroid_points.append([float(s) for s in rline[3:len(rline)]])
    rline = lines[0].strip().split(',')
    datapoint = [float(s) for s in rline[3:len(rline)]]

minidx = MinDist(datapoint, centroid_points)
print(minidx)

(1000,)
(4, 1000)
[[ -4.82270000e-02  -1.43480000e-02   3.33740000e-02 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.67210000e-02  -4.06280000e-02  -7.14600000e-03 ...,   0.00000000e+00
   -8.30000000e-05  -8.30000000e-05]
 [  4.30300000e-02  -5.04220000e-02   8.33900000e-03 ...,   0.00000000e+00
   -6.70000000e-04  -1.51000000e-04]
 [ -8.89080000e-02  -4.29300000e-03   3.34010000e-02 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]]
1


In [None]:
page_visitor_pairs = (page_id,visitor_id), 1
page_id, visitor_id = page_visitor_pairs.strip('()').split(',', 2)
print(page_id)
print(visitor_id)
