#DATASCI W261: Machine Learning at Scale

Nick Hamlin and Tigi Thomas  
nickhamlin@gmail.com, tgthomas@berkeley.edu   
Time of Submission: 9:23 PM EST, Wednesday, Feb 10, 2016  
W261-3, Spring 2016  
Week 4 Homework

###Submission Notes:
- For each problem, we've included a summary of the question as posed in the instructions.  In many cases, we have not included the full text to keep the final submission as uncluttered as possible.  For reference, we've included a link to the original instructions in the "Useful Reference" below.
- Problem statements are listed in *italics*, while our responses are shown in plain text. 
- We've included the full output of the hadoop jobs in our responses so that counter results are shown.  However, these don't always render nicely into PDF form.  In these situations, please reference [the complete rendered notebook on Github](https://github.com/nickhamlin/mids_261_homework/blob/master/HW3/MIDS-W261-2015-HWK-Week03-Hamlin-Thomas.ipynb)

###Useful References:
- **[Original Assignment Instructions](https://www.dropbox.com/sh/m0nxsf4vs5cyrp2/AACYOZQ3hRyGtHoPt33ny_Pza/HW4-Questions.txt?dl=0)**
- [Most frequent word example in mrjob](http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/nd2wow1t3y77jqk/MrjobMostUsedWord.ipynb)
- [kmeans example in mrjob](http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/5qwejmygaievrzt/MrJobKmeans.ipynb)
- [Microsoft anonymous web data background info](https://kdd.ics.uci.edu/databases/msweb/msweb.html)

##HW4.0.  
*What is MrJob? How is it different to Hadoop MapReduce?* 

MrJob is a convenient, easy to use MapReduce library implemented in Python. The MrJob library simplifies writing and running of Hadoop Streaming jobs.

With the standard Hadoop MapReduce paradigm using Hadoop Streaming, one has to provide separate Mapper and Reducer scripts/code and invoke the streaming job providing one such mapper and readucer at a time. Althought this provides much control over the process, pipelining multiple Map and Reduce steps, or iteratively calling the same map-reduce tasks become very cumbersome. MrJob simplifies this, by allowing developers to write one Map Reduce program with the mapper and reducer as different methods in a MapReduce class. This allows for very convenient testing, debugging and considerably simplifies the creation and execution of MapReduce Job pipelines.

MrJob can be executed even without installing Hadoop providing a perfect platform for prototyping. The code will then simply work within a Hadoop setting requiring no further code changes. MrJob also has extensive integration with Amazon Elastic MapReduce and the same code can be run on Amazon EMR with just a few configuration settings. For more information, see the [MRJob source code](https://github.com/Yelp/mrjob) and the [corresponding docs](https://pythonhosted.org/mrjob/guides/why-mrjob.html).


*What are the mappint_init, mapper_final(), combiner_final(), reducer_final() methods? When are they called?*

With MrJob, you write implementation scripts for your Mapper and Reducer as methods of a subclass of MRJob. This script is then invoked once per task by Hadoop Streaming, which starts your script, feeds it stdin, reads stdout, and finally closes it. Based on how you have defined your mapper and reducer step functions MrJob will invoke each of them.

However, it is common to require some initialization or finalization code to be run before or after the various mapper / reducer steps. MrJob lets you write such start-up and tear-down methods to run at the beginning ( \_init()) and end ( \_final() of the various mapper/reducer process: via the *_init()* and *_final()* methods:

These methods can be used to load support files and or write out intermediate files during the various map and reduce steps. This allows for efficient sharing of common files within the same node while it processes different data chunks.



##HW4.1

*What is serialization in the context of MrJob or Hadoop?*  

*When it used in these frameworks?*  

*What is the default serialization mode for input and outputs for MrJob?*  

## HW 4.2

###Problem Statement
Preprocess the data on a single node (i.e., not on a cluster of nodes) from the format:

C,"10001",10001   #Visitor id 10001  
V,1000,1          #Visit by Visitor 10001 to page id 1000  
V,1001,1          #Visit by Visitor 10001 to page id 1001  
V,1002,1          #Visit by Visitor 10001 to page id 1002  
C,"10002",10002   #Visitor id 10001  

to the format:

V,1000,1,C, 10001  
V,1001,1,C, 10001  
V,1002,1,C, 10001

###Implementation
We can solve this problem simply by iterating through the file.  Because the rows are in order, every time we encounter a new visitor, we can save their ID to be applied to each subsequent view record until a new visitor record is reached.  Also, while it's not explcitly asked for in this problem, we'll run a second batch of code to save the clean URL data to its own file since we'll need this data for HW 4.4.

In [None]:
%%writefile convert_msdata.py

from csv import reader
with open('anonymous-msweb.data','rb') as f:
    data=f.readlines()
    
for i in reader(data):
    if i[0]=='C':
        visitor_id=i[1] #Store visitor id
        continue
    if i[0]=='V':
        print i[0]+','+i[1]+','+i[2]+',C,'+visitor_id #Append visitor_id to each pageview

In [None]:
%%writefile create_urls.py

#Save only results from 'A' rows into their own file for easy URL access in the future
from csv import reader
with open('anonymous-msweb.data','rb') as f:
    data=f.readlines()
    
for i in reader(data):
    if i[0]=='A':
        print i[1]+','+i[3]+','+i[4]

In [None]:
#Make files executable, convert data, and view some example results to check that everything worked
#!chmod +x convert_msdata.py create_urls.py
!python convert_msdata.py > clean_msdata.txt
!cat clean_msdata.txt | head -10
!python create_urls.py > ms_urls.txt
!cat ms_urls.txt | head -10

## HW 4.3
*Find the 5 most frequently visited pages using MrJob from the output of 4.2 (i.e., transfromed log file).*

In [None]:
%load_ext autoreload
%autoreload 2

In [621]:
%%writefile top_pages.py
"""
This program will take a CSV data file and output tab-seperated lines of

    Vroot -> number of visits

To run:

    python top_pages.py anonymous-msweb.data

To store output:

    python top_pages.py anonymous-msweb.data > top_pages.out
"""
import csv

from mrjob.job import MRJob
from mrjob.step import MRStep


def csv_readline(line):
    """Given a string CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class TopPages(MRJob):
    
# Normally, we'd use the shuffle to do the sort.  However, the bug
# in comparitors when running local MRJobs makes this an untenable solution
# so we'll settle for doing the sort in the second-stage reducer instead

#     def jobconf(self):
#         orig_jobconf = super(TopPages, self).jobconf()        
#         custom_jobconf = {  #key value pairs
#             'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
#             'mapred.text.key.comparator.options': '-k1,1nr',
#             'mapred.reduce.tasks': '1',
#         }
#         combined_jobconf = orig_jobconf
#         combined_jobconf.update(custom_jobconf)
#         self.jobconf = combined_jobconf
#         return combined_jobconf

    def mapper_extract_views(self, line_no, line):
        """Extracts the Vroot that was visited"""
        cell = csv_readline(line)
        if cell[0] == 'V':
            yield cell[1],1

    def reducer_sum_views(self, vroot, visit_counts):
        """Sumarizes the visit counts by adding them together,yield the results"""
        
        total = sum(i for i in visit_counts)
        yield None,(total, vroot)
        
        
    # discard the key; it is just None
    def reducer_find_top_views(self,_, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word

        output=sorted(word_count_pairs)[-5:]
        output.reverse()
        for i in output:
            yield (i[1],i[0])
        
        
    def steps(self):  #pipeline of Map-Reduce jobs
        return [
            MRStep(mapper=self.mapper_extract_views,       # STEP 1: view count step
                    reducer=self.reducer_sum_views) ,
            MRStep(reducer=self.reducer_find_top_views) # Step 2: sort and return top 5 results
        ]
        
if __name__ == '__main__':
    TopPages.run()

Overwriting top_pages.py


In [None]:
#Make file executable if it's not already
!chmod +x top_pages.py

In [622]:
from top_pages import TopPages
import csv

mr_job = TopPages(args=['clean_msdata.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)



('1008', 10836)
('1034', 9383)
('1004', 8463)
('1018', 5330)
('1017', 5108)


## HW 4.4

*Find the most frequent visitor of each page using MrJob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.*

In [None]:
%%writefile freq_visitor.py

import csv
from collections import Counter
from operator import itemgetter

from mrjob.job import MRJob
from mrjob.step import MRStep


def csv_readline(line):
    """Given a string CSV line, return a list of strings."""
    for row in csv.reader([line]):
        return row

class FreqVisitor(MRJob):
    
    def jobconf(self):
        orig_jobconf = super(FreqVisitor, self).jobconf()        
        custom_jobconf = {'upload_files': 'ms_urls.txt'}
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf

    def mapper_extract_views(self, line_no, line):
        """Extracts the visitor id and the vroot that was visited"""
        cell = csv_readline(line)
        if cell[0] == 'V':
            yield cell[4],cell[1]
    
    def reducer_load_urls(self):
        with open('ms_urls.txt','rb') as f:
            urls=csv.reader(f.readlines())
        self.url_dict={}
        for i in urls:
            self.url_dict[int(i[0])]=i[2]

    def reducer_sum_views_by_visitor(self, visitor, vroots):
        """Summarizes page counts for each visitor, 
        yields one record per visitor with the page containing 
        the most views by that visitor"""
        pages=Counter()
        for i in vroots:
            pages[i]+=1
        output= max(pages.iteritems(), key=itemgetter(1))[0]
        yield ('Visitor ID:'+str(visitor)),(output,pages[output],self.url_dict[int(output)])
   
    def steps(self):
        return [MRStep(mapper=self.mapper_extract_views,
                        reducer_init=self.reducer_load_urls,
                        reducer=self.reducer_sum_views_by_visitor)]
        
if __name__ == '__main__':
    FreqVisitor.run()

In [None]:
from freq_visitor import FreqVisitor
import csv

mr_job = FreqVisitor(args=['clean_msdata.txt','--file','ms_urls.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)

## HW 4.5

###Problem Statement
Here you will use a different dataset consisting of word-frequency distributions 
for 1,000 Twitter users. These Twitter users use language in very different ways,
and were classified by hand according to the criteria:

0: Human, where only basic human-human communication is observed.

1: Cyborg, where language is primarily borrowed from other sources
(e.g., jobs listings, classifieds postings, advertisements, etc...).

2: Robot, where language is formulaically derived from unrelated sources
(e.g., weather/seismology, police/fire event logs, etc...).

3: Spammer, where language is replicated to high multiplicity
(e.g., celebrity obsessions, personal promotion, etc... )

Check out the preprints of our recent research,
which spawned this dataset:

http://arxiv.org/abs/1505.04342
http://arxiv.org/abs/1508.01843

The main data lie in the accompanying file:

topUsers_Apr-Jul_2014_1000-words.txt

and are of the form:

USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...
.
.

where

USERID = unique user identifier
CODE = 0/1/2/3 class code
TOTAL = sum of the word counts

Using this data, you will implement a 1000-dimensional K-means algorithm in MrJob on the users
by their 1000-dimensional word stripes/vectors using several 
centroid initializations and values of K.

Note that each "point" is a user as represented by 1000 words, and that
word-frequency distributions are generally heavy-tailed power-laws
(often called Zipf distributions), and are very rare in the larger class
of discrete, random distributions. For each user you will have to normalize
by its "TOTAL" column. Try several parameterizations and initializations:

(A) K=4 uniform random centroid-distributions over the 1000 words
(B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
(C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
(D) K=4 "trained" centroids, determined by the sums across the classes.

and iterate until a threshold (try 0.001) is reached.
After convergence, print out a summary of the classes present in each cluster.
In particular, report the composition as measured by the total
portion of each class type (0-3) contained in each cluster,
and discuss your findings and any differences in outcomes across parts A-D.

Note that you do not have to compute the aggregated distribution or the 
class-aggregated distributions, which are rows in the auxiliary file:

topUsers_Apr-Jul_2014_1000-words_summaries.txt

### HW 4.5 - Setting up the mrjob
First, we modify the MRJob class to run a single iteration of the K-Means algorithm.  The mapper_init function takes a text file containing initial centroid positions and loads it into memory on each mapper.  

Next, the mapper method runs the expectation step and emits a record for each point in the main dataset and its corresponding cluster assignment based on the current centroid locations.  This is done using the helper function from the class example that we have to make sure to define in advance. The mapper's output also includes the points actual class so that we can evaluate the results of our cluster at the end.  

The maximization step takes place in the reducer, which aggregates results for each predicted class and computes the new corresponding centroid location.  A combiner sits between the mapper and reducer to help with intermediate aggregation.  Finally, the new centroid locations are written back to disk. 

In [None]:
%%writefile mrkmeans.py
from __future__ import division
from math import sqrt
from operator import itemgetter
from collections import Counter

import numpy as np

from mrjob.job import MRJob
from mrjob.step import MRStep

#Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = np.array(datapoint)
    centroid_points = np.array(centroid_points)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = np.argmin(list(diffsq.sum(axis = 1)))
    return minidx

class MRKmeans(MRJob):
    
    def __init__(self, *args, **kwargs):
        super(MRKmeans, self).__init__(*args, **kwargs)
        #Initializing these values here makes them available to the class as a whole
        self.k = 0 #Number of clusters to create
        self.centroid_points=[] #List of centroid vectors
    
    def steps(self):
        return [
            MRStep(
                mapper_init=self.mapper_init,
                mapper=self.mapper,
                combiner=self.combiner,
                reducer=self.reducer
            )
        ]
    
    def mapper_init(self):
        """
        Load locations of existing centroids into memory as a list with len=k of lists with len=1000
        """        
        self.centroid_points=[map(float,s.split('\n')[0].split(',')) for s in open('Centroids.txt').readlines()]
        open('Centroids.txt','w').close() #This wipes the file once we've loaded it so we can overwrite at the end
        self.k=len(self.centroid_points)
        
    def mapper(self,_,line):
        """
        For each point sent through the stream:
        - Normalize each point by the total number of words in the document
        - Calculate the closest centroid
        - Emit records where... 
            -Key=(<current cluster asst>,<correct cluster asst>)
            -Value=(1,<normalized vector for that point>)
        """
        
        line=line.split(',')
        line_id,cluster,total_words=int(line[0]),int(line[1]),float(line[2])
        D=(map(float,line[3:])) #Convert point to floats
        D=[i/total_words for i in D] #Normalize by total words
        idx=int(MinDist(D,self.centroid_points)) #Calculate closest centroid/cluster assignment
        class_counts=np.zeros(4) #Pass actual cluster assignment through (the array helps aggregation later)
        class_counts[cluster]+=1
        yield idx,(list(class_counts),1,D) #We convert the class_counts array to a list for serialization purposes
        
    def combiner(self,idx,inputdata):
        """
        For each row sent by the mapper, calculate partial sum for new centroid:
        - Initialize a blank 1000 element list
        - Add all intermediate values together for that list
        - Emit records where...
            -Key=Index of centroid that should be updated with the associated vector
            -Value=(<number of points represented in the vector>,<vector of partial sums>)
        """
        
        temp_row=np.zeros(1000) #Initialize aggregated vector
        num=0
        class_counts=np.zeros(4)
        for v in inputdata: #Calculate intermediate sums
            class_counts+=v[0] #records will come in with a real cluster id, we'll pass the lists through here
            num+=v[1]
            temp_row+=v[2]
        yield idx,(list(class_counts),num,list(temp_row))
    
    def reducer(self,idx,inputdata):
        """
        For each incoming row:
        - Calculate final sum of vector elements using the same approach as in the combiner
        - Divide by the number of points in the cluster to calculate the updated location of each new centroid
        - Store updated centroids to disk
        - Emit location of new centroids
        """
        centroid=np.zeros(1000)
        class_counts=np.zeros(4)
        num=0

        for v in inputdata:
            class_counts+=v[0] #Aggregate actual class assignments contained in each proposed cluster
            num+=v[1] #Track total word count for normalization
            centroid+=v[2]
        
        centroid/=num #Normalize aggregated new centroid vector by number of words
        
        #Save new centroid locations to file
        with open('Centroids.txt','a') as f:
            f.writelines(','.join(map(str,centroid))+'\n')
        yield idx,(list(class_counts),list(centroid))

if __name__=='__main__':
    MRKmeans.run()

### HW 2.5 - Running iterative MRJobs
Once we've established our kmeans class, we need to set up a driver structure to run it and make sense of the results. First, we define a stopping criterion based on the class example that checks how much the centroids have moved from iteration to iteration. If this delta is above a threshold, we'll continue to iterate.  

Next, we set up a function to run the job itself that accepts a list of centroid points and a value for K.  This will make it possible to recycle our code to answer each of the four questions posted.  The main function will save the starting centroids to disk, then repeatedly call the MRJob we defined above until our stopping criterion is met.  At this point, class summaries are calculated and displayed.

In [599]:
### K-Means Driver Code
from __future__ import division
from itertools import chain

from numpy import random

from mrkmeans import MRKmeans

def stop_criterion(centroid_points_old, centroid_points_new,T):
    oldvalue = list(chain(*centroid_points_old))
    newvalue = list(chain(*centroid_points_new))
    Diff = [abs(x-y) for x, y in zip(oldvalue, newvalue)]
    Flag = True
    for i in Diff:
        if(i>T):
            Flag = False
            break
    return Flag

def run_kmeans_mrjob(centroid_points,k):
    source='topUsers_Apr-Jul_2014_1000-words.txt'

    #Set up job and save centroids to file
    mr_job=MRKmeans(args=[source,'--file', 'Centroids.txt'])
    with open('Centroids.txt','w+') as f:
        f.writelines(','.join(str(j) for j in i)+'\n' for i in centroid_points)

    #Update centroids iteratively
    i=0 #Track which iteration we're on
    while(1):
        output=[] #Initialze destination for our final results
        centroid_points_old=centroid_points[:]
        with mr_job.make_runner() as runner:
            runner.run() #stream output
            for line in runner.stream_output():
                key,value=mr_job.parse_output_line(line)
                output.append((key,value[0])) #Save our temp results.  These will only display once the algorithm converges
                centroid_points[key]=value[1] 
        i+=1

        #Check if stop criterion is satsfied.  
        if stop_criterion(centroid_points_old,centroid_points,0.001):
            
            #Calculate overall class totals
            totals=np.zeros(4)
            for v in output:
                for col,j in enumerate(v[1]):
                    totals[col]+=j
            
            #Print final results
            print "==========RESULTS============="
            print "k-means converged after {0} iterations\n".format(str(i))
            print "--------Class Counts by Cluster ------"
            for j in output:
                print str(j[0])+' | '+str(j[1][0])+' ('+str(round(j[1][0]/totals[0],3)*100)+'%) | '+str(j[1][1])+' ('+str(round(j[1][1]/totals[1],3)*100)+'%) | '+str(j[1][2])+' ('+str(round(j[1][2]/totals[2],3)*100)+'%) | '+str(j[1][3])+' ('+str(round(j[1][3]/totals[3],3)*100)+'%) | '
            break
    

### HW 2.5 - Part A
*K=4 uniform random centroid-distributions over the 1000 words (generate 1000 random numbers and normalize the vectors)*

Now that we've laid all the groundwork, we can run our jobs.  The only differences between each of the four parts in this problem are the values of K and what process we use to intialize our centroid locations.

TODO: FIX THIS

In [615]:
####### PART A ############
from csv import reader
import random as rand #avoid namespace collision

def run_part_a():
    k=4
    centroid_points=[] #Initialize list of lists for to hold K starting centroids
    source='topUsers_Apr-Jul_2014_1000-words.txt'
    users=(rand.sample(list(open(source)),k))
    for line in reader(users):
        line_id,cluster,total_words=int(line[0]),int(line[1]),float(line[2])
        D=(map(float,line[3:]))
        D=[i/total_words for i in D]
        centroid_points.append(D)

    run_kmeans_mrjob(centroid_points,k)
    #for i in range(k): #THIS IS OLD
    #    centroid_points.append([random.uniform(-.01,.01) for i in range(centroid_dimensions)])

run_part_a()



k-means converged after 11 iterations

--------Class Counts by Cluster ------
0 | 1.0 (0.1%) | 88.0 (96.7%) | 38.0 (70.4%) | 4.0 (3.9%) | 
1 | 0.0 (0.0%) | 0.0 (0.0%) | 12.0 (22.2%) | 0.0 (0.0%) | 
2 | 83.0 (11.0%) | 0.0 (0.0%) | 3.0 (5.6%) | 62.0 (60.2%) | 
3 | 668.0 (88.8%) | 3.0 (3.3%) | 1.0 (1.9%) | 37.0 (35.9%) | 


### HW 2.5 - Part B
*K=2, with centroids based on random perturbations from the user-wide distribution*

Here, we use the intialization function provided in the updated problem statement, which returns K centroids based on random noise added to the overall distribution.  These are returned as a list of lists that we can then use in our main k-means function

In [603]:
import re
# Setup function for centroids for part B
def startCentroidsBC(k):
    counter = 0
    for line in open("topUsers_Apr-Jul_2014_1000-words_summaries.txt").readlines():
        if counter == 2:        
            data = re.split(",",line)
            globalAggregate = [float(data[i+3])/float(data[2]) for i in range(1000)]
        counter += 1
    ## perturb the global aggregate for the four initializations    
    centroids = []
    for i in range(k):
        rndpoints = random.sample(1000)
        peturpoints = [rndpoints[n]/10+globalAggregate[n] for n in range(1000)]
        centroids.append(peturpoints)
        total = 0
        for j in range(len(centroids[i])):
            total += centroids[i][j]
        for j in range(len(centroids[i])):
            centroids[i][j] = centroids[i][j]/total
    return centroids

In [616]:
####### PART B ############
def run_part_b():
    k=2
    centroid_points=startCentroidsBC(k)
    run_kmeans_mrjob(centroid_points,k)

run_part_b()




k-means converged after 4 iterations

--------Class Counts by Cluster ------
0 | 1.0 (0.1%) | 88.0 (96.7%) | 40.0 (74.1%) | 4.0 (3.9%) | 
1 | 751.0 (99.9%) | 3.0 (3.3%) | 14.0 (25.9%) | 99.0 (96.1%) | 


### HW 2.5 - Part C
*K=4, with centroids based on random perturbations from the user-wide distribution*

We can recycle the same approach as in part B, and simply change the value of K

In [617]:
####### PART C ############
def run_part_c():
    k=4
    centroid_points=startCentroidsBC(k)
    run_kmeans_mrjob(centroid_points,k)

run_part_c()




k-means converged after 8 iterations

--------Class Counts by Cluster ------
0 | 0.0 (0.0%) | 51.0 (56.0%) | 0.0 (0.0%) | 0.0 (0.0%) | 
1 | 751.0 (99.9%) | 3.0 (3.3%) | 9.0 (16.7%) | 99.0 (96.1%) | 
2 | 0.0 (0.0%) | 0.0 (0.0%) | 7.0 (13.0%) | 0.0 (0.0%) | 
3 | 1.0 (0.1%) | 37.0 (40.7%) | 38.0 (70.4%) | 4.0 (3.9%) | 


### HW 2.5 - Part D
*K=4, "trained" centroids, determined by the sums across the classes*

This version involves pulling the initial centroids from the aggregated summary of the stats.  We read each in, normalize by the total words in the class, and output the result as our class centroid.

In [618]:
####### PART D ############

def run_part_d():
    k=4
    centroid_points=[] #Initialize list of lists for to hold K starting centroids
    source='topUsers_Apr-Jul_2014_1000-words_summaries.txt'
    users=list(open(source))
    for line in reader(users[2:]): #Skip the first two lines since we only want the cluster-level data
        line_id,cluster,total_words=line[0],line[1],float(line[2])
        D=(map(float,line[3:]))
        D=[i/total_words for i in D]
        centroid_points.append(D)

    run_kmeans_mrjob(centroid_points,k)

run_part_d()



k-means converged after 5 iterations

--------Class Counts by Cluster ------
0 | 749.0 (99.6%) | 3.0 (3.3%) | 14.0 (25.9%) | 38.0 (36.9%) | 
1 | 0.0 (0.0%) | 51.0 (56.0%) | 0.0 (0.0%) | 0.0 (0.0%) | 
2 | 1.0 (0.1%) | 37.0 (40.7%) | 40.0 (74.1%) | 4.0 (3.9%) | 
3 | 2.0 (0.3%) | 0.0 (0.0%) | 0.0 (0.0%) | 61.0 (59.2%) | 


##HW 2.5 - Discussion of Results TODO


###End of Submission