## DATASCI W261: Machine Learning at Scale

# Assignment: Week 4

- Juanjo Carin
- [juanjose.carin@ischool.berkeley.edu](mailto:juanjose.carin@ischol.berkeley.com)
- W261-2
- Week 04
- Submission date: 9/29/2015

##Errata

Here I will upload any **minor corrections** I may make to the assignment after I submit it:

[https://www.dropbox.com/s/lf9ijexdvqobtm6/HW4-Errata.txt?dl=0](https://www.dropbox.com/s/lf9ijexdvqobtm6/HW4-Errata.txt?dl=0)

# HW4.0

1. **What is MrJob? How is it different to Hadoop MapReduce?**

2. **What are the mapper_final(), combiner_final(), reducer_final() methods? When are they called?**

1. **MrJob** is a Python package for running Hadoop streaming jobs, i.e., a Hadoop streaming framework. Its **differences** with Hadoop MapReduce are:

    * It allows (or assists in) producing multistep jobs (be it a sequential pipelining or an iterative one).
    * Unlike Hadoop MapReduce, MRJob accepts as input/output text formats raw text and JSON, and it currently does not support binary serialization schemes. That makes it much slower than Hadoop streaming via Python, because deserialization and serialization of records incurs a lot of CPU, storage, and network overhead...
    * ...but it is also very easy to write, maintain, and communicate with, and it can work seamlessly with EMR and complex objects, which compensates for that reduction in speed.

2. **`mapper_final()`, `combiner_final()`, `reducer_final()` methods** are defined within a subclass of MRJob class, and perform the work (and shutdown) phases of the Mapper, Combiner, and Reducer in a MapReduce job (`mapper_init()`, `combiner_init()`, `reducer_init()` could optionally be created too, in order to initizalize those phases). They are called when we call the `.py` file that contains the subclass definition, or when working with a Python driver such as...

    ```python
    from file import MRJob_subclass
    mr_job = MRJob_subclass(args=...)
    with mr_job.make_runner() as runner:
        runner.run()
        ...
    ```

    ...the moment we run the process (`runner.run`). 

#HW4.1

1. **What is serialization in the context of MrJob or Hadoop?**
2. **When it used in these frameworks?**
3. **What is the default serialization mode for input and outputs for MrJob?**

1. **Serialization** is the process of converting structured objects (raw text, JSON) into a byte stream in order to store the object or transmit it. Its main purpose is to save the state of an object in order to be able to recreate it when needed.**

2. In **Hadoop** serialization is used heavilyt throughout the entire framework. In **MrJob** it is used in a limited fashion: neither the input nor the output can be in binary format, which slows down our pipelines because we have to convert the data previously.

3. Each job has an input protocol, an internal, and an output protocol. By default, those protocols assume raw values, JSON, and JSON again, respectively. So the **default serialization mode for inputs and outputs** are from raw text (input; lines are read as strings) or JSON (output; lines are read as JSON strings separated by a tab character) to binary format.

#HW4.2

**Recall the Microsoft logfiles data from the async lecture. The logfiles are described are located at:**

**[https://kdd.ics.uci.edu/databases/msweb/msweb.html](https://kdd.ics.uci.edu/databases/msweb/msweb.html)**

**[http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/](http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/)**

**This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.**

**Here, you must preprocess the data on a single node (i.e., not on a cluster of nodes) from the format:**

    C,"10001",10001   #Visitor id 10001
    V,1000,1          #Visit by Visitor 10001 to page id 1000
    V,1001,1          #Visit by Visitor 10001 to page id 1001
    V,1002,1          #Visit by Visitor 10001 to page id 1002
    C,"10002",10002   #Visitor id 10001
    V**`

**Note: #denotes comments to the format:**

    V,1000,1,C, 10001
    V,1001,1,C, 10001
    V,1002,1,C, 10001

**Write the python code to accomplish this.**

Since we have to report the URL in HW4.4, I will also include that information in the transformed logfile, having something like:

    V,1000,1,/regwiz,C,10001
    V,1001,1,/support,C,10001
    V,1002,1,/athome,C,10001

That could be reduced to:

    1000,/regwiz,10001
    1001,/support,10001
    1002,/athome,10001
    
eliminating all the `V`s in the first field, the `1`s in the third field, and the `C`s in the fifth one, keeping only the webpage ID, the webpage (relative) URL, and the Visitor ID, by I've kept the format as we were asked to, just inserting the (relative) URL between the Vroot and the user.

In [1]:
import urllib2
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/' +\
    'anonymous-msweb.data'
# A dictionary to link webpages URLs to IDs
Vroot = {}
# Two counters to keep track of number of distinct webpages and visitors
A = 0
C = 0
with open('preprocessed_anonymous-msweb.data', 'w') as output_file:
    for line in urllib2.urlopen(url):
        record = line.strip().split(',')
        record = [x.strip('"') for x in record]
        # If the record corresponds to an attribute, link the URL to the Vroot
            # All lines starting with 'A' are at the beginning of the file, so 
            # the whole dictionary is already created when lines starting with 
            # 'C' or 'V' are read
        if record[0] == 'A':
            A += 1
            Vroot[record[1]] = record[4]
        # If the record corresponds to a case (Visitor ID), save that info to 
            # pass it to the Vroot
        elif record[0] == 'C':
            C += 1
            case = record[:2]
        # If the line contains a vote line for a case, concatenate the user ID
        elif record[0] == 'V':
            output_file.write(','.join(record+[Vroot[record[1]]]+case) + '\n')

print 'Training Instances  {}'.format(C)
print 'Attributes  {}'.format(A)

Training Instances  32711
Attributes  294


According to [https://kdd.ics.uci.edu/databases/msweb/msweb.data.html](https://kdd.ics.uci.edu/databases/msweb/msweb.data.html) there were:

`Training Instances  32711
Attributes  294`

In [2]:
!echo "Number of lines (visits): \
    "$(cat preprocessed_anonymous-msweb.data | wc -l)
!echo "-------------------------------"
!echo "First 10 lines:"
!head preprocessed_anonymous-msweb.data

Number of lines (visits):     98654
-------------------------------
First 10 lines:
V,1000,1,/regwiz,C,10001
V,1001,1,/support,C,10001
V,1002,1,/athome,C,10001
V,1001,1,/support,C,10002
V,1003,1,/kb,C,10002
V,1001,1,/support,C,10003
V,1003,1,/kb,C,10003
V,1004,1,/search,C,10003
V,1005,1,/norge,C,10004
V,1006,1,/misc,C,10005


#HW4.3

**Find the 5 most frequently visited pages using mrjob from the output of 4.2 (i.e., transfromed log file).**

##HW4.3.a
With my **first approach** the MRJob program just aggregates the number of visits of each webpage, and the Top5 is extracted using the **command line**. In HW4.3.b my **second approach** is shown, where **MRJob** does all the work.

In [3]:
%%writefile VisitsPerVroot.py

from mrjob.job import MRJob
from mrjob.job import MRStep

class MRVisitsPerVroot(MRJob):

    # We can re-use the code of the reducer as combiner
    def steps(self):
        return [
            MRStep(mapper = self.mapper, 
                   combiner = self.reducer,
                   reducer=self.reducer)
               ]

    def mapper(self, _, line):
        """Extracts the Vroot"""
        cell = line.split(',')
        yield cell[1],1

    def reducer(self, vroot, visit_counts):
        """Sumarizes the visit counts by adding them together"""
        total = sum(visit_counts)
        yield vroot, total
        
if __name__ == '__main__':
    MRVisitsPerVroot.run()

Overwriting VisitsPerVroot.py


In [4]:
# Impor the class
from VisitsPerVroot import MRVisitsPerVroot

# Use file generated in HW4.2
mr_job = MRVisitsPerVroot(args=['preprocessed_anonymous-msweb.data'])
with mr_job.make_runner() as runner:
    runner.run()
    # For every webpage in the dataset
    with open('Visits','w') as results:
        for line in runner.stream_output():
            # Add a line in the file containing the webpageID and the times it 
                # was visited
            results.writelines('\t'.join([str(x) for x in \
                                          mr_job.parse_output_line(line)])+'\n')

We've generated a text file with one line per Vroot, each one containing:

    Vroot_id TAB Visit_count

Note that the new file contains **285** lines (i.e., **Vroots**), while **the original file listed 294: 9 of the Vroots were not visited by any user**.

All the values in the second column should sum up the number of lines in the file generated in HW4.2 (`preprocessed_anonymous-msweb.data`):

Let's check what we've got:

In [5]:
!echo "Number of visits: "$(cat Visits | cut -f 2 | paste -sd+ -|bc)
!echo "Number of Vroots: "$(cat Visits | wc -l)
!echo "-----------------------"
!echo "First 10 lines:"
!head Visits

Number of visits: 98654
Number of Vroots: 285
-----------------------
First 10 lines:
1000	912
1001	4451
1002	749
1003	2968
1004	8463
1005	42
1006	135
1007	865
1008	10836
1009	4628


The 5 most frequently visited pages could be generated this way:

In [6]:
!echo Vroot'\t'Visits
!sort Visits -k2 -n -r | head -5

Vroot	Visits
1008	10836
1034	9383
1004	8463
1018	5330
1017	5108


Or from Python:

In [7]:
# Import numpy
import numpy as np
# Create a list of results
results = []
with open('Visits','r') as f:
    # For every line (i.e., webpage)
    for line in f:
        Vroot = line.split()
        # Create a list containing WepageID and Visits
        Vroot = [int(x) for x in Vroot]
        # And add it to the whole list
        results.append(Vroot)
# Convert the list of lists into a nparray
    # where each row relates to a webpage
arrayResults = np.array(results)
# Order the indices of rows
index = np.argsort(arrayResults[:,1])
# Get the indices of the Top5
top5 = index[-5:][::-1]
# Show results
print 'Pos\tVroot\tVisits'
for i, ind in enumerate(top5):
    print str(i+1) + '\t' + '\t'.join([str(x) for x in arrayResults[ind,:]])


Pos	Vroot	Visits
1	1008	10836
2	1034	9383
3	1004	8463
4	1018	5330
5	1017	5108


##HW4.3.b
As mentioned, now we make **MRJob** do all the work.

In [8]:
%%writefile Top5Pages.py

from mrjob.job import MRJob
from mrjob.job import MRStep
from collections import Counter
from operator import itemgetter

class Top5Pages(MRJob):

    # IMPORTANT: if we're sure that only 1 reducer is run at the 1st MRJob step,
        # we could use just this 1st step:
            # combiner = self.reducer1
            # reducer = self.reducer2
        # (Or just don't import MRStep, delete steps and combiner methods, and
            # rename reducer1 as combiner and reducer2 as reducer
    def steps(self):
        return [
            MRStep(mapper = self.mapper, 
                   combiner = self.combiner,
                   reducer=self.reducer1),
            MRStep(reducer = self.reducer2)
               ]

    def mapper(self, _, line):
        """Extracts the Vroot"""
        cell = line.split(',')
        yield cell[1],1

    def combiner(self, vroot, visit_counts):
        """Sumarizes the visit counts by adding them together"""
        total = sum(visit_counts)
        yield vroot, total

    def reducer1(self, vroot, visit_counts):
        """Group all webpages using key=None; the values will be dictionaries
        with wepage as key and its visits as value"""
        total = sum(visit_counts)
        yield None, {vroot: total}

    def reducer2(self, _, dictionaries):
        """The 2nd reducer gets a list of dictionaries, one for each webpage"""
        # We start creating a special type of dictionary: Counter
        final_dict = Counter()
        # For every item in the list (of dictionaries), i.e., for every webpage
        for x in dictionaries:
            # Update number of occurrences for each key
                # (Also valid to include new keys)
            final_dict.update(x)
        # Now create 2 lists, one containing the webpageIDs, and the other the 
            # visits of each one
        webpages=[]
        visits=[]
        for k,v in final_dict.iteritems():
            webpages.append(k)
            visits.append(v)
        # For the Top 5
        for i in range(5):
            # Find the index of the webpage with most visits
            index, value = max(enumerate(visits), key=itemgetter(1))
            # Output its ID and number of visits, taking out of the list for the 
                # next iteration
            yield webpages.pop(index), visits.pop(index)
        
if __name__ == '__main__':
    Top5Pages.run()

Overwriting Top5Pages.py


In [9]:
# Impor the class
from Top5Pages import Top5Pages

# Use file generated in HW4.2
mr_job = Top5Pages(args=['preprocessed_anonymous-msweb.data', '--no-strict-protocols'])
with mr_job.make_runner() as runner:
    runner.run()
    print 'Pos\tVroot\tVisits'
    # For every webpage in the Top 5
    for i, line in enumerate(runner.stream_output()):
        # Print position in ranking, webpageID, and number of visits
        print str(i+1) + '\t' + '\t'.join([str(x) for x in \
                                           mr_job.parse_output_line(line)])

Pos	Vroot	Visits
1	1008	10836
2	1034	9383
3	1004	8463
4	1018	5330
5	1017	5108


# HW4.4

**Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transformed log file). In this output please include the webpage URL, webpageID and Visitor ID.**

In [10]:
# Create a directory to put files (one for each wepage: 285)
# (Delete previous version if it existed)
!rm -r MostFrequentVisitors
!mkdir MostFrequentVisitors

Each file will be of the form

    1074: /ntworkstation
    11520
    18559
    19498
    33089
    ...

where the first line has the form `webpageID: webpageURL` and each line corresponds to a `visitorID` (there will be more than one in case of a tie (which will always be the case in this dataset, where no user visits a webpage more than once).

In [11]:
%%writefile MostFreqVisitor.py

from mrjob.job import MRJob
from collections import Counter

class MostFreqVisitor(MRJob):

    def mapper(self, _, line):
        # Extract information of each line
        cell = line.split(',')
        # Concatenate webpageID and webpageURL
            # These are relative URLs so all start with '/'
        page = cell[1]+': '+cell[3]
        # Output:
            # key: webpage 
            # value: a dict with key=visitorID and value=1
                # (Each line relates to a single visit)
        yield page, {cell[5]:1}

    def combiner(self, page, dictionary):
        # For each key (webpage) the combiner receives a list of dictionaries,
            # each one with a single key (visitorID) and value 1
        # The combiner has to aggregate count of visits for each visitor
            # in case there is more than one (not the case in this dataset)
        # We start creating a special type of dictionary: Counter
        agg_dictionary = Counter()
        # For every item in the list (of dictionaries)
        for x in dictionary:
            # Update number of occurrences for each key
                # (Also valid to include new keys)
            agg_dictionary.update(x)
        # Now create a new list of dictionaries, but this time it would look
            # like [{visitor1: 2}, {visitor2: 3}, ...] instead of
            # [{visitor1: 1}, {visitor2: 1}, {visitor2: 1}, ...]
        list_dicts = []
        # And fullfil for every user that visited the webpage
        for k,v in agg_dictionary.iteritems():
            list_dicts.append({k:v})
        # Output:
            # key: webpage
            # value: a dict with key=visitorID and value=#visits of visitorID
        yield page, agg_dictionary

    def reducer(self, page, agg_dictionary):
        # Similar to combiner
        final_dict = Counter()
        for x in agg_dictionary:
            final_dict.update(x)
        # But now (assuming a single reducer) we have to filter, keeping only
            # the most frequent visitor
        maximum = max(final_dict.values())
        # Just in case there's a tie (which is the case in this dataset)
            # keep a list of most frequent visitorS
        MFvisitor_list = []
        for visitor, visits in final_dict.iteritems():
            # Only put most frequent visitor(s) in the list
            if visits == maximum:
                MFvisitor_list.append(visitor)
        # Sort visitors by ID
        MFvisitor_list.sort()
        # Output:
            # key: webpage (it contains ID and URL in a single string)
            # value: a list of most frequent visitors
                # (Could also report the (maximum) count of visits
        yield page, MFvisitor_list
        
if __name__ == '__main__':
    MostFreqVisitor.run()

Overwriting MostFreqVisitor.py


In [12]:
# Import class
from MostFreqVisitor import MostFreqVisitor

# Use file generated in HW4.2
mr_job = MostFreqVisitor(args=['preprocessed_anonymous-msweb.data'])
with mr_job.make_runner() as runner:
    runner.run()
    # For every webpage in the dataset
    for line in runner.stream_output():
        # Create a file identified just by the webpageID (4 digits)
        page_id = mr_job.parse_output_line(line)[0][:4]
        # Write the file (in the folder created at the beginning)
        with open('MostFrequentVisitors/FrequentVisitors.'+page_id,'w') as f:
            # First line with webpage info, then one line per most frequent user
            f.writelines(mr_job.parse_output_line(line)[0] + '\n' + 
                        '\n'.join([x for x in \
                                   mr_job.parse_output_line(line)[1]]) + '\n')

Let's check what we've got:

In [13]:
# Check number of files (should be the number obtained in HW4.3)
!ls MostFrequentVisitors/FrequentVisitors.* | wc -l

285


In [14]:
# Check number of lines in all files: should be the number of visits since
    # all visitors are the most frequent in this dataset
# But delete 1st line in each file!!!
# Should be the number obtained in HW4.3
!cd MostFrequentVisitors; lines=$(find . -type f -exec cat {} + | wc -l); \
    files=$(ls | wc -l); echo $(echo $lines-$files | bc)

98654


**OUTPUT**: Since almost 100,000 lines seems excessive, I only show here the 4 most frequent visitors of a few webpages.

I've selected 10 webpageIDs, the first five (1000–1014) and 1120–1124 (because some of them had less than 4 visitors and hence the output is shorter).

In [15]:
for i in range(0,5):
    !head -5 MostFrequentVisitors/FrequentVisitors.100$i
for i in range(0,5):
    !head -5 MostFrequentVisitors/FrequentVisitors.112$i

1000: /regwiz
10001
10010
10039
10073
1001: /support
10001
10002
10003
10020
1002: /athome
10001
10019
10020
10031
1003: /kb
10002
10003
10006
10019
1004: /search
10003
10006
10008
10018
1120: /switch
10241
1121: /magazine
10241
10522
13153
13595
1122: /mindshare
10243
21553
21782
23581
1123: /germany
10254
10267
10326
10335
1124: /industry
10263
10602
10607
10697


#HW4.5

**Here you will use a different dataset consisting of word-frequency distributions for 1,000 Twitter users. These Twitter users use language in very different ways, and were classified by hand according to the criteria:**

**0: Human, where only basic human-human communication is observed.**

**1: Cyborg, where language is primarily borrowed from other sources (e.g., jobs listings, classifieds postings, advertisements, etc...).**

**2: Robot, where language is formulaically derived from unrelated sources (e.g., weather/seismology, police/fire event logs, etc...).**

**3: Spammer, where language is replicated to high multiplicity (e.g., celebrity obsessions, personal promotion, etc... )**

**The main data lie in the accompanying file: topUsers_Apr-Jul_2014_1000-words.txt. And and are of the form:**

    USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...

**where**

    USERID = unique user identifier
    CODE = 0/1/2/3 class code
    TOTAL = sum of the word counts

**Using this data, you will implement a 1000-dimensional K-means algorithm on the users by their 1000-dimensional word stripes/vectors using several centroid initializations and values of K.**

**Note that each "point" is a user as represented by 1000 words, and that word-frequency distributions are generally heavy-tailed power-laws (often called Zipf distributions), and are very rare in the larger class of discrete, random distributions. For each user you will have to normalize by its "TOTAL" column. Try several parameterizations and initializations:**

**(A) K=4 uniform random centroid-distributions over the 1000 words**

**(B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution**

**(C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution** 

**(D) K=4 "trained" centroids, determined by the sums across the classes.**

**and iterate until a threshold (try 0.001) is reached.**

**After convergence, print out a summary of the classes present in each cluster. In particular, report the composition as measured by the total portion of each class type (0-3) contained in each cluster, and discuss your findings and any differences in outcomes across parts A-D.**

**Note that you do not have to compute the aggregated distribution or the class-aggregated distributions, which are rows in the auxiliary file: topUsers_Apr-Jul_2014_1000-words_summaries.txt**

Firs let's normalize the data for each user (and print a summary of the dataset):

In [16]:
import numpy as np

num_users = 0

with open('topUsers_Apr-Jul_2014_1000-words.txt','r') as dataset:
    # For each user
    for ind,line in enumerate(dataset):
        num_users += 1
        line = line.split(',')
        if ind==0:
            num_coord = len(line)-3
        # 1st value is userID
        user = [int(line[0])]
        # Append 2nd value (class code)
        user.append(int(line[1]))
        # Append the 1000 coordinates normalized by the total
        for i in range(num_coord):
            user.append(float(line[i+3])/int(line[2]))
        # Convert to nparray
        user = np.array(user).reshape(1,num_coord+2)
        if ind==0:
            data=user
        else:
            data=np.append(data,user,axis=0)
np.savetxt('normalized_dataset.csv',data,delimiter = ",")

# Occurrences of each code
for i in range(4):
    occurrences = sum(data[:,1]==i)
    portion = 100 * float(occurrences) / num_users
    print 'User with code {0}: {1:3} ({2:4.1f}%)'.format(i, occurrences, 
                                                         portion)

User with code 0: 752 (75.2%)
User with code 1:  91 ( 9.1%)
User with code 2:  54 ( 5.4%)
User with code 3: 103 (10.3%)


Define the MRJob process that we'll run for each part (A-D):

In [17]:
%%writefile Kmeans.py
from numpy import argmin, array, random
from mrjob.job import MRJob
from mrjob.step import MRStep
from itertools import chain

#Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    centroid_points = array(centroid_points)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx

#Check whether centroids converge
def stop_criterion(centroid_points_old, centroid_points_new,T):
    oldvalue = list(chain(*centroid_points_old))
    newvalue = list(chain(*centroid_points_new))
    Diff = [abs(x-y) for x, y in zip(oldvalue, newvalue)]
    Flag = True
    for i in Diff:
        if(i>T):
            Flag = False
            break
    return Flag

class MRKmeans(MRJob):
    centroid_points=[]
    k=4
            
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, mapper=self.mapper,
                   combiner = self.combiner, reducer = self.reducer)
               ]
    
    #load centroids info from file
    def mapper_init(self):
        self.centroid_points = [map(float,s.split('\n')[0].split(',')) \
                                for s in open('/tmp/Centroids2.txt').\
                                readlines()]
        open('/tmp/Centroids2.txt', 'w').close()
    
    #load data and output the nearest centroid index and data point 
    def mapper(self, _, line):
        # Get all components for each record
        D = (map(float,line.split(',')))
        # 1st component is the user ID, 2nd is the code, the rest are the 
            # coordinates
        coord = D[2:]
        # Get the code (class) of the user
        code = int(D[1])
        # Include in the value a binary vector of lenght k
            # 1 for the corresponding code, 0 for the rest
        code_info = [0]*4
        code_info[code]=1
        # Append 1 (added 1 element to the cluster output as key)
        code_info.append(1)
        # Output:
            # key: cluster assigned
            # value: tuple with the thousand coordinates plus a 4-vector with 
                # the code/class plus 1
        yield int(MinDist(coord,self.centroid_points)), tuple(coord+code_info)
    
    #Combine sum of data points locally
    def combiner(self, idx, inputdata):
        sum_coord = [0]*1000
        count_code = [0]*4
        num = 0
        for x in inputdata:
            for i in range(-1,-5,-1):
                count_code[i] = count_code[i] + x[i-1]
            for i in range(1000):
                sum_coord[i] = sum_coord[i] + x[i]
            num = num + x[-1]
        yield idx,(sum_coord,count_code,num)

    #Aggregate sum for each cluster and then calculate the new centroids
    def reducer(self, idx, inputdata): 
        centroids = []
        sum_coord = [0]*1000
        count_code = [0] * 4
        num = [0]*self.k
        for i in range(self.k):
            centroids.append([0]*1000)
        for x in inputdata:
            for i in range(-1,-5,-1):
                count_code[i] = count_code[i] + x[1][i]
            for i in range(1000):
                sum_coord[i] = sum_coord[i] + x[0][i]
            num[idx] = num[idx] + x[2]
        centroids[idx] = sum_coord
        for i in range(1000):
            centroids[idx][i] = centroids[idx][i]/num[idx]
        with open('/tmp/Centroids2.txt', 'a') as f:
            f.writelines(','.join([str(x) for x in centroids[idx]]) + '\n')
        yield idx,(centroids[idx],count_code)

if __name__ == '__main__':
    MRKmeans.run()

Overwriting Kmeans.py


##(A) K=4 uniform random centroid-distributions over the 1000 words

In [30]:
from numpy import random

#Generate initial centroids
centroid_points=[]
k = 4
for i in range(k):
    numbers = random.sample(1000)
    numbers = numbers / sum(numbers)
    centroid_points.append(numbers)
with open('/tmp/Centroids2.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in \
                     centroid_points)

The driver will be common to the 4 approaches; the only thing that changes is the seeds we use as initial centroids (and the number of clusters). Hence, it seems appropriate to define a function to be called each time.

In [31]:
# Define a function that we'll run for each part (A-D)
def hw45(k):
    import numpy as np
    from Kmeans import MRKmeans, stop_criterion
    mr_job = MRKmeans(args=['normalized_dataset.csv'])

    # Update centroids iteratively
    it = 0 # count of iterations
    count_code=[[0]*4]*k
    while(1):
        # save previous centoids to check convergency
        centroid_points_old = centroid_points[:]
        with mr_job.make_runner() as runner: 
            runner.run()
            # stream_output: get access of the output 
            for i,line in enumerate(runner.stream_output()):
                key,value =  mr_job.parse_output_line(line)
                centroid_points[key] = value[0]
                count_code[key] = value[1]
        it = it + 1
        if(stop_criterion(centroid_points_old,centroid_points,0.001)):
            break

    # Print iterations
    print 'Iterations until convergence: {}\n'.format(it)

    # Summary of classes in each cluster
    count_code = np.array(count_code) # a k x 4 matrix
    # Sum row-wise
    total_code = np.sum(count_code, axis = 0).astype(float)
    portion = count_code/total_code*100

    classifier =['  Class 0 ', '  Class 1 ', '  Class 2 ', '  Class 3 ']
    print '---------------------------------------------------------------------'
    print '|Percentage of users of |{}|{}|{}|{}|'.format(*classifier)
    print '---------------------------------------------------------------------'
    for j in range(k):
        portion_row = [j]
        portion_row.extend(list(portion[j,:]))
        print '|    in Cluster {}       |{:8.2f}% |{:8.2f}% |{:8.2f}% '\
            '|{:8.2f}% |'.format(*portion_row)
    print '---------------------------------------------------------------------'
    
    total_code = total_code.astype(int)
    # Sum column-wise
    total_cluster = list(np.sum(count_code, axis = 1).reshape(k,))
    print '\n'
    print '-----------------------------------------------------------------'
    print '|Number of users of |{}|{}|{}|{}|'.format(*classifier)
    print '----------------------------------------------------------------------------'
    for j in range(k):
        count_row = [j]
        count_row.extend(list(count_code[j,:]))
        count_row.extend([total_cluster[j]])

        print '|    in Cluster {}   |{:9} |{:9} |{:9} |{:9} |{:9} |'\
            .format(*count_row)
    print '----------------------------------------------------------------------------'
    print '                    |{:9} |{:9} |{:9} |{:9} |'.format(*total_code)
    print '                    ---------------------------------------------'
    
    # Purity
    majority = np.max(count_code, axis = 1)
    purity = float(np.sum(majority)) / np.sum(count_code)
    print '\nPURITY:    {:.3f}'.format(purity)
    
    # Since there are many more cases of class 0 (humans), search what's the 
        # cluster most of them are assigned to
    code0_cluster = np.argmax(count_code[:,0])
    # Estimate False and True Positives and Negatives, and send them as output 
        # of the function.
    # My reasoning is largely explained at the end of HW4.5
    
    # 1st column of the matrix count_code corresponds to class 0
    TP = count_code[code0_cluster,0]
    # Rest of the elemnts in the row are FPs (robots, spammers... assigned to 
        # the cluster where the majority of humans were assigned to)
    FP = np.sum(count_code[code0_cluster,:]) - TP
    # Rest of the elements in the 1st column are FNs (humans assigned to other 
        # clusters)
    FN = np.sum(count_code[:,0]) - TP
    TN = np.sum(count_code) - TP - FP - FN
    return [TP,FP,FN,TN]
    
[TP,FP,FN,TN] = hw45(k)

# Define a function that calculates Accuracy, Precision, and Recall, from 
    #previous output
def metrics(TP,FP,FN,TN):
    print 'ACCURACY:  {:.3f}'.format(float(TP+TN)/(TP+FP+FN+TN))
    print 'PRECISION: {:.3f}'.format(float(TP)/(TP+FP))
    print 'RECALL:    {:.3f}'.format(float(TP)/(TP+FN))

metrics(TP,FP,FN,TN)
# The results are commented at the end of HW4.5

Iterations until convergence: 5

---------------------------------------------------------------------
|Percentage of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
---------------------------------------------------------------------
|    in Cluster 0       |   98.14% |    3.30% |   20.37% |   96.12% |
|    in Cluster 1       |    1.73% |    0.00% |    9.26% |    0.00% |
|    in Cluster 2       |    0.00% |   56.04% |    0.00% |    0.00% |
|    in Cluster 3       |    0.13% |   40.66% |   70.37% |    3.88% |
---------------------------------------------------------------------


-----------------------------------------------------------------
|Number of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
----------------------------------------------------------------------------
|    in Cluster 0   |      738 |        3 |       11 |       99 |      851 |
|    in Cluster 1   |       13 |        0 |        5 |        0 |       18 |
|    in Cluster 2   |        0 |       51 

##(B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution

Now we just have to generate the seeds and call the function previously defined. We'll call it inside other function, since part (B) and (C) are almost equal.

In [32]:
from numpy import random
import numpy as np
with open('topUsers_Apr-Jul_2014_1000-words_summaries.txt','r') as summary:
    for line in summary:
        line = line.split(',')
        if line[0]=='ALL_CODES':
            coord = [float(x)/int(line[2]) for x in line[3:]]

def hw45BC(coord, k):
    array_centroid_points = random.sample(k*1000).reshape(k,1000)*0.001
    array_centroid_points = array_centroid_points + np.array(coord)

    array_centroid_points = (array_centroid_points.T / array_centroid_points.\
                             sum(axis=1)).T

    centroid_points = []
    for i in range(k):
        centroid_points.append(list(array_centroid_points[i,:]))

    with open('/tmp/Centroids2.txt', 'w+') as f:
            f.writelines(','.join(str(j) for j in i) + '\n' for i in \
                         centroid_points)

    [TP,FP,FN,TN] = hw45(k)
    return [TP,FP,FN,TN]

[TP,FP,FN,TN] = hw45BC(coord, 2)

metrics(TP,FP,FN,TN)

Iterations until convergence: 4

---------------------------------------------------------------------
|Percentage of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
---------------------------------------------------------------------
|    in Cluster 0       |    0.13% |   96.70% |   74.07% |    3.88% |
|    in Cluster 1       |   99.87% |    3.30% |   25.93% |   96.12% |
---------------------------------------------------------------------


-----------------------------------------------------------------
|Number of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
----------------------------------------------------------------------------
|    in Cluster 0   |        1 |       88 |       40 |        4 |      133 |
|    in Cluster 1   |      751 |        3 |       14 |       99 |      867 |
----------------------------------------------------------------------------
                    |      752 |       91 |       54 |      103 |
                    -------------------

##(C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution

In [33]:
[TP,FP,FN,TN] = hw45BC(coord, 4)

metrics(TP,FP,FN,TN)

Iterations until convergence: 8

---------------------------------------------------------------------
|Percentage of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
---------------------------------------------------------------------
|    in Cluster 0       |   26.20% |    2.20% |   25.93% |   24.27% |
|    in Cluster 1       |    0.13% |   40.66% |   70.37% |    3.88% |
|    in Cluster 2       |   73.67% |    1.10% |    0.00% |   71.84% |
|    in Cluster 3       |    0.00% |   56.04% |    3.70% |    0.00% |
---------------------------------------------------------------------


-----------------------------------------------------------------
|Number of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
----------------------------------------------------------------------------
|    in Cluster 0   |      197 |        2 |       14 |       25 |      238 |
|    in Cluster 1   |        1 |       37 |       38 |        4 |       80 |
|    in Cluster 2   |      554 |        1 

##(D) K=4 "trained" centroids, determined by the sums across the classes.

In [34]:
from numpy import random
import numpy as np

coord = []
with open('topUsers_Apr-Jul_2014_1000-words_summaries.txt','r') as summary:
    for line in summary:
        line = line.split(',')
        if line[0]=='CODE':
            coord.append([float(x)/int(line[2]) for x in line[3:]])

with open('/tmp/Centroids2.txt', 'w+') as f:
    f.writelines(','.join(str(j) for j in i) + '\n' for i in coord)

[TP,FP,FN,TN] = hw45(4)

metrics(TP,FP,FN,TN)

Iterations until convergence: 5

---------------------------------------------------------------------
|Percentage of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
---------------------------------------------------------------------
|    in Cluster 0       |   99.60% |    3.30% |   25.93% |   36.89% |
|    in Cluster 1       |    0.00% |   56.04% |    0.00% |    0.00% |
|    in Cluster 2       |    0.13% |   40.66% |   74.07% |    3.88% |
|    in Cluster 3       |    0.27% |    0.00% |    0.00% |   59.22% |
---------------------------------------------------------------------


-----------------------------------------------------------------
|Number of users of |  Class 0 |  Class 1 |  Class 2 |  Class 3 |
----------------------------------------------------------------------------
|    in Cluster 0   |      749 |        3 |       14 |       38 |      804 |
|    in Cluster 1   |        0 |       51 |        0 |        0 |       51 |
|    in Cluster 2   |        1 |       37 

Here's the summary for each approach:

(A) K=4 uniform random centroid-distributions over the 1000 words

    PURITY:    0.840
    ACCURACY:  0.873
    PRECISION: 0.867
    RECALL:    0.981

(B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 

    PURITY:    0.839
    ACCURACY:  0.883
    PRECISION: 0.866
    RECALL:    0.999

(C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution

    PURITY:    0.840
    ACCURACY:  0.727
    PRECISION: 0.881
    RECALL:    0.737

(D) K=4 "trained" centroids, determined by the sums across the classes.

    PURITY:    0.901
    ACCURACY:  0.942
    PRECISION: 0.932
    RECALL:    0.996

The results do not differ too much, but they're definitely better when we use "trained" centroids.

The **Purity** is not always a good indication of the "quality" of the clusters, especially for a large number of them, but this is not the case. Here it's about 0.84 for the first thre approaches, andd much larger, about 0.90, for the "trained" centroids.

To define **Accuracy**, **Precision**, and **Recall** I didn't follow mentioned in "*An introduction of information retrieval*" but the typical definitions in classification problems:a true positive would correspond to a human properly assigned to a cluster where humans are the norm, a false positive to a human assigned to a cluster dominated by the other classes, a true negative corresponds to any cyborg, robot, or spammer, assigned to any cluster where humans are the exception, and a false negative would correspond to the opposite situation. I've followed this approach because my guess is that in this kind of problem (as in the spam vs. ham classification), a false negative (i.e., discarding a human message) is much worse or more inconvenient than a false positive (accepting undesired messages from the other classes). Hence, the most important metric would probably be the **Recall**. I think it's not surprising that the 2 clusters obtained in part (B) have an almost perfect recall: human messages that with more clusters might have been incorrectly assigned to different clusters, are in this case assigned to a "big cluster;" this big cluster also contains messages from other classes, and hence its accuracy and precsion are lower than those achieved in (D), as expected. The purity is also not too high, because only 2 classes (0: humans; and 1: cyborgs) can be dominant in the 2 clusters; robots are mostly assigned to the same cluster than cyborgs, and spammers are mainly assigned to the same cluster than humans. This happens in the 4 approaches.

The bad results in (C) are not very surprising: we just added noise to the centroid of the single cluster that covers all classes, so the starting points are not dissimilar enough. I scaled down the perturbation by a factor of 0.001, but even when using a $U(0,1)$ distribution the results are pretty similar.

Overall, the results of the 4 approaches have surprised given the relatively high dimensionality of the problem (if we think of the clusters as hyperspheres, two members of a cluster, in a high-dimensional space, are very likely to be closer to the the border of the hypersphere than to each other (or its centroid).

In [5]:
!aws s3 mb s3://ucb-mids-mls-juanjocarin/tmp/HW74

make_bucket failed: s3://ucb-mids-mls-juanjocarin/tmp/HW74/ A client error (TooManyBuckets) occurred when calling the CreateBucket operation: You have attempted to create more buckets than allowed

