This notebook requires some nbextensions.

* [toc2](https://github.com/ipython-contrib/IPython-notebook-extensions/tree/master/nbextensions/usability/toc2) provides a button to create a floating table of contents
* [toggle_all_line_numbers](https://github.com/ipython-contrib/IPython-notebook-extensions/tree/master/nbextensions/usability/toggle_all_line_numbers) provides a button to see line numbers for all code cells
* [autosaveclasses](https://github.com/holatuwol/jupyter-magic/tree/master/nbextensions/autosaveclasses.js) avoids usage of `%%writefile` (cells with a class definition are saved to disk when run)

If they are not yet installed, run the following cell and restart the notebook server.

In [1]:
%%bash
nbextdl() {
    mkdir -p $(ipython locate)/nbextensions/$(dirname $2)
    curl --silent -L \
        "https://raw.githubusercontent.com/$1/master/nbextensions/$2" \
        > "$(ipython locate)/nbextensions/$2"
}

nbextdl ipython-contrib/IPython-notebook-extensions usability/toc2/main.js
nbextdl ipython-contrib/IPython-notebook-extensions usability/toc2/main.css
nbextdl ipython-contrib/IPython-notebook-extensions usability/toc2/icon.png
nbextdl ipython-contrib/IPython-notebook-extensions usability/toc2/image.png

nbextdl ipython-contrib/IPython-notebook-extensions usability/toggle_all_line_numbers/main.js
nbextdl ipython-contrib/IPython-notebook-extensions usability/toggle_all_line_numbers/icon.png

nbextdl holatuwol/jupyter-magic autosaveclasses.js

Load the extensions.

In [2]:
%%javascript
require(['base/js/utils'], function(utils) {
    utils.load_extensions('usability/toc2/main');
    utils.load_extensions('usability/toggle_all_line_numbers/main');
    utils.load_extensions('autosaveclasses');
});

<IPython.core.display.Javascript object>

Also set the base folder to use for storing files in HDFS.

In [3]:
hdfs_base_folder = '/tmp'

# HW 4.0. 

> What is MrJob? How is it different to Hadoop MapReduce? 

MRJob is a framework provided by Yelp as an alternate way to write Hadoop streaming jobs.

It allows you to write the mappers, combiners, and reducers as a single Python class. In doing so, it eliminates the boilerplate line parsing code found in Hadoop streaming jobs. It also provides a "map reduce step" (``MRStep``) abstraction that allows you to chain together a pre-determined set of map reduce steps.

MRJob is more limited than standard MapReduce in the sense that only those abstractions that have been implemented in the framework are available (for example, secondary sort does not work cleanly).

> What are the ``mapper_init()``, ``mapper_final()``, ``combiner_init()``, ``combiner_final()``, ``reducer_init()``, ``reducer_final()`` methods? When are they called?

``mapper_init``, ``combiner_init``, and ``reducer_init`` are called before the mapper, combiner, and reducer phases, respectively. ``mapper_final``, ``combiner_final``, and ``reducer_final`` are called after the mapper, combiner, and reducer phases, respectively.

These functions are used as part of the setup and tear-down of a MapReduce phase. They are equivalent to the commands that run before and after the stream iteration loop in a traditional Hadoop streaming job.

# HW 4.1

> What is serialization in the context of MrJob or Hadoop?

Serialization is the process of encoding the data between each of the different phases of a MapReduce job.

> When it used in these frameworks?

Serialization is used whenever data is written to persistent storage (such as in creating spill files or creating output files) or transmitted over over the network.

> What is the default serialization mode for input and outputs for MrJob?

* The default serialization mode for the input protocol (data sent during the first mapping phase) is ``RawValueProtocol``, which assumes that there are no keys and the whole line should be treated as the value.
* The default serialization mode for internal communication (data sent between phases) is ``JSONProtocol`` which encodes the data in JSON.
* The default serialization mode for the output protocol (the final step) is ``JSONProtocol`` which encodes the data in JSON.

# HW 4.2

> Recall the Microsoft logfiles data from the async lecture. The logfiles are described are located at:

> * https://kdd.ics.uci.edu/databases/msweb/msweb.html
> * http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/

> This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.

> Here, you must preprocess the data on a single node (i.e., not on a cluster of nodes) from the format:

    C,"10001",10001   #Visitor id 10001
    V,1000,1          #Visit by Visitor 10001 to page id 1000
    V,1001,1          #Visit by Visitor 10001 to page id 1001
    V,1002,1          #Visit by Visitor 10001 to page id 1002
    C,"10002",10002   #Visitor id 10001

> Note: #denotes comments

> to the format:

    V,1000,1,C,10001
    V,1001,1,C,10001
    V,1002,1,C,10001

> Write the python code to accomplish this.

## Download the file from Dropbox

In [4]:
!wget --quiet https://www.dropbox.com/sh/m0nxsf4vs5cyrp2/AADCHtrJ4CBCDO1po_OAWg0ia/anonymous-msweb.data

## Reformat the data file

In [5]:
from csv import reader

input_file_name = 'anonymous-msweb.data'

visitor_file = open(input_file_name + '.visitor', 'w')
webpage_file = open(input_file_name + '.webpage', 'w')

with open(input_file_name) as input_file:
    case_id = None

    # Update the case ID when we see a new case (line with a C).
    # Output the vote when we see a new vote (line with a V).
    # Skip all other lines.

    for row in reader(input_file):
        if row[0] == 'A':
            print >> webpage_file, ','.join(row)
            continue

        if row[0] == 'C':
            case_id = row[1]
            continue

        if row[0] == 'V':
            row.extend(['C', case_id])
            print >> visitor_file, ','.join(row)

visitor_file.close()
webpage_file.close()

In [6]:
!head anonymous-msweb.data.visitor

V,1000,1,C,10001
V,1001,1,C,10001
V,1002,1,C,10001
V,1001,1,C,10002
V,1003,1,C,10002
V,1001,1,C,10003
V,1003,1,C,10003
V,1004,1,C,10003
V,1005,1,C,10004
V,1006,1,C,10005


# HW 4.3

> Find the 5 most frequently visited pages using MrJob from the output of 4.2 (i.e., transformed log file).

## Create the job

In [7]:
#!/usr/bin/env python
from mrjob.job import MRJob
from mrjob.protocol import RawProtocol
from mrjob.step import MRStep

import csv
import sys

class mrjob_43(MRJob):
    
    """
    Mapper yields the vroot and a 1 (count).
    """
    def mapper_get_visits(self, _, line):
        row = csv.reader([line]).next()

        if len(row) != 5:
            return

        vroot_id = row[1]
        yield vroot_id, 1
    
    """
    Combiner sums the visit counts for the given vroot.
    """
    def combiner_get_visits(self, vroot_id, visits):
        total_visits = sum(visits)
        yield vroot_id, total_visits
    
    """
    Reducer sums the visit counts for the given vroot.
    """
    def reducer_get_visits(self, vroot_id, visits):
        total_visits = sum(visits)
        yield vroot_id, total_visits

    """
    Mapper pushes all the vroot-visits pairs to the same reducer, flipping the
    tuple. Pad the number with zeros so that MRJob reverse order works.
    """
    def mapper_top5(self, vroot_id, visits):
        padded_visits = '%06d' % (visits)
        yield None, (padded_visits, vroot_id)

    """
    Initialize the webpage URLs.
    """
    def reducer_top5_init(self):
        self.vroot_names = {}
        
        with open('anonymous-msweb.data.webpage', 'r') as webpage_file:
            for row in csv.reader(webpage_file):
                if len(row) != 5:
                    continue

                vroot_id = row[1]
                vroot_name = row[3].strip()

                self.vroot_names[vroot_id] = vroot_name
        
    """
    Reducer identifies the top 5 by relying on the job configuration's sort and
    simply emitting the first 5. Also make sure to remove the padded zeros.
    """
    def reducer_top5(self, _, pairs):
        emits_remaining = 5
        
        for pair in pairs:
            if emits_remaining == 0:
                continue
            
            emits_remaining -= 1

            visits = int(pair[0])

            vroot_id = pair[1]
            vroot_name = self.vroot_names[vroot_id]
            
            yield visits, vroot_name

    """
    Declare the two-step map reduce. Since we are using None as the key, it
    won't get emitted, and -k1 corresponds to the value.
    """
    def steps(self):
        return [
            MRStep(
                mapper = self.mapper_get_visits,
                combiner = self.combiner_get_visits,
                reducer = self.reducer_get_visits),
            MRStep(
                mapper = self.mapper_top5,
                reducer_init = self.reducer_top5_init,
                reducer = self.reducer_top5,
                jobconf = {
                    'stream.num.map.output.key.fields' : 2,
                    'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                    'mapred.text.key.comparator.options': '-k1r',
                    'mapred.reduce.tasks': 1
                })
        ]

if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    mrjob_43().run()

## Run the job

In [None]:
!hdfs dfs -rm -r -f -skipTrash $hdfs_base_folder/mrjob_43_output > /dev/null

!python mrjob_43.py -r hadoop \
    --strict-protocols \
    --file=anonymous-msweb.data.webpage \
    --output-dir=$hdfs_base_folder/mrjob_43_output \
    --no-output \
    anonymous-msweb.data.visitor \
    > /dev/null 2>&1

## Confirm the output

In [None]:
!hdfs dfs -cat $hdfs_base_folder/mrjob_43_output/*

# HW 4.4

> Find the most frequent visitor of each page using MrJob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.

## Create the job

In [None]:
#!/usr/bin/env python
from mrjob.job import MRJob
from mrjob.step import MRStep

import csv
import sys

class mrjob_44(MRJob):
    
    """
    Mapper yields the vroot and the visitor.
    """
    def mapper_top1(self, _, line):
        row = csv.reader([line]).next()

        vroot_id = row[1]
        visitor_id = row[4]
        
        yield vroot_id, visitor_id

    """
    Initialize the webpage URLs.
    """
    def reducer_top1_init(self):
        self.vroot_urls = {}
        
        with open('anonymous-msweb.data.webpage', 'r') as webpage_file:
            for row in csv.reader(webpage_file):
                if len(row) != 5:
                    continue

                vroot_id = row[1]
                vroot_url = row[4].strip()

                self.vroot_urls[vroot_id] = vroot_url
        
    """
    Reducer finds the most frequent visitor for the given vroot.
    """
    def reducer_top1(self, vroot_id, visitor_ids):
        top_visitor = { 'id': None, 'visits': 0 }
        current_visitor = { 'id': None, 'visits': 0 }
        
        for visitor_id in visitor_ids:
            if visitor_id == current_visitor['id']:
                current_visitor['visits'] += 1
                continue
            
            if current_visitor['visits'] > top_visitor['visits']:
                top_visitor = current_visitor
            
            current_visitor = { 'id': visitor_id, 'visits': 1 }
        
        if current_visitor['visits'] > top_visitor['visits']:
            top_visitor = current_visitor

        vroot_url = self.vroot_urls[vroot_id]
        yield (vroot_id, vroot_url), top_visitor['id']

    """
    Declare the one-step map reduce.
    """
    def steps(self):
        return [
            MRStep(
                mapper = self.mapper_top1,
                reducer_init = self.reducer_top1_init,
                reducer = self.reducer_top1,
                jobconf = {
                    'stream.num.map.output.key.fields' : 2,
                    'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                    'mapred.text.key.comparator.options': '-k1 -k2'
                })
        ]
        
if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    mrjob_44().run()

## Run the job

In [None]:
!hdfs dfs -rm -r -f -skipTrash $hdfs_base_folder/mrjob_44_output > /dev/null

!python mrjob_44.py -r hadoop \
    --strict-protocols \
    --file=anonymous-msweb.data.webpage \
    --output-dir=$hdfs_base_folder/mrjob_44_output \
    --no-output \
    anonymous-msweb.data.visitor \
    > /dev/null 2>&1

## Confirm the output

In [None]:
!hdfs dfs -cat $hdfs_base_folder/mrjob_44_output/*

# HW 4.5

> Here you will use a different dataset consisting of word-frequency distributions 
for 1,000 Twitter users. These Twitter users use language in very different ways,
and were classified by hand according to the criteria:

> * 0: Human, where only basic human-human communication is observed.
> * 1: Cyborg, where language is primarily borrowed from other sources (e.g., jobs listings, classifieds postings, advertisements, etc...).
> * 2: Robot, where language is formulaically derived from unrelated sources (e.g., weather/seismology, police/fire event logs, etc...).
> * 3: Spammer, where language is replicated to high multiplicity (e.g., celebrity obsessions, personal promotion, etc... )

> Check out the preprints of our recent research,
which spawned this dataset:

> * http://arxiv.org/abs/1505.04342
> * http://arxiv.org/abs/1508.01843

> The main data lie in the accompanying file:

> `topUsers_Apr-Jul_2014_1000-words.txt`

> and are of the form:

    USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...

> where

> * USERID = unique user identifier
> * CODE = 0/1/2/3 class code
> * TOTAL = sum of the word counts

> Using this data, you will implement a 1000-dimensional K-means algorithm in MrJob on the users
by their 1000-dimensional word stripes/vectors using several 
centroid initializations and values of K.

> Note that each "point" is a user as represented by 1000 words, and that
word-frequency distributions are generally heavy-tailed power-laws
(often called Zipf distributions), and are very rare in the larger class
of discrete, random distributions. For each user you will have to normalize
by its "TOTAL" column.

## Download the files from Dropbox

In [None]:
!wget --quiet https://www.dropbox.com/sh/m0nxsf4vs5cyrp2/AACBUw7kflSmuQJ7jn-uBMV1a/topUsers_Apr-Jul_2014_1000-words.txt
!wget --quiet https://www.dropbox.com/sh/m0nxsf4vs5cyrp2/AAD5G0PHKMgdqTPC1w-2rR2ya/topUsers_Apr-Jul_2014_1000-words_summaries.txt

## Create the job

In [8]:
#!/usr/bin/env python
from mrjob.job import MRJob
from mrjob.step import MRStep

import csv
import math
import sys

class KMeansJob(MRJob):

    """
    Load the centroids.
    """
    def mapper_init(self):
        self.centroids = {}
        
        with open('centroids.txt', 'r') as centroids_file:
            for centroid_row in csv.reader(centroids_file):
                if len(centroid_row) <= 3:
                    continue
                    
                key = centroid_row[0]
                point = [float(x) for x in centroid_row[1:]]
                self.centroids[key] = point

    """
    Yield the cluster and the point associated with that prediction so that
    the next iteration can average it.
    """
    def mapper(self, _, line):
        row = csv.reader([line]).next()
        
        if len(row) <= 3:
            return

        total = int(row[2])
        point = [float(x) / total for x in row[3:]]
        centroid = self.get_closest_centroid(point)
        
        yield centroid, [1] + point
    
    """
    Combine the counts that are being emitted.
    """
    def combiner(self, centroid, data):
        yield centroid, self.get_totals(data)
    
    """
    Returns the sum of all the features described in pairs.
    """
    def get_totals(self, data):
        total = None

        for datum in data:
            if total is None:
                total = datum
            else:
                total = [x + y for x, y in zip(total, datum)]
        
        return total
    
    """
    Find the new point to use for the centroid.
    """
    def reducer(self, centroid, data):
        totals = self.get_totals(data)
        
        point_count = totals[0]
        point_sum = totals[1:]
        
        mean_point = [x / point_count for x in point_sum]

        yield centroid, mean_point
        
    """
    Return the ID of the centroid closest to the given row.
    """
    def get_closest_centroid(self, point):
        minimum_key = None
        minimum_distance = sys.maxint
        
        for key, centroid in self.centroids.iteritems():
            distance = self.get_distance(point, centroid)
            
            if distance < minimum_distance:
                minimum_key = key
                minimum_distance = distance
        
        return minimum_key        
    
    """
    Returns the distance between the two points. For now use Euclidean
    distance.
    """
    def get_distance(self, values1, values2):
        difference = [x - y for x, y in zip(values1, values2)]
        squared_difference = [x * x for x in difference]
        sum_squared_difference = sum(squared_difference)
        euclidean_distance = math.sqrt(sum_squared_difference)
        
        return euclidean_distance

    def steps(self):
        return [
            MRStep(
                mapper_init = self.mapper_init,
                mapper = self.mapper,
                combiner = self.combiner,
                reducer = self.reducer)
        ]

if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    KMeansJob().run()

## Test the job

In [9]:
import numpy

# If we have just one counter, everything should categorize to it. Make sure
# the map/reduce job can handle this degenerate case.

with open('centroids.txt', 'w') as centroid_file:
    print >> centroid_file, ','.join([str(x) for x in numpy.repeat(0, 1001)])

In [11]:
!hdfs dfs -rm -r -f -skipTrash $hdfs_base_folder/mrjob_45_output

!python KMeansJob.py -r hadoop \
    --strict-protocols \
    --file=centroids.txt \
    --output-dir=$hdfs_base_folder/mrjob_45_output \
    --no-output \
    topUsers_Apr-Jul_2014_1000-words.txt \
    > /dev/null 2>&1

Deleted /tmp/mrjob_45_output


## Confirm the output

In [12]:
!hdfs dfs -ls $hdfs_base_folder/mrjob_45_output

Found 2 items
-rw-r--r--   1 ubuntu hadoop          0 2016-02-10 09:14 /tmp/mrjob_45_output/_SUCCESS
-rw-r--r--   1 ubuntu hadoop      23182 2016-02-10 09:14 /tmp/mrjob_45_output/part-00000


> Try several parameterizations and initializations and iterate until a threshold (try 0.001) is reached.

> Note that you do not have to compute the aggregated distribution or the 
class-aggregated distributions, which are rows in the auxiliary file:

> `topUsers_Apr-Jul_2014_1000-words_summaries.txt`

## Create the base driver

Create a base utility class that serves as a driver until convergence happens. Child classes are expected to extend it in order to provide the first guess at centroids.

In [19]:
#!/usr/bin/env python
import numpy
import shutil
import time

from KMeansJob import KMeansJob

class KMeansJobDriver:
    
    """
    Stores the given centroids to a file.
    """
    def store_centroids(self, file_name, centroids):            
        with open(file_name, 'w') as centroids_file:
            for key, point in centroids.iteritems():
                print >> centroids_file, key + ',' + ','.join([str(x) for x in point])
    
    """
    Iterates until the threshold for convergence has been satisfied.
    """
    def run(self, runner_type, output_folder, threshold):

        # Initialize the centroids.
        
        pre_centroids = None
        post_centroids = self.get_initial_centroids()

        # Iterate until we have converged.
        
        converged = False
        iteration = 0
        
        time_start = time.time()

        while not converged:
            iteration += 1
            pre_centroids = post_centroids

            # Write the centroids.txt file for the next iteration.
            
            iteration_file_name = 'centroids_%03d.txt' % iteration

            self.store_centroids(iteration_file_name, post_centroids)
            shutil.copyfile(iteration_file_name, 'centroids.txt')
            
            # Run the next iteration.
            
            iteration_output_folder = '%s/%03d' % (output_folder, iteration)

            mr_job = KMeansJob(args = [
                '-r', runner_type,
                '--strict-protocols',
                '--file=centroids.txt',
                '--output-dir=' + iteration_output_folder,
                'topUsers_Apr-Jul_2014_1000-words.txt'
            ])
            
            with mr_job.make_runner() as runner:
                runner.run()
                
                post_centroids = {}

                for line in runner.stream_output():
                    key, point = mr_job.parse_output_line(line)
                    post_centroids[key] = point
                
                # Account for missing centroids and leave them in
                # the same place they were before the iteration.
                
                for key, point in pre_centroids.iteritems():
                    if key not in post_centroids:
                        post_centroids[key] = point                

            self.store_centroids('centroids_next.txt', post_centroids)

            # Print iteration results and check for convergence.

            maximum_change = self.get_maximum_change(pre_centroids, post_centroids)
            print 'Iteration', iteration, 'has maximum change', maximum_change
            converged = maximum_change <= threshold
        
        time_end = time.time()
        duration = time_end - time_start
        
        print 'Converged in', iteration, 'iteration(s), which required', duration, 'second(s)'
            
    """
    Returns the maximum change for any given coordinate position in the centroids.
    Account for case where nothing winds up in some cluster.
    """
    def get_maximum_change(self, pre_centroids, post_centroids):
        best_point_difference = 0
        
        for key, pre_point in pre_centroids.iteritems():
            post_point = post_centroids[key]

            point_difference = numpy.array(pre_point) - numpy.array(post_point)
            max_point_difference = max(numpy.abs(point_difference))
            
            if max_point_difference > best_point_difference:
                best_point_difference = max_point_difference

        return best_point_difference

> After convergence, print out a summary of the classes present in each cluster.
In particular, report the composition as measured by the total
portion of each class type (0-3) contained in each cluster,
and discuss your findings and any differences in outcomes across parts A-D.

## Provide summary job

In [29]:
#!/usr/bin/env python
from mrjob.step import MRStep
from KMeansJob import KMeansJob

import csv
import sys

class KMeansSummaryJob(KMeansJob):

    """
    Emit the centroid and its corresponding code.
    """
    def mapper(self, _, line):
        row = csv.reader([line]).next()

        if len(row) <= 3:
            return

        # Update the centroid summary so that it remembers the code for
        # the point that belongs to it.

        code = row[1]

        total = int(row[2])
        point = [float(x) / total for x in row[3:]]
        centroid = self.get_closest_centroid(point)

        yield code, { centroid: 1 }

    """
    Emit the total counts for each code
    """
    def combiner(self, code, centroid_counts):
        yield code, self.get_totals(centroid_counts)
    
    """
    Emit the total counts for each code
    """
    def reducer(self, code, centroid_counts):
        totals = self.get_totals(centroid_counts)

        total = sum(totals.itervalues())
        proportions = { key : float(count) / total for key, count in totals.iteritems() }
        
        yield code, proportions

    """
    Compute the totals for each code as if it were a word counter.
    """
    def get_totals(self, centroid_counts):
        totals = {}
        
        for centroid_count in centroid_counts:
            for centroid, count in centroid_count.iteritems():
                if centroid in totals:
                    totals[centroid] += count
                else:
                    totals[centroid] = count
        
        return totals

    """
    Specify steps.
    """
    def steps(self):
        return [
            MRStep(
                mapper_init = self.mapper_init,
                mapper = self.mapper,
                combiner = self.combiner,
                reducer = self.reducer)
        ]


if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    KMeansSummaryJob().run()

## Provide summary pretty print

In [44]:
import json
import pandas

"""
Create a summary table from the JSON output in the noted file name.
"""
def get_summary_table(file_name):
    summary_items = {}
    
    with open(file_name, 'r') as summary_file:
        for line in summary_file:

            split_line = line.strip().split('\t')
            code = split_line[0]
            summary = json.loads(split_line[1])
            
            summary_items[code] = { key: '{:.2%}'.format(value) for key, value in summary.iteritems() }
    
    df = pandas.DataFrame.from_dict(summary_items)
    df.sort_index(axis = 1, inplace = True)
    df.fillna('0.00%', inplace = True)

    return df

# HW4.5a

> (A) K=4 uniform random centroid-distributions over the 1000 words

## Create the driver

In [22]:
#!/usr/bin/env python
import csv
import random
import sys

from KMeansJobDriver import KMeansJobDriver

class driver_45a(KMeansJobDriver):

    """
    Choose 4 random indices and iterate over the CSV until we find the users
    at the specified indices.
    """
    def get_initial_centroids(self):

        row_indices = random.sample(xrange(1, 1001), 4)

        centroids = {}        
        row_number = 0
        
        with open('topUsers_Apr-Jul_2014_1000-words.txt', 'r') as user_file:
            for row in csv.reader(user_file):
                row_number += 1
                
                if row_number not in row_indices:
                    continue

                user_id = row[0]
                total = int(row[2])
                point = [float(x) / total for x in row[3:]]

                centroids[user_id] = point
        
        return centroids

if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    driver_45a().run(sys.argv[1], sys.argv[2], 0.001)

## Run the driver

In [23]:
!rm -f centroids*.txt

!hdfs dfs -rm -r -f -skipTrash $hdfs_base_folder/mrjob_45a_output > /dev/null
!python driver_45a.py hadoop $hdfs_base_folder/mrjob_45a_output

Deleted /tmp/mrjob_45a_output
Iteration 1 has maximum change 0.0775211232155
Iteration 2 has maximum change 0.030128726541
Iteration 3 has maximum change 0.00762067793484
Iteration 4 has maximum change 0
Converged in 4 iteration(s), which required 187.624292135 second(s)


## Summarize the output

In [32]:
!rm -rf mrjob_45a_summary

!python KMeansSummaryJob.py -r local \
    --strict-protocols \
    --file=centroids.txt \
    --output-dir=mrjob_45a_summary \
    --no-output \
    topUsers_Apr-Jul_2014_1000-words.txt \
    > /dev/null 2>&1

!cat mrjob_45a_summary/* > mrjob_45a_summary.txt

In [45]:
get_summary_table('mrjob_45a_summary.txt')

Unnamed: 0,"""0""","""1""","""2""","""3"""
196249823,0.13%,87.91%,66.67%,3.88%
2300389891,0.27%,0.00%,0.00%,57.28%
42911831,0.00%,8.79%,7.41%,0.00%
453748967,99.60%,3.30%,25.93%,38.83%


# HW4.5b

> (B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 

## Create the driver

In [46]:
#!/usr/bin/env python
import csv
import numpy
import sys

from KMeansJobDriver import KMeansJobDriver

class driver_45b(KMeansJobDriver):

    """
    Load the class distributions from topUsers_Apr-Jul_2014_1000-words_summaries.txt
    """
    def __init__(self):
        distribution_file_name = 'topUsers_Apr-Jul_2014_1000-words_summaries.txt'

        with open(distribution_file_name, 'r') as distribution_file:
            reader = csv.reader(distribution_file)
            header_row = reader.next()

            # Load the row for the total population
            
            total_row = reader.next()

            total = int(total_row[2])
            point = [float(x) / total for x in total_row[3:]]
            
            self.centroid = point

    """
    Choose 2 centroids.
    """
    def get_initial_centroids(self):
        return { str(i) : self.get_perturbed_centroid() for i in range(0, 2) }
        
    """
    Choose a new centroid by adding a random perturbation to the all-class centroid.
    """
    
    def get_perturbed_centroid(self):

        # Choose random numbers between 0 and 0.01

        perturbation = numpy.random.random(len(self.centroid)) / 100

        # Create the perturbed point and re-normalize so that the
        # values still sum to 1.

        perturbed_point = numpy.array(self.centroid) + perturbation
        perturbed_point /= numpy.sum(perturbed_point)
        
        return perturbed_point        
        
if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    driver_45b().run(sys.argv[1], sys.argv[2], 0.001)

## Run the driver

In [47]:
!rm -f centroids*.txt

!hdfs dfs -rm -r -f -skipTrash $hdfs_base_folder/mrjob_45b_output > /dev/null
!python driver_45b.py hadoop $hdfs_base_folder/mrjob_45b_output

Iteration 1 has maximum change 0.0443779813147
Iteration 2 has maximum change 0.048027636263
Iteration 3 has maximum change 0.0282316079689
Iteration 4 has maximum change 0.000689009986548
Converged in 4 iteration(s), which required 186.39784193 second(s)


## Summarize the output

In [48]:
!rm -rf mrjob_45b_summary

!python KMeansSummaryJob.py -r local \
    --strict-protocols \
    --file=centroids.txt \
    --output-dir=mrjob_45b_summary \
    --no-output \
    topUsers_Apr-Jul_2014_1000-words.txt \
    > /dev/null 2>&1

!cat mrjob_45b_summary/* > mrjob_45b_summary.txt

In [49]:
get_summary_table('mrjob_45b_summary.txt')

Unnamed: 0,"""0""","""1""","""2""","""3"""
0,0.13%,96.70%,74.07%,3.88%
1,99.87%,3.30%,25.93%,96.12%


# HW4.5c

> (C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution

## Create the driver

In [50]:
#!/usr/bin/env python
import csv
import numpy
import sys

from driver_45b import driver_45b

class driver_45c(driver_45b):

    """
    Choose 4 centroids.
    """
    def get_initial_centroids(self):
        return { str(i) : self.get_perturbed_centroid() for i in range(0, 4) }

if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    driver_45c().run(sys.argv[1], sys.argv[2], 0.001)

## Run the driver

In [51]:
!rm -f centroids*.txt

!hdfs dfs -rm -r -f -skipTrash $hdfs_base_folder/mrjob_45c_output > /dev/null
!python driver_45c.py hadoop $hdfs_base_folder/mrjob_45c_output

Iteration 1 has maximum change 0.105617448782
Iteration 2 has maximum change 0.0588165443685
Iteration 3 has maximum change 0.0121952665909
Iteration 4 has maximum change 0
Converged in 4 iteration(s), which required 188.59821105 second(s)


## Summarize the output

In [52]:
!rm -rf mrjob_45c_summary

!python KMeansSummaryJob.py -r local \
    --strict-protocols \
    --file=centroids.txt \
    --output-dir=mrjob_45c_summary \
    --no-output \
    topUsers_Apr-Jul_2014_1000-words.txt \
    > /dev/null 2>&1

In [53]:
!cat mrjob_45c_summary/* > mrjob_45c_summary.txt

get_summary_table('mrjob_45c_summary.txt')

Unnamed: 0,"""0""","""1""","""2""","""3"""
0,0.13%,40.66%,70.37%,3.88%
1,0.27%,0.00%,5.56%,56.31%
2,0.00%,56.04%,3.70%,0.00%
3,99.60%,3.30%,20.37%,39.81%


# HW4.5d

> (D) K=4 "trained" centroids, determined by the sums across the classes.

## Create the driver

In [54]:
#!/usr/bin/env python
import csv
import numpy
import sys

from KMeansJobDriver import KMeansJobDriver

class driver_45d(KMeansJobDriver):

    """
    Load the class distributions from topUsers_Apr-Jul_2014_1000-words_summaries.txt
    """
    def __init__(self):
        self.centroids = {}
        distribution_file_name = 'topUsers_Apr-Jul_2014_1000-words_summaries.txt'

        with open(distribution_file_name, 'r') as distribution_file:
            reader = csv.reader(distribution_file)
            header_row = reader.next()

            # Load the row for the total population
            
            total_row = reader.next()
            
            for row in reader:
                code = row[1]
                total = int(total_row[2])
                point = [float(x) / total for x in total_row[3:]]
                
                self.centroids[code] = point

    """
    Return the 'trained' centroids
    """
    def get_initial_centroids(self):
        return self.centroids

if __name__ == '__main__' and sys.argv[0].find('ipykernel') == -1:
    driver_45d().run(sys.argv[1], sys.argv[2], 0.001)

## Run the driver

In [55]:
!rm -f centroids*.txt

!hdfs dfs -rm -r -f -skipTrash $hdfs_base_folder/mrjob_45d_output > /dev/null
!python driver_45d.py hadoop $hdfs_base_folder/mrjob_45d_output

Iteration 1 has maximum change 0.0104099262664
Iteration 2 has maximum change 0.0741558818057
Iteration 3 has maximum change 0.0165985372109
Iteration 4 has maximum change 0.0134138361426
Iteration 5 has maximum change 0.00672879986323
Iteration 6 has maximum change 0.0101538237874
Iteration 7 has maximum change 0.0203770724963
Iteration 8 has maximum change 0.0241936258467
Iteration 9 has maximum change 0.0186661807583
Iteration 10 has maximum change 0.0215314932318
Iteration 11 has maximum change 0.00123803087056
Iteration 12 has maximum change 0
Converged in 12 iteration(s), which required 560.313692808 second(s)


## Summarize the output

In [56]:
!rm -rf mrjob_45d_summary

!python KMeansSummaryJob.py -r local \
    --strict-protocols \
    --file=centroids.txt \
    --output-dir=mrjob_45d_summary \
    --no-output \
    topUsers_Apr-Jul_2014_1000-words.txt \
    > /dev/null 2>&1

In [57]:
!cat mrjob_45c_summary/* > mrjob_45d_summary.txt

get_summary_table('mrjob_45d_summary.txt')

Unnamed: 0,"""0""","""1""","""2""","""3"""
0,0.13%,40.66%,70.37%,3.88%
1,0.27%,0.00%,5.56%,56.31%
2,0.00%,56.04%,3.70%,0.00%
3,99.60%,3.30%,20.37%,39.81%


# HW4.5 Discussion

> Report the composition as measured by the total
portion of each class type (0-3) contained in each cluster,
and discuss your findings and any differences in outcomes across parts A-D.

* For all initializations, class type 0 was predominantly in only one cluster.
* For random initialization and 2 perturbed centroids, class type 1 was predominantly in only one cluster. For 4 perturbed centroids and trained centroids, class type 1 had a 3/2 split between two clusters.
* For all initializations, class type 3 had pproximately 70% of all of its elements in one cluster.
* For 2 perturbed centroids, class type 4 was predominantly in only one cluster. For all other cases, class type 3 had a 3/2 split between clusters.