Name: Patrick Ng  
Email: patng@ischool.berkeley.edu  
Class: W261-2  
Week: 04  
Date of submission: Feb 11, 2016

## HW 4.0. 

**What is MrJob? How is it different to Hadoop MapReduce? **
MRJob is a Python framework built upon Hadoop streaming.  It makes it easier to write Hadoop jobs, especially multi-step ones, in Python than just using Hadoop streaming.  

**What are the mapper_init, mapper_final(), combiner_final(), reducer_final() methods? When are they called?**
These are methods which can be defined by user.  The mapper_init method is called before a mapper processes any input.  All those three final methods are called after the corresponding component (e.g. combiner) has finished processing all its input.


## HW 4.1
What is serialization in the context of MrJob or Hadoop? 
When it used in these frameworks? 
What is the default serialization mode for input and outputs for MrJob?

### Answers ###

In the context of MrJob or Hadoop, serialization refers to how the data is formatted for input, and how it is formatted when the result is generated.  

It's used when the framework sends input to and generate output from the mappers, combiners and reducers.  

In MRJob, the default serialization for input and output are JSONProtocol.

In [14]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## HW 4.2: 

Recall the Microsoft logfiles data from the async lecture. The logfiles are described are located at:

https://kdd.ics.uci.edu/databases/msweb/msweb.html  
http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/

This dataset records which areas (Vroots) of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998.

 Here, you must preprocess the data on a single node (i.e., not on a cluster of nodes) from the format:
```
C,"10001",10001   #Visitor id 10001
V,1000,1          #Visit by Visitor 10001 to page id 1000
V,1001,1          #Visit by Visitor 10001 to page id 1001
V,1002,1          #Visit by Visitor 10001 to page id 1002
C,"10002",10002   #Visitor id 10001
V
Note: #denotes comments
to the format:

V,1000,1,C, 10001
V,1001,1,C, 10001
V,1002,1,C, 10001
```
Write the python code to accomplish this.

In [233]:
%%writefile transform.py
#!/usr/bin/python

import csv

curUser = None
with open("anonymous-msweb.data", "r") as csvfile:
    for fields in csv.reader(csvfile):
        recType = fields[0]

        if recType == "C":
            curUser = fields[2] # remember current user
        elif recType == "V":
            print ",".join(fields + ["C", curUser])

Overwriting transform.py


In [234]:
!python transform.py > "transformed-msweb.data"

## HW 4.3: 
Find the 5 most frequently visited pages using MrJob from the output of 4.2 (i.e., transfromed log file).

In [2]:
%%writefile hw4_3.py

'''
Input format:
....
V,1000,1,C,10001
V,1001,1,C,10001
V,1002,1,C,10001
V,1001,1,C,10002
V,1003,1,C,10002
V,1001,1,C,10003
V,1003,1,C,10003
V,1004,1,C,10003
....
'''
from mrjob.job import MRJob, MRStep
import csv
import sys

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    return csv.reader([line]).next()

class PageVisitHW4_3(MRJob):
    
    def mapper_get_visit_count(self, line_no, line):
        """Extracts the page id and visit count"""
        cell = csv_readline(line)
        yield cell[1], 1

    def reducer_get_visit_count(self, pageId, visit_counts):
        """Sumarizes the visit counts by adding them together."""
        total = sum(i for i in visit_counts)
        yield pageId, total

    def reducer_find_top5_pages_init(self):
        self.printed = 0

    def reducer_find_top5_pages(self, pageId, visit_counts):
        """Print the top 5 pageId's and the counts"""
        if self.printed < 5:
            yield pageId, visit_counts.next()
            self.printed += 1
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_visit_count,
                   combiner=self.reducer_get_visit_count,
                   reducer=self.reducer_get_visit_count),
            MRStep(reducer_init=self.reducer_find_top5_pages_init,
                   reducer=self.reducer_find_top5_pages,
                   jobconf={
                    "stream.num.map.output.key.fields":"2",
                    "mapreduce.job.output.key.comparator.class":
                        "org.apache.hadoop.mapred.lib.KeyFieldBasedComparator",
                    "mapreduce.partition.keycomparator.options":"-k2,2nr"
                          })]

    
if __name__ == '__main__':
    PageVisitHW4_3.run()


Overwriting hw4_3.py


In [236]:
from hw4_3 import PageVisitHW4_3
mr_job = PageVisitHW4_3(args=['transformed-msweb.data', '-r', 'hadoop'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    
    print "5 most frequently visited pages:"

    for line in runner.stream_output():
        print mr_job.parse_output_line(line)



5 most frequently visited pages:
('1008', 10836)

ERROR:mrjob.fs.hadoop:STDERR: 16/02/09 23:42:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable




('1034', 9383)
('1004', 8463)
('1018', 5330)
('1017', 5108)


## HW 4.4: 
Find the most frequent visitor of each page using MrJob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.

In [238]:
%%writefile transform_urls.py
#!/usr/bin/python

import csv

curUser = None
with open("anonymous-msweb.data", "r") as csvfile:
    for fields in csv.reader(csvfile):
        if fields[0] == "A":
            print ",".join([fields[1], fields[4]])


Writing transform_urls.py


In [245]:
!python transform_urls.py > urls.data

In [278]:
%%writefile hw4_4.py

from mrjob.job import MRJob, MRStep
import mrjob
import csv
import sys

def csv_readline(line):
    """Given a sting CSV line, return a list of strings."""
    return csv.reader([line]).next()

class PageVisitHW4_4(MRJob):
    
    BASE_URL = "http://www.microsoft.com"
    INTERNAL_PROTOCOL = mrjob.protocol.RawProtocol
    OUTPUT_PROTOCOL = mrjob.protocol.RawProtocol
    
    def mapper1(self, line_no, line):
        cell = csv_readline(line)
        # page,visitor \t count
        yield ",".join([cell[1], cell[4]]), "1"
        
    def reducer1(self, key, values):
        """Sum up the visit count per (page, visitor) pair."""
        total = sum([int(v) for v in values])
        fields = key.split(",")
        # page \t visitor \t total
        yield fields[0], "\t".join([fields[1], str(total)])

    def reducer2_init(self):
        # Build the dictionary of pageId:url
        self.urls = {}
        with open("urls.data", "r") as f:
            for fields in csv.reader(f):
                self.urls[fields[0]] = fields[1]
                
    def reducer2(self, key, values):
        # url \t pageId \t visitor \t total
        yield self.BASE_URL + self.urls[key] + "\t" + key, values.next()
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper1,
                   reducer=self.reducer1,
                  ),
            MRStep(reducer_init=self.reducer2_init,
                   reducer=self.reducer2,
                   jobconf={
                    "stream.num.map.output.key.fields":"3",
                    "mapreduce.job.output.key.comparator.class":
                        "org.apache.hadoop.mapred.lib.KeyFieldBasedComparator",
                    "mapreduce.partition.keycomparator.options":"-k3,3nr"
                          })]

    
if __name__ == '__main__':
    PageVisitHW4_4.run()


Overwriting hw4_4.py


In [280]:
!python ./hw4_4.py \
-r hadoop \
--file urls.data \
transformed-msweb.data

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/dm/nsw7wjf91f1c74hgl17ldw040000gn/T/hw4_4.patrickng.20160209.163650.965338
writing wrapper script to /var/folders/dm/nsw7wjf91f1c74hgl17ldw040000gn/T/hw4_4.patrickng.20160209.163650.965338/setup-wrapper.sh
Using Hadoop version 2.7.1
Copying local files into hdfs:///user/patrickng/tmp/mrjob/hw4_4.patrickng.20160209.163650.965338/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

HADOOP: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
HADOOP: packageJobJar: [/var/folders/dm/nsw7wjf91f1c74hgl17ldw040000gn/T/hadoop-unjar1265772579474090499/] [] /var/folders/dm/nsw7wjf91f1c74hgl17ldw040000gn/T/stream

## HW 4.5 Clustering Tweet Dataset
```
Here you will use a different dataset consisting of word-frequency distributions 
for 1,000 Twitter users. These Twitter users use language in very different ways,
and were classified by hand according to the criteria:

0: Human, where only basic human-human communication is observed.

1: Cyborg, where language is primarily borrowed from other sources
(e.g., jobs listings, classifieds postings, advertisements, etc...).

2: Robot, where language is formulaically derived from unrelated sources
(e.g., weather/seismology, police/fire event logs, etc...).

3: Spammer, where language is replicated to high multiplicity
(e.g., celebrity obsessions, personal promotion, etc... )

Check out the preprints of  recent research,
which spawned this dataset:

http://arxiv.org/abs/1505.04342
http://arxiv.org/abs/1508.01843

The main data lie in the accompanying file:

topUsers_Apr-Jul_2014_1000-words.txt

and are of the form:

USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...
.
.

where

USERID = unique user identifier
CODE = 0/1/2/3 class code
TOTAL = sum of the word counts

Using this data, you will implement a 1000-dimensional K-means algorithm in MrJob on the users
by their 1000-dimensional word stripes/vectors using several 
centroid initializations and values of K.

Note that each "point" is a user as represented by 1000 words, and that
word-frequency distributions are generally heavy-tailed power-laws
(often called Zipf distributions), and are very rare in the larger class
of discrete, random distributions. 

** For each user you will have to normalize by its "TOTAL" column. 

Try several parameterizations and initializations:

(A) K=4 uniform random centroid-distributions over the 1000 words (generate 1000 random numbers and normalize the vectors)
(B) K=2 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
(C) K=4 perturbation-centroids, randomly perturbed from the aggregated (user-wide) distribution 
(D) K=4 "trained" centroids, determined by the sums across the classes. Use use the 
(row-normalized) class-level aggregates as 'trained' starting centroids (i.e., the training is already done for you!).
Note that you do not have to compute the aggregated distribution or the 
class-aggregated distributions, which are rows in the auxiliary file:

topUsers_Apr-Jul_2014_1000-words_summaries.txt

Row 1: Words
Row 2: Aggregated distribution across all classes
Row 3-6 class-aggregated distributions for clases 0-3
For (A),  we select 4 users randomly from a uniform distribution [1,...,1,000]
For (B), (C), and (D)  you will have to use data from the auxiliary file: 

topUsers_Apr-Jul_2014_1000-words_summaries.txt

This file contains 5 special word-frequency distributions:

(1) The 1000-user-wide aggregate, which you will perturb for initializations
in parts (B) and (C), and

(2-5) The 4 class-level aggregates for each of the user-type classes (0/1/2/3)


In parts (B) and (C), you will have to perturb the 1000-user aggregate 
(after initially normalizing by its sum, which is also provided).
So if in (B) you want to create 2 perturbations of the aggregate, start
with (1), normalize, and generate 1000 random numbers uniformly 
from the unit interval (0,1) twice (for two centroids), using:

from numpy import random
numbers = random.sample(1000)

Take these 1000 numbers and add them (component-wise) to the 1000-user aggregate,
and then renormalize to obtain one of your aggregate-perturbed initial centroids.


###################################################################################
## Geneate random initial centroids around the global aggregate
## Part (B) and (C) of this question
###################################################################################
def startCentroidsBC(k):
    counter = 0
    for line in open("topUsers_Apr-Jul_2014_1000-words_summaries.txt").readlines():
        if counter == 2:        
            data = re.split(",",line)
            globalAggregate = [float(data[i+3])/float(data[2]) for i in range(1000)]
        counter += 1
    ## perturb the global aggregate for the four initializations    
    centroids = []
    for i in range(k):
        rndpoints = random.sample(1000)
        peturpoints = [rndpoints[n]/10+globalAggregate[n] for n in range(1000)]
        centroids.append(peturpoints)
        total = 0
        for j in range(len(centroids[i])):
            total += centroids[i][j]
        for j in range(len(centroids[i])):
            centroids[i][j] = centroids[i][j]/total
    return centroids



——
For experiments A, B, C and D and iterate until a threshold (try 0.001) is reached.
After convergence, print out a summary of the classes present in each cluster.
In particular, report the composition as measured by the total
portion of each class type (0-3) contained in each cluster,
and discuss your findings and any differences in outcomes across parts A-D.

```

### Normalize the user input file

In [14]:
%%writefile hw4_5_normalize_input.py
from __future__ import division # Use Python 3-style division
import csv
import sys

'''
USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...
.
.

where

USERID = unique user identifier
CODE = 0/1/2/3 class code
TOTAL = sum of the word counts
'''

with open("topUsers_Apr-Jul_2014_1000-words.txt", "r") as csvfile:
    for fields in csv.reader(csvfile):
        output = [fields[1]] # Only need to use the Code for this assignment.
        total = int(fields[2])
        for i in range(3,1003):
            output.append(str(int(fields[i])/total))
            
        print ",".join(output)
        

Overwriting hw4_5_normalize_input.py


In [15]:
!python hw4_5_normalize_input.py > normalized_1000-words.txt

### The MRJob

In [13]:
%%writefile MRKmeans_4_5.py
from numpy import argmin, array, random
import re
import numpy as np
from mrjob.job import MRJob, MRStep
from itertools import chain

random.seed(10)

# Calculate find the nearest centroid for data point 
def MinDist(datapoint, centroid_points):
    datapoint = array(datapoint)
    centroid_points = array(centroid_points)
    diff = datapoint - centroid_points 
    diffsq = diff*diff
    # Get the nearest centroid for each instance
    minidx = argmin(list(diffsq.sum(axis = 1)))
    return minidx

# Check whether centroids converge
def stop_criterion(centroid_points_old, centroid_points_new, T):
    oldvalue = list(chain(*centroid_points_old))
    newvalue = list(chain(*centroid_points_new))
    Diff = [abs(x-y) for x, y in zip(oldvalue, newvalue)]
    Flag = True
    for i in Diff:
        if(i>T):
            Flag = False
            break
    return Flag

# For part B and C
def startCentroidsBC(k):
    counter = 1 # It was originally 0, but I think that was a bug in the provided code.
    for line in open("topUsers_Apr-Jul_2014_1000-words_summaries.txt").readlines():
        if counter == 2:        
            data = re.split(",",line)
            globalAggregate = [float(data[i+3])/float(data[2]) for i in range(1000)]
        counter += 1
    ## perturb the global aggregate for the four initializations    
    centroids = []
    for i in range(k):
        rndpoints = random.sample(1000)
        peturpoints = [rndpoints[n]/10+globalAggregate[n] for n in range(1000)]
        centroids.append(peturpoints)
        total = 0
        for j in range(len(centroids[i])):
            total += centroids[i][j]
        for j in range(len(centroids[i])):
            centroids[i][j] = centroids[i][j]/total
    return centroids

class MRKmeans(MRJob):
    centroid_points = []
    num_class = 4
    
    def configure_options(self):
        super(MRKmeans, self).configure_options()
        self.add_passthrough_option(
            '-k', type='int', default=4, help='The number of centroids.')
        
    def load_options(self, args):
        super(MRKmeans, self).load_options(args)
        self.k = self.options.k
        
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init, 
                   mapper=self.mapper,
                   combiner = self.combiner,
                   reducer=self.reducer)
               ]

    #load centroids info from file
    def mapper_init(self):
        self.centroid_points = [map(float,s.split('\n')[0].split(',')) 
                                for s in open("Centroids.txt").readlines()]
        
    #load data and output the nearest centroid index and data point 
    def mapper(self, _, line):
        '''
        CODE,NORMALIZED_WORD1_COUNT,NORMALIZED_WORD2_COUNT,...
        .

        where

        CODE = 0/1/2/3 class code
        '''
        fields = line.split(',')
        
        code = int(fields[0])
        codeCounts = [0]*self.num_class
        codeCounts[code] = 1
        
        D = (map(float, fields[1:]))
        yield int(MinDist(D,self.centroid_points)), (D, 1, codeCounts)
        
    # Combine sum of data points locally
    def combiner(self, idx, inputdata):
        num = 0
        Ds = np.array([0.0]*1000)
        codeCountsAll = np.array([0]*self.num_class)
        
        for D, n, codeCounts in inputdata:
            num += n
            Ds += D
            codeCountsAll += codeCounts
            
        yield idx, (Ds.tolist(), num, codeCountsAll.tolist())
        
    # Aggregate sum for each cluster and then calculate the new centroids
    def reducer(self, idx, inputdata): 
        centroid = np.array([0.0]*1000)
        num = 0 
        codeCountsAll = np.array([0]*self.num_class)
        
        for D, n, codeCounts in inputdata: 
            num += n
            centroid += D
            codeCountsAll += codeCounts
            
        centroid /= num

        yield idx, (centroid.tolist(), codeCountsAll.tolist())
      
if __name__ == '__main__':
    MRKmeans.run()

Overwriting MRKmeans_4_5.py


### HW4.5 (A) - Driver
Generate random initial centroids  
New Centroids = initial centroids  
While(1)：
+ Cacluate new centroids
+ stop if new centroids close to old centroids
+ Updates centroids 

In [19]:
from numpy import random
from MRKmeans_4_5 import MRKmeans, stop_criterion

k = 4
mr_job = MRKmeans(args=['normalized_1000-words.txt', 
                        '--file', 'Centroids.txt',
                        '-k', str(k)])

random.seed(10)

# Generate initial centroids
centroid_points = []

for i in range(k):
    # generate 1000 random numbers and normalize the vectors
    vector = random.random(1000)
    vector = vector / vector.sum()
    centroid_points.append(vector.tolist())

def writeCentroidsFile():
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

writeCentroidsFile()

# Update centroids iteratively
i = 0
while(1):
    # save previous centoids to check convergency
    centroid_points_old = centroid_points[:]
    
    codeCounts = [None] * k
    
    print "iteration"+str(i)+":"
    with mr_job.make_runner() as runner: 
        runner.run()
        
        for line in runner.stream_output():
            key, value =  mr_job.parse_output_line(line)
            centroid_points[key] = value[0]
            codeCounts[key] = value[1]
    
    writeCentroidsFile()        
    
    i = i + 1
    if(stop_criterion(centroid_points_old, centroid_points, 0.001)):
        break
        
print "Classes distribution in each centroid:\n"
for i in range(k):
    print str(i) + ": " + str(codeCounts[i])



iteration0:
iteration1:




iteration2:




iteration3:




iteration4:




iteration5:




iteration6:




Classes distribution in each centroid:

0: [78, 0, 4, 69]
1: [13, 0, 11, 1]
2: [660, 3, 2, 29]
3: [1, 88, 37, 4]


### HW4.5 (B) - Driver


In [15]:
from numpy import random
from MRKmeans_4_5 import MRKmeans, stop_criterion, startCentroidsBC

k = 2
mr_job = MRKmeans(args=['normalized_1000-words.txt', 
                        '--file', 'Centroids.txt',
                        '-k', str(k)])

random.seed(10)

# Generate initial centroids
centroid_points = startCentroidsBC(k)

def writeCentroidsFile():
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

writeCentroidsFile()

# Update centroids iteratively
i = 0
while(1):
    # save previous centoids to check convergency
    centroid_points_old = centroid_points[:]
    
    codeCounts = [None] * k
    
    print "iteration"+str(i)+":"
    with mr_job.make_runner() as runner: 
        runner.run()
        
        for line in runner.stream_output():
            key, value =  mr_job.parse_output_line(line)
            centroid_points[key] = value[0]
            codeCounts[key] = value[1]
    
    writeCentroidsFile()        
    
    i = i + 1
    if(stop_criterion(centroid_points_old, centroid_points, 0.001)):
        break
        
print "Classes distribution in each centroid:"
for i in range(k):
    print str(i) + ": " + str(codeCounts[i])



iteration0:
iteration1:




iteration2:




iteration3:




iteration4:




Classes distribution in each centroid:
0: [1, 88, 40, 4]
1: [751, 3, 14, 99]


### HW 4.5 (C) - Driver

In [17]:
from numpy import random
from MRKmeans_4_5 import MRKmeans, stop_criterion, startCentroidsBC

k = 4
mr_job = MRKmeans(args=['normalized_1000-words.txt', 
                        '--file', 'Centroids.txt',
                        '-k', str(k)])

random.seed(10)

# Generate initial centroids
centroid_points = startCentroidsBC(k)

def writeCentroidsFile():
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

writeCentroidsFile()

# Update centroids iteratively
i = 0
while(1):
    # save previous centoids to check convergency
    centroid_points_old = centroid_points[:]
    
    codeCounts = [None] * k
    
    print "iteration"+str(i)+":"
    with mr_job.make_runner() as runner: 
        runner.run()
        
        for line in runner.stream_output():
            key, value =  mr_job.parse_output_line(line)
            centroid_points[key] = value[0]
            codeCounts[key] = value[1]
    
    writeCentroidsFile()        
    
    i = i + 1
    if(stop_criterion(centroid_points_old, centroid_points, 0.001)):
        break
        
print "Classes distribution in each centroid:"
for i in range(k):
    print str(i) + ": " + str(codeCounts[i])



iteration0:
iteration1:




iteration2:




iteration3:




iteration4:




iteration5:




iteration6:




Classes distribution in each centroid:
0: [29, 0, 3, 72]
1: [13, 0, 12, 1]
2: [709, 3, 2, 26]
3: [1, 88, 37, 4]


### HW 4.5 (D) - Driver

In [18]:
from numpy import random
import re
from MRKmeans_4_5 import MRKmeans, stop_criterion

k = 4
mr_job = MRKmeans(args=['normalized_1000-words.txt', 
                        '--file', 'Centroids.txt',
                        '-k', str(k)])

random.seed(10)

def startCentroidsD():
    counter = 1
    centroids = []
    for line in open("topUsers_Apr-Jul_2014_1000-words_summaries.txt").readlines():
        if counter >= 3:        
            data = re.split(",",line)
            centroid = [float(data[i+3])/float(data[2]) for i in range(1000)]
            centroids.append(centroid)
        counter += 1
        
    return centroids

def writeCentroidsFile():
    with open('Centroids.txt', 'w+') as f:
        f.writelines(','.join(str(j) for j in i) + '\n' for i in centroid_points)

# Generate initial centroids
centroid_points = startCentroidsD()

writeCentroidsFile()

# Update centroids iteratively
i = 0
while(1):
    # save previous centoids to check convergency
    centroid_points_old = centroid_points[:]
    
    codeCounts = [None] * k
    
    print "iteration"+str(i)+":"
    with mr_job.make_runner() as runner: 
        runner.run()
        
        for line in runner.stream_output():
            key, value =  mr_job.parse_output_line(line)
            centroid_points[key] = value[0]
            codeCounts[key] = value[1]
    
    writeCentroidsFile()        
    
    i = i + 1
    if(stop_criterion(centroid_points_old, centroid_points, 0.001)):
        break
        
print "Classes distribution in each centroid:"
for i in range(k):
    print str(i) + ": " + str(codeCounts[i])



iteration0:
iteration1:




iteration2:




iteration3:




iteration4:




Classes distribution in each centroid:
0: [749, 3, 14, 38]
1: [0, 51, 0, 0]
2: [1, 37, 40, 4]
3: [2, 0, 0, 61]
