# DATASCI W261: Machine Learning at Scale
## Assignment Week 3
Miki Seltzer (miki.seltzer@berkeley.edu)<br>
W261-2, Spring 2016<br>
Submission: 

## HW3.0
### What is a merge sort? Where is it used in Hadoop?
Merge sort is used to combine two pre-sorted lists. It is a very efficient sort, as it only needs to iteratively look for the smallest element in multiple sorted lists. It is utilized in Hadoop during the shuffle, when key-value pairs are shuffled to reducers, then sorted.

### How is  a combiner function in the context of Hadoop? 
A combiner function allows for preaggregation before key-value pairs are sent from the mappers to reducers. It is similar to a reducer, but there are some key differences. One difference is that the combiner may not always be used during the job -- it may be used 0, 1, or many times. Thus, we cannot count on Hadoop actually using a combiner we have included in the job, and we must be careful about matching output types of the mapper and combiner.

### Give an example where it can be used and justify why it should be used in the context of this problem.
In the classic word count example, a document is scanned, and each word is paired with the value of 1. A combiner can be used to combine values of 1 with the same key (word) before they are shuffled to reducers. This reduces the amount of data that is shuffled between the mapper and reducer, and increases efficiency.

### What is the Hadoop shuffle?
The Hadoop shuffle is the process by which data from mappers is shuffled and sorted while being sent to reducers. The shuffle ensures that keys are grouped together and sorted within the reducer they are sent to.

## HW3.1: Use Counters to do EDA (exploratory data analysis and to monitor progress)

In [343]:
# I am running this locally, so make sure that the Hadoop streaming API is in this folder.
!wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-streaming/2.7.1/hadoop-streaming-2.7.1.jar
    
# Create a folder on HDFS for this week's assignment, strip the header line from Consumer_Complaints.csv
!echo "$(tail -n +2 Consumer_Complaints.csv)" > Consumer_Complaints.csv
!hdfs dfs -mkdir /user/miki/week03
!hdfs dfs -put Consumer_Complaints.csv /user/miki/week03

mkdir: `/user/miki/week03': File exists


In [344]:
%%writefile mapper_31.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.1

import sys
from csv import reader

# Our input comes from STDIN (standard input)
for line in reader(sys.stdin):
    product = line[1]
    if product == "Debt collection": sys.stderr.write("reporter:counter:Product,Debt,1\n")
    elif product == "Mortgage": sys.stderr.write("reporter:counter:Product,Mortgage,1\n")
    else: sys.stderr.write("reporter:counter:Product,Other,1\n")
    print line
    

Writing mapper_31.py


In [345]:
%%writefile reducer_31.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW3.1

import sys
from operator import itemgetter
from csv import reader

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    print line
    

Writing reducer_31.py


In [346]:
# Change permissions on mapper and reducer
!chmod +x mapper_31.py
!chmod +x reducer_31.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_1_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_31.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_31.py \
-input /user/miki/week03/Consumer_Complaints.csv \
-output /user/miki/week03/hw3_1_output

Deleted /user/miki/week03/hw3_1_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob7547497993532515688.jar tmpDir=null


#### Screenshot

![Custom Counters](Counters.png)

## HW 3.2 Analyze the performance of your Mappers, Combiners and Reducers using Counters


In [18]:
%%writefile HW3_2_input.txt
foo foo quux labs foo bar quux

Overwriting HW3_2_input.txt


In [19]:
!hdfs dfs -put HW3_2_input.txt /user/miki/week03

In [347]:
%%writefile mapper_32a.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.2

import sys
from csv import reader

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Mapper,1\n")

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    line = line.split()
    for word in line:
        print '%s\t%s' % (word, 1)
    

Writing mapper_32a.py


In [348]:
%%writefile reducer_32a.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW3.2

import sys
from operator import itemgetter

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Reducer,1\n")

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    print line
    

Writing reducer_32a.py


In [349]:
# Change permissions on mapper and reducer
!chmod +x mapper_32a.py
!chmod +x reducer_32a.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_2a_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=1 \
-D mapred.reduce.tasks=4 \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_32a.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_32a.py \
-input /user/miki/week03/HW3_2_input.txt \
-output /user/miki/week03/hw3_2a_output

Deleted /user/miki/week03/hw3_2a_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob7785257004505429063.jar tmpDir=null


### What is the value of your user defined Mapper Counter, and Reducer Counter after completing this word count job? The answer  should be 1 and 4 respectively. Please explain.

I had to specify the number of map tasks and reduce tasks to get 1 and 4, since the defaults produced counters of 2 and 1 respectively.

The counters were incremented each time the mapper and reducer scripts were executed.

## HW3.2b: Perform a word count analysis of the Issue column of the Consumer Complaints Dataset 


In [302]:
%%writefile mapper_32b.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.2b

import sys
from csv import reader
import string

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Mapper,1\n")

# Initialize variables
total = 0

# Our input comes from STDIN (standard input)
for line in reader(sys.stdin):
    # Format our line
    issue = line[3].lower()
    issue = issue.replace(',',' ').replace('/',' ')
    
    for word in issue.split():
        if len(word) > 0:
            print '%s\t%s' % (word, 1)
            total += 1

# Print total words
print '%s\t%s' % ('*total', total)

Writing mapper_32b.py


In [303]:
%%writefile reducer_32b.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW3.2b

import sys
from operator import itemgetter

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Reducer,1\n")

# Initialize variables
prev_word = None
prev_count = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:

    # Split line
    word, count = line.split('\t')

    # Convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # Count wasn't an int, so move on
        continue

    if prev_word == word:
        # We haven't moved to a new word
        prev_count += count
    
    else:
        if prev_word:
            print '%s\t%s' % (prev_word, prev_count)

        prev_count = count
        prev_word = word

# Output the last line
if prev_word == word:
    print '%s\t%s' % (prev_word, prev_count)    

Writing reducer_32b.py


In [350]:
# Change permissions on mapper and reducer
!chmod +x mapper_32b.py
!chmod +x reducer_32b.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_2b_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-D mapred.reduce.tasks=4 \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_32b.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_32b.py \
-input /user/miki/week03/Consumer_Complaints.csv \
-output /user/miki/week03/hw3_2b_output

Deleted /user/miki/week03/hw3_2b_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob1465294680810041979.jar tmpDir=null


### What is the value of your user defined Mapper Counter, and Reducer Counter after completing your word count job?

After completing this job, the counters show the following values:
- Mapper: 2
- Reducer: 4 (this is explicitly set when running the job)

## HW3.2c: Perform a word count analysis of the Issue column of the Consumer Complaints Dataset (ADD: standalone combiner)

We can reuse the reducer in this case, and rename it combiner. We update the line to increment the combiner counter.


In [305]:
%%writefile combiner_32c.py
#!/usr/bin/python
## combiner.py
## Author: Miki Seltzer
## Description: combiner code for HW3.2c

import sys
from operator import itemgetter

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Combiner,1\n")

prev_word = None
prev_count = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    word, count = line.split('\t')

    # Convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # Count wasn't an int, so move on
        continue

    # Check if we've moved to a new word
    if prev_word == word:
        prev_count += count
    else:
        if prev_word:
            # We are at a new word, need to print previous word sum
            print '%s\t%s' % (prev_word, prev_count)
        prev_count = count
        prev_word = word

# Output the last line
if prev_word == word:
    print '%s\t%s' % (prev_word, prev_count)
    

Writing combiner_32c.py


In [351]:
# Change permissions on mapper and reducer
!chmod +x mapper_32b.py
!chmod +x combiner_32c.py
!chmod +x reducer_32b.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_2c_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-D mapred.reduce.tasks=4 \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_32b.py \
-combiner /home/cloudera/Documents/W261-Fall2016/Week03/combiner_32c.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_32b.py \
-input /user/miki/week03/Consumer_Complaints.csv \
-output /user/miki/week03/hw3_2c_output

Deleted /user/miki/week03/hw3_2c_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob2612824031763828089.jar tmpDir=null


### What is the value of your user defined Mapper Counter, Combiner Counter and Reducer Counter after completing your word count job?

After completing this job, the counters show the following values:
- Mapper: 2
- Combiner: 8
- Reducer: 4 (this is explicitly set when running the job)

## HW3.2d: Using a single reducer, present frequency and relative frequency of top 50 and bottom 10 terms

For this section, we only need an identity mapper and an identity reducer.

In [309]:
%%writefile mapper_32d.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.2d

import sys
from csv import reader
import string

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Mapper,1\n")

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    word, count = line.replace('\n','').split('\t')
    print '%s\t%s' % (count, word)

Overwriting mapper_32d.py


In [310]:
%%writefile reducer_32d.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW3.2d

import sys
from operator import itemgetter

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Reducer,1\n")

# Initialize variables
total = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    fields = line.replace('\n','').split('\t')
    count = fields[0]
    word = fields[1]
    
    try:
        count = int(count)
    except ValueError:
        continue
        
    # The first word should be *total, save this as total
    if word == '*total': total = float(count)
    else: print '%s\t%s\t%s' % (word, count, count/total)

Overwriting reducer_32d.py


In [352]:
# Change permissions on mapper and reducer
!chmod +x mapper_32d.py
!chmod +x reducer_32d.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_2d_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.comparator.options='-k1,1nr -k2,2n' \
-D mapred.reduce.tasks=1 \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_32d.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_32d.py \
-input /user/miki/week03/hw3_2b_output/part* \
-output /user/miki/week03/hw3_2d_output

Deleted /user/miki/week03/hw3_2d_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob6548003399415369612.jar tmpDir=null


In [353]:
!rm hw3_2d_output.txt
!hdfs dfs -copyToLocal /user/miki/week03/hw3_2d_output/part-00000 hw3_2d_output.txt

In [354]:
# Function to pretty print:
# - the top x and bottom y items
# - unique items
def print_results(file, x=50, y=10):
    words = []
    special_words = []
    
    with open(file,'r') as myfile:
        for line in myfile:
            fields = line.replace('\n','').split('\t')
            if fields[0][0] != '*': words.append(fields)
            else: special_words.append(fields)

    print '      {:16s}{:>8s}{:>15s}'.format('word', 'count', 'relative freq')
    print '--------------------------------------------'
    for i in range(x):
        print '[{:3d}] {:16s}{:8,d}{:15.2%}'.format(i+1, 
                                                    words[i][0], 
                                                    int(words[i][1]), 
                                                    float(words[i][2]))
    print '...'
    for i in range(y):
        j = len(words) - 10 + i
        print '[{:3d}] {:16s}{:8,d}{:15.2%}'.format(j+1, 
                                                    words[j][0], 
                                                    int(words[j][1]), 
                                                    float(words[j][2]))
    
    print '\n-------------------------------------------'
    print '      {:16s}{:>8,d}'.format('Unique words', len(words))
    for item in special_words:
        name = item[0][1:].replace('_',' ')
        print '      {:16s}{:>8,d}'.format(name, int(item[1]))

print_results('hw3_2d_output.txt')

      word               count  relative freq
--------------------------------------------
[  1] loan             119,630          8.87%
[  2] collection        72,394          5.37%
[  3] foreclosure       70,487          5.23%
[  4] modification      70,487          5.23%
[  5] account           57,448          4.26%
[  6] credit            55,251          4.10%
[  7] or                40,508          3.00%
[  8] payments          39,993          2.97%
[  9] escrow            36,767          2.73%
[ 10] servicing         36,767          2.73%
[ 11] report            34,903          2.59%
[ 12] incorrect         29,133          2.16%
[ 13] information       29,069          2.16%
[ 14] on                29,069          2.16%
[ 15] debt              27,874          2.07%
[ 16] closing           19,000          1.41%
[ 17] not               18,477          1.37%
[ 18] attempts          17,972          1.33%
[ 19] cont'd            17,972          1.33%
[ 20] collect           17,972     

## HW3.3. Shopping Cart Analysis Exploratory Data Analysis

We can reuse the reducer from HW3.2b, but there are small changes that need to be made to the mapper:
- We do not have to format the products to lower case, assume there is no punctuation stripping needed
- Keep track of the largest basket size as we loop through baskets

In [282]:
!hdfs dfs -put ProductPurchaseData.txt /user/miki/week03

In [325]:
%%writefile mapper_33a.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.3a

import sys

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Mapper,1\n")

# Initialize variables
total = 0
basket_size = 0
largest_basket_size = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    # Split our line into products
    for product in line.replace('\n','').split():
        print '%s\t%s' % (product, 1)
        basket_size += 1
        total += 1
    if basket_size > largest_basket_size:
        largest_basket_size = basket_size
    
    basket_size = 0

# Print total words
print '%s\t%s' % ('*total', total)
print '%s\t%s' % ('*largest_basket', largest_basket_size)

Overwriting mapper_33a.py


In [326]:
# Change permissions on mapper and reducer
!chmod +x mapper_33a.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_3a_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_33a.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_32b.py \
-input /user/miki/week03/ProductPurchaseData.txt \
-output /user/miki/week03/hw3_3a_output

Deleted /user/miki/week03/hw3_3a_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob4662501162287957069.jar tmpDir=null


### Using a single reducer: Report your findings such as number of unique products; largest basket; report the top 50 most frequently purchased items,  their frequency,  and their relative frequency (break ties by sorting the products alphabetical order) etc. using Hadoop Map-Reduce. 

We can use the mapper and reducer from HW3.2d to get the sorted frequencies and relative frequencies of the products.

In [327]:
# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_3b_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.comparator.options='-k1,1nr -k2,2n' \
-D mapred.reduce.tasks=1 \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_32d.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_32d.py \
-input /user/miki/week03/hw3_3a_output/part* \
-output /user/miki/week03/hw3_3b_output

Deleted /user/miki/week03/hw3_3b_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob7461671848105384853.jar tmpDir=null


In [330]:
!rm hw3_3b_output.txt
!hdfs dfs -copyToLocal /user/miki/week03/hw3_3b_output/part-00000 hw3_3b_output.txt

In [341]:
print_results('hw3_3b_output.txt', 50, 0)

      word               count  relative freq
--------------------------------------------
[  1] DAI62779           6,667          1.75%
[  2] FRO40251           3,881          1.02%
[  3] ELE17451           3,875          1.02%
[  4] GRO73461           3,602          0.95%
[  5] SNA80324           3,044          0.80%
[  6] ELE32164           2,851          0.75%
[  7] DAI75645           2,736          0.72%
[  8] SNA45677           2,455          0.64%
[  9] FRO31317           2,330          0.61%
[ 10] DAI85309           2,293          0.60%
[ 11] ELE26917           2,292          0.60%
[ 12] FRO80039           2,233          0.59%
[ 13] GRO21487           2,115          0.56%
[ 14] SNA99873           2,083          0.55%
[ 15] GRO59710           2,004          0.53%
[ 16] GRO71621           1,920          0.50%
[ 17] FRO85978           1,918          0.50%
[ 18] GRO30386           1,840          0.48%
[ 19] ELE74009           1,816          0.48%
[ 20] GRO56726           1,784     

## HW3.4: Pairs

Suppose we want to recommend new products to the customer based on the products they have already browsed on the online website. Write a map-reduce program to find products which are frequently browsed together. Fix the support count (cooccurence count) to s = 100 (i.e. product pairs need to occur together at least 100 times to be considered frequent) and find pairs of items (sometimes referred to itemsets of size 2 in association rule mining) that have a support count of 100 or more.

In [360]:
%%writefile mapper_34a.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.4a

import sys
import itertools

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Mapper,1\n")

# Initialize variables
total = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    # Split our line into products
    products = line.replace('\n','').split()
    
    # Get all combinations of products:
    #  - Use a set to remove duplicate products
    #  - Combinations finds tuples of length 2 with no repeats
    pairs = list(itertools.combinations(set(products), 2))
    
    # For each pair, sort the pair alphabetically, then emit
    for pair in pairs:
        sorted_pair = sorted(pair)
        print '%s\t%s\t%s' % (sorted_pair[0], sorted_pair[1], 1)
    
    # Increment total number of baskets
    total += 1
        
# Print total words
print '%s\t%s\t%s' % ('*total', '', total)

Overwriting mapper_34a.py


In [365]:
%%writefile reducer_34a.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW3.4a

import sys
from operator import itemgetter

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Reducer,1\n")

# Initialize variables
prev_pair = []
prev_count = 0
total = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    # Define our key and value
    fields = line.replace('\n','').split('\t')
    pair = [fields[0], fields[1]]
    count = fields[2]

    # Convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # Count wasn't an int, so move on
        continue

    # Check if we've moved to a new word
    if prev_pair == pair:
        prev_count += count
    else:
        if len(prev_pair) > 0:
            # We are at a new pair, need to print previous pair sum
            print '%s\t%s\t%s' % (prev_pair[0], prev_pair[1], prev_count)
        prev_count = count
        prev_pair = pair

# Output the last line
if prev_pair == pair:
    print '%s\t%s\t%s' % (prev_pair[0], prev_pair[1], prev_count)

Overwriting reducer_34a.py


In [373]:
# Change permissions on mapper and reducer
!chmod +x mapper_34a.py
!chmod +x reducer_34a.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_4a_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_34a.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_34a.py \
-input /user/miki/week03/ProductPurchaseData.txt \
-output /user/miki/week03/hw3_4a_output

Deleted /user/miki/week03/hw3_4a_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob4258626330607127192.jar tmpDir=null


In [379]:
%%writefile mapper_34b.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.4b

import sys

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Mapper,1\n")

# Initialize variables
total = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    fields = line.replace('\n','').split('\t')
    if int(fields[2]) >= 100:
        print '%s\t%s\t%s' % (fields[2], fields[0], fields[1])

Overwriting mapper_34b.py


In [381]:
%%writefile reducer_34b.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW3.4b

import sys
from operator import itemgetter

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Reducer,1\n")

# Initialize variables
total = 0

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    fields = line.replace('\n','').split('\t')
    count = fields[0]
    item1 = fields[1]
    item2 = fields[2]
    
    try:
        count = int(count)
    except ValueError:
        continue
        
    # The first word should be *total, save this as total
    if item1 == '*total': total = float(count)
    else: print '%s\t%s\t%s\t%s' % (item1, item2, count, count/total)

Writing reducer_34b.py


In [383]:
# Change permissions on mapper and reducer
!chmod +x mapper_34b.py
!chmod +x reducer_34b.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_4b_output

# Run job
!time hadoop jar hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.num.map.output.key.fields=3 \
-D mapred.text.key.comparator.options='-k1,1nr -k2,2 -k3,3' \
-D mapred.reduce.tasks=1 \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_34b.py \
-reducer /home/cloudera/Documents/W261-Fall2016/Week03/reducer_34b.py \
-input /user/miki/week03/hw3_4a_output/part* \
-output /user/miki/week03/hw3_4b_output

Deleted /user/miki/week03/hw3_4b_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob2299545763059091234.jar tmpDir=null

real	0m26.758s
user	0m4.982s
sys	0m0.345s


In [386]:
!hdfs dfs -copyToLocal /user/miki/week03/hw3_4b_output/part-00000 hw3_4b_output.txt

In [394]:
# Function to pretty print:
# - the top x and bottom y items
# - unique items
def print_results(file, x=50, y=10):
    words = []
    special_words = []
    
    with open(file,'r') as myfile:
        for line in myfile:
            fields = line.replace('\n','').split('\t')
            if fields[0][0] != '*': words.append(fields)
            else: special_words.append(fields)

    print '      {:10s}{:10s}{:>8s}{:>15s}'.format('item1', 'item2', 'count', 'relative freq')
    print '------------------------------------------------------'
    for i in range(x):
        print '[{:3d}] {:10s}{:10s}{:8,d}{:15.2%}'.format(i+1, 
                                                          words[i][0], 
                                                          words[i][1],
                                                          int(words[i][2]), 
                                                          float(words[i][3]))  

### List the top 50 product pairs

In [395]:
print_results('hw3_4b_output.txt', 50, 0)

      item1     item2        count  relative freq
------------------------------------------------------
[  1] DAI62779  ELE17451     1,592          5.12%
[  2] FRO40251  SNA80324     1,412          4.54%
[  3] DAI75645  FRO40251     1,254          4.03%
[  4] FRO40251  GRO85051     1,213          3.90%
[  5] DAI62779  GRO73461     1,139          3.66%
[  6] DAI75645  SNA80324     1,130          3.63%
[  7] DAI62779  FRO40251     1,070          3.44%
[  8] DAI62779  SNA80324       923          2.97%
[  9] DAI62779  DAI85309       918          2.95%
[ 10] ELE32164  GRO59710       911          2.93%
[ 11] DAI62779  DAI75645       882          2.84%
[ 12] FRO40251  GRO73461       882          2.84%
[ 13] DAI62779  ELE92920       877          2.82%
[ 14] FRO40251  FRO92469       835          2.68%
[ 15] DAI62779  ELE32164       832          2.68%
[ 16] DAI75645  GRO73461       712          2.29%
[ 17] DAI43223  ELE32164       711          2.29%
[ 18] DAI62779  GRO30386       709          2

### Report  the compute time for the Pairs job. 
The job reports the following compute times:
```
real	0m26.758s
user	0m4.982s
sys 	0m0.345s
```
### Describe the computational setup used (E.g., single computer; dual core; linux, number of mappers, number of reducers)
Cloudera QuickStart VM: single computer, 2 processors, 2 mappers (default), 1 reducer

### How many times is each mapper and reducer called?
- Mapper: 2
- Reducer: 1

## HW3.5: Stripes
Repeat 3.4 using the stripes design pattern for finding cooccuring pairs.

In [398]:
%%writefile mapper_35a.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW3.5a

import sys
import itertools

# Increment mapper counter
sys.stderr.write("reporter:counter:Custom_Counter,Mapper,1\n")

# Initialize variables
total = 0
stripes = {}

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    # Split our line into products
    products = line.replace('\n','').split()
    
    # Get all combinations of products:
    #  - Use a set to remove duplicate products
    #  - Combinations finds tuples of length 2 with no repeats
    items = sorted(list(set(products)))

    for i in range(len(items)-1):
        for j in range(i+1, len(items)):
            stripes[items[j]] = 1
        print '%s\t%s' % (items[i], stripes)
        stripes = {}
        
    # Increment total number of baskets
    total += 1
        
# Print total words
print '%s\t%s' % ('*total', total)

Overwriting mapper_35a.py


In [399]:
# Change permissions on mapper and reducer
!chmod +x mapper_35a.py
#!chmod +x reducer_35a.py

# If output folder already exists, delete it
!hdfs dfs -rm -r /user/miki/week03/hw3_5a_output

# Run job
!hadoop jar hadoop-streaming-2.7.1.jar \
-mapper /home/cloudera/Documents/W261-Fall2016/Week03/mapper_35a.py \
-reducer /bin/cat \
-input /user/miki/week03/ProductPurchaseData.txt \
-output /user/miki/week03/hw3_5a_output

Deleted /user/miki/week03/hw3_5a_output
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob1564090710484002419.jar tmpDir=null
