#DATASCI W261, Machine Learning at Scale
--------
####Assignement:  week \#3
####Lei Yang (leiyang@berkeley.edu)
####Due: 2016-02-02, 8AM PST

###*HW3.0.* Q&A

####What is a merge sort? Where is it used in Hadoop?
Merge sort is a sorting algorithm which quickly combines two sorted lists into a single list of items. Merge sort benefits from distributable in its least efficient step, which is the sorting of the child lists. The merging of child lists into a single sorted list is done in linear time. Merge sorting is used in the shuffle stage of Hadoop to rearrange keys prior to sending them to the reducer. Key-value pairs from different mappers are sorted at their mappers, and then distributed across the reducers in a sorted form.

####How is a combiner function used in the context of Hadoop?
Combiners are used for local aggregation during the mapper processes of Hadoop. They are run when the incomplete output from the mapper becomes too large to fit within memory and "spills over" onto disk. The combiner is responsible for shrinking the data back down so that the mapper can run faster by keeping data in memory and so that the network operations in the partitioner are kept to a bare minimum. Depending on the size and scope of the problem, Hadoop will run combiners any number of times including zero with no input from the user. For this reason, it is critical that the combiner is able to receive records in the format of the mapper's output and emit data in the same format. The combining operation must also be associative and commutative so that the variable number of runs will not affect the result.

####Give an example where it can be used and justify why it should be used in the context of this problem
Combiners can be used in long word-count operations. A typical mapper output for a word-count problem will be greater than the size of the document since it emits each individual word and the number associated with it. Transferring this data across the network can drastically reduce the performance of this operation, as well as making the subsequent sorting operation take much longer. Adding a combiner can reduce the size of the mapper output from being tied to the size of the document to being tied to the size of the vocabulary. 

####What is the Hadoop shuffle?
Shuffle happens after all mapper processes complete and before reducer starts, all key-value pairs are sorted by key, and the same key is guaranteed to be delivered to the same reducer.

####What is the Apriori algorithm? Describe an example use in your domain of expertise. Define confidence and lift.


###start yarn, hdfs, and job history

In [7]:
!/usr/local/Cellar/hadoop/2*/sbin/start-yarn.sh
!/usr/local/Cellar/hadoop/2*/sbin/start-dfs.sh
!/usr/local/Cellar/hadoop/2*/sbin/mr-jobhistory-daemon.sh --config /usr/local/Cellar/hadoop/2*/libexec/etc/hadoop/ start historyserver 

starting yarn daemons
resourcemanager running as process 644. Stop it first.
localhost: nodemanager running as process 746. Stop it first.
Starting namenodes on [localhost]
localhost: namenode running as process 894. Stop it first.
localhost: datanode running as process 987. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 1105. Stop it first.
historyserver running as process 1213. Stop it first.


###*HW3.1.* Use Counters to do EDA (exploratory data analysis and to monitor progress)
Counters are lightweight objects in Hadoop that allow you to keep track of system progress in both the map and reduce stages of processing. By default, Hadoop defines a number of standard counters in "groups"; these show up in the jobtracker webapp, giving you information such as "Map input records", "Map output records", etc.

While processing information/data using MapReduce job, it is a challenge to monitor the progress of parallel threads running across nodes of distributed clusters. Moreover, it is also complicated to distinguish between the data that has been processed and the data which is yet to be processed. The MapReduce Framework offers a provision of user-defined Counters, which can be effectively utilized to monitor the progress of data across nodes of distributed clusters.

**Use the Consumer Complaints  Dataset provide [here](https://www.dropbox.com/s/vbalm3yva2rr86m/Consumer_Complaints.csv?dl=0) to complete this question:**

- The consumer complaints dataset consists of diverse consumer complaints, which have been reported across the United States regarding various types of loans. The dataset consists of records of the form:
 - Complaint ID,Product,Sub-product,Issue,Sub-issue,State,ZIP code,Submitted via,Date received,Date sent to company,Company,Company response,Timely response?,Consumer disputed?

**Here’s is the first few lines of the  of the Consumer Complaints  Dataset:**

- Complaint ID,Product,Sub-product,Issue,Sub-issue,State,ZIP code,Submitted via,Date received,Date sent to company,Company,Company response,Timely response?,Consumer disputed?
- 1114245,Debt collection,Medical,Disclosure verification of debt,Not given enough info to verify debt,FL,32219,Web,11/13/2014,11/13/2014,"Choice Recovery, Inc.",Closed with explanation,Yes,
- 1114488,Debt collection,Medical,Disclosure verification of debt,Right to dispute notice not received,TX,75006,Web,11/13/2014,11/13/2014,"Expert Global Solutions, Inc.",In progress,Yes,
- 1114255,Bank account or service,Checking account,Deposits and withdrawals,,NY,11102,Web,11/13/2014,11/13/2014,"FNIS (Fidelity National Information Services, Inc.)",In progress,Yes,
- 1115106,Debt collection,"Other (phone, health club, etc.)",Communication tactics,Frequent or repeated calls,GA,31721,Web,11/13/2014,11/13/2014,"Expert Global Solutions, Inc.",In progress,Yes,

**User-defined Counters**

- Now, let’s use Hadoop Counters to identify the number of complaints pertaining to *debt collection*, *mortgage* and *other* categories (all other categories get lumped into this one) in the consumer complaints dataset. Basically produce the distribution of the Product column in this dataset using counters (limited to 3 counters here).
- Hadoop offers Job Tracker, an UI tool to determine the status and statistics of all jobs. Using the job tracker UI, developers can view the Counters that have been created. Screenshot your  job tracker UI as your job completes and include it here. Make sure that your user defined counters are visible.



###<span style="color:red">HW3.1 Answer:</span>


###Mapper
- as the shuffler will do the sorting, mapper just need to emit word with integer as the key

In [2]:
%%writefile mapper.py
#!/usr/bin/python
import sys
for line in sys.stdin:  
    # extract the column values
    parts = line.strip().split(',')
    # product is in second column
    prod = parts[1].strip().lower()
    # emit product name as key, no need for value as we are only count product name
    print "%s\t%s" %(prod, 'na')

Overwriting mapper.py


###Reducer

In [3]:
%%writefile reducer.py
#!/usr/bin/python
import sys

for line in sys.stdin:
    # product name
    prod = line.split('\t')[0].strip()
    
    # compare with what we want to count and adjust the counter
    if prod == 'debt collection':
        sys.stderr.write("reporter:counter:HW3_1,debt,1\n")
    elif prod == 'mortgage':
        sys.stderr.write("reporter:counter:HW3_1,mortgage,1\n")
    else:
        sys.stderr.write("reporter:counter:HW3_1,others,1\n")
    

Overwriting reducer.py


###Run the job with Hadoop Streaming
- add parameter *-D mapred.reduce.tasks=2* to specify number of reducers

In [4]:
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.reduce.tasks=2 \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input /user/lei/Consumer_Complaints.csv \
-output results

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar3709839099986008331/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob3447758072235333814.jar tmpDir=null


###Check counter value

![Image 1](HW3_1.png)

###*HW3.2.*  Analyze the performance of your Mappers, Combiners and Reducers using Counters

**For this brief study the Input file will be one record (the next line only):**

*foo foo quux labs foo bar quux*

- Perform a word count analysis of this single record dataset using a Mapper and Reducer based WordCount (i.e., no combiners are used here) using user defined Counters to count up how many time the mapper and reducer are called. What is the value of your user defined Mapper Counter, and Reducer Counter after completing this word count job. The answer  should be 1 and 4 respectively. Please explain.

###Mapper

In [5]:
%%writefile mapper.py
#!/usr/bin/python
import sys, re, string

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_2,Mapper_cnt,1\n")

# input comes from STDIN (standard input)
for line in sys.stdin:
    
    # split the line into words
    words = line.split()
    
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

Overwriting mapper.py


###Reducer

In [6]:
%%writefile reducer.py
#!/usr/bin/python
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# increase counter for reducer being called
sys.stderr.write("reporter:counter:HW3_2,Reducer_cnt,1\n")

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # print out count
            print '%s\t%s' %(current_word, current_count)
        current_count = count
        current_word = word

# do not forget to print the last word count if needed!
if current_word == word:    
    print '%s\t%s' %(current_word, current_count)    


Overwriting reducer.py


###Write the file and put on HDFS

In [85]:
%%writefile wordcount.txt
foo foo quux labs foo bar quux

Overwriting wordcount.txt


In [86]:
!hdfs dfs -rm /user/lei/wordcount.txt
!hdfs dfs -put wordcount.txt /user/lei

Deleted /user/lei/wordcount.txt


###Run the job with Hadoop streaming

In [7]:
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=1 \
-D mapred.reduce.tasks=4 \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input '/user/lei/wordcount.txt' \
-output results

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar7556678054711269180/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob3086524328830560864.jar tmpDir=null


###<span style="color:red">HW3.2 Results:</span>
![Image 2](HW3_2_1.png)

###*HW3.2*  Exploratory analysis on consumer complaint data
**Please use mulitple mappers and reducers for these jobs (at least 2 mappers and 2 reducers).**

- Perform a word count analysis of the Issue column of the Consumer Complaints  Dataset using a Mapper and Reducer based WordCount (i.e., no combiners used anywhere)  using user defined Counters to count up how many time the mapper and reducer are called. What is the value of your user defined Mapper Counter, and Reducer Counter after completing your word count job.

###Mapper 

In [8]:
%%writefile mapper.py
#!/usr/bin/python
import sys, re, string
# define regex for punctuation removal
regex = re.compile('[%s]' % re.escape(string.punctuation))

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_2,Mapper_cnt,1\n")

for line in sys.stdin:      
    # extract the column values
    parts = line.strip().split(',')
    # issue is in 4th column
    issue = parts[3].strip().lower()
    # emit issue as key, and 1 as count
    print "%s,%s" %(regex.sub('', issue), '1')

Overwriting mapper.py


###Reducer

In [9]:
%%writefile reducer.py
#!/usr/bin/python
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# increase counter for reducer being called
sys.stderr.write("reporter:counter:HW3_2,Reducer_cnt,1\n")

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split(',', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # print out count
            print '%s,%s' %(current_word, current_count)
        current_count = count
        current_word = word

# do not forget to print the last word count if needed!
if current_word == word:    
    print '%s,%s' %(current_word, current_count)    


Overwriting reducer.py


###Run the job with Hadoop streaming

In [10]:
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=4 \
-D mapred.reduce.tasks=2 \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input '/user/lei/Consumer_Complaints.csv' \
-output results 

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar5125320870294825413/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob2595898526152944027.jar tmpDir=null


###<span style="color:red">HW3.2 Results:</span>
we can see that the counter values are consistent with our specification of times for mapper and reducer to be called.
![Image 2](HW3_2_2.png)

And the issue counts are below:

In [11]:
!hdfs dfs -cat /user/leiyang/results/part-0000* 

account terms and changes,350	
application processing delay,243	
application,8625	
apr or interest rate,3431	
billing disputes,6938	
billing statement,1220	
cant contact lender,221	
closingcancelling account,2795	
collection practices,1003	
convenience checks,75	
credit card protection  debt protection,1343	
credit determination,1490	
customer service  customer relations,1367	
dealing with my lender or servicer,1944	
delinquent account,1061	
deposits and withdrawals,10555	
disclosure verification of debt,5214	
health club,12545	
improper contact or sharing of info,2832	
incorrectmissing disclosures or info,64	
late fee,1797	
loan modification,70487	
loan servicing,36767	
makingreceiving payments,3226	
managing the loan or lease,4560	
money was not available when promised,274	
other fee,1075	
other transaction issues,387	
other,6273	
payoff process,1155	
privacy,240	
repaying your loan,3844	
rewards,1002	
shopping for a line of credit,137	
taking out th

###*HW3.2*  Exploratory analysis on consumer complaint data
**Please use mulitple mappers and reducers for these jobs (at least 2 mappers and 2 reducers).**

- Perform a word count analysis of the Issue column of the Consumer Complaints  Dataset using a Mapper, Reducer, and standalone combiner (i.e., not an in-memory combiner) based WordCount using user defined Counters to count up how many time the mapper, combiner, reducer are called. What is the value of your user defined Mapper Counter, and Reducer Counter after completing your word count job.


The definitions of mapper and reducer don't need to change in this case, we can just use the reducer as a standalone combiner, specified by Hadoop-streaming parameter (-combiner)

In [12]:
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=4 \
-D mapred.reduce.tasks=2 \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-combiner reducer.py \
-input '/user/lei/Consumer_Complaints.csv' \
-output results 

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar2110887529466242773/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob4709300790161981780.jar tmpDir=null


###<span style="color:red">HW3.2 Results:</span>
We can see that the reducer.py was called 8 times during map step as **combiner**, and 2 times during reduce step as **reducer**. 
![Image 2](HW3_2_3.png)
And the issue counts from two reducers are below:

In [13]:
!hdfs dfs -cat /user/leiyang/results/part-0000* 

account terms and changes,350	
application processing delay,243	
application,8625	
apr or interest rate,3431	
billing disputes,6938	
billing statement,1220	
cant contact lender,221	
closingcancelling account,2795	
collection practices,1003	
convenience checks,75	
credit card protection  debt protection,1343	
credit determination,1490	
customer service  customer relations,1367	
dealing with my lender or servicer,1944	
delinquent account,1061	
deposits and withdrawals,10555	
disclosure verification of debt,5214	
health club,12545	
improper contact or sharing of info,2832	
incorrectmissing disclosures or info,64	
late fee,1797	
loan modification,70487	
loan servicing,36767	
makingreceiving payments,3226	
managing the loan or lease,4560	
money was not available when promised,274	
other fee,1075	
other transaction issues,387	
other,6273	
payoff process,1155	
privacy,240	
repaying your loan,3844	
rewards,1002	
shopping for a line of credit,137	
taking out th

###*HW3.2*  Exploratory analysis on consumer complaint data

- Using a single reducer: What are the top 50 most frequent terms in your word count analysis? 
- Present the top 50 terms and their frequency and their relative frequency. 
- If there are ties please sort the tokens in alphanumeric/string order. Present bottom 10 tokens (least frequent items).

**Notes:**
- for a single reducer (job) to get list of relative frequencies, we need to implement **order inversion** to get total count first.
- **mapper** will emit **'dummy_sort_key, issue_name / \*, count'**, as it is impossible to sort count with secondary sorting if we use the issue name as partitioner option.
- we need to sort numerically of the count, and in the mean time guarantee the emits for total calculation **(key, \*, count)** arrive first, thus we define *-inf* as the dummy sort key for those emits, as other counts are always postive.
- **reducer** will get total count first, then joint count for each word, and finally relative frequency. It needs to be a generic process such that if the combiner is not called, the final results would still be correct.
- specify secondary sort on issue name

### Mapper

In [14]:
%%writefile mapper.py
#!/usr/bin/python
import sys, re, string
# define regex for punctuation removal
regex = re.compile('[%s]' % re.escape(string.punctuation))

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_2,Mapper_cnt,1\n")

for line in sys.stdin:      
    # get issue count and name
    issue, count = line.strip().split(',')
    # emit issue as key, and 1 as count
    print "%s,%s,%s" %(count, issue, count)
    # for order inversion, to calculate total count
    print '%s,%s,%s' %('-3', '*', count)
    
# test for tie-break
#print '%s,%s,%s' %(1, 'zzz', 3)
#print '%s,%s,%s' %(1, 'oko', 3)
#print '%s,%s,%s' %(1, 'ccc', 3)

Overwriting mapper.py


###Reducer

In [15]:
%%writefile reducer.py
#!/usr/bin/python
from operator import itemgetter
import sys


# buffer for top and bottom
n_bottom, n_top = 10, 50
bottom, top = [], []
n_total = 0

# input comes from STDIN
for line in sys.stdin:
    dummy, issue, count = line.strip().split(',', 2)
    
    # skip bad count
    try:
        count = int(count)
    except ValueError:
        continue
    
    # get total count
    if '*' == issue:
        n_total += count        
        continue
    
    # calculate relative frequency
    rf = 1.0*count/n_total
    
    # buffer top and bottom
    if len(bottom) < n_bottom:
        bottom.append([issue, count, rf])
                
    if len(top) < n_top:
        top.append([issue, count, rf])
    else:
        top = top[1:] + [[issue, count, rf]]
        
# print results:
top.reverse()
print '\ntop %d issues:' %n_top
for rec in top:
    print '%.2f%%\t%d\t%s' %(100*rec[2], rec[1], rec[0])

print '\nbottom %d issues:' %n_bottom
for rec in bottom:
    print '%.2f%%\t%d\t%s' %(100*rec[2], rec[1], rec[0])

Overwriting reducer.py


###Run the job with Hadoop streaming

In [16]:
# assuming count results are available
!hdfs dfs -rm -r results2
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=, \
-D map.output.key.value.fields.spec=0-1:0- \
-D mapred.text.key.comparator.options='-k1,1n -k2,2' \
-D mapred.map.tasks=2 \
-D mapred.reduce.tasks=1 \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input /user/leiyang/results/part-0000* \
-output results2

Deleted results2
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar1115560675774446028/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob3811758649004660161.jar tmpDir=null


###The sorted top and bottom issues are:

In [17]:
!hdfs dfs -cat /user/leiyang/results2/part-0000* 

	
top 50 issues:	
22.53%	70487	loan modification
11.75%	36767	loan servicing
9.29%	29069	incorrect information on credit report
5.18%	16205	account opening
4.01%	12545	health club
3.79%	11848	contd attempts collect debt not owed
3.37%	10555	deposits and withdrawals
2.76%	8625	application
2.22%	6938	billing disputes
2.21%	6920	communication tactics
2.00%	6273	other
1.81%	5663	problems caused by my funds being low
1.67%	5214	disclosure verification of debt
1.55%	4858	credit reporting companys investigation
1.46%	4560	managing the loan or lease
1.39%	4357	unable to get credit reportcredit score
1.39%	4350	settlement process and costs
1.23%	3844	repaying your loan
1.22%	3821	problems when you are unable to pay
1.10%	3431	apr or interest rate
1.05%	3276	identity theft  fraud  embezzlement
1.03%	3226	makingreceiving payments
0.91%	2832	improper contact or sharing of info
0.89%	2795	closingcancelling account
0.89%	2774	credit decision  underwriting
0.80%	2508	false 

###*3.2.1 OPTIONAL - * Using 2 reducers: 
- What are the top 50 most frequent terms in your word count analysis? 
- Present the top 50 terms and their frequency and their relative frequency. 
- If there are ties please sort the tokens in alphanumeric/string order. Present bottom 10 tokens (least frequent items).

###*HW3.3.* Shopping Cart Analysis
Product Recommendations: 
- The action or practice of selling additional products or services
to existing customers is called cross-selling. 
- Giving product recommendation is one of the examples of cross-selling that are frequently used by online retailers.
- One simple method to give product recommendations is to recommend products that are frequently
browsed together by the customers.

For this homework use the online browsing behavior dataset [here](https://www.dropbox.com/s/zlfyiwa70poqg74/ProductPurchaseData.txt?dl=0):

- Each line in this dataset represents a browsing session of a customer.
- On each line, each string of 8 characters represents the id of an item browsed during that session.
- The items are separated by spaces.

- Here are the first few lines of the ProductPurchaseData
 - FRO11987 ELE17451 ELE89019 SNA90258 GRO99222
 - GRO99222 GRO12298 FRO12685 ELE91550 SNA11465 ELE26917 ELE52966 FRO90334 SNA30755 ELE17451 FRO84225 SNA80192
 - ELE17451 GRO73461 DAI22896 SNA99873 FRO86643
 - ELE17451 ELE37798 FRO86643 GRO56989 ELE23393 SNA11465
 - ELE17451 SNA69641 FRO86643 FRO78087 SNA11465 GRO39357 ELE28573 ELE11375 DAI54444

**Do some exploratory data analysis of this dataset.**

- How many unique items are available from this supplier?
- **Using a single reducer:**
 - Report your findings such as number of unique products; largest basket; 
 - Report the top 50 most frequently purchased items,  their frequency,  and their relative frequency (break ties by sorting the products alphabetical order) etc. using Hadoop Map-Reduce.

###Mapper - pair count
- where $size$ is the basket size for each new session, otherwise zero to minimize data transfer

In [21]:
%%writefile mapper.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_3,Mapper_cnt,1\n")

for line in sys.stdin:   
    # get all products
    products = line.strip().split(' ')
    size = len(products)
    if size==0:
        continue
    for i in range(size):                
        # emit word key
        print '%s,%s,%s' %(products[i], 1, size if i==0 else 0)


Overwriting mapper.py


###Reducer - pair count

In [106]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import numpy as np

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_3,Reducer_cnt,1\n")

max_size = 0
current_prod = None
current_count = 0

for line in sys.stdin:   
    # get mapper output
    prod, count, size = line.strip().split(',', 2)
    
    # skip bad counts
    try:
        count = int(count)
        size = int(size)
    except ValueError:
        continue
    
    # handle basket size
    max_size = max(max_size, size)
        
    # count unique and get frequency
    if current_prod == prod:
        current_count += count
    else:
        # one product just finishes streaming
        if current_prod:            
            # emit product count
            print '%s,%s' %(current_prod, current_count)            
                    
        # reset for new prod
        current_prod = prod
        current_count = count

#print 'max basket size: %d' %max_size

Overwriting reducer.py


###Mapper - relative frequency & sort
- use **order inversion** for reletive frequency, for each word emit $(dummy\_sort, *, count)$ and $(dummy\_sort, product, count)$

In [107]:
%%writefile mapper_s.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_3s,Mapper_cnt,1\n")

for line in sys.stdin:      
    # get product and count
    prod, count = line.strip().split(',')
    # emit prod as key, and count
    print "%s,%s,%s" %(count, prod, count)
    # for order inversion, to calculate total count
    print '%d,%s,%s' %(1e+10, '*', count)

Overwriting mapper_s.py


###Reducer - relative frequency & sort
- get the top 50 pairs with most count
- obtain unique products

In [108]:
%%writefile reducer_s.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_3,Reducer_cnt,1\n")

n_total = 0
n_top = 50
n_unique = 0

for line in sys.stdin:   
    
    dummy, product, count = line.strip().split(',', 2)
    
    try:
        count = int(count)
    except ValueError:
        continue
       
    # handle total
    if product == '*':
        n_total += count
        continue
    
    # get relative frequency
    n_unique += 1
    if n_unique <= n_top:
        print '%s\t%s\t%.4f%%' %(product, count, 100.0*count/n_total)
    
print 'total browsing items: %d' %n_total
print 'unique product: %d' %n_unique

Overwriting reducer_s.py


###MapReducing

In [109]:
# job 1 - pair count
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=, \
-D map.output.key.value.fields.spec=0:1- \
-D mapred.text.key.comparator.options='-k1,1' \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input /user/leiyang/ProductPurchaseData.txt \
-output results

# job 2 - relative frequency & sort with order inversion
!hdfs dfs -rm -r results2
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=, \
-D map.output.key.value.fields.spec=0-1:0- \
-D mapred.text.key.comparator.options='-k1,1nr -k2,2' \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper_s.py,reducer_s.py \
-mapper mapper_s.py \
-reducer reducer_s.py \
-input /user/leiyang/results/part-0000* \
-output results2

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar6113682566909254738/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob2500948326253885932.jar tmpDir=null
Deleted results2
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar2055398774848871732/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob4358016637415128494.jar tmpDir=null


In [110]:
!hdfs dfs -cat results2/part-0*

DAI62779	6667	1.7507%
FRO40251	3881	1.0191%
ELE17451	3875	1.0175%
GRO73461	3602	0.9458%
SNA80324	3044	0.7993%
ELE32164	2851	0.7486%
DAI75645	2736	0.7184%
SNA45677	2455	0.6447%
FRO31317	2330	0.6118%
DAI85309	2293	0.6021%
ELE26917	2292	0.6019%
FRO80039	2233	0.5864%
GRO21487	2115	0.5554%
SNA99873	2083	0.5470%
GRO59710	2004	0.5262%
GRO71621	1920	0.5042%
FRO85978	1918	0.5036%
GRO30386	1840	0.4832%
ELE74009	1816	0.4769%
GRO56726	1784	0.4685%
DAI63921	1773	0.4656%
GRO46854	1756	0.4611%
ELE66600	1713	0.4498%
DAI83733	1712	0.4496%
FRO32293	1702	0.4469%
ELE66810	1697	0.4456%
SNA55762	1646	0.4322%
DAI22177	1627	0.4272%
FRO78087	1531	0.4020%
ELE99737	1516	0.3981%
ELE34057	1489	0.3910%
GRO94758	1489	0.3910%
FRO35904	1436	0.3771%
FRO53271	1420	0.3729%
SNA93860	1407	0.3695%
SNA90094	1390	0.3650%
GRO38814	1352	0.3550%
ELE56788	1345	0.3532%
GRO61133	1321	0.3469%
DAI88807	1316	0.3456%
ELE74482	1316	0.3456%
ELE59935	1311	0.3443%
SNA96271	1295	0.3401%
DAI43223	12

###*3.3.1 OPTIONAL* - Using 2 reducers:  
- Report your findings such as number of unique products; largest basket; 
- Report the top 50 most frequently purchased items,  their frequency,  and their relative frequency (break ties by sorting the products alphabetical order) etc. using Hadoop Map-Reduce.

**Notes:**
- the challenge is from total calculation since we have multiple reducers, as only one will get the ***** key and be able to calculate the marginal.
- possible solution: 
 - write a customer partitioner, emit two dummy pairs, dispatch one for each reducer

###*HW3.4.* (Computationally prohibitive but then again Hadoop can handle this) Pairs

- Suppose we want to recommend new products to the customer based on the products they
have already browsed on the online website. 
- Write a map-reduce program to find products which are frequently browsed together. 
- Fix the support count (cooccurence count) to s = 100
(i.e. product pairs need to occur together at least 100 times to be considered frequent),
and find pairs of items (sometimes referred to itemsets of size 2 in association rule mining) that have a support count of 100 or more.

**List the top 50 product pairs with corresponding support count (aka frequency)**, and relative frequency or support (number of records where they coccur, the number of records where they coccur/the number of baskets in the dataset)  in decreasing order of support  for frequent (100>count) itemsets of size 2.

Use the Pairs pattern (lecture 3)  to  extract these frequent itemsets of size 2. Free free to use combiners if they bring value. Instrument your code with counters for count the number of times your mapper, combiner and reducers are called.

<img src="Pairs.png" alt="Drawing" style="width: 600px;"/>

Please output records of the following form for the top 50 pairs (itemsets of size 2):

      item1, item2, support count, support



Fix the ordering of the pairs lexicographically (left to right),
and break ties in support (between pairs, if any exist)
by taking the first ones in lexicographically increasing order.

Report  the compute time for the Pairs job. Describe the computational setup used (E.g., single computer; dual core; linux, number of mappers, number of reducers)
 
|Spec | Value|
|---|:---:|
| Computer | single |
| OS  | OS X El Capitan |
| Processor | 2.2 GHz Intel Core i7  |
| Memory | 16 GB 1600 MHz DDR3|
Instrument your mapper, combiner, and reducer to count how many times each is called using Counters and report these counts.


###Mapper
- for each session (row), use pair pattern with **order inversion**, emit $((w_i\_w_j),1)$ for all pairs, and one $(*,1)$ for the session (for total session count).
- the fourth field of every emit is used to indicate basket size for every session (row), size is postive for *only* one emit, and zero for rest of the emit, so we minimize data transfer

In [59]:
%%writefile mapper.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_4,Mapper_cnt,1\n")

for line in sys.stdin:   
    # get all products from the session
    products = line.strip().split(' ')
    size = len(products)
    if size==0:
        continue
    
    # sort products the pair is lexicographically sound
    products.sort()
    
    # get pairs of products
    pairs = [[products[i], products[j]] for i in range(size) for j in range(i+1, size)]
    
    # emit dummy record
    print '%s,%s' %('*', 1)
    
    # emit product pairs
    for pair in pairs:
        print '%s_%s,%s' %(pair[0], pair[1], 1)

Overwriting mapper.py


###Combiner
- local aggregation for count

In [60]:
%%writefile combiner.py
#!/usr/bin/python
import sys

# increase counter for reducer being called
sys.stderr.write("reporter:counter:HW3_4,Combiner_cnt,1\n")

current_pair = None
current_count = 0

for line in sys.stdin:       
    # get all products from the session
    pair, count = line.strip().split(',', 1)
    
    # skip bad count
    try:
        count = int(count)
    except ValueError:
        continue
        
    # accumulate counts for whatever keys it receives
    if current_pair == pair:
        current_count += count
    else:
        # previous pair finishes streaming, emit results
        if current_pair:            
            print '%s,%s' %(current_pair, current_count)
        # reset new pair
        current_pair = pair
        current_count = count

Overwriting combiner.py


###Reducer
- count number of basket based on $(*, 1)$ emits
- get suport and relative frequency for each pair in the stream

In [61]:
%%writefile reducer.py
#!/usr/bin/python
import sys

# increase counter for reducer being called
sys.stderr.write("reporter:counter:HW3_4,Reducer_cnt,1\n")

n_basket = 0
min_support = 100
current_pair = None
current_count = 0

for line in sys.stdin:       
    # get all products from the session
    pair, count = line.strip().split(',', 1)
    
    # skip bad count
    try:
        count = int(count)
    except ValueError:
        continue
        
    # get total sessions/baskets
    if pair == '*':
        n_basket += count
        continue
        
    # get pair count
    if current_pair == pair:
        current_count += count
    else:
        # previous pair finishes streaming
        if current_pair and current_count > min_support:
            # get relative freq
            rf = 100.0*current_count/n_basket
            # emit
            print '%s,%s,%.4f%%' %(current_pair, current_count, rf)
        # reset new pair
        current_pair = pair
        current_count = count

#print '\ntotal basket: %d' %n_basket


Overwriting reducer.py


###Mapper for sort (or use identity mapper)

In [62]:
%%writefile mapper_s.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_4,Mapper_s_cnt,1\n")

for line in sys.stdin:   
    # just emit
    print line.strip()

Overwriting mapper_s.py


###Reducer for sort

In [63]:
%%writefile reducer_s.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_4,Reducer_s_cnt,1\n")

n_out = 0
n_top = 50

print 'top %d pairs: ' %n_top

for line in sys.stdin:   
    # parse mapper output  
    pair, count, rf = line.strip().split(',', 2)
    n_out += 1
    if n_out <= n_top:
        w1, w2 = pair.split('_')
        print '%s\t%s\t%s\t%s' %(w1, w2, count, rf)

Overwriting reducer_s.py


###MapReducing without combiner

In [64]:
# job 1 - count
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=, \
-D map.output.key.value.fields.spec=0:1- \
-D mapred.text.key.comparator.options='-k1,1' \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input /user/leiyang/ProductPurchaseData.txt \
-output results

# job 2 - sort relative frequency
!hdfs dfs -rm -r results2
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=',' \
-D map.output.key.value.fields.spec=0-1:2- \
-D mapred.text.key.comparator.options='-k2,2nr -k1,1' \
-D mapred.map.tasks=2 \
-D mapred.reduce.tasks=1 \
-files mapper_s.py,reducer_s.py \
-mapper mapper_s.py \
-reducer reducer_s.py \
-input /user/leiyang/results/part-0000* \
-output results2

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar6313500269707918499/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob6247307708596164046.jar tmpDir=null
Deleted results2
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar2314263346417839441/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob201592152313437354.jar tmpDir=null


###HW3.4 results without combiner
- 3 mappers, 1 reducer

<img src="HW3_4.counter.png" alt="Drawing" style="width: 880px;"/>
<img src="HW3_4.time.png" alt="Drawing" style="width: 400px;"/>

In [65]:
!hdfs dfs -cat results2/part-0*

top 50 pairs: 	
DAI62779	ELE17451	1592	5.1188%
FRO40251	SNA80324	1412	4.5400%
DAI75645	FRO40251	1254	4.0320%
FRO40251	GRO85051	1213	3.9002%
DAI62779	GRO73461	1139	3.6623%
DAI75645	SNA80324	1130	3.6333%
DAI62779	FRO40251	1070	3.4404%
DAI62779	SNA80324	923	2.9678%
DAI62779	DAI85309	918	2.9517%
ELE32164	GRO59710	911	2.9292%
DAI62779	DAI75645	882	2.8359%
FRO40251	GRO73461	882	2.8359%
DAI62779	ELE92920	877	2.8198%
FRO40251	FRO92469	835	2.6848%
DAI62779	ELE32164	832	2.6752%
DAI75645	GRO73461	712	2.2893%
DAI43223	ELE32164	711	2.2861%
DAI62779	GRO30386	709	2.2797%
ELE17451	FRO40251	697	2.2411%
DAI85309	ELE99737	659	2.1189%
DAI62779	ELE26917	650	2.0900%
GRO21487	GRO73461	631	2.0289%
DAI62779	SNA45677	604	1.9421%
ELE17451	SNA80324	597	1.9196%
DAI62779	GRO71621	595	1.9131%
DAI62779	SNA55762	593	1.9067%
DAI62779	DAI83733	586	1.8842%
ELE17451	GRO73461	580	1.8649%
GRO73461	SNA80324	562	1.8070%
DAI62779	GRO59710	561	1.8038%
DAI62779	FRO80039	550	1.7684%
DAI75645	ELE174

###MapReducing with combiner

In [66]:
# job 1 - add combiner below
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=, \
-D map.output.key.value.fields.spec=0:1- \
-D mapred.text.key.comparator.options='-k1,1' \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper.py,reducer.py,combiner.py \
-mapper mapper.py \
-reducer reducer.py \
-combiner combiner.py \
-input /user/leiyang/ProductPurchaseData.txt \
-output results

# job 2 - sort relative frequency
!hdfs dfs -rm -r results2
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=',' \
-D map.output.key.value.fields.spec=0-1:2- \
-D mapred.text.key.comparator.options='-k2,2nr -k1,1' \
-D mapred.map.tasks=2 \
-D mapred.reduce.tasks=1 \
-files mapper_s.py,reducer_s.py \
-mapper mapper_s.py \
-reducer reducer_s.py \
-input /user/leiyang/results/part-0000* \
-output results2

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar3295224449230340302/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob7434701589508937802.jar tmpDir=null
Deleted results2
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar943447964334990827/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob5402224175670776010.jar tmpDir=null


###HW3.4 Results with combiner
- 3 mappers, 1 reducer
- the combiner was called 1 time by each map process, total 3 times

<img src="HW3_4.combiner.counter.png" alt="Drawing" style="width: 880px;"/>
<img src="HW3_4.combiner.time.png" alt="Drawing" style="width: 400px;"/>

In [68]:
!hdfs dfs -cat results2/part-0*

top 50 pairs: 	
DAI62779	ELE17451	1592	5.1188%
FRO40251	SNA80324	1412	4.5400%
DAI75645	FRO40251	1254	4.0320%
FRO40251	GRO85051	1213	3.9002%
DAI62779	GRO73461	1139	3.6623%
DAI75645	SNA80324	1130	3.6333%
DAI62779	FRO40251	1070	3.4404%
DAI62779	SNA80324	923	2.9678%
DAI62779	DAI85309	918	2.9517%
ELE32164	GRO59710	911	2.9292%
DAI62779	DAI75645	882	2.8359%
FRO40251	GRO73461	882	2.8359%
DAI62779	ELE92920	877	2.8198%
FRO40251	FRO92469	835	2.6848%
DAI62779	ELE32164	832	2.6752%
DAI75645	GRO73461	712	2.2893%
DAI43223	ELE32164	711	2.2861%
DAI62779	GRO30386	709	2.2797%
ELE17451	FRO40251	697	2.2411%
DAI85309	ELE99737	659	2.1189%
DAI62779	ELE26917	650	2.0900%
GRO21487	GRO73461	631	2.0289%
DAI62779	SNA45677	604	1.9421%
ELE17451	SNA80324	597	1.9196%
DAI62779	GRO71621	595	1.9131%
DAI62779	SNA55762	593	1.9067%
DAI62779	DAI83733	586	1.8842%
ELE17451	GRO73461	580	1.8649%
GRO73461	SNA80324	562	1.8070%
DAI62779	GRO59710	561	1.8038%
DAI62779	FRO80039	550	1.7684%
DAI75645	ELE174

###*HW3.5*: Stripes
- Repeat 3.4 using the stripes design pattern for finding cooccuring pairs.
- Report  the compute times for stripes job versus the Pairs job. 
- Describe the computational setup used (E.g., single computer; dual core; linux, number of mappers, number of reducers)

- Instrument your mapper, combiner, and reducer to count how many times each is called using Counters and report these counts. 
- Discuss the differences in these counts between the Pairs and Stripes jobs

<img src="Stripes.png" alt="Drawing" style="width: 600px;"/>

###Mapper
- build associative array for each session, and do local in-memory aggregation
- for the associative array, we implement the rule that *any key will only have words that alphabetically behind it in the associative array*, to have unique pairs

In [71]:
%%writefile mapper.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_5,Mapper_cnt,1\n")

# composite associative array
H = {}

for line in sys.stdin:   
    # get all products from the session
    products = line.strip().split(' ')
    size = len(products)
    if size==0:
        continue
    
    # sort products so the pair is lexicographically sound
    products.sort()
    
    # get pairs of products
    pairs = [[products[i], products[j]] for i in range(size) for j in range(i+1, size)]
    
    # emit dummy record
    print '%s\t%s' %('*', 1)
    
    # build associative arrays
    for w1, w2 in pairs:
        # each pair is lexicographically in order        
        if w1 not in H:
            # if w1 is new, add an associative array for it
            H[w1] = {}
            H[w1][w2] = 1            
        elif w2 not in H[w1]:
            # w1 is not new, but it doesn't have key for w2
            H[w1][w2] = 1
        else:
            # both are there, increase it
            H[w1][w2] += 1
        
# emit associative arrays
for h in H:
    print '%s\t%s' %(h, str(H[h]))

Overwriting mapper.py


###Reducer
- element-wise sum

In [74]:
%%writefile reducer.py
#!/usr/bin/python

# function to combine associative array
def elementSum(H1, H2):    
    # make sure H1 is the long one
    if len(H1)<len(H2):
        H0 = H2
        H2 = H1
        H1 = H0
    # merge shorter one H2 into longer one H1
    for h in H2:
        if h not in H1:
            H1[h] = H2[h]
        else:
            H1[h] += H2[h]        
    # return
    return H1

import sys
import numpy as np

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_5,Reducer_cnt,1\n")

min_support = 100
current_word = None
current_aArray = None
n_total = 0

for line in sys.stdin:
    # parse out keyword and the associative array
    word, aArray = line.strip().split('\t', 1)
    
    # get total basket
    if word == '*':
        n_total += int(aArray)
        continue
    
    # get array into variable
    cmdStr = 'aArray = ' + aArray
    exec cmdStr
        
    # merge the associative array
    if current_word == word:
        current_aArray = elementSum(current_aArray, aArray)           
    else:
        # finish one word merge
        if current_word:
            # get the top pairs with heap
            for p in current_aArray:
                if current_aArray[p] > min_support:                    
                    # get relative freq
                    rf = 100.0*current_aArray[p]/n_total
                    print '%s,%s,%s,%.4f%%' %(current_word, p, current_aArray[p], rf)
        # reset for a new word
        current_word = word
        current_aArray = aArray

#print '\ntotal basket: %d' %n_total


Overwriting reducer.py


###Mapper to sort

In [79]:
%%writefile mapper_s.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_5,Mapper_s_cnt,1\n")

for line in sys.stdin:   
    # just emit
    print line.strip()

Overwriting mapper_s.py


###Reducer to sort

In [80]:
%%writefile reducer_s.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_5,Reducer_s_cnt,1\n")

n_out = 0
n_top = 50

print 'top %d pairs: ' %n_top

for line in sys.stdin:   
    # parse mapper output  
    n_out += 1
    if n_out <= n_top:        
        print line.strip().replace(',', '\t')

Overwriting reducer_s.py


###MapReducing

In [84]:
# job 1 - count
!hdfs dfs -rm -r results
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper.py,reducer.py,combiner.py \
-mapper mapper.py \
-reducer reducer.py \
-input /user/leiyang/ProductPurchaseData.txt \
-output results

# job 2 - sort relative frequency
!hdfs dfs -rm -r results2
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D map.output.key.field.separator=',' \
-D map.output.key.value.fields.spec=0-2:3- \
-D mapred.text.key.comparator.options='-k3,3nr -k1,1 -k2,2' \
-D mapred.map.tasks=2 \
-D mapred.reduce.tasks=1 \
-files mapper_s.py,reducer_s.py \
-mapper mapper_s.py \
-reducer reducer_s.py \
-input /user/leiyang/results/part-0000* \
-output results2

Deleted results
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar3782570145270895412/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob870881278814839645.jar tmpDir=null
Deleted results2
packageJobJar: [/var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/hadoop-unjar5736039512355960039/] [] /var/folders/tx/5ldq67q511q8wqwqkvptnxd00000gn/T/streamjob3937477619845631898.jar tmpDir=null


###HW3.5 Results
- 3 mappers, 1 reducer
- the combiner was called 1 time by each map process, total 3 times
- with the same configure, the execution time is reduced to 15 sec.  from 23 sec. of pair approach, about **33%** improvement

<img src="HW3_5.counter.png" alt="Drawing" style="width: 880px;"/>
<img src="HW3_5.time.png" alt="Drawing" style="width: 400px;"/>

In [85]:
!hdfs dfs -cat results2/part-0*

top 50 pairs: 	
DAI62779	ELE17451	1592	5.1188%
FRO40251	SNA80324	1412	4.5400%
DAI75645	FRO40251	1254	4.0320%
FRO40251	GRO85051	1213	3.9002%
DAI62779	GRO73461	1139	3.6623%
DAI75645	SNA80324	1130	3.6333%
DAI62779	FRO40251	1070	3.4404%
DAI62779	SNA80324	923	2.9678%
DAI62779	DAI85309	918	2.9517%
ELE32164	GRO59710	911	2.9292%
DAI62779	DAI75645	882	2.8359%
FRO40251	GRO73461	882	2.8359%
DAI62779	ELE92920	877	2.8198%
FRO40251	FRO92469	835	2.6848%
DAI62779	ELE32164	832	2.6752%
DAI75645	GRO73461	712	2.2893%
DAI43223	ELE32164	711	2.2861%
DAI62779	GRO30386	709	2.2797%
ELE17451	FRO40251	697	2.2411%
DAI85309	ELE99737	659	2.1189%
DAI62779	ELE26917	650	2.0900%
GRO21487	GRO73461	631	2.0289%
DAI62779	SNA45677	604	1.9421%
ELE17451	SNA80324	597	1.9196%
DAI62779	GRO71621	595	1.9131%
DAI62779	SNA55762	593	1.9067%
DAI62779	DAI83733	586	1.8842%
ELE17451	GRO73461	580	1.8649%
GRO73461	SNA80324	562	1.8070%
DAI62779	GRO59710	561	1.8038%
DAI62779	FRO80039	550	1.7684%
DAI75645	ELE174

###OPTIONAL: all HW below this are optional

** Preliminary information **

Much of this homework beyond this point will focus on the Apriori algorithm for frequent itemset  mining and the additional step for extracting association rules from these frequent itemsets.
Please acquaint yourself with the background information (below)
before approaching the remaining  assignments.

** Apriori background information **

Some background material for the  Apriori algorithm is located at:

 - Slides in Live Session #3
 - https://en.wikipedia.org/wiki/Apriori_algorithm
 - https://www.dropbox.com/s/k2zm4otych279z2/Apriori-good-slides.pdf?dl=0
 - http://snap.stanford.edu/class/cs246-2014/slides/02-assocrules.pdf

Association Rules are frequently used for Market Basket Analysis (MBA) by retailers to
understand the purchase behavior of their customers. This information can be then used for
many different purposes such as cross-selling and up-selling of products, sales promotions,
loyalty programs, store design, discount plans and many others.
Evaluation of item sets: Once you have found the frequent itemsets of a dataset, you need
to choose a subset of them as your recommendations. Commonly used metrics for measuring
significance and interest for selecting rules for recommendations are: confidence; lift; and conviction.

###*HW3.6*
What is the Apriori algorithm? Describe an example use in your domain of expertise and what kind of . Define confidence and lift.

NOTE:
For the remaining homework use the online browsing behavior dataset located at (same dataset as used above):

       https://www.dropbox.com/s/zlfyiwa70poqg74/ProductPurchaseData.txt?dl=0

Each line in this dataset represents a browsing session of a customer.
On each line, each string of 8 characters represents the id of an item browsed during that session.
The items are separated by spaces.

Here are the first few lines of the ProductPurchaseData:

- FRO11987 ELE17451 ELE89019 SNA90258 GRO99222
- GRO99222 GRO12298 FRO12685 ELE91550 SNA11465 ELE26917 ELE52966 FRO90334 SNA30755 ELE17451 FRO84225 SNA80192
- ELE17451 GRO73461 DAI22896 SNA99873 FRO86643
- ELE17451 ELE37798 FRO86643 GRO56989 ELE23393 SNA11465
- ELE17451 SNA69641 FRO86643 FRO78087 SNA11465 GRO39357 ELE28573 ELE11375 DAI54444


###Answer:
- Aprior algorithm is used to find frequent itemsets, each iteration has two scans of data and a filtering in between:
 1. generate a condidate set $C_k$ for itemsets of size $k$, based on the output of previous iteration $L_{k-1}$
 2. remove all members from the set whose support is less than the user specified threshold $s_i$
 3. generate the final set $L_k$ for frequent itemset of size $k$, based on output after filtering
 
 
- For example, to find itemsets of size $k$ from a basket set, the process is:
 1. count all single words from all baskets, output $C_1$
 2. remove all words with support below threshold, output $L_1$
 3. using $L_1$, generate candidate set for frequent pair set $C_2$
 5. remove all pairs with support below threshold, get $L_2$
 6. using $L_2$, generate candidate set for frequent triple set $C_3$
 7. remove all triples with support below threshold, get $L_3$

###*HW3.7.* Shopping Cart Analysis
Product Recommendations: The action or practice of selling additional products or services
to existing customers is called cross-selling. Giving product recommendation is
one of the examples of cross-selling that are frequently used by online retailers.
One simple method to give product recommendations is to recommend products that are frequently
browsed together by the customers.

Suppose we want to recommend new products to the customer based on the products they
have already browsed on the online website

- Write a program using the A-priori algorithm to find products which are frequently browsed together. 
- Fix the support to s = 100 (i.e. product sets need to occur together at least 100 times to be considered frequent)
and find itemsets of size 2 and 3.

Then extract association rules from these frequent items. A rule is of the form:

- (item1, item5) ⇒ item2.

List the top 10 discovered rules in descreasing order of confidence in the following format

- (item1, item5) ⇒ item2, supportCount ,support, confidence

**Implementation Notes:**
- each MapReduce job perform one round of APrior processing:
 - mapper: construct candidate set $C_k$
 - reducer: filter $C_k$ to get frequent item set $L_k$
- to find itemsets of size 3, we will need 3 jobs


###Mapper 1: get $C_1$
- emit singleton

In [10]:
%%writefile mapper_1.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_7,Mapper_1_cnt,1\n")

for line in sys.stdin:   
    # get words and emit
    for prod in line.strip().split(' '):
        print '%s\t%d' %(prod, 1)

Overwriting mapper_1.py


###Reducer 1: get $L_1$
- only emit words whose frequency is above the support threshold (100)
- can be used as **combiner** too

In [11]:
%%writefile reducer_1.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_7,Reducer_1_cnt,1\n")

current_prod = None
current_count = 0
min_support = 100

for line in sys.stdin:   
    # get k-v pair
    prod, count = line.strip().split('\t', 1)
    
    # skip bad count
    try:
        count = int(count)
    except ValueError:
        continue
        
    # get count
    if current_prod == prod:
        current_count += count
    else:
        if current_prod and current_count > min_support:
            # emit prod above min support
            print '%s\t%d' %(current_prod, current_count)
        # reset
        current_prod = prod
        current_count = count
    

Overwriting reducer_1.py


###Mapper 2: get $C_2$

In [12]:
%%writefile mapper_2.py
#!/usr/bin/python
import sys, subprocess 

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_7,Mapper_2_cnt,1\n")

singleton = []
cat = subprocess.Popen(['hdfs', 'dfs', '-cat', 'results1/part-0*'], stdout=subprocess.PIPE)
for line in cat.stdout:
    singleton.append(line.strip().split('\t')[0])

# read the input data
for line in sys.stdin:   
    # debug code
    # print line.strip()
    # continue
    
    line = line.strip()
    # if it's job1 output, rehydrate singleton buffer
    if '\t' in line:
        singleton.append(line.split('\t', 1)[0])
        continue
    
    # get words for each session
    prod = line.strip().split(' ')
        
    # keep product from singleton set only
    products = [val for val in prod if val in singleton]
    products.sort()
    
    # get pairs to emit
    size = len(products)
    pairs = [products[i] + '_' + products[j] for i in range(size) for j in range(i+1, size)]
    for p in pairs:
        print '%s\t%d' %(p, 1)
        

Overwriting mapper_2.py


###Reducer 2: get $L_2$ 
- same as Reducer 1, since we have identical k-v format (%s\t%d) from the mapper

In [13]:
### same as reducer_1.py

###Mapper 3: get $C_3$

In [37]:
%%writefile mapper_3.py
#!/usr/bin/python
import sys, subprocess 

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_7,Mapper_3_cnt,1\n")

# load the frequent pairs given by Job 2
pair = []
cat = subprocess.Popen(['hdfs', 'dfs', '-cat', 'results2/part-0*'], stdout=subprocess.PIPE)
for line in cat.stdout:
    pair.append(line.strip().split('\t')[0])

# still read frequent pairs first, then session data to generate triples
for line in sys.stdin:   
    line = line.strip()
            
    # get products from each session
    prod = line.split(' ')
    prod.sort()
    n = len(prod)
    
    # generate triples from the session, in the format of a_b_c, alphabetically sorted
    triples = [[prod[i],prod[j],prod[k]] for i in range(n) for j in range(i+1,n) for k in range(i+2,n)]
    
    # processing triples
    for tri in triples:
        # from each triple a_b_c: check if the 3 child-pairs (a_b, b_c, a_c) are in the frequent pair set
        if tri[0]+'_'+tri[1] in pair and tri[1]+'_'+tri[2] in pair and tri[0]+'_'+tri[2] in pair:
            # if so, emit 3 dummies a_b_*, b_c_*, a_c_*, and the triple itself a_b_c
            print '%s_%s_*\t%d' %(tri[0], tri[1], 1)
            print '%s_%s_*\t%d' %(tri[0], tri[2], 1)
            print '%s_%s_*\t%d' %(tri[1], tri[2], 1)
            print '%s_%s_%s\t%d' %(tri[0], tri[1], tri[2], 1)
    

Overwriting mapper_3.py


###Reducer 3: get $L_3$
- use order inversion to get confidence

In [38]:
%%writefile reducer_3.py
#!/usr/bin/python
import sys

# increase counter for mapper being called
sys.stderr.write("reporter:counter:HW3_7,Reducer_3_cnt,1\n")

current_prod = None
current_dummy = None
current_count = 0
min_support = 100
marginal = 0

for line in sys.stdin:   
    # debug code
    # print line.strip()
    # continue
    
    # get k-v pair
    prod, count = line.strip().split('\t', 1)
    
    # skip bad count
    try:
        count = int(count)
    except ValueError:
        continue
    
    # handle marginal with dummy key
    if '*' in prod:        
        if current_dummy == prod:
            # accumulate marginal
            marginal += count
        else:
            # reset marginal for new dummy key
            current_dummy = prod
            marginal = count
        continue
        
    # processing triple and emit rules
    if current_prod == prod:
        current_count += count
    else:
        if current_prod and current_count > min_support:   ### bug here when new word comes, current_count is the old value
            # emit triples for the rule
            w1,w2,w3 = current_prod.split('_')
            conf = 100.0*current_count/marginal
            print '(%s, %s) => %s, %d, %d, %.2f%% \t %s' %(w1, w2, w3, current_count, marginal, conf, current_dummy)
            
        # reset for new triple
        current_prod = prod
        current_count = count
        
    

Overwriting reducer_3.py


###MapReducing

In [16]:
# job 1 - get L_1 for frequent singletons
!hdfs dfs -rm -r results1
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper_1.py,reducer_1.py \
-mapper mapper_1.py \
-reducer reducer_1.py \
-combiner reducer_1.py \
-input /user/leiyang/ProductPurchaseData.txt \
-output results1

Deleted results1


In [18]:
# job 2 - get L_2 for frequent pairs
!hdfs dfs -rm -r results2
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper_2.py,reducer_1.py \
-mapper mapper_2.py \
-reducer reducer_1.py \
-input /user/leiyang/ProductPurchaseData.txt \
-output results2

Deleted results2


In [39]:
# job 3 - get L_3 and calculate association rules
!hdfs dfs -rm -r results3
!hadoop jar /usr/local/Cellar/hadoop/2.*/libexec/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.map.tasks=3 \
-D mapred.reduce.tasks=1 \
-files mapper_3.py,reducer_3.py \
-mapper mapper_3.py \
-reducer reducer_3.py \
-input /user/leiyang/ProductPurchaseData.txt \
-output results3

Deleted results3


In [40]:
!hdfs dfs -cat results3/part-0*

(DAI22896, DAI62779) => GRO73461, 101, 719, 14.05% 	 DAI22896_DAI62779_*
(DAI23334, DAI62779) => ELE92920, 143, 231, 61.90% 	 DAI31081_DAI43223_*
(DAI31081, DAI62779) => ELE17451, 103, 923, 11.16% 	 DAI31081_DAI62779_*
(DAI31081, DAI75645) => FRO40251, 122, 593, 20.57% 	 DAI31081_DAI75645_*
(DAI31081, ELE32164) => GRO59710, 112, 633, 17.69% 	 DAI31081_ELE32164_*
(DAI31081, FRO40251) => GRO85051, 102, 795, 12.83% 	 DAI31081_FRO40251_*
(DAI31081, FRO40251) => SNA80324, 103, 365, 28.22% 	 DAI31081_FRO53271_*
(DAI42083, DAI62779) => DAI92600, 105, 287, 36.59% 	 DAI42083_DAI62779_*
(DAI42083, DAI92600) => ELE17451, 117, 384, 30.47% 	 DAI42083_DAI92600_*
(DAI42493, DAI62779) => ELE17451, 112, 991, 11.30% 	 DAI42493_DAI62779_*
(DAI42493, DAI62779) => ELE92920, 112, 991, 11.30% 	 DAI42493_DAI62779_*
(DAI42493, DAI62779) => SNA18336, 109, 991, 11.00% 	 DAI42493_DAI62779_*
(DAI43223, DAI62779) => ELE17451, 227, 1272, 17.85% 	 DAI43223_DAI62779_*
(DAI43223, DAI62779) => ELE32164, 287

###*HW3.8*

Benchmark your results using the pyFIM implementation of the Apriori algorithm
(Apriori - Association Rule Induction / Frequent Item Set Mining implemented by Christian Borgelt).
You can download pyFIM from here:

http://www.borgelt.net/pyfim.html

Comment on the results from both implementations (your Hadoop MapReduce of apriori versus pyFIM)
in terms of results and execution times.

###*HW3.8* (Conceptual Exercise)

Suppose that you wished to perform the Apriori algorithm once again,
though this time now with the goal of listing the top 5 rules with corresponding confidence scores
in decreasing order of confidence score for itemsets of size 3 using Hadoop MapReduce.
A rule is now of the form:

(item1, item2) ⇒ item3

Recall that the Apriori algorithm is iterative for increasing itemset size,
working off of the frequent itemsets of the previous size to explore
ONLY the NECESSARY subset of a large combinatorial space.
Describe how you might design a framework to perform this exercise.

In particular, focus on the following:
  — map-reduce steps required
  - enumeration of item sets and filtering for frequent candidates

###stop yarn, hdfs, and job history

In [45]:
!/usr/local/Cellar/hadoop/2*/sbin/stop-yarn.sh
!/usr/local/Cellar/hadoop/2*/sbin/stop-dfs.sh
!/usr/local/Cellar/hadoop/2*/sbin/mr-jobhistory-daemon.sh --config /usr/local/Cellar/hadoop/2*/libexec/etc/hadoop/ stop historyserver 

stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping historyserver
