# Clustering Case Study 2: Apply Association Rules to the customer segments from Case Study 1 to create a recommendation engine 

## Overview of Association Rules and the Apriori algorithm behind it 

Association Rules uncovers which items in a dataset occur together. Within the context of our ecommerce dataset, if customers normally purchase 

KDNuggets gives a quick overview [here](https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html). For a more mathematical overview, see [pg 497 of ESL by Hastie and Tibshirani](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) 

Association Rules are particularly useful for stock transaction data and provide a good starting point into recommendation engines. 

## Implementing Association Rules on ecommerce data 

1. Read in the cleaned dataset you saved in Case Study 1
2. This dataset is not ready for Association Rules yet. Therefore, reshape the data so that each row is an invoice number and each column is a product
![alt text](stockcode.png)

In [45]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
data_clean = pd.read_pickle('./data/clean.pkl')
df_mod = pd.crosstab(data_clean['InvoiceNo'],data_clean['StockCode'])
df_mod[df_mod>0]=1
data_clean.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [5]:
df_mod

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536370,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
536371,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536372,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536373,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536374,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
np.shape(df_mod)

(18532, 3665)

3. Apply the apriori algorithm on the dataset generated above to get the frequent itemsets. You may find the `mlextend` libary useful
4. Apply association rules on the frequent itemsets from 3 to generate confidence, support and lift measures for the data 
5. What happens when you change the `min_threshold` parameter? 

In [12]:
frequent_itemsets = apriori(df_mod, min_support=0.01, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(20712),(22386),0.026549,0.047,0.010738,0.404472,8.605817,0.00949,1.60026
1,(22386),(20712),0.047,0.026549,0.010738,0.228473,8.605817,0.00949,1.26172
2,(20712),(85099B),0.026549,0.086337,0.014138,0.53252,6.167917,0.011846,1.954444
3,(85099B),(20712),0.086337,0.026549,0.014138,0.16375,6.167917,0.011846,1.164067
4,(20713),(85099B),0.022178,0.086337,0.0109,0.491484,5.692616,0.008985,1.796725


In [58]:
rules.sort_values(by='support',ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
369,(85099B),(22386),0.086337,0.047000,0.029463,0.341250,7.260672,0.025405,1.446680
368,(22386),(85099B),0.047000,0.086337,0.029463,0.626866,7.260672,0.025405,2.448616
460,(22699),(22697),0.042251,0.037287,0.029193,0.690932,18.530185,0.027617,3.114894
461,(22697),(22699),0.037287,0.042251,0.029193,0.782923,18.530185,0.027617,4.412029
484,(22727),(22726),0.047324,0.042575,0.028599,0.604333,14.194548,0.026584,2.419774
485,(22726),(22727),0.042575,0.047324,0.028599,0.671736,14.194548,0.026584,2.902169
52,(20725),(22384),0.069555,0.050237,0.028221,0.405741,8.076466,0.024727,1.598230
53,(22384),(20725),0.050237,0.069555,0.028221,0.561762,8.076466,0.024727,2.123147
51,(22383),(20725),0.056281,0.069555,0.028006,0.497603,7.154057,0.024091,1.852011
50,(20725),(22383),0.069555,0.056281,0.028006,0.402638,7.154057,0.024091,1.579810


### Creating tailored recommendations by applying Association Rules to the customer segments produced from Case Study 1

1. In the previous notebook, we created a GMM model that clustered customers into n segments. Apply association rules to each segment from your chosen model. 
2. Do results for each segment differ from each other? 

In [23]:
labels = pd.read_pickle('./data/labels')
labels.head(5)

Unnamed: 0_level_0,cluster
CustomerID,Unnamed: 1_level_1
12363,2
12379,2
12386,0
12393,2
12434,2


In [29]:
data_clean['CustomerID'].astype(int)
groups=[]
for ii in range(3):
    groups.append(pd.merge(data_clean,labels[labels['cluster']==ii],how='inner',left_on='CustomerID',right_index=True,))

In [49]:
from IPython.display import display

def group_rules():
    ass_rules=[]
    for group in groups:
        df = pd.crosstab(group['InvoiceNo'],group['StockCode'])
        df[df>0]=1
        frequent_itemsets = apriori(df, min_support=0.01, use_colnames=True)
        rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
        display(rules.head(5))
        ass_rules.append(rules)
    return ass_rules
ass_rules=group_rules()


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(20914),(85123A),0.030129,0.097561,0.010043,0.333333,3.416667,0.007104,1.353659
1,(85123A),(20914),0.097561,0.030129,0.010043,0.102941,3.416667,0.007104,1.081168
2,(22749),(20970),0.025825,0.011478,0.010043,0.388889,33.881944,0.009747,1.617582
3,(20970),(22749),0.011478,0.025825,0.010043,0.875,33.881944,0.009747,7.7934
4,(84879),(21136),0.071736,0.020086,0.014347,0.2,9.957143,0.012906,1.224892


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(20723),(20724),0.019084,0.024809,0.013359,0.7,28.215385,0.012885,3.250636
1,(20724),(20723),0.024809,0.019084,0.013359,0.538462,28.215385,0.012885,2.125318
2,(20723),(22356),0.019084,0.017176,0.01145,0.6,34.933333,0.011123,2.457061
3,(22356),(20723),0.017176,0.019084,0.01145,0.666667,34.933333,0.011123,2.942748
4,(20723),(23204),0.019084,0.017176,0.01145,0.6,34.933333,0.011123,2.457061


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(20725),(20726),0.03762,0.03937,0.013123,0.348837,8.860465,0.011642,1.475253
1,(20726),(20725),0.03937,0.03762,0.013123,0.333333,8.860465,0.011642,1.44357
2,(20725),(20727),0.03762,0.049869,0.014873,0.395349,7.927785,0.012997,1.571371
3,(20727),(20725),0.049869,0.03762,0.014873,0.298246,7.927785,0.012997,1.371391
4,(20725),(20728),0.03762,0.052493,0.014873,0.395349,7.531395,0.012898,1.56703


In [56]:
for ass_rule in ass_rules:
    display(ass_rule.sort_values(by='leverage',ascending=False).head(5))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
76,(22698),(22697),0.032999,0.043042,0.031564,0.956522,22.223188,0.030144,22.010043
77,(22697),(22698),0.043042,0.032999,0.031564,0.733333,22.223188,0.030144,3.626255
78,(22699),(22697),0.038737,0.043042,0.02726,0.703704,16.349383,0.025592,3.229735
79,(22697),(22699),0.043042,0.038737,0.02726,0.633333,16.349383,0.025592,2.621625
48,(22138),(22617),0.071736,0.031564,0.025825,0.36,11.405455,0.023561,1.513181


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
13,(85123A),(21733),0.074427,0.026718,0.024809,0.333333,12.47619,0.022821,1.459924
12,(21733),(85123A),0.026718,0.074427,0.024809,0.928571,12.47619,0.022821,12.958015
23,(22578),(22579),0.026718,0.017176,0.017176,0.642857,37.428571,0.016717,2.751908
22,(22579),(22578),0.017176,0.026718,0.017176,1.0,37.428571,0.016717,inf
33,(85099B),(85099C),0.078244,0.030534,0.019084,0.243902,7.987805,0.016695,1.282197


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
39,(20728),(22384),0.052493,0.04287,0.027122,0.516667,12.052041,0.024871,1.98027
38,(22384),(20728),0.04287,0.052493,0.027122,0.632653,12.052041,0.024871,2.579323
149,(22470),(22469),0.048994,0.046369,0.026247,0.535714,11.553235,0.023975,2.053974
148,(22469),(22470),0.046369,0.048994,0.026247,0.566038,11.553235,0.023975,2.191449
37,(22383),(20728),0.048994,0.052493,0.024497,0.5,9.525,0.021925,1.895013


In [59]:
for group in groups:
    print(group['leverage'].mean()/rules['leverage'].mean())

KeyError: 'leverage'