# Clustering Case Study 2: Apply Association Rules to the customer segments from Case Study 1 to create a recommendation engine 

## Overview of Association Rules and the Apriori algorithm behind it 

Association Rules uncovers which items in a dataset occur together. Within the context of our ecommerce dataset, if customers normally purchase 

KDNuggets gives a quick overview [here](https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html). For a more mathematical overview, see [pg 497 of ESL by Hastie and Tibshirani](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) 

Association Rules are particularly useful for stock transaction data and provide a good starting point into recommendation engines. 

## Implementing Association Rules on ecommerce data 

1. Read in the cleaned dataset you saved in Case Study 1
2. This dataset is not ready for Association Rules yet. Therefore, reshape the data so that each row is an invoice number and each column is a product
![alt text](stockcode.png)

In [2]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from sklearn.preprocessing import OneHotEncoder

In [1]:
# import mlxtend
# import pandas as pd
# df = pd.read_csv('data/derived/data2.csv')
# df = df.iloc[:,:2]

# from sklearn.preprocessing import OneHotEncoder
# enc = OneHotEncoder()
# df2 = pd.DataFrame(enc.fit_transform(df['StockCode'].values.reshape(-1, 1)).toarray(),columns = [name.split('_')[1] for name in enc.get_feature_names()])
# df2.index = df['InvoiceNo']

# df3 = df2.groupby('InvoiceNo').sum()
# df3 = df3.clip(0,1)

# df3.to_csv('data/derived/data3.csv')

In [2]:
df3 = pd.read_csv('data/derived/data3.csv',index_col = 0)
df3.head()

Unnamed: 0_level_0,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


3. Apply the apriori algorithm on the dataset generated above to get the frequent itemsets. You may find the `mlextend` libary useful
4. Apply association rules on the frequent itemsets from 3 to generate confidence, support and lift measures for the data 
5. What happens when you change the `min_threshold` parameter? 

In [3]:
supp = apriori(df3, min_support=0.01, use_colnames=True, n_jobs = -1)
supp.sort_values('support',ascending = False).head(10)

Unnamed: 0,support,itemsets
620,0.106734,(85123A)
237,0.091895,(22423)
617,0.086337,(85099B)
545,0.074412,(47566)
592,0.074196,(84879)
18,0.069555,(20725)
326,0.061839,(22720)
624,0.059303,(POST)
464,0.058278,(23203)
20,0.056767,(20727)


In [4]:
rconf = association_rules(supp, metric='confidence', min_threshold=0.8, support_only=False)
print(rconf.shape)
rconf.sort_values('confidence').head()

(19, 9)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(22745),(22748),0.017052,0.01867,0.013706,0.803797,43.05195,0.013388,5.001615
3,(22746),(22745),0.013598,0.017052,0.011062,0.813492,47.707705,0.01083,5.270277
1,(22579),(22578),0.014893,0.023365,0.012195,0.818841,35.04562,0.011847,5.391025
2,(22698),(22697),0.030002,0.037287,0.024822,0.827338,22.188466,0.023703,5.575714
0,(21086),(21094),0.015379,0.017537,0.012735,0.82807,47.217835,0.012465,5.714324


In [5]:
rconf = association_rules(supp, metric='confidence', min_threshold=0.5, support_only=False)
print(rconf.shape)
rconf.sort_values('confidence').head()

(323, 9)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
85,(22730),(22728),0.025254,0.033078,0.012627,0.5,15.115824,0.011791,1.933844
16,(21977),(21212),0.0361,0.055526,0.018077,0.500747,9.018319,0.016072,1.891777
150,"(20725, 20728)",(20726),0.024768,0.044248,0.012411,0.501089,11.324619,0.011315,1.915678
46,(22411),(85099B),0.042629,0.086337,0.021368,0.501266,5.805911,0.017688,1.831964
43,(22662),(22383),0.032269,0.056281,0.016188,0.501672,8.913701,0.014372,1.893772


In [6]:
rconf = association_rules(supp, metric='confidence', min_threshold=0.9, support_only=False)
print(rconf.shape)
rconf.sort_values('confidence').head()

(3, 9)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(23172),(23171),0.012087,0.014569,0.0109,0.901786,61.895899,0.010724,10.033475
2,"(22423, 22699, 22698)",(22697),0.0143,0.037287,0.012897,0.901887,24.187795,0.012363,9.812269
1,"(22746, 22745)",(22748),0.011062,0.01867,0.010037,0.907317,48.596532,0.00983,10.58803


### Creating tailored recommendations by applying Association Rules to the customer segments produced from Case Study 1

1. In the previous notebook, we created a GMM model that clustered customers into n segments. Apply association rules to each segment from your chosen model. 
2. Do results for each segment differ from each other? 

In [1]:
import pandas as pd
df = pd.read_csv('data/derived/data2.csv')
df4 = pd.read_csv('data/derived/gmm.csv',index_col=0)

In [8]:
#cluster0
cluster0 = df4.loc[df4['Cluster'] == 0,:]
print('Number of rows in cluster:',cluster0.shape[0])

cluster0_df = df.loc[df['CustomerID'].isin(cluster0.index.tolist()),:]
cluster0_df = cluster0_df.iloc[:,:2]

#Convert data to Invoice-StockCode matrix
enc = OneHotEncoder()
cluster0_df2 = pd.DataFrame(enc.fit_transform(cluster0_df['StockCode'].values.reshape(-1, 1)).toarray(),columns = [name.split('_')[1] for name in enc.get_feature_names()])
cluster0_df2.index = cluster0_df['InvoiceNo']

cluster0_df3 = cluster0_df2.groupby('InvoiceNo').sum()
cluster0_df3 = cluster0_df3.clip(0,1)

#Association Rule Mining
supp0 = apriori(cluster0_df3, min_support=0.01, use_colnames=True,n_jobs=-1)
rconf0 = association_rules(supp0, support_only=False)

print('Number of rules above threshold:',rconf0.shape[0])
rconf0.sort_values('lift',ascending=False).head()

Number of rows in cluster: 631
Number of rules above threshold: 84


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
16,(23290),(23292),0.012262,0.012058,0.010014,0.816667,67.727966,0.009866,5.388774
15,(23292),(23290),0.012058,0.012262,0.010014,0.830508,67.727966,0.009866,5.827652
53,"(23175, 23173)",(23174),0.011036,0.014715,0.010423,0.944444,64.18287,0.010261,17.735132
50,"(23170, 23172)",(23171),0.012876,0.015124,0.012058,0.936508,61.923423,0.011863,15.511803
52,(23172),"(23170, 23171)",0.014306,0.013693,0.012058,0.842857,61.553731,0.011862,6.276499


<span style="color:#003366"><b> {SPACEBOY CHILDRENS BOWL(23290), SPACE BOY CHILDRENS CUP(23292)} <br/>
<span style="color:#003366"><b> {REGENCY TEAPOT ROSES(23173),REGENCY SUGAR BOWL GREEN(23174),REGENCY MILK JUG PINK(23175)}<br/> 
<span style="color:#003366"><b> {REGENCY TEA PLATE ROSES(23170),REGENCY TEA PLATE GREEN(23171),REGENCY TEA PLATE PINK(23172)}

In [9]:
#cluster1
cluster1 = df4.loc[df4['Cluster'] == 1,:]
print('Number of rows in cluster:',cluster1.shape[0])

cluster1_df = df.loc[df['CustomerID'].isin(cluster1.index.tolist()),:]
cluster1_df = cluster1_df.iloc[:,:2]

#Convert data to Invoice-StockCode matrix
enc = OneHotEncoder()
cluster1_df2 = pd.DataFrame(enc.fit_transform(cluster1_df['StockCode'].values.reshape(-1, 1)).toarray(),columns = [name.split('_')[1] for name in enc.get_feature_names()])
cluster1_df2.index = cluster1_df['InvoiceNo']

cluster1_df3 = cluster1_df2.groupby('InvoiceNo').sum()
cluster1_df3 = cluster1_df3.clip(0,1)

#Association Rule Mining
supp1 = apriori(cluster1_df3, min_support=0.01, use_colnames=True,n_jobs=-1)
rconf1 = association_rules(supp1, metric='confidence', min_threshold=0.8, support_only=False)

print('Number of rules above threshold:',rconf1.shape[0])
rconf1.sort_values('lift',ascending=False).head()

Number of rows in cluster: 1129
Number of rules above threshold: 29


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(21124),(21122),0.013286,0.016829,0.0124,0.933333,55.459649,0.012177,14.747564
4,(22635),(22634),0.015943,0.015943,0.013286,0.833333,52.268519,0.013032,5.90434
5,(22634),(22635),0.015943,0.015943,0.013286,0.833333,52.268519,0.013032,5.90434
9,(23254),(23256),0.016829,0.017715,0.015058,0.894737,50.507895,0.014759,9.331709
10,(23256),(23254),0.017715,0.016829,0.015058,0.85,50.507895,0.014759,6.554473


<span style="color:#003366"><b> {SET/10 BLUE POLKADOT PARTY CANDLES(21124), SET/10 PINK POLKADOT PARTY CANDLES(21122)} <br/>
<span style="color:#003366"><b> {CHILDS BREAKFAST SET SPACEBOY(22634),CHILDS BREAKFAST SET DOLLY GIRL(22635)}<br/> 
<span style="color:#003366"><b> {CHILDRENS CUTLERY DOLLY GIRL(23254),CHILDRENS CUTLERY SPACEBOY(23256)}

In [10]:
#cluster2
cluster2 = df4.loc[df4['Cluster'] == 2,:]
print('Number of rows in cluster:',cluster2.shape[0])

cluster2_df = df.loc[df['CustomerID'].isin(cluster2.index.tolist()),:]
cluster2_df = cluster2_df.iloc[:,:2]

#Convert data to Invoice-StockCode matrix
enc = OneHotEncoder()
cluster2_df2 = pd.DataFrame(enc.fit_transform(cluster2_df['StockCode'].values.reshape(-1, 1)).toarray(),columns = [name.split('_')[1] for name in enc.get_feature_names()])
cluster2_df2.index = cluster2_df['InvoiceNo']

cluster2_df3 = cluster2_df2.groupby('InvoiceNo').sum()
cluster2_df3 = cluster2_df3.clip(0,1)

#Association Rule Mining
supp2 = apriori(cluster2_df3, min_support=0.01, use_colnames=True,n_jobs=-1)
rconf2 = association_rules(supp2, metric='confidence', min_threshold=0.8, support_only=False)

print('Number of rules above threshold:',rconf2.shape[0])
rconf2.sort_values('lift',ascending=False).head()

Number of rows in cluster: 980
Number of rules above threshold: 7


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3,(23254),(23256),0.012612,0.013988,0.010089,0.8,57.193443,0.009913,4.930062
0,(22302),(22303),0.011924,0.015363,0.010319,0.865385,56.327497,0.010136,7.314443
1,(22579),(22578),0.012382,0.023618,0.010319,0.833333,35.283172,0.010026,5.858289
6,"(22699, 22698)",(22697),0.015822,0.027975,0.014446,0.913043,32.637562,0.014004,11.178285
4,"(22423, 22698)",(22697),0.011695,0.027975,0.010548,0.901961,32.241401,0.010221,9.914653


<span style="color:#003366"><b> {CHILDRENS CUTLERY DOLLY GIRL(23254),CHILDRENS CUTLERY SPACEBOY(23256)}
<span style="color:#003366"><b> {COFFEE MUG PEARS  DESIGN(22302), COFFEE MUG APPLES DESIGN (22303)} <br/>
<span style="color:#003366"><b> {WOODEN STAR CHRISTMAS SCANDINAVIAN(22578),WOODEN TREE CHRISTMAS SCANDINAVIAN(22579)}<br/> 
<span style="color:#003366"><b> {GREEN REGENCY TEACUP AND SAUCER(22697), PINK REGENCY TEACUP AND SAUCER(22698), ROSES REGENCY TEACUP AND SAUCER(22699)} <br/>
<span style="color:#003366"><b> {REGENCY CAKESTAND 3 TIER(22423), PINK REGENCY TEACUP AND SAUCER(22698), ROSES REGENCY TEACUP AND SAUCER(22699)} <br/>

In [11]:
#cluster3
cluster3 = df4.loc[df4['Cluster'] == 3,:]
print('Number of rows in cluster:',cluster3.shape[0])

cluster3_df = df.loc[df['CustomerID'].isin(cluster3.index.tolist()),:]
cluster3_df = cluster3_df.iloc[:,:2]

#Convert data to Invoice-StockCode matrix
enc = OneHotEncoder()
cluster3_df2 = pd.DataFrame(enc.fit_transform(cluster3_df['StockCode'].values.reshape(-1, 1)).toarray(),columns = [name.split('_')[1] for name in enc.get_feature_names()])
cluster3_df2.index = cluster3_df['InvoiceNo']

cluster3_df3 = cluster3_df2.groupby('InvoiceNo').sum()
cluster3_df3 = cluster3_df3.clip(0,1)

#Association Rule Mining
supp3 = apriori(cluster3_df3, min_support=0.01, use_colnames=True,n_jobs=-1)
rconf3 = association_rules(supp3, metric='confidence', min_threshold=0.8, support_only=False)

print('Number of rules above threshold:',rconf3.shape[0])
rconf3.sort_values('lift',ascending=False).head()

Number of rows in cluster: 815
Number of rules above threshold: 77


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
36,(22917),"(22916, 22918)",0.010753,0.01012,0.01012,0.941176,93.0,0.010011,16.827957
75,(22916),"(22917, 22920, 22918)",0.010753,0.01012,0.01012,0.941176,93.0,0.010011,16.827957
73,(22917),"(22920, 22916, 22918)",0.010753,0.01012,0.01012,0.941176,93.0,0.010011,16.827957
70,"(22920, 22916)","(22917, 22918)",0.010753,0.01012,0.01012,0.941176,93.0,0.010011,16.827957
37,(22916),"(22917, 22918)",0.010753,0.01012,0.01012,0.941176,93.0,0.010011,16.827957


<span style="color:#003366"><b> {HERB MARKER THYME(22916), HERB MARKER ROSEMARY(226917), HERB MARKER PARSLEY(22918)} <br/>
<span style="color:#003366"><b> {HERB MARKER THYME(22916), HERB MARKER ROSEMARY(226917), HERB MARKER PARSLEY(22918), HERB MARKER BASIL(22920)} <br/>

In [12]:
#cluster4
cluster4 = df4.loc[df4['Cluster'] == 4,:]
print('Number of rows in cluster:',cluster4.shape[0])

cluster4_df = df.loc[df['CustomerID'].isin(cluster4.index.tolist()),:]
cluster4_df = cluster4_df.iloc[:,:2]

#Convert data to Invoice-StockCode matrix
enc = OneHotEncoder()
cluster4_df2 = pd.DataFrame(enc.fit_transform(cluster4_df['StockCode'].values.reshape(-1, 1)).toarray(),columns = [name.split('_')[1] for name in enc.get_feature_names()])
cluster4_df2.index = cluster4_df['InvoiceNo']

cluster4_df3 = cluster4_df2.groupby('InvoiceNo').sum()
cluster4_df3 = cluster4_df3.clip(0,1)

#Association Rule Mining
supp4 = apriori(cluster4_df3, min_support=0.01, use_colnames=True,n_jobs=-1)
rconf4 = association_rules(supp4, metric='confidence', min_threshold=0.8, support_only=False)

print('Number of rules above threshold:',rconf4.shape[0])
rconf4.sort_values('lift',ascending=False).head()

Number of rows in cluster: 277
Number of rules above threshold: 31


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
10,(23175),(23174),0.011591,0.011591,0.010537,0.909091,78.429752,0.010403,10.872497
11,(23174),(23175),0.011591,0.011591,0.010537,0.909091,78.429752,0.010403,10.872497
24,"(23170, 23171)",(23172),0.013699,0.011591,0.011591,0.846154,73.0,0.011432,6.424658
25,(23172),"(23170, 23171)",0.011591,0.013699,0.011591,1.0,73.0,0.011432,inf
9,(23174),(23173),0.011591,0.013699,0.010537,0.909091,66.363636,0.010379,10.849315


<span style="color:#003366"><b> {REGENCY TEAPOT ROSES(23173),REGENCY SUGAR BOWL GREEN(23174)} <br/>
<span style="color:#003366"><b> {REGENCY SUGAR BOWL GREEN(23174),REGENCY MILK JUG PINK(23175)}<br/> 
<span style="color:#003366"><b> {REGENCY TEA PLATE ROSES(23170),REGENCY TEA PLATE GREEN(23171),REGENCY TEA PLATE PINK(23172)}
