# Clustering Case Study 2: Apply Association Rules to the customer segments from Case Study 1 to create a recommendation engine 

## Overview of Association Rules and the Apriori algorithm behind it 

Association Rules uncovers which items in a dataset occur together. Within the context of our ecommerce dataset, if customers normally purchase 

KDNuggets gives a quick overview [here](https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html). For a more mathematical overview, see [pg 497 of ESL by Hastie and Tibshirani](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) 

Association Rules are particularly useful for stock transaction data and provide a good starting point into recommendation engines. 

In [11]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import association_rules

## Implementing Association Rules on ecommerce data 

1. Read in the cleaned dataset you saved in Case Study 1
2. This dataset is not ready for Association Rules yet. Therefore, reshape the data so that each row is an invoice number and each column is a product
![alt text](stockcode.png)

In [2]:
df = pd.read_csv('data/clean_data.csv', encoding='ISO-8859-1')

In [3]:
df1 = pd.crosstab(df.InvoiceNo,df.StockCode)

In [4]:
df1[df1>1]=1

# 3. Apply the apriori algorithm on the dataset generated above to get the frequent itemsets. You may find the `mlextend` libary useful
4. Apply association rules on the frequent itemsets from 3 to generate confidence, support and lift measures for the data 
5. What happens when you change the `min_threshold` parameter? 

__Changing Min_threshold__
- Increasing the threshold, decreases the number of itemsets found

In [9]:
frequent_itemsets=apriori(df1, min_support=0.01,use_colnames=True)
# x=[]
# for i in range(0,df1.values.shape[1]):
#    x.append(df1.values[i].sum()/df1.values.shape[1])
# dfx= pd.DataFrame(x,columns=['x'])
# sns.distplot(dfx.x,kde=False)

In [12]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)

In [24]:
rules.sort_values('lift',ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1102,(23172),"(23170, 23171)",0.012179,0.012342,0.010059,0.825893,66.915513,0.009908,5.672701
1099,"(23170, 23171)",(23172),0.012342,0.012179,0.010059,0.814978,66.915513,0.009908,5.338936
1103,(23171),"(23170, 23172)",0.014572,0.010711,0.010059,0.690299,64.446549,0.009903,3.19433
1098,"(23170, 23172)",(23171),0.010711,0.014572,0.010059,0.939086,64.446549,0.009903,16.17745
539,(23171),(23172),0.014572,0.012179,0.010983,0.753731,61.886727,0.010806,4.011151
538,(23172),(23171),0.012179,0.014572,0.010983,0.901786,61.886727,0.010806,10.033453
1093,"(22745, 22748)",(22746),0.01381,0.013702,0.010113,0.732283,53.445069,0.009924,3.684115
1096,(22746),"(22745, 22748)",0.013702,0.01381,0.010113,0.738095,53.445069,0.009924,3.765451
545,(23175),(23174),0.01468,0.014463,0.011092,0.755556,52.24127,0.010879,4.031743
544,(23174),(23175),0.014463,0.01468,0.011092,0.766917,52.24127,0.010879,4.227339


### Creating tailored recommendations by applying Association Rules to the customer segments produced from Case Study 1

1. In the previous notebook, we created a GMM model that clustered customers into n segments. Apply association rules to each segment from your chosen model. 
2. Do results for each segment differ from each other? 

In [25]:
gmm_df= pd.read_csv('data/gmm_df.csv', sep=',',encoding='ISO-8859-1')

In [26]:
gmm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4323 entries, 0 to 4322
Data columns (total 9 columns):
CustomerID               4323 non-null int64
NoOfInvoices             4323 non-null float64
NoOfUniqueItems          4323 non-null float64
TotalQuantity            4323 non-null float64
UnitPriceMean            4323 non-null float64
UnitPriceStd             4323 non-null float64
QuantityPerInvoice       4323 non-null float64
UniqueItemsPerInvoice    4323 non-null float64
Clusters                 4323 non-null int64
dtypes: float64(7), int64(2)
memory usage: 304.0 KB


In [29]:
gmm_df_0= gmm_df[gmm_df['Clusters']==0]
gmm_df_1= gmm_df[gmm_df['Clusters']==1]

In [34]:
def getRules(df):
    df1=pd.crosstab(df.InvoiceNo,df.StockCode)
    df1[df1>1]=1
    frequent_itemsets=apriori(df1, min_support=0.01,use_colnames=True)
    rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
    return rules.sort_values('lift',ascending=False)

Unnamed: 0,CustomerID,NoOfInvoices,NoOfUniqueItems,TotalQuantity,UnitPriceMean,UnitPriceStd,QuantityPerInvoice,UniqueItemsPerInvoice,Clusters
3,12350,0.0,0.008959,0.001109,0.001806,0.011176,0.062889,0.076782,0
5,12353,0.0,0.00168,0.000108,0.002905,0.005407,0.006096,0.017367,0
7,12355,0.0,0.006719,0.001353,0.001984,0.004148,0.076686,0.058501,0
10,12358,0.004785,0.006719,0.001398,0.003986,0.013749,0.039466,0.028793,0
12,12360,0.009569,0.058231,0.006587,0.001662,0.007663,0.12428,0.159049,0
