# Association Analysis

## Apriori Algorithm


Apriori algorithm is a method used to reveal product associations from user purchases. It provides the opportunity to see the association of products purchased according to a threshold value to be determined.
  
Support(X, Y) = Freq(X, Y) / N -- (The probability of X and Y are seen together)  
Confidence(X, Y) = Freq(X, Y) / Freq(X) -- (Probability of selling product Y when product X is sold)  
Lift = Support(X, Y) / (Support(X) * Support(Y)) -- X purchases increase sales of Y Lift times more

In [1]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [2]:
df = pd.read_csv('retail_dataset.csv') # load the dataset
df.head() # We can actually call the rows of this dataset as shopping basket and columns as products that were purchased

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 315 entries, 0 to 314
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       315 non-null    object
 1   1       285 non-null    object
 2   2       245 non-null    object
 3   3       187 non-null    object
 4   4       133 non-null    object
 5   5       71 non-null     object
 6   6       41 non-null     object
dtypes: object(7)
memory usage: 17.4+ KB


In [8]:
# one-hot encoding to make the dataset useful for apriori algorithm
items = (df['0'].unique()) # get list of unique products in the dataset
encoded_vals = []
for index, row in df.iterrows(): 
    labels = {}
    uncommons = list(set(items) - set(row))
    commons = list(set(items).intersection(row))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)

encoded_vals[0]

{'Milk': 0,
 'Bagel': 0,
 'Pencil': 1,
 'Cheese': 1,
 'Wine': 1,
 'Eggs': 1,
 'Diaper': 1,
 'Meat': 1,
 'Bread': 1}

In [9]:
ohe_df = pd.DataFrame(encoded_vals) # transform the encoded values to a pandas dataframe
ohe_df.head()

Unnamed: 0,Milk,Bagel,Pencil,Cheese,Wine,Eggs,Diaper,Meat,Bread
0,0,0,1,1,1,1,1,1,1
1,1,0,1,1,1,0,1,1,1
2,1,0,0,1,1,1,0,1,0
3,1,0,0,1,1,1,0,1,0
4,0,0,1,0,1,0,0,1,0


In [13]:
freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True, verbose=1) # get the frequencies of items

Processing 138 combinations | Sampling itemset size 3


In [14]:
freq_items

Unnamed: 0,support,itemsets
0,0.501587,(Milk)
1,0.425397,(Bagel)
2,0.361905,(Pencil)
3,0.501587,(Cheese)
4,0.438095,(Wine)
5,0.438095,(Eggs)
6,0.406349,(Diaper)
7,0.47619,(Meat)
8,0.504762,(Bread)
9,0.225397,"(Milk, Bagel)"


In [19]:
association_rules(freq_items, metric="confidence", min_threshold=0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
2,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
3,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
4,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
5,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
6,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
8,"(Cheese, Meat)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845
9,"(Cheese, Milk)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429


Some inferences from the table above:  
- The probability of selling eggs and meat together is 32%
- Probability of selling bread when bagel is sold equals to 65%.  
- Meat and milk purchases together increase the cheese sales 1.66 times 

In [18]:
# get the items where support value is lower than 0.3 and confidence is greater than 0.7
df_ar = association_rules(freq_items, metric = "confidence", min_threshold=0.6)
df_ar[(df_ar.support < 0.3) & (df_ar.confidence > 0.7)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8,"(Milk, Meat)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137
11,"(Cheese, Eggs)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773
13,"(Eggs, Meat)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667
