In [61]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [62]:
# load the data set and show the first five transaction

df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')

df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [63]:
products = df.values.flatten()
unique_products = set(item for item in products if not pd.isna(item))
print(unique_products)

{'Eggs', 'Diaper', 'Bagel', 'Meat', 'Cheese', 'Wine', 'Pencil', 'Milk', 'Bread'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [64]:
#create an itemset based on the products
df = df.fillna('NaN')
transactions = df.values.tolist()

# encoding the feature
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

In [65]:
  # create new dataframe from the encoded features
df_new = pd.DataFrame(df_encoded)
df_new = df_new.astype(int)
  # show the new dataframe
df_new.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,NaN,Pencil,Wine
0,0,1,1,1,1,1,0,0,1,1
1,0,1,1,1,0,1,1,0,1,1
2,0,0,1,0,1,1,1,1,0,1
3,0,0,1,0,1,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [66]:
df_new = df_new.drop('NaN', axis=1)

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [67]:
frequent_itemsets = apriori(df_new, min_support=0.2, use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.504762,(Bread)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)
5,0.47619,(Meat)
6,0.501587,(Milk)
7,0.361905,(Pencil)
8,0.438095,(Wine)
9,0.279365,"(Bagel, Bread)"


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [68]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
2,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
3,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
4,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
5,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624,0.387409
8,"(Meat, Cheese)",(Eggs),0.32381,0.438095,0.215873,0.666667,1.521739,0.074014,1.685714,0.507042
9,"(Meat, Eggs)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667,0.518717


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

Antecedent Support:

What it measures: The percentage of transactions that contain the starting item (the "if" part of the rule).

Example: If Bagel is the starting item, antecedent support tells us how often Bagel appears in all transactions.



Consequent Support:

What it measures: The percentage of transactions that contain the result item (the "then" part of the rule).

Example: If Bread is the result item, consequent support tells us how often Bread appears in all transactions.



Support:

What it measures: Support is a measure that indicates the proportion of transactions in a dataset that contain a particular itemset (a set of items).

Example: Consider the itemset {Milk, Cheese, Bread}. If this itemset appears in 20 out of 100 transactions, the support for {Milk, Cheese, Bread} is 20/100, which is 0.2 or 20%



Confidence:

What it measures: The likelihood that the result item will be in a transaction if the starting item is already there.

Example: If Bagel is present, confidence tells us the chance that Bread is also present in the same transaction.



Lift:

What it measures: How much more likely the result item is to appear when the starting item is present, compared to when it's not.

Interpretation: Lift greater than 1 means the items are positively related; less than 1 means they are less likely to appear together.



Leverage:

What it measures: How much the presence of the starting item and result item together deviates from what we'd expect if they were independent.

Example: Positive leverage means they appear together more often than expected; negative means less often.



Conviction:

What it measures: How much more likely the result item is to appear without the starting item compared to when it's present.
Interpretation: Conviction greater than 1 means the result item is more likely to appear without the starting item.