In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

! pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 1.6 MB/s eta 0:00:01
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.0


# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [2]:
# load the data set and show the first five transaction
import pandas as pd

# Load the dataset
url = "https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv"
df = pd.read_csv(url)

# Show the first five transactions
print("First five transactions:")
df.head()

First five transactions:


Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [8]:

purchased_products_set = set(df.values.flatten())

unique_products = df.stack().dropna().unique()

print("\nUnique products that have been purchased:")
print(unique_products)


Unique products that have been purchased:
['Bread' 'Wine' 'Eggs' 'Meat' 'Cheese' 'Pencil' 'Diaper' 'Milk' 'Bagel']


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [11]:
#create an itemset based on the products
itemset = df.apply(lambda row: list(row.dropna()), axis=1)

# encoding the feature
encoded_df = pd.get_dummies(itemset.apply(pd.Series).stack()).sum(level=0)

print("\nEncoded DataFrame:")
print(encoded_df)





Encoded DataFrame:
     Bagel  Bread  Cheese  Diaper  Eggs  Meat  Milk  Pencil  Wine
0        0      1       1       1     1     1     0       1     1
1        0      1       1       1     0     1     1       1     1
2        0      0       1       0     1     1     1       0     1
3        0      0       1       0     1     1     1       0     1
4        0      0       0       0     0     1     0       1     1
..     ...    ...     ...     ...   ...   ...   ...     ...   ...
310      0      1       1       0     1     0     0       0     0
311      0      0       0       0     0     1     1       1     0
312      0      1       1       1     1     1     0       1     1
313      0      0       1       0     0     1     0       0     0
314      1      1       0       0     1     1     0       0     1

[315 rows x 9 columns]


  encoded_df = pd.get_dummies(itemset.apply(pd.Series).stack()).sum(level=0)


In [12]:
  # create new dataframe from the encoded features
new_df = pd.concat([df, encoded_df], axis=1)

    
  # show the new dataframe
print("\nNew DataFrame with Encoded Features:")
print(new_df)


New DataFrame with Encoded Features:
          0       1       2       3       4       5       6  Bagel  Bread  \
0     Bread    Wine    Eggs    Meat  Cheese  Pencil  Diaper      0      1   
1     Bread  Cheese    Meat  Diaper    Wine    Milk  Pencil      0      1   
2    Cheese    Meat    Eggs    Milk    Wine     NaN     NaN      0      0   
3    Cheese    Meat    Eggs    Milk    Wine     NaN     NaN      0      0   
4      Meat  Pencil    Wine     NaN     NaN     NaN     NaN      0      0   
..      ...     ...     ...     ...     ...     ...     ...    ...    ...   
310   Bread    Eggs  Cheese     NaN     NaN     NaN     NaN      0      1   
311    Meat    Milk  Pencil     NaN     NaN     NaN     NaN      0      0   
312   Bread  Cheese    Eggs    Meat  Pencil  Diaper    Wine      0      1   
313    Meat  Cheese     NaN     NaN     NaN     NaN     NaN      0      0   
314    Eggs    Wine   Bagel   Bread    Meat     NaN     NaN      1      1   

     Cheese  Diaper  Eggs  Meat  Milk

Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [13]:
# Drop the NaN column or select all columns other than the first column
new_df_cleaned = new_df.iloc[:, 1:].dropna(axis=1, how='all')

# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(new_df_cleaned)


Cleaned DataFrame:
          1       2       3       4       5       6  Bagel  Bread  Cheese  \
0      Wine    Eggs    Meat  Cheese  Pencil  Diaper      0      1       1   
1    Cheese    Meat  Diaper    Wine    Milk  Pencil      0      1       1   
2      Meat    Eggs    Milk    Wine     NaN     NaN      0      0       1   
3      Meat    Eggs    Milk    Wine     NaN     NaN      0      0       1   
4    Pencil    Wine     NaN     NaN     NaN     NaN      0      0       0   
..      ...     ...     ...     ...     ...     ...    ...    ...     ...   
310    Eggs  Cheese     NaN     NaN     NaN     NaN      0      1       1   
311    Milk  Pencil     NaN     NaN     NaN     NaN      0      0       0   
312  Cheese    Eggs    Meat  Pencil  Diaper    Wine      0      1       1   
313  Cheese     NaN     NaN     NaN     NaN     NaN      0      0       1   
314    Wine   Bagel   Bread    Meat     NaN     NaN      1      1       0   

     Diaper  Eggs  Meat  Milk  Pencil  Wine  
0        

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [25]:
from mlxtend.frequent_patterns import apriori

min_support = 0.2
frequent_itemsets_apriori = apriori(encoded_df, min_support=min_support, use_colnames=True)

print(frequent_itemsets_apriori)

     support              itemsets
0   0.425397               (Bagel)
1   0.504762               (Bread)
2   0.501587              (Cheese)
3   0.406349              (Diaper)
4   0.438095                (Eggs)
5   0.476190                (Meat)
6   0.501587                (Milk)
7   0.361905              (Pencil)
8   0.438095                (Wine)
9   0.279365        (Bread, Bagel)
10  0.225397         (Milk, Bagel)
11  0.238095       (Bread, Cheese)
12  0.231746       (Bread, Diaper)
13  0.206349         (Bread, Meat)
14  0.279365         (Milk, Bread)
15  0.200000       (Pencil, Bread)
16  0.244444         (Wine, Bread)
17  0.200000      (Cheese, Diaper)
18  0.298413        (Eggs, Cheese)
19  0.323810        (Meat, Cheese)
20  0.304762        (Milk, Cheese)
21  0.200000      (Pencil, Cheese)
22  0.269841        (Wine, Cheese)
23  0.234921        (Wine, Diaper)
24  0.266667          (Eggs, Meat)
25  0.244444          (Eggs, Milk)
26  0.241270          (Eggs, Wine)
27  0.244444        



Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [28]:
from mlxtend.frequent_patterns import association_rules

# Set the confidence threshold
min_confidence = 0.6

# Generate association rules
rules = association_rules(frequent_itemsets_apriori, metric="confidence", min_threshold=min_confidence)

# Display the association rules
print("Association Rules:")
print(rules)

Association Rules:
       antecedents consequents  antecedent support  consequent support  \
0          (Bagel)     (Bread)            0.425397            0.504762   
1           (Eggs)    (Cheese)            0.438095            0.501587   
2           (Meat)    (Cheese)            0.476190            0.501587   
3         (Cheese)      (Meat)            0.501587            0.476190   
4           (Milk)    (Cheese)            0.501587            0.501587   
5         (Cheese)      (Milk)            0.501587            0.501587   
6           (Wine)    (Cheese)            0.438095            0.501587   
7           (Eggs)      (Meat)            0.438095            0.476190   
8     (Eggs, Meat)    (Cheese)            0.266667            0.501587   
9   (Eggs, Cheese)      (Meat)            0.298413            0.476190   
10  (Meat, Cheese)      (Eggs)            0.323810            0.438095   
11    (Milk, Meat)    (Cheese)            0.244444            0.501587   
12  (Milk, Cheese) 

Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

In [None]:
- Antecedent Support:
Shows how often the first item (on the left) appears alone.

- Consequent Support:
Shows how often the second item (on the right) appears alone.

- Support:
Tells us how often both items appear together.

- Confidence:
Tells us how likely the second item is to be bought when the first item is bought.

- Lift:
Tells us how much more likely the second item is bought when the first item is bought, compared to when the second item is bought independently.

- Leverage:
Measures the difference between how often the two items are bought together and how often we would expect them to be bought together by chance.

- Conviction:
Measures how much more likely the second item is to be bought independently, compared to when it's bought with the first item.