In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder

!pip install mlxtend==0.23.1

Collecting mlxtend==0.23.1
  Downloading mlxtend-0.23.1-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mlxtend
  Attempting uninstall: mlxtend
    Found existing installation: mlxtend 0.23.3
    Uninstalling mlxtend-0.23.3:
      Successfully uninstalled mlxtend-0.23.3
Successfully installed mlxtend-0.23.1


# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [2]:
df=pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [50]:
unique_product = set(df.values.flatten())
print (unique_product)

{'Wine', nan, 'Eggs', 'Diaper', 'Cheese', 'Bread', 'Milk', 'Meat', 'Bagel', 'Pencil'}


  and should_run_async(code)


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [51]:
# Menyusun itemset untuk produk unik dengan nilai awal 0
itemset = {item: 0 for item in unique_product}

# Iterasi melalui setiap item dalam baris pertama DataFrame
for item in df.iloc[0]:  # df.iloc[0] mengakses baris pertama
    if item in itemset:  # Memeriksa apakah item ada dalam itemset
        itemset[item] = 1  # Setel nilai menjadi 1 jika ditemukan dalam baris pertama

# Menampilkan itemset yang sudah diperbarui
itemset


  and should_run_async(code)


{'Wine': 1,
 nan: 0,
 'Eggs': 1,
 'Diaper': 1,
 'Cheese': 1,
 'Bread': 1,
 'Milk': 0,
 'Meat': 1,
 'Bagel': 0,
 'Pencil': 1}

In [64]:

encoded_df = pd.DataFrame(0, index=range(len(df)), columns=itemset)

# Iterasi melalui baris pada DataFrame
for i, row in df.iterrows():
    for item in row:  # iterasi setiap item dalam baris
            encoded_df.loc[i, item] = 1

# Menampilkan 5 baris pertama dari encoded_df
encoded_df.head()


  and should_run_async(code)


Unnamed: 0,Wine,NaN,Eggs,Diaper,Cheese,Bread,Milk,Meat,Bagel,Pencil
0,1,0,1,1,1,1,0,1,0,1
1,1,0,0,1,1,1,1,1,0,1
2,1,1,1,0,1,0,1,1,0,0
3,1,1,1,0,1,0,1,1,0,0
4,1,1,0,0,0,0,0,1,0,1


In [65]:
encoded_df = encoded_df.drop(encoded_df.columns[1], axis=1)

# Menampilkan 5 baris pertama setelah kolom kedua dihapus
encoded_df.head()

  and should_run_async(code)


Unnamed: 0,Wine,Eggs,Diaper,Cheese,Bread,Milk,Meat,Bagel,Pencil
0,1,1,1,1,1,0,1,0,1
1,1,0,1,1,1,1,1,0,1
2,1,1,0,1,0,1,1,0,0
3,1,1,0,1,0,1,1,0,0
4,1,0,0,0,0,0,1,0,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [66]:
from mlxtend.frequent_patterns import apriori, association_rules
frequent_itemsets = apriori(encoded_df, min_support=0.2, use_colnames=True)
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.438095,(Wine)
1,0.438095,(Eggs)
2,0.406349,(Diaper)
3,0.501587,(Cheese)
4,0.504762,(Bread)
5,0.501587,(Milk)
6,0.47619,(Meat)
7,0.425397,(Bagel)
8,0.361905,(Pencil)
9,0.24127,"(Wine, Eggs)"


In [67]:
association_rules_df = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.6)
association_rules_df

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
2,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624,0.387409
3,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
4,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
5,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
6,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
7,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
8,"(Eggs, Meat)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667,0.518717
9,"(Meat, Cheese)",(Eggs),0.32381,0.438095,0.215873,0.666667,1.521739,0.074014,1.685714,0.507042


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)

In the context of association rule mining, the dataset you provided represents various association rules derived from a frequent itemset. Here's a breakdown of each metric:

Antecedent Support: This measures the frequency (or proportion) of transactions that contain the antecedent (the item(s) on the left side of the rule). In your case, for example, the antecedent support for the rule (Wine) -> (Cheese) is 0.438095, meaning that about 43.8% of transactions in the dataset include wine.

Consequent Support: This represents the frequency (or proportion) of transactions that contain the consequent (the item(s) on the right side of the rule). For the rule (Wine) -> (Cheese), the consequent support is 0.501587, indicating that 50.16% of transactions include cheese.

Support: This is the proportion of transactions that contain both the antecedent and the consequent. For example, the support for the rule (Wine) -> (Cheese) is 0.269841, meaning that about 26.98% of transactions contain both wine and cheese. Support is essentially the intersection of antecedent and consequent over the total number of transactions.

Confidence: This indicates the likelihood that the consequent appears in a transaction given that the antecedent is present. The confidence for the rule (Wine) -> (Cheese) is 0.615942, meaning that, if a transaction contains wine, there is a 61.59% chance it will also contain cheese.

Lift: This metric measures how much more likely the consequent is to appear when the antecedent is present, compared to when the consequent is expected by chance. A lift greater than 1 indicates a positive association. For (Wine) -> (Cheese), the lift is 1.227986, meaning that the likelihood of cheese appearing with wine is 1.23 times greater than if the two items were independent.

Leverage: Leverage measures the difference between the observed frequency of the antecedent and consequent together and the expected frequency if they were independent. It represents how much more (or less) likely the items are to occur together than by chance. The leverage for (Wine) -> (Cheese) is 0.050098, suggesting that the association between wine and cheese is stronger than would be expected by chance.

Conviction: Conviction is a measure of the implication strength of the rule, showing how strongly the antecedent implies the consequent. A higher conviction value indicates a stronger association. The conviction for (Wine) -> (Cheese) is 1.297754, meaning that, on average, the rule is 1.30 times more likely to hold than not. A conviction value greater than 1 indicates a positive association.

Zhang’s Metric: This is a metric used to measure rule interestingness by considering both the support and confidence of the rule, factoring in the distribution of support across different items. For (Wine) -> (Cheese), the Zhang’s metric is 0.330409, which gives an alternative measure of how interesting or strong the rule is.

Interpretation:
For the rule (Wine) -> (Cheese):

The antecedent support and consequent support show the frequency of wine and cheese in the transactions. Since both are relatively high (43.81% for wine and 50.16% for cheese), it suggests that these two items appear together in a significant number of transactions.
The support of 26.98% indicates that 26.98% of transactions include both wine and cheese together.
The confidence of 61.59% shows that when wine is purchased, there’s a relatively high chance (61.59%) that cheese will also be bought.
The lift of 1.23 suggests that the two items are positively correlated, as the chance of purchasing cheese is 23% higher when wine is bought compared to random chance.
The leverage of 0.05 indicates a small positive association between the two items beyond what would be expected by chance.
The conviction of 1.30 implies that the antecedent (wine) increases the likelihood of purchasing cheese, though not as strongly as in some other associations.
The Zhang’s metric provides an interestingness value of 0.33, suggesting a moderate strength in the association.
Applying the Threshold (Confidence ≥ 0.6):
For association rules with a confidence level of 0.6 or above, we would focus on the rules that meet this threshold:

Rule (Eggs) -> (Cheese) has a confidence of 0.681159, which is above 0.6.
Rule (Eggs, Meat) -> (Cheese) has a confidence of 0.809524, which is also above 0.6.
Rule (Meat, Milk) -> (Cheese) has a confidence of 0.831169, which is above 0.6.
These rules represent strong associations where the antecedents significantly influence the consequent (cheese).