In [30]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder

!pip install mlxtend==0.23.1



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [31]:
# load the data set ans show the first five transaction
df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')

df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [None]:
itemlist = pd.unique(df.values.flatten())
itemlist

array(['Bread', 'Wine', 'Eggs', 'Meat', 'Cheese', 'Pencil', 'Diaper',
       'Milk', nan, 'Bagel'], dtype=object)

## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [46]:
itemlist = df.values.flatten()

unique = set(itemlist)

presence_map = {item: 0 for item in unique}

first_row = df.iloc[0].values
for item in first_row:
    presence_map[item] = 1

reshape = df.values.tolist()

flat_items = [item for sublist in reshape for item in sublist]

flat_items_array = np.array(flat_items).reshape(-1,1)

encoder = OneHotEncoder(sparse_output=False)

encoded_data = encoder.fit_transform(flat_items_array)


encoded_data = encoded_data.astype(int)

encoded_df = pd.DataFrame(encoded_data, columns=encoder.categories_[0])

customer_ids = []
for i, row in enumerate(reshape):
    customer_ids.extend([i] * len(row))

encoded_df['customer_id'] = customer_ids

transformed_df = encoded_df.groupby('customer_id').sum()

presence_map

{'Milk': 0,
 'Diaper': 1,
 'Bagel': 0,
 'Wine': 1,
 'Meat': 1,
 'Bread': 1,
 'Pencil': 1,
 nan: 0,
 'Cheese': 1,
 'Eggs': 1}

In [47]:
# create new dataframe from the encoded features
new_df = transformed_df

# show the new dataframe
new_df.head()

Unnamed: 0_level_0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine,nan
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0,1,1,1,1,1,0,1,1,0
1,0,1,1,1,0,1,1,1,1,0
2,0,0,1,0,1,1,1,0,1,2
3,0,0,1,0,1,1,1,0,1,2
4,0,0,0,0,0,1,0,1,1,4


In [48]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.
new_df = new_df.drop(columns=["nan"])

new_df.head()


Unnamed: 0_level_0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,1,1,1,1,1,0,1,1
1,0,1,1,1,0,1,1,1,1
2,0,0,1,0,1,1,1,0,1
3,0,0,1,0,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [50]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules

apriori(new_df, min_support=0.2, use_colnames=True)




Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.504762,(Bread)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)
5,0.47619,(Meat)
6,0.501587,(Milk)
7,0.361905,(Pencil)
8,0.438095,(Wine)
9,0.279365,"(Bagel, Bread)"


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [51]:
frequent_itemsets = apriori(new_df, min_support=0.2, use_colnames=True)

confidence_threshold = 0.6
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=confidence_threshold)
rules.drop(columns=['zhangs_metric'], inplace=True)
rules



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
2,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
3,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
4,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
5,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
8,"(Cheese, Eggs)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773
9,"(Cheese, Meat)",(Eggs),0.32381,0.438095,0.215873,0.666667,1.521739,0.074014,1.685714


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)

Antecedent Support : 
Definition: Antecedent support indicates the proportion of transactions in which the item(s) in the antecedent (left-hand side of the rule) appear. It helps assess the popularity of the item(s) involved in the rule's "if" part.
Interpretation: For the rule (Bagel) → (Bread), the antecedent support for Bagel is 0.423, meaning that Bagel is bought in about 42.3% of all transactions.
Kaggle Insight: Higher antecedent support suggests the item is frequently purchased, providing a larger potential pool for finding related rules. However, low antecedent support might imply that the antecedent is less relevant for generating rules due to its infrequent occurrence in transactions.

Consequent Support :
Definition: Consequent support is the proportion of transactions that contain the consequent (right-hand side of the rule), showing how often the item(s) in the "then" part of the rule appear in the dataset.
Interpretation: For the rule (Bagel) → (Bread), the consequent support for Bread is 0.505, meaning Bread appears in 50.5% of all transactions.
Kaggle Insight: High consequent support indicates that the consequent item is popular and could be involved in many rules. If the consequent support is very high, it might indicate that the item is often bought on its own or with various other products, making it a common target for association rule mining.

Support :
Definition: Support of the itemset (antecedent and consequent together) is the proportion of transactions that contain both the antecedent and consequent. It measures how often the combination of items appears in the dataset.
Interpretation: For the rule (Bagel) → (Bread), the support for both Bagels and Bread appearing together is 0.28, meaning 28% of transactions contain both items.
Kaggle Insight: Higher support indicates a stronger relationship between the items. Items with higher support are likely to be of greater interest in rule generation because they are frequent, whereas items with low support may not generate useful or actionable rules due to their infrequent occurrence.

Confidence :
Definition: Confidence is the probability that the consequent (right-hand side) will be purchased when the antecedent (left-hand side) is purchased. It shows the likelihood of the consequent occurring given the antecedent.
Interpretation: For the rule (Bagel) → (Bread), the confidence is 0.657, meaning that if a customer buys Bagel, there’s a 65.7% chance they will also buy Bread.
Kaggle Insight: A high confidence value indicates a strong association between the antecedent and consequent. In practice, confidence helps businesses determine which items are frequently purchased together, making it useful for promotional bundling, cross-selling, or store placement strategies.

Lift :
Definition: Lift measures how much more likely the consequent is to be purchased when the antecedent is bought, compared to when the two items are independent. It quantifies the strength of the relationship between the antecedent and consequent.
Interpretation: For the rule (Bagel) → (Bread), the lift is 1.59, meaning that the likelihood of purchasing Bread increases by 59% when Bagel is purchased, compared to the case where the two items are bought independently.
Kaggle Insight: A lift value greater than 1 indicates a positive correlation between the items, suggesting a strong association. Lift can be particularly useful for identifying product pairings that have a meaningful relationship beyond random co-occurrence.

Leverage :
Definition: Leverage quantifies the difference between the observed frequency of the antecedent and consequent appearing together and the expected frequency if the two items were independent. It shows how much more or less likely the items are to be purchased together than if there were no association.
Interpretation: For the rule (Bagel) → (Bread), the leverage is 0.021, meaning that the combination of Bagel and Bread occurs 2.1% more often than would be expected if the two were independent.
Kaggle Insight: Positive leverage indicates a stronger relationship between the antecedent and consequent, suggesting they are more likely to be bought together than by chance. Leverage helps identify significant associations and can be used to filter out weak rules.

Conviction :
Definition: Conviction measures how likely the consequent is to occur when the antecedent occurs, compared to the probability that the rule does not hold. It evaluates the strength of the rule by accounting for the cases where the rule doesn’t hold.
Interpretation: For the rule (Bagel) → (Bread), the conviction is 1.95, meaning that customers who purchase Bagel are 1.95 times more likely to purchase Bread than they would be to not purchase Bread when they buy Bagel.
Kaggle Insight: Conviction helps assess the strength of the association beyond just the likelihood of the consequent. Higher conviction values indicate that the antecedent strongly predicts the consequent, making the rule more actionable for decision-making.





https://chatgpt.com/c/6747f46b-76f0-8011-af27-e11872bc8a18
https://en.wikipedia.org/wiki/Apriori_algorithm#:~:text=Apriori%20is%20an%20algorithm%20for,sufficiently%20often%20in%20the%20database.