Association Rules

Association Rules Analysis has become familiar for analysis in the retail industry. It is also called Market Basket Analysis terms. This analysis is also used for advice. Personal recommendations in applications such as Spotify, Netflix, and Youtube can be given as examples. One of the biggest examples of Association Rules Analysis is the correlation between beer and diaper. When Walmart, a chain store in the United States, studied the shopping behavior of customers, the study showed that diapers and beers are bought together. Because, as it turns out, fathers are often tasked with shopping while mothers stay with the baby.

Apriori Algorithm

The Apriori Algorithm, used for the first phase of the Association Rules, is the most popular and classical algorithm in the frequent old parts. These algorithm properties and data are evaluated with Boolean Association Rules. In this algorithm, there are product clusters that pass frequently, and then strong relationships between these products and other products are sought.

The importance of an Association Rules can be determined by 3 parameters that are used to identify the strength of the algorithm. Namely,

    Support
    Confidence
    Lift



Let X and Y represent the products in the market and N represents the total number of products.

image.png

Support : It is the probability of an event to occur.

Confidence : It is a measure of conditional probability

Lift : It is the probability of all items occurring together divided by the product of antecedent and consequent occurring as if they are independent of each other.


# Import the Libraries and Dataset


In [1]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules

In [2]:
df = pd.read_csv(r"C:\Users\hidayat\Desktop\Association_project\GroceryStoreDataSet.csv", names = ['products'], sep = ',')
df.head()

Unnamed: 0,products
0,"MILK,BREAD,BISCUIT"
1,"BREAD,MILK,BISCUIT,CORNFLAKES"
2,"BREAD,TEA,BOURNVITA"
3,"JAM,MAGGI,BREAD,MILK"
4,"MAGGI,TEA,BISCUIT"


Let's examine the shape of the data set,


In [3]:
df.shape

(20, 1)

# Splitting the products and creating a list called by 'data',


In [4]:
data = list(df["products"].apply(lambda x:x.split(",") ))
data

[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COCK'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

Apriori Algorithm and One-Hot Encoding

Apriori's algorithm transforms True/False or 1/0.
Using TransactionEncoder, we convert the list to a One-Hot Encoded Boolean list.
Products that customers bought or did not buy during shopping will now be represented by values 1 and 0.


In [5]:
#Let's transform the list, with one-hot encoding
from mlxtend.preprocessing import TransactionEncoder
a = TransactionEncoder()
a_data = a.fit(data).transform(data)
df = pd.DataFrame(a_data,columns=a.columns_)
df = df.replace(False,0)
df

Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COCK,COFFEE,CORNFLAKES,JAM,MAGGI,MILK,SUGER,TEA
0,True,0,True,0,0,0,0,0,True,0,0
1,True,0,True,0,0,True,0,0,True,0,0
2,0,True,True,0,0,0,0,0,0,0,True
3,0,0,True,0,0,0,True,True,True,0,0
4,True,0,0,0,0,0,0,True,0,0,True
5,0,True,True,0,0,0,0,0,0,0,True
6,0,0,0,0,0,True,0,True,0,0,True
7,True,0,True,0,0,0,0,True,0,0,True
8,0,0,True,0,0,0,True,True,0,0,True
9,0,0,True,0,0,0,0,0,True,0,0


Applying Apriori and Resulting

The next step is to create the Apriori Model. We can change all the parameters in the Apriori Model in the mlxtend package.
I will try to use minimum support parameters for this modeling.
For this, I set a min_support value with a threshold value of 20% and printed them on the screen as well.


In [6]:
#set a threshold value for the support value and calculate the support value.
df = apriori(df, min_support = 0.2, use_colnames = True, verbose = 1)
df


Processing 72 combinations | Sampling itemset size 2
Processing 42 combinations | Sampling itemset size 3




Unnamed: 0,support,itemsets
0,0.35,(BISCUIT)
1,0.2,(BOURNVITA)
2,0.65,(BREAD)
3,0.4,(COFFEE)
4,0.3,(CORNFLAKES)
5,0.25,(MAGGI)
6,0.25,(MILK)
7,0.3,(SUGER)
8,0.35,(TEA)
9,0.2,"(BREAD, BISCUIT)"


I chose the 50% minimum confidence value. In other words, when product X is purchased, we can say that the purchase of product Y is 50% or more.


In [7]:
#Let's view our interpretation values using the Associan rule function.
df_ar = association_rules(df, metric = "confidence", min_threshold = 0.5)
df_ar

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(BISCUIT),(BREAD),0.35,0.65,0.2,0.571429,0.879121,-0.0275,0.816667,-0.174603
1,(MILK),(BREAD),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75,0.25
2,(SUGER),(BREAD),0.3,0.65,0.2,0.666667,1.025641,0.005,1.05,0.035714
3,(TEA),(BREAD),0.35,0.65,0.2,0.571429,0.879121,-0.0275,0.816667,-0.174603
4,(COFFEE),(CORNFLAKES),0.4,0.3,0.2,0.5,1.666667,0.08,1.4,0.666667
5,(CORNFLAKES),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8,0.571429
6,(COFFEE),(SUGER),0.4,0.3,0.2,0.5,1.666667,0.08,1.4,0.666667
7,(SUGER),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8,0.571429
8,(TEA),(MAGGI),0.35,0.25,0.2,0.571429,2.285714,0.1125,1.75,0.865385
9,(MAGGI),(TEA),0.25,0.35,0.2,0.8,2.285714,0.1125,3.25,0.75


antecedents: (BISCUIT)

consequents: (BREAD)

antecedent support: 0.35

consequent support: 0.65 

support: 0.2

confidence: 0.571429

lift: 0.879121

leverage: -0.0275

conviction: 0.816667

zhangs_metric: -0.174603

This rule tells us that there's a strong association between purchasing "BISCUIT" and "BREAD" together.

    Antecedents and Consequents:
        Antecedents represent the items that are bought first or are considered as "if" in the rule.
        Consequents represent the items that are bought or are considered as "then" in the rule.
        In this case, if someone buys "BISCUIT" (antecedent), there's a high likelihood that they will also buy "BREAD" (consequent).

    Support:
        Support for "BISCUIT" is 0.35, meaning that 35% of all transactions include the purchase of "BISCUIT."
        Support for "BREAD" is 0.65, indicating that 65% of all transactions include the purchase of "BREAD."
        Support for both together is 0.2, meaning that 20% of transactions include both "BISCUIT" and "BREAD."

    Confidence:
        Confidence is 0.571429, implying that 57.14% of customers who buy "BISCUIT" also buy "BREAD."

    Lift:
        Lift is 0.879121, which is less than 1. This suggests that the occurrence of "BISCUIT" has a negative (but weak) effect on the occurrence of "BREAD."

    Leverage:
        Leverage is -0.0275, indicating that the co-occurrence of "BISCUIT" and "BREAD" is slightly less than what would be expected if they were independent.

    Conviction:
        Conviction is 0.816667, suggesting that customers who buy "BISCUIT" are about 81.67% less likely to buy "BREAD" if the two purchases were unrelated.

    Zhang's Metric:
        Zhang's Metric is -0.174603, indicating a negative association between "BISCUIT" and "BREAD."

In practical terms, this association rule suggests that when a customer purchases "BISCUIT," there's a 57.14% chance that they will also purchase "BREAD." However, the negative lift and conviction values suggest that the two products are slightly less likely to be bought together than if they were independent items.


