Association Rule Learning is rule-based learning for identifying the association between different variables in a database.

Apriori algorithm finds the most frequent itemsets or elements in a transaction database and identifies association rules between the items just like the above-mentioned example.



To construct association rules between elements or items, the algorithm considers 3 important factors which are, support, confidence and lift

The support of item I is defined as the ratio between the number of transactions containing the item I by the total number of transactions 

Confidence is measured by the proportion of transactions with item I1, in which item I2 also appears. The confidence between two items I1 and I2, in a transaction is defined as the total number of transactions containing both items I1 and I2 divided by the total number of transactions containing I1.

Lift is the ratio between the confidence and support.

We will use the mlxtend library to implement Apriori.

This library wants the dataset in the following format.

    transaction_id    Cake   Ballon   Caps
    1                  0       1       1
    2                  1       0       0
    3                  1       1       1
    4                  0       0       0
   
Where 1 indicates that the item was bought in the particular transaction.

Now we will import our libraries and the dataset.

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
dp = pd.read_csv('https://raw.githubusercontent.com/harshit5674/DATA-MINING/main/datasets/apriori1.csv', encoding="ISO-8859-1")
dp=dp.drop(columns=['Unnamed: 0'])
dp.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662.0,Germany
1,536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662.0,Germany
2,536527,84945,MULTI COLOUR SILVER T-LIGHT HOLDER,12,12/1/2010 13:04,0.85,12662.0,Germany
3,536527,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 13:04,1.65,12662.0,Germany
4,536527,22244,3 HOOK HANGER MAGIC GARDEN,12,12/1/2010 13:04,1.95,12662.0,Germany


The above dataset contains objects bought(description) and transaction ID(InvoiceNo)

In [3]:
dp['Description'] = dp['Description'].str.strip()

The above statement is to remove the spaces in the description.

In [4]:
#some of transaction quantity is negative which can not be possible remove them.
d = dp[dp.Quantity >0]

In [5]:
table = pd.pivot_table(data=dp,index='InvoiceNo',columns='Description',values='Quantity', aggfunc='sum',fill_value=0)

The above statement converts our dataset into the format described above.

In [6]:
table.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536840,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536861,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536967,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


You will notice that this would contain quantities of each item in each transaction, according to the above format we just need binary 0 or 1, so we will make that change.

In [7]:
def convert_into_binary(x):
    if x > 0:
        return 1
    else:
        return 0

In [8]:
table = table.applymap(convert_into_binary)

In [9]:
# remove postage item as it is just a seal which almost all transaction contains. 
table.drop(columns=['POSTAGE'],inplace=True)

In [10]:
# call apriori function and pass minimum support here we are passing 4%. 
# means 4 times in total number of transaction the item should be present.
frequent_itemsets = apriori(table, min_support=0.04, use_colnames=True)



In [11]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.077944,(6 RIBBONS RUSTIC CHARM)
1,0.053068,(ALARM CLOCK BAKELIKE PINK)
2,0.049751,(CHARLOTTE BAG APPLES DESIGN)
3,0.046434,(COFFEE MUG APPLES DESIGN)
4,0.048093,(FAWN BLUE HOT WATER BOTTLE)
5,0.054726,(GUMBALL COAT RACK)
6,0.043118,(IVORY KITCHEN SCALES)
7,0.048093,(JAM JAR WITH PINK LID)
8,0.069652,(JAM MAKING SET PRINTED)
9,0.046434,(JUMBO BAG APPLES)


First step in generation of association rules is to get all the frequent itemsets on which binary partitions can be performed to get the antecedent and the consequent.

Frequent itemsets are the ones which occur at least a minimum number of times in the transactions. Technically, these are the itemsets for which support value (fraction of transactions containing the itemset) is above a minimum threshold — min_support. We have kept min_support=0.09 in our above notebook.

In [12]:
# We would apply association rules on frequent itemset. 
# here we are setting based on lift and keeping minimum lift as 1

rules_mlxtend = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules_mlxtend.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.087894,0.104478,0.05141,0.584906,5.598383,0.042227,2.157395
1,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN CIRCUS PARADE),0.104478,0.087894,0.05141,0.492063,5.598383,0.042227,1.795709
2,(ROUND SNACK BOXES SET OF4 WOODLAND),(PLASTERS IN TIN CIRCUS PARADE),0.185738,0.087894,0.043118,0.232143,2.641173,0.026793,1.187859
3,(PLASTERS IN TIN CIRCUS PARADE),(ROUND SNACK BOXES SET OF4 WOODLAND),0.087894,0.185738,0.043118,0.490566,2.641173,0.026793,1.598366
4,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.08126,0.104478,0.046434,0.571429,5.469388,0.037945,2.089552


From a list of all possible candidate rules, we aim to identify rules that fall above a minimum threshold level (like min_confidence or min_lift).

In [13]:
rules_mlxtend[ (rules_mlxtend['lift'] >= 3) & (rules_mlxtend['confidence'] >= 0.6) ].head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
9,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.053068,0.096186,0.044776,0.84375,8.772091,0.039672,5.784411
10,(ROUND SNACK BOXES SET OF 4 FRUITS),(ROUND SNACK BOXES SET OF4 WOODLAND),0.119403,0.185738,0.099502,0.833333,4.486607,0.077325,4.885572
13,(SPACEBOY LUNCH BOX),(ROUND SNACK BOXES SET OF4 WOODLAND),0.077944,0.185738,0.053068,0.680851,3.665653,0.038591,2.551354


antecedents and consequents -> The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent. The antecedent and the consequent are disjoint; they have no items in common.

antecedent support -> This measure gives an idea of how frequent antecedent is in all the transactions.

consequent support -> This measure gives an idea of how frequent consequent is in all the transactions.
