**fundamental concepts in association rules**

(Not a Rule) Support: number of times X occurs over all instances.

Support(X→Y) is the probability of co-occurence of both items within all data.

Confidence(X→Y) is the probability of Y occurs given that X is present.

Lift(X→Y) is the probability of Y being bought given that X is present, taking into account the popularity of Y as well.

Conviction(X→Y) is the measure of implication. A value > 1 indicates that Y is highly depending on X.

So basically it is probability/statistics. A simple but useful decision making tool for a wide range of usages such as market basket analysis, customer relationship management, recommender system, marketing activities, network traffic analysis, intrusion detection (fraud & malware detection) and bioinformatics.

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
data = {'ID':[1,2,3,4,5,6],
       'Onion':[1,0,0,1,1,1],
       'Potato':[1,1,0,1,1,1],
       'Burger':[1,1,0,0,1,1],
       'Milk':[0,1,1,1,0,1],
       'Beer':[0,0,1,0,1,0]}

In [3]:
df = pd.DataFrame(data)
df = df[['ID', 'Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]]
df

Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


Then, we can generate frequent itemsets based on support.
Here we need to set the minimum support value between [0,1]. Using min_supp = 50% means we only want itemsets that co-occur more than half of the time.

apriori(df, min_support=0.5, use_colnames=False, max_len=None)

In [4]:
frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]], 
                            min_support=0.50, use_colnames=True)

In [5]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.666667,(Onion)
1,0.833333,(Potato)
2,0.666667,(Burger)
3,0.666667,(Milk)
4,0.666667,"(Potato, Onion)"
5,0.5,"(Burger, Onion)"
6,0.666667,"(Potato, Burger)"
7,0.5,"(Milk, Potato)"
8,0.5,"(Potato, Burger, Onion)"


**Final Step: generate the rules with their corresponding support, confidence and lift, (and leverage & conviction):
association_rules(df, metric='confidence', min_threshold=0.8)**

Here, df means the frequent_itemsets dataframe;

metrics is the parameters to consider if there is association. You can set it to one of the five metrics.

min_threshold is the mininum value for the specified metrics.

In [6]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)

In [7]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667
1,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
2,(Burger),(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
3,(Onion),(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
4,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667
5,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
6,"(Potato, Burger)",(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
7,"(Potato, Onion)",(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333
8,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf
9,(Potato),"(Burger, Onion)",0.833333,0.5,0.5,0.6,1.2,0.083333,1.25


In [8]:
rules [ (rules['lift'] >1.125)  & (rules['confidence']> 0.8)  ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
5,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf
8,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf



Subsetting the lift and confidence values return you with the itemsets that are relatively highly correlated in this data.

We can see that:

If Onion or Burger is in a users' basket, it is highly likely that the user will buy Potato as well.

If Burger and Onion is in a users' basket, it is highly likely that the user will also buy Potato.

**Some notes on Lift, Conviction & Leverage:**

**Lift(X→Y) :** the likelihood of Y being bought when X is present, taking into account the popularity of Y as well.

When Lift=1, X makes no impact on Y

When Lift>1, there is a relationship between X & Y

**Conviction(X→Y):** Conviction is a measure of the implication and has value 1 if items are unrelated.
A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.

**Leverage(X→Y):** the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. An leverage value of 0 indicates independence.