**fundamental concepts in association rules**

(Not a Rule) Support: number of times X occurs over all instances. (%)

Support(X→Y) is the probability of co-occurence of both items within all data. (%)

Confidence(X→Y) is the probability of Y occurs given that X is present.(%)

Lift(X→Y) is the probability of Y being bought given that X is present, taking into account the popularity of Y as well. 

Conviction(X→Y) is the measure of implication. A value > 1 indicates that Y is highly depending on X.

So basically it is probability/statistics. A simple but useful decision making tool for a wide range of usages such as market basket analysis, customer relationship management, recommender system, marketing activities, network traffic analysis, intrusion detection (fraud & malware detection) and bioinformatics.

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

**Some notes on Lift, Conviction & Leverage:**

**Lift(X→Y) :** the likelihood of Y being bought when X is present, taking into account the popularity of Y as well.

When Lift=1, X makes no impact on Y

When Lift>1, there is a relationship between X & Y

**Conviction(X→Y):** Conviction is a measure of the implication and has value 1 if items are unrelated.
A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.

**Leverage(X→Y):** the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. An leverage value of 0 indicates independence.

In [2]:
retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }

In [3]:
retail = pd.DataFrame(retail_shopping_basket)
retail = retail[['ID', 'Basket']]
pd.options.display.max_colwidth=100

In [4]:
retail

Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, BabyFood, Milk]"
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


**First one-hot encode the basket, but how?**

In [5]:
retail = retail.drop('Basket' ,1).join(retail.Basket.str.join(',').str.get_dummies(','))

In [6]:
retail

Unnamed: 0,ID,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,2,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,3,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,4,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,5,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,6,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [7]:
frequent_itemsets_2 = apriori(retail.drop('ID',1), min_support=0.01, use_colnames=True)

In [8]:
frequent_itemsets_2.head()

Unnamed: 0,support,itemsets
0,0.166667,(Aspirin)
1,0.166667,(BabyFood)
2,0.666667,(Beer)
3,0.166667,(Bread)
4,0.666667,(Chips)


In [9]:
frequent_itemsets_2.sort_values("support", axis = 0, ascending = False)

Unnamed: 0,support,itemsets
10,0.666667,(Milk)
2,0.666667,(Beer)
4,0.666667,(Chips)
6,0.500000,(Diaper)
24,0.500000,"(Chips, Beer)"
25,0.500000,"(Beer, Diaper)"
38,0.333333,"(Chips, Milk)"
85,0.333333,"(Milk, Beer, Diaper)"
77,0.333333,"(Chips, Beer, Diaper)"
12,0.333333,(Soda)


Just by calculating the support(X>Y), [Beer, Chips] & [Beer, Diaper] are the two frequent basket of intereseted.

But which one is more correlated than the other?

In [10]:
assoc = association_rules(frequent_itemsets_2, metric='lift')
assoc["antecedents_len"] = assoc["antecedents"].apply(lambda x: len(x))
assoc["consequents_len"] = assoc["consequents"].apply(lambda x: len(x))

assoc.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedents_len,consequents_len
0,(Beer),(Aspirin),0.666667,0.166667,0.166667,0.25,1.5,0.055556,1.111111,1,1
1,(Aspirin),(Beer),0.166667,0.666667,0.166667,1.0,1.5,0.055556,inf,1,1
2,(Chips),(Aspirin),0.666667,0.166667,0.166667,0.25,1.5,0.055556,1.111111,1,1
3,(Aspirin),(Chips),0.166667,0.666667,0.166667,1.0,1.5,0.055556,inf,1,1
4,(Aspirin),(Diaper),0.166667,0.5,0.166667,1.0,2.0,0.083333,inf,1,1


In [11]:
import pandas_profiling
profile = pandas_profiling.ProfileReport(assoc)
profile



In [12]:
high_association = assoc[ (assoc['lift'] >= 1) & (assoc['confidence'] >= 0.5) & (assoc['conviction'] > 1)]


In [13]:
final = high_association.sort_values(["support", "confidence", "lift"], axis = 0, ascending = [False, False, False])
final

Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,antecedents_len,consequents_len
23,(Diaper),(Beer),0.500000,0.666667,0.500000,1.000000,1.500000,0.166667,inf,1,1
22,(Beer),(Diaper),0.666667,0.500000,0.500000,0.750000,1.500000,0.166667,2.000000,1,1
20,(Chips),(Beer),0.666667,0.666667,0.500000,0.750000,1.125000,0.055556,1.333333,1,1
21,(Beer),(Chips),0.666667,0.666667,0.500000,0.750000,1.125000,0.055556,1.333333,1,1
246,"(Milk, Beer)",(Diaper),0.333333,0.500000,0.333333,1.000000,2.000000,0.166667,inf,2,1
74,(Soda),(Milk),0.333333,0.666667,0.333333,1.000000,1.500000,0.111111,inf,1,1
205,"(Chips, Diaper)",(Beer),0.333333,0.666667,0.333333,1.000000,1.500000,0.111111,inf,2,1
247,"(Milk, Diaper)",(Beer),0.333333,0.666667,0.333333,1.000000,1.500000,0.111111,inf,2,1
251,(Diaper),"(Milk, Beer)",0.500000,0.333333,0.333333,0.666667,2.000000,0.166667,2.000000,1,2
204,"(Chips, Beer)",(Diaper),0.500000,0.500000,0.333333,0.666667,1.333333,0.083333,1.500000,2,1


**What can you discover from the two rules? **

Clearly, {Diaper, Beer} is the most associated itemset in this data!

# ** If Customer buys 'antecedents values' mentioned in below recommendation model will suggests 'consequents values' according to higher 'confidence'& higher 'lift' & higher 'leverage' & higher 'conviction'**


# For Examples

In [19]:
final.loc[final.antecedents==frozenset({'Beer', 'Diaper'}) ].head()


Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,antecedents_len,consequents_len


In [20]:
final.loc[final.antecedents==frozenset({'Beer'}) ].head()


Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,antecedents_len,consequents_len
22,(Beer),(Diaper),0.666667,0.5,0.5,0.75,1.5,0.166667,2.0,1,1
21,(Beer),(Chips),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,1,1
208,(Beer),"(Chips, Diaper)",0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1,2
250,(Beer),"(Milk, Diaper)",0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1,2


In [21]:
final.loc[final.antecedents==frozenset({'Milk'}) ].head()


Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,antecedents_len,consequents_len
75,(Milk),(Soda),0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1,1


In [22]:
final.loc[final.antecedents==frozenset({'Lotion', 'Juice', 'BabyFood'}) ].head()


Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,antecedents_len,consequents_len
1968,"(BabyFood, Lotion, Juice)","(Milk, Chips, Beer)",0.166667,0.166667,0.166667,1.0,6.0,0.138889,inf,3,3
2092,"(BabyFood, Lotion, Juice)","(Milk, Chips, Diaper)",0.166667,0.166667,0.166667,1.0,6.0,0.138889,inf,3,3
2273,"(BabyFood, Lotion, Juice)","(Milk, Chips, Beer, Diaper)",0.166667,0.166667,0.166667,1.0,6.0,0.138889,inf,3,4
1366,"(BabyFood, Lotion, Juice)","(Milk, Beer)",0.166667,0.333333,0.166667,1.0,3.0,0.111111,inf,3,2
1398,"(BabyFood, Lotion, Juice)","(Chips, Diaper)",0.166667,0.333333,0.166667,1.0,3.0,0.111111,inf,3,2


In [23]:
final.loc[final.antecedents==frozenset({'Juice', 'Lotion', 'BabyFood'}) ].head()


Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,antecedents_len,consequents_len
1968,"(BabyFood, Lotion, Juice)","(Milk, Chips, Beer)",0.166667,0.166667,0.166667,1.0,6.0,0.138889,inf,3,3
2092,"(BabyFood, Lotion, Juice)","(Milk, Chips, Diaper)",0.166667,0.166667,0.166667,1.0,6.0,0.138889,inf,3,3
2273,"(BabyFood, Lotion, Juice)","(Milk, Chips, Beer, Diaper)",0.166667,0.166667,0.166667,1.0,6.0,0.138889,inf,3,4
1366,"(BabyFood, Lotion, Juice)","(Milk, Beer)",0.166667,0.333333,0.166667,1.0,3.0,0.111111,inf,3,2
1398,"(BabyFood, Lotion, Juice)","(Chips, Diaper)",0.166667,0.333333,0.166667,1.0,3.0,0.111111,inf,3,2
