<a id='libraries'></a>
<h1 style="color:navy" >Association Rules? What's that!</h1> 


Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swam introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions,potatoes} ->{burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions. 


- **Support:** Support is an indication of how frequently the itemset appears in the dataset.

Support is a so-called frequency constraint. Its main feature is that it possesses the property of down-ward closure which means that all sub sets of a frequent set (support > min. support threshold) are also frequent. This property (actually, the fact that no super set of a infrequent set can be frequent) is used to prune the search space (usually a tree of item sets with increasing size) in level-wise algorithms (e.g., the APRIORI algorithm). The disadvantage of support is the rare item problem. Items that occur very infrequently in the data set are pruned although they would still produce interesting and potentially valuable rules.

- **Confidence:** Confidence is the percentage of all transactions satisfying X that also satisfy Y.

Confidence is not down-ward closed and was developed together with support (the so-called support-confidence framework). While support is used to prune the search space and only leave potentially interesting rules, confidence is used in a second step to filter rules that exceed a min. confidence threshold. A problem with confidence is that it is sensitive to the frequency of the consequent (Y) in the data set. Caused by the way confidence is calculated, Ys with higher support will automatically produce higher confidence values even if they exists no association between the items.

- **Lift:**  The ratio of the observed support to that expected if X and Y were independent.

Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells. Using min. leverage thresholds at the same time incorporates an implicit frequency constraint. E.g., for setting a min. leverage thresholds to 0.01% (corresponds to 10 occurrence in a data set with 100,000 transactions) one first can use an algorithm to find all itemsets with min. support of 0.01% and then filter the found item sets using the leverage constraint. Because of this property leverage also can suffer from the rare item problem.

- **Conviction:** The ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions.

Conviction compares the probability that X appears without Y if they were dependent with the actual frequency of the appearance of X without Y. In that respect it is similar to lift, however, it contrast to lift it is a directed measure. Furthermore, conviction is monotone in confidence and lift.

- **Leverage:** Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent.

The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells. Using min. leverage thresholds at the same time incorporates an implicit frequency constraint. E.g., for setting a min. leverage thresholds to 0.01% (corresponds to 10 occurrence in a data set with 100,000 transactions) one first can use an algorithm to find all itemsets with min. support of 0.01% and then filter the found item sets using the leverage constraint. Because of this property leverage also can suffer from the rare item problem.




In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import mlxtend as ml

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
df = pd.read_excel('../input/chocolate-ingredients/chocolate_ingredients.xlsx',header=None)
df.columns = ['Ingredients']
df

Unnamed: 0,Ingredients
0,"sugar,cocoa butter,skimmed milk powder,cocoa m..."
1,"sugar,coconut oil,milk powder,glucose syrup,wh..."
2,"skimmed milk powder,sugar,palm oil,sunflower o..."
3,"sugar,wheat flour,coconut oil,palm oil,hazelnu..."
4,"sugar,cocoa butter,whole milk powder,cocoa mas..."
...,...
74,"sugar,palm oil,sunflower oil,cotton oil,cocoa ..."
75,"sugar,glucose-fructose syrup,edible beef gelat..."
76,"sugar,skimmed milk powder,palm oil,clarified b..."
77,"sugar,peanuts,cocoa mass,whole milk powder,lac..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Ingredients  79 non-null     object
dtypes: object(1)
memory usage: 760.0+ bytes


In [4]:
data = list(df["Ingredients"].apply(lambda x:x.split(',')))

In [5]:
te = TransactionEncoder()
te_data = te.fit(data).transform(data)
df = pd.DataFrame(te_data,columns=te.columns_)
df

Unnamed: 0,wheat flour,albumin,almond,almond chunks,almonds,ammonia phostatines,ammonium bicarbonate,ammonium carbonate,ammonium hydrogen carbonate,ammonium phosphates,...,wheat malt,wheat protein,wheat semolina,wheat starch,whey powder,whole milk,whole milk powder,whole wheat flour,whole wheat grains,xanthan gum
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
75,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
76,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
77,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [6]:
df1 = apriori(df, min_support=0.30, use_colnames= True, verbose = 1)
df1[1:10]

Processing 55 combinations | Sampling itemset size 5


Unnamed: 0,support,itemsets
1,0.708861,(cocoa mass)
2,0.417722,(flavoring)
3,0.531646,(palm oil)
4,0.582278,(salt)
5,0.506329,(skimmed milk powder)
6,0.544304,(soy lecithin)
7,0.949367,(sugar)
8,0.417722,(sunflower lecithin)
9,0.329114,(wheat flour)


In [7]:
df1.sort_values(by="support",ascending=False)

Unnamed: 0,support,itemsets
7,0.949367,(sugar)
0,0.759494,(cocoa butter)
17,0.734177,"(cocoa butter, sugar)"
1,0.708861,(cocoa mass)
25,0.683544,"(cocoa mass, sugar)"
...,...,...
31,0.303797,"(flavoring, soy lecithin)"
75,0.303797,"(cocoa mass, whole milk powder, soy lecithin)"
78,0.303797,"(palm oil, salt, flavoring)"
88,0.303797,"(salt, sugar, wheat flour)"


In [8]:
df1['length'] = df1['itemsets'].apply(lambda x:len(x))
df1

Unnamed: 0,support,itemsets,length
0,0.759494,(cocoa butter),1
1,0.708861,(cocoa mass),1
2,0.417722,(flavoring),1
3,0.531646,(palm oil),1
4,0.582278,(salt),1
...,...,...,...
99,0.367089,"(cocoa butter, cocoa mass, sugar, whole milk p...",4
100,0.303797,"(cocoa butter, skimmed milk powder, whole milk...",4
101,0.316456,"(cocoa butter, sugar, whole milk powder, soy l...",4
102,0.303797,"(cocoa mass, sugar, whole milk powder, soy lec...",4


In [9]:
frequent_itemsets = apriori(df,min_support= 0.15, use_colnames = True)
rules1 = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.30)

In [10]:
rules1 = rules1.sort_values(['confidence','lift'],ascending = False)
rules1[1:11]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
9300,"(palm oil, sodium hydrogen carbonate, flavoring)","(salt, wheat flour)",0.151899,0.316456,0.151899,1.0,3.16,0.10383,inf
9841,"(sugar, sodium hydrogen carbonate, flavoring)","(salt, wheat flour)",0.151899,0.316456,0.151899,1.0,3.16,0.10383,inf
988,"(sodium hydrogen carbonate, flavoring)",(wheat flour),0.164557,0.329114,0.164557,1.0,3.038462,0.110399,inf
4052,"(palm oil, sodium hydrogen carbonate, flavoring)",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
4249,"(salt, sodium hydrogen carbonate, flavoring)",(wheat flour),0.164557,0.329114,0.164557,1.0,3.038462,0.110399,inf
4420,"(sodium hydrogen carbonate, flavoring, sugar)",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
9289,"(palm oil, salt, sodium hydrogen carbonate, fl...",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
9830,"(salt, sugar, sodium hydrogen carbonate, flavo...",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
5320,"(cocoa butter, cocoa powder, palm oil)","(cocoa mass, salt)",0.151899,0.367089,0.151899,1.0,2.724138,0.096138,inf
10750,"(cocoa butter, cocoa powder, sugar, palm oil)","(cocoa mass, salt)",0.151899,0.367089,0.151899,1.0,2.724138,0.096138,inf


In [11]:
print(len(rules1))

14382


In [12]:
frequent_itemsets = apriori(df,min_support= 0.10, use_colnames = True)
rules2 = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.30)

In [13]:
rules2 = rules1.sort_values(['confidence','lift'],ascending = False)
rules2[1:11]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
9300,"(palm oil, sodium hydrogen carbonate, flavoring)","(salt, wheat flour)",0.151899,0.316456,0.151899,1.0,3.16,0.10383,inf
9841,"(sugar, sodium hydrogen carbonate, flavoring)","(salt, wheat flour)",0.151899,0.316456,0.151899,1.0,3.16,0.10383,inf
988,"(sodium hydrogen carbonate, flavoring)",(wheat flour),0.164557,0.329114,0.164557,1.0,3.038462,0.110399,inf
4052,"(palm oil, sodium hydrogen carbonate, flavoring)",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
4249,"(salt, sodium hydrogen carbonate, flavoring)",(wheat flour),0.164557,0.329114,0.164557,1.0,3.038462,0.110399,inf
4420,"(sodium hydrogen carbonate, flavoring, sugar)",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
9289,"(palm oil, salt, sodium hydrogen carbonate, fl...",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
9830,"(salt, sugar, sodium hydrogen carbonate, flavo...",(wheat flour),0.151899,0.329114,0.151899,1.0,3.038462,0.101907,inf
5320,"(cocoa butter, cocoa powder, palm oil)","(cocoa mass, salt)",0.151899,0.367089,0.151899,1.0,2.724138,0.096138,inf
10750,"(cocoa butter, cocoa powder, sugar, palm oil)","(cocoa mass, salt)",0.151899,0.367089,0.151899,1.0,2.724138,0.096138,inf


In [14]:
print(len(rules2))

14382


In [15]:
frequent_itemsets = apriori(df,min_support= 0.08, use_colnames = True)
rules3 = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.3)

In [16]:
rules3 = rules3.sort_values(['confidence','lift'],ascending = False)
rules3[1:11]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
204661,"(cocoa powder, salt, sugar, polyglycerol polyr...","(cocoa butter, soy lecithin, cocoa mass, palm ...",0.088608,0.101266,0.088608,1.0,9.875,0.079635,inf
204809,"(cocoa powder, salt, polyglycerol polyricinole...","(cocoa butter, sugar, soy lecithin, cocoa mass...",0.088608,0.101266,0.088608,1.0,9.875,0.079635,inf
14742,"(cocoa mass, sodium hydrogen carbonate, wheat ...","(palm oil, ammonium hydrogen carbonate)",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf
14982,"(cocoa mass, sodium hydrogen carbonate, wheat ...","(ammonium hydrogen carbonate, soy lecithin)",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf
42750,"(cocoa mass, salt, sodium hydrogen carbonate, ...","(palm oil, ammonium hydrogen carbonate)",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf
42775,"(cocoa mass, sodium hydrogen carbonate, wheat ...","(palm oil, salt, ammonium hydrogen carbonate)",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf
42983,"(cocoa mass, sodium hydrogen carbonate, wheat ...","(palm oil, ammonium hydrogen carbonate)",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf
42990,"(cocoa mass, sodium hydrogen carbonate, palm o...","(ammonium hydrogen carbonate, soy lecithin)",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf
43008,"(cocoa mass, sodium hydrogen carbonate, wheat ...","(palm oil, ammonium hydrogen carbonate, soy le...",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf
43010,"(palm oil, sodium hydrogen carbonate, ammonium...","(cocoa mass, wheat flour, soy lecithin)",0.088608,0.113924,0.088608,1.0,8.777778,0.078513,inf


In [17]:
print(len(rules3))

205848
