## Association rules

In [2]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

Association rules are normally written like this: {Diapers} -> {Beer} which means that there is a strong relationship between customers that purchased diapers and also purchased beer in the same transaction.

In the above example, the {Diaper} is the __antecedent__ and the {Beer} is the __consequent__. Both antecedents and consequents can have multiple items. In other words, {Diaper, Gum} -> {Beer, Chips} is a valid rule.

__Support__ is the relative frequency that the rules show up. In many instances, you may want to look for high support in order to make sure it is a useful relationship. However, there may be instances where a low support is useful if you are trying to find “hidden” relationships.

In [3]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
transaction_encoder = TransactionEncoder()
transaction_encoder_array = transaction_encoder.fit_transform(dataset)
df = pd.DataFrame(transaction_encoder_array, columns=transaction_encoder.columns_)
df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


In [5]:
df.astype(int)

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,0,0,0,1,0,1,1,1,1,0,1
1,0,0,1,1,0,1,0,1,1,0,1
2,1,0,0,1,0,1,1,0,0,0,0
3,0,1,0,0,0,1,1,0,0,1,1
4,0,1,0,1,1,1,0,0,1,0,0


In [5]:
from mlxtend.frequent_patterns import apriori

apriori(df, min_support = 0.6)

Unnamed: 0,support,itemsets
0,0.8,(3)
1,1.0,(5)
2,0.6,(6)
3,0.6,(8)
4,0.6,(10)
5,0.8,"(3, 5)"
6,0.6,"(8, 3)"
7,0.6,"(5, 6)"
8,0.6,"(8, 5)"
9,0.6,"(10, 5)"


By default, apriori returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names: 

In [6]:
result = apriori(df, min_support = 0.6, use_colnames=True)
result

Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Eggs, Onion)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


In [7]:
result.sort_values(['support'], ascending=[0], inplace=True)
result

Unnamed: 0,support,itemsets
1,1.0,(Kidney Beans)
0,0.8,(Eggs)
5,0.8,"(Kidney Beans, Eggs)"
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
6,0.6,"(Eggs, Onion)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


#### Challenge !!!
The advantage of working with pandas DataFrames is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset

In [9]:
frequent_itemset = apriori(df, min_support = 0.6, use_colnames=True)
# frequent_itemset['length'] = frequent_itemset.itemsets.apply(lambda x: len(x))
frequent_itemset['length'] = [len(item) for item in frequent_itemset.itemsets]

frequent_itemset

Unnamed: 0,support,itemsets,length
0,0.8,(Eggs),1
1,1.0,(Kidney Beans),1
2,0.6,(Milk),1
3,0.6,(Onion),1
4,0.6,(Yogurt),1
5,0.8,"(Kidney Beans, Eggs)",2
6,0.6,"(Eggs, Onion)",2
7,0.6,"(Kidney Beans, Milk)",2
8,0.6,"(Kidney Beans, Onion)",2
9,0.6,"(Kidney Beans, Yogurt)",2


In [10]:
# frequent_itemset[[f and s for f,s in zip(frequent_itemset.support >= 0.8 , frequent_itemset.length == 2)]] # nd.array object does not have true value
frequent_itemset[ (frequent_itemset['length'] == 2) & (frequent_itemset['support'] >= 0.8) ] # & is bitwise    

Unnamed: 0,support,itemsets,length
5,0.8,"(Kidney Beans, Eggs)",2


Note that the entries in the "itemsets" column are of type frozenset, which is built-in Python type that is similar to a Python set but immutable, which makes it more efficient for certain query or comparison operations 
Since frozensets are sets, the item order does not matter. I.e.,this equery has the same result of last cell:

### Association Rule Metrics
__Support: support(A→C)=support(A∪C),range: [0,1]__:


__The support metric is defined for itemsets, not assocication rules__. The table produced by the association rule mining algorithm contains three different support metrics: 'antecedant support', 'consequent support', and 'support'. Here, 'antecedant support' computes the proportion of transactions that contain the antecedant A, and 'consequent support' computes the support for the itemset of the consequent C. The 'support' metric then computes the support of the combined itemset A ∪ C -- note that 'support' depends on 'antecedant support' and 'consequent support' via min('antecedant support', 'consequent support').
    
__'confidence':  $confidence(A→C)= \frac{support(A→C)}{support(A)}$,range: [0,1]__:

    The confidence of a rule A->C is the probability of seeing the consequent in a transaction given that it also contains the antecedent. Note that the metric is not symmetric or directed; for instance, the confidence for A->C is different than the confidence for C->A. The confidence is 1 (maximal) for a rule A->C if the consequent and antecedent always occur together. 

__lift__: $lift(A→C)=\frac{confidence(A→C)}{support(C)},range: [1,∞]$:

    The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent. If A and C are independent, the Lift score will be exactly 1.
    
__leverage: $levarage(A→C)=support(A→C)−support(A)×support(C),range: [−1,1]$:__

    Leverage computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. An leverage value of 0 indicates independence.
    
__'conviction': $conviction(A→C)=\frac{1−support(C)}{1−confidence(A→C)},range: [0,∞]$__:

    A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.

In [12]:
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemset, metric="confidence", min_threshold=0.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Kidney Beans),(Eggs),1.0,0.8,0.8,0.8,1.0,0.0,1.0
1,(Eggs),(Kidney Beans),0.8,1.0,0.8,1.0,1.0,0.0,inf
2,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
3,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
4,(Kidney Beans),(Milk),1.0,0.6,0.6,0.6,1.0,0.0,1.0
5,(Milk),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
6,(Kidney Beans),(Onion),1.0,0.6,0.6,0.6,1.0,0.0,1.0
7,(Onion),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
8,(Kidney Beans),(Yogurt),1.0,0.6,0.6,0.6,1.0,0.0,1.0
9,(Yogurt),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf


In [14]:
rules[rules.lift>1.0]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
3,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
10,"(Kidney Beans, Eggs)",(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
11,"(Kidney Beans, Onion)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
14,(Eggs),"(Kidney Beans, Onion)",0.8,0.6,0.6,0.75,1.25,0.12,1.6
15,(Onion),"(Kidney Beans, Eggs)",0.6,0.8,0.6,1.0,1.25,0.12,inf


*:)*