# Lab 8: Frequent Itemset Mining

In this lab, we will explore the transactional data and use techniques in frequent itemset mining to extract association rules. First of all, we need to install a packages named [mlxtend](http://rasbt.github.io/mlxtend/).

In [4]:
# install packages
import sys

# !conda install --yes --prefix {sys.prefix} -c anaconda nltk 
!conda install --yes --prefix {sys.prefix} -c conda-forge mlxtend

Solving environment: done


  current version: 4.4.9
  latest version: 4.6.7

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.



In [15]:
import numpy as np
import pandas as pd
import logging
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import association_rules

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Apriori Algoirthm
[Apriori](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf) is a popular algorithm for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.



### Generating Frequent Itemsets
Suppose we have the following transaction data.

In [8]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']
          ]

First, we can transform the data into the right format via the TransactionEncoder API.

In [9]:
# train model
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)

df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


### Find frequent itemsets

Now, let us return the items and itemsets with at least 60% support:

In [12]:
apriori(df, min_support=0.6, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Eggs, Kidney Beans)"
6,0.6,"(Eggs, Onion)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Onion, Kidney Beans)"
9,0.6,"(Kidney Beans, Yogurt)"


### Selecting and Filtering Results
We can also apply filters to select the subset of frequent itemsets of interest. 

In [14]:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

# filter based on both length and support.
frequent_itemsets[ 
    (frequent_itemsets['length'] == 2) &
    (frequent_itemsets['support'] >= 0.8) 
]

Unnamed: 0,support,itemsets,length
5,0.8,"(Eggs, Kidney Beans)",2


-------

## Association Rules Generation from Frequent Itemsets

Rule generation is a common task in the mining of frequent patterns. An association rule is an implication expression of the form X→Y, where X and Y are disjoint itemsets. A more concrete example based on consumer behaviour would be {Diapers}→{Beer} suggesting that people who buy diapers are also likely to buy beer. To evaluate the "interest" of such an association rule, different metrics have been developed. 

**Lift**: The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent. If A and C are independent, the Lift score will be exactly 1.

In [19]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

If you are interested in rules according to a different metric of interest, you can simply adjust the metric and min_threshold arguments . E.g. if you are only interested in rules that have a lift score of >= 1.2, you would do the following:

In [20]:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))

rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
9,"(Onion, Kidney Beans)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf,2


# End of lab 8