First, install the python package called "pymining" by typing **pip install mlxtend** in cmd or terminal

Visit **http://rasbt.github.io/mlxtend/#examples** for more information

# Data Preparation

In [3]:
!pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.8.0-py2.py3-none-any.whl (1.3MB)
[K    100% |████████████████████████████████| 1.3MB 392kB/s eta 0:00:01
Installing collected packages: mlxtend
Successfully installed mlxtend-0.8.0


In [1]:
import pandas as pd
import matplotlib as plt
%matplotlib inline
import csv

from mlxtend.preprocessing import OnehotTransactions

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
df = pd.read_csv("data/toydataset.csv", header=None)
df

Unnamed: 0,0,1,2
0,apple,banana,carrot
1,banana,,
2,apple,,
3,apple,carrot,diet coke
4,banana,carrot,
5,banana,carrot,


There are a total six transactions in the dataset

In [3]:
df = df.fillna('')
df

Unnamed: 0,0,1,2
0,apple,banana,carrot
1,banana,,
2,apple,,
3,apple,carrot,diet coke
4,banana,carrot,
5,banana,carrot,


Change dataframe to lists

In [4]:
data = df.values.tolist()
data

[['apple', 'banana', 'carrot'],
 ['banana', '', ''],
 ['apple', '', ''],
 ['apple', 'carrot', 'diet coke'],
 ['banana', 'carrot', ''],
 ['banana', 'carrot', '']]

Filter out empty lists

In [5]:
dataset = [ filter(None, a) for a in data ]
dataset 

[['apple', 'banana', 'carrot'],
 ['banana'],
 ['apple'],
 ['apple', 'carrot', 'diet coke'],
 ['banana', 'carrot'],
 ['banana', 'carrot']]

In [6]:
oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
df

Unnamed: 0,apple,banana,carrot,diet coke
0,1,1,1,0
1,0,1,0,0
2,1,0,0,0
3,1,0,1,1
4,0,1,1,0
5,0,1,1,0


# Frequent Item Set Mining

Apriori is a popular algorithm for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. A itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur togehter in at least 50% of all transactions in the database.



In [7]:
apriori(df, min_support=0.2)

Unnamed: 0,support,itemsets
0,0.5,[0]
1,0.666667,[1]
2,0.666667,[2]
3,0.333333,"[0, 2]"
4,0.5,"[1, 2]"


By default, apriori returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names:

In [8]:
apriori(df, min_support=0.2, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.5,[apple]
1,0.666667,[banana]
2,0.666667,[carrot]
3,0.333333,"[apple, carrot]"
4,0.5,"[banana, carrot]"


### Top Products

In [9]:
apriori(df, min_support=0.2, use_colnames=True).sort_values(['support'],ascending=False)

Unnamed: 0,support,itemsets
1,0.666667,[banana]
2,0.666667,[carrot]
0,0.5,[apple]
4,0.5,"[banana, carrot]"
3,0.333333,"[apple, carrot]"


* banana & carrot are included in **three** transactions (out of six transactions)
* apple is included in **three** transactions
* banana and carrot are two most popular products sold

carrot and banana are two most popular individual items. And also, carrot and banada were purchased together three times. 

In [10]:
res=apriori(df, min_support=0.2, use_colnames=True)
res.to_csv("data/freq_df.csv", index=False)

The advantage of working with pandas DataFrames is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 40 percent. First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset:

In [11]:
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.5,[apple],1
1,0.666667,[banana],1
2,0.666667,[carrot],1
3,0.333333,"[apple, carrot]",2
4,0.5,"[banana, carrot]",2


In [12]:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.4) ]

Unnamed: 0,support,itemsets,length
4,0.5,"[banana, carrot]",2


# Association Rules Mining

Support = Number of  Rows having both A AND B / Total Number of Rows
<br>
<br>
Confidence =  Number of Rows  having both A AND B / Number of Rows with A
<br>
<br>
Expected Confidence = Number of rows with B / Total Number of Rows
<br>
<br>
Lift = Confidence / Expected Confidence.
- A lift value greater than 1 : X and Y appear more often together than expected; this means that the occurrence of X has a positive effect on the occurrence of Y or that X is positively correlated with Y.
- A lift smaller than 1 : X and Y appear less often together than expected, this means that the occurrence of X has a negative effect on the occurrence of Y or that X is negatively correlated with Y
- A lift value near 1 : X and Y appear almost as often together as expected; this means that the occurrence of X has almost no effect on the occurrence of Y or that X and Y have Zero Correlation. 
- lift is a value between 0 and infinity


The generate_rules() function allows you to (1) specify your metric of interest and (2) the according threshold. Currently implemented measures are confidence and lift. Let's say you are interesting in rules derived from the frequent itemsets only if the level of confidence is above the 50 percent threshold (min_threshold=0.9):

In [13]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(carrot),(banana),0.666667,0.75,1.125
1,(banana),(carrot),0.666667,0.75,1.125
2,(carrot),(apple),0.666667,0.5,1.0
3,(apple),(carrot),0.5,0.666667,1.0


In [14]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5).sort_values(['confidence'],ascending=False)

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(carrot),(banana),0.666667,0.75,1.125
1,(banana),(carrot),0.666667,0.75,1.125
3,(apple),(carrot),0.5,0.666667,1.0
2,(carrot),(apple),0.666667,0.5,1.0


### Top Cross-Selling Products

In [15]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.1)
rules

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(carrot),(banana),0.666667,0.75,1.125
1,(banana),(carrot),0.666667,0.75,1.125


Pandas DataFrames make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:

1. at least 2 antecedants
2. a confidence > 0.75
3. a lift score > 1.1

We could compute the antecedent length as follows:

In [16]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules["antecedant_len"] = rules["antecedants"].apply(lambda x: len(x))
rules

Unnamed: 0,antecedants,consequents,support,confidence,lift,antecedant_len
0,(carrot),(banana),0.666667,0.75,1.125,1
1,(banana),(carrot),0.666667,0.75,1.125,1
2,(carrot),(apple),0.666667,0.5,1.0,1
3,(apple),(carrot),0.5,0.666667,1.0,1


In [17]:
rules[ (rules['antecedant_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.1) ]

Unnamed: 0,antecedants,consequents,support,confidence,lift,antecedant_len


No association found satisfying the above criteria

# References

- http://analyticstrainings.com/?p=151
- http://rstatistics.net/association-mining-with-r/
- http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/