# Apriori and Eclat (Python)
Yang Xi <br>
01 Aug, 2020

<br>

- Problem Statement
    - Toy Problem
	- Up-Sell and Cross-Sell
    - Face Validation
	- Pairwise Association Rule
- Apriori
	- Build Transactions Table
	- Pairwise Rules for Item X and Interpretation
	- Pairwise Rules for Item Y and Interpretation
- Eclat
	- Build Transactions Table
	- Pairwise Rules for Item X
	- Pairwise Rules for Item Y
- Visualize Association Rules with Network Plot


# Problem Statement
### Toy Problem

**Association rule models** are commonly used in market basket analysis and recommandation systems. <br>
In this notebook, I will demonstrate behavior of **Apriori** and **Eclat** models using a toy problem. <br>
<br>
A **toy problem** is a purpusely constructed problem with known patterns. It is very useful to help us understand how a model works, by comparing the model interpretation with the prior knowledge. <br>
<br>
In this toy problem, the transactions table contains 8 items purchased by 10 customers:

|customer|items| | | | | | | |
|--------|-|-|-|-|-|-|-|-|
|u01     |X|Y|B|C|D|E|F| |
|u02     |X|Y|B|C|D|E| | |
|u03     |X|Y| |C|D|E| | |
|u04     |X|Y| |C|D| | | |
|u05     |X|Y| |C|D| | | |
|u06     | |Y|B| | | | | |
|u07     | |Y|B| | |E| | |
|u08     | |Y|B| | | | | |
|u09     | |Y|B| | | | | |
|u08     | |Y|B|C| | | |G|

I will used association rule model to answer the following questions:
- Which items are **associated** with item X?
- Which item has the **strongest association** with item X?
- What are the answers regarding **item Y**?
<br>


### Up-Sell and Cross-Sell
An item-category mapping is also created to illustrate identifying up-sell and cross-sell oppotunities:

|item|category|
|----|--------|
|X|essentials|
|D|essentials|
|E|essentials|
|Y|signature|
|B|signature|
|C|occasionals|
|F|occasionals|
|G|occasionals|

In later section, I will use network plot to visualize the association rules, as well as up-sell & cross-sell baskets.

### Face Validation
Without any model, we can already observe some patterns with "common sense":
- **Item D** should be **most associated** with **item X**, because customers always purchase them together.
- **Item B or G** should be **dissociated** with **item X**, because customers who buy X are unlikely to buy B or G.
- Customers always purchase **item Y** regardless of what else they purchase. We will see how the models behave to such item.



### Pairwise Association Rule
An association rulem models takes a **transactions table** as input, and outputs **all frequent item sets (rules)** fulfilling the user specified criteria.<br>
Note that the frequent item sets **can contain multiple items**. For example, "customers who buy items X and Y together are likely to buy item C and D as well".<br>
<br>
For this toy problem, I will use **pairwise association rule**, which contains exactly two items in each set (rule).<br>
Pairwise association rule is easy to interpret, and from the result you can still construct bigger basket with multiple items.<br>
It can be formulated with the following criteria:
- minimum support = 0
- minimum confidence = 0
- minimum set length = 2
- maximum set length = 2

**Leverage** will be used to rank the rules, which measures both association and dissociation between two items.

$$ leverage (A→C) = support(A→C) − support(A) × support(C) $$

Leverage ranges from [-1, 1]. Larger leverage value indicates stronger association. An leverage value of 0 indicates independence.

Refer to [this webpage](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) for definition of association rule measures (*support*, *confidence*, *lift*, *leverage*, etc).


# Apriori
I will use `apriori` from `mlxtend` package.<br>
It requires the input transactions table to be encoded into a user-item boolean matrix.

### Build Transactions Table

In [1]:
import pandas as pd
pd.set_option('mode.chained_assignment', None)

dfl = pd.read_csv('data/toy_problem.csv')
dfl.head()

Unnamed: 0,user,item
0,u01,X
1,u01,Y
2,u01,B
3,u01,C
4,u01,D


In [2]:
from mlxtend.preprocessing import TransactionEncoder

se_items = dfl.groupby('user')['item'].apply(list)
print(se_items)

te = TransactionEncoder()
te_fit = te.fit(se_items)
te_ary = te_fit.transform(se_items)

df = pd.DataFrame(te_ary, columns=te.columns_)
df

user
u01    [X, Y, B, C, D, E, F]
u02       [X, Y, B, C, D, E]
u03          [X, Y, C, D, E]
u04             [X, Y, C, D]
u05             [X, Y, C, D]
u06                   [Y, B]
u07                [Y, B, E]
u08                   [Y, B]
u09                   [Y, B]
u10             [Y, B, C, G]
Name: item, dtype: object


Unnamed: 0,B,C,D,E,F,G,X,Y
0,True,True,True,True,True,False,True,True
1,True,True,True,True,False,False,True,True
2,False,True,True,True,False,False,True,True
3,False,True,True,False,False,False,True,True
4,False,True,True,False,False,False,True,True
5,True,False,False,False,False,False,False,True
6,True,False,False,True,False,False,False,True
7,True,False,False,False,False,False,False,True
8,True,False,False,False,False,False,False,True
9,True,True,False,False,False,True,False,True


### Pairwise Rules for Item X and Interpretation
I will first calculate the support of each item set using `apriori` algorithm.<br>
Then I will focus on pairwise rules related to **item X**.<br>

In [3]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

freq_sets = apriori(df, min_support=0.01, max_len=2, use_colnames=True)
print(freq_sets.shape)
freq_sets.sample(5)

# Note that each item set is unordered

(32, 2)


Unnamed: 0,support,itemsets
14,0.7,"(Y, B)"
18,0.1,"(G, C)"
4,0.1,(F)
25,0.1,"(F, E)"
21,0.3,"(E, D)"


In [4]:
rules = association_rules(freq_sets, metric='confidence', min_threshold=0)

rules.to_csv('output/df_rules_apriori.csv', index=False)
print(rules.shape)
print(rules.columns)

(48, 9)
Index(['antecedents', 'consequents', 'antecedent support',
       'consequent support', 'support', 'confidence', 'lift', 'leverage',
       'conviction'],
      dtype='object')


In [5]:
rules_pair_X = rules[(rules['consequents'].apply(len)==1) & (rules['antecedents'].apply(len)==1) & (rules['antecedents']=={'X'})]
rules_pair_X.sort_values('leverage', ascending=False, inplace=True)
print(rules_pair_X.shape)
rules_pair_X

(6, 9)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
30,(X),(D),0.5,0.5,0.5,1.0,2.0,0.25,inf
22,(X),(C),0.5,0.6,0.5,1.0,1.666667,0.2,inf
36,(X),(E),0.5,0.4,0.3,0.6,1.5,0.1,1.5
40,(X),(F),0.5,0.1,0.1,0.2,2.0,0.05,1.125
46,(X),(Y),0.5,1.0,0.5,1.0,1.0,0.0,inf
10,(X),(B),0.5,0.7,0.2,0.4,0.571429,-0.15,0.5


The model results are aligned with the previous face validation:
- **Item D** is **most associated** with **item X**.
- **Item B** is **dissociated** with **item X** (negtive leverage value).
- **Item Y** is **independent** to **item X** (0 leverage value)
- Leverage also reflects **association strength** of the items.

The results also demonstrate how **leverage** represents more information then simple **support** or **confidence** measures.<br>
In addition, **item F** demonstrate how the **lift** measure bias towards items with very small support.<br>
<br>
Note: Item G is not shown in this table. This is due to a recent upgrade of `mlxtend` package, which does not mine item sets with 0 support anymore.

### Pairwise Rules for Item Y and Interpretation

In [6]:
rules_pair_Y = rules[(rules['consequents'].apply(len)==1) & (rules['antecedents'].apply(len)==1) & (rules['antecedents']=={'Y'})]
rules_pair_Y.sort_values('leverage', ascending=False, inplace=True)
print(rules_pair_Y.shape)
rules_pair_Y

(7, 9)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
12,(Y),(B),1.0,0.7,0.7,0.7,1.0,0.0,1.0
24,(Y),(C),1.0,0.6,0.6,0.6,1.0,0.0,1.0
32,(Y),(D),1.0,0.5,0.5,0.5,1.0,0.0,1.0
38,(Y),(E),1.0,0.4,0.4,0.4,1.0,0.0,1.0
42,(Y),(F),1.0,0.1,0.1,0.1,1.0,0.0,1.0
44,(Y),(G),1.0,0.1,0.1,0.1,1.0,0.0,1.0
47,(Y),(X),1.0,0.5,0.5,0.5,1.0,0.0,1.0


Item Y is always purchased by customers regardless of other items.<br>The model reflects this with **0 leverage value** in all sets - item Y is **independent** with all other items.

# Eclat
The performance of `Apriori` deteriorates badly when the problem scales up.<br>
One alternative is the `Eclat` algorithm, which implements a depth-first search approach to avoid repeated scanning of the data.<br>
I will use the `eclat` function from the `fim` package ([pyfim](https://borgelt.net/pyfim.html)).<br>
This package also supports weighted Eclat algorithm.

### Build Transactions Table

In [7]:
import pandas as pd
pd.set_option('mode.chained_assignment', None)

dfl = pd.read_csv('data/toy_problem.csv')
dfl.head()

Unnamed: 0,user,item
0,u01,X
1,u01,Y
2,u01,B
3,u01,C
4,u01,D


In [8]:
se_items = dfl.groupby('user')['item'].apply(list)
print(se_items)

user
u01    [X, Y, B, C, D, E, F]
u02       [X, Y, B, C, D, E]
u03          [X, Y, C, D, E]
u04             [X, Y, C, D]
u05             [X, Y, C, D]
u06                   [Y, B]
u07                [Y, B, E]
u08                   [Y, B]
u09                   [Y, B]
u10             [Y, B, C, G]
Name: item, dtype: object


### Pairwise Rules for Item X
Note for the `report` parameter to the `eclat` function:
- `y`: relative head item support
- `x`: relative body set support
- `s`: relative item set support
- `c`: confidence
- `l`: lift

In [9]:
from fim import eclat

rules = eclat(se_items, target='r', supp=0, conf=0, zmin=2, zmax=2, report='yxscl')
df_rules = pd.DataFrame(rules, columns=['consequents','antecedents','supp_con','supp_ant','support','confidence','lift'])
df_rules = df_rules[['antecedents','consequents','supp_ant','supp_con','support','confidence','lift']]
df_rules['leverage'] = df_rules['support'] - df_rules['supp_con']*df_rules['supp_ant']
df_rules['antecedents'] = df_rules['antecedents'].apply(lambda t:t[0])

df_rules.to_csv('output/df_rules_eclat.csv', index=False)

rules_pair_X = df_rules[df_rules['antecedents']=='X'].sort_values('leverage', ascending=False)
rules_pair_X

Unnamed: 0,antecedents,consequents,supp_ant,supp_con,support,confidence,lift,leverage
19,X,D,0.5,0.5,0.5,1.0,2.0,0.25
10,X,C,0.5,0.6,0.5,1.0,1.666667,0.2
27,X,E,0.5,0.4,0.3,0.6,1.5,0.1
37,X,F,0.5,0.1,0.1,0.2,2.0,0.05
6,X,Y,0.5,1.0,0.5,1.0,1.0,0.0
8,X,B,0.5,0.7,0.2,0.4,0.571429,-0.15


### Pairwise Rules for Item Y

In [10]:
rules_pair_Y = df_rules[df_rules['antecedents']=='Y'].sort_values('leverage', ascending=False)
rules_pair_Y

Unnamed: 0,antecedents,consequents,supp_ant,supp_con,support,confidence,lift,leverage
1,Y,B,1.0,0.7,0.7,0.7,1.0,0.0
3,Y,C,1.0,0.6,0.6,0.6,1.0,0.0
7,Y,X,1.0,0.5,0.5,0.5,1.0,0.0
13,Y,D,1.0,0.5,0.5,0.5,1.0,0.0
21,Y,E,1.0,0.4,0.4,0.4,1.0,0.0
31,Y,F,1.0,0.1,0.1,0.1,1.0,0.0
43,Y,G,1.0,0.1,0.1,0.1,1.0,0.0


# Visualize Association Rules with Network Plot
I will use R to visualize the association rules, because I developed several key utilities previously in R:
- utilities to identify loops in network, which will present "basket of items".
- utilities for network plotting

In [11]:
import subprocess

subprocess.check_output(['Rscript','plot_association_rules.r'])


b'[1] "Association Rules Visualized."\n'

### Basket with Item X

<img src="output/association_network_plot_X.jpg" style="width:600px;height:400px;">

From the visualization of association rules related to item X, Item X, C and D trend to form a "basket", where these three items are bought together by a customer. This also demonstrates how pairwise association rules can **reconstruct bigger basket** with more than 2 items.<br>
<br>
This result also identifies **up-sell opportunity** between item X and D (same category), and **cross-sell** opportunity between item X and C (different category).


### Baskets from All Items

<img src="output/association_network_plot_All.jpg" style="width:600px;height:600px;">

By visualizing all pairwise association rules with sufficient leverage value, we can observe that
- item X, C, D, E form a big basket, where these 4 items are likely to be bought together.
- item E and F form another weak basket.
- item G, B, Y are independent items, but we need to take note of the difference in the reason behind the independency:
    - item G is rearly bought
    - item B and Y are (almost) always bought
