# Data Science and Visualization (RUC F2023)

## Lecture 9: Associate Rules

 # Apriori Algorithm for Association Rule Mining
 
 This notebook demonstrates how to use the library **mlxtend** to mine association rules.
 
 To install mlxtend, you need to execute the following in Anaconda Prompt:

pip install mlxtend

## 0. Setup and data loading

In [18]:
import numpy as np
import pandas as pd
#from apyori import apriori

# To make sure the first row is not thought of as the heading
dataset = pd.read_csv('C:/Data/store_data.csv', header=None) 
dataset.shape

(7501, 20)

Let's take a look at the format of the dataset:

In [19]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


Each row represents a transaction. The columns do not carry any meanings here, as the values in a same column do NOT represent the same thing or the same type.

Also, we can see there are many NULL or missing values on each row. Let's count them:

In [4]:
pd.set_option('display.max_rows', None)

dataset.isnull().sum(axis=1)

0        0
1       17
2       19
3       18
4       15
5       19
6       18
7       17
8       17
9       19
10      18
11      19
12      15
13      17
14      18
15      19
16      13
17      18
18      10
19      15
20      13
21      18
22      14
23      12
24      19
25      14
26      11
27      19
28      15
29      16
30      16
31      16
32      17
33      17
34      19
35      16
36      18
37      17
38      17
39      17
40      19
41      12
42      19
43      16
44      19
45      12
46      16
47      18
48      18
49      17
50      15
51      17
52      18
53      18
54      17
55      18
56       9
57      18
58      13
59      14
60      19
61      19
62      19
63      15
64      18
65      19
66      18
67      13
68      19
69      18
70      15
71      17
72      19
73      19
74      18
75      19
76      17
77      18
78      15
79      19
80      17
81      15
82      16
83      14
84      19
85      18
86      18
87      19
88      18
89      16
90      19

In [21]:
del dataset[dataset.columns[0]]

dataset.head()

Let's see how many items on each row/transaction:

In [26]:
dataset.count(axis='columns')

0       19
1        2
2        0
3        1
4        4
5        0
6        1
7        2
8        2
9        0
10       1
11       0
12       4
13       2
14       1
15       0
16       6
17       1
18       9
19       4
20       6
21       1
22       5
23       7
24       0
25       5
26       8
27       0
28       4
29       3
30       3
31       3
32       2
33       2
34       0
35       3
36       1
37       2
38       2
39       2
40       0
41       7
42       0
43       3
44       0
45       7
46       3
47       1
48       1
49       2
50       4
51       2
52       1
53       1
54       2
55       1
56      10
57       1
58       6
59       5
60       0
61       0
62       0
63       4
64       1
65       0
66       1
67       6
68       0
69       1
70       4
71       2
72       0
73       0
74       1
75       0
76       2
77       1
78       4
79       0
80       2
81       4
82       3
83       5
84       0
85       1
86       1
87       0
88       1
89       3
90       0

## 1. Data preparation

We first transform the dataset into a transaction table in the format of *a list of lists*. A small example is given below. Notice the two levles of '[ ]' pairs:

[['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],

['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],

['Milk', 'Apple', 'Kidney Beans', 'Eggs'],

['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],

['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

This intermediate format is needed by the function to be used in subsequent operations.

In [30]:
transactions = []

# We loop on all rows, i.e., transactions, in the original dataset
for i in range(0, dataset.shape[0]):
    # We transform the current transaction/row
    transaction = []
    # We loop on all values on the current row
    for j in range(0, dataset.shape[1]):
        # We get the current value on this row
        item = dataset.iloc[i, j]
        # If the value is NULL, we know there're no more values on this row, so we break out of the inner loop
        if (pd.isnull(item)):
            break
        else:
            # Otherwise, we add this concrete value/item to the end of the value for the current transaction
            transaction.append(item)
        
    transactions.append(transaction)

# Let's see the 2nd transaction in the transaction table
print(transactions[1])

['meatballs', 'eggs']


The **apriori()** function from **mlxtend.frequent_patterns** expects data in a pandas DataFrame of a speical format, where
* the dimenions are all the items 
* each transaction is represented by a series of boolean values: True (or False) indicates a corresponding item is available (or unavailable) in the transaction.

Therefore we need to transform the original transaection table. This is done by the **TransactionEncoder()**:

In [31]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
# The two steps, fit and transform, can be merged into one step
#te_ary = te.fit(dataset).transform(dataset)
te_ary = te.fit_transform(transactions)

# We generate a DataFrame for the transformed dataset
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [32]:
len(df.columns)

120

In the transformed data, the columns make sense: they're just the item names:

In [33]:
df.columns

Index([' asparagus', 'almonds', 'antioxydant juice', 'asparagus', 'avocado',
       'babies food', 'bacon', 'barbecue sauce', 'black tea', 'blueberries',
       ...
       'turkey', 'vegetables mix', 'water spray', 'white wine',
       'whole weat flour', 'whole wheat pasta', 'whole wheat rice', 'yams',
       'yogurt cake', 'zucchini'],
      dtype='object', length=120)

## 2. Generating Frequent Itemsets

Now, let us return the frequent itemsets with at least 1% support:

In [34]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(df, min_support=0.01, use_colnames=True)

frequent_itemsets

Unnamed: 0,support,itemsets
0,0.018931,(almonds)
1,0.02573,(avocado)
2,0.010399,(barbecue sauce)
3,0.013065,(black tea)
4,0.011332,(body spray)
5,0.030663,(brownies)
6,0.010399,(burgers)
7,0.023197,(butter)
8,0.067991,(cake)
9,0.014931,(carrots)


## 3. Generating Association Rules

We obtain the inferred rules in a dataframe, via function **association_rules()**:

In [35]:
from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric ="lift", min_threshold = 1)

rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(cake),(eggs),0.067991,0.142514,0.017064,0.25098,1.761089,0.007375,1.144811
1,(eggs),(cake),0.142514,0.067991,0.017064,0.119738,1.761089,0.007375,1.058786
2,(cake),(french fries),0.067991,0.138382,0.014265,0.209804,1.516126,0.004856,1.090386
3,(french fries),(cake),0.138382,0.067991,0.014265,0.103083,1.516126,0.004856,1.039125
4,(cake),(green tea),0.067991,0.119184,0.012932,0.190196,1.595817,0.004828,1.08769
5,(green tea),(cake),0.119184,0.067991,0.012932,0.108501,1.595817,0.004828,1.045441
6,(cake),(milk),0.067991,0.105453,0.011598,0.170588,1.617677,0.004429,1.078532
7,(milk),(cake),0.105453,0.067991,0.011598,0.109987,1.617677,0.004429,1.047186
8,(cake),(mineral water),0.067991,0.161445,0.020264,0.298039,1.846071,0.009287,1.194589
9,(mineral water),(cake),0.161445,0.067991,0.020264,0.125516,1.846071,0.009287,1.065782


For **leverage** and **conviction**, please refer to http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

We can sort the rules on confidence and lift:

In [36]:
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
175,(soup),(mineral water),0.040128,0.161445,0.018531,0.461794,2.860377,0.012052,1.558056
138,(ground beef),(mineral water),0.069191,0.161445,0.030129,0.435453,2.697218,0.018959,1.485358
214,"(milk, spaghetti)",(mineral water),0.029329,0.161445,0.012665,0.431818,2.674705,0.00793,1.475857
207,"(ground beef, spaghetti)",(mineral water),0.029063,0.161445,0.012532,0.431193,2.670831,0.00784,1.474234
195,"(eggs, milk)",(mineral water),0.025997,0.161445,0.011199,0.430769,2.668208,0.007001,1.473137


### References
* http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
* https://www.geeksforgeeks.org/implementing-apriori-algorithm-in-python/
* http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/