# Association Rules

#### In this section, we will learn about how to use assocition rules in python and how to filter data based on different metrics of association rules.


Following libraries are used for association rules:
- pandas
- numpy
- matplotlib
- mlxtend

In [1]:
# Import necessary modules

import numpy as np
import pandas as pd
import csv
from matplotlib import pyplot as plt

# Import FP-growth and Apriori modules, TransactionEncoder module and association module from mlxtend

from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import association_rules as arule
from mlxtend.frequent_patterns import fpgrowth

### Read data from Repair.csv

We use Repair.csv file as a data set for finding frequent item sets. FP-growth and Apriori algorithms are used for finding frequent itemsets.


In [2]:
# Read file 'Repair.csv' and change the data format for applying algorithms

data_set = []

with open("Repair.csv") as csvFile:
    reader = csv.reader(csvFile)
    for row in reader:
        data_set.append(row)


### FP-grwoth algorithm

We use Repair.csv file as a data set for finding frequent item sets. FP-growth algorithm is used for finding frequent itemsets.

In [3]:
# learn to use TransactionEncoder module to convert an array to DataFrame for FP-growth algorithm in mlxtend

te = TransactionEncoder()
te_ary = te.fit(data_set).transform(data_set)
data = pd.DataFrame(te_ary, columns = te.columns_)
data.tail(5)




Unnamed: 0,Analyze Defect,Archive Repair,Inform User,Register,Repair (Complex),Repair (Simple),Restart Repair,Test Repair
1099,True,True,True,True,True,False,False,True
1100,True,True,True,True,True,False,False,True
1101,True,True,True,True,True,False,False,True
1102,True,True,True,True,True,False,False,True
1103,True,True,True,True,False,True,True,True


In [4]:
frequent_itemsets=fpgrowth(data, min_support=0.3, use_colnames=True)
print(frequent_itemsets)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets_filtered = frequent_itemsets.loc[(frequent_itemsets['length'] > 3) & (frequent_itemsets['support'] > 0.3)]   
frequent_itemsets_filtered

     support                                           itemsets
0   1.000000                                         (Register)
1   1.000000                                   (Analyze Defect)
2   0.998188                                      (Test Repair)
3   0.998188                                      (Inform User)
4   0.905797                                   (Archive Repair)
..       ...                                                ...
90  0.386775  (Repair (Simple), Archive Repair, Analyze Defe...
91  0.386775  (Repair (Simple), Archive Repair, Register, Te...
92  0.386775  (Repair (Simple), Archive Repair, Register, An...
93  0.386775  (Repair (Simple), Archive Repair, Register, An...
94  0.386775  (Repair (Simple), Archive Repair, Register, An...

[95 rows x 2 columns]


Unnamed: 0,support,itemsets,length
17,0.998188,"(Test Repair, Analyze Defect, Inform User, Reg...",4
28,0.905797,"(Test Repair, Archive Repair, Analyze Defect, ...",4
29,0.905797,"(Test Repair, Archive Repair, Inform User, Reg...",4
30,0.905797,"(Archive Repair, Analyze Defect, Inform User, ...",4
31,0.905797,"(Test Repair, Archive Repair, Analyze Defect, ...",4
32,0.905797,"(Archive Repair, Register, Analyze Defect, Tes...",5
48,0.595109,"(Analyze Defect, Inform User, Repair (Complex)...",4
49,0.595109,"(Test Repair, Analyze Defect, Repair (Complex)...",4
50,0.550725,"(Archive Repair, Analyze Defect, Repair (Compl...",4
51,0.595109,"(Test Repair, Analyze Defect, Inform User, Rep...",4


### Apriori algorithm

We use Repair.csv file as a data set for finding frequent item sets. Apriori algorithm is used to find frequent itemsets.

In [5]:
# learn to use TransactionEncoder module to convert an array to DataFrame for Apriori algorithm in mlxtend
# Read file 'Repair.csv' and change the data format for applying algorithms


te = TransactionEncoder()
te_ary = te.fit(data_set).transform(data_set)
data = pd.DataFrame(te_ary, columns = te.columns_)
data.tail(5)


Unnamed: 0,Analyze Defect,Archive Repair,Inform User,Register,Repair (Complex),Repair (Simple),Restart Repair,Test Repair
1099,True,True,True,True,True,False,False,True
1100,True,True,True,True,True,False,False,True
1101,True,True,True,True,True,False,False,True
1102,True,True,True,True,True,False,False,True
1103,True,True,True,True,False,True,True,True


In [6]:
frequent_itemsets = apriori(data, min_support = 0.3, use_colnames = True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,1.000000,(Analyze Defect)
1,0.905797,(Archive Repair)
2,0.998188,(Inform User)
3,1.000000,(Register)
4,0.596920,(Repair (Complex))
...,...,...
90,0.438406,"(Repair (Simple), Register, Analyze Defect, Te..."
91,0.550725,"(Archive Repair, Register, Test Repair, Inform..."
92,0.386775,"(Repair (Simple), Archive Repair, Register, Te..."
93,0.550725,"(Archive Repair, Register, Analyze Defect, Tes..."


### Filtering data based on metrics of association rules
In python you can filter frequent itemsets based on different metrics such as support, confidence, and lift.

In [7]:
# learn to use the association rule algorithm from mlxtend and filter data based on one metric.

rules_association =arule(frequent_itemsets, metric = 'lift', min_threshold = 0.8)
rules_association

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Archive Repair),(Analyze Defect),0.905797,1.000000,0.905797,1.000000,1.000000,0.000000,inf
1,(Analyze Defect),(Archive Repair),1.000000,0.905797,0.905797,0.905797,1.000000,0.000000,1.000000
2,(Analyze Defect),(Inform User),1.000000,0.998188,0.998188,0.998188,1.000000,0.000000,1.000000
3,(Inform User),(Analyze Defect),0.998188,1.000000,0.998188,1.000000,1.000000,0.000000,inf
4,(Analyze Defect),(Register),1.000000,1.000000,1.000000,1.000000,1.000000,0.000000,inf
...,...,...,...,...,...,...,...,...,...
1019,(Archive Repair),"(Repair (Simple), Register, Analyze Defect, Te...",0.905797,0.438406,0.386775,0.427000,0.973983,-0.010331,0.980095
1020,(Register),"(Repair (Simple), Archive Repair, Analyze Defe...",1.000000,0.386775,0.386775,0.386775,1.000000,0.000000,1.000000
1021,(Analyze Defect),"(Repair (Simple), Archive Repair, Register, Te...",1.000000,0.386775,0.386775,0.386775,1.000000,0.000000,1.000000
1022,(Test Repair),"(Repair (Simple), Archive Repair, Register, An...",0.998188,0.386775,0.386775,0.387477,1.001815,0.000701,1.001146


#### Question: Change the metric to lift and support. Investigate the effect of that on the table.

### Finding qualified frequent itemsets using association rules
You can use association rules to find qualified itemsets for different datasets.

#### Question:  Find frequent item sets with minimum support of 0.2. Store them in frequent_itemsets variable.


In [8]:
#Answer
frequent_itemsets = apriori(data, min_support = 0.2, use_colnames = True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,1.000000,(Analyze Defect)
1,0.905797,(Archive Repair)
2,0.998188,(Inform User)
3,1.000000,(Register)
4,0.596920,(Repair (Complex))
...,...,...
138,0.219203,"(Repair (Simple), Register, Restart Repair, Te..."
139,0.550725,"(Archive Repair, Register, Analyze Defect, Tes..."
140,0.386775,"(Repair (Simple), Archive Repair, Register, An..."
141,0.240036,"(Archive Repair, Register, Analyze Defect, Res..."


### Filtering itemsets based on length and metrics of assosciation rules
In this section, you will learn how to filter frequent item sets based on length of them.

In [9]:
# Add another column named 'length' in 'frequent_itemsets' which indicates the number of items in each frequent itemset.

frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

# Filter out the frequent itemsets which have a length longer than 2 and a support bigger than 0.3. 

# Store these found itemsets in variable 'frequent_itemsets_filtered'.

frequent_itemsets_filtered = frequent_itemsets.loc[(frequent_itemsets['length'] > 2) & (frequent_itemsets['support'] > 0.3)]   
frequent_itemsets_filtered

Unnamed: 0,support,itemsets,length
34,0.905797,"(Archive Repair, Analyze Defect, Inform User)",3
35,0.905797,"(Archive Repair, Analyze Defect, Register)",3
36,0.550725,"(Archive Repair, Analyze Defect, Repair (Compl...",3
37,0.386775,"(Archive Repair, Analyze Defect, Repair (Simple))",3
39,0.905797,"(Test Repair, Archive Repair, Analyze Defect)",3
...,...,...,...
131,0.438406,"(Repair (Simple), Register, Analyze Defect, Te...",5
135,0.550725,"(Archive Repair, Register, Test Repair, Inform...",5
136,0.386775,"(Repair (Simple), Archive Repair, Register, Te...",5
139,0.550725,"(Archive Repair, Register, Analyze Defect, Tes...",6


### Demonstrating selective metrics of association rules in one table
 In this section, you will learn how to show selective metrics of association rules in one table.

In [10]:
# Mine association rules from the discovered frequent itemsets stored in variable 'frequent_itemsets', set minimum confidence to 0.5.

# Store the discovered rules in variable 'rules_association'.

rules_association =arule(frequent_itemsets, metric = 'confidence', min_threshold = 0.5)

# Filter out the rules with lift larger than 1 and support larger than 0.4, store the discovered rules in variable 'filtered_rules'.

filtered_rules = rules_association.loc[(rules_association['lift'] > 1) & (rules_association['support'] > 0.4)]     

# Show the columns 'antecedents', 'consequents', 'support', 'confidence' and 'lift' of variable 'filtered_rules' 

filtered_rules[['support', 'confidence', 'lift']]

Unnamed: 0,support,confidence,lift
12,0.905797,1.000000,1.001815
13,0.905797,0.907441,1.001815
16,0.550725,0.608000,1.018561
17,0.550725,0.922610,1.018561
20,0.905797,0.907441,1.001815
...,...,...,...
1135,0.550725,0.925419,1.021662
1136,0.550725,0.608000,1.021662
1139,0.550725,0.551724,1.001815
1140,0.550725,0.551724,1.001815
