# Apriori


The Apriori algorithm is used for mining frequent itemsets and devising association rules from a transactional database. The parameters “support” and “confidence” are used. Support refers to items’ frequency of occurrence; confidence is a conditional probability.

A key concept in Apriori algorithm is the anti-monotonicity of the support measure. It assumes that

1. All subsets of a frequent itemset must be frequent
2. Similarly, for any infrequent itemset, all its supersets must be infrequent too


###  Algorithm
The following are the main steps of the algorithm:

1. Calculate the support of item sets (of size k = 1) in the transactional database (note that support is the frequency of 
   occurrence of an itemset). This is called generating the candidate set.
2. Prune the candidate set by eliminating items with a support less than the given threshold.
3. Join the frequent itemsets to form sets of size k + 1, and repeat the above sets until no more itemsets can be formed. This 
   will happen when the set(s) formed have a support less than​ the given support.

### Libraries useful in Apriori are listed below

### Install library for apriori algorithm using:
!pip install mlxtend

In [1]:
#import warnings
#warnings.filterwarnings('ignore')
#!pip3 install mlxtend

In [101]:
# import libraries
import pandas as pd
from sklearn import preprocessing
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

### Load the "basket" data

In [3]:
# Load dataset and display first five rows.
df = pd.read_csv("Basket Dataset/BASKETS1n")
df.head()

Unnamed: 0,cardid,value,pmethod,sex,homeown,income,age,fruitveg,freshmeat,dairy,cannedveg,cannedmeat,frozenmeal,beer,wine,softdrink,fish,confectionery
0,39808,42.7123,CHEQUE,M,NO,27000,46,F,T,T,F,F,F,F,F,F,F,T
1,67362,25.3567,CASH,F,NO,30000,28,F,T,F,F,F,F,F,F,F,F,T
2,10872,20.6176,CASH,M,NO,13200,36,F,F,F,T,F,T,T,F,F,T,F
3,26748,23.6883,CARD,F,NO,12200,26,F,F,T,F,F,F,F,T,F,F,F
4,91609,18.8133,CARD,M,YES,11000,24,F,F,F,F,F,F,F,F,F,F,F


### Perform pre-processing (if required)

In [4]:
#selecting only products columns and replacing boolean values
df = df.drop(columns = ["cardid", "value", "pmethod", "sex", "homeown", "income", "age"])
cols=["fruitveg", "freshmeat", "dairy", "cannedveg", "cannedmeat",
      "frozenmeal", "beer", "wine", "softdrink", "fish", "confectionery"]
for i in cols:
    df[i] = preprocessing.LabelEncoder().fit_transform(df[i])
df

Unnamed: 0,fruitveg,freshmeat,dairy,cannedveg,cannedmeat,frozenmeal,beer,wine,softdrink,fish,confectionery
0,0,1,1,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,1
2,0,0,0,1,0,1,1,0,0,1,0
3,0,0,1,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,1,0,0,0,0,0,0,0
996,0,0,0,1,0,0,0,0,0,1,0
997,0,1,0,0,0,0,0,0,0,0,0
998,1,0,0,0,0,0,0,1,0,0,1


### Q1. Find frequent itemsets in the dataset using Apriori

In [6]:
#apriori with min support 0.1 and confidence 0.1
freqitems = apriori(df, min_support=0.1, use_colnames=True)
freqitems

Unnamed: 0,support,itemsets
0,0.299,(fruitveg)
1,0.183,(freshmeat)
2,0.177,(dairy)
3,0.303,(cannedveg)
4,0.204,(cannedmeat)
5,0.302,(frozenmeal)
6,0.293,(beer)
7,0.287,(wine)
8,0.184,(softdrink)
9,0.292,(fish)


### Q2. Find the assoiation rules in the dataset having min confidence 10%

In [8]:
# find rules
rules = association_rules(freqitems, metric="confidence", min_threshold=0.1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(fruitveg),(fish),0.299,0.292,0.145,0.48495,1.660787,0.057692,1.374623
1,(fish),(fruitveg),0.292,0.299,0.145,0.496575,1.660787,0.057692,1.392463
2,(frozenmeal),(cannedveg),0.302,0.303,0.173,0.572848,1.890586,0.081494,1.631736
3,(cannedveg),(frozenmeal),0.303,0.302,0.173,0.570957,1.890586,0.081494,1.626877
4,(beer),(cannedveg),0.293,0.303,0.167,0.569966,1.881075,0.078221,1.620802
5,(cannedveg),(beer),0.303,0.293,0.167,0.551155,1.881075,0.078221,1.575154
6,(frozenmeal),(beer),0.302,0.293,0.17,0.562914,1.921208,0.081514,1.61753
7,(beer),(frozenmeal),0.293,0.302,0.17,0.580205,1.921208,0.081514,1.662715
8,(confectionery),(wine),0.276,0.287,0.144,0.521739,1.817906,0.064788,1.490818
9,(wine),(confectionery),0.287,0.276,0.144,0.501742,1.817906,0.064788,1.453063


### Q3. Find association rules having minimum antecedent_len 2 & confidence greater than 0.75

In [18]:
#rules having minimum antecedent_len 2 and confidence greater than 0.75
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules[(rules["antecedent_len"] > 1) & (rules["confidence"] > 0.75)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
10,"(frozenmeal, beer)",(cannedveg),0.17,0.303,0.146,0.858824,2.834401,0.09449,4.937083,2
11,"(frozenmeal, cannedveg)",(beer),0.173,0.293,0.146,0.843931,2.880309,0.095311,4.530037,2
12,"(beer, cannedveg)",(frozenmeal),0.167,0.302,0.146,0.874251,2.894873,0.095566,5.550762,2


### Load the "zoo" data

In [43]:
# load the dataset and display first five rows
colnames = ['name', 'hair', 'feathers', 'eggs', 'mammal', 'airborne', 'aquatic', 'predator',
            'toothed', 'backbone','breathes', 'venomous','fins', 'legs', 'tail', 'domestic', 'catsize', 'type' ]
df= pd.read_csv("Zoo Dataset/zoo.data", header=None, names= colnames)
df.head()

Unnamed: 0,name,hair,feathers,eggs,mammal,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


### Q4. Perform pre-processing (if required)

In [44]:
#dropping first column - name
df = df.drop(columns = ["name"])

#one hot encoding column - legs
df = pd.get_dummies(df, prefix = ['legs'], columns = ['legs'])

#replacing class type and one hot encoding it
df = pd.get_dummies(df, prefix = ['type'], columns = ['type'])
df

Unnamed: 0,hair,feathers,eggs,mammal,airborne,aquatic,predator,toothed,backbone,breathes,...,legs_5,legs_6,legs_8,type_1,type_2,type_3,type_4,type_5,type_6,type_7
0,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
1,1,0,0,1,0,0,0,1,1,1,...,0,0,0,1,0,0,0,0,0,0
2,0,0,1,0,0,1,1,1,1,0,...,0,0,0,0,0,0,1,0,0,0
3,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
4,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,1,0,0,1,0,0,0,1,1,1,...,0,0,0,1,0,0,0,0,0,0
97,1,0,1,0,1,0,0,0,0,1,...,0,1,0,0,0,0,0,0,1,0
98,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
99,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1


### Q5. Find frequent itemsets in zoo dataset having min support 0.5 

In [45]:
#apriori with min support 0.5 and confidence 0.5
freqitems = apriori(df, min_support=0.5, use_colnames=True)
freqitems

Unnamed: 0,support,itemsets
0,0.584158,(eggs)
1,0.554455,(predator)
2,0.60396,(toothed)
3,0.821782,(backbone)
4,0.792079,(breathes)
5,0.742574,(tail)
6,0.60396,"(toothed, backbone)"
7,0.514851,"(tail, toothed)"
8,0.683168,"(breathes, backbone)"
9,0.732673,"(tail, backbone)"


### Q6. Find frequent association rules having min confidence 0.5

In [46]:
# Find and display rules
rules = association_rules(freqitems, metric="confidence", min_threshold=0.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(toothed),(backbone),0.60396,0.821782,0.60396,1.0,1.216867,0.107637,inf
1,(backbone),(toothed),0.821782,0.60396,0.60396,0.73494,1.216867,0.107637,1.494149
2,(tail),(toothed),0.742574,0.60396,0.514851,0.693333,1.147978,0.066366,1.291433
3,(toothed),(tail),0.60396,0.742574,0.514851,0.852459,1.147978,0.066366,1.744774
4,(breathes),(backbone),0.792079,0.821782,0.683168,0.8625,1.049548,0.032252,1.29613
5,(backbone),(breathes),0.821782,0.792079,0.683168,0.831325,1.049548,0.032252,1.232673
6,(tail),(backbone),0.742574,0.821782,0.732673,0.986667,1.200643,0.122439,13.366337
7,(backbone),(tail),0.821782,0.742574,0.732673,0.891566,1.200643,0.122439,2.374037
8,(tail),(breathes),0.742574,0.792079,0.60396,0.813333,1.026833,0.015783,1.113861
9,(breathes),(tail),0.792079,0.742574,0.60396,0.7625,1.026833,0.015783,1.083898


### Q7. Convert the dataset into two classes "Mammal" and "others"

In [114]:
# Take mammal class column as the class column and drop others.
Y = df["mammal"]
col = df.columns.tolist()
col.remove("mammal")
X = df[col]
X

Unnamed: 0,hair,feathers,eggs,airborne,aquatic,predator,toothed,backbone,breathes,venomous,...,legs_5,legs_6,legs_8,type_1,type_2,type_3,type_4,type_5,type_6,type_7
0,1,0,0,0,0,1,1,1,1,0,...,0,0,0,1,0,0,0,0,0,0
1,1,0,0,0,0,0,1,1,1,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,1,0,1,1,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,1,1,1,1,0,...,0,0,0,1,0,0,0,0,0,0
4,1,0,0,0,0,1,1,1,1,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,1,0,0,0,0,0,1,1,1,0,...,0,0,0,1,0,0,0,0,0,0
97,1,0,1,1,0,0,0,0,1,1,...,0,1,0,0,0,0,0,0,1,0
98,1,0,0,0,0,1,1,1,1,0,...,0,0,0,1,0,0,0,0,0,0
99,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


### Q8. Partition the dataset into training and testing part (70:30)

In [115]:
#partition the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
df_train = pd.concat([X_train, Y_train], axis = 1)

### Q9. Generate association rules for "mammal" class (training data) with min support 0.4 and confidence as 1

In [116]:
# frequent itemsets 
freqitems = apriori(df_train, min_support=0.4, use_colnames=True)
freqitems

Unnamed: 0,support,itemsets
0,0.428571,(hair)
1,0.571429,(eggs)
2,0.571429,(predator)
3,0.628571,(toothed)
4,0.8,(backbone)
5,0.757143,(breathes)
6,0.7,(tail)
7,0.414286,(catsize)
8,0.4,(type_1)
9,0.4,(mammal)


In [117]:
# find frequent rules
rules = association_rules(freqitems, metric="confidence", min_threshold=1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(hair),(breathes),0.428571,0.757143,0.428571,1.0,1.320755,0.104082,inf
1,(toothed),(backbone),0.628571,0.800000,0.628571,1.0,1.250000,0.125714,inf
2,(type_1),(toothed),0.400000,0.628571,0.400000,1.0,1.590909,0.148571,inf
3,(mammal),(toothed),0.400000,0.628571,0.400000,1.0,1.590909,0.148571,inf
4,(type_1),(backbone),0.400000,0.800000,0.400000,1.0,1.250000,0.080000,inf
...,...,...,...,...,...,...,...,...,...
110,"(toothed, type_1)","(breathes, backbone, mammal)",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
111,"(breathes, mammal)","(toothed, backbone, type_1)",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
112,"(mammal, toothed)","(breathes, backbone, type_1)",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
113,(type_1),"(breathes, backbone, mammal, toothed)",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf


In [118]:
# selecting rules having consequents as class mammal
rules = rules[rules['consequents'] == {'mammal'}]
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
9,(type_1),(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
26,"(toothed, type_1)",(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
37,"(backbone, type_1)",(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
41,"(breathes, type_1)",(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
62,"(toothed, backbone, type_1)",(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
71,"(breathes, toothed, type_1)",(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
82,"(breathes, backbone, type_1)",(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
93,"(breathes, backbone, toothed, type_1)",(mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf


### Q10. Test the rules generated on testing dataset and find precision and recall for the rule based classifier

In [119]:
#applying rules on test data
Y_pred = []
for i, row in X_test.iterrows():
    if row["type_1"] == 1:
        Y_pred.append(1)
    else:
        Y_pred.append(0)
Y_pred

[0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1]

In [121]:
# evaluation measures
# print classification report
print(classification_report(Y_test, Y_pred))
print(accuracy_score(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      1.00      1.00        13

    accuracy                           1.00        31
   macro avg       1.00      1.00      1.00        31
weighted avg       1.00      1.00      1.00        31

1.0


### Q11. Apply decision tree on the dataset and calculate the performance evaluation measures

In [122]:
# Select the independent variables and target column
# Apply decision tree
dtree = DecisionTreeClassifier()
dtree.fit(X_train, Y_train)

DecisionTreeClassifier()

In [123]:
# Find predictions by decision tree
Y_pred = dtree.predict(X_test)

In [125]:
# Evaluation measures and classification report
print(classification_report(Y_test, Y_pred))
print(accuracy_score(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      1.00      1.00        13

    accuracy                           1.00        31
   macro avg       1.00      1.00      1.00        31
weighted avg       1.00      1.00      1.00        31

1.0


### Q12. Which out of the two classifiers performs better.

In [126]:
# Name of the classifier with accuracy value.
# equal accuracy