## Lab1: MBA Solutions

##### Overview
We will perform market basket analysis on a transaction dataset using the Apriori algorithm.

The transaction dataset consists of 1000 transactions, and may be found in Lab1MLB.csv. There are 50 items available for purchase, labelled 1 through 50. Each transaction is limited to 5 items or less.

Apriori has a few parameters which must be adjusted upon working with a new dataset. These parameters include minimum support, minimum confidence, and minimum lift. All of these parameters will assume their unitless form here (as opposed to units of baskets). The idea is to start these parameters high, keeping minimum support greater than minimum confidence, and minimum lift greater than 1.

These parameter constraints enable you to determine itemsets and their supports from association rules, the standard Apriori output. When a rule has an empty antecedent, then the support of the rule is the support of the consequent. This allows us to obtain frequent 1-itemsets from Apriori output. When the antecedent is not empty, the support of the rule is the support of the itemset formed by merging antecedent with consequent.

The goal of parameter tuning for Apriori is to find as many interesting association rules without excessive computational complexity. Usually, one starts by revealing no association rules. This is because “interesting” requires achievement of minimum support, confidence and lift.
Here is one way to tune Apriori parameters. Start with the lift threshold very high, like 2. Start with the minimum support higher than 1−exp(−basketsize/number of items), and the minimum confidence about 50% less than the minimum support. When you run Apriori in this mode, it should be pretty fast on this dataset, and you will see few if any association rules from the output. Next, drop the lift gradually down to 1, each time re-running Apriori and looking for association rules. If you find some, great! If you find too many, raise the minimum support and minimum confidence proportionally, and reset the minimum lift to 2. If no rules are produced, drop the minimum support and minimum confidence proportionally, and reset the minimum lift to its initial value. Repeat as before. Eventually, you should see some, but not too many “interesting” association rules. Once this happens, you should adjust the tuning parameters more gradually, and see what you find. There are many others, please explore!

In [1]:
#installation of required libraries
# pip install numpy
# pip install pandas
# pip install mlxtend

#importing the libraries
import pandas as pd
import numpy as np

from mlxtend.preprocessing import TransactionEncoder 
from mlxtend.frequent_patterns import apriori, association_rules

In [2]:
#importing dataset file
data = pd.read_csv("./data/Lab1MBA.csv",sep = ",",header = None)
data.head(5)

Unnamed: 0,0,1,2,3,4
0,5,6,11,36.0,46.0
1,6,17,24,31.0,35.0
2,9,20,25,30.0,
3,7,21,27,35.0,41.0
4,1,24,29,40.0,43.0


In [3]:
#creating list of transactions from dataset
transactions = []
for i in range(0, 1000):
    transactions.append([data.values[i,j] for j in range(0, 5)])
print("Number of Transactions: ",len(transactions))

Number of Transactions:  1000


In [4]:
#instantiate transaction encoder
encoder = TransactionEncoder().fit(transactions)
one_hot = encoder.transform(transactions)
dataframe = pd.DataFrame(one_hot,columns = encoder.columns_)

In [5]:
#training apriori algorithm on the dataset
frequent_itemsets = apriori(dataframe, min_support = 0.0138 , max_len = 3, use_colnames=True)
print(len(frequent_itemsets))

112


**Q1. By trial and error, what are the values of the minimum support, minimum confidence, and minimum lift for which the number of interesting association rules is between 1 and 2 times the number of items?**

In [6]:
rules = association_rules(frequent_itemsets,metric = "lift",min_threshold = 2)
rules = rules[(rules['confidence'] > 0.5) & (rules['lift'] > 1.4)]
print("Min Support = 0.0138, Min Confidence = 0.0120, Min Lift = 2")
print("Number of interesting rules obtained : ",len(rules))

rules = rules.sort_values(['confidence', 'lift'], ascending = [False, False])
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
print(rules.head(len(rules)))

Min Support = 0.0138, Min Confidence = 0.0120, Min Lift = 2
Number of interesting rules obtained :  0
Empty DataFrame
Columns: [antecedents, consequents, antecedent support, consequent support, support, confidence, lift, leverage, conviction, zhangs_metric]
Index: []


**Q2. What relationship between these variables keeps the number of interesting association rules in this range?**

The relationship between Support, Confidence and Lift is defined by how these association rules are interrelated to find which items are associated with each other and how the value of one rule influences the value of another rule. Support gives the popularity of an itemset by measuring its proportion of transactions in which the itemset appears. Confidence gives the likeliness of item B being purchased when item A is purchased and is given by confidence (A → B) = Support(A→B) / Support(A). However, confidence can be sometimes misleading as it does not account for the popularity of the consequents. Lift gives how likely will item B be purchased when item A is purchased while controlling the popularity of item B and is given by lift (A → B) = Support(A→B) / Support(A)\*Support(B). A lift value greater than 1, indicates that the rule is better at predicting the result than guessing. A lift value lesser than 1, indicates that the rule is doing worse than informed guessing.


When the min support value is set to a suitable minimum value w.r.t basket size and number of items, we can find item sets with good probability of occurring in the market basket with lift value greater than 1 which implies that these relationships are very likely to occur in the market basket. When we set the min support value, it checks for items with popularity greater than 1.38% in the dataset and lift tries to find relationships of such items with other popular items to give meaningful association rules. Confidence will give the likeliness of occurrence of consequent on the cart given that the cart already has antecedents i.e when we have items with atleast 1.38% popularity(support) in the dataset, we are much more likely to find consequent items with good conditional probability giving interesting association rules between different items. Setting appropriate values for support, confidence and lift gives lesser count and more interesting set of association rules thereby limiting the number of association rules in the range of 50 –100.


**Q3. Provide a table of the 75 most frequent itemsets for the following parameters: minimum support=0.015, minimum confidence=0.01, minimum lift=1. Order the itemsets by descending count. (hint: you may use the output rules to determine these itemsets). You may use the association rule format for this table, whose columns are: antecedent, consequent, support, confidence, lift, and count.**

In [7]:
rules = association_rules(frequent_itemsets, metric = "lift",min_threshold = 1)
rules = rules[(rules['support'] >= 0.015) & (rules['confidence'] >= 0.01) & (rules['lift'] >= 1)]
rules = rules.sort_values(['support'], ascending =False) #display itemsets in descending count of support
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
print("Min Support = 0.015, Min Confidence = 0.01, Min Lift = 1")
print(rules.head(75)) #displays 75 most frequent itemsets

Min Support = 0.015, Min Confidence = 0.01, Min Lift = 1
    antecedents consequents  antecedent support  consequent support  support  confidence      lift  leverage  conviction  zhangs_metric
52        (9.0)      (37.0)               0.114               0.123    0.019    0.166667  1.355014  0.004978    1.052400       0.295711
53       (37.0)       (9.0)               0.123               0.114    0.019    0.154472  1.355014  0.004978    1.047865       0.298746
4         (8.0)       (2.0)               0.107               0.107    0.018    0.168224  1.572190  0.006551    1.073607       0.407553
5         (2.0)       (8.0)               0.107               0.107    0.018    0.168224  1.572190  0.006551    1.073607       0.407553
74       (35.0)      (14.0)               0.119               0.094    0.018    0.151261  1.609154  0.006814    1.067465       0.429688
75       (14.0)      (35.0)               0.094               0.119    0.018    0.191489  1.609154  0.006814    1.089658       

**Q4. For the same parameter settings as in Q3, determine the 50 most interesting association rules, as measured by lift. Present your findings in a table with similar format as in your answer to Q3, with rules arranged by descending lift.**

In [8]:
rules = association_rules(frequent_itemsets, metric = "lift",min_threshold = 1)
rules = rules[(rules['support'] >= 0.015) & (rules['confidence'] >= 0.01) & (rules['lift'] >= 1)]
rules = rules.sort_values(['lift'], ascending =False) #display itemsets in descending count of lift
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
print("Min Support = 0.015, Min Confidence = 0.01, Min Lift = 1")
print(rules.head(50)) #displays 50 most interesting association rules

Min Support = 0.015, Min Confidence = 0.01, Min Lift = 1
    antecedents consequents  antecedent support  consequent support  support  confidence      lift  leverage  conviction  zhangs_metric
119      (42.0)      (41.0)               0.091               0.108    0.017    0.186813  1.729752  0.007172    1.096919       0.464117
118      (41.0)      (42.0)               0.108               0.091    0.017    0.157407  1.729752  0.007172    1.078813       0.472962
123      (43.0)      (48.0)               0.096               0.091    0.015    0.156250  1.717033  0.006264    1.077333       0.461947
122      (48.0)      (43.0)               0.091               0.096    0.015    0.164835  1.717033  0.006264    1.082421       0.459406
30       (18.0)       (6.0)               0.087               0.102    0.015    0.172414  1.690331  0.006126    1.085083       0.447317
31        (6.0)      (18.0)               0.102               0.087    0.015    0.147059  1.690331  0.006126    1.070414       