## Business understanding
The idea behind association rule mining is to find patterns between different variables. The goal of this assignment is to find relationships between purchases across multiple product groups.

## Data understanding
The data used contains 20 different product groups with column values of 1 or 0, where 1 indicates that a product was purchased from that group and 0 indicates that a purchase didn't include a product from that group. There are no missing values in the dataset.

## Data preparation
The data needed minimal preparation. The 'ID' column was dropped as it isn't needed and all the values in the columns were transformed into boolean values, where 1 = True and 0 = False, because the mlxtend library used requires the usage of boolean values. 

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
df = pd.read_csv('data/drone_prod_groups.csv', sep=',')

# Data preprocessing
df = df.drop(columns='ID')
df = df.apply(lambda x: x.map(lambda y: True if y == 1 else False))
df.head()

Unnamed: 0,Prod1,Prod2,Prod3,Prod4,Prod5,Prod6,Prod7,Prod8,Prod9,Prod10,Prod11,Prod12,Prod13,Prod14,Prod15,Prod16,Prod17,Prod18,Prod19,Prod20
0,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,True
1,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,True,True,True
2,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,True
3,True,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,True
4,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,True,True


## Modeling
To find the relationships in the data we used the Apriori algorithm with the implementation of the algorithm imported from mlxtend library. The Apriori algorithm is a popular algorithm for mining frequent itemsets for boolean association rules. The algorithm is designed to operate on databases containing transactions. The algorithm finds all frequent item sets and uses a candidate generation function to generate all possible itemsets. The algorithm terminates when no more itemsets can be found. \
We only used one hyperparameter: min_support. Support is the proportion of transactions in the database that contain the item set, in this case an item set is a set of product groups present in a transaction. In our case we decided on 3% of transactions have to include the itemset for it to be significant for our analysis.

In [2]:
freq = apriori(df, min_support=0.03, use_colnames=True)
# Filters the itemsets with one product group and breaks the next code cell
# freq= freq[freq['itemsets'].apply(lambda x: len(x) != 1)] 
freq

Unnamed: 0,support,itemsets
0,0.10998,( Prod1)
1,0.13098,( Prod2)
2,0.03271,( Prod3)
3,0.03585,( Prod4)
4,0.10459,( Prod5)
5,0.13499,( Prod7)
6,0.16179,( Prod8)
7,0.19853,( Prod9)
8,0.09336,( Prod10)
9,0.10848,( Prod11)


In [3]:
rules = association_rules(freq, metric='lift', min_threshold=0.5).sort_values(by='lift', ascending=False)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
21,( Prod9),( Prod15),0.19853,0.1188,0.11145,0.561376,4.725388,1.0,0.087865,2.009011,0.983664,0.541335,0.502243,0.749754
20,( Prod15),( Prod9),0.1188,0.19853,0.11145,0.938131,4.725388,1.0,0.087865,12.954372,0.894663,0.541335,0.922806,0.749754
42,"( Prod9, Prod19)",( Prod20),0.04996,0.14798,0.03345,0.669536,4.524501,1.0,0.026057,2.578251,0.819946,0.203356,0.61214,0.44779
43,( Prod20),"( Prod9, Prod19)",0.14798,0.04996,0.03345,0.226044,4.524501,1.0,0.026057,1.227512,0.914276,0.203356,0.185344,0.44779
39,( Prod19),( Prod20),0.20626,0.14798,0.13476,0.65335,4.415125,1.0,0.104238,2.457869,0.974508,0.613997,0.593144,0.782007
38,( Prod20),( Prod19),0.14798,0.20626,0.13476,0.910664,4.415125,1.0,0.104238,8.884845,0.907849,0.613997,0.887449,0.782007
40,"( Prod9, Prod20)",( Prod19),0.03676,0.20626,0.03345,0.909956,4.411696,1.0,0.025868,8.81507,0.802842,0.159613,0.886558,0.536065
45,( Prod19),"( Prod9, Prod20)",0.20626,0.03676,0.03345,0.162174,4.411696,1.0,0.025868,1.14969,0.974286,0.159613,0.1302,0.536065
4,( Prod5),( Prod12),0.10459,0.15971,0.06683,0.638971,4.000822,1.0,0.050126,2.327488,0.837662,0.338431,0.570352,0.528709
5,( Prod12),( Prod5),0.15971,0.10459,0.06683,0.418446,4.000822,1.0,0.050126,1.539685,0.89261,0.338431,0.350516,0.528709


## Evaluation
The algorithm found 39, 21 if we exclude the single item itemsets, different itemsets where the product groups were present in at least 3% of the transactions in the dataset. \
For association rules we used the 'Lift' metric. Lift is a derived metric that tells the factor with which the probability of the consequent being present increases when the antecedent is present i.e. buying a product from group 9 increases the probability of buying a product from group 15 by a factor of 4.7.

## Deployment
The model can be used in anything involving commerce. Different association rules should be utilized when deciding how to place products in a store or what to recommend on an e-commrece platform e.g. an online shop selling electronics could look at the lift metric and use that to determine what other products to recommend when a customer adds a product to their checkout basket. A physical store could use the support metric to see which items are often purchased together and place the items in the store accordingly, either far apart from each other if they want to keep the customer browsing and impulse buying for as long as possible or close to each other for convenience.