# Data Fellowship Association Rules Group Project
Karl Merisalu, Goran Krajnovic, Christoph Wolff,  Matthew Wallace 

### 1) Importing relevant libraries

In [6]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

### 2) Importing anonymised dataset and showing first rows of dataset

In [7]:
amData = pd.read_csv('instacart\AMdata3.csv')
amData.head()

Unnamed: 0,client id,AssetClass
0,1,Equities
1,2,Multi Asset
2,3,Real Estate & Private Markets
3,4,Equities
4,5,Equities


### 3) Getting an overview of fequencies of each asset class

In [8]:
amData['AssetClass'].value_counts().rename("freq")

Equities                         309
Fixed Income                     215
Multi Asset                      148
Real Estate & Private Markets    122
Hedge Funds                       35
Distribution Partners              6
Money Market                       4
Name: freq, dtype: int64

### 4) Grouping data by client id to have an overview of each client id exposure to asset classes

In [9]:
client_baskets = amData.groupby(['client id']).AssetClass.apply(np.array).reset_index()
client_baskets.head()

Unnamed: 0,client id,AssetClass
0,1,[Equities]
1,2,[Multi Asset]
2,3,[Real Estate & Private Markets]
3,4,[Equities]
4,5,[Equities]


### 5) Transposing grouped data into a sparse matrix

In [10]:
te = TransactionEncoder()
te_ary = te.fit(client_baskets['AssetClass']).transform(client_baskets['AssetClass'])
dataset = pd.DataFrame(te_ary, columns=te.columns_)
dataset.head()

Unnamed: 0,Distribution Partners,Equities,Fixed Income,Hedge Funds,Money Market,Multi Asset,Real Estate & Private Markets
0,False,True,False,False,False,False,False
1,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True
3,False,True,False,False,False,False,False
4,False,True,False,False,False,False,False


### 6) Using mlxtend's apriori method to generate a list of frequent itemsets

In [11]:
frequent_itemsets = apriori(dataset, min_support=0.01, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

# if we wanted to limit itemsets to certain number we would use the following sample code: 
# frequent_itemsets = frequent_itemsets[ (frequent_itemsets['length'] >= 2)]

frequent_itemsets.sort_values('support', ascending=False)

Unnamed: 0,support,itemsets,length
0,0.397172,(Equities),1
1,0.27635,(Fixed Income),1
3,0.190231,(Multi Asset),1
4,0.156812,(Real Estate & Private Markets),1
2,0.044987,(Hedge Funds),1
5,0.032134,"(Fixed Income, Equities)",2
6,0.017995,"(Equities, Multi Asset)",2
8,0.014139,"(Fixed Income, Multi Asset)",2
7,0.012853,"(Equities, Real Estate & Private Markets)",2


### 7) Using mlxtend's association_rules method to generate rules

In [12]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.01)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules["consequent_len"] = rules["consequents"].apply(lambda x: len(x))
rules.sort_values('confidence', ascending = False)

# if we wanted to see relationships between specific antecedent or consequent lengths 
# we sould use the following code sample: rules[ (rules['antecedent_len'] == 1) &
#                                                (rules['consequent_len'] == 2) ].sort_values('confidence', ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
0,(Fixed Income),(Equities),0.27635,0.397172,0.032134,0.116279,0.292767,-0.077625,0.682147,1,1
3,(Multi Asset),(Equities),0.190231,0.397172,0.017995,0.094595,0.23817,-0.05756,0.66581,1,1
5,(Real Estate & Private Markets),(Equities),0.156812,0.397172,0.012853,0.081967,0.206377,-0.049428,0.656652,1,1
1,(Equities),(Fixed Income),0.397172,0.27635,0.032134,0.080906,0.292767,-0.077625,0.787352,1,1
7,(Multi Asset),(Fixed Income),0.190231,0.27635,0.014139,0.074324,0.26895,-0.038432,0.781754,1,1
6,(Fixed Income),(Multi Asset),0.27635,0.190231,0.014139,0.051163,0.26895,-0.038432,0.853433,1,1
2,(Equities),(Multi Asset),0.397172,0.190231,0.017995,0.045307,0.23817,-0.05756,0.848198,1,1
4,(Equities),(Real Estate & Private Markets),0.397172,0.156812,0.012853,0.032362,0.206377,-0.049428,0.871388,1,1


### 8) Analysis: 
<b>1) Individual asset class support #1:</b> client ids are most frequently exposed to Equities asset class. 39.7% of all client ids are exposed to Equities asset class

<b>2) Individual asset class support #2: </b>The most frequent combination of different asset classes for client ids is Fixed Income and Equities. 3.2% of all client ids are exposed to Fixed Income AND Equities asset classes

<b>3) Antecedent-consequent confidence #1:</b> 11.6% of client ids who are exposed to Fixed Income are also exposed to Equities. This is the highest such confidence metric that could be potentially used in fund recommentations. 

<b>4) Antecedent-consequent confidence #2:</b> The 2nd highest confidence metric is "when Multi Asset --> (then) Equities" exposure, which occurs at 9.5% of client ids

<b>5) Lift #1:</b> When looking at lift however, we notice that all shown metrics are below 1. This means that the presence of antecedent has a negative effect on presence of consequent. As such, we shouldn't use any of these rules for reliable predictions. This is caused by a large marjority of client ids being exposed to only 1 asset class (meaning client id having no other exposure is the most likely situation)
