<a href="https://colab.research.google.com/github/hollyemblem/Recommenders/blob/main/Association_Rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Association Rules Notebook

In this notebook, I explore some of the examples provided by Kim Falk in Practical Recommender Systems, specifically implementing association rules (chapter 5-6)

I use a dataset from Kaggle: https://www.kaggle.com/datasets/mkechinov/ecommerce-purchase-history-from-electronics-store?resource=download

which contains a list of ecommerce purchases, linked by orderID. I then look at implementing two types of association rules frequent items datasets:

- Linked by productID
- Linked by category code

The category code example proves more relevant

### Challenges
A big challenge with this dataset is the minimum support required to generate an itemsets dataset is really low! Like 0.025 in some instances. The dilemma this introduces is described here:

"If the frequencies of items vary highly we will encounter two problems:firstly, if minsupp is set too high, we will not find those rules that involve infrequent items or rare items in the data. Secondly, in order to find rules that involve both frequent and rare items, we have to set minsupp very low. 

However, this may cause combination explosion, producing too many rules, because
those frequent items will be associated with one another in all possible ways and many of them are meaningless. This
dilemma is called the rare item problem"

The solution generated in the paper is to mine for profit, not support.

Source: http://www.joebm.com/vol4/454-MH0004.pdf

### Examples
Some further implementation examples are here:

https://www.datacamp.com/tutorial/market-basket-analysis-r - This one is interesting as they specify quite a low support.

https://mhahsler.github.io/arules/docs/measures - using leverage as opposed to min support, also suffers from the same rare item problem

https://core.ac.uk/download/pdf/81961775.pdf - Further guidance on the rare item problem

https://link.springer.com/article/10.1007/s40747-018-0085-9 - Rare pattern mining, challenges and future perspectives




### Loading in Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

  and should_run_async(code)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/Recommendation\ Engines

/content/drive/MyDrive/Recommendation Engines


#### Reading data in with Pandas


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('kz.csv')

In [None]:
df.head()

Unnamed: 0,event_time,order_id,product_id,category_id,category_code,brand,price,user_id
0,2020-04-24 11:50:39 UTC,2294359932054536986,1515966223509089906,2.268105e+18,electronics.tablet,samsung,162.01,1.515916e+18
1,2020-04-24 11:50:39 UTC,2294359932054536986,1515966223509089906,2.268105e+18,electronics.tablet,samsung,162.01,1.515916e+18
2,2020-04-24 14:37:43 UTC,2294444024058086220,2273948319057183658,2.268105e+18,electronics.audio.headphone,huawei,77.52,1.515916e+18
3,2020-04-24 14:37:43 UTC,2294444024058086220,2273948319057183658,2.268105e+18,electronics.audio.headphone,huawei,77.52,1.515916e+18
4,2020-04-24 19:16:21 UTC,2294584263154074236,2273948316817424439,2.268105e+18,,karcher,217.57,1.515916e+18


In [None]:
df.count()

event_time       2633521
order_id         2633521
product_id       2633521
category_id      2201567
category_code    2021319
brand            2127516
price            2201567
user_id           564169
dtype: int64

Drop Rows with any NA values

In [None]:
df = df.dropna(subset = ['product_id', 'category_id', 'order_id', 'brand', 'category_code'])

In [None]:
df.count()

event_time       1532175
order_id         1532175
product_id       1532175
category_id      1532175
category_code    1532175
brand            1532175
price            1532175
user_id           420718
dtype: int64

Average categories purchased by order
Average products purchased by order

In [None]:
products_per_order = df.groupby('order_id')['product_id'].count()
products_per_order.mean()

1.3400189610266942

In [None]:
cats_per_order = df.groupby('order_id')['category_code'].count()
cats_per_order.mean()

1.3400189610266942

### Adding Association Rules Engine

In [None]:
!pip install mlxtend --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth

Expected format for frequent itemsets: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/

In [None]:
dataset = df.groupby('order_id')['category_code'].apply(list).tolist()

  and should_run_async(code)


In [None]:
len(dataset)

  and should_run_async(code)


1143398

Transforming into the correct format:

Changed to sparse due to the size of the dataset

In [None]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
oht_ary = te.fit(dataset).transform(dataset, sparse = True)
prod_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)

  and should_run_async(code)


Returning items with _some_~ support, trying both apriori and fpgrowth

In [None]:
apriori(prod_df, min_support=0.001)

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.011313,(0)
1,0.004070,(3)
2,0.001260,(8)
3,0.023329,(9)
4,0.006663,(10)
...,...,...
147,0.002013,"(33, 29, 23)"
148,0.001818,"(33, 30, 23)"
149,0.001649,"(33, 29, 30)"
150,0.002234,"(33, 92, 30)"


In [None]:
fpgrowth(prod_df, min_support=0.001)

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.016770,(89)
1,0.054661,(81)
2,0.036197,(99)
3,0.273765,(88)
4,0.063376,(30)
...,...,...
147,0.001379,"(20, 30)"
148,0.001429,"(20, 29, 23)"
149,0.001999,"(98, 99)"
150,0.001227,"(37, 30)"


Adding column names

In [None]:
apriori(prod_df, min_support=0.10, use_colnames=True)


##Findings 
At this point, we can see that the dataset is arguably too complex to be represented with association rules. 

What I am going to try instead is looking for category purchases that are similar within an orderID

In [None]:
cat_dataset = df.groupby('order_id')['category_code'].apply(list).tolist()

  and should_run_async(code)


In [None]:
te2 = TransactionEncoder()
oht_ary2 = te2.fit(cat_dataset).transform(cat_dataset, sparse = True)
cat_df = pd.DataFrame.sparse.from_spmatrix(oht_ary2, columns=te2.columns_)

  and should_run_async(code)


Having to set a tiny support value to get results 😔 and using fpgrowth which is for ecommerce examples: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/

In [None]:
fpgrowth(cat_df, min_support=0.0025)

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.016770,(89)
1,0.054661,(81)
2,0.036197,(99)
3,0.273765,(88)
4,0.063376,(30)
...,...,...
69,0.003879,"(30, 23)"
70,0.007814,"(29, 23)"
71,0.002695,"(33, 29)"
72,0.003255,"(29, 30)"


In [None]:
frequent_itemsets = fpgrowth(cat_df, min_support=0.00025, use_colnames = True)

  and should_run_async(code)


### Example of subsetting with counts, e.g. an itemset of 2 or more

In [None]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

  and should_run_async(code)


In [None]:
frequent_itemsets_subset = frequent_itemsets[ (frequent_itemsets['length'] == 2) ]

  and should_run_async(code)


In [None]:
frequent_itemsets_subset

  and should_run_async(code)


Unnamed: 0,support,itemsets,length
84,0.000504,"(electronics.audio.headphone, electronics.tablet)",2
85,0.002099,"(electronics.smartphone, electronics.tablet)",2
86,0.000350,"(computers.notebook, electronics.tablet)",2
87,0.000558,"(electronics.video.tv, electronics.tablet)",2
89,0.008456,"(electronics.smartphone, electronics.audio.hea...",2
...,...,...,...
440,0.000422,"(computers.components.motherboard, computers.c...",2
441,0.000478,"(computers.components.motherboard, computers.c...",2
442,0.000330,"(computers.components.motherboard, computers.c...",2
463,0.000540,"(furniture.bedroom.blanket, appliances.kitchen...",2


In [None]:
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

  and should_run_async(code)


For my example, I want only one antecedent (bread -> wine, bread is the antecedent), with a confidence score greater than 20% and a lift score greater than 1

In [None]:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedent_len
0,"(electronics.smartphone, electronics.video.tv)",(electronics.tablet),0.011105,0.016770,0.000352,0.031737,1.892482,0.000166,1.015458,0.476890,2
1,"(electronics.smartphone, electronics.tablet)",(electronics.video.tv),0.002099,0.060645,0.000352,0.167917,2.768861,0.000225,1.128920,0.640184,2
2,"(electronics.video.tv, electronics.tablet)",(electronics.smartphone),0.000558,0.273765,0.000352,0.631661,2.307315,0.000200,1.971652,0.566912,2
3,(electronics.smartphone),"(electronics.video.tv, electronics.tablet)",0.273765,0.000558,0.000352,0.001287,2.307315,0.000200,1.000730,0.780182,1
4,(electronics.video.tv),"(electronics.smartphone, electronics.tablet)",0.060645,0.002099,0.000352,0.005812,2.768861,0.000225,1.003735,0.680084,1
...,...,...,...,...,...,...,...,...,...,...,...
1163,(computers.components.power_supply),"(computers.components.motherboard, computers.c...",0.001125,0.000296,0.000262,0.233281,786.824757,0.000262,1.303873,0.999854,1
1164,(furniture.bedroom.blanket),(appliances.kitchen.washer),0.000750,0.046336,0.000540,0.719953,15.537517,0.000505,3.405374,0.936341,1
1165,(appliances.kitchen.washer),(furniture.bedroom.blanket),0.046336,0.000750,0.000540,0.011646,15.537517,0.000505,1.011025,0.981100,1
1166,(appliances.kitchen.refrigerators),(furniture.bedroom.blanket),0.063376,0.000750,0.000335,0.005285,7.051688,0.000287,1.004560,0.916259,1


In [None]:
rules_dataframe = rules[ (rules['antecedent_len'] == 1) &
       (rules['confidence'] > 0.1) &
       (rules['lift'] > 1) ]

  and should_run_async(code)


In [None]:
rules_dataframe

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedent_len
22,(computers.peripherals.monitor),(computers.peripherals.mouse),0.006118,0.030351,0.000750,0.122516,4.036672,0.000564,1.105034,0.756902,1
24,(computers.peripherals.monitor),(computers.peripherals.keyboard),0.006118,0.007140,0.001045,0.170836,23.926249,0.001001,1.197423,0.964103,1
25,(computers.peripherals.keyboard),(computers.peripherals.monitor),0.007140,0.006118,0.001045,0.146374,23.926249,0.001001,1.164307,0.965096,1
30,(computers.peripherals.monitor),(computers.notebook),0.006118,0.056649,0.000665,0.108649,1.917944,0.000318,1.058339,0.481554,1
52,(computers.peripherals.printer),(computers.notebook),0.009404,0.056649,0.001486,0.158002,2.789163,0.000953,1.120373,0.647559,1
...,...,...,...,...,...,...,...,...,...,...,...
1149,(computers.components.power_supply),"(computers.components.motherboard, computers.c...",0.001125,0.000330,0.000278,0.247278,749.967130,0.000278,1.328074,0.999791,1
1161,(computers.components.motherboard),"(computers.components.power_supply, computers....",0.000995,0.000292,0.000262,0.263620,902.464140,0.000262,1.357599,0.999887,1
1163,(computers.components.power_supply),"(computers.components.motherboard, computers.c...",0.001125,0.000296,0.000262,0.233281,786.824757,0.000262,1.303873,0.999854,1
1164,(furniture.bedroom.blanket),(appliances.kitchen.washer),0.000750,0.046336,0.000540,0.719953,15.537517,0.000505,3.405374,0.936341,1


### How would you implement this as a system?

- For each category available to a user, we would want to show at least one other option of what to buy.
- Where possible, we would serve from the association rules which meet our requirements in terms of confidence, lift and support.
- Not all categories in this example, will have a consequent they can recommend - as such, would need to look at % of items that can have a consequent based on associations, as opposed to a generic top 10.

In [None]:
rules_dataframe.count()

  and should_run_async(code)


antecedents           140
consequents           140
antecedent support    140
consequent support    140
support               140
confidence            140
lift                  140
leverage              140
conviction            140
zhangs_metric         140
antecedent_len        140
dtype: int64

In [None]:
(rules_dataframe['antecedents']).nunique()

  and should_run_async(code)


31

There are 26 category codes that have coverage. How many unique categories in original dataframe?

In [None]:
df['category_code'].nunique()

  and should_run_async(code)


123

### Association Rules coverage.

25% of categories have an associated rule. We can optimise this number by playing with the support, confidence and lift.

In [None]:
31/123

  and should_run_async(code)


0.25203252032520324

## Findings and Future Development

For future development and to understand the impact of non-personalised vs association rules recommendations, you could examine, when we have a product with an association, compared to a generic top 10 item, what is the subsequent metric uplift (purchase, ARPU) for associated item rules vs top 10s?

Future development could then also include clustering items together, e.g. if you select a certain content type, we show you another example from that cluster.