# Env setup

In [8]:
import datetime as dt
from datetime import date
import random

from mlxtend import frequent_patterns
import pandas as pd


# Requirements

1. Implicit or explicit ratings.
    - Rating created date required for applying rating time decay.

In [9]:
today = date.today()
ratings = [{
    "user_id": random.randint(1, 100), 
    "item_id": random.randint(2, 200), 
    "rating": random.random() * 10, 
    "created_date": today - dt.timedelta(days=random.randint(0, 200))
} for _ in range(1000)]

ratings_df = pd.DataFrame(ratings)
ratings_df = ratings_df.loc[ratings_df[["user_id", "item_id"]].drop_duplicates().index]

# Static non-personalised recommenders

## Top 10

### All-time

In [11]:
ratings_df.groupby("item_id")["rating"].mean().sort_values(ascending=False).reset_index().head(10)

Unnamed: 0,item_id,rating
0,51,8.823776
1,64,8.540408
2,86,8.511595
3,119,8.363824
4,156,8.05357
5,90,7.612117
6,48,7.562261
7,106,7.341274
8,191,7.266507
9,137,7.204925


### Time-weighted

Applying decay functions can add a recency effect to a top 10 creation. Note, decay functions can take multiple forms (linear, eAponential, etc.).

In [13]:
ratings_df["days_since"] = (today - ratings_df["created_date"]).apply(lambda delta: delta.days)
ratings_df["weighted_rating"] = ratings_df["rating"] / (1 + ratings_df["days_since"])
ratings_df.groupby("item_id")["weighted_rating"].mean().sort_values(ascending=False).reset_index().head(10)

Unnamed: 0,item_id,weighted_rating
0,7,1.216613
1,186,1.17081
2,147,0.844251
3,121,0.765872
4,178,0.689128
5,192,0.539768
6,21,0.469681
7,148,0.451014
8,67,0.439227
9,36,0.437203


## Frequently bought together (FBT) recommendation 

Created using the [online retail dataset](https://archive.ics.uci.edu/dataset/352/online+retail) (last accessed 2023-10-14) from the UCI ML repository. Required be accessible locally as `data/online_retail.Alsx`.


In [14]:
retail_df = pd.read_excel("../data/online_retail.xlsx")
retail_df = retail_df.rename(columns={
    "InvoiceNo": "invoice_id",
    "StockCode": "stock_id",
    "Description": "description",
    "Quantity": "quantity",
    "InvoiceDate": "invoiced_at",
    "UnitPrice": "unit_price",
    "CustomerID": "customer_id",
    "Country": "country"
})
retail_df = retail_df.drop(columns=["country", "quantity", "unit_price"])
retail_df["stock_id"] = retail_df["stock_id"].astype(str)
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   invoice_id   541909 non-null  object        
 1   stock_id     541909 non-null  object        
 2   description  540455 non-null  object        
 3   invoiced_at  541909 non-null  datetime64[ns]
 4   customer_id  406829 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 20.7+ MB


### Methods

- [Apriori algorithm](https://analyticslog.com/blog/2020/8/13/apriori-algorithm-items-frequently-bought-together-a-basic-explanation-of-how-it-works) (last accessed 2023-10-01)
- [FP-growth algorithm](https://hands-on.cloud/implementation-of-fp-growth-algorithm-using-python) (last accessed 2023-10-01)

FP-growth improves upon Apriori by being faster and computational more efficient.

FP-growth took 2.06 s ± 18.8 ms (mean ± std. dev.) to find all n-item itemsets for 7 runs across 25900 sales and 4070 items.

In [15]:
pvt_df = pd.crosstab(index=retail_df["invoice_id"], columns=retail_df["stock_id"]).map(bool)
pvt_df.shape

(25900, 4070)

In [16]:
#%%timeit

fg_result = frequent_patterns.fpgrowth(pvt_df, min_support=0.01, use_colnames=True)
ar_result = frequent_patterns.association_rules(fg_result, metric="lift", min_threshold=1)
ar_result.sort_values("confidence", ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1092,"(22699, 22698, 22423)",(22697),0.013012,0.040811,0.011699,0.89911,22.031167,0.011168,9.507258,0.967194
1317,(23172),(23171),0.012124,0.014903,0.010888,0.898089,60.260387,0.010707,9.66626,0.995474
363,"(21080, 21086)",(21094),0.011429,0.020347,0.010232,0.89527,43.999051,0.009999,9.354101,0.98857
1105,"(22699, 22698)",(22697),0.023707,0.040811,0.021197,0.894137,21.909313,0.020229,9.060649,0.977531
1093,"(22697, 22698, 22423)",(22699),0.013359,0.043243,0.011699,0.875723,20.251084,0.011121,7.698554,0.963491


| Term | Definition |
| Antecendent (A) | First part of conditional proposal |
| Consequent (C) | Second part of condition proposal |



## Formula

T(X) = Count of transactions itemset X has appeared in.

$support(A \rightarrow C) = support(A \cup C) = \frac{T(A \cup C)}{T()}$ 
- P(A and C are bought together).

$confidence(A \rightarrow C) = \frac{support(A \rightarrow C)}{support(A)} = \frac{T(A \cup C)}{T(A)}$ 
- P(C is bought when A is bought).

$lift(A \rightarrow C) = \frac{confidence(A \rightarrow C)}{support(C)} = \frac{support(A \rightarrow C)}{support(A) \cdot support(C)}$
- P(A and C are bought together) / P(A and C are bought together assuming independence). Values close to, and under, 1 indicates independence.

$leverage(A \rightarrow C) = support(A \rightarrow C) - support(A) \cdot support(C)$ 
- P(A and C are bought together) - P(A and C are bought together assuming independence). Values close to, and under, 0 indicates independence.
