# SAR 

Simple Algorithm for Recommendation (SAR) is a fast and scalable algorithm for personalized recommendations based on user transaction history. It produces easily explainable and interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. SAR is a kind of neighborhood based algorithm (as discussed in [Recommender Systems by Aggarwal](https://dl.acm.org/citation.cfm?id=2931100)) which is intended for ranking top items for each user.

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

### Disatvantages of SAR:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do.
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

## 0 Global Settings and Imports

In [1]:
import logging
import numpy as np
import pandas as pd

from recommenders.utils.timer import Timer
from recommenders.models.sar import SAR
from recommenders.evaluation.python_evaluation import (
    map_at_k,
    ndcg_at_k,
    precision_at_k,
    recall_at_k,
)


print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

NumPy version: 1.24.3
Pandas version: 1.5.3


# 1 Load Data

SAR is intended to be used on interactions with the following schema:
`<User ID>, <Item ID>,<Time>,[<Event Type>], [<Event Weight>]`. 

All the columns directly correspond with the `recommenders` api, where `rating` is the `event_weight` and `timestamp` is the `time`.

In [2]:
# top k items to recommend
TOP_K = 10

In [3]:
train = pd.read_csv("../data/interim/ml-100k/train.csv")
test = pd.read_csv("../data/interim/ml-100k/test.csv")

train.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,31,3.0,875072144
1,1,39,4.0,875072173
2,1,163,4.0,875072442
3,1,226,3.0,878543176
4,1,169,5.0,878543541


# 2 Train the SAR Model

### 2.1 Instantiate the SAR algorithm and set the index

I will use the single node implementation of SAR and specify the column names to match the dataset.

In [4]:
model = SAR(
    col_user="user_id",
    col_item="item_id",
    col_rating="rating",
    col_timestamp="timestamp",
    similarity_type="jaccard",
    time_decay_coefficient=30,
    timedecay_formula=True,
    normalize=True,
)

### 2.2 Train the SAR model on our training data, and get the top-k recommendations for our testing data

#### How it works

1. SAR computes an item-to-item ***co-occurence matrix***. Co-occurence represents the number of times two items appear together for any given user.

2. SAR computes an ***affinity matrix*** to capture the strength of the relationship between each user and each item. Affinity is driven by *rating* and *time* of the event

3. Recommendations are achieved by multiplying the affinity matrix $A$ and the similarity matrix $S$. The result is a ***recommendation score matrix*** $R$ from which the top-k recommendations can be extracted.

In [5]:
with Timer() as train_time:
    model.fit(train)

print("Took {} seconds for training.".format(train_time.interval))

Took 0.2995656509992841 seconds for training.


In [6]:
with Timer() as test_time:
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)

print("Took {} seconds for prediction.".format(test_time.interval))

Took 0.23474036500010698 seconds for prediction.


In [7]:
top_k.head(100)

Unnamed: 0,user_id,item_id,prediction
0,1,204,3.231405
1,1,89,3.199445
2,1,11,3.154097
3,1,367,3.113913
4,1,423,3.054493
...,...,...,...
95,10,172,3.941404
96,10,423,3.938111
97,10,318,3.898689
98,10,183,3.897613


In [8]:
top_k[["prediction"]].describe()

Unnamed: 0,prediction
count,9430.0
mean,3.058003
std,0.540984
min,0.896047
25%,2.694779
50%,3.112189
75%,3.451543
max,4.569483


### 2.3. Evaluate how well SAR performs

I am evaluating `SAR` using 4 metrics (because, why not), which are `MAP@10` (Mean Average Precision), `NDCG@10` (Normalized Discounted Cumulative Gain), `Precision@10`, and `Recall@10`.

But for the final results, though I will use `Precision@10`, because it is the most important metric for my use case. I chose it because I think, that it is better to recommend less items, but with higher probability that the user will like them, than to recommend more items, which the user will not like.

In [9]:
# Ranking metrics
eval_map = map_at_k(
    test, top_k, col_user="user_id", col_item="item_id", col_rating="rating", k=TOP_K
)
eval_ndcg = ndcg_at_k(
    test, top_k, col_user="user_id", col_item="item_id", col_rating="rating", k=TOP_K
)
eval_precision = precision_at_k(
    test, top_k, col_user="user_id", col_item="item_id", col_rating="rating", k=TOP_K
)
eval_recall = recall_at_k(
    test, top_k, col_user="user_id", col_item="item_id", col_rating="rating", k=TOP_K
)

In [10]:
print(
    "Model:\t",
    "Top K:\t\t%d" % TOP_K,
    "MAP:\t\t%f" % eval_map,
    "NDCG:\t\t%f" % eval_ndcg,
    "Precision@K:\t%f" % eval_precision,
    "Recall@K:\t%f" % eval_recall,
    sep="\n",
)

Model:	
Top K:		10
MAP:		0.110591
NDCG:		0.382461
Precision@K:	0.330753
Recall@K:	0.176385


## 3 Saving the model checkpoint

In [11]:
import pickle
import datetime


with open(f"../models/sar-best.pkl", "wb") as f:
    pickle.dump(model, f)

## References

- [microsoft-recommenders official example](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sar_movielens.ipynb)