# SAR Single Node on Player Mock dataset

Simple Algorithm for Recommendation (SAR) to handle cold item and semi-cold user scenarios. 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

### Notes to use SAR properly:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do.
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

## Set up

In [1]:
import pandas as pd
from sklearn.preprocessing import minmax_scale
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.sar import SAR
from recommenders.utils.timer import Timer
from recommenders.utils.python_utils import binarize
from recommenders.utils.notebook_utils import store_metadata
from recommenders.evaluation.python_evaluation import (
    map,
    ndcg_at_k,
    precision_at_k,
    recall_at_k,
    rmse,
    mae,
    logloss,
    rsquared,
    exp_var
)
import logging
import numpy as np
import datetime as dt


## 1. Load Data

Each row reprensets a single interaction between a user and an item.

### 1.1 Load data

In [2]:
match_df = pd.read_csv('datasets/player_data.csv')

print(match_df.head())
print('\nNumber of entries in dataset: ' + str(len(match_df)))

# top k items to recommend
TOP_K = 10

   itemId  userId  rating            timestamp              playerName  \
0      24     350       2  2020-07-06 07:22:44  Trent Alexander-Arnold   
1      22     270       3  2024-05-21 23:33:49  Trent Alexander-Arnold   
2       1     168       5  2018-05-18 09:32:21       Cristiano Ronaldo   
3       0     425       4  2015-07-14 18:44:49            Lionel Messi   
4       0      29       5  2019-02-24 03:26:47            Lionel Messi   

  playerTeam playerCountry  
0  Liverpool    Angleterre  
1    Arsenal    Angleterre  
2   Al-Nassr      Portugal  
3  Barcelona     Argentina  
4  Barcelona     Argentina  

Number of entries in dataset: 50000


### 1.2 Transform data

The **rating** column in our dataset is a boolean indicating wheter the user placed a bet or not, so we need to transform it into numerical ratings : 0 and 1.

The **timestamp** column needs to be a numeric value, so a new column **timestamp_diff_days** with the difference of days will be added.

In [3]:
#print(match_df.head())
epoch_time = dt.datetime(1970, 1, 1)

# Convert the float precision to 32-bit in order to reduce memory consumption 
match_df['rating'] = match_df['rating'].astype(np.float32)

match_df['timestamp'] = pd.to_datetime(match_df['timestamp'])
match_df['timestamp_seconds'] = match_df['timestamp'].astype('int64') // 10**9

print(match_df)

print(match_df.dtypes)

# Check number of null lines, should be 0
print("\nnull lines: ", match_df['timestamp_seconds'].isna().sum()) 

       itemId  userId  rating           timestamp              playerName  \
0          24     350     2.0 2020-07-06 07:22:44  Trent Alexander-Arnold   
1          22     270     3.0 2024-05-21 23:33:49  Trent Alexander-Arnold   
2           1     168     5.0 2018-05-18 09:32:21       Cristiano Ronaldo   
3           0     425     4.0 2015-07-14 18:44:49            Lionel Messi   
4           0      29     5.0 2019-02-24 03:26:47            Lionel Messi   
...       ...     ...     ...                 ...                     ...   
49995      17      84     2.0 2012-07-26 06:30:43              Toni Kroos   
49996      45     399     1.0 2016-03-16 18:37:04          Erling Haaland   
49997       8     247     1.0 2017-04-18 11:16:41         Kevin De Bruyne   
49998      38     445     4.0 2021-11-14 02:37:08             Luka Modrić   
49999      24     213     4.0 2024-03-03 19:03:02  Trent Alexander-Arnold   

              playerTeam playerCountry  timestamp_seconds  
0              

### 1.3 Split data using python random splitter

Split dataset into train and test dataset to evaluate algorithm performance.
All users that are in th test set must also exist in the training set, so we use `python_stratified_split`. 
The function holds out a percentage (25% here) of items from each user, but ensures all users are in both `train` and `test` datasets.

In [4]:
train, test = python_stratified_split(match_df, ratio=0.75, col_user="userId", col_item="itemId", seed=42)

In [5]:
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['userId'].unique()),
    train_items=len(train['itemId'].unique()),
    test_total=len(test),
    test_users=len(test['userId'].unique()),
    test_items=len(test['itemId'].unique()),
))


Train:
Total Ratings: 37505
Unique Users: 500
Unique Items: 51

Test:
Total Ratings: 12495
Unique Users: 500
Unique Items: 51



# 2. Train the SAR Model

## 2.1 Instantiate the SAR algorithm and set the index

In [6]:
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SAR(
    col_user="userId",
    col_item="itemId",
    col_rating="rating",
    col_timestamp="timestamp_seconds",
    similarity_type="jaccard", 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

## 2.2 Train the SAR model on the training data, and get the top-k recommendations for the testing data

SAR creates an item-to-item **co-occurence matrix** (number of times two items are together for a given user), then we compute an **item similarity matrix** by rescaling the cooccurences by given metric.

We also compute an **affinity matrix** to capture the strength of the relationship between each user and each item.

We get the **recommendations** by multiplying the affinity matrix and the similarity matrix.

In [7]:
with Timer() as train_time:
    model.fit(train)

print("Training finished in {} seconds.".format(train_time.interval))

2025-01-28 18:59:59,259 INFO     Collecting user affinity matrix
2025-01-28 18:59:59,261 INFO     Calculating time-decayed affinities
2025-01-28 18:59:59,266 INFO     Creating index columns
2025-01-28 18:59:59,276 INFO     Calculating normalization factors
2025-01-28 18:59:59,280 INFO     Building user affinity sparse matrix
2025-01-28 18:59:59,282 INFO     Calculating item co-occurrence
2025-01-28 18:59:59,284 INFO     Calculating item similarity
2025-01-28 18:59:59,284 INFO     Using jaccard based similarity
2025-01-28 18:59:59,285 INFO     Done training


Training finished in 0.02944191699998555 seconds.


In [8]:
with Timer() as test_time:
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)

print("Prediction finished in {} seconds.".format(test_time.interval))

2025-01-28 18:59:59,290 INFO     Calculating recommendation scores
2025-01-28 18:59:59,292 INFO     Removing seen items


Prediction finished in 0.006701333000023624 seconds.


In [9]:
top_k

Unnamed: 0,userId,itemId,prediction
0,1,23,0.046925
1,1,11,0.045363
2,1,31,0.044808
3,1,36,0.044477
4,1,46,0.044472
...,...,...,...
4993,500,43,0.002435
4994,500,1,0.002433
4995,500,13,0.002401
4996,500,35,0.002395


## 2.3 Evaluate SAR performance


In [10]:
# Ranking metrics
eval_map = map(test, top_k, col_user="userId", col_item="itemId", col_rating="rating", k=TOP_K)
eval_ndcg = ndcg_at_k(test, top_k, col_user="userId", col_item="itemId", col_rating="rating", k=TOP_K)
eval_precision = precision_at_k(test, top_k, col_user="userId", col_item="itemId", col_rating="rating", k=TOP_K)
eval_recall = recall_at_k(test, top_k, col_user="userId", col_item="itemId", col_rating="rating", k=TOP_K)

  df_hit.groupby(col_user, as_index=False)[col_user].agg({"hit": "count"}),
  rating_true_common.groupby(col_user, as_index=False)[col_user].agg(
  df_hit.groupby(col_user, as_index=False)[col_user].agg({"hit": "count"}),
  rating_true_common.groupby(col_user, as_index=False)[col_user].agg(
  df_hit.groupby(col_user, as_index=False)[col_user].agg({"hit": "count"}),
  rating_true_common.groupby(col_user, as_index=False)[col_user].agg(
  df_hit.groupby(col_user, as_index=False)[col_user].agg({"hit": "count"}),
  rating_true_common.groupby(col_user, as_index=False)[col_user].agg(


In [11]:
# Rating metrics
eval_rmse = rmse(test, top_k, col_user="userId", col_item="itemId", col_rating="rating")
eval_mae = mae(test, top_k, col_user="userId", col_item="itemId", col_rating="rating")
eval_rsquared = rsquared(test, top_k, col_user="userId", col_item="itemId", col_rating="rating")
eval_exp_var = exp_var(test, top_k, col_user="userId", col_item="itemId", col_rating="rating")


In [12]:
positivity_threshold = 2
test_bin = test.copy()
test_bin["rating"] = binarize(test_bin["rating"], positivity_threshold)

top_k_prob = top_k.copy()
top_k_prob["prediction"] = minmax_scale(top_k_prob["prediction"].astype(float))

eval_logloss = logloss(
    test_bin, top_k_prob, col_user="userId", col_item="itemId", col_rating="rating"
)

In [13]:
print("Model:\t",
      "Top K:\t%d" % TOP_K,
      "MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall,
      "RMSE:\t%f" % eval_rmse,
      "MAE:\t%f" % eval_mae,
      "R2:\t%f" % eval_rsquared,
      "Exp var:\t%f" % eval_exp_var,
      "Logloss:\t%f" % eval_logloss,
      sep='\n')

Model:	
Top K:	10
MAP:	0.136550
NDCG:	0.728474
Precision@K:	0.471600
Recall@K:	0.189971
RMSE:	3.243994
MAE:	2.921707
R2:	-4.317032
Exp var:	-0.004000
Logloss:	1.720861


In [14]:
# Now let's look at the results for a specific user
user_id = 272

ground_truth = test[test["userId"] == user_id].sort_values(
    by="rating", ascending=False
)[:TOP_K]
prediction = model.recommend_k_items(
    pd.DataFrame(dict(userId=[user_id])), remove_seen=True
)
df = pd.merge(ground_truth, prediction, on=["userId", "itemId"], how="left")
df.head(10)

2025-01-28 18:59:59,620 INFO     Calculating recommendation scores
2025-01-28 18:59:59,621 INFO     Removing seen items


Unnamed: 0,itemId,userId,rating,timestamp,playerName,playerTeam,playerCountry,timestamp_seconds,prediction
0,17,272,5.0,2012-07-13 18:36:23,Toni Kroos,Bayern Munich,Allemagne,1342204583,
1,14,272,5.0,2014-03-04 15:33:39,Bruno Fernandes,Real Betis,Portugal,1393947219,
2,5,272,4.0,2013-02-16 15:42:00,Luka Modrić,Real Madrid,Croatie,1361029320,
3,11,272,4.0,2016-10-11 19:28:14,Mohamed Salah,Liverpool,Maroc,1476214094,0.124271
4,40,272,4.0,2021-05-17 09:19:56,Erling Haaland,Red Bull Salzburg,Norway,1621243196,
5,22,272,3.0,2012-01-27 09:21:06,Trent Alexander-Arnold,Arsenal,Angleterre,1327656066,
6,38,272,3.0,2021-07-12 18:45:48,Luka Modrić,Tottenham Hotspur,Croatie,1626115548,
7,16,272,3.0,2020-08-01 13:42:14,Sergio Busquets,Barcelona,Espagne,1596289334,
8,6,272,3.0,2015-09-10 20:12:58,Karim Benzema,Al-Ittihad,France,1441915978,
9,7,272,3.0,2023-07-11 05:00:22,Sergio Ramos,Real Madrid,Espagne,1689051622,


In [15]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("train_time", train_time.interval)
store_metadata("test_time", test_time.interval)

<IPython.core.display.JSON object>

<IPython.core.display.JSON object>

<IPython.core.display.JSON object>

<IPython.core.display.JSON object>

<IPython.core.display.JSON object>

<IPython.core.display.JSON object>