# SAR Single Node on Player Mock dataset

Simple Algorithm for Recommendation (SAR) to handle cold item and semi-cold user scenarios. 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

### Notes to use SAR properly:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do.
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

## Set up

In [1]:
import pandas as pd
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.sar import SAR
from recommenders.utils.timer import Timer
import logging
import numpy as np
import datetime as dt


## 1. Load Data

Each row reprensets a single interaction between a user and an item.

### 1.1 Load data

In [2]:
match_df = pd.read_csv('datasets/player_data.csv')

print(match_df.head())
print('\nNumber of entries in dataset: ' + str(len(match_df)))

# top k items to recommend
TOP_K = 10

   itemId  userId  rating            timestamp       playerName  \
0      31     425       2  2015-11-20 07:29:05  Bruno Fernandes   
1      10     145       3  2011-06-17 20:09:38       Harry Kane   
2      34     761       3  2014-05-09 03:59:06       Harry Kane   
3       7     681       4  2018-02-08 14:50:37     Sergio Ramos   
4      26     940       1  2011-05-05 12:10:27     Riyad Mahrez   

           playerTeam playerCountry  
0             Udinese      Portugal  
1   Tottenham Hotspur    Angleterre  
2  Ridgeway Rovers FC    Angleterre  
3         Real Madrid       Espagne  
4            Le Havre         Maroc  

Number of entries in dataset: 50000


### 1.2 Transform data

The **rating** column in our dataset is a boolean indicating wheter the user placed a bet or not, so we need to transform it into numerical ratings : 0 and 1.

The **timestamp** column needs to be a numeric value, so a new column **timestamp_diff_days** with the difference of days will be added.

In [3]:
#print(match_df.head())
epoch_time = dt.datetime(1970, 1, 1)

#match_df['betPlaced'] = match_df['betPlaced'].astype(np.float32)
#match_df['betPlaced'] = match_df['betPlaced'].astype(int)
#match_df['betPlaced'] = match_df['betPlaced'].replace({True: 1, False: -1})
match_df['timestamp'] = pd.to_datetime(match_df['timestamp'])
match_df['timestamp_seconds'] = match_df['timestamp'].astype('int64') // 10**9


print(match_df.head())

print(match_df.dtypes)

# Check number of null lines, should be 0
print(match_df['timestamp_seconds'].isna().sum()) 

   itemId  userId  rating           timestamp       playerName  \
0      31     425       2 2015-11-20 07:29:05  Bruno Fernandes   
1      10     145       3 2011-06-17 20:09:38       Harry Kane   
2      34     761       3 2014-05-09 03:59:06       Harry Kane   
3       7     681       4 2018-02-08 14:50:37     Sergio Ramos   
4      26     940       1 2011-05-05 12:10:27     Riyad Mahrez   

           playerTeam playerCountry  timestamp_seconds  
0             Udinese      Portugal         1448004545  
1   Tottenham Hotspur    Angleterre         1308341378  
2  Ridgeway Rovers FC    Angleterre         1399607946  
3         Real Madrid       Espagne         1518101437  
4            Le Havre         Maroc         1304597427  
itemId                        int64
userId                        int64
rating                        int64
timestamp            datetime64[ns]
playerName                   object
playerTeam                   object
playerCountry                object
timestamp

### 1.3 Split data using python random splitter

Split dataset into train and test dataset to evaluate algorithm performance.
All users that are in th test set must also exist in the training set, so we use `python_stratified_split`. 
The function holds out a percentage (25% here) of items from each user, but ensures all users are in both `train` and `test` datasets.

In [4]:
train, test = python_stratified_split(match_df, ratio=0.75, col_user="userId", col_item="itemId", seed=42)

In [5]:
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['userId'].unique()),
    train_items=len(train['itemId'].unique()),
    test_total=len(test),
    test_users=len(test['userId'].unique()),
    test_items=len(test['itemId'].unique()),
))


Train:
Total Ratings: 37503
Unique Users: 1000
Unique Items: 51

Test:
Total Ratings: 12497
Unique Users: 1000
Unique Items: 51



# 2. Train the SAR Model

## 2.1 Instantiate the SAR algorithm and set the index

In [6]:
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SAR(
    col_user="userId",
    col_item="itemId",
    col_rating="rating",
    col_timestamp="timestamp_seconds",
    similarity_type="cosine", 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

## 2.2 Train the SAR model on the training data, and get the top-k recommendations for the testing data

SAR creates an item-to-item **co-occurence matrix** (number of times two items are together for a given user), then we compute an **item similarity matrix** by rescaling the cooccurences by given metric.

We also compute an **affinity matrix** to capture the strength of the relationship between each user and each item.

We get the **recommendations** by multiplying the affinity matrix and the similarity matrix.

In [7]:
with Timer() as train_time:
    model.fit(train)

print("Training finished in {} seconds.".format(train_time.interval))

2025-01-28 19:55:27,004 INFO     Collecting user affinity matrix
2025-01-28 19:55:27,007 INFO     Calculating time-decayed affinities
2025-01-28 19:55:27,012 INFO     Creating index columns
2025-01-28 19:55:27,026 INFO     Calculating normalization factors
2025-01-28 19:55:27,032 INFO     Building user affinity sparse matrix
2025-01-28 19:55:27,033 INFO     Calculating item co-occurrence
2025-01-28 19:55:27,037 INFO     Calculating item similarity
2025-01-28 19:55:27,037 INFO     Using cosine similarity
2025-01-28 19:55:27,038 INFO     Done training


Training finished in 0.038646250000000215 seconds.


In [8]:
with Timer() as test_time:
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)

print("Prediction finished in {} seconds.".format(test_time.interval))

2025-01-28 19:55:27,045 INFO     Calculating recommendation scores
2025-01-28 19:55:27,049 INFO     Removing seen items


Prediction finished in 0.010301082999999878 seconds.


In [9]:
top_k

Unnamed: 0,userId,itemId,prediction
0,1,40,0.006915
1,1,24,0.006845
2,1,37,0.006825
3,1,31,0.006718
4,1,49,0.006699
...,...,...,...
9995,1000,37,0.017010
9996,1000,2,0.016829
9997,1000,27,0.016825
9998,1000,9,0.016796


## 2.3 Evaluate performance
