# SAR Single Node on Player Mock dataset

Simple Algorithm for Recommendation (SAR) to handle cold item and semi-cold user scenarios. 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

### Notes to use SAR properly:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do.
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

## Set up

In [10]:
import pandas as pd
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.sar import SAR
from recommenders.utils.timer import Timer
import logging
import numpy as np
import datetime as dt


## 1. Load Data

Each row reprensets a single interaction between a user and an item.

### 1.1 Load data

In [11]:
match_df = pd.read_csv('datasets/player_data.csv')

print(match_df.head())
print('\nNumber of entries in dataset: ' + str(len(match_df)))

# top k items to recommend
TOP_K = 10

                                 itemId              playerName  \
0  d1f43d86-d060-4f8b-af61-05e949e89aec            Riyad Mahrez   
1  9a3dc2e0-d405-48f4-82b1-c2aefeaeda1a  Trent Alexander-Arnold   
2  dba7f29b-1ebe-48ba-bad3-95ff55a53141              Harry Kane   
3  b8ac9683-2d09-4fe8-b656-62757f0f1f8f             Eden Hazard   
4  d32ed43c-c75d-4b65-a08b-1d5841bc991d         Virgil van Dijk   

          playerTeam playerCountry  userId userRiskType  betPlaced  betAmount  \
0  Borussia Dortmund         Maroc      22         High          1      52.89   
1            Arsenal    Angleterre      23       Medium          5      33.22   
2  Tottenham Hotspur    Angleterre      34         High          2      83.00   
3           Juventus      Belgique      39          Low          2      60.25   
4          Liverpool      Pays-Bas       7         High          1      56.97   

             timestamp  
0  2023-11-23 00:00:00  
1  2024-04-02 00:00:00  
2  2022-01-20 00:00:00  
3  2023-06

### 1.2 Transform data

The **rating** column in our dataset is a boolean indicating wheter the user placed a bet or not, so we need to transform it into numerical ratings : 0 and 1.

The **timestamp** column needs to be a numeric value, so a new column **timestamp_diff_days** with the difference of days will be added.

In [12]:
#print(match_df.head())
epoch_time = dt.datetime(1970, 1, 1)

match_df['betPlaced'] = match_df['betPlaced'].astype(np.float32)
#match_df['betPlaced'] = match_df['betPlaced'].astype(int)
#match_df['betPlaced'] = match_df['betPlaced'].replace({True: 1, False: -1})
match_df['timestamp'] = pd.to_datetime(match_df['timestamp'])
match_df['timestamp_seconds'] = match_df['timestamp'].astype('int64') // 10**9


print(match_df.head())

print(match_df.dtypes)

# Check number of null lines, should be 0
print(match_df['timestamp_seconds'].isna().sum()) 

                                 itemId              playerName  \
0  d1f43d86-d060-4f8b-af61-05e949e89aec            Riyad Mahrez   
1  9a3dc2e0-d405-48f4-82b1-c2aefeaeda1a  Trent Alexander-Arnold   
2  dba7f29b-1ebe-48ba-bad3-95ff55a53141              Harry Kane   
3  b8ac9683-2d09-4fe8-b656-62757f0f1f8f             Eden Hazard   
4  d32ed43c-c75d-4b65-a08b-1d5841bc991d         Virgil van Dijk   

          playerTeam playerCountry  userId userRiskType  betPlaced  betAmount  \
0  Borussia Dortmund         Maroc      22         High        1.0      52.89   
1            Arsenal    Angleterre      23       Medium        5.0      33.22   
2  Tottenham Hotspur    Angleterre      34         High        2.0      83.00   
3           Juventus      Belgique      39          Low        2.0      60.25   
4          Liverpool      Pays-Bas       7         High        1.0      56.97   

   timestamp  timestamp_seconds  
0 2023-11-23         1700697600  
1 2024-04-02         1712016000  
2 2022-0

### 1.3 Split data using python random splitter

Split dataset into train and test dataset to evaluate algorithm performance.
All users that are in th test set must also exist in the training set, so we use `python_stratified_split`. 
The function holds out a percentage (25% here) of items from each user, but ensures all users are in both `train` and `test` datasets.

In [13]:
train, test = python_stratified_split(match_df, ratio=0.75, col_user="userId", col_item="itemId", seed=42)

In [14]:
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['userId'].unique()),
    train_items=len(train['itemId'].unique()),
    test_total=len(test),
    test_users=len(test['userId'].unique()),
    test_items=len(test['itemId'].unique()),
))


Train:
Total Ratings: 379
Unique Users: 40
Unique Items: 379

Test:
Total Ratings: 121
Unique Users: 40
Unique Items: 121



# 2. Train the SAR Model

## 2.1 Instantiate the SAR algorithm and set the index

In [15]:
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SAR(
    col_user="userId",
    col_item="itemId",
    col_rating="betPlaced",
    col_timestamp="timestamp_seconds",
    similarity_type="cosine", 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

## 2.2 Train the SAR model on the training data, and get the top-k recommendations for the testing data

SAR creates an item-to-item **co-occurence matrix** (number of times two items are together for a given user), then we compute an **item similarity matrix** by rescaling the cooccurences by given metric.

We also compute an **affinity matrix** to capture the strength of the relationship between each user and each item.

We get the **recommendations** by multiplying the affinity matrix and the similarity matrix.

In [16]:
with Timer() as train_time:
    model.fit(train)

print("Training finished in {} seconds.".format(train_time.interval))

2024-12-04 12:45:35,174 INFO     Collecting user affinity matrix
2024-12-04 12:45:35,175 INFO     Calculating time-decayed affinities
2024-12-04 12:45:35,177 INFO     Creating index columns
2024-12-04 12:45:35,178 INFO     Calculating normalization factors
2024-12-04 12:45:35,181 INFO     Building user affinity sparse matrix
2024-12-04 12:45:35,181 INFO     Calculating item co-occurrence
2024-12-04 12:45:35,182 INFO     Calculating item similarity
2024-12-04 12:45:35,182 INFO     Using cosine similarity
2024-12-04 12:45:35,184 INFO     Done training


Training finished in 0.012018958000005853 seconds.


In [17]:
with Timer() as test_time:
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)

print("Prediction finished in {} seconds.".format(test_time.interval))

2024-12-04 12:45:35,189 INFO     Calculating recommendation scores
2024-12-04 12:45:35,191 INFO     Removing seen items


Prediction finished in 0.004530417000005116 seconds.


In [18]:
top_k

Unnamed: 0,userId,itemId,prediction
0,1,37f12a4f-9e3f-440e-bf06-fc1e70264752,1.791551e-11
1,1,0c174fe6-c5f3-4180-800c-93c8737777c7,1.791551e-11
2,1,8304ea3e-2c7b-4a1a-8bce-4450ca9f40ac,1.791551e-11
3,1,6cae5b00-ddb8-4953-84a6-a75dc05ec493,1.791551e-11
4,1,bc9d5aaf-9d2d-4e87-8717-06770a48076e,1.791551e-11
...,...,...,...
395,40,6cae5b00-ddb8-4953-84a6-a75dc05ec493,1.791551e-11
396,40,688bbdaf-4dcf-4d4a-9425-8eb11b8e97ae,1.791551e-11
397,40,79148248-f97d-49eb-ad26-6326bbf4bf09,1.791551e-11
398,40,37f12a4f-9e3f-440e-bf06-fc1e70264752,1.791551e-11


## 2.3 Evaluate performance
