# SAR Single Node on Player Mock dataset

Simple Algorithm for Recommendation (SAR) to handle cold item and semi-cold user scenarios. 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

### Notes to use SAR properly:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do.
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

## Set up

In [41]:
import pandas as pd
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.sar import SAR
from recommenders.utils.timer import Timer
import logging


## 1. Load Data

Each row reprensets a single interaction between a user and an item.

### 1.1 Load data

In [42]:
match_df = pd.read_csv('datasets/player_data.csv')

print(match_df.head())
print('\nNumber of entries in dataset: ' + str(len(match_df)))

# top k items to recommend
TOP_K = 10

                                 itemId     playerName           playerTeam  \
0  c4493a2b-8c67-4345-a052-ddd8988be736  Karim Benzema          Real Madrid   
1  46f3b28a-d110-4f9a-8f85-f953cf283d9d         Neymar  Paris Saint Germain   
2  76b0dca2-0634-4df2-8904-3195b4bc4257         Neymar  Paris Saint Germain   
3  d2906d5f-c345-4811-a8df-3b7c2a937f6f  Mohamed Salah            Liverpool   
4  5d724afc-6f4c-4d7b-9b94-c3c5add68c88  Kylian Mbappé          Real Madrid   

  playerCountry  userId userRiskType  betPlaced  betAmount  \
0        France       8       Medium      False       0.00   
1        Brésil      11       Medium      False       0.00   
2        Brésil      11       Medium       True      90.12   
3         Maroc       1         High      False       0.00   
4       Espagne       7         High      False       0.00   

             timestamp  
0  2022-12-04 00:00:00  
1  2023-06-16 00:00:00  
2  2023-11-13 00:00:00  
3  2024-06-01 00:00:00  
4  2024-11-30 00:00:00  

N

### 1.2 Transform data

The **rating** column in our dataset is a boolean indicating wheter the user placed a bet or not, so we need to transform it into numerical ratings : 0 and 1.

The **timestamp** column needs to be a numeric value, so a new column **timestamp_diff_days** with the difference of days will be added.

In [43]:
print(match_df.head())

match_df['betPlaced'] = match_df['betPlaced'].astype(int)
match_df['timestamp'] = pd.to_datetime(match_df['timestamp'])
match_df['timestamp_diff_secs'] = match_df['timestamp'].apply(lambda x : x.timestamp())


print(match_df.head())

print(match_df.dtypes)

# Check number of null lines, should be 0
print(match_df['timestamp_diff_secs'].isna().sum()) 

                                 itemId     playerName           playerTeam  \
0  c4493a2b-8c67-4345-a052-ddd8988be736  Karim Benzema          Real Madrid   
1  46f3b28a-d110-4f9a-8f85-f953cf283d9d         Neymar  Paris Saint Germain   
2  76b0dca2-0634-4df2-8904-3195b4bc4257         Neymar  Paris Saint Germain   
3  d2906d5f-c345-4811-a8df-3b7c2a937f6f  Mohamed Salah            Liverpool   
4  5d724afc-6f4c-4d7b-9b94-c3c5add68c88  Kylian Mbappé          Real Madrid   

  playerCountry  userId userRiskType  betPlaced  betAmount  \
0        France       8       Medium      False       0.00   
1        Brésil      11       Medium      False       0.00   
2        Brésil      11       Medium       True      90.12   
3         Maroc       1         High      False       0.00   
4       Espagne       7         High      False       0.00   

             timestamp  
0  2022-12-04 00:00:00  
1  2023-06-16 00:00:00  
2  2023-11-13 00:00:00  
3  2024-06-01 00:00:00  
4  2024-11-30 00:00:00  
  

### 1.3 Split data using python random splitter

Split dataset into train and test dataset to evaluate algorithm performance.
All users that are in th test set must also exist in the training set, so we use `python_stratified_split`. 
The function holds out a percentage (25% here) of items from each user, but ensures all users are in both `train` and `test` datasets.

In [44]:
train, test = python_stratified_split(match_df, ratio=0.75, col_user="userId", col_item="itemId", seed=42)

In [45]:
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['userId'].unique()),
    train_items=len(train['itemId'].unique()),
    test_total=len(test),
    test_users=len(test['userId'].unique()),
    test_items=len(test['itemId'].unique()),
))


Train:
Total Ratings: 149
Unique Users: 18
Unique Items: 149

Test:
Total Ratings: 51
Unique Users: 18
Unique Items: 51



# 2. Train the SAR Model

## 2.1 Instantiate the SAR algorithm and set the index

In [46]:
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SAR(
    col_user="userId",
    col_item="itemId",
    col_rating="betPlaced",
    col_timestamp="timestamp_diff_secs",
    similarity_type="jaccard", 
    time_decay_coefficient=30, 
    timedecay_formula=False,
    normalize=True
)

## 2.2 Train the SAR model on the training data, and get the top-k recommendations for the testing data

SAR creates an item-to-item **co-occurence matrix** (number of times two items are together for a given user), then we compute an **item similarity matrix** by rescaling the cooccurences by given metric.

We also compute an **affinity matrix** to capture the strength of the relationship between each user and each item.

We get the **recommendations** by multiplying the affinity matrix and the similarity matrix.

In [47]:
with Timer() as train_time:
    model.fit(train)

print("Training finished in {} seconds.".format(train_time.interval))

2024-12-04 11:22:22,073 INFO     Collecting user affinity matrix
2024-12-04 11:22:22,073 INFO     Creating index columns
2024-12-04 11:22:22,075 INFO     Calculating normalization factors
2024-12-04 11:22:22,075 INFO     Building user affinity sparse matrix
2024-12-04 11:22:22,075 INFO     Calculating item co-occurrence
2024-12-04 11:22:22,076 INFO     Calculating item similarity
2024-12-04 11:22:22,077 INFO     Using jaccard based similarity
2024-12-04 11:22:22,077 INFO     Done training


Training finished in 0.005976916999998139 seconds.


In [48]:
with Timer() as test_time:
    top_k = model.recommend_k_items(test, top_k=TOP_K, remove_seen=True)

print("Took {} seconds for prediction.".format(test_time.interval))

2024-12-04 11:22:22,081 INFO     Calculating recommendation scores
2024-12-04 11:22:22,082 INFO     Removing seen items


Took 0.003283291000002464 seconds for prediction.


  return self._with_data(self.data * other)


In [49]:
top_k

Unnamed: 0,userId,itemId,prediction
0,1,e6b2d561-bbbc-4a60-a526-a6a72d7cbb6d,0.0
1,1,b3c5fe9f-797d-4331-b866-04468f37d084,0.0
2,1,2dde7a9b-e40d-4e76-b183-89c78464b0bb,0.0
3,1,6515bec6-8e6a-423c-9fbc-bf99771d0d32,0.0
4,1,5d724afc-6f4c-4d7b-9b94-c3c5add68c88,0.0
...,...,...,...
166,17,d671b837-4797-45b8-9193-305b0d6dd961,0.0
170,18,73d4af48-5f53-4711-a724-f578a9af0aa8,0.0
171,18,5d724afc-6f4c-4d7b-9b94-c3c5add68c88,0.0
172,18,6515bec6-8e6a-423c-9fbc-bf99771d0d32,0.0
