# Recommender Systems 2020/21


## Practice 4 - Building an ItemKNN Recommender From Scratch

This practice session is created to provide a guide to students of how to crete a recommender system from scratch, going from the data loading, processing, model creation, evaluation, hyperparameter tuning and a sample submission to the competition. 

Outline:
- Data Loading with Pandas (MovieLens 10M, link: http://files.grouplens.org/datasets/movielens/ml-10m.zip)
- Data Preprocessing
- Dataset splitting in Train, Validation and Testing
- Similarity Measures
- Collaborative Item KNN
- Evaluation Metrics
- Evaluation Procedure
- Hyperparameter Tuning
- Submission to competition

In [1]:
__author__ = 'Fernando Benjamín Pérez Maurera'
__credits__ = ['Fernando Benjamín Pérez Maurera']
__license__ = 'MIT'
__version__ = '0.1.0'
__maintainer__ = 'Fernando Benjamín Pérez Maurera'
__email__ = 'fernandobenjamin.perez@polimi.it'
__status__ = 'Dev'

import os
from typing import Tuple, Callable, Dict, Optional, List

import numpy as np
import pandas as pd
import scipy.sparse as sp

from sklearn.model_selection import train_test_split


## Dataset Loading with pandas

The Movielens 10M dataset is a collection of ratings given by users to items. They are stored in a columnar `.dat` file using `::` as separators for each attribute, and every row follows this structure: `<user_id>::<item_id>::<rating>::<timestamp>`. 

The function `read_csv` from pandas provides a wonderful and fast interface to load tabular data like this. For better results and performance we provide the separator `::`, the column names `["user_id", "item_id", "ratings", "timestamp"]`, and the types of each attribute in the `dtype` parameter.

In [2]:
pd.read_csv?

In [7]:
def load_data():
    return pd.read_csv("challenge_data/data_train.csv", 
                       sep=",", 
                       dtype={"user_id": np.int32,
                               "item_id": np.int32,
                               "data": np.int32})

In [12]:
ratings = load_data()

In [13]:
ratings

Unnamed: 0,user_id,item_id,data
0,0,0,1
1,0,2,1
2,0,120,1
3,0,128,1
4,0,211,1
...,...,...,...
1764602,35735,37802,1
1764603,35735,37803,1
1764604,35735,37805,1
1764605,35735,38000,1


## Data Preprocessing

This section wors with the previously-loaded ratings dataset and extracts the number of users, number of items, and min/max user/item identifiers. Exploring and understanding the data is an essential step prior fitting any recommender/algorithm. 

In this specific case, we discover that item identifiers go between 1 and 65133, however, there are only 10677 different items (meaning that ~5/6 of the items identifiers are not present in the dataset). To ease further calculations, we create new contiguous user/item identifiers, we then assign each user/item only one of these new identifiers. To keep track of these new mappings, we add them into the original dataframe using the `pd.merge` function.

In [6]:
pd.merge?

In [14]:
def preprocess_data(ratings: pd.DataFrame):
    unique_users = ratings.user_id.unique()
    unique_items = ratings.item_id.unique()
    
    num_users, min_user_id, max_user_id = unique_users.size, unique_users.min(), unique_users.max()
    num_items, min_item_id, max_item_id = unique_items.size, unique_items.min(), unique_items.max()
    
    print(num_users, min_user_id, max_user_id)
    print(num_items, min_item_id, max_item_id)
    
    mapping_user_id = pd.DataFrame({"mapped_user_id": np.arange(num_users), "user_id": unique_users})
    mapping_item_id = pd.DataFrame({"mapped_item_id": np.arange(num_items), "item_id": unique_items})
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_user_id,
                       how="inner",
                       on="user_id")
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_item_id,
                       how="inner",
                       on="item_id")
    
    return ratings
    

In [15]:
ratings = preprocess_data(ratings)

35736 0 35735
38121 0 38120


In [16]:
ratings 

Unnamed: 0,user_id,item_id,data,mapped_user_id,mapped_item_id
0,0,0,1,0,0
1,5,0,1,5,0
2,6,0,1,6,0
3,7,0,1,7,0
4,9,0,1,9,0
...,...,...,...,...,...
1764602,34967,37744,1,34967,38120
1764603,35040,37744,1,35040,38120
1764604,35308,37744,1,35308,38120
1764605,35431,37744,1,35431,38120


## Dataset Splitting into Train, Validation, and Test

This is the last part before creating the recommender. However, this step is super *important*, as it is the base for the training, parameters optimization, and evaluation of the recommender(s).

In here we read the ratings (which we loaded and preprocessed before) and create the `train`, `validation`, and `test` User-Rating Matrices (URM). It's important that these are disjoint to avoid information leakage from the train into the validation/test set, in our case, we are safe to use the `train_test_split` function from `scikit-learn` as the dataset only contains *one* datapoint for every `(user,item)` pair. On another topic, we first create the `test` set and then we create the `validation` by splitting again the `train` set.


`train_test_split` takes an array (or several arrays) and divides it into `train` and `test` according to a given size (in our case `testing_percentage` and `validation_percentage`, which need to be a float between 0 and 1).

After we have our different splits, we create the *sparse URMs* by using the `csr_matrix` function from `scipy`.

In [10]:
train_test_split?

In [19]:
def dataset_splits(ratings, num_users, num_items, validation_percentage: float, testing_percentage: float):
    seed = 1234
    
    (user_ids_training, user_ids_test,
     item_ids_training, item_ids_test,
     ratings_training, ratings_test) = train_test_split(ratings.mapped_user_id,
                                                        ratings.mapped_item_id,
                                                        ratings.data,
                                                        test_size=testing_percentage,
                                                        shuffle=True,
                                                        random_state=seed)
    
    (user_ids_training, user_ids_validation,
     item_ids_training, item_ids_validation,
     ratings_training, ratings_validation) = train_test_split(user_ids_training,
                                                              item_ids_training,
                                                              ratings_training,
                                                              test_size=validation_percentage,
                                                             )
    
    urm_train = sp.csr_matrix((ratings_training, (user_ids_training, item_ids_training)), 
                              shape=(num_users, num_items))
    
    urm_validation = sp.csr_matrix((ratings_validation, (user_ids_validation, item_ids_validation)), 
                              shape=(num_users, num_items))
    
    urm_test = sp.csr_matrix((ratings_test, (user_ids_test, item_ids_test)), 
                              shape=(num_users, num_items))
    
    
    
    return urm_train, urm_validation, urm_test
    
    
    
    

In [None]:
urm_train, urm_validation, urm_test = dataset_splits(
    ratings,
    num_users=35736,
    num_items=38121,
    validation_percentage=0.10,
    testing_percentage=0.20,
)

In [22]:
urm_train

<35736x38121 sparse matrix of type '<class 'numpy.intc'>'
	with 1270516 stored elements in Compressed Sparse Row format>

In [23]:
urm_validation

<35736x38121 sparse matrix of type '<class 'numpy.intc'>'
	with 141169 stored elements in Compressed Sparse Row format>

In [24]:
urm_test

<35736x38121 sparse matrix of type '<class 'numpy.intc'>'
	with 352922 stored elements in Compressed Sparse Row format>

## Cosine Similarity

We can implement different versions of a cosine similarity. Some of these are faster and others are slower.

The most simple version is just to loop item by item and calculate the similarity of item pairs.
$$ W_{i,j} 
= cos(v_i, v_j) 
= \frac{v_i \cdot v_j}{|| v_i || ||v_j ||} 
= \frac{\Sigma_{u \in U}{URM_{u,i} \cdot URM_{u,j}}}{\sqrt{\Sigma_{u \in U}{URM_{u,i}^2}} \cdot \sqrt{\Sigma_{u \in U}{URM_{u,j}^2}} + shrink} $$


In [17]:
def naive_similarity(urm: sp.csc_matrix, shrink: int):
    num_items = urm.shape[1]
    weights = np.empty(shape=(num_items, num_items))
    for item_i in range(num_items):
        item_i_profile = urm[:, item_i] # mx1 vector
        
        for item_j in range(num_items):
            item_j_profile = urm[:, item_j] # mx1 vector
            
            numerator = item_i_profile.T.dot(item_j_profile).todense()[0,0]
            denominator = (np.sqrt(np.sum(item_i_profile.power(2)))
                           * np.sqrt(np.sum(item_j_profile.power(2)))
                           + shrink
                           + 1e-6)
            
            weights[item_i, item_j] = numerator / denominator
    
    np.fill_diagonal(weights, 0.0)
    return weights
    
            

Another (faster) version of the similarity is by operating on vector products
$$ W_{i,I} 
= cos(v_i, URM_{I}) 
= \frac{v_i \cdot URM_{I}}{|| v_i || IW_{I} + shrink} $$

and where 

$$ IW_{i} = \sqrt{{\Sigma_{u \in U}{URM_{u,i}^2}}}$$

In [26]:
def vector_similarity(urm: sp.csc_matrix, shrink: int):
    item_weights = np.sqrt(
        np.sum(urm.power(2), axis=0)
    ).A.flatten()
    
    num_items = urm.shape[1]
    urm_t = urm.T
    weights = np.empty(shape=(num_items, num_items))
    for item_id in range(num_items):
        numerator = urm_t.dot(urm[:, item_id]).A.flatten()
        denominator = item_weights[item_id] * item_weights + shrink + 1e-6
        
        weights[item_id] = numerator / denominator
        
    np.fill_diagonal(weights, 0.0)
    return weights
    

Lastly, a faster but more memory-intensive version of the similarity is by operating on matrix products
$$ W  
= \frac{URM^{t} \cdot URM}{IW^{t} IW + shrink} $$

In [25]:
def matrix_similarity(urm: sp.csc_matrix, shrink: int):
    item_weights = np.sqrt(
        np.sum(urm.power(2), axis=0)
    ).A
    
    numerator = urm.T.dot(urm)
    denominator = item_weights.T.dot(item_weights) + shrink + 1e-6
    weights = numerator / denominator
    np.fill_diagonal(weights, 0.0)
    
    return weights

In [18]:
urm_csc = urm_train.tocsc()
shrink = 5
slice_size = 100

In [19]:
%%time 
naive_weights = naive_similarity(urm_csc[:slice_size,:slice_size], shrink)
naive_weights

CPU times: user 8.08 s, sys: 67.5 ms, total: 8.14 s
Wall time: 8.28 s


array([[0.        , 0.36632423, 0.36526804, ..., 0.        , 0.        ,
        0.        ],
       [0.36632423, 0.        , 0.54985153, ..., 0.        , 0.03425119,
        0.        ],
       [0.36526804, 0.54985153, 0.        , ..., 0.03108656, 0.11382563,
        0.        ],
       ...,
       [0.        , 0.        , 0.03108656, ..., 0.        , 0.2717996 ,
        0.1006602 ],
       [0.        , 0.03425119, 0.11382563, ..., 0.2717996 , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.1006602 , 0.        ,
        0.        ]])

In [27]:
%%time
vector_weights = vector_similarity(urm_csc[:slice_size,:slice_size], shrink)
vector_weights

CPU times: user 60.8 ms, sys: 2.53 ms, total: 63.4 ms
Wall time: 62.5 ms


array([[0.        , 0.36632423, 0.36526804, ..., 0.        , 0.        ,
        0.        ],
       [0.36632423, 0.        , 0.54985153, ..., 0.        , 0.03425119,
        0.        ],
       [0.36526804, 0.54985153, 0.        , ..., 0.03108656, 0.11382563,
        0.        ],
       ...,
       [0.        , 0.        , 0.03108656, ..., 0.        , 0.2717996 ,
        0.1006602 ],
       [0.        , 0.03425119, 0.11382563, ..., 0.2717996 , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.1006602 , 0.        ,
        0.        ]])

In [30]:
%%time
matrix_weights = matrix_similarity(urm_csc[:slice_size,:slice_size], shrink)
matrix_weights

CPU times: user 4.71 ms, sys: 2.08 ms, total: 6.79 ms
Wall time: 5.22 ms


matrix([[0.        , 0.36632423, 0.36526804, ..., 0.        , 0.        ,
         0.        ],
        [0.36632423, 0.        , 0.54985153, ..., 0.        , 0.03425119,
         0.        ],
        [0.36526804, 0.54985153, 0.        , ..., 0.03108656, 0.11382563,
         0.        ],
        ...,
        [0.        , 0.        , 0.03108656, ..., 0.        , 0.2717996 ,
         0.1006602 ],
        [0.        , 0.03425119, 0.11382563, ..., 0.2717996 , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.1006602 , 0.        ,
         0.        ]])

In [28]:
np.array_equal(naive_weights, vector_weights)

True

In [31]:
np.array_equal(vector_weights, matrix_weights)

True

## Collaborative Filtering ItemKNN Recommender

This step creates a `CFItemKNN` class that represents a Collaborative Filtering ItemKNN Recommender. As we have mentioned in previous practice sessions, our recommenders have two main functions: `fit` and `recommend`. 

The first receives the similarity function and the dataset with which it will create the similarities, the result of this function is to save the similarities (`weights`) into the class instance. 

The second function takes a user id, the train URM, the recommendation lenght and a boolean value to remove already-seen items from users. It returns a recommendation list for the user.

In [26]:
class CFItemKNN(object):
    def __init__(self, shrink: int):
        self.shrink = shrink
        self.weights = None
    
    
    def fit(self, urm_train: sp.csc_matrix, similarity_function):
        if not sp.isspmatrix_csc(urm_train):
            raise TypeError(f"We expected a CSC matrix, we got {type(urm_train)}")
        
        self.weights = similarity_function(urm_train, self.shrink)
        
    def recommend(self, user_id: int, urm_train: sp.csr_matrix, at: Optional[int] = None, remove_seen: bool = True):
        user_profile = urm_train[user_id]
        
        ranking = user_profile.dot(self.weights).A.flatten()
        
        if remove_seen:
            user_profile_start = urm_train.indptr[user_id]
            user_profile_end = urm_train.indptr[user_id+1]
            
            seen_items = urm_train.indices[user_profile_start:user_profile_end]
            
            ranking[seen_items] = -np.inf
            
        ranking = np.flip(np.argsort(ranking))
        return ranking[:at]

In [27]:
itemknn_recommender = CFItemKNN(shrink=50)
itemknn_recommender

<__main__.CFItemKNN at 0x25308e66250>

In [28]:
%%time

itemknn_recommender.fit(urm_train.tocsc(), matrix_similarity)

CPU times: total: 48.7 s
Wall time: 3min 4s


In [29]:
for user_id in range(10):
    print(itemknn_recommender.recommend(user_id=user_id, urm_train=urm_train, at=10, remove_seen=True))

[ 533  144  890  354   50  564   16  471 2036 1553]
[ 533  354  311  890  144  471  166  564 1553 1764]
[10307  4821  3466  8373  3884  3917  6059  3067  8317   198]
[1426  154  314 7698 5291  274   48  959  913  156]
[2036  354 1553 1182  623  541  151 1981   50  407]
[ 953  471  126  282 1111 1596  954 3606 1035 1717]
[ 533  564 1553 1182  623 3971  407 1676 1764 1556]
[ 890 1596 1764  880  471 1676 1035  864  282 4359]
[   4  876 1596  883  880 2477 2466  864 1764 2872]
[ 953  955   48  126  921 1717    4 4010  281  959]


## Evaluation Metrics

In this practice session we will be using the same evaluation metrics defined in the Practice session 2, i.e., precision, recall and mean average precision (MAP).

In [30]:
def recall(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    recall_score = np.sum(is_relevant) / relevant_items.shape[0]
    
    return recall_score
    
    
def precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_score = np.sum(is_relevant) / recommendations.shape[0]

    return precision_score

def mean_average_precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_at_k = is_relevant * np.cumsum(is_relevant, dtype=np.float32) / (1 + np.arange(is_relevant.shape[0]))

    map_score = np.sum(precision_at_k) / np.min([relevant_items.shape[0], is_relevant.shape[0]])

    return map_score
    

## Evaluation Procedure

The evaluation procedure returns the averaged accuracy scores (in terms of precision, recall and MAP) for all users (that have at least 1 rating in the test set). It also calculates the number of evaluated and skipped users. It receives a recommender instance, and the train and test URMs.

In [31]:
def evaluator(recommender: object, urm_train: sp.csr_matrix, urm_test: sp.csr_matrix):
    recommendation_length = 10
    accum_precision = 0
    accum_recall = 0
    accum_map = 0
    
    num_users = urm_train.shape[0]
    
    num_users_evaluated = 0
    num_users_skipped = 0
    for user_id in range(num_users):
        user_profile_start = urm_test.indptr[user_id]
        user_profile_end = urm_test.indptr[user_id+1]
        
        relevant_items = urm_test.indices[user_profile_start:user_profile_end]
        
        if relevant_items.size == 0:
            num_users_skipped += 1
            continue
            
        recommendations = recommender.recommend(user_id=user_id, 
                                               at=recommendation_length, 
                                               urm_train=urm_train, 
                                               remove_seen=True)
        
        accum_precision += precision(recommendations, relevant_items)
        accum_recall += recall(recommendations, relevant_items)
        accum_map += mean_average_precision(recommendations, relevant_items)
        
        num_users_evaluated += 1
        
    
    accum_precision /= max(num_users_evaluated, 1)
    accum_recall /= max(num_users_evaluated, 1)
    accum_map /=  max(num_users_evaluated, 1)
    
    return accum_precision, accum_recall, accum_map, num_users_evaluated, num_users_skipped
    

In [32]:
%%time

accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped = evaluator(itemknn_recommender, 
                                                                                            urm_train, 
                                                                                            urm_test)

CPU times: total: 1min 3s
Wall time: 1min 32s


In [33]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped

(0.06487420941671485, 0.08849688323797107, 0.04848522513375326, 35575, 161)

## Hyperparameter Tuning

This step is fundamental to get the best performance of an algorithm, specifically, because we will train different configurations of the parameters for the `CFItemKNN` recommender and select the best performing one.

In order for this step to be meaningful (and to avoid overfitting on the test set), we perform it using the `validation` URM as test set.

This step is the longest one to run in the entire pipeline when building a recommender.

In [34]:
def hyperparameter_tuning():
    shrinks = [0,1,5,10,50]
    results = []
    for shrink in shrinks:
        print(f"Currently trying shrink {shrink}")
        
        itemknn_recommender = CFItemKNN(shrink=shrink)
        itemknn_recommender.fit(urm_train.tocsc(), matrix_similarity)
        
        ev_precision, ev_recall, ev_map, _, _ = evaluator(itemknn_recommender, urm_train, urm_validation)
        
        results.append((shrink, (ev_precision, ev_recall, ev_map)))
        
    return results
    


In [35]:
%%time

hyperparameter_results = hyperparameter_tuning()

Currently trying shrink 0
Currently trying shrink 1
Currently trying shrink 5
Currently trying shrink 10
Currently trying shrink 50
CPU times: total: 8min 52s
Wall time: 23min 19s


In [36]:
hyperparameter_results

[(0, (0.02973637313341475, 0.09063398503116604, 0.03887138296220355)),
 (1, (0.029972961347020176, 0.09104874655050213, 0.03914929576506363)),
 (5, (0.030307871935110944, 0.09176682421317435, 0.03961139127158533)),
 (10, (0.030458428071041647, 0.09196546593035936, 0.039513069211595545)),
 (50, (0.028316843851782438, 0.08579451871289662, 0.03696762780159111))]

## Submission to competition

This step serves as a similar step that you will perform when preparing a submission to the competition. Specially after you have chosen and trained your recommender.

For this step the best suggestion is to select the most-performing configuration obtained in the hyperparameter tuning step and to train the recommender using both the `train` and `validation` set. Remember that in the competition you *do not* have access to the test set.

We simulated the users to generate recommendations by randomly selecting 100 users from the original identifiers. Do consider that in the competition you are most likely to be provided with the list of users to generate recommendations. 

Another consideration is that, due to easier and faster calculations, we replaced the user/item identifiers with new ones in the preprocessing step. For the competition, you are required to generate recommendations using the dataset's original identifiers. Due to this, this step also reverts back the newer identifiers with the ones originally found in the dataset.

Last, this step creates a function that writes the recommendations for each user in the same file in a tabular format following this format: 
```csv
<user_id>,<item_id_1> <item_id_2> <item_id_3> <item_id_4> <item_id_5> <item_id_6> <item_id_7> <item_id_8> <item_id_9> <item_id_10>
```

Always verify the competitions' submission file model as it might vary from the one we presented here.

In [37]:
best_shrink = 5
urm_train_validation = urm_train + urm_validation


In [38]:
best_recommender = CFItemKNN(shrink=best_shrink)
best_recommender.fit(urm_train_validation.tocsc(), matrix_similarity)

In [40]:
users_to_recommend = np.random.choice(ratings.user_id.unique(), size=100, replace=False)
users_to_recommend

users_to_recommend = pd.read_csv(
    "challenge_data/data_target_users_test.csv",
    sep=",",
    dtype={"user_id": np.int32},
)
users_to_recommend = users_to_recommend.user_id.values
users_to_recommend

array([    0,     1,     2, ..., 35731, 35734, 35735])

In [41]:
mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))

In [42]:
mapping_to_item_id

{0: 0,
 1: 2,
 2: 120,
 3: 128,
 4: 211,
 5: 232,
 6: 282,
 7: 453,
 8: 458,
 9: 491,
 10: 675,
 11: 711,
 12: 760,
 13: 884,
 14: 950,
 15: 1015,
 16: 1075,
 17: 1715,
 18: 2548,
 19: 2600,
 20: 2635,
 21: 2697,
 22: 2698,
 23: 2705,
 24: 2753,
 25: 2891,
 26: 2906,
 27: 2920,
 28: 2971,
 29: 2975,
 30: 3043,
 31: 3077,
 32: 3100,
 33: 3508,
 34: 3564,
 35: 3611,
 36: 3653,
 37: 3667,
 38: 3765,
 39: 3766,
 40: 3907,
 41: 3917,
 42: 3923,
 43: 4678,
 44: 4737,
 45: 4872,
 46: 4883,
 47: 5129,
 48: 6165,
 49: 6197,
 50: 6822,
 51: 7365,
 52: 7391,
 53: 7434,
 54: 7548,
 55: 7549,
 56: 7701,
 57: 7702,
 58: 8740,
 59: 9328,
 60: 9448,
 61: 9537,
 62: 9746,
 63: 9765,
 64: 9783,
 65: 11551,
 66: 11685,
 67: 12798,
 68: 12803,
 69: 12814,
 70: 13723,
 71: 13733,
 72: 13808,
 73: 14154,
 74: 14767,
 75: 14982,
 76: 14989,
 77: 15136,
 78: 15209,
 79: 15212,
 80: 15239,
 81: 15257,
 82: 15314,
 83: 15600,
 84: 15601,
 85: 16892,
 86: 16933,
 87: 17113,
 88: 17540,
 89: 18334,
 90: 19138,
 9

In [55]:
def prepare_submission(ratings: pd.DataFrame, users_to_recommend: np.array, urm_train: sp.csr_matrix, recommender: object):
    users_ids_and_mappings = ratings[ratings.user_id.isin(users_to_recommend)][["user_id", "mapped_user_id"]].drop_duplicates()
    items_ids_and_mappings = ratings[["item_id", "mapped_item_id"]].drop_duplicates()
    
    mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))
    
    
    recommendation_length = 10
    submission = []
    for idx, row in users_ids_and_mappings.iterrows():
        user_id = row.user_id
        mapped_user_id = row.mapped_user_id
        
        recommendations = recommender.recommend(user_id=mapped_user_id,
                                                urm_train=urm_train,
                                                at=recommendation_length,
                                                remove_seen=True)
        
        submission.append((user_id, list(sorted([mapping_to_item_id[item_id] for item_id in recommendations]))))

    submission.sort(key=lambda x: x[0])
        
    return submission
    

In [44]:
submission = prepare_submission(ratings, users_to_recommend, urm_train_validation, best_recommender)


In [49]:
ratings

Unnamed: 0,user_id,item_id,data,mapped_user_id,mapped_item_id
0,0,0,1,0,0
1,5,0,1,5,0
2,6,0,1,6,0
3,7,0,1,7,0
4,9,0,1,9,0
...,...,...,...,...,...
1764602,34967,37744,1,34967,38120
1764603,35040,37744,1,35040,38120
1764604,35308,37744,1,35308,38120
1764605,35431,37744,1,35431,38120


In [60]:
submission

[(0, [471, 512, 1075, 4462, 4689, 6348, 8505, 14888, 21470, 28958]),
 (1, [471, 516, 3074, 3417, 6348, 8505, 8685, 14748, 14888, 28958]),
 (2, [2752, 15740, 18553, 22551, 22589, 22705, 22714, 29640, 29963, 29964]),
 (3, [269, 3671, 6189, 7448, 11146, 11362, 15925, 16924, 16956, 25079]),
 (4, [471, 3472, 6546, 8612, 9812, 11753, 15223, 15606, 18647, 21146]),
 (5, [39, 186, 3130, 4137, 4253, 11150, 11151, 15648, 19208, 21443]),
 (6, [3381, 3408, 6348, 8505, 9742, 11753, 15223, 15503, 15606, 15677]),
 (7, [116, 186, 2652, 2676, 2861, 2892, 3074, 3697, 9866, 15104]),
 (8, [116, 117, 186, 211, 627, 2676, 2837, 2916, 2917, 3024]),
 (9, [2782, 3021, 3130, 3131, 6165, 6980, 7036, 11151, 11154, 11362]),
 (10, [29, 53, 211, 2591, 3697, 4253, 5354, 6321, 9919, 14955]),
 (11, [2310, 4112, 14789, 16288, 16294, 16300, 16305, 16320, 16336, 16359]),
 (12, [116, 281, 2847, 2861, 2871, 2874, 3875, 10110, 18576, 18580]),
 (13, [186, 928, 2496, 2591, 2782, 3697, 4137, 4253, 6356, 11151]),
 (14, [604, 1003

In [61]:
def write_submission(submissions):
    with open("./submission.csv", "w") as f:
        for user_id, items in submissions:
            f.write(f"{user_id},{' '.join([str(item) for item in items])}\n")
    

In [62]:
write_submission(submission)

## Exercises

In this lecture we saw the most simple version of Cosine Similarity, where it just includes a shrink factor. There are different optimizations that we can do to it.

- Implement TopK Neighbors
- When calculating the cosine similarity we used `urm.T.dot(urm)` to calculate the enumerator. However, depending of the dataset and the number of items, this matrix could not fit in memory. Implemenent a `block` version, faster than our `vector` version but that does not use `urm.T.dot(urm)` beforehand.
- Implement Adjusted Cosine [Formula link](http://www10.org/cdrom/papers/519/node14.html)
- Implement Dice Similarity [Wikipedia Link](https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient)
- Implement an implicit CF ItemKNN.
- Implement a CF UserKNN model