<a href="https://colab.research.google.com/github/gretamontera/DataMining1/blob/main/Collaborative_Filtering_Recommender_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering Recommender System
Adapted from [here](https://medium.com/@camilolgon/collaborative-filtering-based-recommender-system-from-scratch-38037932b877)

In [1]:
import numpy as np
import pandas as pd
import re

# Load the dataset
In this hands-on lecture, we will use the [MovieLens](https://grouplens.org/datasets/movielens/) dataset.<br/>
In particular, we will use the $\texttt{ml-latest-small}$ dataset. It contains about 100k ratings from about 600 users across more than 9k movies.

In [2]:
# download the dataset and unzip it
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip
!rm ml-latest-small.zip

--2025-03-10 15:18:38--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2025-03-10 15:18:38 (2.67 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


csv è un sample del dataset, usiamo pandas per lavorare su questo dataset

We load the dataset using [pandas](https://pandas.pydata.org/), a popular Python library for data analysis.

We start by loading the $\texttt{ratings.csv}$ file, which contains the full dataset.<br/>
Each row has the user ID, the movie ID and a rating from 0.5 to 5.

In [3]:
ratings = pd.read_csv(
    'ml-latest-small/ratings.csv',
    sep = ',',
    names = ['user_id', 'movie_id', 'rating', 'timestamp'], # the columns names in the dataset
    usecols = ['user_id', 'movie_id', 'rating'], # for this application, we don't use the timestamp
    skiprows = 1 # the first row contains the names of the columns
    )
ratings['inc_movie_id'] = pd.factorize(ratings['movie_id'])[0] # get an incremental index for movies. In questo modo gli id dei film sono aumentati uno dopo l'altro
ratings

Unnamed: 0,user_id,movie_id,rating,inc_movie_id
0,1,1,4.0,0
1,1,3,4.0,1
2,1,6,4.0,2
3,1,47,5.0,3
4,1,50,5.0,4
...,...,...,...,...
100831,610,166534,4.0,3120
100832,610,168248,5.0,2035
100833,610,168250,5.0,3121
100834,610,168252,5.0,1392


In [4]:
movies = pd.read_csv(
    'ml-latest-small/movies.csv', #contiene il mapping tra il movie id e il movie title
    sep = ',',
    names = ['movie_id', 'movie_title', 'genres'],
    skiprows = 1 #rimuoviamo year del film
)
# remove the year from movie_title
movies['movie_title'] = movies['movie_title'].apply(lambda title: re.sub("\(\d{4}\)", "", title).strip()) #ogni row ha un dataset
movies

Unnamed: 0,movie_id,movie_title,genres
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,Adventure|Children|Fantasy
2,3,Grumpier Old Men,Comedy|Romance
3,4,Waiting to Exhale,Comedy|Drama|Romance
4,5,Father of the Bride Part II,Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy
9739,193585,Flint,Drama
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation


In [5]:
# Some useful statistics

n_users = ratings['user_id'].nunique()
n_movies = ratings['movie_id'].nunique()

print(f'Number of users: {n_users}')
print(f'Number of movies: {n_movies}')
print(f'Number of ratings: {len(ratings)}')
print()

print(f'User with most reviews: {ratings.mode()["user_id"][0]}') #chi ha fatto più reviews
most_reviews_movie_id = ratings.mode()['movie_id'][0]
print(f'Movie with most reviews: {movies[movies.movie_id == most_reviews_movie_id].movie_title.values[0]}')  #titolo del film che ha avuto più reviews
print()

print(f'Average rating: {ratings["rating"].mean():.2f}') #media per tutti i film
toy_story_id = movies[movies.movie_title == 'Toy Story'].movie_id.values[0]  #calcoliamo la media di punteggio che gli è stata data
print(f'Average rating for Toy Story: {ratings[ratings.movie_id == toy_story_id].rating.mean()}')

Number of users: 610
Number of movies: 9724
Number of ratings: 100836

User with most reviews: 414
Movie with most reviews: Forrest Gump

Average rating: 3.50
Average rating for Toy Story: 3.9209302325581397


## Train/Test split
Now we create the train and test sets.</br>
In this case, 80% of each user ratings will be used for training, and the remaining for testing.
Il train allena i modelli e il test valuta il modello su dati non ancora visti

In [6]:
seed = 42
test_perc = 0.2 #20 del sample di ogni utente,

# Initialize the train and test dataframes
train_set, test_set = pd.DataFrame(), pd.DataFrame()

# Check each user PER OGNI UTENTE VIENE FATTA QUESTA COSA
for user_id in ratings.user_id.unique():
    # select only samples from the current user and shuffle them
    user_df = ratings[ratings.user_id == user_id].sample( #shuffle the samples
        frac = 1, # 100% of the rows are returned, but shuffled
        random_state = seed
    )

    n_entries = len(user_df)
    n_test = int(round(test_perc * n_entries))

    # concatenate samples from current user to train and test sets
    test_set = pd.concat((test_set, user_df.tail(n_test))) #creiamo il testo
    train_set = pd.concat((train_set, user_df.head(n_entries - n_test))) #creiamo il test

# re-shuffle train and test sets again
train_set = train_set.sample(frac=1).reset_index(drop=True)
test_set = test_set.sample(frac=1).reset_index(drop=True)

print(f'Training set has {len(train_set)} samples')
print(f'Test set has {len(test_set)} samples')

Training set has 80672 samples
Test set has 20164 samples


# Memory-Based Collaborative Filtering
In this notebook, we implement a *memory-based* approach.<br/>
This means that the similarities between users (or items) are computed using the rating data from the dataset.
COMPARIAMO GLI UTENTI E POI GLI DIAMO DEI CONSIGLI IN BASE AGLI UTENTI A CUI ASSOMIGLANO DI PIù


To implement a **Collaborative Filtering** approach, we need to create the *interactions matrix*.<br/>
Each row contains the ratings given by a user to all items in a database.

In [7]:
def build_interactions_matrix(ratings_mat, n_users, n_items):
    iter_mat = np.zeros((n_users, n_items)) # initialize the iteration matrix with zeros values

    for _, user_id, _, rating, inc_movie_id in ratings_mat.itertuples(): #iteriamo sulle metriche e per ogni utente inseriamo il punteggio di quell'utente per quel certo numero
        # we remove 1 to make the ids start from 0
        iter_mat[user_id - 1, inc_movie_id] = rating

    return iter_mat #abbiamo la matrice risultante dal ciclo for

iter_m = build_interactions_matrix(ratings, n_users, n_movies)
iter_m.shape

(610, 9724)

From the interactions matrix, we can compute the **similarity** between two users.<br/>
It is computed via some distance measurement between the users' feature vectores *i.e.*, the rows in the interactions matrix.

There are different possible distance measurements to compute the similarity between user $A$ and user $B$.<br/>
For example:


*   **Pearson correlation**:
$r(A, B) = \frac{\sum (R_{A,i} - \bar{R}_A) (R_{B,i} - \bar{R}_B)}
{\sqrt{\sum (R_{A,i} - \bar{R}_A)^2} \cdot \sqrt{\sum (R_{B,i} - \bar{R}_B)^2}}
$

* **Cosine similarity**:
$
sim(A, B) = cos(A, B) = \frac{\sum R_{A,i} R_{B,i}}
{\sqrt{\sum R_{A,i}^2} \cdot \sqrt{\sum R_{B,i}^2}}
$

where $R_{A,i}$ is the rating value given by user $A$ to item $i$.


Using a distance measure, we can compute the **similarity matrix**.<br/>
It is a simmetric matrix, with values from 0 to 1. Each entry $sim\_mat_{i,j}$ represents the similarity between user (or item) $i$ and $j$.<br/>
The elements on the diagonal are *auto-similarities*, so they are equal to 1.

In this case, we use the cosine similarity measure.<br/>
In the function below, we can compute the similarity between users or items.

In [8]:
def build_similarity_matrix(interactions_matrix, kind="user", eps=1e-9): #avoid division by 0
    # takes rows as user features
    if kind == "user":
        similarity_matrix = interactions_matrix.dot(interactions_matrix.T)

    # takes columns as item features
    elif kind == "item":
        similarity_matrix = interactions_matrix.T.dot(interactions_matrix)

    norms = np.sqrt(similarity_matrix.diagonal()) + eps # eps is used for numerical stability
    return similarity_matrix / (norms[np.newaxis, :] * norms[:, np.newaxis])

user_sim = build_similarity_matrix(iter_m, kind="user")
item_sim = build_similarity_matrix(iter_m, kind="item")

print(f"User similarity matrix shape: {user_sim.shape}\nUser similarity matrix sample:\n{user_sim[:4, :4]}")
print("-" * 50)
print(f"Item similarity matrix shape: {item_sim.shape}\nItem similarity matrix sample:\n{item_sim[:4, :4]}")

User similarity matrix shape: (610, 610)
User similarity matrix sample:
[[1.         0.02728287 0.05972026 0.19439477]
 [0.02728287 1.         0.         0.00372587]
 [0.05972026 0.         1.         0.00225139]
 [0.19439477 0.00372587 0.00225139 1.        ]]
--------------------------------------------------
Item similarity matrix shape: (9724, 9724)
Item similarity matrix sample:
[[1.         0.2969169  0.37631587 0.4376586 ]
 [0.2969169  1.         0.28425686 0.27961401]
 [0.37631587 0.28425686 1.         0.46392609]
 [0.4376586  0.27961401 0.46392609 1.        ]]


## Make Predictions


### The Recommender

We create a class that represents the recommender system.<br/>
It takes in input the ratings matrix and it creates the interaction and similarity matrices.<br/>
It also computes the predictions using normalized weighted sum:

$$
pred_{u,i} = \frac{\sum\limits_{v \in N(u)} sim(u,v) R_{v,i}}{\sum\limits_{v \in N(u)} |sim(u,v)|}
$$

In [9]:
class Recommender: #calcola le previsioni sulle matrici
    def __init__(self, n_users, n_items, ratings_mat, kind="user", eps=1e-9):
        # store the values
        self.n_users = n_users
        self.n_items = n_items
        self.kind = kind
        self.eps = eps

        # create interactions matrix and similarity matrix
        self.inter_mat = build_interactions_matrix(ratings_mat, self.n_users, self.n_items)
        self.sim_mat = build_similarity_matrix(self.inter_mat, kind=self.kind) #con la matrice precedente computiamo quella di similarity

        # make predictions
        self.predictions = self._predict_all() #fa previsioni per tutti gli utenti

    def _predict_all(self):
        if self.kind == 'user':
            predictions = self.sim_mat.dot(self.inter_mat) / np.abs(self.sim_mat + self.eps).sum(axis=0)[:, np.newaxis]

        elif self.kind == 'item':
            predictions = self.inter_mat.dot(self.sim_mat) / np.abs(self.sim_mat + self.eps).sum(axis=0)[np.newaxis, :]

        return predictions

In [10]:
print("User-based predictions sample:")
print(Recommender(n_users, n_movies, train_set, kind="user").predictions[:4, :4]) #facciamo previsioni sull'utente e sull'item
print("-" * 50)
print("item-based predictions sample:")
print(Recommender(n_users, n_movies, train_set, kind="item").predictions[:4, :4])

#user-based:movie con id 0 sarà consigliato con valore 1.8

User-based predictions sample:
[[1.83521756 0.26945908 0.69753964 1.60462404]
 [1.33107597 0.11778826 0.41507675 1.25240043]
 [1.46579424 0.2356986  0.70228418 1.24934449]
 [1.73919347 0.2350114  0.55374472 1.47033147]]
--------------------------------------------------
item-based predictions sample:
[[0.25773399 0.22004065 0.28210378 0.25681174]
 [0.02019038 0.01163793 0.01870963 0.02272396]
 [0.00826171 0.00713194 0.0095411  0.00849455]
 [0.14852232 0.1120654  0.12919133 0.1450038 ]]


## Model evaluation

We can use the train and test sets we created before to evaluate the model's performance.<br/>
For the evaluation, we can use the *Mean Square Error* (MSE) error.

In [11]:
from sklearn.metrics import mean_squared_error

def build_predictions_df(preds_m, dataframe):
    preds_v = []
    for _, user_id, _, _, inc_movie_id in dataframe.itertuples():
        preds_v.append(preds_m[user_id-1, inc_movie_id])

    preds_df = pd.DataFrame(data={"user_id": dataframe.user_id, "movie_id": dataframe.inc_movie_id, "rating": preds_v}) #il nostro modello da testare sul test
    return preds_df

def get_mse(estimator, train_set, test_set):
    train_preds = build_predictions_df(estimator.predictions, train_set)
    test_preds = build_predictions_df(estimator.predictions, test_set)

    # evaluate the model
    train_mse = mean_squared_error(train_set.rating, train_preds.rating) #computiamo le differenze nel training set
    test_mse = mean_squared_error(test_set.rating, test_preds.rating) #computiamo le differenze nel test set

    return train_mse, test_mse

In [12]:
train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="user"),
    train_set,
    test_set
)

print(f"User-based train MSE: {train_mse:.3f} -- User-based test MSE: {test_mse:.3f}")

User-based train MSE: 9.931 -- User-based test MSE: 10.556


# K-nearest neighbors

In the previous example, the predictions are computed using the ratings from all users. This means that even users with low similarity scores will have a weight in the final computation.

Instead, **k-nearest neighbors** algorithm uses only the subset of the most similar users to make predictions.

In the first algo we used all the users, anche gli utenti che sono veramente poco simili e che non ci servono per la raccomandazione. Usando questo nuovo algoritmo usiamo solo quelli che ci servono. Anzi che prenderli tutti prendiamo solo quelli più rilevanti, i k più rilevanti.

In [13]:
class KNN_Recommender:
    # k is the number of neighbors to use when computing the similarity scores
    def __init__(self, n_users, n_items, ratings_mat, k=40, kind="user", eps=1e-9): #prendiamo solo i 40 utenti più simili
        # store the values
        self.n_users = n_users
        self.n_items = n_items
        self.k = k
        self.kind = kind
        self.eps = eps

        # create interactions matrix and similarity matrix
        self.inter_mat = build_interactions_matrix(ratings_mat, self.n_users, self.n_items)
        self.sim_mat = build_similarity_matrix(self.inter_mat, kind=self.kind)

        # make predictions
        self.predictions = self._predict_all()


    def _predict_all(self):
        pred = np.empty_like(self.inter_mat) #questo è diverso dal primo algoirtmo perchè consideriamo che ora anzi ch eprendere tutti i similarity matrix ora prendiamo i top k element da quella matrice

        if self.kind == "user":
            # an user has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_mat)[:, 1:self.k+1] # take the highest k elements from similarity matrix, una volta che loi abbimao tutti e 40

            for user_id, k_users in enumerate(sorted_ids): #facciamo la stessa formula dello scorso algoritmo ma prendiamo solo i k top user.
                pred[user_id, :] = self.sim_mat[user_id, k_users].dot(self.inter_mat[k_users, :])
                pred[user_id, :] /= np.abs(self.sim_mat[user_id, k_users] + self.eps).sum()

        elif self.kind == "item":
            # an item has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_mat)[:, 1:self.k+1]

            for item_id, k_items in enumerate(sorted_ids):
                pred[:, item_id] = self.sim_mat[item_id, k_items].dot(self.inter_mat[:, k_items].T)
                pred[:, item_id] /= np.abs(self.sim_mat[item_id, k_items] + self.eps).sum()

        return pred

As we see from the evaluation, this algorithm obtains a lower MSE than before.

In [14]:
train_mse, test_mse = get_mse(
    KNN_Recommender(n_users, n_movies, train_set, kind="user"),
    train_set,
    test_set
)

print(f"KNN train MSE: {train_mse:.3f} -- KNN test MSE: {test_mse:.3f}")

KNN train MSE: 7.711 -- KNN test MSE: 8.201


## Bias subtraction

We can improve the prediction computation by considering the user's bias.<br/>
We subtract the average rating given by each user, so that we consider the *relative* difference in ratings instead of the absolute one.

In this case, the prediction is computed as:

$$
pred_{u,i} = \bar{R}_u + \frac{\sum\limits_{v \in N(u)} sim(u,v) (R_{v,i} - \bar{R}_v)}
{\sum\limits_{v \in N(u)} |sim(u,v)|}
$$
where $\bar{R}_u$ is the average rating for the user $u$.

Non prendiamo più il rate assoluto, prendiamo solo le distanze rilevanti. Se io do sempre 4 non indica allora che il film è bello perchè lo do sempre. Questa informazione ci interessa. Anzi che prendere la R che è il rate, prendo il rate - il valore del bias

In [16]:
class KNN_BiasSub_Recommender:
    # k is the number of neighbors to use when computing the similarity scores
    def __init__(self, n_users, n_items, ratings_mat, k=40, bias_sub = False, kind="user", eps=1e-9):  #variabili aggiunte che possono essere settate che può togliere il bias
        # store the values
        self.n_users = n_users
        self.n_items = n_items
        self.k = k
        self.bias_sub = bias_sub
        self.kind = kind
        self.eps = eps

        # create interactions matrix and similarity matrix
        self.inter_mat = build_interactions_matrix(ratings_mat, self.n_users, self.n_items)
        self.sim_mat = build_similarity_matrix(self.inter_mat, kind=self.kind)
        self.item_sim_mat = build_similarity_matrix(self.inter_mat, kind='item')

        # make predictions
        self.predictions = self._predict_all()

    def _predict_all(self):
        pred = np.empty_like(self.inter_mat)

        if self.kind == 'user':

            # computes the new interaction matrix if needed.
            # compute and remove the user bias
            inter_mat = self.inter_mat
            if self.bias_sub:  #la differenza: se tolgo il bias ho la media sull'asse 1 e tolgo il bias nella matrice.
                user_bias = self.inter_mat.mean(axis=1)[:, np.newaxis] #a ogni elemento della matrice viene tolto il bias
                inter_mat -= user_bias

            # an user has the higher similarity score with itself,
            # so we skip the first element
            sorted_ids = np.argsort(-self.sim_mat)[:, 1:self.k+1] # take the highest k elements from similarity matrix

            for user_id, k_users in enumerate(sorted_ids):
                pred[user_id, :] = self.sim_mat[user_id, k_users].dot(inter_mat[k_users, :])
                pred[user_id, :] /= \
                    np.abs(self.sim_mat[user_id, k_users] + self.eps).sum() + self.eps

            # if considering the user bias, re-add it at the end
            if self.bias_sub:
                pred += user_bias #se la usiamo aggiungiamo alla previsione il bias.

        elif self.kind == "item":

            # computes the new interaction matrix if needed.
            # compute and remove the user bias
            iter_m = self.iter_m
            if self.bias_sub:
                item_bias = self.iter_m.mean(axis=0)[np.newaxis, :]
                iter_m -= item_bias

            # an item has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]

            for item_id, k_items in enumerate(sorted_ids):
                pred[:, item_id] = self.sim_m[item_id, k_items].dot(iter_m[:, k_items].T)
                pred[:, item_id] /= \
                    np.abs(self.sim_m[item_id, k_items] + self.eps).sum() + self.eps

            # if considering the user bias, re-add it at the end
            if self.bias_sub:
                pred += item_bias

        return pred.clip(0, 5)

    def get_top_recommendations(self, item_id, n=6): #a function per computare le top n consigli agli utenti
        # Obtain the top n recommendation based on the provided item
        # "I like movie X, what are your recommendations?"

        # we take the row corresponding to our item
        # and we remove the first item -- the similarity with itself
        sim_row = self.item_sim_mat[item_id - 1, :]
        items_idxs = np.argsort(-sim_row)[1:n+1] #prendiamo le similarità e prendiamo le top n

        similarities = sim_row[items_idxs]
        return items_idxs + 1, similarities

In [17]:
train_mse, test_mse = get_mse(
    KNN_BiasSub_Recommender(n_users, n_movies, train_set, kind="user", bias_sub=True),
    train_set,
    test_set
)

print(f"KNN with bias train MSE: {train_mse} -- KNN with bias test MSE: {test_mse}")

KNN with bias train MSE: 7.576342723649772 -- KNN with bias test MSE: 8.065313322605636


Using this model, we can get recommendation based on a movie we like.<br/>
We will use a method that returns the *n* most similar items to a given one.

In [18]:
# definition of the model
model = KNN_BiasSub_Recommender(n_users, n_movies, ratings, k=20, kind='user', bias_sub=False) #usiamo 20 utenti e non usiamo la bis diminuzione

# auxiliary functions
# just a mapping between movie_id and movie_title
def title2id(movies, movie_title):
    return movies[movies.movie_title == movie_title].movie_id.values[0] #prendiamo il titolo anzi che l'id

def ids2titles(movies, ids):
    titles = []
    for inc_id in ids:
        id = ratings[ratings.inc_movie_id == inc_id].movie_id.values[0] # get the movie_id from the incremental id
        t = movies[movies.movie_id == id].movie_title.values[0] #prendiamo il titolo dii quel film corrispondente
        titles.append(t)
    return titles

def print_rec(model, movies, movie_title): #ci daà il modello i film e i titoli.
    movie_id = title2id(movies, movie_title)

    ids, similarities = model.get_top_recommendations(title2id(movies, movie_title), n = 10) #prendiamo le similarità e i primi 10 consigli per gli utenti
    titles = ids2titles(movies, ids) #poi prendiamo i titoli

    for title, sim in zip(titles, similarities):
        print(f'{title} -- {sim:.2f}')


In [19]:
print_rec(model, movies, 'Toy Story') #voglio guardare qualcosa simile a toy story, quali sono le cose che posso guardare dato che mi piace toy story?

Mission to Mars -- 0.57
Mrs. Doubtfire -- 0.57
She's the One -- 0.56
Pulp Fiction -- 0.56
Jungle Book, The -- 0.55
True Lies -- 0.54
Goodfellas -- 0.54
James and the Giant Peach -- 0.54
Back to the Future -- 0.53
Highlander -- 0.53
