# Collaborative Neighborhood-Based Filtering for Movie Recommendations (Incomplete)

## Introduction

Collaborative neighborhood-based filtering is a popular approach in recommendation systems that leverages user-item interactions to predict user preferences based on the preferences of users who are similar to the target user and the items that the target user has already rated. There are two main types of collaborative filtering:

- **User-Based Collaborative Filtering**: Recommends items based on the preferences of users who are similar to the target user.
- **Item-Based Collaborative Filtering**: Suggests items that are similar to those the user has already rated positively.

In this notebook, we will explore both approaches for movie recommendations.

## Dataset Description
We use two datasets for this analysis:
1. **Movies Dataset**:
   - `Movie_ID`: Unique identifier for each movie.
   - `Name`: Title of the movie.
   - `Year`: Release year of the movie.

2. **Ratings Dataset**:
   - `User_ID`: Unique identifier for each user.
   - `Movie_ID`: Identifier for the movie rated.
   - `Rating`: Numeric rating given by the user (e.g., on a scale of 1-5).

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

# Load the dataset
ratings = pd.read_csv("./data/Netflix_Dataset_Rating.csv")  # Columns: User_ID, Rating, Movie_ID
movies  = pd.read_csv("./data/Netflix_Dataset_Movie.csv")    # Columns: Movie_ID, Year, Name

### Formal Definition

- $U = \{u_1, u_2, \dots, u_n\}$ is the set of users.
- $U_i = \{u \in U \mid r_{u,i} \neq 0\}$ is the set of users who have rated item $i$
- $I = \{i_1, i_2, \dots, i_m\}$ is the set of items.
- $I_u = \{i \in I \mid r_{u,i} \neq 0\}$ is the set of items rated by user $u$
- $R = \{0, 1, \dots, 5\} \lor R = [0, 1]$ is the set of ratings.
- $r_{u,i}$ is the rating given by user $u$ for item $i$ (equal to 0 if not rated).

## User-Based Collaborative Filtering

### 1. Data Preprocessing
- **User-Item Matrix Creation**: Convert the ratings dataset into a user-item matrix, where rows represent users and columns represent movies. Missing ratings are filled with zeros. Each rating is represented by a number from 1 to 5.
  $$ M[u, i] = r_{u,i} \in R$$
  Where:
  - $u \in U$
  - $i \in I$.

- **Sparse Matrix Conversion**: The dense matrix is converted to a sparse format for memory optimization:
  $$M_{\{\text{sparse}\}} = \text{sparse}(M)$$

In [2]:
# Step 1: Create a user-item matrix
user_item_matrix = ratings.pivot(index='User_ID', columns='Movie_ID', values='Rating')

# Fill missing values with 0 (can use NaN for some algorithms)
user_item_matrix.fillna(0, inplace=True) # It is not the case for this dataset

# Convert the DataFrame to a sparse matrix
sparse_user_item = csr_matrix(user_item_matrix.values)

### 2. Collaborative Filtering Model
We use the `scikit-learn` library to implement a KNN-based model:

1. **Similarity Metric**: Cosine similarity is used to identify users with similar rating patterns. The formula for cosine similarity between two users $u$ and $v$ is:
   $$
   \text{sim}(u, v) = \frac{\vec r_u \cdot \vec r_v}{\|\vec r_u\| \cdot \|\vec r_v\|}
   $$
   Where:
   - $\vec r_u$ and $\vec r_v$ are column vectors of ratings for users $u$ and $v$ respectively (the $u$-th and $v$-th rows of the user-item matrix).
   - $\cdot$ represents the dot product.

2. **Nearest Neighbors**: The model identifies the top $k$ nearest neighbors for each user, based on the similarity scores:
   $$\large
   \mathcal{N}_u^k = \argmax_{U' \subseteq U \setminus \{u\} \land |U'| = k} \sum_{v \in U'} \text{sim}(u, v)
   $$
   Where:
   - $\mathcal N_u^k$ is the set of the top $k$ nearest neighbors for user $u$.
   - $U'$ is a subset of users excluding $u$.
   - $k = 10$ (default value in this implementation).

In [10]:
# Step 2: Build the collaborative filtering model
model_knn_user = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)

### 3. Recommendation Function (User-Based)

In the user-based collaborative filtering approach, the recommendation function operates as follows:

1. **Get all relevant items**: The set of items rated by the top $k$ nearest neighbors for user $u$:
   $$\large
   I^k = \{i \in I \setminus I_u \mid \exists v \in \mathcal N_u^k : r_{v,i} \neq 0\}
   $$
   Where:
   - $\mathcal N_u^k$ is the set of the top $k$ nearest neighbors for user $u$.
   - $I^k$ is the set of items rated by the top $k$ nearest neighbors for user $u$ and not already rated by $u$.
   - $k = 10$ (default value in this implementation).

2. **Prediction Formula**: For a given user $u$, the set of predicted ratings for items $i \in I^k$ is calculated as:
   $$\large
   I^k_u = \{\hat{r}_{u,i} \mid i \in I^k\} \quad \text{where} \quad
   \hat{r}_{u,i} = \frac{\sum_{v \in \mathcal N_u^k} \text{sim}(u, v) \cdot r_{v,i}}{\sum_{v \in \mathcal N_u^k} \text{sim}(u, v)}
   $$
   Where:
   - $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$.
   - $r_{v,i}$ is the rating of neighbor $v \in \mathcal N_u^k$ for item $i$.
   - $\text{sim}(u, v)$ is the similarity between users $u$ and $v$.

This combination of steps allows the model to efficiently generate user-based collaborative filtering recommendations using nearest neighbors. This approach works well in cases where users with similar preferences exist in the dataset. For example, if two users have rated several movies similarly, the model assumes they will likely share preferences for other movies as well.

In [38]:
# Step 3: Define a rating and a recommendation function
def recommend_movies_user_based(user_id, train_matrix, user_index, movie_index, movies, num_recommendations=None, k=10):
    """
    Recommend movies using user-based collaborative filtering.

    Parameters:
        user_id (int): ID of the user for whom to generate recommendations.
        train_matrix (csr_matrix): Sparse user-item matrix (rows = users, columns = movies).
        user_index (dict): Maps user IDs to row indices in the train_matrix.
        movie_index (dict): Maps column indices in the train_matrix to movie IDs.
        movies (pd.DataFrame): DataFrame with movie metadata (columns: Movie_ID, Year, Name).
        num_recommendations (int, optional): Number of recommendations to return. Defaults to None (all).
        k (int, optional): Number of nearest neighbors to consider. Defaults to 10.

    Returns:
        pd.DataFrame: DataFrame with recommended movies (columns: ['Movie_ID', 'Name', 'Year', 'Estimated_Rating']).
    """
    if user_id not in user_index:
        raise ValueError(f"User ID {user_id} not found in the dataset")

    # Identify user index and retrieve their vector
    user_idx = user_index[user_id]
    user_vector = train_matrix[user_idx]

    # Fit the KNN model and find neighbors
    model_knn_user.fit(train_matrix)
    distances, indices = model_knn_user.kneighbors(user_vector, n_neighbors=k + 1)

    # Process neighbors (exclude the user itself)
    neighbors, similarity_scores = indices.flatten()[1:], 1 - distances.flatten()[1:]

    # Compute weighted ratings and normalize
    similar_users_ratings = train_matrix[neighbors].toarray()
    weighted_ratings = similar_users_ratings.T @ similarity_scores
    normalization = similarity_scores.sum()
    mean_ratings = weighted_ratings / normalization

    # Filter out already rated movies
    mean_ratings[user_vector.indices] = 0

    # Rank movies by estimated ratings
    top_indices = np.argsort(mean_ratings)[::-1]
    if num_recommendations:
        top_indices = top_indices[:num_recommendations]

    # Map indices back to movie IDs and retrieve metadata
    recommendations = pd.DataFrame({
        'Movie_ID': [movie_index[idx] for idx in top_indices],
        'Estimated_Rating': mean_ratings[top_indices]
    })
    recommendations = recommendations.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

### 4. Model Testing (User-Based)
The function is tested with a sample user to generate personalized recommendations (User-Based).

In [39]:
# Step 4: Test the recommendation function
user_id_to_test = 305344  # User ID for which to generate recommendations
num_recommendations = 5

# Create dictionaries to map user IDs and movie IDs to matrix indices
user_index = {user_id: idx for idx, user_id in enumerate(user_item_matrix.index)}
movie_index = {idx: movie_id for idx, movie_id in enumerate(user_item_matrix.columns)}

try:
    # Generate recommendations for the specified user
    user_based_recommendations = recommend_movies_user_based(
        user_id=user_id_to_test,
        train_matrix=sparse_user_item,
        user_index=user_index,
        movie_index=movie_index,
        movies=movies,
        num_recommendations=num_recommendations,
        k=10
    )
    
    # Display recommendations
    print(f"Top {num_recommendations} movie recommendations for User {user_id_to_test}:")
    print(user_based_recommendations[['Name', 'Year', 'Estimated_Rating', 'Movie_ID']])
    
except ValueError as e:
    print(e)

Top 5 movie recommendations for User 305344:
             Name  Year  Estimated_Rating  Movie_ID
0  Doctor Zhivago  1965          4.211178      3153
1         Hackers  1995          3.834992      3680
2             Ray  2004          3.648459       886
3      The Rookie  1990          3.424174      2251
4        Sideways  2004          3.339992      3282


## Item-Based Collaborative Filtering

### 1. Data Preprocessing
- **Item-User Matrix Creation**: Transpose the user-item matrix $M$ to create a new matrix $M' = M^T$, where rows represent movies and columns represent users. Missing ratings are filled with zeros.
  $$ M'[i, u] = M[u, i] = r_{u,i} \in R$$
- **Sparse Matrix Conversion**: The dense matrix is converted to a sparse format to optimize memory usage.
  $$M'_{\{\text{sparse}\}} = \text{csr\_matrix}(M')$$

In [40]:
# Step 1: Transpose the user-item matrix
item_user_matrix = user_item_matrix.T

# Convert the DataFrame to a sparse matrix
sparse_item_user = csr_matrix(item_user_matrix.values)

### 2. Collaborative Filtering Model (Item-Based)
We use the `scikit-learn` library to implement a KNN-based model:

1. **Similarity Metric**: Cosine similarity is used to identify movies with similar rating patterns. The formula for cosine similarity between two movies $i$ and $j$ is:
   $$
   \text{sim}(i, j) = \frac{\vec r_i \cdot \vec r_j}{\|\vec r_i\| \cdot \|\vec r_j\|}
   $$
   Where:
   - $\vec r_i$ and $\vec r_j$ are column vectors of user ratings for movies $i$ and $j$ respectively (the $i$-th and $j$-th rows of the item-user matrix $M'$).
   - $\cdot$ represents the dot product.

2. **Nearest Neighbors**: The model identifies the top $k$ nearest neighbors for each movie $i \in I \setminus I_u$ based on similarity scores:
   $$\large
   \mathcal{N}_i^k = \argmax_{I'_u \subseteq I_u \setminus \{i\} \land |I'_u| = k} \sum_{j \in I'_u} \text{sim}(i, j)
   $$
   Where:
   - $\mathcal N_i^k$ is the set of the top $k$ nearest movies (in the user $u$ rated movies set) for movie $i$.
   - $I'_u$ is a subset, of size $k$, of items rated by user $u$ that excludes movie $i$, obviously.
   - $k = 10$ (default value in this implementation).

In [41]:
# Step 2: Build the item-based collaborative filtering model
model_knn_item = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)

# Create the sparse matrix for the item-based model
sparse_item_user = csr_matrix(item_user_matrix.values)

### 3. Recommendation Function (Item-Based)

In the item-based collaborative filtering approach, the recommendation function works as follows:

Predict movie ratings for the target user $u$ by aggregating ratings from similar movies (not already rated by $u$) to the already rated ones. For a given user $u$ and a set of items not yet rated by $u$, the predicted rating is calculated as:
   $$\large
   I^k_u = \{\hat{r}_{u,i} \mid i \in I \setminus I_u \setminus \{i\}\} \quad \text{where} \quad
   \hat{r}_{u,i} = \frac{\sum_{j \in \mathcal N_i^k} \text{sim}(i, j) \cdot r_{u,j}}{\sum_{j \in \mathcal N_i^k} \text{sim}(i, j)}
   $$
   Where:
   - $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$.
   - $r_{u,j}$ is the rating of user $u$ for a similar item $j \in \mathcal N_i^k$.
   - $\text{sim}(i, j)$ is the similarity between items $i$ and $j$.
   - $\mathcal N_i^k$ is the set of the top $k$ most similar items to $i$.
   - $I^k_u$ is the set of predicted ratings for user $u$ on items not yet rated by $u$.

Thus, for each movie not rated by the target user, the model calculates a weighted average of similar movies' ratings based on their similarity scores. The predicted ratings are then sorted and the top are returned.

For accuracy reasons, the predicted rating will be sorted by $\sum_{j \in \mathcal N_i^k} \text{sim}(i, j) \cdot r_{u,j}$ and not by $\hat{r}_{u,i}$. This smply because when we divide by $\sum_{j \in \mathcal N_i^k} \text{sim}(i, j)$ we are not considering the similarity scores anymore, and this is not an optimal approach since we want that movies that are closer to already rated ones have an higher weight in the rating estimation. 

This method leverages the user's own preferences and the similarity relationships between movies, providing relevant and personalized suggestions.

In [42]:
# Step 3: Recommendation Function (Item-Based)
from tqdm import tqdm

def recommend_movies_item_based(user_id, train_matrix, num_recommendations=5, k=10):
    """Recommend movies using item-based collaborative filtering for a given user."""
    if user_id not in user_item_matrix.index:
        raise ValueError("User ID not found in the dataset")

    # Get the user's ratings
    user_ratings = item_user_matrix.loc[:, user_id].to_numpy()
    rated_movies = np.where(user_ratings > 0)[0]  # Indices of movies the user has rated

    if len(rated_movies) == 0:
        raise ValueError("User has not rated any movies")

    # Initialize a dictionary to store weighted scores
    movie_scores = {}

    # Fit the model with the sparse item-user matrix limited to the rated movies
    sparse_item_user_u = train_matrix[rated_movies, :]
    model_knn_item.fit(sparse_item_user_u)

    # Get unseen movies
    not_rated_movies = item_user_matrix.index.difference(item_user_matrix.index[rated_movies])

    # Iterate over all movies not rated by the user
    for movie_id in tqdm(not_rated_movies, 
                         desc="Processing Movies", unit="movie"):
        # Get the vector for the movie
        movie_idx = item_user_matrix.index.get_loc(movie_id)
        movie_vector = train_matrix[movie_idx, :]

        # Find nearest neighbors for the movie
        distances, indices = model_knn_item.kneighbors(movie_vector, n_neighbors=k)

        # Calculate similarity scores
        similarity_scores = 1 - distances.flatten()

        # Get the indexes of the similar movies
        similar_movies = rated_movies[indices.flatten()]

        # Calculate the weighted average score for the movie using similarity scores
        user_ratings_for_similar = user_ratings[similar_movies]
        movie_scores[movie_id]   = [
            np.dot(similarity_scores, user_ratings_for_similar), # Weighted sum
            similarity_scores.sum() # Sum of similarity scores
        ]

    # Sort movies by aggregated score
    recommended_movies = sorted(movie_scores.items(), key=lambda x: x[1][0], reverse=True)[:num_recommendations]

    # Normalize the scores
    recommended_movies = [(movie_id, score[0] / score[1]) for movie_id, score in recommended_movies]

    # Map movie IDs to names
    recommended_movies_df = pd.DataFrame(sorted(recommended_movies, key=lambda x: x[1], reverse=True), columns=['Movie_ID', 'Estimated_Rating'])
    recommendations = recommended_movies_df.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

In [52]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
from tqdm import tqdm

def recommend_movies_item_based(user_id, train_matrix, item_index, user_index, movies, num_recommendations=5, k=10):
    """
    Recommend movies using item-based collaborative filtering for a given user.
    This version ensures the same results as the original, but with optimized performance.
    """
    # Verifica che l'utente esista nel dataset
    if user_id not in user_index:
        raise ValueError("User ID not found in the dataset")

    # Ottieni l'indice interno dell'utente
    user_idx = user_index[user_id]

    # Recupera le valutazioni dell'utente come array
    user_ratings = train_matrix[:, user_idx].toarray().flatten()
    rated_movie_indices = np.where(user_ratings > 0)[0]  # Indici dei film valutati dall'utente

    # Controlla se l'utente ha valutato almeno un film
    if len(rated_movie_indices) == 0:
        raise ValueError("User has not rated any movies")

    # Inizializza un dizionario per salvare i punteggi ponderati
    movie_scores = {}

    # KNN per ottenere i film più simili
    model_knn_item = NearestNeighbors(n_neighbors=k, metric='cosine')
    
    # Allenare il modello con la matrice item-user (limitato ai film già valutati)
    sparse_item_user_u = train_matrix[rated_movie_indices, :]
    model_knn_item.fit(sparse_item_user_u)
    
    # Trova i film non ancora valutati dall'utente
    not_rated_movie_indices = np.setdiff1d(np.arange(train_matrix.shape[0]), rated_movie_indices)

    # Itera sui film non valutati dall'utente
    for movie_idx in tqdm(not_rated_movie_indices, desc="Processing Movies", unit="movie"):
        # Ottieni il vettore per il film non valutato
        movie_vector = train_matrix[movie_idx, :]

        # Trova i vicini più simili al film non valutato
        distances, indices = model_knn_item.kneighbors(movie_vector, n_neighbors=k)

        # Calcola i punteggi di similarità
        similarity_scores = 1 - distances.flatten()  # La similarità è (1 - distanza coseno)

        # Ottieni gli indici dei film simili che l'utente ha valutato
        similar_movies = rated_movie_indices[indices.flatten()]

        # Calcola il punteggio medio ponderato per il film
        user_ratings_for_similar = user_ratings[similar_movies]
        weighted_score = np.dot(similarity_scores, user_ratings_for_similar)
        sum_similarity_scores = similarity_scores.sum()

        if sum_similarity_scores > 0:
            movie_scores[movie_idx] = [weighted_score, sum_similarity_scores]

    # Ordina i film per punteggio e seleziona i migliori
    recommended_movie_indices = sorted(movie_scores.items(), key=lambda x: x[1][0], reverse=True)[:num_recommendations]

    # Converte gli indici in ID dei film
    recommended_movies = [(item_index[movie_idx], score[0] / score[1]) for movie_idx, score in recommended_movie_indices]

    # Crea il DataFrame delle raccomandazioni
    recommended_movies_df = pd.DataFrame(sorted(recommended_movies, key=lambda x: x[1], reverse=True), columns=['Movie_ID', 'Estimated_Rating'])
    recommendations = recommended_movies_df.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

### 4. Model Testing
The function is tested with a sample user to generate personalized recommendations.

In [43]:
# Step 4: Test the recommendation function
user_id_to_test = 774868#1331154  # Change as needed
num_recommendations = 5

try:
    item_based_recommendations = recommend_movies_item_based(user_id_to_test, sparse_item_user, num_recommendations=num_recommendations, k=10)
    print(f"Top {num_recommendations} movie recommendations for User {user_id_to_test} (Item-Based):")
    print(item_based_recommendations[['Name', 'Year', 'Estimated_Rating', 'Movie_ID']])
except ValueError as e:
    print(e)

Processing Movies: 100%|██████████| 1331/1331 [00:09<00:00, 145.36movie/s]

Top 5 movie recommendations for User 774868 (Item-Based):
                                                Name  Year  Estimated_Rating  \
0                          Finding Nemo (Widescreen)  2003          4.817077   
1  Pirates of the Caribbean: The Curse of the Bla...  2003          4.816355   
2      Lord of the Rings: The Fellowship of the Ring  2001          4.809336   
3                                    The Sixth Sense  1999          4.808670   
4                                    American Beauty  1999          4.761287   

   Movie_ID  
0      3962  
1      1905  
2      2452  
3      4306  
4       571  





In [53]:
# Step 4: Test the recommendation function (Item-Based)
user_id_to_test = 774868  # Cambia l'ID utente se necessario
num_recommendations = 5

try:
    # Genera raccomandazioni usando il metodo Item-Based
    item_based_recommendations = recommend_movies_item_based(
        user_id=user_id_to_test,
        train_matrix=sparse_item_user,
        item_index=movie_index,
        user_index=user_index,
        movies=movies,
        num_recommendations=num_recommendations,
        k=10
    )

    # Stampa le raccomandazioni
    print(f"Top {num_recommendations} movie recommendations for User {user_id_to_test} (Item-Based):")
    print(item_based_recommendations[['Name', 'Year', 'Estimated_Rating', 'Movie_ID']])

except ValueError as e:
    print(e)

Processing Movies: 100%|██████████| 1331/1331 [00:09<00:00, 146.46movie/s]

Top 5 movie recommendations for User 774868 (Item-Based):
                                                Name  Year  Estimated_Rating  \
0                          Finding Nemo (Widescreen)  2003          4.817077   
1  Pirates of the Caribbean: The Curse of the Bla...  2003          4.816355   
2      Lord of the Rings: The Fellowship of the Ring  2001          4.809336   
3                                    The Sixth Sense  1999          4.808670   
4                                    American Beauty  1999          4.761287   

   Movie_ID  
0      3962  
1      1905  
2      2452  
3      4306  
4       571  





### 5. Evaluation of the Model

To assess the effectiveness of the collaborative filtering approach, we perform a train-test split on the ratings data. The evaluation process includes:

1. **Train-Test Split**:  
   - 80% of the data is used for training the model.
   - 20% of the data is reserved for testing.

2. **Predictions and Metrics**:  
   - For each user in the test set, the model predicts ratings for movies based on the nearest neighbors identified in the training data.
   - **Mean Absolute Error (MAE)** is calculated as the primary metric to evaluate prediction accuracy.


In [17]:
# Calcola il numero di valutazioni per ogni utente
user_ratings_count = user_item_matrix.apply(lambda row: (row > 0).sum(), axis=1)

# Ordina gli utenti per numero di valutazioni, in ordine decrescente
top_users = user_ratings_count.sort_values(ascending=False)

# Mostra i primi 10 utenti con più valutazioni
print(top_users.head(10))

User_ID
305344     1344
387418     1339
2439493    1324
2118461    1305
1664010    1257
716173     1169
1314869    1132
2056022    1072
1852040    1053
2606799    1039
dtype: int64


In [None]:
from sklearn.metrics import precision_score, recall_score

def evaluate_user_based_recommendation(user_id, train_matrix, user_index, movie_index, movies, num_recommendations=5, k=10):
    """
    Valuta il modello di raccomandazione con metriche di precisione, recall e accuracy @ K.
    """
    # Verifica che l'utente esista nel dataset
    if user_id not in user_index:
        raise ValueError("User ID not found in the dataset")
    
    # Ottieni l'indice interno dell'utente
    user_idx = user_index[user_id]
    user_ratings = train_matrix[user_idx].toarray().flatten()  # Recupera le valutazioni dell'utente come array
    rated_movie_indices = np.where(user_ratings > 0)[0]  # Indici dei film valutati dall'utente

    # Considera come "rilevanti" i film con una valutazione maggiore di 3
    relevant_movie_indices = rated_movie_indices[user_ratings[rated_movie_indices] > 3]

    # Nascondi le valutazioni dei film rilevanti nella matrice di test
    test_matrix = train_matrix.copy()
    test_matrix[user_idx, relevant_movie_indices] = 0

    # Genera raccomandazioni per l'utente
    recommendations = recommend_movies_user_based(user_id, test_matrix, user_index, movie_index, movies, num_recommendations, k)

    # Ottieni gli ID dei film raccomandati
    recommended_movie_ids = recommendations['Movie_ID'].tolist()

    # Filtro per gestire eventuali ID non trovati in movie_index
    recommended_movie_indices = [
        movie_index[movie_id] for movie_id in recommended_movie_ids if movie_id in movie_index
    ]

    # Calcola precision, recall e accuracy @ K
    true_positives = len(set(recommended_movie_indices) & set(relevant_movie_indices))  # Film raccomandati e rilevanti
    precision_at_k = true_positives / len(recommended_movie_indices) if len(recommended_movie_indices) > 0 else 0
    recall_at_k = true_positives / len(relevant_movie_indices) if len(relevant_movie_indices) > 0 else 0
    accuracy_at_k = true_positives / len(set(recommended_movie_indices) | set(relevant_movie_indices)) if len(set(recommended_movie_indices) | set(relevant_movie_indices)) > 0 else 0

    return {
        "precision_at_k": precision_at_k,
        "recall_at_k": recall_at_k,
        "accuracy_at_k": accuracy_at_k,
    }

# Test della funzione di valutazione
user_id = 387418  # ID di un utente esistente
metrics = evaluate_user_based_recommendation(
    user_id=user_id,
    train_matrix=sparse_user_item,
    user_index=user_index,
    movie_index=movie_index,
    movies=movies,
    num_recommendations=1000
)

print(f"Precision @ K: {metrics['precision_at_k']}")
print(f"Recall @ K: {metrics['recall_at_k']}")
print(f"Accuracy @ K: {metrics['accuracy_at_k']}")

Precision @ K: 0.011904761904761904
Recall @ K: 0.09259259259259259
Accuracy @ K: 0.010660980810234541


In [66]:
from sklearn.metrics import precision_score, recall_score
import numpy as np

def evaluate_item_based_recommendation(user_id, train_matrix, user_index, item_index, movies, num_recommendations=5, k=10):
    """
    Valuta il modello di raccomandazione basato sugli item con metriche di precisione, recall e accuracy @ K.
    """
    # Verifica che l'utente esista nel dataset
    if user_id not in user_index:
        raise ValueError("User ID not found in the dataset")
    
    # Ottieni l'indice interno dell'utente
    user_idx = user_index[user_id]
    user_ratings = train_matrix[:, user_idx].toarray().flatten()  # Recupera le valutazioni dell'utente come array
    rated_movie_indices = np.where(user_ratings > 0)[0]  # Indici dei film valutati dall'utente

    # Considera come "rilevanti" i film con una valutazione maggiore di 3
    relevant_movie_indices = rated_movie_indices[user_ratings[rated_movie_indices] > 3]

    # Nascondi le valutazioni dei film rilevanti nella matrice di test
    test_matrix = train_matrix.copy()
    test_matrix[relevant_movie_indices, user_idx] = 0

    # Genera raccomandazioni per l'utente
    recommendations = recommend_movies_item_based(user_id, test_matrix, item_index, user_index, movies, num_recommendations, k)

    # Ottieni gli ID dei film raccomandati
    recommended_movie_ids = recommendations['Movie_ID'].tolist()

    # Filtro per gestire eventuali ID non trovati in item_index
    recommended_movie_indices = [
        item_index[movie_id] for movie_id in recommended_movie_ids if movie_id in item_index
    ]

    # Calcola precision, recall e accuracy @ K
    true_positives = len(set(recommended_movie_indices) & set(relevant_movie_indices))  # Film raccomandati e rilevanti
    precision_at_k = true_positives / len(recommended_movie_indices) if len(recommended_movie_indices) > 0 else 0
    recall_at_k = true_positives / len(relevant_movie_indices) if len(relevant_movie_indices) > 0 else 0
    accuracy_at_k = true_positives / len(set(recommended_movie_indices) | set(relevant_movie_indices)) if len(set(recommended_movie_indices) | set(relevant_movie_indices)) > 0 else 0

    return {
        "precision_at_k": precision_at_k,
        "recall_at_k": recall_at_k,
        "accuracy_at_k": accuracy_at_k,
    }

# Test della funzione di valutazione
user_id = 387418  # ID di un utente esistente
metrics = evaluate_item_based_recommendation(
    user_id=user_id,
    train_matrix=sparse_item_user,
    user_index=user_index,
    item_index=movie_index,
    movies=movies,
    num_recommendations=1000
)

print(f"Precision @ K: {metrics['precision_at_k']}")
print(f"Recall @ K: {metrics['recall_at_k']}")
print(f"Accuracy @ K: {metrics['accuracy_at_k']}")

Processing Movies: 100%|██████████| 174/174 [01:18<00:00,  2.22movie/s]

Precision @ K: 0.09090909090909091
Recall @ K: 0.024539877300613498
Accuracy @ K: 0.019704433497536946





## Conclusions

This notebook demonstrated the implementation of a collaborative filtering approach for recommending movies. Key takeaways include:

- The model effectively utilizes user similarity to make recommendations, as shown by the ability to generate relevant suggestions for a sample user.
- The **MAE metric** provides a reliable evaluation of the model's predictive accuracy.
- While collaborative filtering is powerful, it faces challenges such as:
  - **Cold Start Problem**: Difficulty in recommending movies for new users or items.
  - **Data Sparsity**: Limited interactions in the dataset can affect similarity computations.

### Future Improvements

To address the limitations, potential enhancements include:

- Implementing hybrid recommendation systems that combine collaborative and content-based filtering.
- Exploring matrix factorization techniques (e.g., Singular Value Decomposition).
- Integrating deep learning-based recommendation methods.

This collaborative filtering approach forms a solid foundation for building scalable and effective recommendation systems.