# Collaborative Neighborhood-Based Filtering for Movie Recommendations

## Introduction

Collaborative neighborhood-based filtering is a popular approach in recommendation systems that leverages user-item interactions to predict user preferences based on the preferences of users who are similar to the target user and the items that the target user has already rated. There are two main types of collaborative filtering:

- **User-Based Collaborative Filtering**: Recommends items based on the preferences of users who are similar to the target user.
- **Item-Based Collaborative Filtering**: Suggests items that are similar to those the user has already rated positively.

In this notebook, we will explore both approaches for movie recommendations.

## Dataset Description
We use two datasets for this analysis:
1. **Movies Dataset**:
   - `Movie_ID`: Unique identifier for each movie.
   - `Name`: Title of the movie.
   - `Year`: Release year of the movie.

2. **Ratings Dataset**:
   - `User_ID`: Unique identifier for each user.
   - `Movie_ID`: Identifier for the movie rated.
   - `Rating`: Numeric rating given by the user (e.g., on a scale of 1-5).

In [1]:
import sys
import numpy as np
import pandas as pd
from tqdm import tqdm
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

# Load the datasets
movies_file = "/kaggle/input/netflix-movie-rating-dataset/Netflix_Dataset_Movie.csv"
ratings_file = "/kaggle/input/netflix-movie-rating-dataset/Netflix_Dataset_Rating.csv"

# Load the dataset
ratings = pd.read_csv(ratings_file)  # Columns: User_ID, Rating, Movie_ID
movies  = pd.read_csv(movies_file)    # Columns: Movie_ID, Year, Name

### Formal Definition

- $U = \{u_1, u_2, \dots, u_n\}$ is the set of users.
- $U_i = \{u \in U \mid r_{u,i} \neq 0\}$ is the set of users who have rated item $i$
- $I = \{i_1, i_2, \dots, i_m\}$ is the set of items.
- $I_u = \{i \in I \mid r_{u,i} \neq 0\}$ is the set of items rated by user $u$
- $R = \{0, 1, \dots, 5\} \lor R = [0, 1]$ is the set of ratings.
- $r_{u,i}$ is the rating given by user $u$ for item $i$ (equal to 0 if not rated).

## User-Based Collaborative Filtering

### 1. Data Preprocessing
- **User-Item Matrix Creation**: Convert the ratings dataset into a user-item matrix, where rows represent users and columns represent movies. Missing ratings are filled with zeros. Each rating is represented by a number from 1 to 5.
  $$ M[u, i] = r_{u,i} \in R$$
  Where:
  - $u \in U$
  - $i \in I$.

- **Sparse Matrix Conversion**: The dense matrix is converted to a sparse format for memory optimization:
  $$M_{\{\text{sparse}\}} = \text{sparse}(M)$$

In [2]:
# Step 1: Create a user-item matrix
user_item_matrix = ratings.pivot(index='User_ID', columns='Movie_ID', values='Rating')

# Fill missing values with 0 (can use NaN for some algorithms)
user_item_matrix.fillna(0, inplace=True) # It is not the case for this dataset

### 2. Collaborative Filtering Model
We use the `scikit-learn` library to implement a KNN-based model:

1. **Similarity Metric**: Cosine similarity is used to identify users with similar rating patterns. The formula for cosine similarity between two users $u$ and $v$ is:
   $$
   \text{sim}(u, v) = \frac{\vec r_u \cdot \vec r_v}{\|\vec r_u\| \cdot \|\vec r_v\|}
   $$
   Where:
   - $\vec r_u$ and $\vec r_v$ are column vectors of ratings for users $u$ and $v$ respectively (the $u$-th and $v$-th rows of the user-item matrix).
   - $\cdot$ represents the dot product.

2. **Nearest Neighbors**: The model identifies the top $k$ nearest neighbors for each user, based on the similarity scores:
   $$\large
   \mathcal{N}_u^k = \argmax_{U' \subseteq U \setminus \{u\} \land |U'| = k} \sum_{v \in U'} \text{sim}(u, v)
   $$
   Where:
   - $\mathcal N_u^k$ is the set of the top $k$ nearest neighbors for user $u$.
   - $U'$ is a subset of users excluding $u$.
   - $k = 10$ (default value in this implementation).

In [3]:
# Step 2: Build the collaborative filtering model
model_knn_user = NearestNeighbors(metric='cosine', algorithm='brute')

### 3. Recommendation Function (User-Based)

In the user-based collaborative filtering approach, the recommendation function operates as follows:

1. **Get all relevant items**: The set of items rated by the top $k$ nearest neighbors for user $u$:
   $$\large
   I^k = \{i \in I \setminus I_u \mid \exists v \in \mathcal N_u^k : r_{v,i} \neq 0\}
   $$
   Where:
   - $\mathcal N_u^k$ is the set of the top $k$ nearest neighbors for user $u$.
   - $I^k$ is the set of items rated by the top $k$ nearest neighbors for user $u$ and not already rated by $u$.
   - $k = 10$ (default value in this implementation).

2. **Prediction Formula**: For a given user $u$, the set of predicted ratings for items $i \in I^k$ is calculated as:
   $$\large
   I^k_u = \{\hat{r}_{u,i} \mid i \in I^k\} \quad \text{where} \quad
   \hat{r}_{u,i} = \frac{\sum_{v \in \mathcal N_u^k} \text{sim}(u, v) \cdot r_{v,i}}{\sum_{v \in \mathcal N_u^k} \text{sim}(u, v)}
   $$
   Where:
   - $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$.
   - $r_{v,i}$ is the rating of neighbor $v \in \mathcal N_u^k$ for item $i$.
   - $\text{sim}(u, v)$ is the similarity between users $u$ and $v$.

This combination of steps allows the model to efficiently generate user-based collaborative filtering recommendations using nearest neighbors. This approach works well in cases where users with similar preferences exist in the dataset. For example, if two users have rated several movies similarly, the model assumes they will likely share preferences for other movies as well.

In [4]:
# Step 3: Define a rating and a recommendation function
def recommend_movies_user_based(user_id, user_item_matrix, user_index, item_index, num_recommendations=None, k=10):
    """
    Recommend movies using user-based collaborative filtering.

    Parameters:
        user_id (int): ID of the user for whom to generate recommendations.
        user_item_matrix (pd.DataFrame): User-item matrix with ratings (rows: users, columns: movies).
        user_index (dict): Maps user IDs to row indices in the train_matrix.
        item_index (dict): Maps column indices in the train_matrix to movie IDs.
        movies (pd.DataFrame): DataFrame with movie metadata (columns: Movie_ID, Year, Name).
        num_recommendations (int, optional): Number of recommendations to return. Defaults to None (all).
        k (int, optional): Number of nearest neighbors to consider. Defaults to 10.

    Returns:
        pd.DataFrame: DataFrame with recommended movies (columns: ['Movie_ID', 'Name', 'Year', 'Estimated_Rating']).
    """
    if user_id not in user_index:
        raise ValueError(f"User ID {user_id} not found in the dataset")
    
    # Transform into a sparse matrix
    train_matrix = csr_matrix(user_item_matrix.values)

    # Identify user index and retrieve their vector
    user_idx = user_index[user_id]
    user_vector = train_matrix[user_idx]

    # Fit the KNN model and find neighbors
    model_knn_user.fit(train_matrix)
    distances, indices = model_knn_user.kneighbors(user_vector, n_neighbors=k + 1)

    # Process neighbors (exclude the user itself)
    neighbors, similarity_scores = indices.flatten()[1:], 1 - distances.flatten()[1:]

    # Compute weighted ratings and normalize
    similar_users_ratings = train_matrix[neighbors].toarray()
    weighted_ratings = similar_users_ratings.T @ similarity_scores
    normalization = similarity_scores.sum()
    mean_ratings = weighted_ratings / normalization

    # Filter out already rated movies
    mean_ratings[user_vector.indices] = 0

    # Rank movies by estimated ratings
    top_indices = np.argsort(mean_ratings)[::-1]
    if num_recommendations:
        top_indices = top_indices[:num_recommendations]

    # Map indices back to movie IDs and retrieve metadata
    recommendations = pd.DataFrame({
        'Movie_ID': [item_index[idx] for idx in top_indices],
        'Estimated_Rating': mean_ratings[top_indices]
    })
    recommendations = recommendations.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

### 4. Model Testing (User-Based)
The function is tested with a sample user to generate personalized recommendations (User-Based).

In [5]:
# Step 4: Test the recommendation function
user_id_to_test = 305344  # User ID for which to generate recommendations
num_recommendations = 5

# Create dictionaries to map user IDs and movie IDs to matrix indices
user_index = {user_id: idx for idx, user_id in enumerate(user_item_matrix.index)}
movie_index = {idx: movie_id for idx, movie_id in enumerate(user_item_matrix.columns)}

# Generate recommendations for the specified user
user_based_recommendations = recommend_movies_user_based(
    user_id=user_id_to_test,
    user_item_matrix=user_item_matrix,
    user_index=user_index,
    item_index=movie_index,
    num_recommendations=num_recommendations,
    k=10
)

# Display recommendations
print(f"Top {num_recommendations} movie recommendations for User {user_id_to_test}:")
print(user_based_recommendations[['Name', 'Year', 'Estimated_Rating', 'Movie_ID']])

Top 5 movie recommendations for User 305344:
             Name  Year  Estimated_Rating  Movie_ID
0  Doctor Zhivago  1965          4.211178      3153
1         Hackers  1995          3.834992      3680
2             Ray  2004          3.648459       886
3      The Rookie  1990          3.424174      2251
4        Sideways  2004          3.339992      3282


## Item-Based Collaborative Filtering

### 1. Data Preprocessing
- **Item-User Matrix Creation**: Transpose the user-item matrix $M$ to create a new matrix $M' = M^T$, where rows represent movies and columns represent users. Missing ratings are filled with zeros.
  $$ M'[i, u] = M[u, i] = r_{u,i} \in R$$
- **Sparse Matrix Conversion**: The dense matrix is converted to a sparse format to optimize memory usage.
  $$M'_{\{\text{sparse}\}} = \text{csr\_matrix}(M')$$

### 2. Collaborative Filtering Model (Item-Based)
We use the `scikit-learn` library to implement a KNN-based model:

1. **Similarity Metric**: Cosine similarity is used to identify movies with similar rating patterns. The formula for cosine similarity between two movies $i$ and $j$ is:
   $$
   \text{sim}(i, j) = \frac{\vec r_i \cdot \vec r_j}{\|\vec r_i\| \cdot \|\vec r_j\|}
   $$
   Where:
   - $\vec r_i$ and $\vec r_j$ are column vectors of user ratings for movies $i$ and $j$ respectively (the $i$-th and $j$-th rows of the item-user matrix $M'$).
   - $\cdot$ represents the dot product.

2. **Nearest Neighbors**: The model identifies the top $k$ nearest neighbors for each movie $i \in I \setminus I_u$ based on similarity scores:
   $$\large
   \mathcal{N}_i^k = \argmax_{I'_u \subseteq I_u \setminus \{i\} \land |I'_u| = k} \sum_{j \in I'_u} \text{sim}(i, j)
   $$
   Where:
   - $\mathcal N_i^k$ is the set of the top $k$ nearest movies (in the user $u$ rated movies set) for movie $i$.
   - $I'_u$ is a subset, of size $k$, of items rated by user $u$ that excludes movie $i$, obviously.
   - $k = 10$ (default value in this implementation).

### 3. Recommendation Function (Item-Based)

In the item-based collaborative filtering approach, the recommendation function works as follows:

Predict movie ratings for the target user $u$ by aggregating ratings from similar movies (not already rated by $u$) to the already rated ones. For a given user $u$ and a set of items not yet rated by $u$, the predicted rating is calculated as:
   $$\large
   I^k_u = \{\hat{r}_{u,i} \mid i \in I \setminus I_u \setminus \{i\}\} \quad \text{where} \quad
   \hat{r}_{u,i} = \frac{\sum_{j \in \mathcal N_i^k} \text{sim}(i, j) \cdot r_{u,j}}{\sum_{j \in \mathcal N_i^k} \text{sim}(i, j)}
   $$
   Where:
   - $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$.
   - $r_{u,j}$ is the rating of user $u$ for a similar item $j \in \mathcal N_i^k$.
   - $\text{sim}(i, j)$ is the similarity between items $i$ and $j$.
   - $\mathcal N_i^k$ is the set of the top $k$ most similar items to $i$.
   - $I^k_u$ is the set of predicted ratings for user $u$ on items not yet rated by $u$.

Thus, for each movie not rated by the target user, the model calculates a weighted average of similar movies' ratings based on their similarity scores. The predicted ratings are then sorted and the top are returned.

For accuracy reasons, the predicted rating will be sorted by $\sum_{j \in \mathcal N_i^k} \text{sim}(i, j) \cdot r_{u,j}$ and not by $\hat{r}_{u,i}$. This smply because when we divide by $\sum_{j \in \mathcal N_i^k} \text{sim}(i, j)$ we are not considering the similarity scores anymore, and this is not an optimal approach since we want that movies that are closer to already rated ones have an higher weight in the rating estimation. 

This method leverages the user's own preferences and the similarity relationships between movies, providing relevant and personalized suggestions.

In [6]:
# Step 3: Recommendation Function (Item-Based)

def recommend_movies_item_based(user_id, user_item_matrix, item_index, user_index, num_recommendations=None, k=10):
    """
    Recommend movies using item-based collaborative filtering for a given user.
    This is an optimized version that maintains the same functionality as the original.
    
    Args:
        user_id (int): The ID of the user for whom to make recommendations.
        train_matrix (csr_matrix): The sparse matrix of item-user ratings.
        item_index (dict): A mapping of item IDs to their index in the matrix.
        user_index (dict): A mapping of user IDs to their index in the matrix.
        num_recommendations (int): The number of movies to recommend. Default is 5.
        k (int): The number of neighbors to consider in KNN. Default is 10.

    Returns:
        DataFrame: A DataFrame of recommended movies with their estimated ratings.

    Raises:
        ValueError: If the user ID is not found in the dataset or if the user has not rated any movies.
    """

    # Transform into s sparse matrix
    train_matrix = csr_matrix(user_item_matrix.T.values)
    
    # Check if the user exists in the dataset
    if user_id not in user_index:
        raise ValueError("User ID not found in the dataset")

    # Get the internal index of the user
    user_idx = user_index[user_id]
    # Get the user's ratings as a flattened numpy array
    user_ratings = train_matrix[:, user_idx].toarray().flatten()
    # Get the indices of movies that the user has rated
    rated_movie_indices = np.where(user_ratings > 0)[0]

    # Check if the user has rated any movies
    if len(rated_movie_indices) == 0:
        raise ValueError("User has not rated any movies")

    # Initialize the KNN model
    model_knn_item = NearestNeighbors(n_neighbors=k, metric='cosine')

    # Train the model using only the rated items
    sparse_item_user_u = train_matrix[rated_movie_indices, :]
    model_knn_item.fit(sparse_item_user_u)

    # Find movies that the user has not rated
    not_rated_movie_indices = np.where(user_ratings == 0)[0]

    movie_scores = {} # Initialize the movie score dictionary

    # Iterate over movies that the user has not rated
    for movie_idx in tqdm(not_rated_movie_indices, desc="Processing Movies", unit="movie"):
        
        # Get the feature vector for the current movie
        movie_vector = train_matrix[movie_idx, :]
        # Find the k-nearest neighbors to the movie
        distances, indices = model_knn_item.kneighbors(movie_vector, n_neighbors=k)
        # Calculate the similarity scores (1 - cosine distance)
        similarity_scores = 1 - distances.flatten()
        # Get the indices of similar movies that the user has rated
        similar_movies = rated_movie_indices[indices.flatten()]
        # Get the ratings of similar movies from the user
        user_ratings_for_similar = user_ratings[similar_movies]
        
        # Calculate the weighted score by multiplying similarity scores with user ratings
        weighted_score = np.dot(similarity_scores, user_ratings_for_similar)
        # Sum of the similarity scores
        sum_similarity_scores = np.sum(similarity_scores)

        # If sum of the similarity score is > 0, add the weighted score and sum to the dictionary
        if sum_similarity_scores > 0:
            movie_scores[movie_idx] = [weighted_score, sum_similarity_scores]

    # Sort the movies based on their weighted score, and take the top num_recommendations
    recommended_movie_indices = sorted(movie_scores.items(), key=lambda x: x[1][0], reverse=True)

    # Get the top movies (if num_recommendations is specified)
    if num_recommendations:
        recommended_movie_indices = recommended_movie_indices[:num_recommendations]

    # Convert the movie indices to movie IDs and calculate the estimated ratings
    recommended_movies = [(item_index[movie_idx], score[0] / score[1]) for movie_idx, score in recommended_movie_indices]
    
    # Create a pandas DataFrame from the recommended movies
    recommended_movies_df = pd.DataFrame(sorted(recommended_movies, key=lambda x: x[1], reverse=True), columns=['Movie_ID', 'Estimated_Rating'])
    # Merge the recommended movie DataFrame with the movie DataFrame to get movie details
    recommendations = recommended_movies_df.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

### 4. Model Testing
The function is tested with a sample user to generate personalized recommendations.

In [7]:
# Step 4: Test the recommendation function (Item-Based)
user_id_to_test = 774868
num_recommendations = 5

# Genera raccomandazioni usando il metodo Item-Based
item_based_recommendations = recommend_movies_item_based(
    user_id=user_id_to_test,
    user_item_matrix=user_item_matrix,
    item_index=movie_index,
    user_index=user_index,
    num_recommendations=num_recommendations,
    k=10
)

# Stampa le raccomandazioni
print(f"Top {num_recommendations} movie recommendations for User {user_id_to_test} (Item-Based):")
print(item_based_recommendations[['Name', 'Year', 'Estimated_Rating', 'Movie_ID']])

Processing Movies: 100%|██████████| 1331/1331 [00:11<00:00, 118.53movie/s]


Top 5 movie recommendations for User 774868 (Item-Based):
                                                Name  Year  Estimated_Rating  \
0                          Finding Nemo (Widescreen)  2003          4.817077   
1  Pirates of the Caribbean: The Curse of the Bla...  2003          4.816355   
2      Lord of the Rings: The Fellowship of the Ring  2001          4.809336   
3                                    The Sixth Sense  1999          4.808670   
4                                    American Beauty  1999          4.761287   

   Movie_ID  
0      3962  
1      1905  
2      2452  
3      4306  
4       571  


### 5. Evaluation of the Models

To assess the effectiveness of the collaborative filtering approach, we perform a train-test split on the ratings data. The evaluation process includes:

1. **Train-Test Split**:  
   - 80% of the data is used for training the model.
   - 10% of the data is reserved for testing.

2. **Predictions and Metrics**:  
   - For each user in the test set, the model predicts ratings for movies based on the nearest neighbors identified in the training data.
   - **Mean Absolute Error (MAE)** is calculated as the primary metric to evaluate prediction accuracy.


In [8]:
# Evaluate the recommendation system
def print_progress_bar(iteration, total, prefix='', suffix='', decimals=1, length=50, fill='█'):
    """
    Print a manual progress bar in the terminal.
    """
    percent = f"{100 * (iteration / float(total)):.{decimals}f}"
    filled_length = int(length * iteration // total)
    bar = fill * filled_length + '-' * (length - filled_length)
    sys.stdout.write(f'\r{prefix} |{bar}| {percent}% {suffix}')
    sys.stdout.flush()
    if iteration == total:
        print()  # Print a new line on completion


def evaluate_recommendation_system(recommendation_function, train_matrix, test_size=0.2, k=10, users_to_test=10):
    """
    Evaluate a recommendation system using the Mean Absolute Error (MAE) metric.
    Includes a manual progress bar for tracking progress.

    Parameters:
        recommendation_function (function): The recommendation function to evaluate.
        train_matrix (pd.DataFrame): The user-item matrix.
        user_index (dict): Mapping of user IDs to matrix indices.
        movie_index (dict): Mapping of movie IDs to matrix indices.
        test_size (float): The proportion of data to use for testing. Defaults to 0.2.
        k (int): The number of recommendations to generate. Defaults to 10.
        users_to_test (int): Number of users to test. Defaults to 10.

    Returns:
        float: The Mean Absolute Error (MAE) of the recommendation system.
    """
    def get_indexes_of_active_users(train_matrix):
        # Count non-zero ratings for each user
        user_counts = (train_matrix != 0).sum(axis=1)
        return user_counts[(user_counts >= 50)].index

    def get_random_user_indexes(users_to_test, train_matrix):
        # Randomly sample active users
        return np.random.choice(get_indexes_of_active_users(train_matrix), size=users_to_test, replace=False)

    def get_random_rated_movies(user, train_matrix):
        # Randomly sample rated movies for a given user
        rated_movies = train_matrix.loc[user][train_matrix.loc[user] > 0].index
        selected_rated_movies = np.random.choice(rated_movies, size=int(len(rated_movies) * test_size), replace=False)
        return {movie: train_matrix.loc[user, movie] for movie in selected_rated_movies}

    mae  = 0
    rmse = 0

    # Randomly select users and their rated movies
    random_user_item = {
        user_id: get_random_rated_movies(user_id, train_matrix) for user_id in get_random_user_indexes(users_to_test, train_matrix)
    }

    # Remove empty movie lists
    random_user_item = {key: value for key, value in random_user_item.items() if value}

    total_users = len(random_user_item)
    user_count = 0

    # Iterate over users
    for user_id, random_rated_movies in random_user_item.items():
        user_count += 1

        # Print the progress for users
        print_progress_bar(user_count, total_users, prefix="Users Processed", suffix="Complete")

        # Temporarily set ratings for selected movies to 0
        for movie_id in random_rated_movies.keys():
            user_idx = user_index[user_id]
            movie_idx = user_item_matrix.columns.get_loc(movie_id)
            train_matrix.iloc[user_idx, movie_idx] = 0

        # Generate recommendations for the user
        recommendations = recommendation_function(
            user_id=user_id,
            user_item_matrix=train_matrix,
            user_index=user_index,
            item_index=movie_index,
            k=k
        )

        # Restore the original train matrix
        for movie_id, rating in random_rated_movies.items():
            user_idx = user_index[user_id]
            movie_idx = user_item_matrix.columns.get_loc(movie_id)
            train_matrix.iloc[user_idx, movie_idx] = rating
        
        # Calculate the Mean Absolute Error (MAE)
        mae_tmp  = 0
        rmse_tmp = 0

        for movie_id, rating in random_rated_movies.items():
            # Get the estimated rating for the movie
            estimated_rating = recommendations.loc[recommendations['Movie_ID'] == movie_id, 'Estimated_Rating'].values[0]
            # Update the MAE_tmp
            mae_tmp  += np.abs(rating - estimated_rating)
            rmse_tmp += (rating - estimated_rating)**2

            #print(movie_id, rating, estimated_rating)

        # Normalize the MAE_tmp by the number of rated movies
        mae_tmp /= len(random_rated_movies)
        mae += mae_tmp

        rmse_tmp = rmse_tmp**0.5 / len(random_rated_movies)
        rmse += rmse_tmp

    # Return the average MAE
    return mae / users_to_test, rmse / users_to_test #, recomendations_list, random_rated_movies_list

In [9]:
# Evaluate the User-Based Recommendation System

np.random.seed(42)

error_user_based = evaluate_recommendation_system(
    recommend_movies_user_based,
    train_matrix=user_item_matrix,
    test_size=0.1,
    k=10,
    users_to_test=5
)

print(f"MAE (User-Based): {error_user_based[0]}")
print(f"RMSE (User-Based): {error_user_based[1]}")

Users Processed |██████████████████████████████████████████████████| 100.0% Complete
MAE (User-Based): 1.9519338894924974
RMSE (User-Based): 0.7542495644030704


In [10]:
# Evaluate the Item-Based Recommendation System

np.random.seed(42)

error_item_based = evaluate_recommendation_system(
    recommend_movies_item_based,
    train_matrix=user_item_matrix,
    test_size=0.1,
    k=10,
    users_to_test=2
)

print(f"MAE (Item-Based): {error_item_based[0]}")
print(f"RMSE (Item-Based): {error_item_based[1]}")

Users Processed |█████████████████████████-------------------------| 50.0% Complete

Processing Movies: 100%|██████████| 1214/1214 [07:39<00:00,  2.64movie/s]

Users Processed |██████████████████████████████████████████████████| 100.0% Complete







Processing Movies: 100%|██████████| 1296/1296 [04:13<00:00,  5.11movie/s]

MAE (Item-Based): 0.7369147386173005
RMSE (Item-Based): 0.33006553104989816





In [11]:
# Offline Evaluation of Recommendation Systems
def print_progress_bar(iteration, total, prefix='', suffix='', decimals=1, length=50, fill='█'):
    """
    Print a manual progress bar in the terminal.
    """
    percent = f"{100 * (iteration / float(total)):.{decimals}f}"
    filled_length = int(length * iteration // total)
    bar = fill * filled_length + '-' * (length - filled_length)
    sys.stdout.write(f'\r{prefix} |{bar}| {percent}% {suffix}')
    sys.stdout.flush()
    if iteration == total:
        print()  # Print a new line on completion

def evaluate_recommendation_metrics(recommendation_function, train_matrix, k=10, at_k=10, test_size=0.2, users_to_test=10):
    """
    Evaluate a recommendation system using Precision@K, Recall@K, and F1-Score@K.

    Parameters:
        recommendation_function (function): The recommendation function to evaluate.
        train_matrix (pd.DataFrame): The user-item matrix.
        k (int): Number of neighbors for the recommendation algorithm.
        at_k (int): Number of recommendations to consider for metrics.
        test_size (float): Proportion of rated items to hold out for testing.
        users_to_test (int): Number of users to test.

    Returns:
        dict: A dictionary containing the average Precision@K, Recall@K, and F1-Score@K.
    """

    def get_indexes_of_active_users(train_matrix):
        user_counts = (train_matrix != 0).sum(axis=1)
        return user_counts[(user_counts >= 50)].index

    def get_random_user_indexes(users_to_test, train_matrix):
        return np.random.choice(get_indexes_of_active_users(train_matrix), size=users_to_test, replace=False)

    def get_random_rated_movies(user, train_matrix):
        """
        Select the highest-rated movies for a given user as the test set.
        """
        rated_movies = train_matrix.loc[user][train_matrix.loc[user] > 0]
        rated_movies_sorted = rated_movies.sort_values(ascending=False)  # Sort by rating (descending)
        num_test_movies = max(1, int(len(rated_movies) * test_size))  # Ensure at least 1 test movie
        selected_rated_movies = np.random.choice(rated_movies_sorted.index[:num_test_movies], size=num_test_movies, replace=False)
        return {movie: train_matrix.loc[user, movie] for movie in selected_rated_movies}

    precision_sum = 0
    recall_sum = 0
    f1_sum = 0

    # Randomly select users and their rated movies
    random_user_item = {
        user_id: get_random_rated_movies(user_id, train_matrix) for user_id in get_random_user_indexes(users_to_test, train_matrix)
    }

    # Remove empty movie lists
    random_user_item = {key: value for key, value in random_user_item.items() if value}

    total_users = len(random_user_item)
    user_count = 0

    for user_id, random_rated_movies in random_user_item.items():
        user_count += 1

        # Print the progress for users
        print_progress_bar(user_count, total_users, prefix="Users Processed", suffix="Complete")

        # Temporarily set ratings for selected movies to 0
        for movie_id in random_rated_movies.keys():
            user_idx = user_index[user_id]
            movie_idx = user_item_matrix.columns.get_loc(movie_id)
            train_matrix.iloc[user_idx, movie_idx] = 0

        # Generate recommendations for the user
        recommendations = recommendation_function(
            user_id=user_id,
            user_item_matrix=train_matrix,
            user_index=user_index,
            item_index=movie_index,
            k=k
        )

        # Restore the original train matrix
        for movie_id, rating in random_rated_movies.items():
            user_idx = user_index[user_id]
            movie_idx = user_item_matrix.columns.get_loc(movie_id)
            train_matrix.iloc[user_idx, movie_idx] = rating

        at_k = min(at_k, len(recommendations))

        # Calculate Precision@K, Recall@K, and F1-score@K
        recommended_movies = set(recommendations['Movie_ID'][:at_k])
        relevant_movies = set(random_rated_movies.keys())

        # True Positives
        true_positives = len(recommended_movies & relevant_movies)
        # Precision@K
        precision = true_positives / at_k if at_k > 0 else 0
        # Recall@K
        recall = true_positives / len(relevant_movies) if len(relevant_movies) > 0 else 0
        # F1-Score@K
        f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        precision_sum += precision
        recall_sum += recall
        f1_sum += f1

    # Calculate average metrics
    average_precision = precision_sum / total_users
    average_recall = recall_sum / total_users
    average_f1 = f1_sum / total_users

    # Free memory
    del train_matrix

    return {
        'Precision@K': average_precision,
        'Recall@K': average_recall,
        'F1-Score@K': average_f1
    }

In [12]:
# Evaluate the User-Based Recommendation System

np.random.seed(42)

metrics_user_based = evaluate_recommendation_metrics(
    recommend_movies_user_based,
    train_matrix=user_item_matrix,
    k=10, # Number of neighbors for the recommendation algorithm
    at_k=100, # Number of recommendations to consider
    test_size=0.1,
    users_to_test=10
)

print(f"User-Based Recommendation Metrics:")
print(f"Precision@K: {metrics_user_based['Precision@K']:.4f}")
print(f"Recall@K: {metrics_user_based['Recall@K']:.4f}")
print(f"F1-Score@K: {metrics_user_based['F1-Score@K']:.4f}")

Users Processed |██████████████████████████████████████████████████| 100.0% Complete
User-Based Recommendation Metrics:
Precision@K: 0.0910
Recall@K: 0.7570
F1-Score@K: 0.1522


In [13]:
# Evaluate the Item-Based Recommendation System

np.random.seed(42)

metrics_item_based = evaluate_recommendation_metrics(
    recommend_movies_item_based,
    train_matrix=user_item_matrix,
    k=10,  # Number of neighbors for the recommendation algorithm
    at_k=100,  # Number of recommendations to consider
    test_size=0.1,
    users_to_test=2
)

print(f"Item-Based Recommendation Metrics:")
print(f"Precision@K: {metrics_item_based['Precision@K']:.4f}")
print(f"Recall@K: {metrics_item_based['Recall@K']:.4f}")
print(f"F1-Score@K: {metrics_item_based['F1-Score@K']:.4f}")

Users Processed |█████████████████████████-------------------------| 50.0% Complete

Processing Movies: 100%|██████████| 1214/1214 [07:43<00:00,  2.62movie/s]

Users Processed |██████████████████████████████████████████████████| 100.0% Complete







Processing Movies: 100%|██████████| 1296/1296 [03:57<00:00,  5.47movie/s]

Item-Based Recommendation Metrics:
Precision@K: 0.0450
Recall@K: 0.5000
F1-Score@K: 0.0812





## Conclusions

This notebook demonstrated the implementation of a collaborative filtering approach for recommending movies. Key takeaways include:

- The model effectively utilizes user similarity to make recommendations, as shown by the ability to generate relevant suggestions for a sample user.
- The **MAE metric** and **RMSE** provide a reliable evaluation of the model's predictive accuracy.
- While collaborative filtering is powerful, it faces challenges such as:
  - **Cold Start Problem**: Difficulty in recommending movies for new users or items.
  - **Data Sparsity**: Limited interactions in the dataset can affect similarity computations.

From the data on MAE and RMSE, we observe that the Item-Based approach is the one that most closely aligns with the actual user ratings. On the other hand, the User-Based approach proves to be more suitable for recommendation purposes, as it achieves higher recall, precision, and F1-score values compared to the Item-Based approach. Therefore, depending on the use case, both approaches can be successfully implemented.

### Future Improvements

To address the limitations, potential enhancements include:

- Implementing hybrid recommendation systems that combine collaborative and content-based filtering.
- Exploring matrix factorization techniques (e.g., Singular Value Decomposition).
- Integrating deep learning-based recommendation methods.

This collaborative filtering approach forms a solid foundation for building scalable and effective recommendation systems.