# Collaborative Filtering for Movie Recommendations

## Introduction

Collaborative filtering is a popular approach in recommendation systems that leverages user-item interactions to predict user preferences. There are two main types of collaborative filtering:

- **User-Based Collaborative Filtering**: Recommends items based on the preferences of users who are similar to the target user.
- **Item-Based Collaborative Filtering**: Suggests items that are similar to those the user has already rated positively.

In this notebook, we will explore both approaches for movie recommendations.

## Dataset Description
We use two datasets for this analysis:
1. **Movies Dataset**:
   - `Movie_ID`: Unique identifier for each movie.
   - `Name`: Title of the movie.
   - `Year`: Release year of the movie.

2. **Ratings Dataset**:
   - `User_ID`: Unique identifier for each user.
   - `Movie_ID`: Identifier for the movie rated.
   - `Rating`: Numeric rating given by the user (e.g., on a scale of 1-5).

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

# Load the dataset
ratings = pd.read_csv("./data/Netflix_Dataset_Rating.csv")  # Columns: User_ID, Rating, Movie_ID
movies  = pd.read_csv("./data/Netflix_Dataset_Movie.csv")    # Columns: Movie_ID, Year, Name

### Formal Definition

- $U = \{u_1, u_2, \dots, u_n\}$ is the set of users.
- $U_i = \{u \in U \mid r_{u,i} \neq 0\}$ is the set of users who have rated item $i$
- $I = \{i_1, i_2, \dots, i_m\}$ is the set of items.
- $I_u = \{i \in I \mid r_{u,i} \neq 0\}$ is the set of items rated by user $u$
- $R = \{0, 1, \dots, 5\} \lor R = [0, 1]$ is the set of ratings.
- $r_{u,i}$ is the rating given by user $u$ for item $i$ (equal to 0 if not rated).

## User-Based Collaborative Filtering

### 1. Data Preprocessing
- **User-Item Matrix Creation**: Convert the ratings dataset into a user-item matrix, where rows represent users and columns represent movies. Missing ratings are filled with zeros. Each rating is represented by a number from 1 to 5.
  $$ M[u, i] = r_{u,i} \in R$$
  Where:
  - $u \in U$ is the set of users.
  - $i \in I$ is the set of movies.
  - $r_{u,i}$ is the rating given by user $u$ for movie $i$.

- **Sparse Matrix Conversion**: The dense matrix is converted to a sparse format for memory optimization:
  $$M_{\{\text{sparse}\}} = \text{sparse}(M)$$

In [2]:
# Step 1: Create a user-item matrix
user_item_matrix = ratings.pivot(index='User_ID', columns='Movie_ID', values='Rating')

# Fill missing values with 0 (can use NaN for some algorithms)
user_item_matrix.fillna(0, inplace=True) # It is not the case for this dataset

# Convert the DataFrame to a sparse matrix
sparse_user_item = csr_matrix(user_item_matrix.values)

### 2. Collaborative Filtering Model
We use the `scikit-learn` library to implement a KNN-based model:

1. **Similarity Metric**: Cosine similarity is used to identify users with similar rating patterns. The formula for cosine similarity between two users $u$ and $v$ is:
   $$
   \text{sim}(u, v) = \frac{\vec r_u \cdot \vec r_v}{\|\vec r_u\| \cdot \|\vec r_v\|}
   $$
   Where:
   - $\vec r_u$ and $\vec r_v$ are column vectors of ratings for users $u$ and $v$ respectively (the $u$-th and $v$-th rows of the user-item matrix).
   - $\cdot$ represents the dot product.

2. **Nearest Neighbors**: The model identifies the top $k$ nearest neighbors for each user, based on the similarity scores:
   $$\large
   \mathcal{N}_u^k = \argmax_{U' \subseteq U \setminus \{u\} \land |U'| = k} \sum_{v \in U'} \text{sim}(u, v)
   $$
   Where:
   - $\mathcal N_u^k$ is the set of the top $k$ nearest neighbors for user $u$.
   - $U'$ is a subset of users excluding $u$.
   - $k = 10$ (default value in this implementation).

In [3]:
# Step 2: Build the collaborative filtering model
model_knn_user = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)

### 3. Recommendation Function (User-Based)

In the user-based collaborative filtering approach, the recommendation function operates as follows:

1. **Get all relevant items**: The set of items rated by the top $k$ nearest neighbors for user $u$:
$$
I^k = \{i \in I \setminus I_u \mid \exists v \in \mathcal N_u^k : r_{v,i} \neq 0\}
$$
Where:
- $\mathcal N_u^k$ is the set of the top $k$ nearest neighbors for user $u$.
- $I^k$ is the set of items rated by the top $k$ nearest neighbors for user $u$ and not already rated by $u$.
- $k = 10$ (default value in this implementation).

1. **Prediction Formula**: For a given user $u$, the set of predicted ratings for items $i \in I^k$ is calculated as:
   $$\large
   I^k_u = \{\hat{r}_{u,i} \mid i \in I^k\} \quad \text{where} \quad
   \hat{r}_{u,i} = \frac{\sum_{v \in \mathcal N_u^k \land i \in I^k} \text{sim}(u, v) \cdot r_{v,i}}{\sum_{v \in \mathcal N_u^k} \text{sim}(u, v)}
   $$
   Where:
   - $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$.
   - $r_{v,i}$ is the rating of neighbor $v \in \mathcal N_u^k$ for item $i$.
   - $\text{sim}(u, v)$ is the similarity between users $u$ and $v$.

This combination of steps allows the model to efficiently generate user-based collaborative filtering recommendations using nearest neighbors. This approach works well in cases where users with similar preferences exist in the dataset. For example, if two users have rated several movies similarly, the model assumes they will likely share preferences for other movies as well.

In [105]:
# Step 3: Define a rating and a recommendation function
def recommend_movies_user_based(user_id, train_matrix, num_recommendations=5):
    """Recommend movies using user-based collaborative filtering."""
    if user_id not in user_item_matrix.index:
        raise ValueError("User ID not found in the dataset")

    # Get the user vector
    user_idx = user_item_matrix.index.get_loc(user_id)
    user_vector = train_matrix[user_idx]

    # Train the model
    model_knn_user.fit(train_matrix)

    # Find nearest neighbors
    distances, indices = model_knn_user.kneighbors(user_vector, n_neighbors=10 + 1)  # +1 to exclude the user itself

    # Filter out the user itself (distance = 0)
    neighbors = indices.flatten()[1:]
    similarity_scores = 1 - distances.flatten()[1:]  # Convert distances to similarity scores

    # Get movies rated by similar users and calculate the weighted average rating
    similar_users_ratings = user_item_matrix.iloc[neighbors]
    similar_users_ratings = similar_users_ratings.loc[:, user_item_matrix.loc[user_id] == 0]  # Exclude movies rated by the user
    weighted_ratings = (similar_users_ratings.T * similarity_scores).T  # Apply weights
    mean_ratings = weighted_ratings.sum(axis=0) / similarity_scores.sum()

    # Sort by rating and get the top recommendations
    recommended_movies = mean_ratings.sort_values(ascending=False).head(num_recommendations)

    # Map movie IDs to names
    recommended_movies = recommended_movies.reset_index()
    recommended_movies.columns = ['Movie_ID', 'Estimated_Rating']
    recommendations = recommended_movies.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

### 4. Model Testing (User-Based)
The function is tested with a sample user to generate personalized recommendations (User-Based).

In [106]:
# Step 4: Test the recommendation function
user_id_to_test = 774868 #1331154  # Change as needed
num_recommendations = 5

try:
    user_based_recommendations = recommend_movies_user_based(user_id_to_test, sparse_user_item, num_recommendations=num_recommendations)
    print(f"Top {num_recommendations} movie recommendations for User {user_id_to_test}:")
    print(user_based_recommendations[['Name', 'Year', 'Estimated_Rating', 'Movie_ID']])
except ValueError as e:
    print(e)

Top 5 movie recommendations for User 774868:
                                                Name  Year  Estimated_Rating  \
0      Lord of the Rings: The Fellowship of the Ring  2001          3.916104   
1  Pirates of the Caribbean: The Curse of the Bla...  2003          3.209660   
2                               Bend It Like Beckham  2002          2.741444   
3                                  Princess Mononoke  1997          2.703218   
4                          Finding Nemo (Widescreen)  2003          2.697026   

   Movie_ID  
0      2452  
1      1905  
2      1470  
3       473  
4      3962  


## Item-Based Collaborative Filtering

### 1. Data Preprocessing
- **Item-User Matrix Creation**: Transpose the user-item matrix $M$ to create a new matrix $M' = M^T$, where rows represent movies and columns represent users. Missing ratings are filled with zeros.
  $$ M'[i, u] = M[u, i] = r_{u,i} \in R$$
- **Sparse Matrix Conversion**: The dense matrix is converted to a sparse format to optimize memory usage.
  $$M'_{\{\text{sparse}\}} = \text{csr\_matrix}(M')$$

In [7]:
# Step 1: Transpose the user-item matrix
item_user_matrix = user_item_matrix.T

# Convert the DataFrame to a sparse matrix
sparse_item_user = csr_matrix(item_user_matrix.values)

### 2. Collaborative Filtering Model (Item-Based)
We use the `scikit-learn` library to implement a KNN-based model:

1. **Similarity Metric**: Cosine similarity is used to identify movies with similar rating patterns. The formula for cosine similarity between two movies $i$ and $j$ is:
   $$
   \text{sim}(i, j) = \frac{\vec r_i \cdot \vec r_j}{\|\vec r_i\| \cdot \|\vec r_j\|}
   $$
   Where:
   - $\vec r_i$ and $\vec r_j$ are column vectors of user ratings for movies $i$ and $j$ respectively (the $i$-th and $j$-th rows of the item-user matrix $M'$).
   - $\cdot$ represents the dot product.

2. **Nearest Neighbors**: The model identifies the top $k$ nearest neighbors for each movie based on similarity scores:
   $$\large
   \mathcal{N}_i^k = \argmax_{I'_u \subseteq I_u \setminus \{i\} \land |I'_u| = k} \sum_{j \in I'_u} \text{sim}(i, j)
   $$
   Where:
   - $\mathcal N_i^k$ is the set of the top $k$ nearest neighbors (in the user $u$ rated movies set) for movie $i$.
   - $I'_u$ is a subset, of size $k$, of items rated by user $u$ that excludes movie $i$, obviously.
   - $k = 10$ (default value in this implementation).

In [8]:
# Step 2: Build the item-based collaborative filtering model
model_knn_item = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)

# Create the sparse matrix for the item-based model
sparse_item_user = csr_matrix(item_user_matrix.values)

### 3. Recommendation Function (Item-Based)

In the item-based collaborative filtering approach, the recommendation function works as follows:

Predict movie ratings for the target user by aggregating ratings from similar movies. For a given user $u$ and an item $i$ not yet rated by $u$, the predicted rating is calculated as:
   $$\large
   I^k_u = \{\hat{r}_{u,i} \mid i \in I \setminus I_u \setminus \{i\}\} \quad \text{where} \quad
   \hat{r}_{u,i} = \frac{\sum_{j \in \mathcal N_i^k} \text{sim}(i, j) \cdot r_{u,j}}{\sum_{j \in \mathcal N_i^k} \text{sim}(i, j)}
   $$
   Where:
   - $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$.
   - $r_{u,j}$ is the rating of user $u$ for a similar item $j \in \mathcal N_i^k$.
   - $\text{sim}(i, j)$ is the similarity between items $i$ and $j$.
   - $\mathcal N_i^k$ is the set of the top $k$ most similar items to $i$.
   - $I^k_u$ is the set of predicted ratings for user $u$ on items not yet rated by $u$.

Thus, for each movie not rated by the target user, the model calculates a weighted average of similar movies' ratings based on their similarity scores. The predicted ratings are then sorted and the top are returned.

This method leverages the user's own preferences and the similarity relationships between movies, providing relevant and personalized suggestions.

In [101]:
# Step 3: Recommendation Function (Item-Based)
from tqdm import tqdm

def recommend_movies_item_based(user_id, train_matrix, num_recommendations=5):
    """Recommend movies using item-based collaborative filtering for a given user."""
    if user_id not in user_item_matrix.index:
        raise ValueError("User ID not found in the dataset")

    # Get the user's ratings
    user_ratings = item_user_matrix.loc[:, user_id].to_numpy()
    rated_movies = np.where(user_ratings > 0)[0]  # Indices of movies the user has rated

    if len(rated_movies) == 0:
        raise ValueError("User has not rated any movies")

    # Initialize a dictionary to store weighted scores
    movie_scores = {}

    # Fit the model with the sparse item-user matrix limited to the rated movies
    sparse_item_user_u = train_matrix[rated_movies, :]
    model_knn_item.fit(sparse_item_user_u)

    # Get unseen movies
    not_rated_movies = item_user_matrix.index.difference(item_user_matrix.index[rated_movies])

    # Iterate over all movies not rated by the user
    for movie_id in tqdm(not_rated_movies, 
                         desc="Processing Movies", unit="movie"):
        # Get the vector for the movie
        movie_idx = item_user_matrix.index.get_loc(movie_id)
        movie_vector = train_matrix[movie_idx, :]

        # Find nearest neighbors for the movie
        distances, indices = model_knn_item.kneighbors(movie_vector, n_neighbors=10)

        # Calculate similarity scores
        similarity_scores = 1 - distances.flatten()

        # Get the indexes of the similar movies
        similar_movies = rated_movies[indices.flatten()]

        # Calculate the weighted average score for the movie using similarity scores
        user_ratings_for_similar = user_ratings[similar_movies]
        movie_scores[movie_id] = np.dot(similarity_scores, user_ratings_for_similar) / similarity_scores.sum()

    # Sort movies by aggregated score
    recommended_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True)[:num_recommendations]

    # Normalize the scores
    recommended_movies = [(movie_id, score) for movie_id, score in recommended_movies]

    # Map movie IDs to names
    recommended_movies_df = pd.DataFrame(recommended_movies, columns=['Movie_ID', 'Estimated_Rating'])
    recommendations = recommended_movies_df.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

Alternative version

In [107]:
# Step 3: Recommendation Function (Item-Based)
from tqdm import tqdm

def recommend_movies_item_based(user_id, train_matrix, num_recommendations=5):
    """Recommend movies using item-based collaborative filtering for a given user."""
    if user_id not in user_item_matrix.index:
        raise ValueError("User ID not found in the dataset")

    # Get the user's ratings
    user_ratings = item_user_matrix.loc[:, user_id].to_numpy()
    rated_movies = np.where(user_ratings > 0)[0]  # Indices of movies the user has rated

    if len(rated_movies) == 0:
        raise ValueError("User has not rated any movies")

    # Initialize a dictionary to store weighted scores
    movie_scores = {}

    # Fit the model with the sparse item-user matrix limited to the rated movies
    sparse_item_user_u = train_matrix[rated_movies, :]
    model_knn_item.fit(sparse_item_user_u)

    # Get unseen movies
    not_rated_movies = item_user_matrix.index.difference(item_user_matrix.index[rated_movies])

    # Iterate over all movies not rated by the user
    for movie_id in tqdm(not_rated_movies, 
                         desc="Processing Movies", unit="movie"):
        # Get the vector for the movie
        movie_idx = item_user_matrix.index.get_loc(movie_id)
        movie_vector = train_matrix[movie_idx, :]

        # Find nearest neighbors for the movie
        distances, indices = model_knn_item.kneighbors(movie_vector, n_neighbors=10)

        # Calculate similarity scores
        similarity_scores = 1 - distances.flatten()

        # Get the indexes of the similar movies
        similar_movies = rated_movies[indices.flatten()]

        # Calculate the weighted average score for the movie using similarity scores
        user_ratings_for_similar = user_ratings[similar_movies]
        movie_scores[movie_id]    = []
        movie_scores[movie_id].append(np.dot(similarity_scores, user_ratings_for_similar))

        # Store the similarity sum for normalization
        movie_scores[movie_id].append(similarity_scores.sum())

    # Sort movies by aggregated score
    recommended_movies = sorted(movie_scores.items(), key=lambda x: x[1][0], reverse=True)[:num_recommendations]

    # Normalize the scores
    recommended_movies = [(movie_id, score[0] / score[1]) for movie_id, score in recommended_movies]

    # Map movie IDs to names
    recommended_movies_df = pd.DataFrame(recommended_movies, columns=['Movie_ID', 'Estimated_Rating'])
    recommendations = recommended_movies_df.merge(movies, on='Movie_ID')[['Movie_ID', 'Name', 'Year', 'Estimated_Rating']]

    return recommendations

### 4. Model Testing
The function is tested with a sample user to generate personalized recommendations.

In [108]:
# Step 4: Test the recommendation function
user_id_to_test = 774868#1331154  # Change as needed
num_recommendations = 5

try:
    item_based_recommendations = recommend_movies_item_based(user_id_to_test, sparse_item_user, num_recommendations=num_recommendations)
    print(f"Top {num_recommendations} movie recommendations for User {user_id_to_test} (Item-Based):")
    print(item_based_recommendations[['Name', 'Year', 'Estimated_Rating', 'Movie_ID']])
except ValueError as e:
    print(e)

Processing Movies: 100%|██████████| 1331/1331 [00:09<00:00, 136.26movie/s]

Top 5 movie recommendations for User 774868 (Item-Based):
                                                Name  Year  Estimated_Rating  \
0  Pirates of the Caribbean: The Curse of the Bla...  2003          4.816355   
1      Lord of the Rings: The Fellowship of the Ring  2001          4.809336   
2                                    The Sixth Sense  1999          4.808670   
3                                    American Beauty  1999          4.761287   
4                          Finding Nemo (Widescreen)  2003          4.817077   

   Movie_ID  
0      1905  
1      2452  
2      4306  
3       571  
4      3962  





### 5. Evaluation of the Model

To assess the effectiveness of the collaborative filtering approach, we perform a train-test split on the ratings data. The evaluation process includes:

1. **Train-Test Split**:  
   - 80% of the data is used for training the model.
   - 20% of the data is reserved for testing.

2. **Predictions and Metrics**:  
   - For each user in the test set, the model predicts ratings for movies based on the nearest neighbors identified in the training data.
   - **Mean Absolute Error (MAE)** is calculated as the primary metric to evaluate prediction accuracy.


In [None]:
def user_based_single(user_id, movie_id, train_matrix):
    """Estimate the rating for a specific user and movie using user-based collaborative filtering."""

    if user_id not in user_item_matrix.index:
        raise ValueError("User ID not found in the dataset")
    
    # Get the user vector
    user_idx = user_item_matrix.index.get_loc(user_id)
    user_vector = sparse_user_item[user_idx]

    # Find nearest neighbors
    distances, indices = model_knn_user.kneighbors(user_vector, n_neighbors=10)

    # Calculate similarity scores
    similarity_scores = 1 - distances.flatten()

    # Get the vector for the movie
    movie_idx = user_item_matrix.index.get_loc(movie_id)
    movie_vector = train_matrix[movie_idx, :]

    # Find nearest neighbors for the movie
    distances, indices = model_knn_user.kneighbors(movie_vector, n_neighbors=10)

    # Calculate similarity scores
    similarity_scores = 1 - distances.flatten()

    # Get the indexes of the similar users
    similar_users = indices.flatten()

    # Calculate the weighted average score for the movie using similarity scores
    user_ratings_for_similar = user_item_matrix.iloc[similar_users].to_numpy()[:, movie_idx]
    weighted_ratings = (user_ratings_for_similar * similarity_scores).sum() / similarity_scores.sum()

    return weighted_ratings

def item_based_single(user_id, movie_id, train_matrix):
    """Estimate the rating for a specific user and movie using item-based collaborative filtering."""
    if user_id not in user_item_matrix.index:
        raise ValueError("User ID not found in the dataset")
    
    # Get the user's ratings
    user_ratings = item_user_matrix.loc[:, user_id].to_numpy()
    rated_movies = np.where(user_ratings > 0)[0]  # Indices of movies the user has rated

    if len(rated_movies) == 0:
        raise ValueError("User has not rated any movies")

    # Fit the model with the sparse item-user matrix limited to the rated movies
    sparse_item_user_u = train_matrix[rated_movies, :]
    model_knn_item.fit(sparse_item_user_u)

    # Get the vector for the movie
    movie_idx = item_user_matrix.index.get_loc(movie_id)
    movie_vector = train_matrix[movie_idx, :]

    # Find nearest neighbors for the movie
    distances, indices = model_knn_item.kneighbors(movie_vector, n_neighbors=10)

    # Calculate similarity scores
    similarity_scores = 1 - distances.flatten()

    # Get the indexes of the similar movies
    similar_movies = rated_movies[indices.flatten()]

    # Calculate the weighted average score for the movie using similarity scores
    user_ratings_for_similar = user_ratings[similar_movies]

    return np.dot(similarity_scores, user_ratings_for_similar) / similarity_scores.sum()

In [126]:
# Step 5: Evaluation (Optional)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from tqdm import tqdm

# Split the ratings data
print("Splitting the dataset...", end="")
train_data, test_data = train_test_split(
    ratings[:10000], # Filter the first 100000 rows  
    test_size=0.2, 
    random_state=42
)
print("Done")

# Create train and test user-item matrices
print("Creating train and test matrices...", end="")
user_item_matrix = ratings[:10000].pivot(index='User_ID', columns='Movie_ID', values='Rating').fillna(0)
sparse_user_item = csr_matrix(user_item_matrix.values)
item_user_matrix = user_item_matrix.T
sparse_item_user = csr_matrix(item_user_matrix.values)
test_matrix  = test_data#.pivot(index='User_ID', columns='Movie_ID', values='Rating').fillna(0)
print("Done")

def evaluate_model(recommendation_function, sparse_train_matrix, name='Model'):
    """Evaluate a recommendation function using MAE and RMSE."""
    
    predictions = []
    true_ratings = []

    for user_id in tqdm(test_matrix.User_ID, desc="Evaluating " + name + " Model"):
        print(user_id)
        recommendations = recommendation_function(user_id, sparse_train_matrix, num_recommendations=10)
        recommended_movie_ids = recommendations['Movie_ID']
        user_test_ratings = test_matrix[test_matrix['User_ID'] == user_id].set_index('Movie_ID')['Rating']

        print("Cia")

        for movie_id in recommended_movie_ids:
            predictions.append(recommendations.loc[recommendations['Movie_ID'] == movie_id, 'Estimated_Rating'].values[0])
            true_ratings.append(user_test_ratings[movie_id])
            
    mae = mean_absolute_error(true_ratings, predictions)
    rmse = mean_squared_error(true_ratings, predictions, squared=False)

    print(f"Mean Absolute Error (MAE): {mae}")
    print(f"Root Mean Squared Error (RMSE): {rmse}")

# Example Evaluation
evaluate_model(recommend_movies_user_based, sparse_user_item, name='User-Based')
evaluate_model(recommend_movies_item_based, sparse_item_user, name='Item-Based')

Splitting the dataset...Done
Creating train and test matrices...Done


Evaluating User-Based Model:   0%|          | 0/2000 [00:00<?, ?it/s]

607403
Cia





KeyError: 3

In [120]:
607403 in user_item_matrix.index

False

## Conclusions

This notebook demonstrated the implementation of a collaborative filtering approach for recommending movies. Key takeaways include:

- The model effectively utilizes user similarity to make recommendations, as shown by the ability to generate relevant suggestions for a sample user.
- The **MAE metric** provides a reliable evaluation of the model's predictive accuracy.
- While collaborative filtering is powerful, it faces challenges such as:
  - **Cold Start Problem**: Difficulty in recommending movies for new users or items.
  - **Data Sparsity**: Limited interactions in the dataset can affect similarity computations.

### Future Improvements

To address the limitations, potential enhancements include:

- Implementing hybrid recommendation systems that combine collaborative and content-based filtering.
- Exploring matrix factorization techniques (e.g., Singular Value Decomposition).
- Integrating deep learning-based recommendation methods.

This collaborative filtering approach forms a solid foundation for building scalable and effective recommendation systems.