## User-Based Collaborative Filtering Recommender System - Netflix

**Nadia Aflatounian**

## Introduction

This Jupyter notebook focuses on the implementation of User-Based Collaborative Filtering (UBCF) using K-Nearest Neighbors (KNN) prediction and KNN classification techniques on the **Netflix dataset**. The primary goal is to address the challenges posed by the dataset's high sparsity, reaching around 98%, by exploring different methods for calculating similarities and optimizing the recommendation process accordingly.

**Key Objectives**:
- Implement User-Based Collaborative Filtering tailored to the high sparsity of the Netflix dataset.
- Investigate various similarity metrics and neighbor distances to enhance recommendation accuracy.
- Develop robust logics for both rating prediction and classification tasks.

**Methods Implemented**:
1. **KNN Prediction**: Predicting ratings based on user-item interactions while considering the unique characteristics of the Netflix dataset.
2. **KNN Classification**: Classifying users into groups based on their preferences, leveraging KNN principles adapted to the dataset's properties.

**Approach**:
- Given the high sparsity of the Netflix dataset, special attention is paid to methods that can effectively handle sparse data.

**Outcome**:
By the end of this notebook, insights into the effectiveness of various similarity measures and neighbor distance metrics specific to the Netflix dataset will be gained. Additionally, well-defined logics for predicting ratings and classifying users will be established, contributing to the enhancement of the recommendation system's performance in the context of the Netflix dataset's unique challenges.

In [1]:
import pandas as pd
import numpy as np
training_df = pd.read_csv('C:/Users/nafla/OneDrive/Documents/system development/training_data.csv')
training_df.head()

Unnamed: 0,MovieID,CustomerID,Rating,Date,YearOfRelease,MovieTitle,RatingYear,MovieAge,user_activity,AverageMovieAgeRated,user_average_rating,average_rating_per_movie,number_of_ratings_per_movie,scaled_movie_age
0,1,1488844,3,2005-09-06,2003,Dinosaur Planet,2005,2,1.473012,1.640503,3.253308,3.910534,1.010541,1.215054
1,1,822109,5,2005-05-13,2003,Dinosaur Planet,2005,2,1.031355,1.405855,4.083333,3.910534,1.010541,1.215054
2,1,885013,4,2005-10-19,2003,Dinosaur Planet,2005,2,1.077044,1.400853,3.873563,3.910534,1.010541,1.215054
3,1,30878,4,2005-12-26,2003,Dinosaur Planet,2005,2,1.275924,1.525706,3.634304,3.910534,1.010541,1.215054
4,1,823519,3,2004-05-03,2003,Dinosaur Planet,2004,1,1.139754,1.326786,3.917197,3.910534,1.010541,1.172043


In [2]:
# Calculate quantiles for user activity and item popularity
user_activity_quantiles = training_df['CustomerID'].value_counts().quantile([0.25, 0.5, 0.75])
item_popularity_quantiles = training_df['MovieID'].value_counts().quantile([0.25, 0.5, 0.75])
print(user_activity_quantiles)
print(item_popularity_quantiles)

0.25     8.0
0.50    24.0
0.75    64.0
Name: count, dtype: float64
0.25     192.0
0.50     552.5
0.75    2539.0
Name: count, dtype: float64


In [3]:
# converting column types
training_df['CustomerID'] = training_df['CustomerID'].astype(str)
training_df['MovieID'] = training_df['MovieID'].astype(str)
training_df['Rating'] = pd.to_numeric(training_df['Rating'], errors='coerce')  # Converts to float, makes non-numeric as NaN

In [43]:
# List of your columns to be rounded and converted
columns_to_round_and_convert = ['user_activity', 'AverageMovieAgeRated', 'user_average_rating']

# when trying to create similarity matrix containing user feature we ran to memory error
# Apply rounding and conversion to all specified columns to use less memory 
for column in columns_to_round_and_convert:
    training_df[column] = training_df[column].round().astype(int)

# Display the DataFrame to verify the changes
print(training_df.head())


  MovieID CustomerID  Rating        Date  YearOfRelease       MovieTitle  \
0       1    1488844       3  2005-09-06           2003  Dinosaur Planet   
1       1     822109       5  2005-05-13           2003  Dinosaur Planet   
2       1     885013       4  2005-10-19           2003  Dinosaur Planet   
3       1      30878       4  2005-12-26           2003  Dinosaur Planet   
4       1     823519       3  2004-05-03           2003  Dinosaur Planet   

   RatingYear  user_activity  AverageMovieAgeRated  user_average_rating  \
0        2005              1                     2                    3   
1        2005              1                     1                    4   
2        2005              1                     1                    4   
3        2005              1                     2                    4   
4        2004              1                     1                    4   

   scaled_movie_age  
0          1.215054  
1          1.215054  
2          1.215054  
3   

In [6]:
# drop columns we don't need
# List the names of the columns you want to drop
columns_to_drop = ['average_rating_per_movie', 'number_of_ratings_per_movie', 'MovieAge']

# Drop the specified columns from the DataFrame
training_df = training_df.drop(columns=columns_to_drop)

# Display the DataFrame to verify the changes
print(training_df.head())


  MovieID CustomerID  Rating        Date  YearOfRelease       MovieTitle  \
0       1    1488844       3  2005-09-06           2003  Dinosaur Planet   
1       1     822109       5  2005-05-13           2003  Dinosaur Planet   
2       1     885013       4  2005-10-19           2003  Dinosaur Planet   
3       1      30878       4  2005-12-26           2003  Dinosaur Planet   
4       1     823519       3  2004-05-03           2003  Dinosaur Planet   

   RatingYear  user_activity  AverageMovieAgeRated  user_average_rating  \
0        2005              1                     2                    3   
1        2005              1                     1                    4   
2        2005              1                     1                    4   
3        2005              1                     2                    4   
4        2004              1                     1                    4   

   scaled_movie_age  
0          1.215054  
1          1.215054  
2          1.215054  
3   

### Sampling Strategy

To address the high sparsity and memory constraints in the dataset, a filtering and sampling approach is applied. This involves several steps:

1. **Filtering Users and Movies**: Users and movies with fewer than 10 ratings are filtered out, aiming to reduce the dataset's sparsity and focus on more active users and popular movies.

2. **Random Sampling**: A random sampling technique is employed to further reduce the dataset size while retaining representative samples. The fraction of data to be sampled is specified as a parameter, with careful consideration for memory limitations.

### Note:

It's essential to acknowledge that this sampling strategy may result in the loss of some important trends or data patterns present in the original dataset. Additionally, it's recognized that the trained recommender system may not perform similarly on other datasets due to the reduced diversity resulting from the sampling process.

By implementing this sampling strategy, the goal is to strike a balance between reducing sparsity and memory usage while maintaining a dataset size that allows for effective training and modeling.

In [9]:
import pandas as pd

# Step 1: Filter users with more than 10 ratings
user_filtered_df = training_df.groupby('CustomerID').filter(lambda x: len(x) > 10)

# Step 2: Filter movies with more than 10 ratings
movie_filtered_df = user_filtered_df.groupby('MovieID').filter(lambda x: len(x) > 10)

# Step 3: Perform random sampling
# Replace 'fraction' with the fraction of data you want to sample. For example, 0.005 for 0.5%
fraction = 0.005
strat_sample_df = movie_filtered_df.sample(frac=fraction, random_state=42)  # Ensure reproducibility with random_state

# Display the shapes of the original, user-filtered, movie-filtered, and sampled DataFrames
print("Original DataFrame shape:", training_df.shape)
print("User-Filtered DataFrame shape:", user_filtered_df.shape)
print("Movie-Filtered DataFrame shape:", movie_filtered_df.shape)
print("Sampled DataFrame shape:", strat_sample_df.shape)

# 'strat_sample_df' now contains the randomly sampled data from both the users and movies with more than 10 ratings.


Original DataFrame shape: (24053575, 11)
User-Filtered DataFrame shape: (23343305, 11)
Movie-Filtered DataFrame shape: (23343305, 11)
Sampled DataFrame shape: (116717, 11)


In [10]:
num_sampled_rows = len(strat_sample_df)
print(f"Number of rows in the sampled DataFrame: {num_sampled_rows}")

Number of rows in the sampled DataFrame: 116717


# Splitting dataset to training, test

In [11]:
from sklearn.model_selection import train_test_split

# define user_ids to map to their index
user_ids = strat_sample_df['CustomerID'].unique()
movie_ids = strat_sample_df['MovieID'].unique()

# Create mappings based on the entire dataset
user_id_to_index = {user_id: index for index, user_id in enumerate(user_ids)}
movie_id_to_index = {movie_id: index for index, movie_id in enumerate(movie_ids)}

# spliting data to train and test
# later in k-fold validation we create validation set
training_data, testing_data = train_test_split(strat_sample_df, test_size=0.2, random_state=42)


In [12]:
# function to map index of users and movies for any data set in ordere to use in creating csr matrix and avoid inconsistencies between indices
def map_ids_to_indices(df, user_id_to_index, movie_id_to_index):
    """
    Map user IDs and movie IDs to their respective indices.

    Parameters:
    - df: DataFrame containing 'CustomerID', 'MovieID', and other columns.
    - user_id_to_index: Dictionary mapping user IDs to indices.
    - movie_id_to_index: Dictionary mapping movie IDs to indices.

    Returns:
    - DataFrame with added columns 'UserIndex' and 'MovieIndex' for the respective indices.
    """

    # Copy to avoid modifying the original DataFrame
    modified_data = df.copy()
    
    # Map 'CustomerID' to 'UserIndex'
    modified_data['UserIndex'] = modified_data['CustomerID'].map(user_id_to_index)
    
    # Map 'MovieID' to 'MovieIndex'
    modified_data['MovieIndex'] = modified_data['MovieID'].map(movie_id_to_index)
    
    # Optional: drop rows where either UserIndex or MovieIndex is NaN (i.e., ID wasn't found)
    modified_data.dropna(subset=['UserIndex', 'MovieIndex'], inplace=True)
    
    # Convert indices to integers (they might be floats due to NaN handling)
    modified_data['UserIndex'] = modified_data['UserIndex'].astype(int)
    modified_data['MovieIndex'] = modified_data['MovieIndex'].astype(int)
    
    return modified_data

In [13]:
# mapping training set to their index
mapped_training_data = map_ids_to_indices(training_data,user_id_to_index, movie_id_to_index)

In [14]:
# Calculate the size of each split
training_size = training_data.shape[0]  # Number of rows in the training data
training_size_mapp = mapped_training_data.shape[0]  # Number of rows in the training data
testing_size = testing_data.shape[0]  # Number of rows in the testing data

# Print the sizes
print(f"Training Data Size: {training_size}")
print(f"Training Data Size: {training_size_mapp}")
print(f"Testing Data Size: {testing_size}")

Training Data Size: 93373
Training Data Size: 93373
Testing Data Size: 23344


In [15]:
# Count unique MovieIDs in the final training data
unique_users_training = mapped_training_data['CustomerID'].nunique()

# Count unique MovieIDs in the testing data
unique_movies_testing = mapped_training_data['CustomerID'].nunique()

# Print the counts
print(f"Unique CustomerIDs in Training Data: {unique_users_training}")
print(f"Unique MovieIDs in Training Data: {unique_movies_testing}")


Unique CustomerIDs in Training Data: 71886
Unique MovieIDs in Testing Data: 21626


### Use of CSR Matrix Format

In this dataset, utilizing the Compressed Sparse Row (CSR) matrix format offers significant advantages:

1. **Memory Efficiency**: The CSR matrix format stores only the non-zero elements of the matrix, reducing memory usage substantially, especially in cases of sparse data where most entries are zero.

2. **Computational Efficiency**: Operations involving CSR matrices, such as matrix multiplication and addition, are optimized for sparse matrices, resulting in faster computations compared to dense matrices.

3. **Scalability**: With large datasets, using CSR matrices ensures scalability by efficiently handling sparse data structures without consuming excessive memory resources.

By leveraging the CSR matrix format, the memory-efficient representation of user-item ratings, along with additional feature matrices, can be effectively managed and processed, enabling robust analysis and modeling of the dataset while mitigating memory constraints.

In [17]:
# define values for user features
user_activity_values = mapped_training_data['user_activity'].values
AverageMovieAgeRated_values = mapped_training_data['AverageMovieAgeRated'].values
user_avg_rating_values = mapped_training_data['user_average_rating'].values


In [18]:
from scipy.sparse import csr_matrix, hstack

# Extract user indexes and movie indexes
user_indexes = mapped_training_data['UserIndex'].values
movie_indexes = mapped_training_data['MovieIndex'].values
ratings = mapped_training_data['Rating'].values

# Create the base user-item ratings CSR matrix
num_users = user_indexes.max() + 1
num_movies = movie_indexes.max() + 1
ratings_csr_matrix = csr_matrix((ratings, (user_indexes, movie_indexes)), shape=(num_users, num_movies))

# Create CSR matrices for features
user_activity_matrix = csr_matrix((user_activity_values, (user_indexes, np.zeros_like(user_indexes))), shape=(num_users, 1))
avg_movie_age_matrix = csr_matrix((AverageMovieAgeRated_values, (user_indexes, np.zeros_like(user_indexes))), shape=(num_users, 1))
user_avg_rating_matrix = csr_matrix((user_avg_rating_values, (user_indexes, np.zeros_like(user_indexes))), shape=(num_users, 1))


### Note 
I have created a full CSR matrix containing user features to mitigate sparsity and get better accuracy. but when calculating similarity matrix I enconter memory error so I will continue with simple rating CSR marix containing userid, movieid and ratings

In [19]:
# Horizontally stack the features matrices with the ratings CSR matrix
full_csr_matrix = hstack([ratings_csr_matrix, user_activity_matrix])


## Create similarity matrix 
### Use of Pre-Calculated Similarity Matrix

In the process of developing the recommendation system, I initially attempted to compute similarities between users using a for loop. However, this approach proved to be inefficient and led to a significant decrease in recommendation process speed. To address this issue and improve efficiency, I opted to use a pre-calculated similarity matrix.

#### Advantages of Pre-Calculated Similarity Matrix:
1. **Enhanced Performance**: By pre-calculating the similarity matrix, the recommendation process becomes much faster compared to computing similarities on-the-fly during each recommendation request.

2. **Optimized Resource Utilization**: Utilizing a pre-calculated similarity matrix optimizes resource utilization, as the computational overhead of similarity calculations is shifted to a one-time operation during the matrix creation phase.

#### Considerations and Limitations:
- **Memory Constraints**: While pre-calculating the similarity matrix addresses computational efficiency concerns, it may introduce memory constraints, especially with large datasets. Consequently, I had to be mindful of memory usage and may have skipped certain similarity methods, such as Manhattan distance, to avoid memory errors.

- **Challenges with Pearson Correlation**: Initially, I explored Pearson correlation as a similarity metric. However, due to its requirement for at least two common ratings for each pair of users, it often resulted in NaN (Not a Number) values for most data points. This limitation arises from the sparse nature of the dataset, where many user pairs lack sufficient common ratings for accurate correlation calculation.

- **Example of Missing Data Impact**: For instance, if two users rated only one movie in common, they might still share similar preferences. However, Pearson correlation cannot capture this similarity due to the lack of sufficient common ratings.

- **Sparse Data Challenges with Cosine Similarity**: Even with cosine similarity, which is well-suited for sparse data, there may be instances where multiple user pairs exhibit exactly the same similarity scores. This redundancy can potentially introduce complications in the recommendation process.


In [21]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

cosine_similarity_matrix_csr = cosine_similarity(ratings_csr_matrix, dense_output=False)


## Predict ratings using similarities

The `predict_rating` function is designed to estimate the rating a target user would give to a specific movie based on the ratings of similar users. Below is a brief overview of the prediction process:

#### Steps:

1. **Identify Users Who Rated the Movie**:
   - The function first identifies users who have rated the movie of interest. This is done by extracting non-zero entries from the corresponding column of the user-item matrix.

2. **Extract Similarity Scores**:
   - Next, it extracts the similarity scores between the target user and all other users from the pre-calculated similarity matrix.

3. **Filter Similarities for Rated Users**:
   - The similarity scores are filtered to retain only those corresponding to users who have rated the movie.

4. **Select Top Similar Users**:
   - Among the users who have rated the movie, the function selects the top-k similar users based on their similarity scores.

5. **Retrieve Ratings of Top Similar Users**:
   - It retrieves the ratings of the movie from the top-k similar users in the user-item matrix.

6. **Calculate Weighted Average Rating**:
   - Using the similarity scores as weights, the function calculates a weighted sum of the ratings provided by the top-k similar users.

7. **Predict Rating**:
   - Finally, it predicts the rating for the target user by dividing the weighted sum by the sum of similarity scores. If the sum of similarity scores is zero, indicating no similar users, it defaults to the overall average rating of the movie.

#### Return:
- The function returns the predicted rating for the movie by the target user.

This prediction logic leverages the collaborative filtering approach, where recommendations are based on the preferences and behaviors of similar users. 

In [24]:
def predict_rating(csr_user_item_matrix, similarity_matrix, user_index, movie_index, k):
    """
    Predict the rating for a given movie by a target user, based on the ratings of top-k similar users.
    This function uses a pre-calculated similarity matrix.
    
    Parameters:
    - csr_user_item_matrix: CSR matrix representing the user-item matrix.
    - similarity_matrix: CSR matrix representing the similarity scores between users.
    - user_index: The index of the user for whom the rating is being predicted.
    - movie_index: The index of the movie for which the rating is being predicted.
    - k: Number of top similar users to consider for prediction.
    
    Returns:
    - Predicted rating for the movie by the target user.
    """
    # Step 1: Identify users who have rated the movie
    movie_rated_indices = csr_user_item_matrix[:, movie_index].nonzero()[0]
    
    # Check if user_index exists in the similarity matrix
    if user_index < 0 or user_index >= similarity_matrix.shape[0]:
        # If user_index does not exist in similarity matrix, return default prediction
        overall_average_rating = csr_user_item_matrix[:, movie_index].data.mean()
        return overall_average_rating if np.isfinite(overall_average_rating) else 3.0  # Assuming 3.0 as a neutral rating
    
    # Step 2: Extract similarity scores for the target user with all other users
    user_similarities = similarity_matrix.getrow(user_index).toarray().flatten()
    
    # Step 3: Filter the similarities for users who have rated the movie
    filtered_similarities = user_similarities[movie_rated_indices]
    
    # Step 4: Get indices of top k similar users among those who have rated the movie
    top_k_indices = np.argsort(filtered_similarities)[-k:]
    top_k_users_indices = movie_rated_indices[top_k_indices]
    top_k_similarities = filtered_similarities[top_k_indices]

    # Retrieve ratings for the movie from these top-k similar users
    top_k_ratings = csr_user_item_matrix[top_k_users_indices, movie_index].toarray().flatten()

    # Calculate the weighted average rating
    weighted_sum = np.dot(top_k_similarities, top_k_ratings)
    similarity_sum = np.sum(top_k_similarities)
    
    if similarity_sum > 0:
        predicted_rating = weighted_sum / similarity_sum
    else:
        # Use the overall average rating of the movie by all users as the default rating
        overall_average_rating = csr_user_item_matrix[:, movie_index].data.mean()
        predicted_rating = overall_average_rating if np.isfinite(overall_average_rating) else 3.0  # Assuming 3.0 as a neutral rating

    return predicted_rating


### Evaluation of Predictions

The `evaluate_predictions_csr` function assesses the performance of the recommendation system by predicting ratings for each user-movie pair in the validation set using a CSR matrix and a pre-computed similarity matrix and previous defined prediction function. The predictions are then compared to the actual ratings, and the root mean square error (RMSE) is calculated as a measure of prediction accuracy. Here's an overview of the evaluation process:

#### Steps:

1. **Iterate Over Validation Data**:
   - The function iterates over each row in the validation data, which typically contains 'UserIndex', 'MovieIndex', and 'Rating'.

2. **Predict Ratings**:
   - For each user-movie pair in the validation set, the function predicts the rating using the `predict_rating` function. This prediction is based on the pre-calculated CSR matrix representing the user-item matrix from the training set and the similarity matrix.

3. **Calculate Actual Ratings**:
   - The actual ratings from the validation data are stored for comparison.

4. **Compute RMSE**:
   - After obtaining both the actual and predicted ratings, the function calculates the root mean square error (RMSE) between them. RMSE is a commonly used metric to measure the average deviation of predicted ratings from the actual ratings.

#### Return:
- The function returns the RMSE value, representing the degree of error in the prediction of ratings compared to the ground truth ratings.


In [25]:

from sklearn.metrics import mean_squared_error
from math import sqrt
def evaluate_predictions_csr(validation_data, csr_user_item_matrix, similarity_matrix, k):
    """
    Evaluate the recommendation system by predicting ratings for each user-movie pair in the validation set
    using a CSR matrix and pre-computed similarity matrix, and comparing the predictions to the actual ratings using RMSE.

    Parameters:
    - validation_data: DataFrame containing 'UserIndex', 'MovieIndex', and 'Rating'.
    - csr_user_item_matrix: CSR matrix representing the user-item matrix from the training set.
    - similarity_matrix: Pre-computed similarity matrix as a CSR matrix.
    - k: The number of top similar users to consider when making predictions.
    
    Returns:
    - rmse: The root mean square error of the predicted ratings against the actual ratings.
    """
    actual_ratings = []
    predicted_ratings = []

    for _, row in validation_data.iterrows():
        user_index = row['UserIndex']
        movie_index = row['MovieIndex']
        actual_rating = row['Rating']
        
        # Check if the movie index is valid
        if movie_index in csr_user_item_matrix.indices:
            predicted_rating = predict_rating(csr_user_item_matrix, similarity_matrix, user_index, movie_index, k)
            actual_ratings.append(actual_rating)
            predicted_ratings.append(predicted_rating)
    
    # Calculate RMSE between actual and predicted ratings
    actual_ratings = np.array(actual_ratings)
    predicted_ratings = np.array(predicted_ratings)
    valid_mask = ~np.isnan(predicted_ratings)
    rmse = sqrt(mean_squared_error(actual_ratings[valid_mask], predicted_ratings[valid_mask]))

    return rmse


### Cross-Validation for Parameter Tuning

At this block we performs k-fold cross-validation to tune the hyperparameter 'k' for the K-Nearest Neighbors (KNN) recommendation system. 

#### Steps:

1. **Setup KFold Cross-Validation**:
   - Utilizes the `KFold` class from scikit-learn to create a k-fold cross-validation iterator. The data is split into 5 folds (`n_splits=5`) with shuffling enabled for randomization.

2. **Define Hyperparameters**:
   - Specifies a list of k values (`k_values`) to test. These represent the number of top similar users considered for prediction.

3. **Initialize Results Storage**:
   - Creates an empty list (`results`) to store the average RMSE for each k value over all folds.

4. **Cross-Validation Loop**:
   - Iterates over each k value.
   - Within each iteration, iterates over each fold generated by the KFold iterator.
   - For each fold, evaluates the predictions using the `evaluate_predictions_csr` function, passing the validation data, CSR matrix of user-item ratings, similarity matrix, and the current k value.
   - Computes the RMSE for each fold and stores it in `fold_rmses`.

5. **Average RMSE Calculation**:
   - Calculates the average RMSE for the current k value by taking the mean of all fold RMSEs.

6. **Results Storage**:
   - Appends a tuple `(k, avg_rmse)` to the `results` list, containing the k value and its corresponding average RMSE.

7. **Identify Best k Value**:
   - Finds the best k value based on the minimum average RMSE from the results list.
   - Prints the best k value along with its corresponding RMSE.

#### Return:
- The best k value along with its RMSE, providing insight into the optimal choice of hyperparameter for the KNN recommendation system.

This cross-validation procedure helps in selecting the most suitable k value that minimizes prediction errors and improves the overall performance of the recommendation system.

In [26]:
from sklearn.model_selection import KFold
import numpy as np

# Assuming k_values to test and your similarity matrix is already defined
k_values = [5, 20, 30, 50, 100, 200]
similarity_matrix = cosine_similarity_matrix_csr
# Setup KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize results storage
results = []

for k in k_values:
    fold_rmses = []  # Store RMSEs for each fold

    for train_indices, test_indices in kf.split(ratings_csr_matrix):
        # Splitting your data: ratings_csr_matrix doesn't change, so you just map validation set indices
        validation_data_fold = mapped_training_data.iloc[test_indices]

        # Evaluate predictions on this fold's test set
        rmse = evaluate_predictions_csr(validation_data_fold, ratings_csr_matrix, similarity_matrix, k)
        fold_rmses.append(rmse)

    # Calculate average RMSE for this k over all folds
    avg_rmse = np.mean(fold_rmses)
    results.append((k, avg_rmse))
    print(f"k={k}, Average RMSE={avg_rmse}")

# Find the best k value based on average RMSE
best_k, best_rmse = min(results, key=lambda x: x[1])
print(f"Best k: {best_k} with RMSE: {best_rmse}")


k=5, Average RMSE=0.9508280757776708
k=20, Average RMSE=0.9585441076013355
k=30, Average RMSE=0.9618954676997873
k=50, Average RMSE=0.964437779735535
k=100, Average RMSE=0.9668930415522464
k=200, Average RMSE=0.9678785252100521
Best k: 5 with RMSE: 0.9508280757776708


In [27]:
# map testing data to their indecies
mapped_testing_data = map_ids_to_indices(testing_data, user_id_to_index, movie_id_to_index)

In [28]:
# calculate rmse with the best k on test set
similarity_matrix = cosine_similarity_matrix_csr
k = 5  # Example value for k
rmse = evaluate_predictions_csr(mapped_testing_data, ratings_csr_matrix, similarity_matrix, k)
print(f"RMSE: {rmse}")

RMSE: 1.0653943639743557


### Movie Recommendation Based on Predicted Ratings

Function `recommend_movies_prediction` generates movie recommendations for specified user(s) by predicting ratings and suggesting the top n movies with the highest predicted ratings. Here's a breakdown of its functionality:

#### Steps:

1. **Input Validation**:
   - Checks if the input `user_ids` is a list. If not, converts it to a list to handle both single and multiple user recommendations.

2. **Iterate Over User IDs**:
   - Loops through each user ID provided in the input list.

3. **User Index Retrieval**:
   - Retrieves the user index from the DataFrame (`df`) corresponding to the given user ID.

4. **Unrated Movies Detection**:
   - Finds the indices of movies that the user has not rated yet. This is achieved by comparing all movie indices against the nonzero indices of the user's row in the user-item matrix (`csr_user_item_matrix`).

5. **Predict Ratings for Unrated Movies**:
   - Iterates over the unrated movies' indices.
   - For each unrated movie, predicts the rating using the `predict_rating` function, passing the user index, movie index, K value, and similarity matrix.

6. **Top n Recommendations**:
   - Sorts the predicted ratings in descending order and selects the top n movies.
   - Constructs a list of tuples containing movie IDs and their corresponding predicted ratings.

7. **Return Recommendations**:
   - Constructs a dictionary where keys are user IDs and values are lists of top n movie recommendations with their predicted ratings.

#### Return:
- A dictionary containing user IDs as keys and lists of top n recommended movies with their predicted ratings as values.


In [29]:
import numpy as np

def recommend_movies_prediction(df, csr_user_item_matrix, similarity_matrix, user_ids, k, n):
    """
    Recommend top n movies for specified user(s) based on predicted ratings.
    Assumes 'UserIndex' and 'MovieIndex' are available in 'df'.
    """
    if not isinstance(user_ids, list):
        user_ids = [user_ids]

    recommendations = {}
    
    for user_id in user_ids:
        try:
            user_index = df[df['CustomerID'] == user_id]['UserIndex'].iloc[0]  # Assuming first matching UserIndex is representative
        except IndexError:
            print(f"User ID {user_id} not found.")
            continue

        unrated_movies_indices = np.setdiff1d(np.arange(csr_user_item_matrix.shape[1]),
                                               csr_user_item_matrix.getrow(user_index).nonzero()[1])
        
        predicted_ratings = []
        for movie_index in unrated_movies_indices:
            predicted_rating = predict_rating(csr_user_item_matrix, similarity_matrix, user_index, movie_index, k)
            movie_id = {v: k for k, v in movie_id_to_index.items()}[movie_index]  # Reverse lookup to get MovieID from MovieIndex
            predicted_ratings.append((movie_id, predicted_rating))

        top_n_recommendations = sorted(predicted_ratings, key=lambda x: x[1], reverse=True)[:n]
        recommendations[user_id] = top_n_recommendations

    return recommendations


In [30]:
# prepare main data for movie recommendation
main_mapped_data = map_ids_to_indices(strat_sample_df, user_id_to_index, movie_id_to_index)

In [31]:
# Extract the rows (user indices), columns (movie indices), and data (ratings) for the CSR matrix
rows = main_mapped_data['UserIndex'].values
cols = main_mapped_data['MovieIndex'].values
data = main_mapped_data['Rating'].values

# Determine the shape of the CSR matrix
# The shape is (max_user_index + 1, max_movie_index + 1) because indices start from 0
num_users = main_mapped_data['UserIndex'].max() + 1
num_movies = main_mapped_data['MovieIndex'].max() + 1

# Create the main CSR matrix
Main_csr_matrix = csr_matrix((data, (rows, cols)), shape=(num_users, num_movies))

In [32]:
# calculate the main similarity matrix
main_similarity_matrix = cosine_similarity(Main_csr_matrix, dense_output=False)

In [34]:
main_mapped_data.head()

Unnamed: 0,MovieID,CustomerID,Rating,Date,YearOfRelease,MovieTitle,RatingYear,user_activity,AverageMovieAgeRated,user_average_rating,scaled_movie_age,UserIndex,MovieIndex
8464880,1693,1851346,1,2002-12-15,1998,Sphere,2002,1,2,3,1.301075,0,0
6311316,1220,1710563,4,2004-12-03,2004,Man on Fire,2004,1,2,3,1.129032,1,1
17205447,3316,17864,3,2004-01-07,2002,Bartleby,2004,2,1,3,1.215054,2,2
22300142,4227,1673744,4,2000-02-26,1997,The Full Monty,2000,1,2,4,1.258065,3,3
6146126,1202,1321440,4,2004-06-25,1983,National Lampoon's Vacation,2004,1,2,4,2.032258,4,4


In [35]:
# testing recommendation function with a specific user
user_ids = ['1851346']  # Single user example
k = 5  # Number of similar users to consider
n = 5  # Number of recommendations to generate

# Generate recommendations
recommendations = recommend_movies_prediction(main_mapped_data, Main_csr_matrix, main_similarity_matrix, user_ids, k, n)

# Display the recommendations
for user_id in user_ids:
    print(f"Recommendations for User ID {user_id}:")
    if user_id in recommendations:
        for movie_id, predicted_rating in recommendations[user_id]:
            print(f"\tMovie ID: {movie_id}, Predicted Rating: {predicted_rating}")
    else:
        print("\tNo recommendations available.")


Recommendations for User ID 1851346:
	Movie ID: 30, Predicted Rating: 5.000000000000001
	Movie ID: 3122, Predicted Rating: 5.000000000000001
	Movie ID: 1495, Predicted Rating: 5.000000000000001
	Movie ID: 2780, Predicted Rating: 5.0
	Movie ID: 2342, Predicted Rating: 5.0


In [49]:
# Testing prediction function with one user to make sure system predicts all range of ratings (not just 5)
user_index = 2  # Replace with actual user ID
movie_index = 10  # Replace with actual movie ID
K = 200  # Number of neighbors
predicted_rating = predict_rating(Main_csr_matrix, main_similarity_matrix, user_index, movie_index, k)
print(f"Predicted rating for User ID {user_id} and Movie ID {movie_id} is: {predicted_rating} ")

Predicted rating for User ID 1851346 and Movie ID 886 is: 2.0 


### Movie Recommendation Based on Classification (Voting) Logic

In this section, we employ a classification approach to predict movie ratings for a target user. Unlike the previous method, which focused on predicting numerical ratings, the classification method treats ratings as categories (1 to 5 stars) and utilizes voting logic to determine the predicted rating. Here's a summary of the classification method:

#### Prediction Logic:

1. **Identify Users Who Rated the Movie**:
   - Determine the indices of users who have rated the target movie in the user-item matrix.

2. **Extract Similarity Scores**:
   - Retrieve the similarity scores between the target user and all other users from the pre-calculated similarity matrix.

3. **Filter Similarities for Rated Users**:
   - Filter the similarity scores to include only those users who have rated the movie of interest.

4. **Select Top K Similar Users**:
   - Identify the top K similar users among those who have rated the movie, based on their similarity scores.

5. **Retrieve Ratings and Calculate Votes**:
   - Retrieve the ratings of the movie from the top K similar users.
   - Count the votes for each rating category (1 to 5 stars), weighted by the similarity scores.

6. **Determine Predicted Rating**:
   - The predicted rating is determined by selecting the rating category with the highest sum of similarity-weighted votes.

7. **Handling Default Rating**:
   - If the predicted rating is not available (e.g., due to insufficient data), the overall average rating of the movie (nonzero) by all users is used as the default prediction.

#### Evaluation and Performance:

- The classification method results in predictions based on a voting scheme, where similar users contribute to the final prediction through their weighted votes.
- The Root Mean Square Error (RMSE) is used to evaluate the performance of the classification-based recommendation system.
- During evaluation, the best-performing value of K (number of top similar users) is determined empirically. In this case, the best K value was found to be 200, resulting in the lowest RMSE among tested values.

#### Conclusion:

- The classification method offers an alternative approach to predicting movie ratings, particularly suitable for scenarios where numerical predictions are challenging or less effective.
- Comparing to prediction method, classification method resulted in higher rmse with k=200

### Note
Since I have used the same functions and methods as before, I skip explaining each function again to respect the time of the reader.

In [36]:
from collections import defaultdict

def predict_rating_with_classification(csr_user_item_matrix, similarity_matrix, user_index, movie_index, k):
    """
    Predict the rating for a given movie by a target user, based on the ratings of top-k similar users using classification (voting) logic.
    This function uses a pre-calculated similarity matrix.

    Parameters:
    - csr_user_item_matrix: CSR matrix representing the user-item matrix.
    - similarity_matrix: CSR matrix representing the similarity scores between users.
    - user_index: The index of the user for whom the rating is being predicted.
    - movie_index: The index of the movie for which the rating is being predicted.
    - k: Number of top similar users to consider for prediction.

    Returns:
    - Predicted rating for the movie by the target user.
    """
    # Step 1: Identify users who have rated the movie
    movie_rated_indices = csr_user_item_matrix[:, movie_index].nonzero()[0]
    
    # Check if user_index exists in the similarity matrix
    if user_index < 0 or user_index >= similarity_matrix.shape[0]:
        # If user_index does not exist in similarity matrix, return default prediction
        non_zero_ratings = csr_user_item_matrix[:, movie_index][csr_user_item_matrix[movie_index] != 0]
        predicted_rating = non_zero_ratings.mean() if len(non_zero_ratings) > 0 else np.nan # calculate average just concidering non zero ratings 
  
    # Step 2: Extract similarity scores for the target user with all other users
    user_similarities = similarity_matrix.getrow(user_index).toarray().flatten()
    
    # Step 3: Filter the similarities for users who have rated the movie
    filtered_similarities = user_similarities[movie_rated_indices]
    
    # Step 4: Get indices of top k similar users among those who have rated the movie
    top_k_indices = np.argsort(filtered_similarities)[-k:]
    top_k_users_indices = movie_rated_indices[top_k_indices]
    top_k_similarities = filtered_similarities[top_k_indices]

    # Retrieve ratings for the movie from these top-k similar users
    top_k_ratings = csr_user_item_matrix[top_k_users_indices, movie_index].toarray().flatten()
    
    # Create a dictionary to hold the count of votes for each rating
    rating_votes = defaultdict(int)
    
   # Calculate weights based on similarities and count votes for each rating
    for similarity, rating in zip(top_k_similarities, top_k_ratings):
        if rating in [1, 2, 3, 4, 5]:
            rating_votes[rating] += similarity

    # Find the rating with the highest sum of similarity weights
    predicted_rating = max(rating_votes, key=rating_votes.get, default=np.nan)
    
    # Use the overall average rating of the movie by all users as the default rating
    if np.isnan(predicted_rating):
        non_zero_ratings = csr_user_item_matrix[:, movie_id][csr_user_item_matrix[movie_id] != 0]
        predicted_rating = non_zero_ratings.mean() if len(non_zero_ratings) > 0 else np.nan # calculate average just concidering non zero ratings 

    return predicted_rating


In [37]:

from sklearn.metrics import mean_squared_error
from math import sqrt
def evaluate_predictions_classification(validation_data, csr_user_item_matrix, similarity_matrix, k):
    """
    Evaluate the recommendation system by predicting ratings for each user-movie pair in the validation set
    using a CSR matrix and pre-computed similarity matrix, and comparing the predictions to the actual ratings using RMSE.

    Parameters:
    - validation_data: DataFrame containing 'UserIndex', 'MovieIndex', and 'Rating'.
    - csr_user_item_matrix: CSR matrix representing the user-item matrix from the training set.
    - similarity_matrix: Pre-computed similarity matrix as a CSR matrix.
    - k: The number of top similar users to consider when making predictions.
    
    Returns:
    - rmse: The root mean square error of the predicted ratings against the actual ratings.
    """
    actual_ratings = []
    predicted_ratings = []

    for _, row in validation_data.iterrows():
        user_index = row['UserIndex']
        movie_index = row['MovieIndex']
        actual_rating = row['Rating']
        
        # Check if the movie index is valid
        if movie_index in csr_user_item_matrix.indices:
            predicted_rating = predict_rating_with_classification(csr_user_item_matrix, similarity_matrix, user_index, movie_index, k)
            actual_ratings.append(actual_rating)
            predicted_ratings.append(predicted_rating)
    
    # Calculate RMSE between actual and predicted ratings
    actual_ratings = np.array(actual_ratings)
    predicted_ratings = np.array(predicted_ratings)
    valid_mask = ~np.isnan(predicted_ratings)
    rmse = sqrt(mean_squared_error(actual_ratings[valid_mask], predicted_ratings[valid_mask]))

    return rmse


In [38]:
from sklearn.model_selection import KFold
import numpy as np

# Assuming k_values to test and your similarity matrix is already defined
k_values = [5, 20, 30, 50, 100, 200]
similarity_matrix = cosine_similarity_matrix_csr
# Setup KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize results storage
results = []

for k in k_values:
    fold_rmses = []  # Store RMSEs for each fold

    for train_indices, test_indices in kf.split(ratings_csr_matrix):
        # Splitting your data: ratings_csr_matrix doesn't change, so you just map validation set indices
        validation_data_fold = mapped_training_data.iloc[test_indices]

        # Evaluate predictions on this fold's test set
        rmse = evaluate_predictions_classification(validation_data_fold, ratings_csr_matrix, similarity_matrix, k)
        fold_rmses.append(rmse)

    # Calculate average RMSE for this k over all folds
    avg_rmse = np.mean(fold_rmses)
    results.append((k, avg_rmse))
    print(f"k={k}, Average RMSE={avg_rmse}")

# Find the best k value based on average RMSE
best_k, best_rmse = min(results, key=lambda x: x[1])
print(f"Best k: {best_k} with RMSE: {best_rmse}")


k=5, Average RMSE=1.1067832654649838
k=20, Average RMSE=1.1092461534740492
k=30, Average RMSE=1.1190036568259107
k=50, Average RMSE=1.1130927111297735
k=100, Average RMSE=1.0981874925760953
k=200, Average RMSE=1.0940047558440196
Best k: 200 with RMSE: 1.0940047558440196


In [42]:
# Usage example
similarity_matrix = cosine_similarity_matrix_csr
k = 200  # Example value for k
rmse = evaluate_predictions_classification(mapped_testing_data, ratings_csr_matrix, similarity_matrix, k)
print(f"RMSE: {rmse}")

RMSE: 1.4861601426380566


In [40]:
import numpy as np

def recommend_movies_classification(df, csr_user_item_matrix, similarity_matrix, user_ids, k, n):
    """
    Recommend top n movies for specified user(s) based on predicted ratings.
    Assumes 'UserIndex' and 'MovieIndex' are available in 'df'.
    """
    if not isinstance(user_ids, list):
        user_ids = [user_ids]

    recommendations = {}
    
    for user_id in user_ids:
        try:
            user_index = df[df['CustomerID'] == user_id]['UserIndex'].iloc[0]  # Assuming first matching UserIndex is representative
        except IndexError:
            print(f"User ID {user_id} not found.")
            continue

        unrated_movies_indices = np.setdiff1d(np.arange(csr_user_item_matrix.shape[1]),
                                               csr_user_item_matrix.getrow(user_index).nonzero()[1])
        
        predicted_ratings = []
        for movie_index in unrated_movies_indices:
            predicted_rating = predict_rating_with_classification(csr_user_item_matrix, similarity_matrix, user_index, movie_index, k)
            movie_id = {v: k for k, v in movie_id_to_index.items()}[movie_index]  # Reverse lookup to get MovieID from MovieIndex
            predicted_ratings.append((movie_id, predicted_rating))

        top_n_recommendations = sorted(predicted_ratings, key=lambda x: x[1], reverse=True)[:n]
        recommendations[user_id] = top_n_recommendations

    return recommendations


In [46]:
# testing recommendation function with one user 
user_ids = ['1851346']  # Single user example
k = 200  # Number of similar users to consider
n = 5  # Number of recommendations to generate

# Generate recommendations
recommendations = recommend_movies_classification(main_mapped_data, Main_csr_matrix, main_similarity_matrix, user_ids, k, n)

# Display the recommendations
for user_id in user_ids:
    print(f"Recommendations for User ID {user_id}:")
    if user_id in recommendations:
        for movie_id, predicted_rating in recommendations[user_id]:
            print(f"\tMovie ID: {movie_id}, Predicted Rating: {predicted_rating}")
    else:
        print("\tNo recommendations available.")

Recommendations for User ID 1851346:
	Movie ID: 2300, Predicted Rating: 5
	Movie ID: 1783, Predicted Rating: 5
	Movie ID: 2780, Predicted Rating: 5
	Movie ID: 3416, Predicted Rating: 5
	Movie ID: 886, Predicted Rating: 5


In [47]:
# Testing prediction function with one user to make sure system predicts all range of ratings (not just 5)
user_index = 1  # Replace with actual user ID
movie_index = 10  # Replace with actual movie ID
K = 200  # Number of neighbors
predicted_rating = predict_rating_with_classification (Main_csr_matrix, main_similarity_matrix, user_index, movie_index, k)
print(f"Predicted rating for User ID {user_id} and Movie ID {movie_id} is: {predicted_rating} ")

Predicted rating for User ID 1851346 and Movie ID 886 is: 3 
