## User-Based Collaborative Filtering Recommender System - MovieLens

**Nadia Aflatounian**

### Introduction

This Jupyter notebook focuses on the implementation of User-Based Collaborative Filtering (UBCF) using K-Nearest Neighbors (KNN) prediction and KNN classification techniques on the **MovieLens dataset**. The primary goal is to explore various similarity metrics and neighbor distances to optimize the recommendation process.

**Key Objectives**:
- Implement User-Based Collaborative Filtering.
- Explore different similarity metrics and neighbor distances.
- Develop robust logics for both rating prediction and classification tasks.

**Methods Implemented**:
1. **KNN Prediction**: Predicting ratings based on user-item interactions.
2. **KNN Classification**: Classifying users into groups based on their preferences.

**Approach**:
- For both KNN prediction and classification, different methods of similarity calculation and neighbor distance determination are thoroughly investigated.
- The aim is to identify the most suitable combination of similarity metric and neighbor distance that yields optimal results in terms of prediction accuracy and classification performance.

**Outcome**:
By the end of this notebook, insights into the effectiveness of various similarity measures and neighbor distance metrics will be gained. Additionally, well-defined logics for predicting ratings and classifying users will be established, contributing to the enhancement of the recommendation system's performance.


In [1]:
import pandas as pd
import numpy as np
training_df = pd.read_csv("C:/Users/nafla/Downloads/movielens.csv")
training_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,num_genres,(no genres listed),Action,Adventure,Animation,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1,4.0,2008-11-03 17:52:19,Toy Story (1995),5,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,1,5.0,1996-06-26 19:06:11,Toy Story (1995),5,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,7,1,4.0,2000-11-18 03:27:04,Toy Story (1995),5,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,10,1,3.0,2015-05-03 15:19:54,Toy Story (1995),5,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,12,1,5.0,1997-05-01 15:32:18,Toy Story (1995),5,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [2]:
# change the column names according to code for integrity 
training_df.rename(columns={'userId': 'CustomerID', 'movieId': 'MovieID', 'rating': 'Rating'}, inplace=True)

In [3]:
# change data types
training_df['CustomerID'] = training_df['CustomerID'].astype(str)
training_df['MovieID'] = training_df['MovieID'].astype(str)
training_df['Rating'] = pd.to_numeric(training_df['Rating'], errors='coerce')  # Converts to float, makes non-numeric as NaN

In [4]:
# Calculate quantiles for user activity and item popularity
user_activity_quantiles = training_df['CustomerID'].value_counts().quantile([0.25, 0.5, 0.75])
item_popularity_quantiles = training_df['MovieID'].value_counts().quantile([0.25, 0.5, 0.75])
print(user_activity_quantiles)
print(item_popularity_quantiles)

0.25    15.0
0.50    32.0
0.75    95.0
Name: count, dtype: float64
0.25    1.0
0.50    2.0
0.75    8.0
Name: count, dtype: float64


In [5]:
num_sampled_rows = len(training_df)
print(f"Number of rows in the  DataFrame: {num_sampled_rows}")

Number of rows in the sampled DataFrame: 125351


In [40]:
print(training_df.dtypes)

CustomerID             object
MovieID                object
Rating                float64
timestamp              object
title                  object
num_genres              int64
(no genres listed)      int64
Action                  int64
Adventure               int64
Animation               int64
Children                int64
Comedy                  int64
Crime                   int64
Documentary             int64
Drama                   int64
Fantasy                 int64
Film-Noir               int64
Horror                  int64
IMAX                    int64
Musical                 int64
Mystery                 int64
Romance                 int64
Sci-Fi                  int64
Thriller                int64
War                     int64
Western                 int64
dtype: object


### Splitting dataset to training, test, validation

In [6]:
from sklearn.model_selection import train_test_split

# Split the  data into training, testing, and validation sets
train, testing_data = train_test_split(training_df, test_size=0.2, random_state=42)
training_data , validation_data = train_test_split(train, test_size=0.2, random_state=42)



In [7]:
# Calculate the size of each split
training_size = training_data.shape[0]  # Number of rows in the training data
validation_size = validation_data.shape[0]  # Number of rows in the validation data
testing_size = testing_data.shape[0]  # Number of rows in the testing data

# Print the sizes
print(f"Training Data Size: {training_size}")
print(f"Validation Data Size: {validation_size}")
print(f"Testing Data Size: {testing_size}")

Training Data Size: 80224
Validation Data Size: 20056
Testing Data Size: 25071


### Creating User - Item matrix

In [9]:
# Creating customer-movie matrix
user_item_matrix = training_data.pivot_table(index='CustomerID', columns='MovieID', values='Rating').fillna(0)

In [10]:
user_item_matrix.head()

MovieID,1,10,100,1000,100008,100044,100058,100083,100087,100106,...,99764,99809,99843,999,99910,99912,99917,99957,99964,99986
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1001,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Define similarity functions 

First all neccessary similarity functions are defined. Later these function will be used to find the best similarity method for rating prediction. Due to memory error issues precalculated similarity matrixes are not used for model training part ( memory error while calculation rmse for different values for k )

#### Cosine similarity function
- Input Parameters: This function takes two parameters:

user_item_matrix: DataFrame containing user-item matrix with ratings. Users are represented by rows, movies by columns, and ratings by values.
target_user_ratings: Series containing the target user's ratings, indexed by MovieID. This parameter represents the ratings given by the target user to various movies.

- Initialization:

The target_user_ratings Series is converted to a DataFrame (target_user_df) with the target user's ratings as a row. This step is necessary for compatibility with the cosine_similarity function.
The user_item_matrix DataFrame is aligned with target_user_df to match columns (MovieIDs). Missing values are filled with 0s.

- Calculating Cosine Similarities:

Cosine similarities are calculated between the target user's ratings and all other users' ratings using the cosine_similarity function. This function computes the cosine similarity scores between two arrays of ratings.

- Flattening and Indexing:

The resulting similarity array is flattened, and a Series (similarities_series) is created with user IDs as the index. Each value in the Series represents the cosine similarity score between the target user and a particular user in the user-item matrix.
Return:

\

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# This function computes cosine similarity scores between the target user's ratings and the ratings of all other users in the user-item matrix, 
# providing a measure of similarity between users based on their movie preferences.
def calculate_cosine_similarity(user_item_matrix, target_user_ratings):
    """
    Calculate cosine similarity scores between a target user's ratings and all other users.
    
    Parameters:
    - user_item_matrix: DataFrame with users as rows, movies as columns, and ratings as values.
    - target_user_ratings: Series containing the target user's ratings, indexed by MovieID.
    
    Returns:
    A Series with user IDs as the index and the cosine similarity scores as the values.
    """
    # Ensure target_user_ratings is a DataFrame row for compatibility with cosine_similarity
    target_user_df = pd.DataFrame(target_user_ratings).T.fillna(0)
    
    # Align user_item_matrix with target_user_df to match columns (MovieIDs)
    aligned_user_item_matrix = user_item_matrix.reindex(columns=target_user_df.columns, fill_value=0)
    
    # Calculate cosine similarities
    similarities = cosine_similarity(aligned_user_item_matrix, target_user_df)
    
    # Flatten the similarities array and create a Series with user IDs as index
    similarities_series = pd.Series(similarities.flatten(), index=aligned_user_item_matrix.index)
    
    return similarities_series



### Pearson similarity function 

- Input Parameters: This function takes two parameters:

user_item_matrix: DataFrame containing user-item matrix with ratings. Users are represented by rows, movies by columns, and ratings by values.
target_user_ratings: Series containing the target user's ratings, indexed by MovieID. This parameter represents the ratings given by the target user to various movies.

- Initialization:

The function initializes an empty dictionary similarities to store Pearson correlation coefficients between the target user and all other users.
The function replaces all 0s in the user_item_matrix with NaN values to consider only non-zero ratings for similarity calculation.

- Calculating Pearson Correlation:

The function iterates through each user in the user_item_matrix.
For each user, it finds the common movies rated by both the target user and the current user.
If the number of common movies is less than 2, indicating insufficient data for correlation calculation, a default similarity score of 0 is assigned, assuming no similarity.
If there are at least 2 common movies, it calculates the Pearson correlation coefficient between the ratings of the target user and the current user for these common movies using the pearsonr function from SciPy.
If the correlation coefficient is finite (not NaN), it is stored in the similarities dictionary.

- Converting to Series and Sorting:

The similarities dictionary is converted to a pandas Series (similarity_series) with user IDs as the index and Pearson correlation coefficients as the values.
The Series is sorted in descending order of correlation coefficients.
Return:



In [12]:

from scipy.stats import pearsonr

# defining function to calculate pearson correlation for pair of users
def calculate_pearson_similarity(user_item_matrix, target_user_ratings):
    """
    Calculate the Pearson correlation coefficient between a target user's ratings
    and all other users' ratings in the user-item matrix.
    
    Parameters:
    - user_item_matrix: DataFrame with users as rows, movies as columns, and ratings as values.
    - target_user_ratings: Series containing the target user's ratings, indexed by MovieID.
    
    Returns:
    - A Series with user IDs as the index and the Pearson correlation coefficients as the values.
    """
    similarities = {}
    
    user_item_matrix_replaced = user_item_matrix.replace(0, np.nan)

    # Loop through each user in the user-item matrix
    for user_id, user_ratings in user_item_matrix_replaced.iterrows():
        # Find common movies that both the target user and the current user have rated
        common_movies = user_ratings.dropna().index.intersection(target_user_ratings.dropna().index)
        
        if len(common_movies) < 2:
            # Assign a default similarity score of 0 for pairs with fewer than 2 common ratings assuming they don't have any similarity
            similarities[user_id] = 0 
            continue  # Skip further calculations for this pair
        correlation, _ = pearsonr(user_ratings.loc[common_movies], target_user_ratings.loc[common_movies])
        if np.isfinite(correlation):
            similarities[user_id] = correlation

    # Convert the similarities dictionary to a pandas Series
    similarity_series = pd.Series(similarities, name='Similarity').sort_values(ascending=False)
    
    return similarity_series


#### Manhattan similarity function 

- Input Parameters: This function takes two parameters:

user_item_matrix: DataFrame where rows represent users, columns represent items (movies), and values represent ratings given by users to items.
target_user_ratings: Series or dictionary containing the target user's movie ratings. The index represents movie IDs, and the values represent ratings.
- Initialization:

The function initializes an empty dictionary similarities to store Manhattan distance-based similarity scores between the target user and all other users.
- Calculating Manhattan Distance:

The function iterates through each user in the user_item_matrix.
For each user, it retrieves their ratings from the user_item_matrix.
It calculates the Manhattan distance between the ratings of the target user and the current user for the movies they both have rated.
If there are no common movies between the target user and the current user, indicating no data for comparison, a similarity score of 0 is assigned.
Otherwise, the Manhattan distance is transformed into a similarity score using the formula 1 / (1 + distance). This transformation ensures that higher distance values result in lower similarity scores and vice versa.
- Converting to Series and Sorting:

The similarities dictionary is converted to a pandas Series (similarity_series) with user IDs as the index and similarity scores as the values.
The Series is sorted in descending order of similarity scores.
Return:

The function returns the Series containing user IDs as the index and their corresponding Manhattan distance-based similarity scores as the values.

In [13]:
# defining function to calculate manhattan distance for ratings of each pair of users
def calculate_manhattan_similarity(user_item_matrix, target_user_ratings):
    """
    Calculate user similarities using Manhattan distance, comparing target user's ratings
    with those in the user_item_matrix.
    
    Parameters:
    - user_item_matrix: DataFrame where rows are users, columns are items, and values are ratings (training data).
    - target_user_ratings: Series or dict containing the target user's movie ratings.
    
    Returns:
    A Series with user IDs as the index and the similarity scores as the values.
    """
    similarities = {}

    for user_id in user_item_matrix.index:
        user_ratings = user_item_matrix.loc[user_id]
        
        # Calculate distance only for movies both have rated
        common_movies = user_ratings.index.intersection(target_user_ratings.index)
        if not common_movies.empty:
            distance = np.nansum(np.abs(user_ratings[common_movies] - target_user_ratings[common_movies]))
            similarity = 1 / (1 + distance) if distance != 0 else 0
        else:
            similarity = 0  # No common movies means no similarity
        
        similarities[user_id] = similarity

    similarity_series = pd.Series(similarities, name="Similarity").sort_values(ascending=False)
    return similarity_series




#### Predict ratings using similarities



The function takes in a user-item matrix, the target user's ratings, a movie ID, the number of similar users to consider (`k`), and a method to calculate user similarity.

**Similarity Calculation**: It calculates similarity scores between the target user and all other users using the specified method. the similarity method can be changed dynamicly

**Filtering Users**: It first filters users who have rated the specified movie from the user-item matrix for more accuracy.

**Retaining Similarity Scores**: retains similarity scores only for users who have rated the movie.

**Selecting Top-k Similar Users**: selects the top-k similar users based on the retained similarity scores.

**Retrieving Ratings**: It retrieves ratings for the specified movie from the top-k similar users.

**Calculating Weighted Average Rating**:
   - If ratings are available from the top-k users, it calculates a weighted average rating for the movie by considering both ratings and similarity scores of those users.
   - If no ratings are available from the top-k users, it handles this situation:
     - If the movie has been rated by any user, it returns the average rating for that movie.
     - If the movie has not been rated by anyone, it returns a default rating.

**Output**: The function returns the predicted rating for the movie by the target user.

In [14]:
# defining a function to predict rating for each user and movie based on user similarity similarity
def predict_rating(user_item_matrix, target_user_ratings, movie_id, k, similarity_method):

    """
    Predict the rating for a given movie by a target user, based on the ratings of top-k similar users.
    
    Parameters:
    - user_item_matrix: DataFrame with users as rows, movies as columns, and ratings as values.
    - target_user_id: The ID of the user for whom the rating is being predicted.
    - movie_id: The ID of the movie for which the rating is being predicted.
    - k: Number of top similar users to consider for prediction.
    - similarity_method: Function to calculate similarity scores between users.
    
    Returns:
    - Predicted rating for the movie by the target user.
    """

    
    # Calculate similarity scores between the target user and all others
    similarities = similarity_method(user_item_matrix, target_user_ratings)
    
   # Filter users who have rated the movie
    users_who_rated_movie = user_item_matrix.index[user_item_matrix[movie_id].notnull()]
    users_with_similarity_scores = similarities.index.intersection(users_who_rated_movie)

    # Retain similarity scores for users who have rated the movie
    similarities_filtered = similarities.loc[users_with_similarity_scores]
    
    # Filter top-k similar users from those who have rated the movie
    top_k_users = similarities_filtered.nlargest(k).index
    
    # Retrieve ratings for the movie from these top-k similar users
    top_k_ratings = user_item_matrix.loc[top_k_users, movie_id]
    
 # Calculate weighted average rating
    if not top_k_ratings.isnull().all():
        weighted_ratings = top_k_ratings * similarities.loc[top_k_users]
        predicted_rating = weighted_ratings.sum() / similarities.loc[top_k_users].sum()
    else:
        # Use the average rating for the movie if available
        if user_item_matrix[movie_id].notnull().any():
            non_zero_ratings = user_item_matrix[movie_id][user_item_matrix[movie_id] != 0]
            predicted_rating = non_zero_ratings.mean() if len(non_zero_ratings) > 0 else np.nan # calculate average just concidering non zero ratings 
        else:
            # Default rating if the movie has not been rated by anyone
            predicted_rating = 2.5 # concidering 2.5 as neutral 
            
    return predicted_rating




### Evaluation function 

now previous defined functions will be used to calculate RMSE using following function 

**Input Parameters**: The function takes in the validation data (DataFrame containing 'CustomerID', 'MovieID', and 'Rating'), the user-item matrix from the training set, the number of top similar users to consider (`k`), and a method to calculate user similarity.

**Actual and Predicted Ratings Initialization**: It initializes empty lists to store actual and predicted ratings.

**Grouping Validation Data by User**: It groups the validation data by 'CustomerID' and creates a mapping of user IDs to their corresponding ratings.

**Iterating Over Validation Data**: It iterates over each row in the validation data. For each row:
   - It extracts the user ID, movie ID, and actual rating.
   - It retrieves the user's ratings as a Series from the mapping created in step 3.
   - It checks if the movie exists in the training data and if the user has provided ratings.
   - If conditions are met, it predicts the rating for the movie using the `predict_rating` and `similarity_method` function.
   - It appends the actual and predicted ratings to their respective lists.

**RMSE Calculation**: Once all actual and predicted ratings are collected, it calculates the root mean square error (RMSE) between the actual and predicted ratings using the `mean_squared_error` function from scikit-learn.

**Output**: The function returns the RMSE value, indicating the accuracy of the predicted ratings against the actual ratings in the validation data.

In [15]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# defining a function to evaluate accuracy of prediced ratings for each pair of user and movie
def evaluate_predictions(validation_data, user_item_matrix, k, similarity_method):
    """
    Evaluate the recommendation system by predicting ratings for each user-movie pair in the validation set
    and comparing the predictions to the actual ratings using RMSE.
    
    Parameters:
    - validation_data: DataFrame containing 'CustomerID', 'MovieID', and 'Rating'.
    - user_item_matrix: DataFrame representing the user-item matrix from the training set.
    - k: The number of top similar users to consider when making predictions.
    - similarity_method: The function to calculate similarity scores between users.
    
    Returns:
    - rmse: The root mean square error of the predicted ratings against the actual ratings.
    - movie_evaluated_count: The number of unique movies for which predictions were made.
    """
    actual_ratings = []
    predicted_ratings = []

    user_ratings_map = validation_data.groupby('CustomerID').apply(lambda x: x.set_index('MovieID')['Rating'])

    # Iterate over each row in the validation data
    # Inside evaluate_predictions, before calling predict_rating:
    for _, row in validation_data.iterrows():
        user_id = row['CustomerID']
        movie_id = row['MovieID']
        actual_rating = row['Rating']
    
    # Prepare target_user_ratings as a Series
    # This assumes you have a way to extract all ratings for user_id from validation_data
    # Here's a placeholder for how you might do this, replace with your actual logic
        target_user_ratings = user_ratings_map.get(user_id, pd.Series(dtype='float64'))

    # Now call predict_rating with target_user_ratings instead of target_user_id
    if movie_id in user_item_matrix.columns and not target_user_ratings.empty:  # Check if movie exists in training data
        predicted_rating = predict_rating(user_item_matrix, target_user_ratings, movie_id, k, similarity_method)
        actual_ratings.append(actual_rating)
        predicted_ratings.append(predicted_rating)
        
    
    # Calculate RMSE between actual and predicted ratings
    rmse = sqrt(mean_squared_error(actual_ratings, predicted_ratings))
    
    return rmse




### Grid Search
Conductiong Grid Search for different combination of similarity methods and values fot K

In [16]:

# Define a dictionary to hold your similarity methods for easy access
similarity_methods = {
    'pearson': calculate_pearson_similarity,
    'cosine': calculate_cosine_similarity,
    'manhattan': calculate_manhattan_similarity  
}

# Define the range of k values you want to test
k_values = range(5, 300, 20)

# Placeholder for storing grid search results
grid_search_results = []

# Perform grid search
for k in k_values:
    for method_name, method_function in similarity_methods.items():
        # Evaluate the recommender system's performance for each combination of k and similarity method
        rmse = evaluate_predictions(validation_data, user_item_matrix, k, method_function)
        
        # Store the results
        grid_search_results.append({'method': method_name, 'k': k, 'rmse': rmse})
        
        # Optionally print the results for each iteration
        print(f"Evaluated {method_name} method with k={k}: RMSE = {rmse}")

# Find the best performing combination of k and similarity method based on RMSE
best_configuration = min(grid_search_results, key=lambda x: x['rmse'])

# Output the best combination found
print(f"Best Configuration: Method = {best_configuration['method']}, k = {best_configuration['k']}, RMSE = {best_configuration['rmse']}")




Evaluated pearson method with k=5: RMSE = 1.0
Evaluated cosine method with k=5: RMSE = 0.266099347709436
Evaluated manhattan method with k=5: RMSE = 0.31744765154224996




Evaluated pearson method with k=25: RMSE = 0.88
Evaluated cosine method with k=25: RMSE = 0.2544680613344197
Evaluated manhattan method with k=25: RMSE = 0.12164465837838923




Evaluated pearson method with k=45: RMSE = 0.6545487200724738
Evaluated cosine method with k=45: RMSE = 0.32094258461903447
Evaluated manhattan method with k=45: RMSE = 0.25736731649934164




Evaluated pearson method with k=65: RMSE = 0.552106540411931
Evaluated cosine method with k=65: RMSE = 0.37231580317879254
Evaluated manhattan method with k=65: RMSE = 0.3468774527642984




Evaluated pearson method with k=85: RMSE = 0.5453080590675472
Evaluated cosine method with k=85: RMSE = 0.3237758367520186
Evaluated manhattan method with k=85: RMSE = 0.40117386998915083




Evaluated pearson method with k=105: RMSE = 0.533404753725965
Evaluated cosine method with k=105: RMSE = 0.3684959649591414
Evaluated manhattan method with k=105: RMSE = 0.440401559041222




Evaluated pearson method with k=125: RMSE = 0.5314169890829787
Evaluated cosine method with k=125: RMSE = 0.4418894199833283
Evaluated manhattan method with k=125: RMSE = 0.4673793960297603




Evaluated pearson method with k=145: RMSE = 0.5314169890829785
Evaluated cosine method with k=145: RMSE = 0.4760509381998136
Evaluated manhattan method with k=145: RMSE = 0.4839459392009612




Evaluated pearson method with k=165: RMSE = 0.5314169890829787
Evaluated cosine method with k=165: RMSE = 0.50921925757295
Evaluated manhattan method with k=165: RMSE = 0.5298544943566895




Evaluated pearson method with k=185: RMSE = 0.5314169890829787
Evaluated cosine method with k=185: RMSE = 0.5491536480733257
Evaluated manhattan method with k=185: RMSE = 0.5760755542844112




Evaluated pearson method with k=205: RMSE = 0.5314169890829786
Evaluated cosine method with k=205: RMSE = 0.5753234027541527
Evaluated manhattan method with k=205: RMSE = 0.613823953169499




Evaluated pearson method with k=225: RMSE = 0.5314169890829787
Evaluated cosine method with k=225: RMSE = 0.6040526089176543
Evaluated manhattan method with k=225: RMSE = 0.6450854972192552




Evaluated pearson method with k=245: RMSE = 0.5314169890829787
Evaluated cosine method with k=245: RMSE = 0.6185319004984662
Evaluated manhattan method with k=245: RMSE = 0.6715049703881482




Evaluated pearson method with k=265: RMSE = 0.5314169890829787
Evaluated cosine method with k=265: RMSE = 0.6387312402118386
Evaluated manhattan method with k=265: RMSE = 0.6700150690837794




Evaluated pearson method with k=285: RMSE = 0.5314169890829785
Evaluated cosine method with k=285: RMSE = 0.639106138861109
Evaluated manhattan method with k=285: RMSE = 0.6911994795260131
Best Configuration: Method = manhattan, k = 25, RMSE = 0.12164465837838923


Now we calculate RMSE on Testing data with optimized K and similarity method

In [17]:
# Testing the model on unseen data
rmse_test = evaluate_predictions(testing_data, user_item_matrix, 25, calculate_manhattan_similarity)
print(f"RMSE on Testing Data: {rmse_test}")


RMSE on Testing Data: 2.3407007996073323


In [18]:
# Creating customer-movie matrix
main_user_item_matrix = training_df.pivot_table(index='CustomerID', columns='MovieID', values='Rating').fillna(0)
main_user_item_matrix.index = main_user_item_matrix.index.astype(str)

In [19]:
main_user_item_matrix.head()

MovieID,1,10,100,1000,100008,100044,100046,100058,100083,100087,...,99809,99843,999,99906,99910,99912,99917,99957,99964,99986
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1001,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Due to memory error problems, after finding the best similarity method and k we will use pre-calculated similarity matrix for movie recommendations 

In [20]:

from sklearn.metrics.pairwise import manhattan_distances

# function to calculate mahattan distance and return a similarity matrix
def create_similarity_matrix_manhattan(user_item_matrix):
    """
    Create a user-user similarity matrix using Manhattan distance.

    Parameters:
    - user_item_matrix: DataFrame representing the user-item matrix 
                        (users as rows, items as columns).

    Returns:
    - DataFrame representing the user-user similarity matrix.
    """
    # Replace NaN values with 0s for the distance calculation
    user_item_matrix_filled = user_item_matrix.fillna(0)
    
    # Calculate the Manhattan distances between users
    distances = manhattan_distances(user_item_matrix_filled, user_item_matrix_filled)
    
    # Convert distances to similarities. The +1 to avoid division by zero for identical users.
    similarities = 1 / (1 + distances)
    
    # Create a DataFrame for the similarity matrix
    similarity_matrix = pd.DataFrame(similarities, index=user_item_matrix.index, columns=user_item_matrix.index)
    
    return similarity_matrix



In [21]:
# creating main similarity matrix
similarity_matrix_manhattan = create_similarity_matrix_manhattan(main_user_item_matrix)

# Show the similarity matrix
print(similarity_matrix_manhattan.head())

CustomerID         1        10       100      1000      1001      1002  \
CustomerID                                                               
1           1.000000  0.001483  0.003205  0.002445  0.002010  0.003317   
10          0.001483  1.000000  0.001619  0.001606  0.001284  0.001631   
100         0.003205  0.001619  1.000000  0.003436  0.002789  0.009132   
1000        0.002445  0.001606  0.003436  1.000000  0.002153  0.003350   
1001        0.002010  0.001284  0.002789  0.002153  1.000000  0.003086   

CustomerID      1003      1004      1005      1006  ...       990       991  \
CustomerID                                          ...                       
1           0.002328  0.003160  0.003868  0.003868  ...  0.003263  0.002548   
10          0.001376  0.001610  0.001779  0.001754  ...  0.001629  0.001435   
100         0.003350  0.007663  0.013793  0.013793  ...  0.007782  0.004619   
1000        0.002307  0.003190  0.003992  0.003914  ...  0.003284  0.002721   
1001   

### Movie recommendation function

due to memory error this function uses pre calculater similarity matirx and doesn't use previous prediction function, instead the prediction logic is implemented inside the function 

1. **Input Parameters**: The function takes in the user ID for whom recommendations are to be generated, the user-item matrix (DataFrame representing users as rows and movies as columns), a pre-calculated similarity matrix (DataFrame representing similarities between users), a DataFrame or Series mapping MovieIDs to movie titles, and an optional parameter `N` indicating the number of movies to recommend.

2. **User ID Validation**: It ensures that the `user_id` parameter is converted to a string to match the index type in the user-item matrix.

3. **Finding Similar Users**: It identifies the top 25 most similar users to the target user based on the pre-calculated similarity matrix.

4. **Predicting Ratings**: For movies that the user hasn't seen (i.e., movies with NaN or 0 ratings in the user-item matrix), it predicts ratings by calculating a weighted average of ratings from the top similar users. It iterates over each movie:
   - For each similar user, it checks if they have rated the movie.
   - If so, it calculates the weighted sum of their rating using the similarity score as the weight.
   - It then predicts the rating for the movie by dividing the weighted sum by the total weight (sum of similarity scores).

5. **Selecting Top N Recommendations**: It sorts the predicted ratings and selects the top N movies with the highest predicted ratings.

6. **Fetching Movie Titles**: For each recommended movie, it fetches the corresponding movie title from the provided `movie_titles` DataFrame or Series.

7. **Output**: The function returns a list of tuples containing (MovieID, Movie Title, Predicted Rating) for the top N recommended movies.

In [30]:
# defining a new prediction function that use pre calculated similarity matrix
def recommend_movies(user_id, user_item_matrix, similarity_matrix, movie_titles, N=10):
    """
    Recommend top N movies for a given user using a pre-calculated similarity matrix.

    Parameters:
    - user_id: The ID of the user for whom to generate recommendations.
    - user_item_matrix: DataFrame representing the user-item matrix (users as rows, movies as columns).
    - similarity_matrix: DataFrame representing the pre-calculated similarities between users.
    - movie_titles: DataFrame or Series mapping MovieIDs to movie titles.
    - N: Number of movies to recommend.

    Returns:
    - A list of tuples with (MovieID, Movie Title, Predicted Rating) for the top N recommended movies.
    """
    # Ensure user_id is the correct type
    user_id = str(user_id)
    
    # Get the top 25 most similar users to the target user
    top_25_users = similarity_matrix.loc[user_id].sort_values(ascending=False).head(25).index
    
    # Predict ratings for movies the user hasn't seen
    predicted_ratings = {}
    for movie_id in user_item_matrix.columns:
        # Skip if the user has already rated this movie
        if not pd.isna(user_item_matrix.at[user_id, movie_id]) and user_item_matrix.at[user_id, movie_id] != 0:
            continue
        
        # Calculate the weighted average of ratings from the top 25 similar users
        total_weight = 0
        weighted_sum = 0
        for similar_user in top_25_users:
            # Check if the similar user has rated the movie
            if pd.isna(user_item_matrix.at[similar_user, movie_id]) or user_item_matrix.at[similar_user, movie_id] == 0:
                continue
            similarity_score = similarity_matrix.at[user_id, similar_user]
            rating = user_item_matrix.at[similar_user, movie_id]
            weighted_sum += similarity_score * rating
            total_weight += similarity_score
        
        # Predict the rating if there were any weights, otherwise default to 0
        predicted_rating = weighted_sum / total_weight if total_weight > 0 else 0
        predicted_ratings[movie_id] = predicted_rating
    
    # Sort the predicted ratings and select the top N
    top_n_recommendations = sorted(predicted_ratings.items(), key=lambda x: x[1], reverse=True)[:N]
    
    # Fetch the titles for the recommended movies
    recommendations = [(movie_id, movie_titles[movie_id], rating) for movie_id, rating in top_n_recommendations]
    
    return recommendations


#### Movie recommendation to a given user 

In [31]:
# Extract movie titles
movie_titles = training_df[['MovieID', 'title']].drop_duplicates().set_index('MovieID')['title']

# Use the recommend function for a specific user
user_id = '100'  

# Call the recommendation function
recommendations = recommend_movies(user_id, main_user_item_matrix, similarity_matrix_manhattan, movie_titles, N=20)

# Print the recommendations
for movie_id, title, rating in recommendations:
    print(f"{title} (MovieID: {movie_id}) - Predicted Rating: {rating:.2f}")

Perks of Being a Wallflower, The (2012) (MovieID: 96821) - Predicted Rating: 4.50
L.A. Confidential (1997) (MovieID: 1617) - Predicted Rating: 4.00
Die Hard: With a Vengeance (1995) (MovieID: 165) - Predicted Rating: 4.00
Dr. Dolittle (1998) (MovieID: 1911) - Predicted Rating: 4.00
Bug's Life, A (1998) (MovieID: 2355) - Predicted Rating: 4.00
Tank Girl (1995) (MovieID: 327) - Predicted Rating: 4.00
Carlito's Way (1993) (MovieID: 431) - Predicted Rating: 4.00
American Pie 2 (2001) (MovieID: 4718) - Predicted Rating: 4.00
Sabrina (1954) (MovieID: 915) - Predicted Rating: 3.50
Lethal Weapon 2 (1989) (MovieID: 2001) - Predicted Rating: 3.00
Dirty Dancing (1987) (MovieID: 1088) - Predicted Rating: 2.50
Dovlatov (2018) (MovieID: 184989) - Predicted Rating: 2.50
Shawshank Redemption, The (1994) (MovieID: 318) - Predicted Rating: 2.49
Star Wars: Episode IV - A New Hope (1977) (MovieID: 260) - Predicted Rating: 2.01
Deep Blue Sea (1999) (MovieID: 2722) - Predicted Rating: 2.00
Lara Croft: Tomb 

## KNN Classification 

In this section KNN classification method is implemented for predicting rating and making movie recommendations 

In [4]:
# creating categorical column with values (1, 2, 3, 4, 5 ) for ratings for classification purpose
training_df['CategoricalRating'] = training_df['Rating'].astype(int)

# Display the DataFrame to verify the changes
print(training_df.head())

  CustomerID MovieID  Rating            timestamp             title  \
0          1       1     4.0  2008-11-03 17:52:19  Toy Story (1995)   
1          2       1     5.0  1996-06-26 19:06:11  Toy Story (1995)   
2          7       1     4.0  2000-11-18 03:27:04  Toy Story (1995)   
3         10       1     3.0  2015-05-03 15:19:54  Toy Story (1995)   
4         12       1     5.0  1997-05-01 15:32:18  Toy Story (1995)   

   num_genres  (no genres listed)  Action  Adventure  Animation  ...  Horror  \
0           5                   0       0          1          1  ...       0   
1           5                   0       0          1          1  ...       0   
2           5                   0       0          1          1  ...       0   
3           5                   0       0          1          1  ...       0   
4           5                   0       0          1          1  ...       0   

   IMAX  Musical  Mystery  Romance  Sci-Fi  Thriller  War  Western  \
0     0        0      

In [5]:
# creating dataframe with required columns
knn_data = training_df[['CustomerID', 'MovieID', 'CategoricalRating']]

# Display the prepared DataFrame
print(knn_data.head())

  CustomerID MovieID  CategoricalRating
0          1       1                  4
1          2       1                  5
2          7       1                  4
3         10       1                  3
4         12       1                  5


In [8]:
from sklearn.model_selection import train_test_split
classification_train, classification_test = train_test_split(knn_data, test_size=0.2, random_state=42)
train_set, val_set = train_test_split(classification_train, test_size=0.2, random_state=42)

In [9]:
# Creating the user-item matrix
train_set_matrix = train_set.pivot_table(index='CustomerID', columns='MovieID', values='CategoricalRating')

# Fill NA values with 0 or a placeholder indicating unrated
train_set_matrix.fillna(0, inplace=True)

In [10]:
train_set_matrix.head()

MovieID,1,10,100,1000,100008,100044,100058,100083,100087,100106,...,99764,99809,99843,999,99910,99912,99917,99957,99964,99986
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1001,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Prediction Function 

1. Looping through Neighbors: This loop iterates over the indices of the nearest neighbors found in the indices array. It starts from index 1 to skip the user itself since we're considering K nearest neighbors.

2. Calculating Similarity Weighted Ratings: For each neighbor, it retrieves the similarity value from the similarity_matrix corresponding to the user and neighbor index. It then retrieves the rating given by that neighbor for the target movie from the user_item_matrix. If the neighbor has rated the movie with a rating between 1 and 5 (inclusive), it adds the similarity-weighted rating to the rating_votes dictionary.

3. Consideration of Ratings: The condition if neighbor_rating in [1, 2, 3, 4, 5] ensures that only ratings within the range of 1 to 5 are considered for prediction. This excludes ratings of 0 and any other non-numeric or NaN values.

4. Summing Up Similarity-Weighted Ratings: For each neighbor who has rated the movie, the similarity weight is multiplied by their rating and added to the corresponding entry in rating_votes.

5. Predicting the Rating: Finally, the predicted rating is determined by selecting the rating with the highest sum of similarity weights from the rating_votes dictionary. If no such rating is found (i.e., if none of the neighbors have rated the movie within the specified range), the default value returned is NaN.

In [36]:
import numpy as np
from collections import defaultdict

def predict_rating_classification_with_similarity(user_id, movie_id, user_item_matrix, similarity_matrix, K):
    
    """
    Predicts the rating for a given user and movie using collaborative filtering with similarity weights.

    Args:
        user_id (int): The ID of the user for whom the rating is to be predicted.
        movie_id (int): The ID of the movie for which the rating is to be predicted.
        user_item_matrix (DataFrame): The user-item matrix containing ratings.
        similarity_matrix (array-like): The similarity matrix between users.
        K (int): The number of nearest neighbors to consider for prediction.

    Returns:
        float: The predicted rating for the given user and movie.

    Classification of Ratings:
        - Only ratings 1 to 5 are considered for prediction.
        - If none of the K nearest neighbors have rated the movie, the function predicts the average rating of the movie based on non-zero ratings.
    """
    
    # Ensure the indices are integers
    user_index = int(user_item_matrix.index.get_loc(user_id))
    movie_index = int(user_item_matrix.columns.get_loc(movie_id))
    
    # Find K nearest neighbors, including the user itself
    indices = np.argsort(similarity_matrix[user_index])[::-1][:K+1]
    
    # Initialize a dictionary to hold the sum of weights for each rating
    rating_votes = defaultdict(float)

    
    # Use similarities to calculate weights, skipping the first one (user itself)
    for idx, similarity in zip(indices[1:], similarity_matrix[user_index, indices][1:]):
        neighbor_rating = user_item_matrix.iloc[idx, movie_index]
        
        # Only consider neighbors who have rated the movie
        if neighbor_rating  in [1, 2, 3, 4, 5]:
        # if neighbor_rating != 0 and not np.isnan(neighbor_rating):
            # similarity_weight = similarity
            rating_votes[neighbor_rating] += similarity



    # Find the rating with the highest sum of similarity weights
    predicted_rating = max(rating_votes, key=rating_votes.get, default=np.nan)
    
    # predict average rating if non of k neighbors have rated the movie 
    if np.isnan(predicted_rating):
        non_zero_ratings = user_item_matrix[movie_id][user_item_matrix[movie_id] != 0]
        predicted_rating = non_zero_ratings.mean() if len(non_zero_ratings) > 0 else np.nan # calculate average just concidering non zero ratings 

    return predicted_rating 


In [56]:
import numpy as np

# define the function to calculate rmse for predicted rating usine prediction function
def evaluate_rmse_classification(validation_data, user_item_matrix,similarity_matrix, K):
    """
    Evaluate RMSE (Root Mean Squared Error) for collaborative filtering recommender system.

    Args:
        validation_data (DataFrame): DataFrame containing columns 'user_id', 'movie_id', and 'rating'.
        user_item_matrix (DataFrame): User-item matrix containing ratings.
        similarity matrix : precalculated user-user similarity matrix based on user_item_matrix
        K (int): Number of neighbors for KNN.

    Returns:
        float: RMSE value.
    """

    squared_errors = []
    # iterate through each row of validation data to extract ids
    for _, row in validation_data.iterrows():
        user_id = row['CustomerID']
        movie_id = row['MovieID']
        actual_rating = row['CategoricalRating']
        
     # Check if user ID and movie ID exist in user-item matrix
        if user_id in user_item_matrix.index and movie_id in user_item_matrix.columns:
            # predicted_rating = predict_rating_classification1(user_id, movie_id, user_item_matrix, K)
            predicted_rating = predict_rating_classification_with_similarity(user_id, movie_id, user_item_matrix, similarity_matrix, K)
            
            # Check if predicted rating is not NaN
            if not np.isnan(predicted_rating):
                squared_errors.append((actual_rating - predicted_rating) ** 2)
    
    rmse = np.sqrt(np.mean(squared_errors))
    return rmse


In [58]:
# create cosine similarity matrix based on training set
cosine_similarity_matrix = cosine_similarity(train_set_matrix)

- Cross validation for hyper parameter tuning (k) using validation set 

In [None]:
# Define the range of k values you want to test
k_values = range(5, 200, 20)

# Placeholder for storing tuning results
rmse_results = []

# Perform tuning
for k in k_values:
        # Evaluate the recommender system's performance for each combination of k and similarity method
        rmse = evaluate_rmse_classification(val_set, train_set_matrix, cosine_similarity_matrix, k )
        
        # Store the results
        rmse_results.append({'k': k, 'rmse': rmse})
        
        # print the results for each iteration
        print(f"Evaluated model with k={k}: RMSE = {rmse}")

# Find the best performing combination of k and similarity method based on RMSE
best_configuration = min(rmse_results, key=lambda x: x['rmse'])

# Output the best combination found
print(f"Best k:  k = {best_configuration['k']}, RMSE = {best_configuration['rmse']}")


Evaluated model with k=5: RMSE = 1.2942671217816686
Evaluated model with k=25: RMSE = 1.2363270307673175
Evaluated model with k=45: RMSE = 1.202455711952015
Evaluated model with k=65: RMSE = 1.185552995347718
Evaluated model with k=85: RMSE = 1.1773851537005238
Evaluated model with k=105: RMSE = 1.1760958834779998
Evaluated model with k=125: RMSE = 1.1686325001796167
Evaluated model with k=145: RMSE = 1.1645456602986255
Evaluated model with k=165: RMSE = 1.1630353873660366
Evaluated model with k=185: RMSE = 1.1618697251171308
Best k:  k = 185, RMSE = 1.1618697251171308


Concidering no significant improvement in rmse afte k = 125 , this k is chosen as the best number 
now rmse will be calculated on test data to see if the model also performs good on unseen data

In [None]:
# Calculate RMSE on test data
k = 125
rmse = evaluate_rmse_classification(classification_test, train_set_matrix, cosine_similarity_matrix, k )

print("RMSE:", rmse)

RMSE: 1.172311387029719


As the RMSE on test data is very close to previous calculated rmse, so we continue with k = 125

In [13]:
# Creating the user-item matrix
main_classification_matrix = knn_data.pivot_table(index='CustomerID', columns='MovieID', values='CategoricalRating')

# Fill NA values with 0 or a placeholder indicating unrated
main_classification_matrix.fillna(0, inplace=True)

In [53]:
# create the final similarity matrix
main_cosine_similarity = cosine_similarity(main_classification_matrix)
print(main_cosine_similarity)

[[1.         0.204686   0.         ... 0.09684614 0.         0.0821847 ]
 [0.204686   1.         0.01982629 ... 0.13381938 0.01546474 0.16604516]
 [0.         0.01982629 1.         ... 0.03396021 0.         0.        ]
 ...
 [0.09684614 0.13381938 0.03396021 ... 1.         0.         0.11518879]
 [0.         0.01546474 0.         ... 0.         1.         0.        ]
 [0.0821847  0.16604516 0.         ... 0.11518879 0.         1.        ]]


### Recommendation Function 

This function essentially predicts ratings for unseen movies for a given user, selects the top n movies based on these predictions, and returns recommendations along with predicted ratings.

- A defaultdict named predicted_ratings is created to store predicted ratings for unseen movies. This dictionary will have movie IDs as keys and predicted ratings as values.
The index of the given user in the user-item matrix is retrieved.
Predicting Ratings:

- The function iterates over each movie in the user-item matrix.
For each movie, it checks if the user has not rated the movie or has rated it as 0.0. If so, it predicts the rating for the unseen movie using the predict_rating_classification_with_similarity function.
Predicted ratings for unseen movies are stored in the predicted_ratings dictionary.
Sorting and Selecting Top N Movies:

- The predicted ratings are sorted in descending order, and the top n movies with the highest predicted ratings are selected.
For each recommended movie, the movie ID, movie title (obtained from the movie_titles dictionary), and predicted rating are stored as tuples in the recommendations list.
Return: The function returns the list of recommendations. Each recommendation is a tuple containing movie ID, movie title, and predicted rating.



In [54]:
import numpy as np
from collections import defaultdict

def recommend_movies_for_user(user_id, user_item_matrix, similarity_matrix, movie_titles, K, n):
    """
    Recommends top n movies for a given user based on predicted ratings.

    Args:
        user_id (int): ID of the user for whom recommendations are to be made.
        user_item_matrix (DataFrame): User-item matrix containing ratings.
        similarity_matrix (array-like): Similarity matrix between users.
        K (int): Number of neighbors for KNN.
        n (int): Number of movies to recommend.

    Returns:
        list: List of tuples containing (movie_id, movie_title, predicted_rating).
    """
    # Initialize dictionary to store predicted ratings for unseen movies
    predicted_ratings = defaultdict(float)
    
    # Get the index of the user in the user-item matrix
    user_index = int(user_item_matrix.index.get_loc(user_id))
    
    # Iterate over movies
    for movie_id in user_item_matrix.columns:
        # Check if the user has not rated the movie
        if np.isnan(user_item_matrix.iloc[user_index][movie_id]) or user_item_matrix.iloc[user_index][movie_id] == 0.0:
            # Predict rating for the unseen movie
            predicted_rating = predict_rating_classification_with_similarity(user_id, movie_id, user_item_matrix, similarity_matrix, K)
            if not np.isnan(predicted_rating):
                # Store the predicted rating
                predicted_ratings[movie_id] = predicted_rating
    
    # Sort predicted ratings in descending order and select top n movies
    top_n_movies = sorted(predicted_ratings.items(), key=lambda x: x[1], reverse=True)[:n]
    
     # Fetch the titles for the recommended movies
    recommendations = [(movie_id, movie_titles[movie_id], predicted_rating) for movie_id, predicted_rating in top_n_movies]
    
    
    return recommendations


In [None]:
main_classification_matrix.head()

MovieID,1,10,100,1000,100008,100044,100046,100058,100083,100087,...,99809,99843,999,99906,99910,99912,99917,99957,99964,99986
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1001,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Testing prediction function to make sure it is working properly : 

In [50]:
# testing the prediction function on a specific user 
user_id = '1'  # Replace with actual user ID
movie_id = '1000'  # Replace with actual movie ID
K = 125  # Number of neighbors
predicted_rating = predict_rating_classification_with_similarity(user_id, movie_id, main_classification_matrix, main_cosine_similarity,  K)
print(f"Predicted rating for User ID {user_id} and Movie ID {movie_id} is: {predicted_rating} ")

Predicted rating for User ID 1 and Movie ID 1000 is: 4.0 


### Make movie recommendations to a given user

In [51]:
# Example user ID
user_id = '1'

# parameters 
K = 125  # Number of neighbors for KNN based on tuning proccess 
n = 5   # Number of movies to recommend

# Call the function to get movie recommendations for the user
recommendations = recommend_movies_for_user(user_id, main_classification_matrix, main_cosine_similarity,movie_titles, K, n)

# Print the recommendations
print("Recommended Movies:")
for movie_id, movie_title, predicted_rating in recommendations:
    print(f"Movie ID: {movie_id}, Title: {movie_title}, Predicted Rating: {predicted_rating}")


Recommended Movies:
Movie ID: 100138, Title: Simple Life, A (Tao jie) (2011), Predicted Rating: 5.0
Movie ID: 101074, Title: Legend of Sleepy Hollow, The (1949), Predicted Rating: 5.0
Movie ID: 1015, Title: Homeward Bound: The Incredible Journey (1993), Predicted Rating: 5.0
Movie ID: 101850, Title: Death on the Staircase (Soupçons) (2004), Predicted Rating: 5.0
Movie ID: 1025, Title: Sword in the Stone, The (1963), Predicted Rating: 5.0
