1. Business Understanding

In the age of digital streaming platforms, users are often overwhelmed by the sheer number of available movies. A well-designed movie recommendation system helps users discover movies they will enjoy without manually searching through thousands of options.The goal of this project is to develop a personalized movie recommendation system using collaborative filtering, which predicts user preferences based on their past ratings and similar users' preferences. The system is valuable to various stakeholders among them: streaming services, movie retailers , content creators and users.The movie recommendation system successfully applies data-driven insights to improve user experience and increase platform engagement. By dynamically selecting the best model, we ensure optimal performance and accurate predictions.

2. Business Problem: The goal is to predict user preferences and recommend top movies, helping businesses (e.g., streaming platforms) improve user engagement and retention.

3. Data understanding

This project utilizes the MovieLens ( https://grouplens.org/datasets/movielens/latest/) dataset from the GroupLens research lab at the University of Minnesota.. It contains explicit user ratings, making it well-suited for collaborative filtering techniques.

Dataset Files

ratings.csv → Contains user-movie rating interactions.

movies.csv → Provides movie metadata (titles, genres).


The ratings.csv contains the following columns:
- userId - Unique identifier for each user
- movieId -Unique identifier for each movie
- rating - User's rating for the movie (0.5 - 5.0)
- timestamp- Time when the rating was given

The movies.csv contains the following columns:
- movieId -Unique identifier for each movie
- title- Name of the movie
- genres - Genre(s) of the movie

These datasets are ideal for predicting user preferences and generating personalized recommendations.

4. Data loading and importing libraries

In [52]:
#Load the libraries
import pandas as pd
import matplotlib.pyplot as plt
from surprise import Dataset, Reader, SVD, KNNBasic, accuracy
from surprise.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

In [53]:
#Load Data
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

In [69]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1.0,1.0,4.0,964982703
1,1.0,3.0,4.0,964981247
2,1.0,6.0,4.0,964982224
3,1.0,47.0,5.0,964983815
4,1.0,50.0,5.0,964982931


In [68]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [70]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


##### No Missing Data: All columns have 9,742 non-null values.

In [71]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  float64
 1   movieId    100836 non-null  float64
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(3), int64(1)
memory usage: 3.1 MB


#### No Missing Data: All columns have 100836 non-null values.

5. Data preprocessing

In [54]:
#Handle Missing Values Using Pipeline
pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent"))
])
ratings[['userId', 'movieId', 'rating']] = pipe.fit_transform(ratings[['userId', 'movieId', 'rating']])

In [55]:
ratings['rating'].min()

0.5

The minimum rating score is 0.5

In [56]:
ratings['rating'].max()

5.0

The maximum rating score is 5.0

Data Formatting

In [57]:

#Convert Data for Surprise
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

This code converts the ratings dataset into a format compatible with the Surprise library

6. Split Data

Split the dataset into train-test sets for model training

In [58]:
#Train-Test Split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

Splitted the dataset into training and testing sets.

- test_size=0.2 → 20% of the dataset is held out for testing, while 80% is used for training.
- random_state=42 → Ensures reproducibility so the split remains the same each time the code runs.

7. Modelling and hyperparameter Tuning with Gridsearch

This section defines multiple models and sets up hyperparameter tuning using GridSearchCV to find the best performing model.

In [59]:
#Define Models & Hyperparameter Tuning with GridSearchCV
models = {
    'KNNBasic_Item': KNNBasic(sim_options={'name': 'cosine', 'user_based': False}),
    'KNNBasic_User': KNNBasic(sim_options={'name': 'cosine', 'user_based': True}),
    'SVD': SVD(),
}


Defines three recommendation models:

- KNNBasic_Item → Item-based collaborative filtering (compares movies)
- KNNBasic_User → User-based collaborative filtering (compares users)
- SVD → Matrix factorization technique (Singular Value Decomposition)

In [60]:
param_grid = {
    'SVD': {'n_factors': [50, 100], 'reg_all': [0.02, 0.1]},
    'KNNBasic_Item': {'k': [20, 40], 'min_k': [3, 5]},
    'KNNBasic_User': {'k': [20, 40], 'min_k': [3, 5]}
}

best_scores = {}
best_params = {}

Specifies hyperparameter values to be tested during tuning:
SVD parameters:
- n_factors: Number of latent factors (50, 100)
- reg_all: Regularization strength (0.02, 0.1)

KNNBasic_Item & KNNBasic_User parameters:
- k: Number of neighbors (20, 40)
- min_k: Minimum neighbors required for a prediction (3, 5)

Stores results of the best performing models:

- best_scores :Stores the best RMSE score for each model.
- best_params : Stores the best hyperparameters found for each model.

In [79]:
best_scores

{'KNNBasic_Item': 0.9351588161481544,
 'KNNBasic_User': 0.9344617176942439,
 'SVD': 0.870876506768887}

Lower RMSE (Root Mean Squared Error) is better because it indicates more accurate predictions.

In this case, 
- SVD (0.8708) performs better than:
- KNNBasic_User (0.9344)
- KNNBasic_Item (0.9351)

In [75]:
best_params

{'KNNBasic_Item': {'k': 20, 'min_k': 3},
 'KNNBasic_User': {'k': 20, 'min_k': 3},
 'SVD': {'n_factors': 100, 'reg_all': 0.1}}

KNNBasic_Item (Item-Based Collaborative Filtering)

- k = 20 → Uses 20 nearest neighbors for predictions.
- min_k = 3 → Requires at least 3 neighbors for a valid recommendation.

KNNBasic_User (User-Based Collaborative Filtering)

- k = 20 → Uses 20 nearest users to make predictions.
- min_k = 3 → Requires at least 3 similar users for a valid recommendation.

SVD (Singular Value Decomposition)

- n_factors = 100 → Uses 100 latent factors to represent user-item interactions.
- reg_all = 0.1 → Applies regularization strength to prevent overfitting.

8. Training and Evaluation

In [None]:
# Train & Evaluate Each Model
for model_name, model in models.items():
    print(f"🔍 Tuning Hyperparameters for {model_name}...")

 # Perform Grid Search
    grid_search = GridSearchCV(algo_class=model.__class__, param_grid=param_grid[model_name], measures=['rmse'], cv=5)
    grid_search.fit(data)
    
    # Store Best RMSE Score
    best_scores[model_name] = grid_search.best_score['rmse']

    # Store Best Hyperparameters
    best_params[model_name] = grid_search.best_params 

🔍 Tuning Hyperparameters for KNNBasic_Item...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Compu

iterates over multiple models (SVD, KNNBasic_User, KNNBasic_Item), performs hyperparameter tuning using GridSearchCV, and stores the best RMSE score and hyperparameters for each model.

In [62]:
#select the best model
best_model_name = min(best_scores, key=best_scores.get)
best_model_params = best_params[best_model_name]

Finds the model name with the lowest RMSE score (best performance).
 
Uses min() to get the model with the smallest RMSE from best_score.

Retrieves the best hyperparameters for the selected model.

Uses best_model_name as a key to access best_params.

In [73]:
best_model_name

'SVD'

In [72]:
best_model_params

{'n_factors': 100, 'reg_all': 0.1}

In [63]:
#Initialize the Best Model with Tuned Parameters & Train
if best_model_name == "SVD":
    best_model = SVD(**best_model_params)
elif best_model_name == "KNNBasic_Item":
    best_model = KNNBasic(sim_options={'name': 'cosine', 'user_based': False}, **best_model_params)
elif best_model_name == "KNNBasic_User":
    best_model = KNNBasic(sim_options={'name': 'cosine', 'user_based': True}, **best_model_params)

best_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1f5a69f7910>

Chooses the model with the lowest RMSE (best_model_name).

Applies the best hyperparameters (best_model_params).

Ensures correct model initialization for SVD, KNNBasic_Item, and KNNBasic_User.

Fits the selected model (best_model) on the trainset data.

In [64]:
#Evaluate Model
predictions = best_model.test(testset)
rmse = accuracy.rmse(predictions)
print(f" Best Model: {best_model_name} with RMSE: {rmse:.4f}")

RMSE: 0.8775
 Best Model: SVD with RMSE: 0.8775


Uses the trained model (best_model) to predict ratings for the test set (testset).

SVD was the best model based on hyperparameter tuning.

In [None]:
from sklearn.metrics import classification_report

# Define a function to classify ratings as Liked (1) or Disliked (0)
def classify_rating(rating, threshold=3.5):
    return 1 if rating >= threshold else 0  # 1 = Liked, 0 = Disliked

# Define a dictionary to store classification reports for all models
model_reports = {}

In [90]:
# Convert Ratings to Binary Labels, filtering out None values
true_labels = [classify_rating(pred.r_ui) for pred in predictions if pred.r_ui is not None]
predicted_labels = [classify_rating(pred.est) for pred in predictions if pred.r_ui is not None]

# Ensure lists have valid values before computing metrics
if len(true_labels) > 0 and len(predicted_labels) > 0:
    from sklearn.metrics import classification_report

    print("\n📊 Classification Report:")
    print(classification_report(true_labels, predicted_labels))
else:
    print("⚠️ No valid ratings available for classification.")


⚠️ No valid ratings available for classification.


In [86]:
#Generate Top 5 Recommendations for a User
user_id = 100
movie_ids = ratings['movieId'].unique()
user_rated_movies = ratings[ratings['userId'] == user_id]['movieId'].values
movies_to_predict = [m for m in movie_ids if m not in user_rated_movies]

predictions = [best_model.predict(user_id, m) for m in movies_to_predict]
top_n = sorted(predictions, key=lambda x: x.est, reverse=True)[:5]

predicts and ranks the top 5 movie recommendations for a given user (user_id) using the trained model

Extracts a list of all unique movies in the dataset

Filters the ratings DataFrame to get all movies the user has rated

Finds movies that the user has not yet rated.

Uses the trained model (best_model) to predict ratings for the unseen movies.

Sorts the predictions by predicted rating in descending order.

In [87]:
top_n

[Prediction(uid=100, iid=1204.0, r_ui=None, est=4.636374021639172, details={'was_impossible': False}),
 Prediction(uid=100, iid=1172.0, r_ui=None, est=4.5424362256548365, details={'was_impossible': False}),
 Prediction(uid=100, iid=318.0, r_ui=None, est=4.52796265060145, details={'was_impossible': False}),
 Prediction(uid=100, iid=904.0, r_ui=None, est=4.51224304273186, details={'was_impossible': False}),
 Prediction(uid=100, iid=1276.0, r_ui=None, est=4.505852232571778, details={'was_impossible': False})]

is a list of Prediction which contains the top 5 recommended movies for user_id = 100.

In [84]:
#Display Recommendations
print(f" Top 5 Movie Recommendations for User {user_id}:")
titles, scores = [], []
for pred in top_n:
    title = movies[movies['movieId'] == pred.iid]['title'].values[0]
    print(f"{title} (Predicted Rating: {pred.est:.2f})")
    titles.append(title)
    scores.append(pred.est)

 Top 5 Movie Recommendations for User 1:
Lawrence of Arabia (1962) (Predicted Rating: 4.98)
Shawshank Redemption, The (1994) (Predicted Rating: 4.98)
Cinema Paradiso (Nuovo cinema Paradiso) (1989) (Predicted Rating: 4.95)
Rear Window (1954) (Predicted Rating: 4.93)
Philadelphia Story, The (1940) (Predicted Rating: 4.92)


Retrieved and printed the actual movie titles for the top 5 recommendations.