### Introduction to the Notebook

In this Jupyter notebook we aim to implement a recommender system using Singular Value Decomposition (SVD) and SVD++ techniques available in the Surprise package. The dataset used for this implementation is the Netflix dataset. Due to the potential memory constraints and computational complexity, we opted not to implement manual SVD on this dataset.

### About Surprise Package

The Surprise package, short for "Simple Python RecommendatIon System Engine," is a Python scikit for building and analyzing recommender systems. It provides a simple and efficient way to implement collaborative filtering algorithms, including matrix factorization-based methods like SVD and its variations.

### How Surprise Package Works

1. **Algorithm Implementations**: Surprise offers ready-to-use implementations of various collaborative filtering algorithms, including matrix factorization techniques like SVD, SVD++, NMF, etc.

2. **Data Loading and Preparation**: You can load your dataset into Surprise using its built-in functionalities. It supports different file formats, including CSV and Pandas DataFrame.

3. **Model Training**: Once the data is loaded, you can choose an algorithm and train it on your dataset. Surprise handles the training process, including data splitting for cross-validation and parameter tuning.

4. **Evaluation**: Surprise provides evaluation metrics for assessing the performance of your recommender system, such as RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error).

5. **Prediction**: After training the model, you can make predictions for user-item pairs to recommend items to users. Surprise handles the prediction process efficiently.

6. **Hyperparameter Tuning**: Surprise offers tools for hyperparameter tuning, such as GridSearchCV, to find the best combination of parameters for your algorithm.



In [1]:
import pandas as pd
import numpy as np
training_df = pd.read_csv('C:/Users/nafla/OneDrive/Documents/system development/training_data.csv')
training_df.head()

Unnamed: 0,MovieID,CustomerID,Rating,Date,YearOfRelease,MovieTitle,RatingYear,MovieAge,user_activity,AverageMovieAgeRated,user_average_rating,average_rating_per_movie,number_of_ratings_per_movie,scaled_movie_age
0,1,1488844,3,2005-09-06,2003,Dinosaur Planet,2005,2,1.473012,1.640503,3.253308,3.910534,1.010541,1.215054
1,1,822109,5,2005-05-13,2003,Dinosaur Planet,2005,2,1.031355,1.405855,4.083333,3.910534,1.010541,1.215054
2,1,885013,4,2005-10-19,2003,Dinosaur Planet,2005,2,1.077044,1.400853,3.873563,3.910534,1.010541,1.215054
3,1,30878,4,2005-12-26,2003,Dinosaur Planet,2005,2,1.275924,1.525706,3.634304,3.910534,1.010541,1.215054
4,1,823519,3,2004-05-03,2003,Dinosaur Planet,2004,1,1.139754,1.326786,3.917197,3.910534,1.010541,1.172043


In [2]:

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse

In [3]:
# List of your columns to be rounded and converted
columns_to_round_and_convert = ['user_activity', 'AverageMovieAgeRated', 'user_average_rating', 'average_rating_per_movie', 'number_of_ratings_per_movie', 'scaled_movie_age']

# Apply rounding and conversion to all specified columns
for column in columns_to_round_and_convert:
    training_df[column] = training_df[column].round(2)

# Display the DataFrame to verify the changes
print(training_df.head())

   MovieID  CustomerID  Rating        Date  YearOfRelease       MovieTitle  \
0        1     1488844       3  2005-09-06           2003  Dinosaur Planet   
1        1      822109       5  2005-05-13           2003  Dinosaur Planet   
2        1      885013       4  2005-10-19           2003  Dinosaur Planet   
3        1       30878       4  2005-12-26           2003  Dinosaur Planet   
4        1      823519       3  2004-05-03           2003  Dinosaur Planet   

   RatingYear  MovieAge  user_activity  AverageMovieAgeRated  \
0        2005         2           1.47                  1.64   
1        2005         2           1.03                  1.41   
2        2005         2           1.08                  1.40   
3        2005         2           1.28                  1.53   
4        2004         1           1.14                  1.33   

   user_average_rating  average_rating_per_movie  number_of_ratings_per_movie  \
0                 3.25                      3.91                 

# Stratified Sampling Method 

To create a representative sample of our dataset, we employ a stratified sampling method that accounts for three key dimensions: Rating Distribution, User Activity, and Item Popularity. This approach ensures our sample maintains the diversity and characteristics of the entire dataset, facilitating more reliable model training and evaluation.

- User Activity is quantified by the number of ratings a user has provided.
- Item Popularity reflects the number of ratings an item has received.

Finally, We combine User Activity, Item Popularity, and Rating into a composite stratification key for each record. This multi-dimensional key ensures our sampling process considers the distribution across all three axes.

In [4]:
# Assign each user and item to a bin based on the quantiles
training_df['UserActivityBin'] = pd.qcut(training_df.groupby('CustomerID')['Rating'].transform('size'), 
                                q=[0, .25, .5, .75, 1], labels=['low', 'medium', 'medium-high', 'high'])

training_df['ItemPopularityBin'] = pd.qcut(training_df.groupby('MovieID')['Rating'].transform('size'), 
                                   q=[0, .25, .5, .75, 1], labels=['low', 'medium', 'medium-high', 'high'])

# Combine these with Rating to create a stratification key
training_df['Strata'] = training_df['UserActivityBin'].astype(str) + "_" + training_df['ItemPopularityBin'].astype(str) + "_" + training_df['Rating'].astype(str)
training_df['Strata'] = training_df['UserActivityBin'].astype(str) + training_df['Rating'].astype(str)

# Perform stratified sampling
# we use groupby and  frac to specify a fraction of each strata and in case number of rows is less that 10 it takes all rows
strat_sample_df = training_df.groupby('Strata').apply(lambda x: x.sample(frac=0.005 if len(x) > 10 else len(x)/len(x))).reset_index(drop=True)


##  Recommender System with SVD

In [5]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

### 1. Loading data 

In [30]:
# Defining rating scale
reader = Reader(rating_scale=(1, 5))
# Loading data to surprise data set
data = Dataset.load_from_df(strat_sample_df[['CustomerID', 'MovieID', 'Rating']], reader)

### 2. Hyperparameter Tuning with Grid Search in Surprise Package


1. **Define Hyperparameters**: The `param_grid` dictionary defines a grid of hyperparameters that we want to search over. It includes parameters such as `n_epochs`, `lr_all`, `reg_all`, and `n_factors`.

2. **Create GridSearchCV Object**: We create a `GridSearchCV` object named `grid_search`. This object is configured with:
   - The algorithm class (`SVD` in this case) that we want to tune.
   - The parameter grid defined earlier (`param_grid`).
   - Evaluation measures (`"rmse"` and `"mae"`).
   - Number of folds for cross-validation (`cv=5` for 5-fold cross-validation).

3. **Fit GridSearchCV Object**: We fit the `GridSearchCV` object to our data (`data`). This performs an exhaustive search over the hyperparameter grid and evaluates each combination using cross-validation.

4. **Get Best Hyperparameters**: After fitting, we access the best hyperparameters found by the grid search using `best_score` and `best_params` attributes.

5. **Print Results**: Finally, we print out the best RMSE score achieved and the corresponding best hyperparameters.




In [31]:
# Define params for gread search

# n_epochs: The total number of iterations over the entire dataset during the training process.
# lr_all: The learning rate for all parameters, controlling the step size in the gradient descent optimization.
# reg_all: The regularization term for all parameters, used to penalize larger parameter values to prevent overfitting.
# n_factors: The number of latent factors to use, which represents the dimensionality of the user/item feature space.

param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6], "n_factors": [5 ,10 ,30 ,50, 100, 150]}

# Create a GridSearchCV object
grid_search = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=5)

# Fit the GridSearchCV object
grid_search.fit(data)

# Get the best hyperparameters
# best RMSE score
print("Best rmse:", grid_search.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print("Best hyperparameters:", grid_search.best_params["rmse"])

Best rmse: 1.0277875931473905
Best hyperparameters: {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4, 'n_factors': 10}


### 3. Fitting the model

In [32]:
# fit the algorithm that yields the best rmse:
algo = grid_search.best_estimator["rmse"]
algo.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1c5d0792210>

### 4. Making rating predictions

In [42]:
# Predict rating for a specific user-item pair
user_id = '1'
item_id = '1'
predicted_rating = algo.predict(user_id, item_id)
print("Predicted rating:", predicted_rating.est)

Predicted rating: 3.5996141981724303


### 5. Defining movie recommendation function

This function `recommend_top_n_unseen_movies` recommends the top N unseen movies to a given user based on a trained model. Here's an explanation of the function:

- **Parameters**:
  - `user_id`: ID of the user for whom recommendations are to be made.
  - `model`: Trained recommender system model (e.g., SVD or SVD++).
  - `movies_df`: DataFrame containing information about movies, including MovieID and MovieTitle.
  - `n`: Number of recommendations to be made (default is 10).

**Steps**:
1. **Get Rated Movie IDs**: Extract the movie IDs rated by the user from the dataset.
2. **Get All Movie IDs**: Retrieve all unique movie IDs available in the dataset.
3. **Identify Unseen Movies**: Find the movie IDs that the user has not rated (unseen movies).
4. **Predict Ratings**: Use the trained model to predict ratings for the unseen movies.
5. **Sort Predictions**: Sort the predicted ratings in descending order to identify the top N predictions.
6. **Get Movie Titles**: Retrieve the movie titles corresponding to the top N predicted ratings.
7. **Return Recommendations**: Return the top N recommended movies along with their IDs, titles, and predicted ratings.

**Output**:
The function returns a list of tuples, where each tuple contains the MovieID, MovieTitle, and predicted rating for one of the top N recommended movies.

This function simplifies the process of recommending movies to users by automating the steps involved in predicting ratings for unseen movies and selecting the top recommendations based on these ratings.

In [38]:
def recommend_top_n_unseen_movies(user_id, model, movies_df, n=10):
    # Get all the movie IDs rated by the user
    rated_movie_ids = strat_sample_df[strat_sample_df['CustomerID'] == user_id]['MovieID'].tolist()
    
    # Get all the movie IDs in the dataset
    all_movie_ids = movies_df['MovieID'].unique()
    
    # Identify the unseen movies (movies not rated by the user)
    unseen_movie_ids = [movie_id for movie_id in all_movie_ids if movie_id not in rated_movie_ids]
    
    # Predict ratings for unseen movies
    predictions = [(movie_id, model.predict(user_id, movie_id).est) for movie_id in unseen_movie_ids]
    
    # Sort the predictions based on predicted rating in descending order
    top_n_predictions = sorted(predictions, key=lambda x: x[1], reverse=True)[:n]
    
    # Get movie titles corresponding to the top N predicted ratings
    top_n_movies = [(movie_id, movies_df[movies_df['MovieID'] == movie_id]['MovieTitle'].iloc[0], rating) for movie_id, rating in top_n_predictions]
    
    return top_n_movies




### 6. Making movie recommendations

In [63]:
# recommending movie for one user:
user_id = '1'
top_n_recommendations = recommend_top_n_unseen_movies(user_id, algo, strat_sample_df, n=10)
for movie_id, movie_title, predicted_rating in top_n_recommendations:
    print(f"Movie ID: {movie_id}, Movie Title: {movie_title}, Predicted Rating: {predicted_rating}")

Movie ID: 2102, Movie Title: The Simpsons: Season 6, Predicted Rating: 4.388529479452689
Movie ID: 2452, Movie Title: Lord of the Rings: The Fellowship of the Ring, Predicted Rating: 4.351073301339274
Movie ID: 3962, Movie Title: Finding Nemo (Widescreen), Predicted Rating: 4.33947766718005
Movie ID: 1947, Movie Title: Gilmore Girls: Season 3, Predicted Rating: 4.3325438855746965
Movie ID: 3456, Movie Title: Lost: Season 1, Predicted Rating: 4.288582071208768
Movie ID: 1409, Movie Title: The O.C.: Season 1, Predicted Rating: 4.278009961641653
Movie ID: 4306, Movie Title: The Sixth Sense, Predicted Rating: 4.2580117489011675
Movie ID: 1357, Movie Title: Stargate SG-1: Season 7, Predicted Rating: 4.2575081374242405
Movie ID: 2114, Movie Title: Firefly, Predicted Rating: 4.24769236263177
Movie ID: 1476, Movie Title: Six Feet Under: Season 4, Predicted Rating: 4.2443280914137


### Recommendation System with SVD++

In [44]:
# Add a new column for implicit feedback
# here we concidered watching or not whatching a movie as implicit feedback
strat_sample_df['ChoosingMovie'] = 1
strat_sample_df.head()

Unnamed: 0,MovieID,CustomerID,Rating,Date,YearOfRelease,MovieTitle,RatingYear,MovieAge,user_activity,AverageMovieAgeRated,user_average_rating,average_rating_per_movie,number_of_ratings_per_movie,scaled_movie_age,UserActivityBin,ItemPopularityBin,Strata,ChoosingMovie
0,3659,164845,1,2005-02-20,1968,The Horse in the Gray Flannel Suit,2005,37,1.34,1.64,2.08,3.6,1.02,2.72,high,low,high1,1
1,4008,507603,1,2005-08-27,2002,David Blaine: Fearless,2005,3,2.48,1.68,1.26,3.92,1.02,1.26,high,low,high1,1
2,2751,2636478,1,2003-06-05,1991,Naked Gun 2 1/2: The Smell of Fear,2003,12,1.48,1.94,3.61,3.54,1.36,1.65,high,medium,high1,1
3,2235,1035650,1,2005-08-29,2004,Undertow,2005,1,1.29,1.61,3.3,2.65,1.12,1.17,high,low,high1,1
4,4262,380354,1,2004-07-22,1999,Sleepy Hollow,2004,5,1.44,1.55,3.3,3.68,2.01,1.34,high,medium-high,high1,1


### 1. Defining and loading data in Surprise dataset

In [51]:
# Defining data for implicit and explicit rating
implicit_data = strat_sample_df[['CustomerID', 'MovieID', 'ChoosingMovie']]
explicit_data = strat_sample_df[['CustomerID', 'MovieID', 'Rating']]

# Rename columns to match Surprise library's requirements
implicit_data.columns = ['userID', 'itemID', 'rating']
explicit_data.columns = ['userID', 'itemID', 'rating']

explicit_rating_scale = (1, 5)  # For explicit feedback,  ratings are between 1 and 5
implicit_rating_scale = (0, 1)  # For implicit feedback,  binary (0 or 1) feedback

# Create a custom reader
reader = Reader(rating_scale=(0, 5))


# Concatenate implicit and explicit data
data = pd.concat([implicit_data, explicit_data])

# Load the data into Surprise dataset
data = Dataset.load_from_df(data, reader)


### 2. Hyperparameter Tuning with Grid Search in Surprise Package same as SVD

In [52]:
from surprise import SVDpp
# Define params for gread search

# n_epochs: The total number of iterations over the entire dataset during the training process.
# lr_all: The learning rate for all parameters, controlling the step size in the gradient descent optimization.
# reg_all: The regularization term for all parameters, used to penalize larger parameter values to prevent overfitting.
# n_factors: The number of latent factors to use, which represents the dimensionality of the user/item feature space.

param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6], "n_factors": [5 ,10 ,30 ,50, 100, 150]}

# Create a GridSearchCV object
grid_search = GridSearchCV(SVDpp, param_grid, measures=["rmse", "mae"], cv=5)

# Fit the GridSearchCV object
grid_search.fit(data)

# Get the best hyperparameters
# best RMSE score
print("Best rmse:", grid_search.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print("Best hyperparameters:", grid_search.best_params["rmse"])

Best rmse: 1.5131964310440733
Best hyperparameters: {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4, 'n_factors': 5}


### 3. Fitting the SVD++ model

In [53]:
# fit the algorithm that yields the best rmse:
algo_pp = grid_search.best_estimator["rmse"]
algo_pp.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x1c606df6590>

### 4. Predict rating using SVD++ model and previous defined prediction function

In [70]:
# predict rating for one user as example
user_id = '1'
item_id = '1'
predicted_rating = algo_pp.predict(user_id, item_id)
print("Predicted rating:", predicted_rating.est)

Predicted rating: 2.299807099086215


### 5. Making movie recommendations with SVD++

In [71]:
# Example usage:
user_id = '10'
top_n_recommendations = recommend_top_n_unseen_movies(user_id, algo_pp, strat_sample_df, n=10)
for movie_id, movie_title, predicted_rating in top_n_recommendations:
    print(f"Movie ID: {movie_id}, Movie Title: {movie_title}, Predicted Rating: {predicted_rating}")

Movie ID: 2452, Movie Title: Lord of the Rings: The Fellowship of the Ring, Predicted Rating: 2.6682179380875297
Movie ID: 3962, Movie Title: Finding Nemo (Widescreen), Predicted Rating: 2.6619335528518615
Movie ID: 4306, Movie Title: The Sixth Sense, Predicted Rating: 2.6205785383111353
Movie ID: 3290, Movie Title: The Godfather  Part II, Predicted Rating: 2.6159148477378724
Movie ID: 2782, Movie Title: Braveheart, Predicted Rating: 2.603256314982738
Movie ID: 2942, Movie Title: Friends: Season 6, Predicted Rating: 2.596779186773781
Movie ID: 1905, Movie Title: Pirates of the Caribbean: The Curse of the Black Pearl, Predicted Rating: 2.594323946857848
Movie ID: 2862, Movie Title: The Silence of the Lambs, Predicted Rating: 2.5923590486685084
Movie ID: 2102, Movie Title: The Simpsons: Season 6, Predicted Rating: 2.5903985322698597
Movie ID: 1476, Movie Title: Six Feet Under: Season 4, Predicted Rating: 2.580848285638552
