# Implementing Recommender Systems - Lab

## Introduction

In this lab, you'll practice creating a recommender system model using `surprise`. You'll also get the chance to create a more complete recommender system pipeline to obtain the top recommendations for a specific user.


## Objectives

In this lab you will: 

- Use surprise's built-in reader class to process data to work with recommender algorithms 
- Obtain a prediction for a specific user for a particular item 
- Introduce a new user with rating to a rating matrix and make recommendations for them 
- Create a function that will return the top n recommendations for a user 


For this lab, we will be using the famous 1M movie dataset. It contains a collection of user ratings for many different movies. In the last lesson, you were exposed to working with `surprise` datasets. In this lab, you will also go through the process of reading in a dataset into the `surprise` dataset format. To begin with, load the dataset into a Pandas DataFrame. Determine which columns are necessary for your recommendation system and drop any extraneous ones.

In [1]:
import pandas as pd
df = pd.read_csv('./ml-latest-small/ratings.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [2]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
df['rating'].value_counts(normalize = True)

rating
4.0    0.265957
3.0    0.198808
5.0    0.131015
3.5    0.130271
4.5    0.084801
2.0    0.074884
2.5    0.055040
1.0    0.027877
1.5    0.017762
0.5    0.013586
Name: proportion, dtype: float64

In [4]:
df.shape

(100836, 4)

In [5]:
# Drop unnecessary columns
new_df = df.drop("timestamp", axis = 1)
new_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


It's now time to transform the dataset into something compatible with `surprise`. In order to do this, you're going to need `Reader` and `Dataset` classes. There's a method in `Dataset` specifically for loading dataframes.

In [6]:
# pip install scikit surprise

from surprise import Reader, Dataset

reader = Reader()
data = Dataset.load_from_df(new_df, reader)
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x000001FA7FD5FEE0>


In [7]:
# from surprise import Reader, Dataset
# # read in values as Surprise dataset 
# reader = Reader()

# data = Dataset.load_from_df(new_df, reader)

# data

Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity

In [8]:
dataset = data.build_full_trainset()
print('Number of Users: ', dataset.n_users, '\n')

print('Number of items: ', dataset.n_items)


# By considering the neighborhood-based methods use the majority.. the higher number of users or items
# hence use the item Based Recommender system

Number of Users:  610 

Number of items:  9724


## Determine the best model 

Now, compare the different models and see which ones perform best. For consistency sake, use RMSE to evaluate models. Remember to cross-validate! Can you get a model with a higher average RMSE on test data than 0.869?

In [9]:
# importing relevant libraries
from surprise.model_selection import cross_validate, GridSearchCV
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic, KNNBaseline
# from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
# from surprise.model_selection import GridSearchCV
import numpy as np

In [10]:
# perform the grid search with SVD..
grid_params = {'n_factors': [20, 50, 100],
               'reg_all': [0.02, 0.05, 0.1]}

g_s_svd = GridSearchCV(SVD, param_grid = grid_params, n_jobs = -1)

# fit the model 
g_s_svd.fit(data) 

In [11]:
# print out optimal parameters for SVD after GridSearch
print(g_s_svd.best_params)
# print(g_s_svd.best_score)

# 'n_factors': 50: The number of latent factors in the factorization model. Latent factors are features extracted from 
# the data that represent underlying patterns.
# 'reg_all': 0.05: The regularization term to prevent overfitting by penalizing large coefficients.

# The RMSE value for the model indicates that the std of the residuals (predictions errors) for a lower value indicates a better fot

{'rmse': {'n_factors': 20, 'reg_all': 0.05}, 'mae': {'n_factors': 20, 'reg_all': 0.05}}


In [12]:
import numpy as np
np.int = int  # Temporarily mapping np.int to int to avoid the error


In [13]:
knn_basic = KNNBasic(sim_options = {'name': 'pearson', 'user_based': True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs = 1)

# for user_based: True indicates that the similarity would be computed between users

# for user_based: False indicates that the similarity would be computed between items


Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.


## continue studying from here

In [14]:
# here we print out the average RMSE for the test set
for i in cv_knn_basic.items():
    print(i)
    
print("-----")
print(np.mean(cv_knn_basic['test_rmse']))

# ('test_rmse', array([0.97073847, 0.96632241, 0.97433727, 0.96828789, 0.97987685]))
# these are the RMSE values obtained from each fold of the cross-validation. Lower values indicate better performance.

# 0.9719125767491658 this is the aggregated score the mean rmse across all the folds which provides a single value summarizing the performance of 
# the model


('test_rmse', array([0.96864819, 0.97562488, 0.96110602, 0.98047319, 0.97267233]))
('test_mae', array([0.74759856, 0.75418134, 0.74335465, 0.7582442 , 0.74895639]))
('fit_time', (0.7313454151153564, 0.7004599571228027, 0.6833209991455078, 0.6999144554138184, 0.7013387680053711))
('test_time', (1.6001636981964111, 1.581686019897461, 1.6865253448486328, 1.501746416091919, 1.5758016109466553))
-----
0.9717049221414129


In [15]:
mean_rmse_basic = np.mean(cv_knn_basic['test_rmse'])
mean_rmse_basic

0.9717049221414129

In [16]:
# print out the average RMSE score for the test set
# for i in cv_knn_basic.items():
#     print(i)
# print(' ------------- ')
# print(np.mean(cv_knn_basic['test_rmse']))

In [17]:
# cross validating with KNNBaseline
knn_base = KNNBaseline(sim_options = {'name':'pearson', 'user_based': True})

cv_knn_base = cross_validate(knn_base, data, n_jobs = 1)


Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [18]:
# print out the average score for the test set
mean_rmse_base = np.mean(cv_knn_base['test_rmse'])
mean_rmse_base

0.8777358508325293

Based off these outputs, it seems like the best performing model is the SVD model with `n_factors = 50` and a regularization rate of 0.05. Use that model or if you found one that performs better, feel free to use that to make some predictions.

In [19]:
# Compare the two and determine which has better performance
if mean_rmse_base < mean_rmse_basic:
    print("cv_knn_base has better performance.")
else:
    print("cv_knn_basic has better performance.")

cv_knn_base has better performance.


## Making Recommendations

It's important that the output for the recommendation is interpretable to people. Rather than returning the `movie_id` values, it would be far more valuable to return the actual title of the movie. As a first step, let's read in the movies to a dataframe and take a peek at what information we have about them.

In [20]:
df_movies = pd.read_csv('./ml-latest-small/movies.csv')


In [21]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [22]:
df_movies.shape

(9742, 3)

In [23]:
dataset

<surprise.trainset.Trainset at 0x1fa0187e2b0>

## Making simple predictions
Just as a reminder, let's look at how you make a prediction for an individual user and item. First, we'll fit the SVD model we had from before.

In [24]:
svd = SVD(n_factors = 50, reg_all = 0.05)
svd.fit(dataset)

# svd = SVD(n_factors= 50, reg_all=0.05)
# svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1fa01a2b220>

In [25]:
svd_pred = svd.predict(3, 4)
svd_pred

Prediction(uid=3, iid=4, r_ui=None, est=1.9740254215465762, details={'was_impossible': False})

In [26]:
svd_pred_1 = svd.predict(2, 4)
svd_pred_1

# The predict method is used to predict the rating that a user would give to an item.
# 2 represents the user ID.
# 4 represents the item ID.

Prediction(uid=2, iid=4, r_ui=None, est=3.081374887568451, details={'was_impossible': False})

In [27]:
svd_pred_1[3]

3.081374887568451

In [28]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


This prediction value is a tuple and each of the values within it can be accessed by way of indexing. Now let's put our knowledge of recommendation systems to do something interesting: making predictions for a new user!

## Obtaining User Ratings 

It's great that we have working models and everything, but wouldn't it be nice to get to recommendations specifically tailored to your preferences? That's what we'll be doing now. The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it. 

The function `movie_rater()` should take as parameters: 

* `movie_df`: DataFrame - a dataframe containing the movie ids, name of movie, and genres
* `num`: int - number of ratings
* `genre`: string - a specific genre from which to draw movies

The function returns:
* rating_list : list - a collection of dictionaries in the format of {'userId': int , 'movieId': int , 'rating': float}

#### This function is optional, but fun :) 

In [29]:
def movie_rater(movie_df,num, genre=None):
    userID =1000
    rating_list = []
    while num > 0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].samples(1)
            
        else:
            movie = movie_df.sample(1)
        print(movie)
        rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID,'movieId':movie['movieId'].values[0],'rating':rating}
            rating_list.append(rating_one_movie) 
            num -= 1
    return rating_list
        

In [30]:
def movie_rater(df, rating_threshold, genre):
    """
    Filters movies in a DataFrame based on a rating threshold and genre.
    
    Parameters:
        df (pd.DataFrame): The DataFrame containing movie data.
        rating_threshold (float): The minimum rating threshold.
        genre (str): The genre to filter by.
        
    Returns:
        pd.DataFrame: A DataFrame containing movies that meet the criteria.
    """
    if 'rating' not in df.columns or 'genre' not in df.columns:
        raise KeyError("The DataFrame must contain 'rating' and 'genre' columns.")
    
    filtered_movies = df[(df['rating'] >= rating_threshold) & (df['genre'].str.contains(genre, case=False))]
    return filtered_movies


In [31]:
# import pandas as pd

# data = {
#     'title': ['Movie A', 'Movie B', 'Movie C'],
#     'rating': [4.5, 3.0, 4.0],
#     'genre': ['Comedy', 'Drama', 'Comedy'],
# }
# df_movies = pd.DataFrame(data)


In [32]:
# user_rating = movie_rater(df_movies, 4,'Comedy')
# print(user_rating)


     title  rating   genre
0  Movie A     4.5  Comedy
2  Movie C     4.0  Comedy


In [33]:
# # try out the new function here!
# user_rating = movie_rater(df_movies, 1, 'Drama')
# print(user_rating)

     title  rating  genre
1  Movie B     3.0  Drama


If you're struggling to come up with the above function, you can use this list of user ratings to complete the next segment

### Making Predictions With the New Ratings
Now that you have new ratings, you can use them to make predictions for this new user. The proper way this should work is:

* add the new ratings to the original ratings DataFrame, read into a `surprise` dataset 
* train a model using the new combined DataFrame
* make predictions for the user
* order those predictions from highest rated to lowest rated
* return the top n recommendations with the text of the actual movie (rather than just the index number) 

In [35]:
combined_ratings_df = pd.concat([new_df, user_rating], axis=0)
combined_ratings_df.head()

Unnamed: 0,userId,movieId,rating,title,genre
0,1.0,1.0,4.0,,
1,1.0,3.0,4.0,,
2,1.0,6.0,4.0,,
3,1.0,47.0,5.0,,
4,1.0,50.0,5.0,,


In [36]:
from surprise import Dataset
from surprise import Reader

# Define the reader (specify the rating scale)
reader = Reader(rating_scale=(1, 5))

# Load the combined DataFrame into a surprise dataset
data = Dataset.load_from_df(combined_ratings_df[['userId', 'movieId', 'rating']], reader)

In [37]:
from surprise import SVD
from surprise.model_selection import train_test_split

# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2)

# Train the model (e.g., SVD)
model = SVD()
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1fa08c244f0>

In [40]:
# Get a list of all unique items (movies) in the dataset
all_items = combined_ratings_df['movieId'].unique()

# Get the items the user has already rated
user_rated_items = user_rating['rating'].unique()

# Find items the user has NOT rated
items_to_predict = [item for item in all_items if item not in user_rated_items]

# Generate predictions for the new user
user_predictions = []
for item in items_to_predict:
    predicted_rating = model.predict(uid='new_user', iid=item).est
    user_predictions.append((item, predicted_rating))

In [41]:
new_ratings_df.head()

NameError: name 'new_ratings_df' is not defined

In [43]:
new_ratings_df['genre']

0           NaN
1           NaN
2           NaN
3           NaN
4           NaN
          ...  
100832      NaN
100833      NaN
100834      NaN
100835      NaN
1         Drama
Name: genre, Length: 100837, dtype: object

In [53]:
## add the new ratings to the original ratings DataFrame
user_ratings = pd.DataFrame(user_rating)
new_ratings_df = pd.concat([new_df, user_ratings], axis=0)
new_data = Dataset.load_from_df(new_df,user_ratings, reader)

TypeError: load_from_df() takes 3 positional arguments but 4 were given

In [48]:
print(new_ratings_df.head())

   userId  movieId  rating title genre
0     1.0      1.0     4.0   NaN   NaN
1     1.0      3.0     4.0   NaN   NaN
2     1.0      6.0     4.0   NaN   NaN
3     1.0     47.0     5.0   NaN   NaN
4     1.0     50.0     5.0   NaN   NaN


In [51]:
user_ratings

Unnamed: 0,title,rating,genre
1,Movie B,3.0,Drama


In [50]:
new_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [39]:
## add the new ratings to the original ratings DataFrame
user_ratings = pd.DataFrame(user_rating)
new_rating_df = pd.concat([new_df, user_ratings], axis = 0)
new_data = Dataset.load_from_df(new_rating_df, reader)

ValueError: too many values to unpack (expected 3)

In [40]:
# train a model using the new combined DataFrame


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11daeb898>

In [42]:
# make predictions for the user
# you'll probably want to create a list of tuples in the format (movie_id, predicted_score)


In [45]:
# order the predictions from highest to lowest rated

ranked_movies = None

 For the final component of this challenge, it could be useful to create a function `recommended_movies()` that takes in the parameters:
* `user_ratings`: list - list of tuples formulated as (user_id, movie_id) (should be in order of best to worst for this individual)
* `movie_title_df`: DataFrame 
* `n`: int - number of recommended movies 

The function should use a `for` loop to print out each recommended *n* movies in order from best to worst

In [49]:
# return the top n recommendations using the 
def recommended_movies(user_ratings,movie_title_df,n):
        pass
            
recommended_movies(ranked_movies,df_movies,5)

Recommendation #  1 :  277    Shawshank Redemption, The (1994)
Name: title, dtype: object 

Recommendation #  2 :  680    Philadelphia Story, The (1940)
Name: title, dtype: object 

Recommendation #  3 :  686    Rear Window (1954)
Name: title, dtype: object 

Recommendation #  4 :  602    Dr. Strangelove or: How I Learned to Stop Worr...
Name: title, dtype: object 

Recommendation #  5 :  926    Amadeus (1984)
Name: title, dtype: object 



## Level Up (Optional)

* Try and chain all of the steps together into one function that asks users for ratings for a certain number of movies, then all of the above steps are performed to return the top $n$ recommendations
* Make a recommender system that only returns items that come from a specified genre

## Summary

In this lab, you got the chance to implement a collaborative filtering model as well as retrieve recommendations from that model. You also got the opportunity to add your own recommendations to the system to get new recommendations for yourself! Next, you will learn how to use Spark to make recommender systems.