# Movie Recommendation System

The goal of this notebook is to create a recommendation system that will give recommendations of movies based on user input.

## First look at the data

In [6]:
# Importing the datasets
import pandas as pd
import numpy as np
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')

In [7]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


I'll be using surprise to do my prediction and modeling so I need to reduce the ratings data down to three columns. Time stamp is extraneous to begin with so I'll drop that column entirely.

In [9]:
ratings = ratings.drop(columns = 'timestamp')
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


Perfect. And much less effort thanks in part to not having any data cleaning to do. Now, we need to do a little data model testing.

## Model Testing

In [14]:
# This is for checking the time it takes to run each individual model
# Running this cell might take a while
import time

from surprise.similarities import cosine, msd, pearson
from surprise import accuracy
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split

#loading the .CSV file into surprise
reader = Reader()
data = Dataset.load_from_df(ratings,reader)
train, test = train_test_split(data, test_size=0.2)

rmseScores = []

from surprise.prediction_algorithms import knns
sim_pearson = {'name':'pearson', 'user_based':False}
basic_pearson = knns.KNNBasic(sim_options=sim_pearson)
start = time.time()
basic_pearson.fit(train)
predictions = basic_pearson.test(test)
thePrediction = f'KNNBasic: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

knn_means = knns.KNNWithMeans(sim_options=sim_pearson)
start = time.time()
knn_means.fit(train)
predictions = knn_means.test(test)
thePrediction = f'KNNWithMeans: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

knnZ = knns.KNNWithZScore(sim_options=sim_pearson)
start = time.time()
knnZ.fit(train)
predictions = knnZ.test(test)
thePrediction = f'KNNWithZScore: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

sim_pearson = {'name':'pearson', 'user_based':False}
knn_baseline = knns.KNNBaseline(sim_options=sim_pearson)
start = time.time()
knn_baseline.fit(train)
predictions = knn_baseline.test(test)
thePrediction = f'KNNBaseline: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

from surprise.prediction_algorithms import SVD
svd = SVD()
start = time.time()
svd.fit(train)
predictions = svd.test(test)
thePrediction = f'SVD: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

from surprise import NormalPredictor
normPred = NormalPredictor()
start = time.time()
normPred.fit(train)
predictions = normPred.test(test)
thePrediction = f'NormalPredictor: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

from surprise import BaselineOnly
baseline = BaselineOnly()
start = time.time()
baseline.fit(train)
predictions = baseline.test(test)
thePrediction = f'BaselineOnly: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

from surprise.prediction_algorithms import NMF
NMF = NMF()
start = time.time()
NMF.fit(train)
predictions = NMF.test(test)
thePrediction = f'NMF: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

from surprise.prediction_algorithms import SlopeOne
slopeOne = SlopeOne()
start = time.time()
slopeOne.fit(train)
predictions = slopeOne.test(test)
thePrediction = f'SlopeOne: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

from surprise.prediction_algorithms import CoClustering
cluster = CoClustering()
start = time.time()
cluster.fit(train)
predictions = cluster.test(test)
thePrediction = f'CoClustering: {accuracy.rmse(predictions)}'
end = time.time()
store = f'{thePrediction} | Time Elapsed: {np.round(end - start, 2)}/sec'
rmseScores.append(store)

rmseScores

Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.9718
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.9021
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.9056
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.8812
RMSE: 0.8738
RMSE: 1.4118
Estimating biases using als...
RMSE: 0.8739
RMSE: 0.9170
RMSE: 0.8968
RMSE: 0.9397


['KNNBasic: 0.971846459980042 | Time Elapsed: 65.84/sec',
 'KNNWithMeans: 0.9020955674414711 | Time Elapsed: 66.18/sec',
 'KNNWithZScore: 0.905565420655282 | Time Elapsed: 65.3/sec',
 'KNNBaseline: 0.8812243426888469 | Time Elapsed: 67.97/sec',
 'SVD: 0.8737744850082054 | Time Elapsed: 12.0/sec',
 'NormalPredictor: 1.4118323446448224 | Time Elapsed: 0.56/sec',
 'BaselineOnly: 0.8739309821152166 | Time Elapsed: 0.46/sec',
 'NMF: 0.9169839223926739 | Time Elapsed: 12.84/sec',
 'SlopeOne: 0.8967589686237372 | Time Elapsed: 22.11/sec',
 'CoClustering: 0.9397130868455623 | Time Elapsed: 5.53/sec']

SVD and BaselineOnly seems to have the closest accuracy while maintaining a very short runtime. Although KNNBaseline has the second best score, all KNN models have a substantial runtime that holds it back. Looking at BaselineOnly, modifying the parameters seems to be missing some documentation.

Let's commit to SVD and run a grid search to narrow in what parameters might work best.

In [11]:
# Importing relevant libraries here just to get them out of the way
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV

In [31]:
# Parameters for the grid search and setting them
params = {'n_factors': [20, 50, 100], 'reg_all': [0.02, 0.05, 0.1]}
GSsvd = GridSearchCV(SVD, param_grid = params, n_jobs = -1)

In [15]:
#fitting it on the data
GSsvd.fit(data)

In [16]:
# print out optimal parameters for SVD after GridSearchbest_params
print(GSsvd.best_score)
print(GSsvd.best_params)

{'rmse': 0.8694254611359741, 'mae': 0.6686066963698798}
{'rmse': {'n_factors': 100, 'reg_all': 0.05}, 'mae': {'n_factors': 100, 'reg_all': 0.05}}


Perfect. Finally, let's build one solid function that takes in some input, runs our model, spits out some recommendations of movies.

In [28]:
# This function will take in the predictions and give the top recommendations.
# This function is separate because it'll be called in the next fucntion.
def recommended_movies(user_ratings,movie_title_df,n):
        for idx, rec in enumerate(user_ratings):
            title = movie_title_df.loc[movie_title_df['movieId'] == int(rec[0])]['title']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n -= 1
            if n == 0:
                break

In [29]:
def movie_recommender(movie_df, num_of_rated_movies, genre=None):
    userID = 1000
    rating_list = []
    print(f'Thank you for participating! In order to obtain your recommendations, please rate {num_of_rated_movies} movies.')
    
    # This portion grabs a random movie title and info and asks the user to rate it
    # Once the user gives a number 1-5 or n (for haven't seen), it will append it
    # To the rating_list.
    while num_of_rated_movies > 0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df.sample(1)
        print(movie)
        rating = input('On a scale of 1 - 5, how would you rate this movie? press n if you have not seen this movie. Press enter to submit your answer: \n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID, 'movieId':movie['movieId'].values[0], 'rating':rating}
            rating_list.append(rating_one_movie)
            num_of_rated_movies -= 1
    
    # This portion will take the ratings list, and makes a prediction on it
    new_rating_df = ratings.append(rating_list, ignore_index = True)
    new_data = Dataset.load_from_df(new_rating_df, reader)
    svd = SVD(n_factors=100, n_epochs=10, lr_all=0.005, reg_all=0.4)
    svd.fit(new_data.build_full_trainset())
    predictions = svd.test(test)
    
    moviesList = []
    for m_id in ratings['movieId'].unique():
        moviesList.append((m_id, svd.predict(1000, m_id)[3]))
    
    # This portion takes the list of predictions and orders them in order
    # of most likely to be liked by the user to least.
    ranked_movies = sorted(moviesList, key=lambda x:x[1], reverse=True)
    
    # This takes in the list of predicted movies, the DataFrame of movies,
    # and a number reguarding how many movies to show (starting from the top)
    return recommended_movies(ranked_movies,movies,5), print(thePrediction)

In [30]:
# Running this cell will start to questionair.
# It takes in the DataFrame of movies, a number for how many user inputs
# it requires, and the genre you wish to be recommendations to be from
movie_recommender(movies, 4, 'Comedy')

Thank you for participating! In order to obtain your recommendations, please rate 4 movies.
      movieId                   title  genres
1074     1394  Raising Arizona (1987)  Comedy
On a scale of 1 - 5, how would you rate this movie? press n if you have not seen this movie. Press enter to submit your answer: 
4
      movieId                  title            genres
2288     3035  Mister Roberts (1955)  Comedy|Drama|War
On a scale of 1 - 5, how would you rate this movie? press n if you have not seen this movie. Press enter to submit your answer: 
4
      movieId                                 title  genres
9555   173253  Vir Das: Abroad Understanding (2017)  Comedy
On a scale of 1 - 5, how would you rate this movie? press n if you have not seen this movie. Press enter to submit your answer: 
4
      movieId                     title                genres
6400    50802  Because I Said So (2007)  Comedy|Drama|Romance
On a scale of 1 - 5, how would you rate this movie? press n if you ha

(None, None)

The use cases for this are ultimately pretty obvious. Movie streaming services are all but uncommon. Movie streaming services like Hulu and Netflix require user input to be able to recommend movies to users effectively. Even if it's as simple as recommendations base on movies similar to the most recent.

This system could work in tandem, using a 'likes' system to recommend movies upfront could help jumpstart streaming services give something accurate to begin with.