# Libraries Used

pandas, numpy are used for data frames and running scipy, scikit-surprise, scikit-learn (sklearn) respectively.

sklearn is used for various classification, regression and clustering algorithms. We will be using it for the respective memory and model based approaches for the Movie Recommendation System.

scipy is used for cosine simlarity matrices.

surprise is used for various model based approaches like KNN and SVD. We will ustilise the SVD for the prediction system and also cross_validate to check the RMSE and MAE values of various models to check for lowest error rates for the models.

In [2]:
import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics import pairwise_distances
from sklearn.model_selection import train_test_split

from scipy.spatial.distance import cosine, correlation
from scipy.sparse.linalg import svds
from surprise import Reader, Dataset, SVD, NormalPredictor, KNNBasic
from surprise.model_selection import cross_validate

# Reading the datasets, Cleaning and forming the DataFrame

The movielens 100k dataset is used for this model.
MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.

This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set.

URL: https://www.kaggle.com/prajitdatta/movielens-100k-dataset

The cell below makes a single dataframe for all the required columns and dropping the once that are not needing for the respective Collaborative filtering method.

In [3]:
info = pd.read_csv('u.info' , sep=" ", header = None)
info.columns = ['Counts' , 'Type']

occupation = pd.read_csv('u.occupation' , header = None)
occupation.columns = ['Occupations']

items = pd.read_csv('u.item' , header = None , sep = "|" , encoding='latin-1')
items.columns = ['movie id' , 'movie title' , 'release date' , 'video release date' ,
              'IMDb URL' , 'unknown' , 'Action' , 'Adventure' , 'Animation' ,
              'Childrens' , 'Comedy' , 'Crime' , 'Documentary' , 'Drama' , 'Fantasy' ,
              'Film_Noir' , 'Horror' , 'Musical' , 'Mystery' , 'Romance' , 'Sci_Fi' ,
              'Thriller' , 'War' , 'Western']

data = pd.read_csv('u.data', header= None , sep = '\t')
user = pd.read_csv('u.user', header= None , sep = '|')
genre = pd.read_csv('u.genre', header= None , sep = '|' )

genre.columns = ['Genre' , 'genre_id']
data.columns = ['user id' , 'movie id' , 'rating' , 'timestamp']
user.columns = ['user id' , 'age' , 'gender' , 'occupation' , 'zip code']

data = data.merge(user , on='user id')
df = data.merge(items , on='movie id')

df.drop(columns = ['release date', 'video release date' , 'timestamp', 'IMDb URL', 'age' , 'gender' , 'occupation' , 'zip code'] , inplace = True)
items.drop(columns = ['release date', 'video release date' , 'IMDb URL'] , inplace = True)

df

Unnamed: 0,user id,movie id,rating,movie title,unknown,Action,Adventure,Animation,Childrens,Comedy,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
0,196,242,3,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,305,242,5,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,6,242,4,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,234,242,4,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,63,242,3,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,863,1679,3,B. Monkey (1998),0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
99996,863,1678,1,Mat' i syn (1997),0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99997,863,1680,2,Sliding Doors (1998),0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
99998,896,1681,3,You So Crazy (1994),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


# Preparing the data for Collaborative Filtering

Here all the parameters for Weighted Ratings have been inserted for every row of the dataframe for later use in standardizing the database.

### Weighted Rating Concept by IMDb

IMDb publishes weighted vote averages rather than raw data averages. The simplest way to explain it is that although we accept and consider all votes received by users, not all votes have the same impact (or ‘weight’) on the final rating. 

When unusual voting activity is detected, an different weighting calculation may be applied in order to preserve the reliability of our system. To ensure our rating mechanism remains effective, we don't disclose the exact method used to generate the rating. 

The following formula is used to calculate the Top Rated 250 titles. This formula provides a true 'Bayesian estimate', which takes into account the number of votes each title has received, minimum votes required to be on the list, and the mean vote for all titles:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

Where:

R = average for the movie (mean) = (rating)

v = number of votes for the movie = (votes)

m = minimum votes required to be listed in the Top Rated list (currently 25,000)

C = the mean vote across the whole report

Please be aware that the Top Rated Movies Chart only includes theatrical features: shorts, TV movies, miniseries and documentaries are not included in the Top Rated Movies Chart. The Top Rated TV Shows Chart includes TV Series, but not TV episodes or Movies.

For more informmation: https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV#

Weighted Ratings can be considered as the impact of each user on the rating and the rating_new adjusts for finding similar user later on which helps understand similar likes and dislikes. 

In [4]:
user_details = df.groupby('user id').size().reset_index()
user_details.columns = ['user id' , 'number of user ratings']
df = df.merge(user_details , on='user id')

movie_details = df.groupby('movie id').size().reset_index()
movie_details.columns = ['movie id' , 'number of movie ratings']
df = df.merge(movie_details , on='movie id')

user_details = df.groupby('user id')['rating'].agg('mean').reset_index()
user_details.columns = ['user id' , 'average of user ratings']
df = df.merge(user_details , on='user id')

movie_details = df.groupby('movie id')['rating'].agg('mean').reset_index()
movie_details.columns = ['movie id' , 'average of movie ratings']
df = df.merge(movie_details , on='movie id')

user_details = df.groupby('user id')['rating'].agg('std').reset_index()
user_details.columns = ['user id' , 'std of user ratings']
df = df.merge(user_details , on='user id')

movie_details = df.groupby('movie id')['rating'].agg('std').reset_index()
movie_details.columns = ['movie id' , 'std of movie ratings']
df = df.merge(movie_details , on='movie id')

In [5]:
df['weighted rating'] = (df['number of user ratings']*df['average of user ratings'] + df['number of movie ratings']*df['average of movie ratings'])/(df['number of movie ratings']+df['number of user ratings'])
df['rating_new'] = df['rating'] - df['weighted rating']

df

Unnamed: 0,user id,movie id,rating,movie title,unknown,Action,Adventure,Animation,Childrens,Comedy,...,War,Western,number of user ratings,number of movie ratings,average of user ratings,average of movie ratings,std of user ratings,std of movie ratings,weighted rating,rating_new
0,196,242,3,Kolya (1996),0,0,0,0,0,1,...,0,0,39,117,3.615385,3.991453,1.016065,0.995643,3.897436,-0.897436
1,305,242,5,Kolya (1996),0,0,0,0,0,1,...,0,0,222,117,3.409910,3.991453,1.079840,0.995643,3.610619,1.389381
2,6,242,4,Kolya (1996),0,0,0,0,0,1,...,0,0,211,117,3.635071,3.991453,1.039461,0.995643,3.762195,0.237805
3,234,242,4,Kolya (1996),0,0,0,0,0,1,...,0,0,480,117,3.122917,3.991453,0.920366,0.995643,3.293132,0.706868
4,63,242,3,Kolya (1996),0,0,0,0,0,1,...,0,0,93,117,3.118280,3.991453,0.987415,0.995643,3.604762,-0.604762
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,587,1624,2,Hush (1998),0,0,0,0,0,0,...,0,0,98,1,2.969388,2.000000,1.049831,,2.959596,-0.959596
99996,587,1625,4,Nightwatch (1997),0,0,0,0,0,0,...,0,0,98,1,2.969388,4.000000,1.049831,,2.979798,1.020202
99997,676,1654,1,Chairman of the Board (1998),0,0,0,0,0,1,...,0,0,77,1,3.584416,1.000000,1.463150,,3.551282,-2.551282
99998,381,1533,4,I Don't Want to Talk About It (De eso no se ha...,0,0,0,0,0,0,...,0,0,127,1,3.811024,4.000000,1.110786,,3.812500,0.187500


In [6]:
del movie_details
del user_details

# Collaborative Filtering Technique using Cosine Similarity

pivot_table_user gives a table between rating_new and movie that will help determine user to user similarity while pivot_table_movie will give ratings by each user to the movie giving movie to movie similarity.

In [7]:
pivot_table_user = pd.pivot_table(data=df,values='rating_new',index='user id',columns='movie id')
pivot_table_user = pivot_table_user.fillna(0)

pivot_table_user

movie id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.222376,-0.478908,0.533149,-0.584200,-0.536313,1.392617,0.278614,-2.782077,1.239930,-0.664820,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.142023,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,-1.781457,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.401914,-0.016340,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.051724,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.000000,0.000000,0.000000,-1.518987,0.000000,0.000000,0.274549,1.180982,-0.780788,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,1.113924,0.000000,0.000000,0.000000,0.000000,0.000000,0.188406,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
pivot_table_movie = pd.pivot_table(data=df,values='rating',index='user id',columns='movie id')
pivot_table_movie = pivot_table_movie.fillna(0)

pivot_table_movie

movie id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Performing cosine similarity on both pivot tables for user-based similarity and movie-based similarity

In [9]:
user_based_similarity = 1 - pairwise_distances(pivot_table_user.values, metric="cosine" )
user_based_similarity = pd.DataFrame(user_based_similarity)

user_based_similarity.columns = user_based_similarity.columns+1
user_based_similarity.index = user_based_similarity.index+1

user_based_similarity

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
1,1.000000,0.022914,-0.000792,-0.017528,0.082510,0.057246,0.076065,0.135139,0.014507,-0.013753,...,-0.011673,-0.080702,0.067824,-0.044104,0.031088,0.030045,0.004524,0.035073,-0.072962,-0.029471
2,0.022914,1.000000,0.009484,-0.074357,0.011523,0.034816,0.064244,0.013571,-0.051167,0.052224,...,0.005666,-0.007387,0.015419,0.109715,0.003306,0.033326,-0.048486,-0.102995,0.028680,0.013878
3,-0.000792,0.009484,1.000000,-0.127904,0.000926,0.021540,0.012488,0.017737,-0.008288,0.020971,...,-0.001243,-0.009551,-0.007620,0.094168,-0.042524,-0.029804,-0.028400,-0.049426,-0.041146,0.001773
4,-0.017528,-0.074357,-0.127904,1.000000,0.005933,-0.101571,-0.103798,0.073015,0.073387,-0.014266,...,0.010989,0.013906,-0.011664,-0.158610,0.048351,0.004503,0.133249,0.142051,0.040360,-0.007770
5,0.082510,0.011523,0.000926,0.005933,1.000000,0.011338,0.020022,0.074634,-0.038346,-0.010172,...,0.047061,-0.040748,0.009746,0.003820,0.036534,0.019813,-0.009385,0.041500,-0.004283,0.073846
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.030045,0.033326,-0.029804,0.004503,0.019813,-0.126214,0.007649,0.054655,0.037765,0.023054,...,0.024739,0.163323,0.009803,-0.114882,0.045549,1.000000,0.024615,-0.006177,-0.001633,0.002903
940,0.004524,-0.048486,-0.028400,0.133249,-0.009385,-0.054524,-0.051670,0.002545,0.013325,-0.014679,...,-0.051797,0.066068,-0.078044,-0.057918,-0.025679,0.024615,1.000000,0.064704,0.005732,0.026328
941,0.035073,-0.102995,-0.049426,0.142051,0.041500,-0.072210,-0.026722,-0.039821,0.121178,0.014939,...,-0.041045,-0.029142,0.026341,0.000027,-0.062382,-0.006177,0.064704,1.000000,0.025196,-0.003774
942,-0.072962,0.028680,-0.041146,0.040360,-0.004283,-0.040180,0.114694,-0.006817,0.014472,0.081717,...,0.096542,-0.003829,-0.044564,0.034905,-0.003309,-0.001633,0.005732,0.025196,1.000000,0.015303


In [10]:
movie_based_similarity = 1 - pairwise_distances(pivot_table_movie.T.values, metric="cosine" )
movie_based_similarity = pd.DataFrame(movie_based_similarity)

movie_based_similarity.columns = movie_based_similarity.columns+1
movie_based_similarity.index = movie_based_similarity.index+1

movie_based_similarity

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
1,1.000000,0.402382,0.330245,0.454938,0.286714,0.116344,0.620979,0.481114,0.496288,0.273935,...,0.035387,0.0,0.000000,0.000000,0.035387,0.0,0.0,0.0,0.047183,0.047183
2,0.402382,1.000000,0.273069,0.502571,0.318836,0.083563,0.383403,0.337002,0.255252,0.171082,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.078299,0.078299
3,0.330245,0.273069,1.000000,0.324866,0.212957,0.106722,0.372921,0.200794,0.273669,0.158104,...,0.000000,0.0,0.000000,0.000000,0.032292,0.0,0.0,0.0,0.000000,0.096875
4,0.454938,0.502571,0.324866,1.000000,0.334239,0.090308,0.489283,0.490236,0.419044,0.252561,...,0.000000,0.0,0.094022,0.094022,0.037609,0.0,0.0,0.0,0.056413,0.075218
5,0.286714,0.318836,0.212957,0.334239,1.000000,0.037299,0.334769,0.259161,0.272448,0.055453,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.094211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,1.0,1.0,1.0,0.000000,0.000000
1679,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,1.0,1.0,1.0,0.000000,0.000000
1680,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,1.0,1.0,1.0,0.000000,0.000000
1681,0.047183,0.078299,0.000000,0.056413,0.000000,0.000000,0.051498,0.082033,0.057360,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,1.000000,0.000000


# Functions for finding similar movies

The function, rec_movies(), determines similar movies based on the the film we rpovide as the argument.

The function, rec_user(), determines all the similer users for a given particular user and user_rating() is responsible for finding approximate rating based on the patterns by similar users produced by rec_user().

In [11]:
def rec_movie(movie_id):
    temp_table = pd.DataFrame(columns = items.columns)
    movies = movie_based_similarity[movie_id].sort_values(ascending = False).index.tolist()[:21]
    for mov in movies:
        temp_table = temp_table.append(items[items['movie id'] == mov], ignore_index=True)
    return temp_table

display(rec_movie(197))

Unnamed: 0,movie id,movie title,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
0,197,"Graduate, The (1967)",0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,135,2001: A Space Odyssey (1968),0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,1,0,0
2,191,Amadeus (1984),0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,357,One Flew Over the Cuckoo's Nest (1975),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,174,Raiders of the Lost Ark (1981),0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,514,Annie Hall (1977),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
6,216,When Harry Met Sally... (1989),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
7,423,E.T. the Extra-Terrestrial (1982),0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,1,0,0,0
8,134,Citizen Kane (1941),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,483,Casablanca (1942),0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


In [12]:
def rec_user(user_id):
    temp_table = pd.DataFrame(columns = user.columns)
    us = user_based_similarity[user_id].sort_values(ascending = False).index.tolist()[:500]
    for u in us:
        temp_table = temp_table.append(user[user['user id'] == u], ignore_index=True)
    return temp_table

display(rec_user(405))

def user_rating(x):
    similar_user = rec_user(x)
    similar_user.drop(columns= ['age' , 'gender' , 'occupation' , 'zip code'] , inplace = True)
    
    similar_user = similar_user.merge(pivot_table_movie , on = 'user id')
    similar_user = similar_user.set_index('user id')
    
    similar_user.replace(0, np.nan, inplace=True)
    u_ratings = similar_user[similar_user.index==x]
    similar_user.drop(similar_user.index[0] , inplace = True)
    
    return u_ratings.append(similar_user.mean(axis = 0 , skipna = True), ignore_index = True)

display(user_rating(405))

Unnamed: 0,user id,age,gender,occupation,zip code
0,405,22,F,healthcare,10019
1,16,21,M,entertainment,10309
2,632,18,M,student,55454
3,130,20,M,none,60115
4,330,35,F,educator,33884
...,...,...,...,...,...
495,597,23,M,other,84116
496,89,43,F,administrator,68106
497,311,32,M,technician,73071
498,454,57,M,other,97330


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
0,,1.0,,4.0,4.0,,,4.0,,,...,,,,,,,,,,
1,4.021978,3.397727,3.263158,3.705882,3.393443,3.333333,3.885463,4.153846,3.981928,3.714286,...,3.0,4.0,3.0,2.0,3.0,1.0,3.0,2.0,3.0,


# Testing the Different Models in Surprise Library

The sup_data is cross validated for 3 models NormalPredictor(), SVD() and KNNBasic() to find the lowest RMSE and MAE for all the models.

SVD() having the lowest we used it to make the predictor.

In [13]:
reader = Reader(rating_scale=(1, 5))
sup_data = Dataset.load_from_df(df[['user id', 'movie title', 'rating']], reader)

In [14]:
npred = NormalPredictor()
cross_validate(npred, sup_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.5222  1.5117  1.5174  1.5090  1.5094  1.5139  0.0051  
MAE (testset)     1.2205  1.2107  1.2205  1.2086  1.2122  1.2145  0.0051  
Fit time          0.11    0.12    0.12    0.18    0.18    0.14    0.03    
Test time         0.11    0.14    0.10    0.14    0.11    0.12    0.02    


{'test_rmse': array([1.52219203, 1.51173068, 1.51738025, 1.50898338, 1.50939148]),
 'test_mae': array([1.22048951, 1.21066564, 1.2205424 , 1.20855709, 1.21218938]),
 'fit_time': (0.10671615600585938,
  0.11868143081665039,
  0.11670136451721191,
  0.18450498580932617,
  0.1755204200744629),
 'test_time': (0.10671663284301758,
  0.13863086700439453,
  0.10172629356384277,
  0.1436154842376709,
  0.10970449447631836)}

In [15]:
svd = SVD()
cross_validate(svd, sup_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9384  0.9308  0.9354  0.9344  0.9375  0.9353  0.0027  
MAE (testset)     0.7389  0.7337  0.7349  0.7389  0.7417  0.7376  0.0029  
Fit time          5.37    5.21    4.73    4.61    4.42    4.87    0.36    
Test time         0.15    0.16    0.10    0.18    0.10    0.14    0.03    


{'test_rmse': array([0.9384134 , 0.93078837, 0.93535639, 0.93444017, 0.93747655]),
 'test_mae': array([0.73894973, 0.73374258, 0.73485613, 0.73886739, 0.74173107]),
 'fit_time': (5.371689796447754,
  5.208059549331665,
  4.72832179069519,
  4.611710071563721,
  4.417186260223389),
 'test_time': (0.14858007431030273,
  0.1575772762298584,
  0.10471987724304199,
  0.17548871040344238,
  0.10372233390808105)}

In [16]:
knn = KNNBasic()
cross_validate(knn, sup_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9834  0.9755  0.9761  0.9733  0.9804  0.9777  0.0036  
MAE (testset)     0.7755  0.7714  0.7695  0.7693  0.7752  0.7722  0.0027  
Fit time          0.40    0.44    0.49    0.48    0.43    0.45    0.03    
Test time         2.85    2.89    3.31    2.82    2.72    2.92    0.20    


{'test_rmse': array([0.98340242, 0.97553828, 0.97609183, 0.97332436, 0.98037124]),
 'test_mae': array([0.77545037, 0.77141652, 0.76949018, 0.76931758, 0.77517112]),
 'fit_time': (0.4029104709625244,
  0.440807580947876,
  0.4886598587036133,
  0.4767274856567383,
  0.42685556411743164),
 'test_time': (2.850389242172241,
  2.8862831592559814,
  3.3101465702056885,
  2.817471981048584,
  2.72370982170105)}

# Collaborative Filtering Technique using Single Value Decomposition Method

Parameters for SVD():	
* n_factors – The number of factors. Default is 100.
* n_epochs – The number of iteration of the SGD procedure. Default is 20.
* biased (bool) – Whether to use baselines (or biases). See note above. Default is True.
* init_mean – The mean of the normal distribution for factor vectors initialization. Default is 0.
* init_std_dev – The standard deviation of the normal distribution for factor vectors initialization. Default is 0.1.
* lr_all – The learning rate for all parameters. Default is 0.005.
* reg_all – The regularization term for all parameters. Default is 0.02.

In [17]:
sup_train = sup_data.build_full_trainset()
svd = SVD(n_factors = 200 , lr_all = 0.0025 , n_epochs = 40 , init_std_dev = 0.05)
svd.fit(sup_train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x256e40aca60>

## The prediction_algo() extracts either estimated ratings for all the movies if the parameter fed in is user id and arranging in descending order for best recommendations or estimated ratings for all users for the movie name fed in the parameters arranged in the descending order of the ratings.

In [18]:
def prediction_algo(uid = None , iid = None):
    predictions = []
    
    if uid is None:
        for ui in sup_train.all_users():
            predictions.append(svd.predict(ui, iid))
        return predictions
    
    if iid is None:
        for ii in sup_train.all_items():
            ii = sup_train.to_raw_iid(ii)
            predictions.append(svd.predict(uid, ii))
        return predictions
    
    return predictions.append(svd.predict(uid,iid))

In [19]:
predictions = prediction_algo(iid = 'Star Wars (1977)')
predictions.sort(key=lambda x: x.est, reverse = True)

print('Estimated ratings for the user for the given movie')

for i, pred in enumerate(predictions, start = 1):
    print('{}.\tUser -> {} will most likely Score {}-> {}'.format(i, pred.uid, pred.iid, pred.est))

Estimated ratings for the user for the given movie
1.	User -> 4 will most likely Score Star Wars (1977)-> 5
2.	User -> 13 will most likely Score Star Wars (1977)-> 5
3.	User -> 16 will most likely Score Star Wars (1977)-> 5
4.	User -> 22 will most likely Score Star Wars (1977)-> 5
5.	User -> 24 will most likely Score Star Wars (1977)-> 5
6.	User -> 118 will most likely Score Star Wars (1977)-> 5
7.	User -> 130 will most likely Score Star Wars (1977)-> 5
8.	User -> 137 will most likely Score Star Wars (1977)-> 5
9.	User -> 152 will most likely Score Star Wars (1977)-> 5
10.	User -> 164 will most likely Score Star Wars (1977)-> 5
11.	User -> 173 will most likely Score Star Wars (1977)-> 5
12.	User -> 178 will most likely Score Star Wars (1977)-> 5
13.	User -> 189 will most likely Score Star Wars (1977)-> 5
14.	User -> 200 will most likely Score Star Wars (1977)-> 5
15.	User -> 242 will most likely Score Star Wars (1977)-> 5
16.	User -> 264 will most likely Score Star Wars (1977)-> 5
17.	

In [20]:
predictions = prediction_algo(uid = 197)
predictions.sort(key=lambda x: x.est, reverse = True)

print('Movies Recommended for the User from Best to Worst:')

for i, pred in enumerate(predictions, start = 1):
    print('{}.\tMovie -> {} with estimated Score-> {}'.format(i, pred.iid , pred.est))

Movies Recommended for the User from Best to Worst:
1.	Movie -> Star Wars (1977) with estimated Score-> 4.771682003116179
2.	Movie -> Empire Strikes Back, The (1980) with estimated Score-> 4.758261560429666
3.	Movie -> Return of the Jedi (1983) with estimated Score-> 4.690404021609606
4.	Movie -> Raiders of the Lost Ark (1981) with estimated Score-> 4.6275006062909245
5.	Movie -> Braveheart (1995) with estimated Score-> 4.505103028452529
6.	Movie -> Fugitive, The (1993) with estimated Score-> 4.493444751301805
7.	Movie -> It's a Wonderful Life (1946) with estimated Score-> 4.391904958634059
8.	Movie -> Alien (1979) with estimated Score-> 4.390513351672412
9.	Movie -> Jurassic Park (1993) with estimated Score-> 4.337144711298759
10.	Movie -> Terminator 2: Judgment Day (1991) with estimated Score-> 4.330724764069908
11.	Movie -> Hunt for Red October, The (1990) with estimated Score-> 4.317181330413534
12.	Movie -> Casablanca (1942) with estimated Score-> 4.300802819974517
13.	Movie -> Sh