Movie Recommendation System based on Item based Collaborative Filtering
By:

Vaishnavi Gupta (N019)
Khushi Vora (N078)
Shubham Nagar (C023)
Navin Sada (C054)
A Recommendation System is a filtration program whose prime goal is to predict the “rating” or “preference” of a user towards a domain-specific item or item.

There are mainly three types of recommendation systems:

1) Content based recommendation system: The algorithm recommends products that are similar to the ones that a user has liked in the past. This similarity (generally cosine similarity) is computed from the data we have about the items as well as the user’s past preferences. For example, if a user likes movies such as ‘The Prestige’ then we can recommend him the movies of ‘Christian Bale’ or movies with the genre ‘Thriller’ or maybe even movies directed by ‘Christopher Nolan’.So what happens here the recommendation system checks the past preferences of the user and find the film “The Prestige”, then tries to find similar movies to that using the information available in the database such as the lead actors, the director, genre of the film, production house, etc and based on this information find movies similar to “The Prestige”.

2) Collaborative recommendation system: This filtration strategy is based on the combination of the user’s behavior and comparing and contrasting that with other users’ behavior in the database. The history of all users plays an important role in this algorithm. The main difference between content-based filtering and collaborative filtering that in the latter, the interaction of all users with the items influences the recommendation algorithm while for content-based filtering only the concerned user’s data is taken into account. There are multiple ways to implement collaborative filtering but the main concept to be grasped is that in collaborative filtering multiple user’s data influences the outcome of the recommendation. and doesn’t depend on only one user’s data for modeling.

Collaborative filtering is further classified into 2 categories:

a) User based Collaborative filtering: The basic idea here is to find users that have similar past preference patterns as the user ‘A’ has had and then recommending him or her items liked by those similar users which ‘A’ has not encountered yet. This is achieved by making a matrix of items each user has rated/viewed/liked/clicked depending upon the task at hand, and then computing the similarity score between the users and finally recommending items that the concerned user isn’t aware of but users similar to him/her are and liked it.

For example, if the user ‘A’ likes ‘Batman Begins’, ‘Justice League’ and ‘The Avengers’ while the user ‘B’ likes ‘Batman Begins’, ‘Justice League’ and ‘Thor’ then they have similar interests because we know that these movies belong to the super-hero genre. So, there is a high probability that the user ‘A’ would like ‘Thor’ and the user ‘B’ would like The Avengers’.

b) Item based collaborative filtering: The concept in this case is to find similar movies instead of similar users and then recommending similar movies to that ‘A’ has had in his/her past preferences. This is executed by finding every pair of items that were rated/viewed/liked/clicked by the same user, then measuring the similarity of those rated/viewed/liked/clicked across all user who rated/viewed/liked/clicked both, and finally recommending them based on similarity scores.

Here, for example, we take 2 movies ‘A’ and ‘B’ and check their ratings by all users who have rated both the movies and based on the similarity of these ratings, and based on this rating similarity by users who have rated both we find similar movies. So if most common users have rated ‘A’ and ‘B’ both similarly and it is highly probable that ‘A’ and ‘B’ are similar, therefore if someone has watched and liked ‘A’ they should be recommended ‘B’ and vice versa.

In our case we have used Item based collaborative filtering

Dataset used: The link for the dataset used from Kaggle is: https://www.kaggle.com/code/shivamb/netflix-shows-and-movies-exploratory-analysis/data

The code used for the project is as follows:

In [None]:
#import all the packages
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score
import sklearn.metrics
import statsmodels.api as sm
import plotly.express as px #for plotting the scatter plot
import seaborn as sns #For plotting the dataset in seaborn
sns.set(style='whitegrid')
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Read the excel file
data=pd.read_csv('/content/netflix_titles_nov_2019 new.csv')
print(data.head(5))
print(data.describe())
print(data.columns)

Data cleaning

In [None]:
#Dropping the columns of cast, date_added and listed_in as they are irrelevant. 
#Although the cast may be a factor depending on which a user might watch a movie, but the number of names in each entry, on label encoding, would lead to a lot of variables.
#Thus we decided to drop it.
data=data.drop(['cast','date_added','listed_in'],axis=1)
data.columns

In [None]:
# creating a separate dataset named movie_title_desc which contains the show_id, title and description of the movie
movie_title_desc=pd.DataFrame()
movie_title_desc['title']=data['title']
movie_title_desc['description']=data['description']
movie_title_desc['show_id']=data['show_id']
movie_title_desc.head(5)

In [None]:
#dropping the columns: title and description from the main dataset as they have been added to the movie_title_desc dataframe
data=data.drop(['title','description'],axis=1)
data.head(5)

In [None]:
#checking for null values in the dataset:
data.isnull().any()

In [None]:
# dropping the rows which have entries with null values of director and country
data = data.dropna(axis = 0, how ='any',subset=['director','country'])
# each entry in the duration column had entries like '126 min'.
#Therefore, we had to remove the string 'min' from each entry
data['duration'] = data['duration'].replace(" min",'',regex=True)

print(data['duration'])


Label Encoding

In [None]:
# Label Encoding the values in director, rating, type and country as they have categorical data in them using the in-built function Label Encoder.
le = LabelEncoder()
data['director']= le.fit_transform(data['director'])
data['rating']= le.fit_transform(data['rating'])
data['type']= le.fit_transform(data['type'])
data['country']= le.fit_transform(data['country']) 
print(data.head(5))

Imputation

In [None]:
#Since ratings is a categorical feature data imputation is performed by replacing the null or nan values with the mode
data['rating'] = data['rating'].fillna(data['rating'].mode()[0])

# Converting each value in the duration column to float
data['duration'].apply(lambda x: float(x))

#replacing null or nan values with mean
data['duration']=data['duration'].fillna(data['duration'].mean())
data.info()

Outlier removal

In [None]:
features=['show_id','duration','director','country','release_year','rating','type']
for feature in features:
    data[feature] = pd.to_numeric(data[feature], errors='coerce')
# plotting the distribution plot for show_id, release_year and duration
plt.figure(figsize=(16,5))
plt.subplot(1,3,1)
sns.distplot(data['show_id'])

plt.subplot(1,3,2)
sns.distplot(data['release_year'])

plt.subplot(1,3,3)
sns.distplot(data['duration'])

In [None]:
#Finding the boundary values for each column as we are using the z score treatment for outlier removal
for column in data[features]:
  print("Highest allowed in {} is:{}".format(column,data[column].mean() + 3*data[column].std()))
  print("Lowest allowed in {} is:{}\n\n".format(column,data[column].mean() - 3*data[column].std()))

In [None]:
#Finding the entries in the dataset outliers:
data[(data['type'] > 0.0) | (data['type'] < 0.0)]
data[(data['rating'] > 12.907489545791607) | (data['rating'] < 0.011238369402738257)]
data[(data['release_year'] > 2041.0345281433606) | (data['release_year'] < 1984.2294022725132)]
data[(data['country'] > 609.7975659892998) | (data['country'] < -91.93347248563032)]
data[(data['director'] > 4007.76910860292) | (data['director'] < -1039.0183611171904)]
data[(data['show_id'] > 110702147.1412701) | (data['show_id'] < 41767871.9361966)]
data[(data['duration'] > 178.98291844704266) | (data['duration'] < 20.515314768505064)]

In [None]:
#trimming the outliers: cretaing a new dataframe new_df which contains only the outliers and then removing these entries from the main dataset

new_df=data[(data['type'] > 0.5399688697419108) | (data['type'] < -0.48036971362376735)]
new_df=data[(data['rating'] > 12.877465075058616) | (data['rating'] < 0.10038302620720518)]
new_df=data[(data['release_year'] > 2041.000950704042) | (data['release_year'] < 2041.000950704042)]
new_df=data[(data['country'] > 615.6713893723759) | (data['country'] < -92.0105243934729)]
new_df=data[(data['director'] > 4138.199594880471) | (data['director'] < -1075.8383079606388)]
new_df=data[(data['show_id'] > 110334417.6459286) | (data['show_id'] < 42278005.532341436)]
new_df=data[(data['duration'] > 178.98291844704266) | (data['duration'] < 20.515314768505064)]
new_df

In [None]:
#capping the outliers:
upper_limit={}
lower_limit={}
for column in data[features]:
  upper_limit[column]=data[column].mean() + 3*data[column].std()
  lower_limit[column]=data[column].mean() - 3*data[column].std()
print(upper_limit)
print(lower_limit)

In [None]:
#applying the capping:
for column in data[features]:
  data[column] = np.where(
    data[column]>upper_limit[column],
    upper_limit[column],
    np.where(
        data[column]<lower_limit[column],
        lower_limit[column],
        data[column]
    )
)

Feature Selection

In [None]:
# heatmap of correlation matrix with annotations in 2 different shades
columns=['show_id','duration','director','country','release_year','rating','type']
cor=data[columns].corr()
hm1 = sns.heatmap(cor, annot = True)
hm1.set(xlabel='\nFeatures', ylabel='Features\t', title = "Correlation matrix of data\n")
plt.show()

In [None]:
#Removing sparsity from the dataset by using csr_matrix function from the scipy library
csr_data = csr_matrix(data.values)
data.reset_index(inplace=True)
data.isnull().any()

KNN

In [None]:
#using KNN algorithm to compute the distance between each movie in the dataset.
#The input to the KNN model is the csr_matrix.

from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data)

In [None]:
#function for getting movie_recommendations.

def get_movie_recommendation(movie_name):
    n_movies_to_recommend = 10 #number of movies that we have to recommend
    movie_list = movie_title_desc[movie_title_desc['title'].str.contains(movie_name)]  
    if len(movie_list):        
        movie_idx= movie_list.iloc[0]['show_id']
        movie_idx = data[data['show_id'] == movie_idx].index[0]
        distances , indices = knn.kneighbors(csr_data[movie_idx],n_neighbors= n_movies_to_recommend+1)    
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]
        recommend_frame = []
        for val in rec_movie_indices:
            movie_idx = data.iloc[val[0]]['show_id']
            idx = movie_title_desc[movie_title_desc['show_id'] == movie_idx].index
            recommend_frame.append({'Title':movie_title_desc.iloc[idx]['title'].values[0],'Distance':val[1]})
        df = pd.DataFrame(recommend_frame,index=range(1, n_movies_to_recommend+1))
        return df
    else:
        return "No movies found. Please check your input"
name=input("Please enter the name of your favourite or a recently watched movie that you loved:\n")
get_movie_recommendation(name)