<a href="https://colab.research.google.com/github/kshuravi/Netflix_Recommendation_Model/blob/main/Netflix_Recommendation_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

This project is based on the kaggle dataset with title Netflix Movies and TV shows. Netflix is an application that keeps growing bigger and faster with its popularity, shows and content. This is an attempt to show its data along with a content-based recommendation system.

## Import necessary libraries and show the dataset

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"kshuravi","key":"c5b8516c7c0691ddf59fc4765d40f732"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d shivamb/netflix-shows

netflix-shows.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
from zipfile import ZipFile
file_name="netflix-shows.zip"

with ZipFile(file_name,'r') as zipfile:
  zipfile.extractall()
  print('Done')

Done


Here is the dataset

In [None]:
netflix_overall = pd.read_csv('netflix_titles.csv')
netflix_overall.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


## Building Recommendation Model based on plot in the "description" feature

The TF-IDF(Term Frequency-Inverse Document Frequency) score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

In [None]:
#removing stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
netflix_overall['description'] = netflix_overall['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(netflix_overall['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(7787, 17905)

There are about 17905 words described for the 7787 movies in this dataset.

Here, The Cosine similarity score is used since it is independent of magnitude and is relatively easy and fast to calculate.

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
indices = pd.Series(netflix_overall.index, index=netflix_overall['title']).drop_duplicates()

In [None]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Get the pairwsie similarity scores of all shows/movies with that show/movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the shows/movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar shows/movies
    sim_scores = sim_scores[1:11]

    # Get the show/movie indices
    show_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar shows/movies
    return netflix_overall['title'].iloc[show_indices]

Here are the top 10 recommendations based on the plot of the show/movie.

In [None]:
get_recommendations('Godzilla')

1685                               Defiance
3119           Jamtara - Sabka Number Ayega
4637                             One by Two
2459    GODZILLA City on the Edge of Battle
5399                 Sat Shri Akaal England
2867                           Hunt to Kill
1592                        Dancing Quietly
5670                           Slow Country
4111                                 Mine 9
1055                          Born in Syria
Name: title, dtype: object

In [None]:
get_recommendations('Stranger Things')

5289               Rowdy Rathore
5349             Sakho & Mangane
6098     The Autopsy of Jane Doe
907                Big Stone Gap
1468            Come and Find Me
2188                   FirstBorn
2625                 Hardy Bucks
6885                 The Society
5634             Sinister Circle
5618    Sin Senos sí Hay Paraíso
Name: title, dtype: object

In [None]:
get_recommendations('PK')

7321                    Unbroken
4045       Merku Thodarchi Malai
3164              Jhansi Ki Rani
165           A Clockwork Orange
5261                        ROMA
2627    Harishchandrachi Factory
1940          Ek Main Aur Ekk Tu
888      Bhavesh Joshi Superhero
6412                The Governor
6377           The Frozen Ground
Name: title, dtype: object

It is seen that the model performs well, but is not very accurate. Therefore, more metrics are added to the model in the next section to improve performance.

## Building Improved Recommendation Model - Content based filtering on multiple metrics

Content based filtering on the following factors:

* Title
* Cast
* Director
* Listed in
* Plot

Filling null values with empty string.

In [None]:
filledna=netflix_overall.fillna('')
filledna.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...


Cleaning the data - making all the words lower case

In [None]:
def clean_data(x):
  return str.lower(x.replace(" ", ""))

Identifying features on which the model is to be filtered.

In [None]:
features=['title','director','cast','listed_in','description']
filledna=filledna[features]
for feature in features:
    filledna[feature] = filledna[feature].apply(clean_data)
    
filledna.head(2)

Unnamed: 0,title,director,cast,listed_in,description
0,3%,,"joãomiguel,biancacomparato,michelgomes,rodolfo...","internationaltvshows,tvdramas,tvsci-fi&fantasy",inafuturewheretheeliteinhabitanislandparadisef...
1,7:19,jorgemichelgrau,"demiánbichir,héctorbonilla,oscarserrano,azalia...","dramas,internationalmovies","afteradevastatingearthquakehitsmexicocity,trap..."


Creating a "soup" or a "bag of words" for all rows.

In [None]:
def create_soup(x):
    return x['title']+ ' ' + x['director'] + ' ' + x['cast'] + ' ' +x['listed_in']+' '+ x['description']

In [None]:
filledna['soup'] = filledna.apply(create_soup, axis=1)

From here on, the code is basically similar to the upper model except the fact that count vectorizer is used instead of tfidf.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(filledna['soup'])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
filledna=filledna.reset_index()
indices = pd.Series(filledna.index, index=filledna['title'])

In [None]:
def get_recommendations_new(title, cosine_sim=cosine_sim):
    title=title.replace(' ','').lower()
    idx = indices[title]

    # Get the pairwsie similarity scores of all shows/movies with that show/movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the shows/movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar shows/movies
    sim_scores = sim_scores[1:11]

    # Get the show/movie indices
    show_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar shows/movies
    return netflix_overall['title'].iloc[show_indices]

Here are the improved top 10 recommendations based on the title, director, cast, listed in, plot of the show/movie.

In [None]:
get_recommendations_new('Godzilla', cosine_sim2)

2460              GODZILLA The Planet Eater
2459    GODZILLA City on the Edge of Battle
970                                  BLAME!
3261                                      K
3213               JoJo's Bizarre Adventure
3612                                 Levius
3853           Magi: The Labyrinth of Magic
995                           Blue Exorcist
1165             Cagaster of an Insect Cage
2179                              Fireworks
Name: title, dtype: object

In [None]:
get_recommendations_new('Stranger Things', cosine_sim2)

876                  Beyond Stranger Things
6958                   The Umbrella Academy
2687                                  Helix
4470                            Nightflyers
6660                         The Messengers
7484                            Warrior Nun
1338         Chilling Adventures of Sabrina
6953    The Twilight Zone (Original Series)
6056                               The 4400
6974                    The Vampire Diaries
Name: title, dtype: object

In [None]:
get_recommendations_new('PK', cosine_sim2)

100                            3 Idiots
6585       The Legend of Michael Mishra
552                   Anthony Kaun Hai?
2571                             Haapus
5377                              Sanju
5954                   Taare Zameen Par
1261                    Chal Dhar Pakad
1271                    Chance Pe Dance
1831                            Dostana
1988    EMI: Liya Hai To Chukana Padega
Name: title, dtype: object

## Conculsion

There are two models that have been used here. Both were used to predict top recommendations for shows/movies. The improvement can be seen in the second model after adding more contents in the calculation. More improved results could be found if there were features which had audience ratings.