# Netflix Recommendation System
In this notebook I will be following a tutorial on how to create a netflix recommendation system

## 1. Import Libraries
First, import the pandas, numpy and sklearn libraries

In [157]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity

## 2. Import Dataset
To create the recommendation system, I will be using the "Netflix Movies and TV Shows 2021" dataset from Kaggle. This dataset contains 13 features and approximately 6,000 observations.

https://www.kaggle.com/datasets/satpreetmakhija/netflix-movies-and-tv-shows-2021

I am following a tutorial by Aman Kharwal on how to build this recommendation system.

https://thecleverprogrammer.com/2022/07/05/netflix-recommendation-system-using-python/

In [158]:
data = pd.read_csv("netflixData.csv")

## 3. Exploratory Data Analysis
First, take a look at the dataset to uncover information such as the number of observations, the types of variables, and look at any missing values. 

In [159]:
data.shape

(5967, 13)

In [160]:
data.info()

In [161]:
data.isnull().sum()

Show Id                  0
Title                    0
Description              0
Director              2064
Genres                   0
Cast                   530
Production Country     559
Release Date             3
Rating                   4
Duration                 3
Imdb Score             608
Content Type             0
Date Added            1335
dtype: int64

The dataset has 13 variables and 5967 observations. Additionally, only 5 of the attributes have zero null values. For the recommender system I will only be using title, description, genres, and content type. There are no null values for these attributes. 

Now lets look more closely at these 4 attributes. 

In [162]:
# Select only the columns that we will be using for the recommendation system
data = data[["Title", "Description", "Genres", "Content Type"]]

# Confirm there are no null values for the selected attributes
data.isnull().sum()

Title           0
Description     0
Genres          0
Content Type    0
dtype: int64

Lets take a look at the first few rows of the dataset to see if there is any obvious cleaning that has to be done before getting to building the recommendation system. 

In [163]:
data.head(n=10)

Unnamed: 0,Title,Description,Genres,Content Type
0,(Un)Well,This docuseries takes a deep dive into the luc...,Reality TV,TV Show
1,#Alive,"As a grisly virus rampages a city, a lone man ...","Horror Movies, International Movies, Thrillers",Movie
2,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Documentaries, International Movies",Movie
3,#blackAF,Kenya Barris and his family navigate relations...,TV Comedies,TV Show
4,#cats_the_mewvie,This pawesome documentary explores how our fel...,"Documentaries, International Movies",Movie
5,#FriendButMarried,"Pining for his high school crush for years, a ...","Dramas, International Movies, Romantic Movies",Movie
6,#FriendButMarried 2,As Ayu and Ditto finally transition from best ...,"Dramas, International Movies, Romantic Movies",Movie
7,#realityhigh,When nerdy high schooler Dani finally attracts...,Comedies,Movie
8,#Rucker50,This documentary celebrates the 50th anniversa...,"Documentaries, Sports Movies",Movie
9,#Selfie,"Two days before their final exams, three teen ...","Comedies, Dramas, International Movies",Movie


As you can see, there are no null values, but there is some cleaning to do. First, the titles contain "#" which should be removed. I will also remove stop words from the titles so that tehy can be used to create the recommendation system. 

## 4. Data Preparation
### 4a Clean Title
To use the Title of a show/movie in the recommender model, it needs to be cleaned. This means removing stop words and removing characters that can cause inaccuracies in the model. I will use the nltk package to remove stop words from the Title feature, and use the regular expressions to remove non-alphabet characters such as brackets, numbers and extra whitespaces. 

In [164]:
# Import nltk library, download stopwords, and set the language
# for  stopwords to English

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /Users/kayla/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [165]:
import re
import string
stemmer = nltk.SnowballStemmer("english")

'''
Side note: the ? inside some of the expressions means non-greedy,
ie. match as few characters as possible, so if there is a smaller
set of text that matches the regex within a larger set of text that
also matches the regex, only the smaller set will be removed
(ie. fewer characters).
'''

# clean(text) consumes a string and removes all non-alphabet characters
# as well as hyperlinks.
# It also removes stopwords and stems words by removing affixes.


def clean(text):
    text = str(text).lower()                                            # convert text to lowercase
    text = re.sub('\[.*?\]', '', text)                                  # remove any text in square brackets (including the brackets)
    text = re.sub('https?://\S+|www\.\S+', '', text)                    # removes hyperlinks
    text = re.sub('<.*?>+', '', text)                                   # remove any text in <> brackets (including the brackets)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)     # remove punctuation
    text = re.sub('\n', '', text)                                       # remove newlines
    text = re.sub('\w*\d\w*', '', text)                                 # remove any words that contain numbers in them (at the beginning, middle or end) 
    text = [word for word in text.split(' ') if word not in stop_words] # loop through the text and remove stop words, placing each non-stopword into a list
    text = [stemmer.stem(word) for word in text]                        # stem the remaining words
    text=" ".join(text)                                                 # join the words back together to reform the title after all the modifications above are made
    return text


In [166]:
# apply clean function to title
data["Title"] = data["Title"].apply(clean)
data.head(n=10)

Unnamed: 0,Title,Description,Genres,Content Type
0,unwel,This docuseries takes a deep dive into the luc...,Reality TV,TV Show
1,aliv,"As a grisly virus rampages a city, a lone man ...","Horror Movies, International Movies, Thrillers",Movie
2,annefrank parallel stori,"Through her diary, Anne Frank's story is retol...","Documentaries, International Movies",Movie
3,blackaf,Kenya Barris and his family navigate relations...,TV Comedies,TV Show
4,catsthemewvi,This pawesome documentary explores how our fel...,"Documentaries, International Movies",Movie
5,friendbutmarri,"Pining for his high school crush for years, a ...","Dramas, International Movies, Romantic Movies",Movie
6,friendbutmarri,As Ayu and Ditto finally transition from best ...,"Dramas, International Movies, Romantic Movies",Movie
7,realityhigh,When nerdy high schooler Dani finally attracts...,Comedies,Movie
8,,This documentary celebrates the 50th anniversa...,"Documentaries, Sports Movies",Movie
9,selfi,"Two days before their final exams, three teen ...","Comedies, Dramas, International Movies",Movie


In [167]:
data.Title.sample(10)

273                     alia grace
5235                   surgeon cut
196                 action replayi
5470                       traitor
4693                         bridg
1721               one second next
398                         apostl
3869                           ray
2723    lil peep everybodi everyth
5642                 vampir knight
Name: Title, dtype: object

The clean function did a good job of removing characters that would cause problems for the recommender system. One problem I noticed is that the stemmer stems some words that don't need to be stemmed. For example "alive" is stemmed to "aliv" and "chocolate" is stememd to "chocol". For now I will keep the stemmer, but in the fiture I might experiment with other stemmers to compare. 

### 4b Remove Blank Titles
Since stopwords, special characters and numbers are removed by the clean function, some of the titles may end up being blank. For example, the show "The 100" contains only a stopword and a number, so after applying the clean function we will end up with an empty string. These movies/shows will be useless for the recommendation system, so I will remove them from the dataset. 

In [168]:
# let's take a look at the titles that end up blank after applying clean

data.loc[data["Title"] == ""]

Unnamed: 0,Title,Description,Genres,Content Type
8,,This documentary celebrates the 50th anniversa...,"Documentaries, Sports Movies",Movie
19,,"After an awful accident, a couple admitted to ...","Horror Movies, International Movies",Movie
22,,"In this thought-provoking documentary, scholar...",Documentaries,Movie
27,,A farmer pens a confession admitting to his wi...,"Dramas, Thrillers",Movie
28,,"In this dark alt-history thriller, a naïve law...","Crime TV Shows, International TV Shows, TV Dramas",TV Show
29,,Archival video and new interviews examine Mexi...,"Crime TV Shows, Docuseries, International TV S...",TV Show
30,,"Seeking her independence, a young woman moves ...","Horror Movies, Independent Movies, Thrillers",Movie
34,,This intimate documentary follows rock star Ar...,"Documentaries, International Movies, Sports Mo...",Movie
36,,When a flood of natural disasters begins to de...,"Action & Adventure, Sci-Fi & Fantasy",Movie
37,,"In a social experiment, a group of daughters s...","British TV Shows, Reality TV",TV Show


Lets see how many blank titles there are in the dataset after cleaning the titles. 

In [169]:
data["Title"].loc[data.Title == ''].count()

54

There are 54 blank titles, which will not provide good recommendations, so I will remove these observations from the dataset. 

In [170]:
data.drop(data[data["Title"] == ''].index, inplace=True)
data["Title"].loc[data.Title == ''].count()

0

### 4c Remove Duplicate Titles
For the recommender, we want to remove duplicate rows so that we do not get the same movie recommended either as a recommendation for itself or appear twice in a recommendation for another movie.

In [171]:
data["Title"].duplicated().value_counts()

False    5689
True      224
Name: Title, dtype: int64

In [172]:
indices = pd.Series(data.index, index=data['Title']).drop_duplicates()

## 5. Building the Recommender System
I will use Cosine Similarity to recommend similar movies and tv shows to the user based on the Genres column. 

Here I will use the TfidfVectorizer to convert the genres column into a matrix of TF-IDF features. TF-IDF stands for term frequency- inverse document frequency. Essentially, this calculates the importance of a word based on the number of times the word appears in a corpus. The more frequently a term appears in the corpous, the less importance it is given, and hence the lower the tfidf score assigned to that word. 


In [173]:
feature = data["Genres"].tolist()

# Convert the genres column into a matrix of TF-IDF features
tfidf = text.TfidfVectorizer(input=feature, stop_words="english")
tfidf_matrix = tfidf.fit_transform(feature)
similarity = cosine_similarity(tfidf_matrix)

In [174]:
data["Title"].duplicated().value_counts()

False    5689
True      224
Name: Title, dtype: int64

Finally, I will write the function netflix_recommendation to recommend shows and movies to users based on a given show/movie. 

In [175]:
def netflix_recommendation(title, similarity=similarity):
    index = indices[title]
    similarity_scores = list(enumerate(similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[0:11]
    movieindices = [i[0] for i in similarity_scores if i[0] != index]
    return data[['Title']].iloc[movieindices[:10]]

## 6. Use the Recommender to Recommend Shows and Movies
Now that the recommender is built, I will use it to recommend movies and tv shows for some examples. 

In [176]:
netflix_recommendation("witcher")

Unnamed: 0,Title
164,special love
311,alway mayb
1002,christma view
1624,final found someon
2295,isa pa feel
2444,way
2579,krishna leela
3265,amnesia girl
4057,sakal mage tayo
4848,girl allerg wifi


In [177]:
netflix_recommendation("elit")

Unnamed: 0,Title
465,await instruct
5844,winchest
2114,horn
1139,crimson peak
1195,dark light
5840,wildl
4516,sweetheart
209,aerial
5292,wander earth
1358,doom annihil


In [178]:
netflix_recommendation("manifest")

Unnamed: 0,Title
3,blackaf
285,washington
417,arrest develop
434,astronomi club sketch show
451,aunti donna big ol hous fun
656,big mouth
752,bojack horseman
805,brew brother
935,champion
937,chappell show


## 6. Conclusions
Overall, the recommender does an okay job of recommending movies and shows based on the given title, although there are some pitfalls. 

First, cleaning the Titles makes it hard for the user to input shows/movies to the recommender because they have to know what the new title would be after cleaning. For example, the show "Elite" becomes "elit" after applying to clean function. In the future, I would like to fix this so that the user can type the title exactly as it is and get recommendations. 

Second, the recommendations are not great as they only look at the list of genres for each movie to find similar ones. In a future iteration I would like to use other features to recommend similar movies such as title, description, director, and maybe cast. 