<a href="https://colab.research.google.com/github/harshitthakur1/Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering/blob/main/Unsupervised_ML_Netflix_Movies_and_TV_Shows_Clustering_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Unsupervised ML - Netflix Movies and TV Shows Clustering
 By: Harshit Thakur



##### **Project Type**    - EDA, Clustering, Content based recommender system, Unsupervised.
##### **Contribution**    - Individual

# **Project Summary - Clustering Netflix Movies and TV Shows**


Netflix offers a vast and diverse library of movies and TV shows, making it essential to provide personalized recommendations to its users. Clustering is a powerful technique used to group similar content together, facilitating content discovery and user engagement. Here's an overview of how clustering is applied to Netflix content:

Data Collection and Preparation:

Netflix collects extensive metadata about each title, including genre, actors, directors, release year, and user interactions like viewing history and ratings.

Feature Extraction:

To apply clustering, relevant features are extracted from the dataset. These can include textual descriptions, user interactions, and metadata attributes.

Text Preprocessing:

Textual data, such as plot summaries and user reviews, undergo preprocessing, including tokenization, stop-word removal, and text vectorization using techniques like TF-IDF or word embeddings.

Clustering Algorithms:

Various clustering algorithms can be applied to group similar content. Common choices include K-Means, Hierarchical Clustering, and DBSCAN. The choice of algorithm depends on the dataset and specific goals.

Distance or Similarity Metrics:

Clustering relies on similarity or distance metrics, such as cosine similarity, Euclidean distance, or Jaccard similarity, to measure how similar or dissimilar content items are.

Cluster Evaluation:

To assess the quality of clusters, internal and external validation metrics, like silhouette score or Davies-Bouldin index, can be used. The choice of evaluation metric depends on the clustering algorithm.

User Feedback Incorporation:

Netflix continually collects user feedback, which can be used to improve clustering results. Feedback may include user ratings, watch history, and explicit likes and dislikes.

Dynamic Clustering:

Clusters need to be updated regularly to adapt to new content additions and changing user preferences. This requires periodic re-clustering and model retraining.

Recommendation Systems:

Once content items are clustered, they can be leveraged by recommendation systems. Users are presented with content from the same cluster they've previously enjoyed, enhancing their viewing experience.

Personalization:

Personalization is achieved by considering a user's history and preferences to make recommendations specific to their tastes and interests. Clustering enhances personalization by grouping content effectively.

In summary, clustering Netflix movies and TV shows is an integral part of the recommendation engine, contributing to a more engaging and personalized user experience. It enables users to discover content that aligns with their preferences while helping Netflix manage and organize its vast library efficiently. The dynamic nature of content and user preferences requires ongoing monitoring and adaptation of clustering models to ensure the most relevant and up-to-date recommendations.

# **GitHub Link -**

https://github.com/harshitthakur1

# **Problem Statement**


Netflix is the world's largest online streaming service provider, with over 220 million subscribers as of 2022-Q2. It is crucial that they effectively cluster the shows that are hosted on their platform in order to enhance the user experience, thereby preventing subscriber churn.

We will be able to understand the shows that are similar to and different from one another by creating clusters, which may be leveraged to offer the consumers personalized show suggestions depending on their preferences.

The goal of this project is to classify/group the Netflix shows into certain clusters such that the shows within a cluster are similar to each other and the shows in different clusters are dissimilar to each other.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt

import warnings
warnings.filterwarnings('ignore')

from wordcloud import WordCloud, STOPWORDS
import re, string, unicodedata
import nltk
#import inflect
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
string.punctuation
nltk.download('omw-1.4')
from nltk.tokenize import TweetTokenizer

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

%matplotlib inline
sns.set()


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Alma projects/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv",encoding=('ISO-8859-1'),low_memory=False)


### Dataset First View

In [None]:
# top 5 rows
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
n_rows, n_columns = df.shape
print(f"Number of columns: {n_columns} columns\nNumber of rws: {n_rows} rows")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Checking for duplicate records
df.duplicated().value_counts()

There are no duplicated records in the dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,5))
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

There are many missing values in director, cast, country, date_added, and rating columns.

The missing values in the director, cast, and country attributes can be replaced with 'Unknown'

10 records with missing values in the date_added column can be dropped.

The missing values in rating can be imputed with its mode, since this attribute is discrete.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**show_id :** Unique ID for every Movie / Tv Show

**type :** Identifier - A Movie or TV Show

**title :** Title of the Movie / Tv Show

**director :** Director of the Movie

**cast :** Actors involved in the movie / show

**country :** Country where the movie / show was produced

**date_added :** Date it was added on Netflix

**release_year :** Actual Releaseyear of the movie / show

**rating :** TV Rating of the movie / show

**duration :** Total Duration - in minutes or number of seasons

**listed_in :** Genre

**description:** The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Handling the missing values
df[['director','cast','country']] = df[['director','cast','country']].fillna('Unknown')
df['rating'] = df['rating'].fillna(df['rating'].mode()[0])
df.dropna(axis=0, inplace = True)

In [None]:
df.shape

We have successfully handled all the missing values in the dataset.

In [None]:
# Top countries
df.country.value_counts()


In [None]:
# Genre of shows
df.listed_in.value_counts()

There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it.

To simplify the analysis, let's consider only the primary country where that respective movie / TV show was filmed.

Also, let's consider only the primary genre of the respective movie / TV show.

In [None]:
# Choosing the primary country and primary genre to simplify the analysis
df['country'] = df['country'].apply(lambda x: x.split(',')[0])
df['listed_in'] = df['listed_in'].apply(lambda x: x.split(',')[0])

In [None]:
# contry in which a movie was produced
df.country.value_counts()

In [None]:
# genre of shows
df.listed_in.value_counts()

# Typecasting 'duration' from string to integer

In [None]:
# Splitting the duration column, and changing the datatype to integer
df['duration'] = df['duration'].apply(lambda x: int(x.split()[0]))

In [None]:
# Number of seasons for tv shows
df[df['type']=='TV Show'].duration.value_counts()

In [None]:
# Movie length in minutes
df[df['type']=='Movie'].duration.unique()

In [None]:
# datatype of duration
df.duration.dtype

# Typecasting 'date_added' from string to datetime:

In [None]:
# Typecasting 'date_added' from string to datetime
df["date_added"] = pd.to_datetime(df['date_added'])

# first and last date on which a show was added on Netflix
df.date_added.min(),df.date_added.max()

The shows were added on Netflix between 1st January 2008 and 16th January 2021.

In [None]:
# Adding new attributes month and year of date added

df['month_added'] = df['date_added'].dt.month
df['year_added'] = df['date_added'].dt.year
df.drop('date_added', axis=1, inplace=True)

# Rating:

In [None]:
# Age ratings for shows in the dataset
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=df)

Highest number of shows on Netflix are rated by TV-MA, followed by TV-14 and TV-PG

In [None]:
# Age ratings
df.rating.unique()

In [None]:
# Changing the values in the rating column
rating_map = {'TV-MA':'Adults',
              'R':'Adults',
              'PG-13':'Teens',
              'TV-14':'Young Adults',
              'TV-PG':'Older Kids',
              'NR':'Adults',
              'TV-G':'Kids',
              'TV-Y':'Kids',
              'TV-Y7':'Older Kids',
              'PG':'Older Kids',
              'G':'Kids',
              'NC-17':'Adults',
              'TV-Y7-FV':'Older Kids',
              'UR':'Adults'}

df['rating'].replace(rating_map, inplace = True)
df['rating'].unique()

In [None]:
# Age ratings for shows in the dataset
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=df)

Around 50% of shows on Netflix are produced for adult audience. Followed by young adults, older kids and kids. Netflix has the least number of shows that are specifically produced for teenagers than other age groups.

# **Exploratory Data Analysis(EDA)**

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Univatiate Analysis:**

#### Chart - 1

In [None]:
# Number of Movies and TV Shows in the dataset
plt.figure(figsize=(7,7))
df.type.value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.ylabel('')
plt.title('Movies and TV Shows in the dataset')

There are more movies (69.14%) than TV shows (30.86%) in the dataset.

#### Chart - 2

In [None]:
# Top 10 directors in the dataset
plt.figure(figsize=(10,5))
df[~(df['director']=='Unknown')].director.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 directors by number of shows directed')

Raul Campos and Jan Suter together have directed 18 movies / TV shows, higher than anyone in the dataset.

#### Chart - 3

In [None]:
# Top 10 countries with the highest number movies / TV shows in the dataset
plt.figure(figsize=(10,5))
df[~(df['country']=='Unknown')].country.value_counts().nlargest(10).plot(kind='barh')
plt.title(' Top 10 countries with the highest number of shows')

The highest number of movies / TV shows were based out of the US, followed by India and UK.

#### Chart - 4

In [None]:
top_3 = df.country.value_counts().nlargest(3).sum() / len(df) * 100

# Percent share of movies/TV shows by the top 10 countries
top_10 = df.country.value_counts().nlargest(10).sum() / len(df) * 100

# Data
countries = ['Top 3', 'Top 10']
percentages = [top_3, top_10]

# Create a bar chart
plt.bar(countries, percentages, color=['blue', 'green'])
plt.xlabel('Countries')
plt.ylabel('Percentage Share (%)')
plt.title('Percentage Share of Movies/TV Shows by Countries')
# Add percentage labels to the bars
for i, percentage in enumerate(percentages):
    plt.text(i, percentage, f'{percentage:.2f}%', ha='center', va='bottom')
plt.show()

The top 3 countries together account for about 56% of all movies and TV shows in the dataset.

This value increases to about 78% for top ten countries.

#### Chart - 5

In [None]:
# Visualizing the year in which the movie / tv show was released
plt.figure(figsize=(10,5))
sns.histplot(df['release_year'])
plt.title('distribution by released year')

Netflix has greater number of new movies / TV shows than the old ones.

#### Chart - 6

In [None]:
# Top 10 genres
plt.figure(figsize=(10,5))
df.listed_in.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 genres')

The dramas is the most popular genre followed by comedies and documentaries.

#### Chart - 7

In [None]:
# Share of the top 3 genres
top_3_genres = df.listed_in.value_counts().nlargest(3).sum() / len(df) * 100

# Share of the top 10 genres
top_10_genres = df.listed_in.value_counts().nlargest(10).sum() / len(df) * 100

# Data
genre_categories = ['Top 3 Genres', 'Top 10 Genres']
genre_percentages = [top_3_genres, top_10_genres]

# Create a bar chart
plt.bar(genre_categories, genre_percentages, color=['blue', 'green'])
plt.xlabel('Genre Categories')
plt.ylabel('Percentage Share (%)')
plt.title('Percentage Share of Top Genres')

# Add percentage labels to the bars
for i, percentage in enumerate(genre_percentages):
    plt.text(i, percentage, f'{percentage:.2f}%', ha='center', va='bottom')

plt.show()

These three genres account for about 41% of all movies and TV shows.
This value increases to about 82% for top 10 genres.

#### Chart - 8

In [None]:
# Number of shows added on different months
plt.figure(figsize=(10, 5))
ax = sns.countplot(data=df, x='month_added')
plt.title('Shows added each month over the years')
ax.set_xticklabels(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], rotation=30)
plt.xlabel('Month')
plt.show()

I chose a countplot because it is a suitable chart for visualizing the distribution of categorical data, which is the case in this scenario. The data involves counts of shows added each month, and a countplot is effective for displaying the frequency or count of each category (in this case, each month) on the x-axis

**Over the years a greater number of shows were added in the months of October, November, December, and January.**

#### Chart - 9

In [None]:
# Number of shows added over the years
plt.figure(figsize=(10, 5))
ax = sns.countplot(data=df, x='year_added')
plt.title('Number of shows added each year')
plt.xlabel('Year')

Netflix continuous to add more shows on its platform over the years.

There is a decrease in the number of shows added in the year 2020, which might be attributed to the covid-19-induced lockdowns, which halted the creation of shows.

We have Netflix data only up to 16th January 2021, hence there are less movies added in this year.

#### Chart - 10

In [None]:
# Number of shows on Netflix for different age groups
plt.figure(figsize=(10,5))
df.rating.value_counts().plot(kind='barh')
plt.title('Number of shows on Netflix for different age groups')

The majority of the shows on Netflix are catered to the needs of adult and young adult population.

Answer Here

# **Bivariate analysis:**

#### Chart - 11

In [None]:
# Number of shows released each year since 2008
order = range(2008,2022)
plt.figure(figsize=(10,5))
p = sns.countplot(x='release_year',data=df, hue='type',
                  order = order)
plt.title('Number of shows released each year since 2008 that are on Netflix')
plt.xlabel('')
for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Over the years, Netflix has consistently focused on adding more shows in its platform.

Though there was a decrease in the number of movies added in 2020, this pattern did not exist in the number of TV shows added in the same year.

This might signal that Netflix is increasingly concentrating on introducing more TV series to its platform rather than movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Seasons in each TV show
plt.figure(figsize=(10,5))
p = sns.countplot(x='duration',data=df[df['type']=='TV Show'])
plt.title('Number of seasons per TV show distribution')

for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

In [None]:
# Calculate the percentage of TV shows with just 1 season
percentage = (len(df[(df['type'] == 'TV Show') & (df['duration'] == 1)]) / len(df[df['type'] == 'TV Show'])) * 100

# Data for the bar plot
categories = ['TV Shows with 1 Season', 'Other TV Shows']
values = [percentage, 100 - percentage]

# Create a bar plot
plt.figure(figsize=(8, 6))
bars = plt.bar(categories, values, color=['blue', 'gray'])
plt.ylabel('Percentage (%)')
plt.title('Percentage of TV Shows with Just 1 Season')

# Add percentage values on top of the bars
for bar, value in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width() / 2, value, f'{value:.2f}%', ha='center', va='bottom')

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The TV series in the dataset have up to 16 seasons, however the bulk of them only have one. This might mean that the majority of TV shows has only recently begun, and that further seasons are on the way.

There are very few TV shows that have more than 8 seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# length of movie analysis
plt.figure(figsize=(10,5))
sns.histplot(x='duration',data=df[df['type']=='Movie'])
plt.title('Movie duration distribution')
# Movie statistics
df[df['type']== 'Movie'].duration.describe()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The length of a movie may range from 3 min to 312 minutes, and the distribution is almost normally distributed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#Chart - 14

In [None]:
# Average movie length over the years
plt.figure(figsize=(10,5))
df[df['type']=='Movie'].groupby('release_year')['duration'].mean().plot(kind='line')
plt.title('Average movie length over the years')
plt.ylabel('Length of movie in minutes')
plt.xlabel('Year')
# Movie release year statistics
df[df['type']== 'Movie'].release_year.describe()

 What is/are the insight(s) found from the chart?

Netflix has several movies on its site, including those that were released in way back 1942.

As per the plot, movies made in the 1940s had a fairly short duration on average.

On average, movies made in the 1960s have the longest movie length.
The average length of a movie has been continuously decreasing since the 2000s.

# Chart - 15

In [None]:
# Top 10 genre for movies
plt.figure(figsize=(10,5))
df[df['type']=='Movie'].listed_in.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 genres for movies')

Dramas, comedies, and documentaries are the most popular genre for the movies on Netflix.

# Chart - 16

In [None]:
# Top 10 genre for tv shows
plt.figure(figsize=(10,5))
df[df['type']=='TV Show'].listed_in.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 genres for TV Shows')

International, crime, and kids are the most popular genre for TV shows on Netflix.

# Chart - 17

In [None]:
# Top 10 movie directors
plt.figure(figsize=(10,5))
df[~(df['director']=='Unknown') & (df['type']=='Movie')].director.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 movie directors')

Raul Campos and Jan Suter have togather directed in 18 movies, higher than anyone yet.

This is followed by Marcus Roboy, Jay Karas, and Cathy Gracia-Molina

# Chart - 18

In [None]:
# Top 10 TV show directors
plt.figure(figsize=(10,5))
df[~(df['director']=='Unknown') & (df['type']=='TV Show')].director.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 TV show directors')

Alastair Fothergill has directed three TV shows, the most of any director.

Only six directors have directed more than one television show.

# Chart - 19

In [None]:
# Top actors for movies
plt.figure(figsize=(10,5))
df[~(df['cast']=='Unknown') & (df['type']=='Movie')].cast.value_counts().nlargest(5).plot(kind='barh')
plt.title('Actors who have appeared in highest number of movies')

Samuel West has appeared in 10 movies, followed by Jeff Dunham with 7 movies.

# Chart - 20

In [None]:
# Top actors for TV shows
plt.figure(figsize=(10,5))
df[~(df['cast']=='Unknown') & (df['type']=='TV Show')].cast.value_counts().nlargest(5).plot(kind='barh')
plt.title('Actors who have appeared in highest number of TV shows')

David Attenborough has appeared in 13 TV shows, followed by Michela Luci, Jamie Watson, Anna Claire Bartlam, Dante Zee, Eric Peterson with 4 TV shows.

# Building a wordcloud

In [None]:
# Building a wordcloud for the movie descriptions
comment_words = ''
stopwords = set(STOPWORDS)

# iterate through the csv file
for val in df.description.values:

    # typecaste each val to string
    val = str(val)

    # split the value
    tokens = val.split()

    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()

    comment_words += " ".join(tokens)+" "

wordcloud = WordCloud(width = 700, height = 700,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)

# plot the WordCloud image
plt.figure(figsize = (10,5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

Some keywords in Netflix show descriptions: life, family, new, love, young, world, group, death, man, woman, murder, son, girl, documentary, secret.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

# Modelling Approach:

Select the attributes based on which you want to cluster the shows

Text preprocessing: Remove all non-ascii characters, stopwords and punctuation marks, convert all textual data to lowercase.

Lemmatization to generate a meaningful word out of corpus of words

Tokenization of corpus

Word vectorization

Dimensionality reduction

Use different algorithms to cluster the movies, obtain the optimal number of clusters using different techniques

Build optimal number of clusters and visualize the contents of each cluster using wordclouds.

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Using the original dataset for clustering since
# it does not require handling missing values
df1 = df.copy()
df1.fillna('',inplace=True)
df1

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Combining all the clustering attributes into a single column

df1['clustering_attributes'] = (df1['director'] + ' ' +
                                df1['cast'] +' ' +
                                df1['country'] +' ' +
                                df1['listed_in'] +' ' +
                                df1['description'])

df1['clustering_attributes'][40]

We have successfully added all the necessary data into a single column

### 4. Textual Data Preprocessing

(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Removing non-ASCII characters:

In [None]:
# function to remove non-ascii characters
def remove_non_ascii(words):
    """Function to remove non-ASCII characters"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

# remove non-ascii characters
df1['clustering_attributes'] = remove_non_ascii(df1['clustering_attributes'])
df1['clustering_attributes'][40]

We have successfully removed all non-ascii characters from the corpus.

#### 2. Lower Casing/Removing Stopwords

In [None]:
# extracting the stopwords from nltk library
import nltk
from nltk.corpus import stopwords
sw = stopwords.words('english')
# displaying the stopwords
np.array(sw)
# function to remove stop words
def stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)
# Removing stop words
df1['clustering_attributes'] = df1['clustering_attributes'].apply(stopwords)
print(df1['clustering_attributes'][40])


#### 3. Removing Punctuations

In [None]:
# function to remove punctuations
def remove_punctuation(text):
    '''a function for removing punctuation'''
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)
# Removing punctuation marks
df1['clustering_attributes'] = df1['clustering_attributes'].apply(remove_punctuation)
print(df1['clustering_attributes'][40])

#### 4. Lemmatization

Lemmatization is a text normalization process that involves reducing words to their base or root form, known as a "lemma." The goal of lemmatization is to reduce words to a common base or root form so that different inflected forms of a word are treated as the same word. This helps in text analysis and natural language processing tasks by simplifying the vocabulary and improving the accuracy of text analysis.

For example, in lemmatization:

The word "running" would be reduced to its lemma "run."

In [None]:
# function to lemmatize the corpus
def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas
# Lemmatization
df1['clustering_attributes'] = lemmatize_verbs(df1['clustering_attributes'])
print(df1['clustering_attributes'][40])

#### 5. Tokenization

In [None]:
# Initialize the TweetTokenizer
tokenizer = TweetTokenizer()

# Define a function to tokenize strings and handle non-string elements
def tokenize_text(x):
    if isinstance(x, str):
        return tokenizer.tokenize(x)
    else:
        return x

# Apply the tokenization function to the 'clustering_attributes' column
df1['clustering_attributes'] = df1['clustering_attributes'].apply(tokenize_text)

# Print a sample of the tokenized data
sample_size = 5
sample = df1['clustering_attributes'].sample(sample_size)
for tokens in sample:
    if isinstance(tokens, list):
        print(tokens)

#### 6. Text Vectorization

I have chosen to use TF-IDF vectorization for several reasons. Firstly, TF-IDF considers the importance of each term in the context of the document and the entire corpus. This enables me to focus on the most relevant terms within each document, making it an ideal choice for tasks that require identifying key terms or concepts.

Secondly, by combining TF-IDF with stop-word removal, I can eliminate common and less informative words, which further enhances the quality of the vectorization. This is crucial for ensuring that the resulting feature representation is content-bearing and meaningful.

Thirdly, I've limited the dimensionality of the vectorization using the max_features parameter to make clustering and other NLP tasks more efficient, especially when dealing with a large vocabulary.

In addition, TF-IDF provides normalized values, making it suitable for comparing and clustering documents of varying lengths. This normalization ensures that longer documents do not have an unfair advantage due to their length.

In [None]:
# Vectorizing Text
# clustering tokens saved in a variable
clustering_data = df1['clustering_attributes']
# Tokenization
def identity_tokenizer(text):
    return text

# Using TFIDF vectorizer to vectorize the corpus
# max features = 20000 to prevent system from crashing
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False,max_features = 20000)
X = tfidf.fit_transform(clustering_data)
X
# Shape of X
X.shape
# data type of vector
type(X)
# convert X into array form for clustering
X = X.toarray()
print(X)
print(type(X))
print(X.shape)

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In the context of clustering Netflix movies and TV shows, dimensionality reduction can be beneficial for several reasons:

High Dimensionality of Features: Netflix datasets often have a large number of features, which may represent various attributes, metadata, or user-related information. High dimensionality can lead to computational challenges and slower clustering algorithms. Dimensionality reduction can simplify the dataset, making it more manageable for clustering.

Reduced Noise: Some features in Netflix datasets may not be relevant for clustering and can introduce noise into the analysis. Dimensionality reduction techniques can help filter out less important features, focusing on the most informative ones.

Improved Interpretability: Dimensionality reduction can aid in making sense of the dataset and the clusters formed. By visualizing the data in a lower-dimensional space, patterns and similarities among movies and TV shows become more apparent, contributing to better cluster interpretability.

Enhanced Cluster Quality: Clustering algorithms may perform better when the dimensionality of the data is reduced. Reducing the number of features can lead to more coherent and well-separated clusters.

Efficient Computation: Dimensionality reduction can speed up the clustering process, making it more efficient. This is particularly valuable when working with a vast amount of Netflix content.

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I have utilized Principal Component Analysis (PCA) as the dimensionality reduction technique for the dataset. PCA is a popular choice for several reasons. Firstly, PCA is effective at reducing the dimensionality of the data while retaining the most important information. It does this by identifying the principal components, which are linear combinations of the original features, and arranging them in descending order of variance. By retaining a subset of these components, I can reduce the dataset's dimensionality while preserving as much of the data's variance as possible.

More than 80% of the variance is explained just by 4000 components.

Hence to simplify the model, and reduce dimensionality, we can take the top 4000 components, which will still be able to capture more than 80% of variance.

In [None]:
# using PCA to reduce dimensionality
pca = PCA(n_components=100, random_state=42)  # Adjust the number of components as needed
pca.fit(X)

In [None]:
# Explained variance for different number of components
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('PCA - Cumulative explained variance vs number of components')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

In [None]:
# transformed features
x_pca = pca.transform(X)
# shape of transformed vectors
x_pca.shape

## ***7. ML Model Implementation***

### ML Model - 1(K-Means Clustering)

In [None]:
# Elbow method to find the optimal value of k
wcss=[]
for i in range(1,31):
  kmeans = KMeans(n_clusters=i,init='k-means++',random_state=33)
  kmeans.fit(x_pca)
  wcss_iter = kmeans.inertia_
  wcss.append(wcss_iter)

number_clusters = range(1,31)
plt.figure(figsize=(10,5))
plt.plot(number_clusters,wcss)
plt.title('The Elbow Method - KMeans clustering')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

The sum of squared distance between each point and the centroid in a cluster (WCSS) decreases with the increase in the number of clusters.

In [None]:
# Plotting Silhouette score for different umber of clusters
range_n_clusters = range(2,31)
silhouette_avg = []
for num_clusters in range_n_clusters:
  # initialize kmeans
  kmeans = KMeans(n_clusters=num_clusters,init='k-means++',random_state=33)
  kmeans.fit(x_pca)
  cluster_labels = kmeans.labels_

  # silhouette score
  silhouette_avg.append(silhouette_score(x_pca, cluster_labels))

plt.figure(figsize=(10,5))
plt.plot(range_n_clusters,silhouette_avg)
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k - KMeans clustering')
plt.show()

The highest Silhouette score is obtained for 6 clusters.

In [None]:
# Clustering the data into 19 clusters
kmeans = KMeans(n_clusters=6,init='k-means++',random_state=33)
kmeans.fit(x_pca)

In [None]:
# Evaluation metrics - distortion, Silhouette score
kmeans_distortion = kmeans.inertia_
kmeans_silhouette_score = silhouette_score(x_pca, kmeans.labels_)

print((kmeans_distortion,kmeans_silhouette_score))

In [None]:
# Adding a kmeans cluster number attribute
df1['kmeans_cluster'] = kmeans.labels_
# Number of movies and tv shows in each cluster
plt.figure(figsize=(10,5))
q = sns.countplot(x='kmeans_cluster',data=df1, hue='type')
plt.title('Number of movies and TV shows in each cluster - Kmeans Clustering')
for i in q.patches:
  q.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

In [None]:
# Building a wordcloud for the movie descriptions
def kmeans_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in df1[df1['kmeans_cluster']==cluster_num].description.values:

      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()
       # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()

      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                  stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)


  # plot the WordCloud image
  plt.figure(figsize = (10,5), facecolor = None)
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad = 0)

In [None]:
# Wordcloud for cluster 0
kmeans_worldcloud(0)

Keywords observed in cluster 0: life, new, family, friend, save, help, discover, home, teen

In [None]:
# Wordcloud for cluster 1
kmeans_worldcloud(1)

Keywords observed in cluster 1: life, love, family, father, young, girl, man, woman, friend, daughter

In [None]:
# Wordcloud for cluster 2
kmeans_worldcloud(2)
# Keywords observed in cluster 2: young, world, girl, mysterious, humanity, life, student, school, battle, demon, force

# Wordcloud for cluster 3
kmeans_worldcloud(3)
# Keywords observed in cluster 3: love, life, family, romance, crime, murder, world, adventure

# Wordcloud for cluster 4
kmeans_worldcloud(4)
# Keywords observed in cluster 4: comedian, special, stand, comic, stage, sex, joke

# Wordcloud for cluster 5
kmeans_worldcloud(5)
# Keywords observed in cluster 5: documentary, world, life, filmmaker, american, life

### ML Model - 2(Hierarchical clustering)

In [None]:
# Building a dendogram to decide on the number of clusters
plt.figure(figsize=(10, 7))
dend = shc.dendrogram(shc.linkage(x_pca, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 3.8, color='r', linestyle='--')

At a distance of 3.8 units, 12 clusters can be built using the agglomerative clustering algorithm.

In [None]:
# Fitting hierarchical clustering model
hierarchical = AgglomerativeClustering(n_clusters=12, affinity='euclidean', linkage='ward')
hierarchical.fit_predict(x_pca)

In [None]:
# Adding a kmeans cluster number attribute
df1['hierarchical_cluster'] = hierarchical.labels_
# Number of movies and tv shows in each cluster
plt.figure(figsize=(10,5))
q = sns.countplot(x='hierarchical_cluster',data=df1, hue='type')
plt.title('Number of movies and tv shows in each cluster - Hierarchical Clustering')
for i in q.patches:
  q.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

Successfully built 12 clusters using the Agglomerative (hierarchical) clustering algorithm.

In [None]:
# Building a wordcloud for the movie descriptions
def hierarchical_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in df1[df1['hierarchical_cluster']==cluster_num].description.values:

      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()

      # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()

      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                  stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)

  # plot the WordCloud image
  plt.figure(figsize = (10,5), facecolor = None)
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad = 0)

In [None]:
# Wordcloud for cluster 0
hierarchical_worldcloud(0)
#Keywords observed in cluster 0: life, new, find, family, save, friend, young, teen, adventure

# Wordcloud for cluster 1
hierarchical_worldcloud(1)
#Keywords observed in cluster 1: love, family, life, student, romance, school, woman, master, father

# Wordcloud for cluster 2
hierarchical_worldcloud(2)
#Keywords observed in cluster 2: life, new, series, crime, world, murder, history, detective

# Wordcloud for cluster 3
hierarchical_worldcloud(3)
#Keywords observed in cluster 3: family, life, love, friend, teen, woman, man, young, world, wedding, secret

# Wordcloud for cluster 4
hierarchical_worldcloud(4)
#Keywords observed in cluster 4: documentary, music, world, team, interview,history, family, career, battle, death

# Wordcloud for cluster 5
hierarchical_worldcloud(5)
#Keywords observed in cluster 5: family, life, mexico, young, new, woman, man, secret, spain, death, singer

# Wordcloud for cluster 6
hierarchical_worldcloud(6)
#Keywords observed in cluster 6: young, life, girl, world, friend, mysterious, demon, student, school, father

# Wordcloud for cluster 7
hierarchical_worldcloud(7)
#Keywords observed in cluster 7: love, life, woman, new, student, family, korea, secret, detective, young

# Wordcloud for cluster 8
hierarchical_worldcloud(8)
#Keywords observed in cluster 8: woman, man life, egypt, wealthy, money, young, love, revolution, struggling

# Wordcloud for cluster 9
hierarchical_worldcloud(9)
#Keywords observed in cluster 9: comedian, stand, life, comic, special, show, live, star, stage, hilarious, stories

# Wordcloud for cluster 10
hierarchical_worldcloud(10)
#Keywords observed in cluster 10: animal, nature, explore, planet, species, survive, natural, life, examine, earth

# Wordcloud for cluster 11
hierarchical_worldcloud(11)
#Keywords observed in cluster 11: love, man, woman, india, father, friend, girl, mumbai, city, learn, young

# **Content based recommender system**

We can build a simple content based recommender system based on the similarity of the shows.

If a person has watched a show on Netflix, the recommender system must be able to recommend a list of similar shows that s/he likes.

To get the similarity score of the shows, we can use cosine similarity
The similarity between two vectors (A and B) is calculated by taking the dot product of the two vectors and dividing it by the magnitude value as shown in the equation below. We can simply say that the CS score of two vectors increases as the angle between them decreases.




In [None]:
from sklearn.metrics.pairwise import cosine_similarity
# defining a new df for building a recommender system
recommender_df = df1.copy()
# Changing the index of the df from show id to show title
recommender_df['show_id'] = recommender_df.index
# converting tokens to string
def convert(lst):
  return ' '.join(lst)

recommender_df['clustering_attributes'] = recommender_df['clustering_attributes'].apply(lambda x: convert(x))
# setting title of movies/Tv shows as index
recommender_df.set_index('title',inplace=True)
# Count vectorizer
CV = CountVectorizer()
converted_matrix = CV.fit_transform(recommender_df['clustering_attributes'])
# Cosine similarity
# Calculate the cosine similarity matrix
cosine_similarity = cosine_similarity(converted_matrix)
cosine_similarity.shape
# Developing a function to get 10 recommendations for a show
indices = pd.Series(recommender_df.index)

def recommend_10(title, cosine_sim = cosine_similarity):
  try:
    recommend_content = []
    idx = indices[indices == title].index[0]
    series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top10 = list(series.iloc[1:11].index)
    # list with the titles of the best 10 matching movies
    for i in top10:
      recommend_content.append(list(recommender_df.index)[i])
    print("If you liked '"+title+"', you may also enjoy:\n")
    return recommend_content

  except:
    return 'Invalid Entry'

# Recommendations for 'A Man Called God'
recommend_10('A Man Called God')

# Recommendations for 'Stranger Things'
recommend_10('Stranger Things')

# Recommendations for 'Peaky Blinders'
recommend_10('Peaky Blinders')

# Recommendations for 'Lucifer'
recommend_10('Lucifer')

# Recommendations for 'one piece'
recommend_10('one piece')

Invalid because the show 'one piece' is not available on Netflix.

# **Conclusion**

In this project, we worked on a text clustering problem wherein we had to classify/group the Netflix shows into certain clusters such that the shows within a cluster are similar to each other and the shows in different clusters are dissimilar to each other.

The dataset contained about 7787 records, and 11 attributes.

We began by dealing with the dataset's missing values and doing exploratory data analysis (EDA).

It was found that Netflix hosts more movies than TV shows on its platform, and the total number of shows added on Netflix is growing exponentially. Also, majority of the shows were produced in the United States, and the majority of the shows on Netflix were created for adults and young adults age group.

It was decided to cluster the data based on the attributes: director, cast, country, genre, and description. The values in these attributes were tokenized, preprocessed, and then vectorized using TFIDF vectorizer.

Through TFIDF Vectorization, we created a total of 20000 attributes.

We used Principal Component Analysis (PCA) to handle the curse of dimensionality. 4000 components were able to capture more than 80% of variance, and hence, the number of components were restricted to 4000.

We first built clusters using the k-means clustering algorithm, and the optimal number of clusters came out to be 6. This was obtained through the elbow method and Silhouette score analysis.

Then clusters were built using the Agglomerative clustering algorithm, and the optimal number of clusters came out to be 12. This was obtained after visualizing the dendrogram.

A content based recommender system was built using the similarity matrix obtained after using cosine similarity. This recommender system will make 10 recommendations to the user based on the type of show they watched.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***