# **Project Name** --  Netflix Movies and TV Shows Clustering**



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**



Write the summary here within 500-600 words.

The goal of this project is to analyze the Netflix catalog of movies and TV shows, which was sourced from the third-party search engine Flixable, and group them into relevant clusters. This will aid in enhancing the user experience and prevent subscriber churn for the world's largest online streaming service provider, Netflix, which currently boasts over 220 million subscribers as of 2022-Q2. The dataset, which includes movies and TV shows as of 2019, will be analyzed to uncover new insights and trends in the rapidly growing world of streaming entertainment.

* There were approximately 7787 records and 11 attributes in the dataset.

* We started by working on the missing values in the dataset and conducting exploratory data analysis (EDA).

* Using the following attributes to create a cluster: cast, country, genre, director, rating, and description The TFIDF vectorizer was used to tokenize, preprocess, and vectorize the values in these attributes.

* The problem of dimensionality was dealt with through the use of Principal Component Analysis (PCA).

* Using a variety of methods, including the elbow method, silhouette score, dendrogram, and others, we constructed two distinct types of clusters with the K-Means Clustering and Agglomerative Hierarchical clustering algorithms, respectively, and determined the optimal number of clusters.

* The similarity matrix generated by applying cosine similarity was used to construct a content-based recommender system. The user will receive ten recommendations from this recommender system based on the type of show they watched.

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/padhilipika





# **Problem Statement**


**Write Problem Statement Here.**



Netflix is a streaming service that offers a wide variety of television shows and movies for viewers to watch at their convenience. With a monthly subscription, users have access to a vast library of content, including original series and films produced by Netflix. The platform also allows users to create multiple profiles, making it easy for family members or roommates to have their own personalized viewing experience. Additionally, Netflix allows users to download content to watch offline, making it a great option for those who travel frequently or have limited internet access. Overall, Netflix is a convenient and cost-effective way to access a wide variety of entertainment.

As of 2022-Q2, more than 220 million people had signed up for Netflix's online streaming service, making it the largest OTT provider worldwide. To improve the user experience and prevent subscriber churn, they must efficiently cluster the shows hosted on their platform.

By creating clusters, we will be able to comprehend the shows that are alike and different from one another. These clusters can be used to provide customers with individualized show recommendations based on their preferences.

This project aims to classify and group Netflix shows into specific clusters in such a way that shows in the same cluster are similar to one another and shows in different clusters are different.

#### **Define Your Business Objective?**

Answer Here.

To enhance the user experience and reduce subscriber churn on Netflix by providing personalized show recommendations based on show clustering.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries


In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Word Cloud library
from wordcloud import WordCloud, STOPWORDS

# libraries used to process textual data
import string
string.punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

# libraries used to implement clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# libraries that are used to construct a recommendation system
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look (Viewing the first 5 rows)
df.head()

In [None]:
# Dataset First Look( Viewing the last 5 rows)
df.tail()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
print(f'number of rows : {df.shape[0]}  \nnumber of columns : {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
value = len(df[df.duplicated()])
print("The number of duplicate values in the data set is = ",value)

Hence we found that were no duplicate entries in the above data.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(df, color='green',sort='ascending', figsize=(10,3), fontsize=15)

In [None]:
# Visualizing the missing values using Heatmap
plt.figure(figsize=(12,4))
sns.heatmap(df.isna(), cmap = 'coolwarm')

### What did you know about your dataset?

Answer Here


The given dataset is from the online streaming industry; our task is to examine the dataset, build the clustering methods and content based recommendation system.

Clustering is a technique used in machine learning and data mining to group similar data points together. A clustering algorithm is a method or technique used to identify clusters within a dataset. These clusters represent natural groupings of the data, and the goal of clustering is to discover these groupings without any prior knowledge of the groupings.

* There are 7787 rows and 12 columns in the dataset. In the director, cast, country, date_added, and rating columns, there are missing values. The dataset does not contain any duplicate values.

* Every row of information we have relates to a specific movie. Therefore, we are unable to use any method to impute any null values. Additionally, due to the small size of the data, we do not want to lose any data, so after analyzing each column, we simply impute numeric values using an empty string in the following procedure.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here
* **show_id :** Unique ID for every Movie/Show
* **type :** Identifier - Movie/Show
* **title :** Title of the Movie/Show
* **director :** Director of the Movie/Show
* **cast :** Actors involved in the Movie/Show
* **country :** Country where the Movie/Show was produced
* **date_added :** Date it was added on Netflix
* **release_year :** Actual Release year of the Movie/Show
* **rating :** TV Rating of the Movie/Show
* **duration :** Total Duration - in minutes or number of seasons
* **listed_in :** Genre
* **description :** The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in",i,"is",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dataset First Look (Viewing the first 5 rows)
df.head()
# Dataset last Look (Viewing the first 5 rows)
df.tail()

### What all manipulations have you done and insights you found?

Answer Here.

We are focusing on several key columns of our dataset, including 'type', 'title', 'director', 'cast', 'country', 'rating', 'listed_in', and 'description', as they contain a wealth of information.
By utilizing these features, we plan to create a cluster column and implement both K-means and Hierarchical clustering algorithms.
Additionally, we will be developing a content-based recommendation system that utilizes the information from these columns to provide personalized suggestions to users. This approach will allow us to gain valuable insights and group similar data points together, as well as provide personalized recommendations based on user preferences and viewing history.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **EDA**
* EDA stands for Exploratory Data Analysis. It is a process of analyzing and understanding the data, which is an essential step in the data science process. The goal of EDA is to gain insights into the data, identify patterns, and discover relationships and trends. It is an iterative process that helps to identify outliers, missing values, and any other issues that may affect the analysis and modeling of the data.


#### Chart - 1

# **Column: 'type'**

In [None]:
# Chart - 1 visualization code
# number of values of different categories in 'type'
df['type'].value_counts()

In [None]:
fig,ax = plt.subplots(1,2, figsize=(14,5))

# barplot
graph = sns.countplot(x = 'type', data = df, ax=ax[0])
graph.set_title('Count of Values', size=20)

# piechart
df['type'].value_counts().plot(kind='pie', autopct='%1.2f%%', ax=ax[1], figsize=(15,6),startangle=90)
plt.title('Percentage Distribution', size=20)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?


Answer Here.

I have used bar chart to visualize and display categorical data. It is one of the most commonly used graphs for summarizing and comparing the frequency or count of different categories in a dataset.

I have used Pie charts  to visualize and represent data in a circular graph, where each slice of the pie represents a different category or data point.

##### 2. What is/are the insight(s) found from the chart?


Answer Here

* Movies has more number of counts than TV Shows.
* 31% of the data are from TV shows, while 69% of the data are from movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.




Answer Here

Yes the gained insights help creating a positive business impact  such as "The objective of this visualization was to analyze the count of value to TV shows and Movies and percentage distribution of TV shows and Movies in our dataset".


#### Chart - 2

# **Column: 'title'**

In [None]:
# Chart - 2 visualization code
# number of unique values
df['title'].nunique()

In [None]:
# text documents
text = " ".join(word for word in df['title'])

# create the word cloud using WordCloud library
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', min_font_size=15).generate(text)

# plot the word cloud
plt.imshow(wordcloud,  interpolation='bilinear')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

word cloud is a visual representation of text data where words are displayed in different sizes, with the size of each word indicating its frequency or importance in the given text.WordCloud library simplifies the process of creating visually appealing word clouds and aids in understanding and communicating the key information within a textual dataset.

##### 2. What is/are the insight(s) found from the chart?


Answer Here

* Words like 'Love', 'Christmas', 'Man', 'World', 'Life', 'Girl', and 'Story' are frequently used in the movie title column.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here

Yes the gained insights help creating a positive business impact  such as "visual representation of text data where words are displayed in different sizes, with the size of each word indicating its frequency or importance in the given text in our dataset".

#### Chart - 3

# **Column: 'director'**

In [None]:
# Chart - 3 visualization code
print(f'number of unique directors : {df.director.nunique()}')
print(f'null values in the column : {df.director.isna().sum()}')

In [None]:
print(f"Number of Movies directed by directors are : { df[df['type']=='TV Show']['director'].value_counts().sum()}")
print(f"Number of TV shows directed by directors are : { df[df['type']=='Movie']['director'].value_counts().sum()}")

In [None]:
fig,ax = plt.subplots(1,2, figsize=(14,5))

# top 10 directors who directed TV shows
tv_shows =df[df['type']=='TV Show']['director'].value_counts()[:10].plot(kind='barh', ax=ax[0])
tv_shows.set_title('top 10 director who directed TV Shows', size=15)

# top 10 directors who directed Movies
movies = df[df['type']=='Movie']['director'].value_counts()[:10].plot(kind='barh', ax=ax[1])
movies.set_title('top 10 director who directed Movies', size=15)

plt.tight_layout(pad=1.2, rect=[0, 0, 0.95, 0.95])
plt.show()

##### 1. Why did you pick the specific chart?


Answer Here.

bar plots are effective for comparing and visualizing the distribution of data across different categories. The length of the bars directly represents the magnitude of the data, making it easy to compare values among categories.The first subplot shows the Top 10 directors who directed TV Shows," and the second subplot shows the "Top 10 directors who directed Movies.This layout allows for a clear and straightforward representation of the data, making it easy to compare the counts of movies and TV shows directed by the directors .

##### 2. What is/are the insight(s) found from the chart?



Answer Here

* The three shows directed by Alastair Fothergill are the highest on the data list.
* Both, Jan Suter and Raul Campos have directed 18 films, more than anyone else in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights help creating a positive business impact  such as "The objective of this visualization was to analyze the top 10 director who directed TV shows and top 10 director who directed Movies in our dataset".


Answer Here

#### Chart - 4

# **Column: 'cast'**

In [None]:
# Chart - 4 visualization code
df['cast']

In [None]:
# seperating actors from cast column
cast = df['cast'].str.split(', ', expand=True).stack()

# top actors name who play highest role in movie/show.
cast.value_counts()

In [None]:
print(f"Number of TV Shows actors: {len(df[df['type']=='TV Show']['cast'].str.split(', ',expand=True).stack().value_counts())}")
print(f"Number of Movies actors: {len(df[df['type']=='Movie']['cast'].str.split(', ', expand=True).stack().value_counts())}")

In [None]:
fig,ax = plt.subplots(1,2, figsize=(14,5))

# seperating TV shows actor from cast column
top_TVshows_actor = df[df['type']=='TV Show']['cast'].str.split(', ', expand=True).stack()
# plotting actor who appeared in highest number of TV Show
a = top_TVshows_actor.value_counts().head(10).plot(kind='barh', ax=ax[0])
a.set_title('Top 10 TV shows actors', size=15)

# seperating movie actor from cast column
top_movie_actor = df[df['type']=='Movie']['cast'].str.split(', ', expand=True).stack()
# plotting actor who appeared in highest number of Movie
b = top_movie_actor.value_counts().head(10).plot(kind='barh', ax=ax[1])
b.set_title('Top 10 Movie actors', size=15)

plt.tight_layout(pad=1.2, rect=[0, 0, 0.95, 0.95])
plt.show()

##### 1. Why did you pick the specific chart?



Answer Here.

Horizontal bar plots are effective for comparing and visualizing the distribution of data across different categories. The length of the bars directly represents the magnitude of the data, making it easy to compare values among categories.The first plot shows the top 10 TV show actors, and the second plot shows the top 10 movie actors. Both plots use horizontal bar charts to represent the data.

The horizontal bar charts are effective for comparing the frequency of appearances of different actors, and the use of subplots  allows both charts to be displayed side by side for easy comparison.

 This layout allows for a clear and straightforward representation of the data, making it easy to compare the counts of  TV shows actors and Movie actors .

##### 2. What is/are the insight(s) found from the chart?



Answer Here

* The majority of the roles in the movies are played by Anupam Kher, Shahrukh Khan, and Om Puri.
* In the shows, Takahiro Sakurai, Yuki Kaji, and Daisuke Ono played the most number of roles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Answer Here


Yes the gained insights help creating a positive business impact  such as "The objective of this visualization was to analyze the top 10 TV shows actors and top 10 movie actors in our dataset".

#### Chart - 5

# **Column: 'country'**

In [None]:
# Chart - 5 visualization code
# number of unique values
df['country'].nunique()

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,5))
plt.suptitle('Top 10 country with the highest number of movie/shows', weight='bold', size=15, y=1.01)

# univariate analysis
df['country'].value_counts().nlargest(10).plot(kind='barh', ax=ax[0])

# bivariate analysis
graph = sns.countplot(x="country", data=df, hue='type', order=df['country'].value_counts().index[0:10], ax=ax[1])
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?



Answer Here.

Horizontal bar plots are effective for comparing and visualizing the distribution of data across different categories. The length of the bars directly represents the magnitude of the data, making it easy to compare values among categories.It is used to display the top 10 countries with the highest number of movies and TV shows available on Netflix. This layout allows for a clear and straightforward representation of the data, making it easy to compare the counts of movies and TV shows for each country.

We use a Grouped Bar Plot (Countplot) to visually compare the frequency or count of different categories in a dataset. This type of plot is particularly useful when we want to observe how two categorical variables interact and how their counts vary across different levels of each category.Grouped Bar Plots (Countplots) are valuable visualization tools for comparing and exploring the relationships between categorical variables, making them an essential part of exploratory data analysis and data visualization tasks.

##### 2. What is/are the insight(s) found from the chart?



Answer Here

* The United States-based movies and TV shows were produced most, followed by India and the United Kingdom.
* In India and United State, a greater number of movies are present compared to TV shows.
* In the UK, Japan, and South Korea there are a greater number of TV shows than movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here

Yes the gained insights help creating a positive business impact  such as "The objective of this visualization was to analyze the top 10 country with the highest number of Movies and TV shows  in our dataset".

#### Chart - 6

# ** Column: 'release_year'**

In [None]:
# Chart - 6 visualization code
# number of unique values
df['release_year'].nunique()

In [None]:
print(f'Oldest release year : {df.release_year.min()}')
print(f'Latest release year : {df.release_year.max()}')

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))

# Univariate analysis
hist = sns.histplot(df['release_year'], ax=ax[0])
hist.set_title('Distribution by released year', size=15)

# Bivariate analysis
count = sns.countplot(x="release_year", hue='type', data=df, order=df['release_year'].value_counts().index[0:15], ax=ax[1])
count.set_title('Movie/TV shows released in top 15 year', size=15)
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?



Answer Here.

The first graph  is a univariate analysis plot, specifically a histogram, showing the distribution of the "release_year" variable from the dataset. The histogram displays the frequency distribution of release years, giving an overview of how many movies or TV shows were released in different years.

The second graph is a bivariate analysis plot, specifically a count plot, showing the count of movies and TV shows released in the top 15 years with the highest frequencies. The bars are grouped by the "type" variable, which distinguishes between movies and TV shows, and the count of each type is shown for the top 15 years.

We use a Grouped Bar Plot (Countplot) to visually compare the frequency or count of different categories in a dataset. This type of plot is particularly useful when we want to observe how two categorical variables interact and how their counts vary across different levels of each category.Grouped Bar Plots (Countplots) are valuable visualization tools for comparing and exploring the relationships between categorical variables, making them an essential part of exploratory data analysis and data visualization tasks.

##### 2. What is/are the insight(s) found from the chart?



Answer Here

* Netflix starts releasing more Movies/TV shows in recent years compared to old ones.
* Most Movies and TV shows are available on Netflix between 2015 and 2020, and the highest are in 2018.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here

Yes the gained insights help creating a positive business impact  such as "The objective of this visualization was to analyze the distribution by released year and Movie/TV shows released in Top 15 years in our dataset".

#### Chart - 7

# **Column: 'rating'**

In [None]:
# Chart - 7 visualization code
fig,ax = plt.subplots(1,2, figsize=(15,6))
plt.suptitle('Top 10 rating given for movie and shows', weight='bold', y=1.02, size=15)

# univariate analysis
sns.countplot(x="rating", data=df, order=df['rating'].value_counts().index[0:10], ax=ax[0])

# bivariate analysis
graph = sns.countplot(x="rating", data=df, hue='type', order=df['rating'].value_counts().index[0:10], ax=ax[1])
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?



Answer Here.

Univariate Analysis:
The first graph  is a "Countplot" representing the frequency of each rating category for movies and TV shows in the dataset. The x-axis displays the "rating" categories, and the y-axis shows the count of occurrences for each rating. The order of the x-axis labels is set to display the top 10 most frequently occurring rating categories.

Bivariate Analysis:
The second graph  is another "Countplot" representing the frequency of each rating category for movies and TV shows, differentiated by the "type" of content. The x-axis displays the "rating" categories, and the y-axis shows the count of occurrences for each rating. The bars are further distinguished by color to represent movies and TV shows separately. Again, the order of the x-axis labels is set to display the top 10 most frequently occurring rating categories.

We use a Grouped Bar Plot (Countplot) to visually compare the frequency or count of different categories in a dataset. This type of plot is particularly useful when we want to observe how two categorical variables interact and how their counts vary across different levels of each category.Grouped Bar Plots (Countplots) are valuable visualization tools for comparing and exploring the relationships between categorical variables, making them an essential part of exploratory data analysis and data visualization tasks.

##### 2. What is/are the insight(s) found from the chart?



Answer Here

* The majority of Movies and TV shows have a rating of TV-MA, which stands for "Mature Audience," followed by TV-14, which stands for "Younger Audience."
* When compared to TV shows, Movies receive the highest rating, which is pretty obvious given that a number of Movies are higher compared to TV shows, as we saw earlier in the type column.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here

Yes the gained insights help creating a positive business impact  such as "The objective of this visualization was to analyze the top 10 rating given for Movie and TV shows  in our dataset".

#### Chart - 8

# **Column: 'listed_in'**

Because this column is a genre column, in order to count the genres, we must separate them.

In [None]:
# Chart - 8 visualization code
# seperating genre from listed_in columns for analysis purpose
genres = df['listed_in'].str.split(', ', expand=True).stack()

# top 10 genres in listed movies/TV shows
genres = genres.value_counts().reset_index().rename(columns={'index':'genre', 0:'count'})
genres.head()

In [None]:
# number of genres present in dataset
len(genres)

In [None]:
# plotting graph
fig,ax = plt.subplots(1,2, figsize=(15,6))

# Top 10 genres
top = sns.barplot(x='genre', y = 'count', data=genres[:10], ax=ax[0])
top.set_title('Top 10 genres present in Netflix', size=20)
plt.setp(top.get_xticklabels(), rotation=90)

# Last 10 genres
bottom = sns.barplot(x='genre', y = 'count', data=genres[-10:], ax=ax[1])
bottom.set_title('Last 10 genres present in Netflix', size=20)
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?



Answer Here.

The chart on the left  represents the top 10 genres present in Netflix, while the chart on the right represents the last 10 genres. The y-axis represents the count of each genre, and the x-axis shows the genre names. The bars' height indicates the count of each genre, allowing for easy comparison between the top and last genres.

The visualization is created using the seaborn library in Python, which provides a high-level interface for drawing attractive statistical graphics.

We use a Grouped Bar Plot to visually compare the frequency or count of different categories in a dataset. This type of plot is particularly useful when we want to observe how two categorical variables interact and how their counts vary across different levels of each category.Grouped Bar Plots (Countplots) are valuable visualization tools for comparing and exploring the relationships between categorical variables, making them an essential part of exploratory data analysis and data visualization tasks.

##### 2. What is/are the insight(s) found from the chart?



Answer Here

* International Movies, Dramas, and Comedies make up the majority of the genres.
* TV Shows, Classic and cult TV, TV thrillers, Stand-Up comedy, and Talk shows account for the least genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here

Yes the gained insights help creating a positive business impact such as "The objective of this visualization was to analyze the top 10 genres present in Netflix and last 10 genres present in Netflix  given for Movie and TV shows in our dataset".

Chart - 9

# Column: 'description'*

In [None]:
# Chart - 9 visualization code
# text documents
text = " ".join(word for word in df['description'])

# create the word cloud using WordCloud library
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', min_font_size=15).generate(text)

# plot the word cloud
plt.imshow(wordcloud,  interpolation='bilinear')
plt.show()

##### 1. Why did you pick the specific chart?



Answer Here.

Here. word cloud is a visual representation of text data where words are displayed in different sizes, with the size of each word indicating its frequency or importance in the given text.WordCloud library simplifies the process of creating visually appealing word clouds and aids in understanding and communicating the key information within a textual dataset.

##### 2. What is/are the insight(s) found from the chart?



Answer Here

* The most frequently used words in the description column are "family," "find," "life," "love," "new world," and "friend."

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Answer Here


Yes the gained insights help creating a positive business impact such as "visual representation of text data where words are displayed in different sizes, with the size of each word indicating its frequency or importance in the given text in our dataset".

#### Chart - 10

# ** Data Cleaning**

#### **What is data cleaning?**
* Data cleaning is the process of identifying and correcting or removing inaccuracies, inconsistencies, and missing values in a dataset. It is an important step in the data preparation process that ensures that the data is accurate, complete, and in a format that can be easily analyzed. Data cleaning may include tasks such as removing duplicate records, filling in missing values, correcting errors, and standardizing data formats. The goal of data cleaning is to improve the quality of the data and make it suitable for further analysis and modeling.

**Duplicate Values**



In [None]:
# counting duplicate values
df.duplicated().sum()

## **Missing Values**

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Chart - 10 visualization code
# Visualizing the missing values
import missingno as msno
msno.bar(df, color='green',sort='ascending', figsize=(10,3), fontsize=15)

In [None]:
# Missing Values Percentage
round(df.isna().sum()/len(df)*100, 2)

**Handling Missing Values**
* The "empty string" can be used to replace the missing values in the director, cast, and country attributes.
* There is a small percentage of null values in the rating and date_added columns; eliminating these nan values will have little effect on the model's construction. As a result, the nan value in the rating and date_added columns is simply removed.

In [None]:
# Handling Missing Values & Missing Value Imputation
df[['director','cast','country']] = df[['director','cast','country']].fillna(' ')
df.dropna(axis=0, inplace=True)

In [None]:
# checking for null values after treating them.
df.isna().sum()

##### 1. Why did you pick the specific chart?





Answer Here.

The graph created using the missingno.bar() function from the missingno library is called a "Missing Value Bar Chart."

This chart is used to visualize the missing values in a dataset. It provides a concise summary of the missingness patterns across different variables (columns) in the dataframe.
The missing value bar chart is a quick and effective way to identify missing values in the dataset visually. By looking at the chart, you can get an overview of the completeness of the data and see which variables have more missing data.
It helps in identifying patterns of missingness across different columns, which can be valuable in understanding if there are systematic reasons for missing data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

It provides a concise summary of the missingness patterns across different variables (columns) in the dataframe".

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the gained insights help creating a positive business impact such as "The objective of this visualization was to visualize the missing values in a dataset. It provides a concise summary of the missingness patterns across different variables (columns) in the dataframe".

#### Chart - 11

# **Handling Outliers**

In [None]:
# Chart - 11 visualization code

# Handling Outliers & Outlier treatments

# plotting graph
fig,ax = plt.subplots(1,2, figsize=(15,5))

# Display boxplot and dist plot.
sns.distplot(x=df['release_year'], ax=ax[0])
sns.boxplot(data=df, ax=ax[1])

##### 1. Why did you pick the specific chart?

Answer Here.

Distribution Plot:
The distribution plot (sns.distplot()) is used to visualize the distribution of a continuous variable, in this case, the 'release_year' feature of the dataset. It displays the frequency distribution of the data points, giving an overview of how the values are spread over the range of the variable. The curve in the distribution plot represents the probability density function (PDF) of the data. This plot is useful for understanding the central tendency, skewness, and presence of outliers in the 'release_year' feature.

Box Plot:
The box plot (sns.boxplot()) is another type of visualization used to display the distribution of data and to detect outliers. It shows the distribution of data based on the quartiles, specifically the median, interquartile range (IQR), and any potential outliers. The box in the plot represents the IQR (from the 25th percentile to the 75th percentile), and the whiskers extend to the minimum and maximum values within 1.5 times the IQR. Any data points that fall outside the whiskers are considered outliers and are represented as individual points.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* Except for the release year, almost all of the data are presented in text format.
* The textual format contains the data we need to build a cluster/building model. Therefore, there is no need to handle outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the gained insights help creating a positive business impact as these plots are helpful for identifying the presence of outliers in the 'release_year' feature. Outliers are data points that significantly deviate from the majority of the data and may indicate errors, data quality issues, or unusual observations. By visualizing the distribution and using the box plot, data analysts can make informed decisions about how to handle outliers, such as removing them or applying outlier treatments to ensure that the data is suitable for further analysis or modeling.

#### Chart - 12

# Textual Data **Preprocessing**

**What is textual data preprocessing?**

* Textual data preprocessing is the process of preparing text data for analysis or modeling. It includes a series of steps that are applied to raw text data in order to clean, organize and standardize it so that it can be easily analyzed or used as input for natural language processing or machine learning models. The preprocessing steps typically include tokenization, stop-word removal, stemming or lemmatization, lowercasing, removing punctuation, and removing numbers. The goal of textual data preprocessing is to prepare the data for further analysis and modeling by removing irrelevant information and standardizing the format of the text. This can help improve the accuracy and effectiveness of the analysis or modeling.

**Modeling Approach**


1.   Choose the attributes that you want to cluster.
2.   Text Preprocessing: Change all textual data to lowercase and eliminate all punctuation marks and stopwords. Removing commonly occurring words such as "the", "and", "a", etc. that don't carry much meaning.
3.   Stemming or Lemmatization: Normalizing the words by reducing them to their base form.
4.   Tokenization: Breaking the text into smaller units, such as sentences or words.
5.   Dimensionality reduction.
6.   Make use of various algorithms to cluster the movies and various techniques to determine the optimal number of clusters.
7.   Build the optimal number of clusters and use wordclouds to display the contents of each cluster.

# **Selecting Attributes**

In [None]:
df.head(3)

We will cluster the Netflix movies and TV shows into groups based on the following textual characteristics:

* Director
* Cast
* Country
* Rating
* Listed in (genres)
* Description

In [None]:
# creating tags column using all text column which one is used for model building purpose.
df['text_data'] = df['director'] + df['cast'] + df['country'] + \
                     df['rating'] + df['listed_in'] + df['description']

In [None]:
# checking the first row
df['text_data'][0]

* We were able to successfully consolidate all of the required data into a single column.

# **Removing Stop words and Lower Casing.**

In natural language processing (NLP) tasks, removing stop words and lowercasing words are common pre-processing steps.

* **Stop words Removal:**  Words such as "a," "an," "the," and "is," are words that are commonly used in a language but do not convey much meaning. These words can add noise to the data and can sometimes affect the performance of NLP models, so they are often removed as a pre-processing step.

* **Lowercasing:** It is the process of converting all the words in a text to lowercase. This can be useful in tasks such as information retrieval or text classification where case differences are not important and also can reduce the size of the vocabulary making it easier to work with larger texts or texts in languages with a high number of inflected forms.

In [None]:
# create a set of English stop words
stop_words = stopwords.words('english')

# displaying stopwords
np.array(stop_words)

In [None]:
def stopwords(text):
    '''a function for removing the stopword and lowercase the each word'''
    text = [word.lower() for word in text.split() if word.lower() not in stop_words]
    # joining the list of words with space separator
    return " ".join(text)


In [None]:
# applying stopwords function.
df['text_data'] = df['text_data'].apply(stopwords)

In [None]:
# checking the first row again
df['text_data'][0]

* We have successfully changed the corpus to lowercase and removed all stopwords.

# **Removing Punctuations**

Removing punctuation is the process of removing any punctuation marks (e.g., periods, commas, exclamation points, etc.) from text data. This is a common pre-processing step in natural language processing (NLP) tasks and text analysis, as punctuation marks often do not carry much meaning and can add noise to the data. Removing punctuation can also make it easier to tokenize text into words or sentences, as punctuation marks often act as delimiters between words or sentences. Additionally, removing punctuation can also help in reducing the size of the vocabulary, which can make it easier to work with larger texts or texts in languages with a high number of inflected forms. It can be done using python libraries such as string, re, and nltk.

In [None]:
# function to remove punctuations

def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space, which in effect deletes the punctuation marks.
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
# applying remove_punctuation function
df['text_data'] = df['text_data'].apply(remove_punctuation)

In [None]:
# checking the first row after the process
df['text_data'][0]

* We have effectively eliminate all the punctuation marks from the corpus.

# ** Stemming**

Stemming is the process of reducing a word to its base or root form. This is a common pre-processing step in natural language processing (NLP) tasks and text analysis. The goal of stemming is to reduce words to their base form so that words with the same stem are treated as the same word, even if they are written in different forms. For example, stemming would reduce "running," "runner," and "ran" to the base form "run." This can be useful in tasks such as information retrieval or text classification where the specific form of a word is not important, and it can also help in reducing the size of the vocabulary. There are several stemmers available in python such as Porter stemmer, Snowball stemmer and Lancaster stemmer.

* We will utilize **SnowballStemmer** to construct a meaningful word from a word corpus.

In [None]:
# create an object of stemming function
stemmer = SnowballStemmer("english")

# define a function to apply stemming using SnowballStemmer
def stemming(text):
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text)

In [None]:
# appying stemming function
df['text_data'] = df['text_data'].apply(stemming)

In [None]:
# checking the first row after the process
df['text_data'][0]

* We have successfully utilized the stemming process.

# **Text Vectorization**

Text vectorization is the process of converting text data into numerical vectors or feature representations that can be used for machine learning or data analysis tasks. In simple terms, it transforms the text data into numerical data which can be easily processed by machine learning algorithms. There are several text vectorization techniques available such as bag of words, Tf-idf, Word2vec, and GloVe etc.

* We will be using the TF-IDF vectorizer, which stands for Term Frequency Inverse Document Frequency
* TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). The more often a word appears in a document, the higher its TF score.
* IDF(t) = IDF measures how rare a word is across all the documents in the corpus. The rarer a word, the higher its IDF score.
* The product of TF and IDF is used to calculate the overall weight of a word in a document, which is known as the TF-IDF score. Words with high TF-IDF scores are considered to be more important and relevant to the document than words with low TF-IDF scores.

In [None]:
# create the object of tfid vectorizer
tfidf = TfidfVectorizer(stop_words='english', lowercase=False, max_features = 10000)   # max features = 10000 to prevent system from crashing

# fit the vectorizer using the text data
tfidf.fit(df['text_data'])

# collect the vocabulary items used in the vectorizer
dictionary = tfidf.vocabulary_.items()

In [None]:
print(len(dictionary)) #number of independet features created from "text_data" columns

In [None]:
# convert vector into array form for clustering
vector = tfidf.transform(df['text_data']).toarray()

# summarize encoded vector
print(vector)
print(f'shape of the vector : {vector.shape}')
print(f'datatype : {type(vector)}')

# **Dimensionality Reduction**

Dimensionality reduction is the process of reducing the number of features or dimensions in a dataset while retaining as much information as possible. The main goal of dimensionality reduction is to simplify the data while minimizing the loss of information. It is a crucial step in machine learning and data analysis as it can help to improve the performance of models, reduce overfitting, and make it easier to visualize and interpret the data.

* There are several techniques used for dimensionality reduction, such as:
Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Autoencoder, and Random Projection etc.
* We will use Principal Component Analysis (PCA) to reduce the dimensionality of data.

In [None]:
# using PCA to reduce dimensionality
pca = PCA(random_state=42)
pca.fit(vector)

In [None]:
# Explained variance for different number of components
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('PCA - Cumulative explained variance vs Number of components')
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.axhline(y= 0.8, color='red', linestyle='--')
plt.axvline(x= 3000, color='green', linestyle='--')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I have used the line plot because the graph helps in making informed decisions on how many principal components to keep for an optimal trade-off between dimensionality reduction and information preservation in the data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* We discover that approximately 7500 components account for 100 percent of the variance.
* 3000 components alone account for more than 80% of the variance.
* Therefore, we can take the top 3000 components to reduce dimensionality and simplify the model while still being able to capture more than 80% of the variance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the gained insights help creating a positive business impact as the graph helps in making informed decisions on how many principal components to keep for an optimal trade-off between dimensionality reduction and information preservation in the data.

In [None]:
# reducing the dimensions to 3000 using pca
pca = PCA(n_components=3000, random_state=42)
pca.fit(vector)

In [None]:
# transformed features
X = pca.transform(vector)

# shape of transformed vectors
X.shape

#### Chart - 13

# **Model Implementation**

# **K-Means Clustering**

K-means clustering is a popular unsupervised machine learning technique used to group similar data points together. The goal of k-means clustering is to partition a dataset into k clusters, where each cluster contains similar data points and is represented by its centroid.

The k-means algorithm works by first randomly selecting k centroids, one for each cluster. Then, it assigns each data point to the cluster whose centroid is closest to it. This process is repeated until the assignment of data points to clusters no longer changes, or until a maximum number of iterations is reached.

* We will determine the best number of clusters for the K-means clustering algorithm by visualizing the elbow curve and silhouette score.

In [None]:
# Chart - 13 visualization code

'''Elbow method to find the optimal value of K'''

# Initialize a list to store the sum of squared errors for each value of K
SSE = []

for k in range(1, 16):
  # Initialize the k-means model with the current value of K
  kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
  # Fit the model to the data
  kmeans.fit(X)
  # Compute the sum of squared errors for the model
  SSE.append(kmeans.inertia_)

# Plot the values of SSE
plt.plot(range(1, 16), SSE)
plt.title('The Elbow Method - KMeans clustering')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared errors')
plt.show()

In [None]:
'''Silhouette score method to find the optimal value of k'''

# Initialize a list to store the silhouette score for each value of k
silhouette_avg = []

for k in range(2, 16):
  # Initialize the k-means model with the current value of k
  kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
  # Fit the model to the data
  kmeans.fit(X)
  # Predict the cluster labels for each point in the data
  labels = kmeans.labels_
  # Compute the silhouette score for the model
  score = silhouette_score(X, labels)
  silhouette_avg.append(score)

# Plot the Silhouette analysis
plt.plot(range(2,16), silhouette_avg)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k - KMeans clustering')
plt.show()

In [None]:
# Clustering the data into 6 clusters
kmeans = KMeans(n_clusters=6, init='k-means++', random_state=33)
kmeans.fit(X)

In [None]:
# Evaluation metrics - distortion, Silhouette score
kmeans_distortion = kmeans.inertia_
kmeans_silhouette_score = silhouette_score(X, kmeans.labels_)

print((kmeans_distortion, kmeans_silhouette_score))

In [None]:
# Adding a kmeans cluster number attribute
netflix_df['kmeans_cluster'] = kmeans.labels_

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(8,5))
graph = sns.countplot(x='kmeans_cluster',data=netflix_df, hue='type')
plt.title('Number of movies and TV shows in each cluster - Kmeans Clustering')

# adding value count on the top of bar
for p in graph.patches:
  graph.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# **Building wordclouds for different clusters in K-Means Clustering**

In [None]:
def kmeans_worldcloud(cluster_number, column_name):

  '''function for Building a wordcloud for the movie/shows'''

  df_wordcloud = df[['kmeans_cluster',column_name]].dropna()
  df_wordcloud = df_wordcloud[df_wordcloud['kmeans_cluster']==cluster_number]

  # text documents
  text = " ".join(word for word in df_wordcloud[column_name])

  # create the word cloud
  wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="white").generate(text)

  # Generate a word cloud image
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()

In [None]:
## Word Cloud on "description" column for different cluster ##
for i in range(6):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'description')

In [None]:
## Word Cloud on "cast" column for different cluster ##
for i in range(6):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'cast')

In [None]:
## Word Cloud on "director" column for different cluster ##
for i in range(6):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'director')

In [None]:
## Word Cloud on "listed_in" (genre) col for different cluster ##
for i in range(6):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'listed_in')

In [None]:
## Word Cloud on "country" column column for different cluster ##
for i in range(6):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'country')

In [None]:
##Word Cloud on "title" column column for different cluster ##
for i in range(6):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'title')

##### 1. Why did you pick the specific chart?

Answer Here.

The Elbow Method plot is a technique used to determine the optimal value of K (the number of clusters) for K-means clustering. K-means is an unsupervised machine learning algorithm used for clustering data points into K distinct clusters.
In the plot, the "elbow" point is where the SSE starts to level off. The optimal value of K is usually chosen as the value corresponding to the "elbow" point on the graph, as it provides a good trade-off between capturing the data structure with an adequate number of clusters. However, the Elbow Method is not always definitive, and other evaluation methods may also be considered for finalizing the optimal value of K.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* The sum of squared distance between each point and the centroid in a cluster decreases with the increase in the number of clusters.
* The highest Silhouette score is obtained for 6 clusters.
* Building 6 clusters using the k-means clustering algorithm.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the gained insights help creating a positive business impact

#### Chart - 14 - Correlation Heatmap

# **Hierarchical clustering**

Hierarchical clustering is a method of clustering data points into a tree-like structure. It is an alternative method to k-means clustering and it is used to group similar data points together in a hierarchical fashion.

There are two main types of Hierarchical clustering: Agglomerative and Divisive. Agglomerative is a bottom-up approach where each data point is considered as a separate cluster and the algorithm iteratively merges the closest clusters. On the other hand, Divisive is a top-down approach where all data points are considered as a single cluster and the algorithm iteratively splits the clusters.

The hierarchical clustering algorithm can be represented by a dendrogram which makes it easy to visualize the structure of the clusters.

In [None]:
# Correlation Heatmap visualization code
# Building a dendogram to decide the number of clusters
plt.figure(figsize=(10, 5))
dend = shc.dendrogram(shc.linkage(X, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 4, color='r', linestyle='--')

In [None]:
# Fitting hierarchical clustering model
hierarchical = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='ward')
hierarchical.fit_predict(X)

In [None]:
# Adding a hierarchical cluster number attribute
df['hierarchical_cluster'] = hierarchical.labels_

In [None]:
df.sample(5)[['type', 'title', 'director', 'cast', 'country', 'rating', 'listed_in', 'description', 'hierarchical_cluster']]

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(10,5))
graph = sns.countplot(x='hierarchical_cluster',data=df, hue='type')
plt.title('Number of movies and tv shows in each cluster - Hierarchical Clustering')

# adding value count on the top of bar
for p in graph.patches:
   graph.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# **Building wordclouds for different clusters in hierarchical Clustering**

In [None]:
def hierarchical_worldcloud(cluster_number, column_name):

  '''function for Building a wordcloud for the movie/shows'''

  df_wordcloud = df[['hierarchical_cluster',column_name]].dropna()
  df_wordcloud = df_wordcloud[df_wordcloud['hierarchical_cluster']==cluster_number]

  # text documents
  text = " ".join(word for word in df_wordcloud[column_name])

  # create the word cloud
  wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="white").generate(text)

  # Generate a word cloud image
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()

In [None]:
## Word Cloud on "title" column for different cluster ##
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'title')

In [None]:
## Word Cloud on "description" column for different cluster ##
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'description')

In [None]:
## Word Cloud on "country" column for different cluster ##
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'country')

In [None]:
## Word Cloud on "listed_in (genre)" column for different cluster ##
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'listed_in')

##### 1. Why did you pick the specific chart?

Answer Here.

word cloud is a visual representation of text data where words are displayed in different sizes, with the size of each word indicating its frequency or importance in the given text.WordCloud library simplifies the process of creating visually appealing word clouds and aids in understanding and communicating the key information within a textual dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

# **Recommendation System**

A content-based recommendation system is a type of recommendation system that suggests items to users based on their similarity to other items that the user has shown interest in. It uses the attributes or features of the items to determine the similarity between them.

* Based on how similar the movies and shows are, we can create a straightforward content-based recommender system.
* The recommender system needs to be able to suggest a list of similar shows that a person who has watched a show on Netflix likes.
* We can use cosine similarity to determine the shows' similarity scores.
* By dividing the dot product of the two vectors by their magnitude values, the similarity between A and B can be calculated. Simply put, the angle between two vectors decreases as the cosine similarity score increases.

In [None]:
# veryfying index
df[['show_id', 'title', 'text_data']]

* Our dataframe has a total of 7770 rows, as shown above, and the last index is 7786 due to the deletion of some rows while treating null values.

* In order to construct a content-based recommendation system, we determine the similarity score based on a specific index_id for that particular "tags" column.

* If we are unable to reset the index, there is a good chance that instead of providing an index, we will calculate cosine similarity for another index. in order to avoid this issue and properly address index when developing the recommendation system. The index was simply reset.

In [None]:
# defining new dataframe for building recommandation system
recommender_df = df.copy()

# reseting index
recommender_df.reset_index(inplace=True)

# checking whether or not reset index properly
recommender_df[['show_id', 'title', 'text_data']]

In [None]:
# dropping show-id and index column
recommender_df.drop(columns=['index', 'show_id'], inplace=True)

In [None]:
print(f"before reset index id for movie 'Zero' : {df[df['title'] == 'Zozo'].index[0]}")
print(f"after reset index id for movie 'Zero': {recommender_df[recommender_df['title'] == 'Zozo'].index[0]}")

In [None]:
# calling out transformed array independent features created from text_data(cluster) column after performing PCA for dimenssionality reduction.
X

In [None]:
# calculate cosine similarity
similarity = cosine_similarity(X)
similarity

In [None]:
def recommend(movie):
    '''
    This function list down top ten movies on the basis of similarity score for that perticular movie.
    '''
    print(f"If you liked '{movie}', you may also enjoy: \n")

    # find out index position
    index = recommender_df[recommender_df['title'] == movie].index[0]

    # sorting on the basis of simliarity score, In order to find out distaces from recommended one
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x:x[1])

    # listing top ten recommenaded movie
    for i in distances[1:11]:
        print(df.iloc[i[0]].title)

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

To achieve the business objective ,the client can follow the following steps:

Data Collection: Gather data on the Netflix shows, including attributes such as show titles, genres, descriptions, and ratings. The more data available, the better the clustering results.

Data Preprocessing: Clean and preprocess the data to handle missing values, remove irrelevant information, and convert textual data into a suitable format for analysis.

Feature Extraction: Extract relevant features from the data that can capture the essence of each show. This can include using natural language processing techniques to extract keywords from show descriptions or using numerical features such as ratings and duration.

Clustering Algorithm Selection: Choose appropriate clustering algorithms such as K-Means, hierarchical clustering, or DBSCAN, depending on the data and the desired characteristics of the clusters.

Hyperparameter Tuning: If using machine learning-based clustering algorithms, perform hyperparameter tuning to optimize the clustering performance.

Evaluation: Evaluate the quality of the clustering results using internal and/or external clustering evaluation metrics to ensure that the shows within each cluster are similar and different across clusters.

Personalization: Once the shows are grouped into clusters, use user viewing history and preferences to provide personalized show recommendations. This can be achieved through collaborative filtering or content-based filtering methods.


Continuous Improvement: Regularly update the clustering model and show recommendation algorithms based on user feedback and new data to continuously improve the user experience and retention.

# **Conclusion**

Write the conclusion here.

In this project, we tackled a text clustering problem in which we had to categorize and group Netflix shows into specific clusters in such a way that shows in the same cluster are similar to one another and shows in different clusters are not.

* There were approximately 7787 records and 11 attributes in the dataset.
* We started by working on the missing values in the dataset and conducting exploratory data analysis (EDA).
* It was discovered that Netflix hosts more movies than television shows on its platform, and the total number of shows added to Netflix is expanding at an exponential rate. Additionally, most of the shows were made in the United States.
* The attributes were chosen as the basis for the **clustering of the data: cast, country, genre, director, rating, and description** The TFIDF vectorizer was used to tokenize, preprocess, and vectorize the values in these attributes.
* **10000 attributes** in total were created by **TFIDF vectorization**.
The problem of dimensionality was dealt with through the **use of Principal Component Analysis (PCA). Because 3000 components were able to account for more than 80% of the variance**, the total number of components was limited to 3000.
* Utilizing the **K-Means Clustering algorithm**, we first constructed clusters, and the **optimal number of clusters was determined to be 6**. The **elbow method and Silhouette score analysis** were used to get this.
* The **Agglomerative clustering algorithm** was then used to create clusters, and the **optimal number of clusters was determined to be 7**. This was obtained after visualizing the **dendrogram**.
* The similarity matrix generated by applying **cosine similarity** was used to construct a **content-based recommender system**. The user will receive ten recommendations from this recommender system based on the type of show they watched.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***