<a href="https://colab.research.google.com/github/kajalwasnik/kajalwasnik/blob/main/Netflix_movies_and_tv_shows_clustering_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Netflix-Movies and Tv-shows-Clustering**



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual

**Name** - Kajal Purushottam Wasnik

# **Project Summary -**

Netflix is an online platform that offers subscription-based streaming services for entertainment, encompassing a diverse selection of content, mainly categorized into movies and television shows. Over the years, it has emerged as the most popular Over-The-Top (OTT) platform globally, accessible to individuals worldwide. Despite the option for customers to terminate their memberships at any time, maintaining user interest is crucial for the company. This underscores the importance of recommendation systems, which play a key role in providing relevant suggestions to users.

As a media distribution corporation, Netflix originated with DVD delivery by mail and has since evolved significantly, focusing primarily on video streaming. The content available on the platform includes licensed material as well as original productions.

While Netflix initially emphasized movies, television series have become a more prominent genre in recent times. Operating on a subscription model, Netflix grants customers unlimited access to its content for a fee.

This project involves working with Netflix data to discern recent trends and gain insights into the presented material, sourced from Flixable, a third-party Netflix search engine. Notably, an analysis by Flixable in 2018 revealed a significant increase in the number of TV series on Netflix since 2010, while the number of movies decreased. This prompted the need for a recommendation system, which we developed by evaluating the data and clustering similar content based on text-based attributes.

Key steps in the project include:

1. Exploring the dataset by examining its head and tail.

2. Describing the dataset by calculating mean, minimum, maximum, and data types of columns.

3. Extracting information about non-null counts in column values.

4. Counting distinct values in each column.

5. Determining the shape of the dataset (number of rows and columns).

6. Addressing null values in certain columns by filling them with mode or replacing them with 'No Cast.'

7. Plotting relevant graphs to extract information.

8. Implementing Natural Language Processing (NLP) techniques, such as tokenization, punctuation removal, stopword removal, and word stemming.

9. Employing clustering models like K-means Clustering and Agglomerative Clustering.

10. Utilizing techniques like the Elbow method, Silhouette Score, and Dendrogram to determine the number of clusters in K-means.

11. Developing a recommendation system using K-means clustering results.

The project successfully conducted Exploratory Data Analysis (EDA), revealing critical findings to address significant business challenges. The implemented models, particularly K-means clustering, performed well despite the data's volume and complexity.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The dataset we currently possess contains information about TV shows and movies accessible on Netflix as of 2019. This data was obtained from Flixable, an external Netflix search engine.

A compelling report from 2018 provided intriguing statistics about Netflix's content landscape. According to this report, the number of TV shows on Netflix had nearly tripled since 2010, while the count of movies had decreased by over 2,000 titles within the same timeframe. This shift underscores a significant transformation in the platform's content focus, with a notable emphasis on TV shows.

Further exploration of this dataset presents an exciting opportunity to uncover additional insights. By incorporating external datasets, such as IMDB ratings and Rotten Tomatoes scores, we can delve deeper into the quality and reception of the content offered by Netflix. This integration opens up avenues for discovering compelling findings and gaining a more comprehensive understanding of the platform's content offerings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
#Importing important libaries
import numpy as np
import pandas as pd
from numpy import math

# Importing visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
import matplotlib.ticker as mtick
from matplotlib.pyplot import figure
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px

#importing stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
#for tokenization
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#import stemmer
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split, KFold

# Importing Principal Component Analysis (PCA)
from sklearn.decomposition import PCA

# Importing Machin learning Algorithms
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

#Here we imported path,Image,WordCloud,STOPWORDS,ImageColorGenerator
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')

In [None]:

df = pd.read_csv('/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look

# first 5 rows of data
df.head()

In [None]:
# last 5 rows of data
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print(f'Netflix = {df.shape[0]} Rows , {df.shape[1]} columns.')

### Dataset Information

In [None]:
# Dataset Info

In [None]:

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum().sort_values(ascending = False)

In [None]:
# Visualizing the missing values
import missingno as msno

In [None]:

msno.matrix(df)

In [None]:

#total null values
df.isnull().sum().sum()

### What did you know about your dataset?

In there dataset five columns have missing Values director,cast,country,date added and rating.

Total null values are 3631

## ***2. Understanding Your Variables***

In [None]:

# Dataset Columns
df.keys()

In [None]:
# Dataset Describe
df.describe()

### Variables Description



*   show_id : Unique ID for every Movie / Tv Show
*   type : Identifier - A Movie or TV Show



*  title : Title of the Movie / Tv Show
*   director : Director of the Movie



*   cast : Actors involved in the movie / show
*   country : Country where the movie / show was produced



*   date_added : Date it was added on Netflix
*   release_year : Actual Releaseyear of the movie / show




*   rating : TV Rating of the movie / show
*   duration : Total Duration - in minutes or number of seasons




*   listed_in : Genere
*   description: The Summary description




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# # Creating a copy of original dataset to keep it safe
df_new = df.copy()


In [None]:
# Write your code to make your dataset analysis ready.
df['type'].value_counts()

Netflix has 5377 movies and 2410 TV shows, there are more number movies on Netflix than TV shows.

In [None]:
# 'No cast' will be used to fill in any blank spaces in the 'cast' column.
# If any values are missing, the word "No cast" will be used in their stead.
df['cast'].fillna(value='No cast',inplace=True)

In [None]:
# using the mode, or most frequent value, to fill in the missing entries in the 'country' column.
# If any values are missing, the mode value of the 'country' column will be used as a replacement.
df['country'].fillna(value=df['country'].mode()[0],inplace=True)


In [None]:
# Removing the rows with missing values in the 'date_added' and 'rating' columns
# If there are any missing values in these columns, the corresponding rows are dropped
df.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
#Dropping Director Column
df.drop(['director'],axis=1,inplace=True)

In [None]:
#Assigning the Ratings into grouped categories
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
   'NC-17': 'Adults'
}

df['All Ages'] = df['rating'].replace(ratings)

In [None]:
# type should be a catego
df['type'] = pd.Categorical(df['type'])
df['All Ages'] = pd.Categorical(df['All Ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])

In [None]:

df.head()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# first bar chart
# pie chart
# Check how many defaulter and non defaulter.
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(20,6))
ax = df['type'].value_counts().plot(kind='bar',title="Type",ax=axes[0])
df['type'].value_counts().plot(kind='pie',title="type",autopct='%1.1f%%',ax=axes[1], explode=[0.1,0.1])
ax.set_ylabel("Count")
ax.set_xlabel("'TV Show', 'Movie'")
fig.tight_layout()


1. Why did you pick the specific chart?

I choose this because a bar chart from the matplotlib package is a helpful method of graphically displaying the distribution of various values inside a variable.

##### 2. What is/are the insight(s) found from the chart?

The movie collection is far larger than the TV show collections. 30.9% of TV series and 69.1% of movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here - Netflix offers a wider variety of movies than TV series, which is good for their company as movies often bring in more money than TV episodes.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#creating two extra columns
tv_shows=df[df['type']=='TV Show']
movies=df[df['type']=='Movie']



In [None]:
tv_ratings = tv_shows.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (20,7)
fig, ax = plt.subplots(figsize=fig_dims)
sns.pointplot(x='rating',y='count',data=tv_ratings)
plt.title('TV Show Ratings',size='20')
plt.show()


In [None]:
#Movie Ratings based on All Ages Groups
plt.figure(figsize=(20,6))
plt.title('movie ratings')
sns.countplot(x=movies['rating'],hue=movies['All Ages'],data=movies,order=movies['rating'].value_counts().index)

##### 1. Why did you pick the specific chart?

The Matplotlib software was used to build the Bar chart, which is an effective tool for graphically illustrating the distribution of various values inside a variable.

##### 2. What is/are the insight(s) found from the chart?

The majority of TV ratings, or adult ratings, are on TV-MA.

In both instances, TV-MA has the largest number of ratings for television programs, i.e. adult ratings. The most viewers tune into TV-MA.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here - The bulk of Netflix's programming has ratings for viewers Kids and older and adult audiences. This is helpful because the platform primarily targets adult viewers, improving the probability of drawing in and keeping a bigger audience of enthusiastic viewers.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
#Creating a line chart to visualize the number of movies and TV shows released each year
#Extracting the count of movies and TV shows for each year
movies_year = movies['release_year'].value_counts().sort_index(ascending=False)
tvshows_year = tv_shows['release_year'].value_counts().sort_index(ascending=False)
#Creating a line plot using Seaborn
sns.set(style='whitegrid', font_scale=1.2)
fig, ax = plt.subplots(figsize=(20,7))

ax = sns.lineplot(x=movies_year.index, y=movies_year.values, color='red', label='Movies', linewidth=2.5, marker='o')
ax = sns.lineplot(x=tvshows_year.index, y=tvshows_year.values, color='blue', label='TV Shows', linewidth=2.5, marker='o')

#Customizing the plot
plt.xticks(rotation=90)
ax.set_xlabel('Release Year', fontsize=15)
ax.set_ylabel('Number of Titles', fontsize=15)
ax.set_title('Production Growth Yearly', fontsize=19, pad=15)
plt.legend(fontsize=15)

plt.show()

##### 1. Why did you pick the specific chart?

I chose this graph because it offers a fascinating look at how films and TV series have been distributed over time. The various lines for movies and TV shows make it simple to compare the two, and the line plot depicts the trend in the annual release of movies and TV shows. This chart also employs color coding to distinguish between movies and TV shows, which makes it easier to read and more aesthetically pleasing. Overall, this graph is a useful tool for examining the connection between the quantity of movies and TV shows released and the year they were first released.

##### 2. What is/are the insight(s) found from the chart?



*   The quantity of motion pictures and television programs that are released each year can give us information on the evolution of the production of media material across time.
*   We can observe that between the mid-2000s and 2020, there were much more movies made.


*   Although not as much as TV shows, the quantity of TV shows created has also increased.
*   Additionally, the graph reveals a decline in movie production in 2020, which could be brought on by the COVID-19 epidemic.







##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Gained knowledge can have a good commercial impact by giving investors, streaming services, and content producers useful information. For instance, the rise in movie production may signal a change in consumer preferences toward movies, which may be utilized to inform platform offerings and content creation. The COVID-19 epidemic may have a positive effect on movie creation, but it may also have a negative effect on streaming services and content creators' income.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4 visualization code
# bar plot for age group of audience
plt.figure(figsize=(20,7))
df['All Ages'].value_counts().plot(kind='bar')


##### 1. Why did you pick the specific chart?

The Matplotlib software was used to build the Bar chart, which is an effective tool for graphically illustrating the distribution of various values inside a variable.

##### 2. What is/are the insight(s) found from the chart?

Instead than expressly aiming at children, Netflix largely concentrates on offering material that appeals to the interests and preferences of Adults and Teen audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It may be assumed that the material provided by the site is well-suited to their likes and preferences as a substantial number of Netflix users are adults and teenagers. As a consequence, more individuals will probably find the information captivating and interesting.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

sns.set(rc={'figure.figsize':(20,7)})
ax = sns.countplot(data = df, x = 'All Ages', hue = 'type',palette = 'pastel')
for p in ax.patches:
   ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.01))


##### 1. Why did you pick the specific chart?

A countplot from the Seaborn library is a particular kind of plot that enables us to quickly contrast and compare two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

In comparison to other age groups, Netflix gives much more material to Millennials, with Kids having the least quantity of content available. The database largely comprises of a vast selection of adult-targeted films, while the number of TV programs is about evenly distributed between Adults and Teens.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Given that movies make up a larger portion of Netflix's overall content than TV series do, there is more content devoted to the movies area than to the TV shows section.

#### Chart - 6

In [None]:
# Chart - 6 visualization code#Analysing top10 genre of the movies
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of Movies',fontweight="bold")
sns.countplot(y=movies['listed_in'],data=movies,order=movies['listed_in'].value_counts().index[0:10])

In [None]:
#Analysing top10 genres of TVSHOWS
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of TV Shows',fontweight="bold")
sns.countplot(y=tv_shows['listed_in'],data=tv_shows,order=tv_shows['listed_in'].value_counts().index[0:10])

##### 1. Why did you pick the specific chart?

A countplot from the Seaborn library is a particular kind of plot that enables us to quickly contrast and compare two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

1 The most popular Netflix category is documentaries, which are followed by stand-up comedy, Drams, and foreign films.

2 . The most popular Netflix TV program category is kids TV.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Because International Movies appeal to the majority of Netflix users, there is an appropriate amount of content in that category.While humor is ideal for unwinding and having fun, documentaries are excellent for learning. People are drawn to them and become enamored with these genres because of this.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
df_duration = df.groupby(['duration'])['show_id'].count().sort_values(ascending= False).reset_index()
df_duration


In [None]:
plt.figure(figsize = (20, 6))
sns.barplot(data = df_duration, x = df_duration['duration'][:20], y = df_duration['show_id'])
plt.title('Duration of the shows')
plt.xticks(rotation = 90)
plt.show()

##### 1. Why did you pick the specific chart?

A barplot from the Seaborn library is a particular kind of plot that enables us to quickly contrast and compare two values of a variable.

##### 2. What is/are the insight(s) found from the chart?



*   When a platform-produced TV program, movie, or web series has at least one season, it can be noticed that the data's biggest quantity of material is Season 1.
*  Netflix TV series tend to be arranged into seasons rather than being released one episode at a time or in tiny batches, as seen by the fact that "Season 1" is the most frequent time period. It implies that


*  It follows that viewers are more inclined to watch new episodes when they initially air rather than waiting for the following season to be published as "Season 1" is the most typical duration of TV shows on Netflix. It could suggest what is most likely. This may be because Netflix invests heavily in marketing and promotion of new episodes, or it might be because viewers are more inclined to be interested in shows when they are initially launched.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These observations could aid in directing Netflix's approach for producing and acquiring content. Netflix is well-informed about how it structures and distributes original material, as well as how it allocates resources to develop content that connects with viewers, according to its awareness of viewers' watch-time preferences. You are capable of making wise choices.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Creating the actors plot
sns.set(rc={'figure.figsize':(30,8)})
ax = sns.countplot(data = df_new, x='cast',palette="Spectral",order=df['cast'].value_counts().index[0:15])
plt.title('Actors on Netflix',fontsize = 25  )
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)


##### 1. Why did you pick the specific chart?

A barplot from the Seaborn library is a particular kind of plot that enables us to quickly contrast and compare two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

The narrative makes it clear that the top Netflix content performers are David Attenborough, Samuel West, Jeff Dunham, and Kevin Hart. They often create material that appeals to the audience and gets good ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This information reveals that the performers who have appeared in the most Netflix programming are also the most well-known. Users now have access to a wide variety of high-caliber media that stars these well-known performers.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
sns.set(rc={'figure.figsize':(30,8)})
g = sns.countplot(data = df, x='country',palette="Paired_r",order=df['country'].value_counts().index[0:10],hue = 'rating', )
sns.move_legend(g, "upper left", bbox_to_anchor=(.90, .95), title='Country vs Rating')


##### 1. Why did you pick the specific chart?

A barplot from the Seaborn library is a particular kind of plot that enables us to quickly contrast and compare two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

The United States creates a lot of material for Netflix since it has a diverse range of cultures. The variety of information that is made available reduces as a country's overall output does.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Netflix offers a lot of content with different ratings, which is great because the United States has a diverse range of cultures.

In [None]:

df

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Create a figure and set its size
plt.figure(figsize=(20, 7))

# Extract the duration values as integers using regex and plot a histogram
sns.histplot(movies['duration'].str.extract('(\d+)').astype(int), kde=True, color='blue',)

# Set the title of the plot
plt.title('Distribution of Movie Durations', fontweight='bold')

# Set the x-axis label
plt.xlabel('Duration (minutes)')

# Set the y-axis label
plt.ylabel('Count')

# Show the plot
plt.show()

In [None]:
# Set the figure size
plt.figure(figsize=(20, 7))
# Extract the duration values as integers using regex
movies['minute'] = movies['duration'].str.extract('(\d+)').apply(pd.to_numeric)

# Calculate the average movie duration by rating
duration_year = movies.groupby(['rating'])['minute'].mean()

# Create a DataFrame to store the results and sort by average duration
duration_net_df = pd.DataFrame(duration_year).sort_values('minute')

# Create a bar plot of the average movie duration by rating
ax = sns.barplot(x=duration_net_df.index, y=duration_net_df.minute)

# Set the title of the plot
plt.title("Average Movie Duration by Rating", fontweight='bold')

# Set the x-axis label
plt.xlabel("Rating")

# Set the y-axis label
plt.ylabel("Average Duration (minutes)")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A barplot from the Seaborn library is a particular kind of plot that enables us to quickly contrast and compare two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

The data also showed that NC-17 movies frequently had the longest runtimes, maybe because these films frequently deal with mature issues that need more time to be adequately conveyed. The shortest average runtime is seen in movies with a TV-Y classification, which is appropriate for all youngsters. This implies that films in this category tend to be shorter and have simpler themes and narratives that are suitable for younger audiences. Content producers and distributors that want to comprehend consumer preferences and trends in the film business may find this information useful.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses in the film industry and streaming platforms may benefit from the knowledge gained through audience behavior analysis since it will enable them to better understand their audiences' tastes and provide content that is more relevant to them. For instance, they could decide to concentrate on producing longer, more mature material for adult audiences if they notice that movies with a mature classification typically have longer runtimes.They could refrain from funding comparable initiatives in the future, so reducing the range of content accessible to viewers. Ultimately, before making decisions that might have an influence on their growth, organizations must carefully weigh the possible positive and negative implications of insights.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Word Cloud library
from wordcloud import WordCloud, STOPWORDS


In [None]:
# text documents
text = " ".join(word for word in df['title'])

# create the word cloud using WordCloud library
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', min_font_size=15).generate(text)

# plot the word cloud
plt.imshow(wordcloud,  interpolation='bilinear')
plt.show()



##### 1. Why did you pick the specific chart?

Because of wordcloud library use of alphabet analysis so that whey use this chart

##### 2. What is/are the insight(s) found from the chart?

The word 'Love', 'Christmas', 'Man', 'World', 'Life', 'Girl', and 'Story' are commonly seen in the movie title column.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Set the figure size to 20x7
plt.figure(figsize=(20,7))

# Create a countplot for the 'country' column
# Order the bars in descending order by their value counts
# Limit the plot to show only the top 15 countries
# Use different colors for 'TV Show' and 'Movie' categories
sns.countplot(x=df['country'], order=df['country'].value_counts().index[0:15], hue=df['type'])

# Rotate the x-axis tick labels by 50 degrees for better visibility
plt.xticks(rotation=50)

# Set the plot title with larger font size and bold text
plt.title('Top 15 Countries with Most Content', fontsize=15, fontweight='bold')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

A barplot from the Seaborn library is a particular kind of plot that enables us to quickly contrast and compare two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

The list shows the top 15 nations that contributed to Netflix, with the United States having produced the most material there, followed by India.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Our data shows that the United States has the most titles accessible on Netflix, closely followed by India. Notably, of all the nations covered in the research, India has the most films.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# split the cast and string
plt.figure(figsize=(20,6))
# split the cast and string
cast_movies = df[df.cast != 'unknown'].set_index('title').cast.str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
# Create a countplot for the 'Movies cast' column
sns.countplot(y=cast_movies, order=cast_movies.value_counts().index[:10], palette='bright')
plt.title('Top 10 Actor acted in Movies on Netflix')
plt.show();


##### 1. Why did you pick the specific chart?

Based on the number of Netflix movies they appeared in, we visualize the top 10 actors in movies.

##### 2. What is/are the insight(s) found from the chart?

Indian actors make up the most percentage of the top 10 actors in Netflix movies. Anupam Kher and Shah Rukh Khan are the top two actors in terms of the most films they have appeared in.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The above graph aids the industry in identifying which of the top 10 actors have roles in Netflix-hosted material. Understanding the popularity of the actors who performed in the majority of Netflix material will aid the business.

**Chart - 13 popular Director**

In [None]:
# Chart - 14 visualization code
# Top directors or popular directors on netflix
directors = df_new[df_new.director != 'unknown'].set_index('title').director.str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
plt.figure(figsize=(20,6))
sns.countplot(y=directors, order=directors.value_counts().index[:10], palette='pastel')
plt.title('Top 10 Directors')
plt.show();



**1. Why did you pick the specific chart?**

We depict the most renowned or well-known Netflix directors.

**2. What is/are the insight(s) found from the chart?**

 We can observe that the top 10 Netflix directors with the most material are generally foreign. The most prolific filmmaker on Netflix, Jan Suter has produced a lot of stuff.

**3. Will the gained insights help creating a positive business impact?**

Are there any insights that lead to negative growth? Justify with specific reason.

This will aid the industry in understanding the top 10 filmmakers responsible for the Netflix content. Reaching out to well-known filmmakers for future content directions that Netflix will generate will aid with the insight.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# preparing data for heatmap
# data for correlation
df['count'] = 1
heatmap = df.groupby('country')[['country', 'count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
heatmap = heatmap[heatmap != 'unknown']
heatmap = heatmap['country']

corr = df.loc[df['country'].isin(heatmap)]
corr = pd.crosstab(corr['country'], corr['All Ages'], normalize = 'index').T
corr
# Correlation Heatmap visualization code

countries =['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain', 'Egypt']
rating = ['Adults', 'Teens', 'Older Kids', 'Kids']

plt.figure(figsize=(20,6))
sns.heatmap(corr.loc[rating, countries], cmap='YlGnBu', linewidth=2.5, fmt='1.0%', annot_kws={'fontsize':12}, annot=True)
plt.show()


##### 1. Why did you pick the specific chart?

We display the relationship between the ratings of the streaming material on Netlfix and the various nations.

##### 2. What is/are the insight(s) found from the chart?

The United States and the United Kingdom have comparable rating ages, which indicates that people in both nations want to view similar kinds of content.

#### Chart - 15 - Pair Plot

Since the dataset columns are all in string format and pair plots need numerical data to provide useful visualizations, we are unable to make a pair plot using the dataset.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

In the United States, Netflix has the most programming, followed by India. The most Netflix movies come from India.



*   Null hypothesis H0 : The typical amount of movies available on Netflix in the United States is the same as that available in India.
*   Alternate hypothesis HA : The average number of Netflix movies in the US is higher than the average number of Netflix movies in India.



#### 2. Perform an appropriate statistical test.

In [None]:
# Importing libraries for hypothesis testing
from scipy.stats import uniform
from scipy.stats import norm
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
from scipy.stats import t
from scipy.stats import f
from scipy.stats import ttest_ind
import scipy.stats as stats

In [None]:
# Perform Statistical Test to obtain P-Value
# Filter the movies DataFrame to create two new DataFrames:
# One containing only movies produced in the United States, and one containing only movies produced in India
us_movie_df = movies[movies.country == 'United States']
india_movie_df = movies[movies.country == 'India']

# Perform a two-sample t-test between the release years of the two groups of movies
t, p = ttest_ind(us_movie_df['release_year'], india_movie_df['release_year'], equal_var=False)

# Set the significance level to 0.05
alpha = 0.05

# Print the results
print('t-statistic:', t)
print('p-value:', p)


# Check if the calculated p-value is less than the significance level
if p < alpha:
  # If the p-value is less than the significance level, reject the null hypothesis
  print("We reject the null hypothesis.")
else:
  # If the p-value is greater than or equal to the significance level, fail to reject the null hypothesis
  print("We fail to reject the null hypothesis.")

# deleting the temporary dataframe we obtained to calculate the alpha value
del us_movie_df
del india_movie_df

##### Which statistical test have you done to obtain P-Value?

I used a two-sample t-test, also known as an independent samples t-test or unpaired t-test, to compare the amount of movies accessible on Netflix in the US and India. I conducted the test using the ttest_ind function from the scipy.stats module, which is appropriate for examining the means of two independent samples. I was able to perform this test, compute the p-value, and establish whether there is a significant difference in the quantity of movies produced in the two nations.

##### Why did you choose the specific statistical test?

Because the two-sample t-test is appropriate for comparing the means of two independent samples, I chose it for this investigation. To find out if there is a noticeable variation in the average number of movies between the United States and India, we have two independent sets of Netflix movie data for each nation.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis Testing to check is there is any relation between year_added and type:



*   **Null Hypothesis H0:** The type of material that is contributed to the platform has no relation to year_added.
*  **Alternative Hypothesis HA:** The kind of material that is contributed to the platform depends on year_added.


**Set significance level to 0.05.**

#### 2. Perform an appropriate statistical test.

In [None]:

# Perform Statistical Test to obtain P-Value
hypo_data = pd.crosstab(df['type'], df['date_added'], margins=False)
hypo_data



In [None]:
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
stat, P, dof, expected = chi2_contingency(hypo_data)
# Set the significance level to 0.05
alpha = 0.05
# Print the results
print('p-value:', P)
# Check if the calculated p-value is less than the significance level
if P < alpha:

  # If the p-value is less than the significance level, reject the null hypothesis
  print("We reject the null hypothesis.")
else:
  # If the p-value is greater than or equal to the significance level, fail to reject the null hypothesis
  print("We fail to reject the null hypothesis.")


##### Which statistical test have you done to obtain P-Value?

We will accept the alternative hypothesis and reject the null hypothesis since the p value is less than the significance level.

##### Why did you choose the specific statistical test?

We looked at the p-value for the hypothesis test using a chi-square contingency analysis.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isna().sum()


#### What all missing value imputation techniques have you used and why did you use those techniques?

The data wrangling stage is when we took care of the missing values.

### 2. Handling Outliers

##### What all outlier treatment techniques have you used and why did you use those techniques?

Outliers cannot be detected since the column data types in the dataset are in string format. As a result, we may say that the dataset has no outliers.

### 3. Categorical Encoding

#### What all categorical encoding techniques have you used & why did you use those techniques?

There are no category columns in the dataset since all of the columns are in string format.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
data = df[['title']]
data['cluster'] = (df['description'] + ' ' + df['listed_in'] + ' '+ df['cast'] + ' ' + df['country'] + ' ' + df['rating']).astype(str)
data.set_index('title', inplace = True)
data.head()


#### 2. Lower Casing

In [None]:
# Lower Casing
def to_lower(x):
  return x.lower()

# Apply the to_lower() function to the 'tags' column of the DataFrame
data['cluster'] = data['cluster'].apply(to_lower)

# cross checking our result for the function created
print(data['cluster'][0])


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
# Creating function to remove all the punctuations
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # Let's replace the punctuations with no space,
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)


In [None]:
#data['cluster'] = data['cluster'].apply(remove_punctuation)
data.head(10)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
# our tags column doesnot have links so removing numbers
data['cluster'] = data['cluster'].str.replace(r'\w*\d\w*', '', regex=True)
# remove words and digits containing digits

# cross checking our result for the function created
print(data['cluster'][0])

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
nltk.download('stopwords')

In [None]:
# Remove White spaces
# extracting the stopwords using nltk library
st = nltk.corpus.stopwords.words('english')
# Let's look at all the stopwords.
np.array(st)

In [None]:
print("Counts of stopwords: ", len(st))

#### 6. Rephrase Text

In [None]:
# Rephrase Text
def stopwords(text):
    '''a function for removing the stopword'''
    # deleting the stop words and lowering the case of the chosen words
    text = [word.lower() for word in text.split() if word.lower() not in st]
    # using a space separator to link the list of words
    return " ".join(text)

In [None]:
data['cluster'] = data['cluster'].apply(stopwords)
data.head(15)


#### 7. Tokenization

In [None]:
# Tokenization
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import TweetTokenizer


In [None]:
# making a vectorizer count item
count_vectorizer = CountVectorizer()
# Using the text data, fit the count vectorizer
count_vectorizer.fit(data['cluster'])
# collect the vectorizer's vocabulary items
dictionary = count_vectorizer.vocabulary_.items()

In [None]:
dictionary

In [None]:
# lists for storing vocabulary and counts
vocab = []
count = []
# loop over each vocab and count attach the result to specified lists
for key, value in dictionary:
    vocab.append(key)
    count.append(value)
# Save the count in a panadas dataframe using vocab as the index.
vocab_bef_stem = pd.Series(count, index=vocab)
# sort the dataframe
vocab_before_stem = vocab_bef_stem.sort_values(ascending=False)

In [None]:
vocab_before_stem.head().T

In [None]:
vocab_before_stem.tail().T

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
stemmer = SnowballStemmer("english")

def stemming(text):
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text)

In [None]:
data['cluster'] = data['cluster'].apply(stemming)
data.head(15)


##### Which text normalization technique have you used and why?

The Snowball Stemmer, often known as the Porter2 stemming method, is an upgraded version of the Porter Stemmer technique that we utilized. It fixes some of the flaws in the Porter Stemmer and performs better when stemming words.

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
# making the tfid vectorizer object
tfid_vectorizer = TfidfVectorizer()

# Using text data to fit the vectorizer
tfid_vectorizer.fit(data['cluster'])

# Gather the vocabulary items used in the vectorizer.
dictionary = tfid_vectorizer.vocabulary_.items()

In [None]:
# lists for storing vocabulary and counts
vocab = []
count = []
# loop over each vocab and count attach the result to specified lists
for key, value in dictionary:
    vocab.append(key)
    count.append(value)
# store the count in panadas dataframe with vocab as index
vocab_after_stem = pd.Series(count, index=vocab)
# sort the dataframe
vocab_after_stem = vocab_after_stem.sort_values(ascending=False)
# plot of the top vocab
top_vacab = vocab_after_stem.head(20)
top_vacab.plot(kind = 'barh', figsize=(15,10),xlim = (35000, 40300))

##### Which text vectorization technique have you used and why?

 We made use of The term frequency-inverse document frequency, or TF-IDF, is a technique for converting text to a numerical vector representation. It brings together two crucial concepts: term frequency (TF) and document frequency (DF).

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Appying TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', lowercase=False, max_features=20000)
X = vectorizer.fit_transform(data['cluster'])


In [None]:
X.toarray()[5]

In [None]:
# shape of the vectorized data
print(X.shape)

In [None]:
X = X.toarray()

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, we believe that dimension reduction is required since it is a statistical approach for lowering the number of random variables in a problem by producing a collection of primary variables.

In [None]:
# DImensionality Reduction (If needed)

In [None]:
from sklearn.decomposition import PCA

In [None]:
# using PCA to reduce dimensionality
pca = PCA(random_state=42)
pca.fit(X)


In [None]:
# Explained variance for different number of components
plt.figure(figsize=(20,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('PCA - Cumulative explained variance vs Number of components')
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.axhline(y= 0.8, color='red', linestyle='--')
plt.axvline(x= 3000, color='green', linestyle='--')
plt.show()

In [None]:
# reducing the dimensions to 0.95 using pca
pca = PCA(n_components=3000, random_state=42)
pca.fit(X)

# transformed features
X = pca.transform(X)

# shape of transformed vectors
X.shape

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

The technique of lowering the number of features in a dataset while retaining as much useful information as feasible is known as dimensionality reduction. It is an approach for overcoming the curse of dimensionality, which refers to the problem of rising computing complexity and worse performance of machine learning models as the number of features rises.

Dimensionality reduction approaches are classified into two types: feature selection and feature extraction.

The process of picking a subset of the most relevant characteristics from the original feature set is known as feature selection. It is a strategy for reducing data dimensionality by deleting unnecessary and superfluous characteristics. The following are examples of common feature selection techniques:

Feature selection based on correlation Feature selection based on mutual information Recursive feature removal SelectKBest The technique of extracting additional features from an existing feature set by combining or altering existing features is known as feature extraction. It is a technique that aids in data reduction by generating a new feature space that is more compact and informative than the original feature space. Techniques for extracting features that are often used include:

PCA stands for Principal Component Analysis. LDA stands for Linear Discriminant Analysis. Non-Negative Matrix Factorization (NMF) Independent Component Analysis (ICA) Answer Here for Autoencoder.

## ***7. ML Model Implementation***

**Eblow Method for K-means**

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch


In [None]:
# Applying Elbow method to find optimal clusters
wcss = []
for i in range(1,26 ):
  kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
  kmeans.fit(X)
  wcss.append(kmeans.inertia_)
plt.plot(range(1,26), wcss)
plt.title("The Elbow Method")
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

**2. Silhoutte Score for K-means**

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score


In [None]:
range_n_clusters = [i for i in range(2,16)]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

      # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

After reviewing the Silhouette result, we discovered that the best result was attained with 15 clusters, which amounted to 0.010383798598527266.

**2. Explain each evaluation metric's indication towards business and the business impact of the ML model used.**

The Silhouette score is used to assess cluster quality in algorithms such as K-Means. It assesses how effectively samples are categorized based on similarity. A high Silhouette score is vital for recommendation systems such as Netflix since it reflects how effective they are at proposing related material. This can lead to profit by providing users with accurate and relevant recommendations.

**3. Dendrogram for K-means**

In [None]:
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10

# Here we are Using the dendogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Description and Listed In')
plt.ylabel('Euclidean Distances')
plt.show()



**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

We examined a specific chart known as a dendrogram to determine the optimal number of groups using a technique known as K-means. We discovered the longest vertical line in this graphic that does not intersect any other horizontal lines. We drew a line through this distance and tallied the number of additional lines it intersected. According to the dendrogram, the ideal number of groups for K-means is 15 clusters.

**K-means Clustering with 18 clusters**

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.cluster import KMeans

from yellowbrick.cluster import KElbowVisualizer
plt.figure(figsize=(20,6), dpi=120)

kmeans= KMeans(n_clusters=18, init= 'k-means++', random_state=9)
kmeans.fit(X)

# Here we are predicting the labels of clusters.
label = kmeans.fit_predict(X)
# Let's check the unique labels
unique_labels = np.unique(label)

# function to plot the result
for i in unique_labels:
    plt.scatter(X[label == i , 0] , X[label == i , 1] , label = i)
plt.legend()
plt.show()



**K-means Clustering with 15 clusters.**

In [None]:
from sklearn.cluster import KMeans

from yellowbrick.cluster import KElbowVisualizer
plt.figure(figsize=(20,6), dpi=120)

kmeans= KMeans(n_clusters=15, init= 'k-means++', random_state=9)
kmeans.fit(X)

# Predicting the labels of clusters.
label = kmeans.fit_predict(X)
# let's check all the unique labels
unique_labels = np.unique(label)

# function to plot the result
for i in unique_labels:
    plt.scatter(X[label == i , 0] , X[label == i , 1] , label = i)
plt.legend()
plt.show()

**Hierarchical Clustering:**

Hierarchical Clustering is a sort of unsupervised machine learning clustering technique. Its goal is to arrange comparable data points into clusters based on similarities, resulting in a tree-like structure known as a dendrogram. The method operates by iteratively merging or breaking clusters depending on data point distance.

There are two main types of hierarchical clustering:

1. Agglomerative Clustering

2. Divisive Hierarchical Clustering

**Agglomerative Clustering**

In [None]:
# Importing agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 15, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)


In [None]:
# Here visualizing the clusters in three dimensions.
plt.figure(figsize=(20,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, label = '1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, label = '2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, label = '3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, label = '4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, label = '5')
plt.scatter(X[y_hc == 5, 0], X[y_hc == 5, 1], s = 100, label = '6')
plt.scatter(X[y_hc == 6, 0], X[y_hc == 6, 1], s = 100, label = '6')
plt.scatter(X[y_hc == 7, 0], X[y_hc == 7, 1], s = 100, label = '7')
plt.scatter(X[y_hc == 8, 0], X[y_hc == 8, 1], s = 100, label = '8')
plt.scatter(X[y_hc == 9, 0], X[y_hc == 9, 1], s = 100, label = '9')
plt.scatter(X[y_hc == 10, 0], X[y_hc == 10, 1], s = 100, label = '10')
plt.scatter(X[y_hc == 11, 0], X[y_hc == 11, 1], s = 100, label = '11')
plt.scatter(X[y_hc == 12, 0], X[y_hc == 12, 1], s = 100, label = '12')
plt.scatter(X[y_hc == 13, 0], X[y_hc == 13, 1], s = 100, label = '13')
plt.scatter(X[y_hc == 14, 0], X[y_hc == 14, 1], s = 100, label = '14')
plt.scatter(X[y_hc == 15, 0], X[y_hc == 15, 1], s = 100, label = '15')
plt.title('Clusters of content')

plt.legend()
plt.show()

**3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.**

Text-based feature clustering assists Netflix customers in grouping similar content tastes. Well-formed clusters improve the recommendation system by detecting trends and providing customised recommendations. This enhances the platform's user experience and content relevancy.

**2. Which ML model did you choose from the above created models as your final prediction model and why?**

K-means is a strong data clustering technique that aims to partition a dataset into K separate groups, with each data point belonging to a single cluster. The best number of clusters (15 clusters) was obtained after running K-means, giving optimal grouping for the supplied data.

**RECOMMENDATION**

In [None]:
# Here we imported path,Image,WordCloud,STOPWORDS,ImageColorGenerator
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
df['cluster_re'] = kmeans.labels_
df.head(5)


In [None]:
# Filtering the dataframe by cluster number and selected column
def func_select_Category(category_name,column_of_choice):
  df_word_cloud = df[['cluster_re',column_of_choice]].dropna()
  df_word_cloud = df_word_cloud[df_word_cloud['cluster_re']==category_name]
  # Concatenating the words in the selected column
  text = " ".join(word for word in df_word_cloud[column_of_choice])
   # Setting the stopwords and generate word cloud
  stopwords = set(STOPWORDS)
  # Generating the word cloud image
  wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
  # Code to display the word cloud
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.title(f'Cluster: {i}')
  plt.axis("off")
  plt.show()

In [None]:
for i in range(15):
  func_select_Category(i,'description')


In [None]:
#Function for getting the some of the same cluster item details
def find_same_cluster_items(name_df):
  inp_df = df.loc[df['title'].str.lower() == name_df.lower()]
  num = inp_df.cluster_re.iloc[0]
  type_df = inp_df.type.iloc[0]
  temp_df = df.loc[(df['cluster_re'] == num) & (df['type']==type_df)]
  temp_df = temp_df.sample(10)
  print("The cluster number is {}".format(num))
   #print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
  return list(temp_df['title'])

In [None]:
find_same_cluster_items('zodiac')


In [None]:
find_same_cluster_items('Thank You')

**Calculating the similarity**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel


In [None]:
#compute the cosine similarity matrix
cosine_sim= linear_kernel(X,X)

In [None]:
#Series for getting the index and title
indices = pd.Series(df.index,index=df['title']).drop_duplicates()



In [None]:
def get_recommendations(title, cosine_sim = cosine_sim):

    #get index of the matching title
    idx=indices[title]

    #get the similarity score of the similar titles
    sim_scores=list(enumerate(cosine_sim[idx]))

    #sort the movies based on the similarity score
    sim_scores=sorted(sim_scores, key=lambda x:x[1], reverse=True)

    #get the similarity score of top 10 movies
    sim_scores=sim_scores[1:11]

    #get the indices
    movie_indices = [i[0] for i in sim_scores]

    #return the top indices
    return df['title'].iloc[movie_indices]


In [None]:
# Iterate over a set of indices ranging from 0 to 499.
# Get the movie title from index i.
for i in range(0, 500):
  gg = df['title'].iloc[i]
  print(gg)


In [None]:
# Generating the movie recommendations based on the movie 'Power Rangers Dino Charge '
Movies = (get_recommendations('Power Rangers Dino Charge'))
Movies


In [None]:
#  Generating TV show recommendations based on the TV show 'The Rachel Divide'
Tv_shows = (get_recommendations('The Rachel Divide'))
Tv_shows

**Conclusion For EDA**

1. The Netflix movie collection surpasses the TV show offerings, with 30.9% dedicated to TV series and 69.1% to movies. In both cases, TV-MA emerges as the most prevalent rating for television programs, indicating an adult audience preference. This rating garners the highest viewership on Netflix.

2. Rather than specifically targeting children, Netflix predominantly focuses on providing content that aligns with the interests of adult and teen audiences. Documentaries rank as the most popular Netflix category, followed by stand-up comedy, dramas, and foreign films. The leading category for Netflix TV programs is kids TV.

3. Viewers exhibit a preference for watching new episodes upon initial release, as evidenced by "Season 1" being the most common duration for TV shows on Netflix. This pattern may stem from Netflix's robust marketing efforts for new episodes or viewers' heightened interest in shows during their debut.

4. The United States contributes a substantial volume of content to Netflix due to its diverse cultural landscape. The availability of content decreases as a country's overall output decreases.

5. The data reveals that NC-17 movies tend to have longer runtimes, possibly because they often address mature themes requiring more extensive storytelling. Conversely, movies with a TV-Y classification, suitable for all youngsters, have the shortest average runtime, indicating simpler themes suitable for younger audiences. This information can be valuable for content producers and distributors seeking insights into consumer preferences and trends in the film industry.

6. The list of top 15 countries contributing to Netflix demonstrates the United States as the leading contributor, followed by India.

7. Notably, the top 10 Netflix directors with the most content are predominantly of foreign origin. Jan Suter stands out as the most prolific filmmaker on Netflix, having produced a substantial amount of content.


# **Conclusion** **for Machine Learning Model**

The Elbow graph enabled us to identify the number of clusters, ultimately determining it to be 18. This number was discerned by locating a pronounced bend in the curve or a significant deviation from a straight line on the graph. The point on the x-axis corresponding to this juncture was selected as the value indicating the number of clusters.

Additionally, we employed the Silhouette score method, and upon evaluating the Silhouette Score, we found that the optimal score, amounting to 0.010383798598527266, was achieved with 15 clusters.

To ascertain the ideal number of clusters for K-means from another perspective, we examined the dendrogram. This involved identifying the longest vertical distance that could be drawn without intersecting any other horizontal lines and noting the number of vertical lines crossed after traversing this distance. According to the dendrogram's perspective, 15 clusters emerged as the optimum number for K-means.

The clustering of text-based features allows Netflix users to organize similar content preferences. By recognizing patterns and offering personalized recommendations, well-defined clusters enhance the recommendation system, thereby improving the overall user experience and content relevance on the platform.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***