# **Project Name**    - Netflix Movies and TV Shows Clustering




##### **Project Type**    - Unsupervised
##### **Contribution**    - Team
##### **Team Member 1 -** Prasanta Kumar Sahoo
##### **Team Member 2 -** Kartikeshwara Behera


# **Project Summary -**

The main objective of our project is to analyze the vast Netflix catalog of movies and TV shows, which is sourced from a third-party search engine called Flixable. Our goal is to group them into relevant clusters, which will ultimately enhance the user experience and prevent subscriber churn for Netflix, the world's largest online streaming service provider. With over 220 million subscribers as of 2022-Q2, it is crucial to provide a personalized and engaging streaming experience for each user. To achieve this, we will analyze the dataset that includes movies and TV shows as of 2019, which will help us uncover new insights and trends in the rapidly growing world of streaming entertainment.

The dataset contains approximately 7787 records and 11 attributes, which is quite vast. Thus, we started by working on the missing values in the dataset and conducting exploratory data analysis (EDA) to better understand the dataset's structure.

To create clusters, we used various attributes such as cast, country, genre, director, rating, and description. These attributes were tokenized, preprocessed, and vectorized using the TFIDF vectorizer. However, the problem of dimensionality was addressed by using Principal Component Analysis (PCA).

Next, we employed two different clustering algorithms: K-Means Clustering and Agglomerative Hierarchical clustering, to construct two distinct types of clusters. To determine the optimal number of clusters, we used various methods such as the elbow method, silhouette score, dendrogram, and others.

By applying these techniques, we were able to group the Netflix catalog into clusters that were relevant and coherent. These clusters will enable Netflix to provide its subscribers with more personalized recommendations, which can lead to a better user experience and ultimately reduce subscriber churn. Furthermore, our analysis of the dataset revealed some interesting insights and trends in the streaming entertainment industry, which could be useful for content creators and providers.

# **GitHub Link -**

https://github.com/prasantsahoo107

# **Problem Statement**


Our project's primary goal is to analyse the enormous library of Netflix movies and TV programs, which are obtained from the third-party search engine Flixable. Our objective is to organise them into pertinent clusters, which will improve user experience and stop subscriber churn for Netflix, the biggest supplier of online streaming services in the world. With more than 220 million subscribers as of 2022-Q2, it is essential to offer each individual a unique and interesting streaming experience. In order to do this, we will analyse the dataset of 2019 movies and TV programs, which will enable us to identify fresh trends and insights in the quickly expanding streaming entertainment industry.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import pylab as pl
import spacy
import sklearn
import en_core_web_sm

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from nltk.corpus import stopwords  #stopwords
from nltk import word_tokenize,sent_tokenize # tokenizing
from nltk.stem import PorterStemmer,LancasterStemmer  # using the Porter Stemmer and Lancaster Stemmer and others
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer  # lammatizer from WordNet
!pip install nltk
import nltk 
!python3 -c "import nltk; nltk.download('all')"

# for named entity recognition (NER)
from nltk import ne_chunk

# vectorizers for creating the document-term-matrix (DTM)
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.decomposition import TruncatedSVD

#stop-words
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
from sklearn.decomposition import LatentDirichletAllocation
from wordcloud import WordCloud
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import StandardScaler


# Importing libraries for clustering
import matplotlib.cm as cm
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
#from scipy.cluster.hierarchy import linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import AffinityPropagation
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import OneHotEncoder

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
netflix = pd.read_csv('/content/drive/MyDrive/AlmaBetter Project/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')
netflix_df = netflix.copy()

### Dataset First View

In [None]:
# Dataset First Look
netflix_df.head()

In [None]:
netflix_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
netflix_df.count() #Rows count


In [None]:
netflix_df.shape

### Dataset Information

In [None]:
# Dataset Info
netflix_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
netflix_df[netflix_df.duplicated()].shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
netflix_df.isnull().sum().sort_values(ascending = False)

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(netflix_df)

### What did you know about your dataset?

There is a total of 7787 movies / TV shows in this data set collected from AlmaBetter

Each row contains the following information: type (Movie or TV Show), title, director, cast, country, rating (ex. PG, PG-13, R, etc.), listed_in (genre), and plot description.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(netflix_df.columns)   

In [None]:
# Dataset Describe
netflix_df.describe()

### Variables Description 

* ***Show_id*** : Unique ID for every Movie / Tv Show

* ***Type*** : Identifier - A Movie or TV Show

* ***Title*** : Title of the Movie / Tv Show

* ***Director***  : Director of the Movie

* ***Cast*** : Actors involved in the movie / show

* ***Country*** : Country where the movie / show was produced

* ***Date_added*** : Date it was added on Netflix

* ***Release_year*** : Actual Releaseyear of the movie / show

* ***Rating*** : TV Rating of the movie / show

* ***Duration*** : Total Duration - in minutes or number of seasons

* ***Listed_in*** : Genere

* ***Description***: The Summary descriptionAnswer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
netflix_df['type'].unique() 

In [None]:
netflix_df['title'].unique() 

In [None]:
netflix_df['director'].unique() 

In [None]:
netflix_df['cast'].unique() 

In [None]:
netflix_df['country'].unique() 

In [None]:
netflix_df['rating'].unique() 

In [None]:
netflix_df['rating'].unique() 

In [None]:
netflix_df['listed_in'].unique() 

In [None]:
netflix_df['description'].unique() 

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready

#1. Drop duplicates
netflix_df[netflix_df.duplicated()]

There are no duplicates

In [None]:
#Checking Null Values
netflix_df.isnull().sum()

In [None]:
#total null values
netflix_df.isnull().sum().sum()

The dataset contains 3631 null values, including 10 in date added and 7 in rating. There are 2389 empty values in the director column, 718 in the cast column, 507 in the country column, and 7 in the rating. therefore, we must handle the null values.

In [None]:
#Handling Null Values
netflix_df['cast'].fillna(value='No cast',inplace=True)
netflix_df['country'].fillna(value=netflix_df['country'].mode()[0],inplace=True)

In [None]:
#'date_added' and 'rating' contains an insignificant portion of the data so we will drop them from the dataset
netflix_df.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
#Dropping Director Column
netflix_df.drop(['director'],axis=1,inplace=True)

In [None]:
#again checking is there any null values are not
netflix_df.isnull().sum()

All Null Values has been removed successfully

### What all manipulations have you done and insights you found?

1. There are no duplicated values
2. The dataset contains 3631 null values, including 10 in date added and 7 in rating. There are 2389 empty values in the director column, 718 in the cast column, 507 in the country column, and 7 in the rating. therefore, we must handle the null values.
3. All Null Values has been removed successfully

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 **(Movies vs TV Shows)**

In [None]:
# Chart - 1 visualization code
netflix_df['type'].value_counts()

In [None]:
#countplot to visualize the number of movies and tv_shows in type column
sns.countplot(x="type", data=netflix_df)

##### 1. Why did you pick the specific chart?

The countplot is a good plot to use to see how category data is distributed, like how many movies and TV programmes are in the type column of the Netflix dataset. It makes it simple to evaluate the frequency of various groups by showing the count of records in each category as bars.

Netflix has 5372 movies and 2398 TV shows

##### 2. What is/are the insight(s) found from the chart?

The countplot in the Netflix dataset indicates that the number of movies is nearly twice the number of TV shows. This observation provides insights into the content distribution of Netflix, suggesting that the platform has a more extensive collection of movies than TV shows. 

##### 3. Will the gained insights help creating a positive business impact? 

This information may be useful for content creators and distributors looking to partner with Netflix or understand audience preferences on the platform. It also provides an initial understanding of the type of content that Netflix users consume and may inform future content decisions by the platform.

#### Chart - 2 **(Movie Rating)**

In [None]:
netflix_df['rating']

In [None]:
#Assigning the Ratings into grouped categories
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
netflix_df['target_ages'] = netflix_df['rating'].replace(ratings)

In [None]:
# type should be a categorical
netflix_df['type'] = pd.Categorical(netflix_df['type'])
netflix_df['target_ages'] = pd.Categorical(netflix_df['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])
netflix_df

In [None]:
#creating two extra columns
tv_shows=netflix_df[netflix_df['type']=='TV Show']
movies=netflix_df[netflix_df['type']=='Movie']

In [None]:
# Chart - 2 visualization code
#Rating based on rating system of all TV Shows
tv_ratings = tv_shows.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (14,7)
fig, ax = plt.subplots(figsize=fig_dims)  
sns.pointplot(x='rating',y='count',data=tv_ratings)
plt.title('TV Show Ratings',size='20')
plt.show()

TV-MA has the highest number of ratings for tv shows i,e adult ratings

##### 1. Why did you pick the specific chart?

I chose to use a point plot to visualize the TV show ratings because it is an effective way to show the distribution of a categorical variable (rating) and its corresponding count. The points on the plot represent the count of TV shows for each rating, and the lines connecting the points provide a clear visualization of how the count changes across different ratings. Additionally, the plot allows for easy comparison between the different ratings, making it an ideal choice for this type of analysis.

##### 2. What is/are the insight(s) found from the chart?

TV-MA has the highest number of ratings for tv shows i,e adult ratings

##### 3. Will the gained insights help creating a positive business impact? 


This information can be used by Netflix to tailor their content strategy towards the adult demographic, potentially attracting more viewers and increasing their customer base. Additionally, this information can help Netflix optimize their marketing efforts by targeting the specific audience that prefers TV-MA rated shows.

#### Chart - 3 **(Movie Ratings based on Target Age Groups)**

In [None]:
# Chart - 3 visualization code
#Movie Ratings based on Target Age Groups
plt.figure(figsize=(14,6))
plt.title('movie ratings')
sns.countplot(x=movies['rating'],hue=movies['target_ages'],data=movies,order=movies['rating'].value_counts().index)

##### 1. Why did you pick the specific chart?

I chose to use a count plot to visualize the movie ratings and the target age groups because it is an effective way to show the distribution of a categorical variable (rating) and its corresponding count for each target age group. The count plot also allows for easy comparison between the different target age groups and their corresponding movie ratings. Additionally, the plot provides an ordered representation of the movie ratings based on their frequency, making it easier to see which ratings are the most common.

##### 2. What is/are the insight(s) found from the chart?

TV-MA has the highest number of ratings for tv shows i,e adult ratings in both the cases TV-MA has the highest number of ratings

#### Chart - 4 **(Production_growth)**

In [None]:
movies_year =movies['release_year'].value_counts().sort_index(ascending=False)
movies_year

In [None]:
tvshows_year =tv_shows['release_year'].value_counts().sort_index(ascending=False)

In [None]:
# Chart - 4 visualization code
sns.set(font_scale=1.4)
movies_year.plot(figsize=(12, 8), linewidth=2.5, color='maroon',label="Movies / year",ms=3)
tvshows_year.plot(figsize=(12, 8), linewidth=2.5, color='blue',label="TV Shows / year")
plt.xlabel("Years", labelpad=15)
plt.ylabel("Number", labelpad=15)
plt.title("Production growth yearly", y=1.02, fontsize=22);

##### 1. Why did you pick the specific chart?

Since a line plot is a powerful tool for showing how the number of productions has changed over time, I decided to use one to illustrate the growth in production of movies and TV programmes based on their release year. The line plot makes it simple to compare the progress of movies and TV shows, and the use of distinct colours and a legend makes it easier to distinguish between the two. In addition, the plot offers a chronologically ordered representation of the years, which makes it simpler to understand how production has changed over time.

##### 2. What is/are the insight(s) found from the chart?

Production grwoth incresing exponetially since 2000

##### 3. Will the gained insights help creating a positive business impact? 

Netflix can use this data to carefully plan their content creation efforts and investments in response to the expanding market for films and television programmes. Netflix can expand its customer base and make more money by concentrating on creating content that meets the rising demand of viewers. Additionally, being aware of the shifting patterns in production growth can help Netflix decide where to spend its money, whether it be in partnerships, new technology, or content acquisition. This may give the business a competitive edge in the market and spur further expansion.

#### Chart - 5 (**Release_per_year**)

In [None]:
# Chart - 5 visualization code
#Analysing how many movies released per year in last 20 years
plt.figure(figsize=(15,5))
sns.countplot(y=movies['release_year'],data=netflix_df,order=movies['release_year'].value_counts().index[0:20])

In [None]:
#Analysing how many tv shows released per year in last 15 years
plt.figure(figsize=(15,5))
sns.countplot(y=tv_shows['release_year'],data=netflix_df,order=tv_shows['release_year'].value_counts().index[0:20])

##### 1. Why did you pick the specific chart?

Because a countplot offers a simple-to-understand visual representation of the data, I decided to use it to analyse how many movies were published annually over the previous 20 years. We can easily see the number of movies published each year and contrast the results between various years by using a countplot. Plus, by sorting the data according to the number of movies published each year using an ordered countplot, we can more easily determine the most active years.

##### 2. What is/are the insight(s) found from the chart?


*  highest number of movies released in 2017 and 2018
*   highest number of movies released in 2020
*  The number of movies on Netflix is growing significantly faster than the number of TV shows.
*   We saw a huge increase in the number of movies and television episodes after 2015.
*    there is a significant drop in the number of movies and television episodes produced after 2020.
*   It appears that Netflix has focused more attention on increasing Movie content that TV Shows. Movies have increased much more dramatically than TV shows.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6 (**Release_month**)

In [None]:
#adding columns of month and year of addition

netflix_df['month'] = pd.DatetimeIndex(netflix_df['date_added']).month
netflix_df.head()

In [None]:
# Chart - 6 visualization code
# Plotting the Countplot (Release_month)
plt.figure(figsize=(12,10))
ax=sns.countplot(x='month',data= netflix_df)

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

sns.countplot(x='month', hue='type',lw=5, data=netflix_df, ax=ax)

##### 2. What is/are the insight(s) found from the chart?

* From October to January, maximum number of movies and TV shows were added
* The most content is added to Netflix from october to january

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7 (**Genre**)

In [None]:
# Chart - 7 visualization code
#Analysing top10 genre of the movies
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of Movies',fontweight="bold")
sns.countplot(y=movies['listed_in'],data=movies,order=movies['listed_in'].value_counts().index[0:10])

In [None]:
#Analysing top10 genres of TV_SHOWS
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of TV Shows',fontweight="bold")
sns.countplot(y=tv_shows['listed_in'],data=tv_shows,order=tv_shows['listed_in'].value_counts().index[0:10])

In [None]:
movies['minute'] = movies['duration'].str.extract('(\d+)').apply(pd.to_numeric)
duration_year = movies.groupby(['rating'])['minute'].mean()
duration_df=pd.DataFrame(duration_year).sort_values('minute')
plt.figure(figsize=(12,6))
ax=sns.barplot(x=duration_df.index, y=duration_df.minute)

##### 2. What is/are the insight(s) found from the chart?

* Documentaries are the top most genre in netflix which is fllowed by standup comedy and Drams and international movies.
* Kids tv is the top most TV show genre in netflix
* Those movies that have a rating of NC-17 have the longest average duration.
* When it comes to movies having a TV-Y rating, they have the shortest runtime on average

#### Chart - 8 (**duration**)

In [None]:
# Chart - 8 visualization code
#Checking the distribution of Movie Durations
plt.figure(figsize=(10,7))
#Regular Expression pattern \d is a regex pattern for digit + is a regex pattern for at leas
sns.distplot(movies['duration'].str.extract('(\d+)'),kde=False, color=['red'])
plt.title('Distplot with Normal distribution for Movies',fontweight="bold")
plt.show()

In [None]:
#Checking the distribution of TV SHOWS
plt.figure(figsize=(30,6))
plt.title("Distribution of TV Shows duration",fontweight='bold')
sns.countplot(x=tv_shows['duration'],data=tv_shows,order = tv_shows['duration'].value_counts().index)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

* Most of the movies have duration of between 50 to 150
* Highest number of tv_shows consistig of single season

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9 (**Country**)

In [None]:
# Chart - 9 visualization code
#Analysing top15 countries with most content 
plt.figure(figsize=(18,5))
sns.countplot(x=netflix_df['country'],order=netflix_df['country'].value_counts().index[0:15],hue=netflix_df['type'])
plt.xticks(rotation=50)
plt.title('Top 15 countries with most contents', fontsize=15, fontweight='bold')
plt.show()

In [None]:
# Plotting the Horizontal bar plot for top 10 country contains Movie & TV Show split
country_order = netflix_df['country'].value_counts()[:11].index
content_data = netflix_df[['type', 'country']].groupby('country')['type'].value_counts().unstack().loc[country_order]
content_data['sum'] = content_data.sum(axis=1)
content_data_ratio = (content_data.T / content_data['sum']).T[['Movie', 'TV Show']].sort_values(by='Movie',ascending=False)[::-1]

# Plotting the barh
fig, ax = plt.subplots(1,1,figsize=(15, 8),)

ax.barh(content_data_ratio.index, content_data_ratio['Movie'], 
        color='crimson', alpha=0.8, label='Movie')
ax.barh(content_data_ratio.index, content_data_ratio['TV Show'], left=content_data_ratio['Movie'], 
        color='black', alpha=0.8, label='TV Show')

##### 2. What is/are the insight(s) found from the chart?

* Unitated States has the highest number of content on the netflix ,followed by India
* India has highest number of movies in netflix

#### Chart - 10 **(Originals)**

In [None]:
# Chart - 10 visualization code
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'])
movies['year_added'] = netflix_df['date_added'].dt.year
netflix_df

Some movies and TV shows were actually released in the past and they were added later on Netflix. But some movies and TV shows were released on Netflix itself. Named those as Netflix Originals.



In [None]:
movies['originals'] = np.where(movies['release_year'] == movies['year_added'], 'Yes', 'No')
# pie plot showing percentage of originals and others in movies
fig, ax = plt.subplots(figsize=(5,5),facecolor="#363336")
ax.patch.set_facecolor('#363336')
explode = (0, 0.1)
ax.pie(movies['originals'].value_counts(), explode=explode, autopct='%.2f%%', labels= ['Others', 'Originals'],
       shadow=True, startangle=90,textprops={'color':"black", 'fontsize': 20}, colors =['blue','#F5E9F5'])

Answer Here

#### Chart - 11 - **(Correlation Heatmap)**

In [None]:
# Preparing data for heatmap
netflix_df['count'] = 1
data = netflix_df.groupby('country')[['country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['country']
df_heatmap = netflix_df.loc[netflix_df['country'].isin(data)]
df_heatmap = pd.crosstab(df_heatmap['country'],df_heatmap['target_ages'],normalize = "index").T
df_heatmap

In [None]:
# Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain',
       'Mexico']

age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

sns.heatmap(df_heatmap.loc[age_order,country_order2],cmap="YlGnBu",square=True, linewidth=2.5,cbar=False,
            annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

the US and UK are closely aligned with their Netflix target ages, but radically different from, example, India or Japan!

Also, Mexico and Spain have similar content on Netflix for different age groups.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

*   HO:movies rated for kids and older kids are at least two hours long.
*   H1:movies rated for kids and older kids are not at least two hours long.

#### 2. Perform an appropriate statistical test.

In [None]:
movies

In [None]:
#making copy of df_clean_frame
df_hypothesis=netflix_df.copy()
#head of df_hypothesis
df_hypothesis.head()

In [None]:
#filtering movie from Type_of_show column
df_hypothesis = df_hypothesis[df_hypothesis["type"] == "Movie"]

In [None]:
#with respect to each ratings assigning it into group of categories                 
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

df_hypothesis['target_ages'] = df_hypothesis['rating'].replace(ratings_ages)
#let's see unique target ages 
df_hypothesis['target_ages'].unique()

In [None]:
#Another category is target_ages (4 classes).
df_hypothesis['target_ages'] = pd.Categorical(df_hypothesis['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])
#from duration feature extractin string part and after extracting Changing the object type to numeric
df_hypothesis['duration']= df_hypothesis['duration'].str.extract('(\d+)')
df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])
#head of df_
df_hypothesis.head(3)

In [None]:
#group_by duration and target_ages                 
group_by_= df_hypothesis[['duration','target_ages']].groupby(by='target_ages')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

In [None]:
#In A and B variable grouping values 
A= group_by_.get_group('Kids')
B= group_by_.get_group('Older Kids')
#mean and std. calutation for kids and older kids variables
M1 = A.mean()
S1 = A.std()

M2= B.mean()
S2 = B.std()

print('Mean for movies rated for Kids {} \n Mean for  movies rated for older kids {}'.format(M1,M2))
print('Std for  movies rated for Older Kids {} \n Std for  movies rated for kids {}'.format(S2,S1))

In [None]:
#import stats 
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val[0])

In [None]:
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

Because the t-value is not in the range, the null hypothesis is rejected.

As a result, movies rated for kids and older kids are not at least two hours long.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**2. HYPOTHESIS TESTING** 
*    H1:The duration which is more than 90 mins are  movies
*   HO:The duration which is more than 90 mins are NOT movies

#### 2. Perform an appropriate statistical test.

In [None]:
#making copy of df_clean_frame
df_hypothesis=netflix_df.copy()
#head of df_hypothesis
df_hypothesis.head()

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
df_hypothesis['duration']= df_hypothesis['duration'].str.extract('(\d+)')
df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])
#head of df_

In [None]:
df_hypothesis['type'] = pd.Categorical(df_hypothesis['type'], categories=['Movie','TV Show'])
df_hypothesis.head(3)

In [None]:
#group_by duration and TYPE                 
group_by_= df_hypothesis[['duration','type']].groupby(by='type')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

In [None]:
#In A and B variable grouping values 
A= group_by_.get_group('Movie')
B= group_by_.get_group('TV Show')
#mean and std
M1 = A.mean()
S1 = A.std()

M2= B.mean()
S2 = B.std()

print('Mean  {}'.format(M1,M2))
print('Std  {}'.format(S2,S1))

In [None]:
#import stats 
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val[0])

In [None]:
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

Because the t-value is not in the range, the null hypothesis is rejected.

As a result, The duration which is more than 90 mins are movies

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### Feature Engineering

In [None]:
netflix_df.dtypes

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
nltk.download('punkt')

In [None]:
netflix_df['description'].astype(str)

In [None]:
# after above all the changes, those features are in list format, so making list of description feature
netflix_df['description'] = netflix_df['description'].apply(lambda x: x.split(' '))

In [None]:
# converting text feature to string from list
netflix_df['description']= netflix_df['description'].apply(lambda x: " ".join(x))
# making all the words in text feature to lowercase
netflix_df['description']= netflix_df['description'].apply(lambda x: x.lower())

In [None]:
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space, 
    # which in effect deletes the punctuation marks 
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)
# applying above function on text feature
netflix_df['description']= netflix_df['description'].apply(remove_punctuation)

In [None]:
netflix_df['description'][0:10]

In [None]:
# using nltk library to download stopwords
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
sw=stopwords.words('english')
#Defining stopwords 
def stopwords(text):
    '''a function for removing the stopword'''
    text = [word for word in text.split() if word not in sw]
    # joining the list of words with space separator
    return " ".join(text)
# applying above function on text feature
netflix_df['description']=netflix_df['description'].apply(stopwords)
# this is how value in text looks like after removing stopwords
netflix_df['description'][0]

In [None]:
# importing TfidVectorizer from sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
#Applying Tfidf Vectorizer
tfidfmodel = TfidfVectorizer(max_features=5000)
X_tfidf = tfidfmodel.fit_transform(netflix_df['description'])
X_tfidf.shape

In [None]:
# convert X into array form for clustering
X = X_tfidf.toarray() 

## ***7. ML Model Implementation***

## **1.Kmean**

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.cluster import KMeans  
wcss_list = []  # Initializing the list for the values of WCSS  
num_iterations = 5   #iteration taken as 5 to decrease the run time

# Using for loop for iterations from 1 to num_iterations
for i in range(1, num_iterations + 1):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X)
    wcss_list.append(kmeans.inertia_)

plt.plot(range(1, num_iterations + 1), wcss_list)
plt.title('The Elbow Method Graph')
plt.xlabel('Number of clusters (k)')
plt.ylabel('WCSS')
plt.show()


In [None]:
from sklearn.metrics import silhouette_score
#sillhoute score of clusters 
sill = [] 
for i in range(2,6):
    model = KMeans(n_clusters=i,init ='k-means++',random_state=51)
    model.fit(X)
    y1 = model.predict(X)
    score = silhouette_score(X,y1)
    sill.append(score)
    print('cluster: %d \t Sillhoute: %0.4f'%(i,score))

In [None]:
#Plotting Sillhoute's score
plt.plot(sill,'bs--')
plt.xticks(list(range(0,6)),list(range(0,6)))
plt.grid(),plt.xlabel('Number of cluster')
plt.show()

In [None]:
#training the K-means model on a dataset  
kmeans = KMeans(n_clusters= 5, init='k-means++', random_state= 42)  
y_predict= kmeans.fit_predict(X) 

Evaluation:

In [None]:
#Predict the clusters and evaluate the silhouette score

score = silhouette_score(X, y_predict)
print("Silhouette score is {}".format(score))

In [None]:
#davies_bouldin_score of our clusters 
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X, y_predict)

In [None]:
#Adding a seperate column for the cluster
netflix_df["cluster"] = y_predict

In [None]:
netflix_df['cluster'].value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(15,6))
sns.countplot(x='cluster', hue='type',lw=5, data=netflix_df, ax=ax)

  Cluster 0 has the highest number of datapoints

In [None]:
#SCATTER PLOT FOR CLUSTERS
fig = px.scatter(netflix_df, y="description", x="cluster",color="cluster")
fig.update_traces(marker_size=100)
fig.show()

**Dendogram**

In [None]:
import scipy.cluster.hierarchy as shc
plt.figure(figsize =(8, 8))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X, method ='ward')))

### ML Model - 2

##**2.AgglomerativeClustering**

In [None]:
#Fitting our variable in Agglomerative Clusters
from sklearn.cluster import AgglomerativeClustering
aggh = AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward')  
aggh.fit(X)
#Predicting using our model
y_hc=aggh.fit_predict(X)

In [None]:
df_hierarchical =netflix_df.copy()
#creating a column where each row is assigned to their separate cluster
df_hierarchical['cluster'] = aggh.labels_
df_hierarchical.head()

In [None]:
#Silhouette Coefficient
print("Silhouette Coefficient: %0.3f"%silhouette_score(X,y_hc, metric='euclidean'))

In [None]:
#davies_bouldin_score of our clusters 
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X, y_hc)

# **Conclusion**

*  From elbow and sillhoute score ,optimal of 26 clusters formed , K Means is best for identification than Hierarchical as the evaluation metrics also indicates the same.in kmean cluster 0 has the highest number of datapoints
and evnly distributed for other cluster
*   Netflix has 5372 movies and 2398 TV shows,
there are more   number movies on Netflix than TV shows.

*   TV-MA has the highest number of ratings for tv shows i,e adult ratings

*   Highest number of movies released in 2017 and 2018
highest number of movies released in 2020
The number of movies on Netflix is growing significantly faster than the number of TV shows.
We saw a huge increase in the number of movies and television episodes after 2015.
There is a significant drop in the number of movies and television episodes produced after 2020.
It appears that Netflix has focused more attention on increasing Movie content than TV Shows. Movies have increased much more dramatically than TV shows

*    The most content is added to Netflix from october to january

*   Documentaries are the top most genre in netflix which is fllowed by standup comedy and Drams and international movies
*   Kids tv is the top most  TV show genre in netflix


*   Most of the movies have duration of between 50 to 150


*  Highest number of tv_shows consistig of single season


*   Those movies that have a rating of NC-17 have the longest average duration.

When it comes to movies having a TV-Y rating, they have the shortest runtime on average


*   Unitated states has the highest number of content on the netflix ,followed    by india

*  India has highest number of movies in netflix
*   30% movies released on Netflix.
70% movies added on Netflix were released earlier by different mode.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***