# **Project Name**    - Netflix Project



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name**            - Lekshana Priya R


# **Project Summary -**

Netflix is a popular streaming service that provides its subscribers with a wide range of movies, TV shows, documentaries, and original content to watch on demand. The company was founded in 1997 as a DVD rental service but pivoted to an online streaming model in 2007. Since then, Netflix has grown into one of the most popular streaming services globally, with over 200 million subscribers in more than 190 countries as of 2021.

Netflix's success is due in part to its innovative business model and emphasis on creating original content. The company invests heavily in producing its own movies and TV shows, which have won numerous awards and attracted high-profile talent. Netflix also uses sophisticated algorithms to recommend content to its users, based on their viewing history and preferences.


This project aimed to identify patterns in the content available on the platform and group them into clusters based on similarities in their genres, sub-genres, release year, and other features. The project utilized machine learning algorithms such as K-means clustering and Hierarchical Clustering to cluster the data effectively.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Netflix Content Dataset (2019)
Dataset Overview:**
This dataset contains information about TV shows and movies available on Netflix as of 2019. The data was collected from Flixable, a third-party Netflix search engine.


**Content Growth Trends:**

The number of TV shows on Netflix has nearly tripled since 2010.

The number of movies has decreased by more than 2000 titles over the same period.

**Content Distribution:**

This dataset provides insights into the types of content available in different countries.

It reveals shifts in Netflix's content strategy, including a growing focus on TV shows over movies.

**Project Goals:**
In this project, the primary objectives are to:

**Perform Exploratory Data Analysis (EDA):**

Understand the distribution, popularity, and trends in the Netflix content library.

Identify key factors influencing content availability.

**Analyze Regional Content Availability:**

Explore the variety of content offered in different countries.

Identify regional content preferences and trends.

**Evaluate Content Focus:**
Assess whether Netflix has increasingly focused on TV shows rather than movies in recent years.

**Content Clustering:**

Cluster similar content by matching text-based features to identify potential content categories.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno

import datetime as dt

from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

from  sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans, SpectralClustering, AgglomerativeClustering
from sklearn.metrics import silhouette_samples, silhouette_score
import scipy.cluster.hierarchy as sch

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()

In [None]:
# reading the dataset
import pandas as pd

# Replace 'your_file.csv' with the actual filename
data = pd.read_csv('NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data[data.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# total null values presnt in the data
data.isnull().sum().sum()


In [None]:
# Visualizing the missing values
msno.matrix(data, figsize = (15,8), fontsize =(12))

### What did you know about your dataset?

Answer Here

## ***2. Understanding  Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

In [None]:
# description of all the features.
data.describe(include = 'all').T

### Variables Description

show_id :- Unique id for every movies/Tv shows

type :- Identifier - A movie or Tv show

title :- Title of the movie/show

director :- Director of the show

cast :- Actors involved in the show

Country :- Country of production

date_added :- Date is what added on netflix

release_year :- Actual release year of the show

rating :- TV rating of the show

duration :- Total duration in minutes or number of seasons.

listed_in :- Genre

Description :- The summary descriptionAnswer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
variables_data = data.columns.to_list()
for item in variables_data:
  print('The Unique Values of', item, 'are:', data[item].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# copy of the data.
df = data.copy()

In [None]:
# handling null values.
df.isnull().sum()

In [None]:
# filling the null values of director and cast.
df.fillna({'director':'name absent', 'cast':'name missing'}, inplace = True)

# filling the null values of rating and country with their mode.
df['country'].fillna(df ['country'].mode()[0], inplace = True)
df['rating'].fillna(df ['rating'].mode()[0], inplace = True)

# droping the observation with null values present in date added.
df = df[df['date_added'].notna()]

In [None]:
# checking the null value counts again.
df.isnull().sum().sum()

In [None]:
# checking the data type of data.
df.date_added.dtype

In [None]:
# Remove leading/trailing whitespace first
df['date_added'] = df['date_added'].str.strip()

# Convert to datetime, allowing mixed formats
df['date_added'] = pd.to_datetime(df['date_added'], format='mixed', errors='coerce')

# Extract day, month, year
df['added_day'] = df['date_added'].dt.day
df['added_month'] = df['date_added'].dt.month
df['added_year'] = df['date_added'].dt.year

In [None]:
# columns
df.columns

In [None]:
# rating value counts.
df['rating'].value_counts()

In [None]:
# converting rating into understandable format.

rename_rating = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Kids',
    'TV-Y7': 'Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

df['rating'] =df['rating'].replace(to_replace = rename_rating)
df['rating'].unique()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
import seaborn as sns

# Set a custom color palette
sns.set_palette(["#FF6F61", "#5D9CEC"])

# Plot the type counts
plt.figure(figsize=(7, 6))
sns.countplot(x='type', data=df)

# Add title and labels
plt.title("Distribution of Content Types on Netflix", fontsize=16, fontweight='bold')
plt.xlabel("Content Type", fontsize=14)
plt.ylabel("Count", fontsize=14)

# Show the plot
plt.show()

I chose this chart to visualize the distribution of content types on Netflix because it provides a clear overview of the relative proportions of Movies and TV Shows in the platform's catalog. By using a custom color palette, the chart highlights the difference between these two categories, making the insights more visually engaging. This distribution is important for understanding Netflix's content strategy, including its increasing focus on original TV Shows in recent years. Additionally, this straightforward representation makes it easy to identify shifts in content preferences and trends.

### **Top ten countries based on total production.**

In [None]:
# Count values and convert to DataFrame
top_countries = df['country'].value_counts().head(10).reset_index()
top_countries.columns = ['Country Name', 'Total number of production']  # Rename properly

# Plot with custom colors
plt.figure(figsize=(15, 7))
sns.barplot(data=top_countries, x='Country Name', y='Total number of production', palette=["#5DADE2", "#F39C12", "#58D68D", "#AF7AC5", "#EC7063", "#F4D03F", "#48C9B0", "#AAB7B8", "#D68910", "#C0392B"])
plt.title('Top 10 Countries by TV Shows and Movies Production', fontsize=18, fontweight='bold')
plt.xlabel('Country', fontsize=14)
plt.ylabel('Total Productions', fontsize=14)
plt.xticks(rotation=45)
plt.show()


This bar chart visualizes the Top 10 Countries by TV Shows and Movies Production on Netflix. The customized color palette was chosen to differentiate each country distinctly, highlighting the diverse content contributions from various regions. The chart effectively captures the global reach of Netflix's content library, emphasizing the dominant production hubs like the United States, India, and United Kingdom. This visualization provides valuable insights into the geographic distribution of Netflix's content, reflecting the company's strategy to expand its global footprint and cater to regional audiences.









**Totlal number of contents wrt added_year, added_month, added_day, release_year, rating, type.**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Variables to visualize
var = ['added_year', 'added_month', 'added_day', 'release_year', 'rating', 'type']

# Custom color palette
custom_palette = sns.color_palette(["#FF6F61", "#5D9CEC", "#48C9B0", "#F4D03F", "#AF7AC5", "#EC7063"])

# Plot each variable with the custom color palette
for i, col in enumerate(var):
    plt.figure(figsize=(10, 6))
    sns.countplot(x=col, data=df, palette=[custom_palette[i % len(custom_palette)]])
    plt.title(f'Distribution of {col.capitalize()}', fontsize=18, fontweight='bold')
    plt.xlabel(col.capitalize(), fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.xticks(rotation=45)
    plt.show()

Year of Addition:

The majority of Netflix's content was added between 2018 and 2020, with 2019 marking the peak year for content additions.

The drop in 2021 can be attributed to the global COVID-19 pandemic, which disrupted production pipelines worldwide.

Month of Addition:

December stands out as the most active month for content additions, likely due to the holiday season when viewership typically spikes.

Other high-activity months include October, November, and January, aligning with major holiday seasons and year-end promotions.

Day of Addition:

Content is most frequently added on the 1st and 15th of each month, possibly reflecting standard content update cycles and contractual release schedules.

Year of Release:

The volume of content released each year saw a steady rise until 2021, where it faced a notable decline, again likely due to pandemic-related restrictions and production delays.

Content Ratings:

The majority of Netflix's catalog is adult-oriented, with fewer titles targeted at kids or family audiences, reflecting its focus on mainstream and mature content.

Content Type:

Movies dominate Netflix's library, with approximately 5000 titles, while TV shows account for around 2500 titles, highlighting the platform's emphasis on long-form storytelling.



**Directors**

In [None]:
# count of unique director
df['director'].nunique()

In [None]:
# top directors.
df['director'].value_counts()

In [None]:
# visualization
plt.figure(figsize=(12,8))
sns.countplot(y=df['director'], data = df, order = df['director'].value_counts().index[1:15])

Raúl Campos, Jan Suter, Marcus Raboy, Jay Karas, Cathy Garcia-Molina, etc are the top directors. It would have been easier to get more insights if some of the values in the director column were not null.

**Titles**

In [None]:
# most frequent words used in titles.
#Importing wordcloud
from wordcloud import WordCloud, STOPWORDS

#Most occurred word in title
plt.subplots(figsize=(20,10))
stopwords = set(STOPWORDS)
text = " ".join(df.title)
wordcloud = WordCloud(stopwords=stopwords,background_color='white',width=1000,height=800).generate(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Christmas, love, world, story, life, girl, etc are the most frequent words used in the title of the contents.

### **Genres**

In [None]:
# value counts.
df['listed_in'].value_counts()

In [None]:
# visualization oftop ten genres.
plt.figure(figsize = (12,8))
sns.countplot(y=df['listed_in'], data=df, order = df['listed_in'].value_counts().index[:10])

##  **Genres**

In [None]:
# seprating the genres from every content and assigning them to a new genres list.
genres = df['listed_in']
genre_list=[]
for i in genres:
  i=i.split(",")
  for j in i:
    genre_list.append(j)

In [None]:
# converting genre list into dataframe.
gen_df = pd.DataFrame(genre_list)
gen_df.rename(columns={0:'genres'}, inplace = True)
gen_df.value_counts().head(15)

In [None]:
# Visualization of top 15 genres on netflix.
plt.figure(figsize=(12,8))
sns.countplot(y=gen_df['genres'], data=gen_df, order = gen_df['genres'].value_counts().index[:15])

As we can see in the above-count plot, international movies, dramas, comedies, and action & adventure are some of the top genres.

## **Cast**

In [None]:
# extracting top crew members from cast.
crew_list=[]
for i in df['cast']:
  i=i.split(',')
  for j in i:
    crew_list.append(j)

In [None]:
# converting the list into dataframe
crew_df = pd.DataFrame(crew_list)
crew_df.rename(columns = {0:'actr_actrs'}, inplace = True)
crew_df

In [None]:
# top 15 actor or actress on netflix.
plt.figure(figsize=(12,8))
sns.countplot(y= crew_df['actr_actrs'], data = crew_df, order = crew_df['actr_actrs'].value_counts().index[1:15])

Anupam Kher, Takahiro Sakurai, Shah Rukh Khan, Boman Irani, etc are some of the top actors on Netflix. I had also some null values present in the cast column. It would have been easier to get more insights if some of the values in the cast column were not null.

**Movies Duration**

In [None]:
# separating the movies data from type column.
movie_df = df[df['type']=='Movie']['duration'].apply(lambda x: x.split(" ")[0]).reset_index()
movie_df

In [None]:
# Visualization of movies duration distribution.
plt.figure(figsize=(8,5))
sns.distplot(movie_df['duration'], color = 'red')

The range duration of the movies on Netflix is between 50 to 150 minutes. There are also some movies of 300 minutes.

**TV Shows Duration**

In [None]:
# tv show df.
tv_df = df[df['type']=='TV Show']
tv_df['duration'].value_counts()
# df.loc[df['type']=='TV Show']['duration'].value_counts()

In [None]:
# visualization of tv show duration.
plt.figure(figsize=(10,8))
sns.countplot(y = tv_df['duration'])

Most TV Shows has only 1 or 2 seasons. There are only few TV Shows with more than 2 seasons.

counts wrt type of the content

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Variables to visualize
var = ['added_year', 'added_month', 'added_day', 'release_year', 'rating', 'type']

# Set the figure size
plt.figure(figsize=(18, 30))

# Custom color palette for content types
custom_palette = ["#5DADE2", "#EC7063"]

# Plot each variable with the custom color palette
for idx, col in enumerate(var):
    plt.subplot(4, 2, idx + 1)
    sns.countplot(x=col, hue='type', data=df, palette=custom_palette)
    plt.title(f'Distribution of {col.capitalize()} by Content Type', fontsize=18, fontweight='bold')
    plt.xlabel(col.capitalize(), fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.xticks(rotation=45)

# Adjust layout to avoid overlap
plt.tight_layout()
plt.show()



Insights Found.
Netflix has been increasingly focusing on TV Shows rather than movies in recent years.
Most of the content on Netflix are added in the month of December and January. In Which movies are added the most.
Maximum contents are added on the first day of the month.
Adding the number of content on Netflix is increased in recent years.
Most of the movies and tv shows are for adults. Very few contents are for kids.

### **Type content available in different countries.**

In [None]:
# total production of country wrt type of the content.
plt.figure(figsize=(15,8))
sns.countplot(x = df['country'], hue='type', data = df, order = df['country'].value_counts().index[:15])

US has the most number of movies and tv shows type content. While India comes in second place for movies on Netflix. But UK comes in second place in terms of tv shows. Followed by South Korea and Canada. Other countries have a very less number of contents added on Netflix.

**Feature Engineering**

There are no outliers in the data so I don't have to handle any outliers here. While I cleaned the data completely in the data wrangling section. Like handled the null values, converted the ratings into an understandable format, and extracted the date data from date_added column.

**Textual Data Preprocessing**

In [None]:
df.columns

In [None]:
# combinig the textual data to proceed with NLP
df['text'] = df['cast']+df['listed_in']+df['rating']+df['description']+df['director']+df['rating']+df['country']
df['text'][0]

**Function to remove punctuation and stopwords.**

In [None]:
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space,
    # which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
# applying function to remove punctuation.
df['text']=df['text'].apply(remove_punctuation)
df['text'][0]

**Removing stopwords**

In [None]:
# downloading stop words.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

In [None]:
# assigning the stopwords to a variable
import numpy as np
stop_words = stopwords.words('english')

# displaying stopwords
np.array(stop_words)

In [None]:
# function to remove stopwords from a given text
def remove_stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in stop_words]
    # joining the list of words with space separator
    return " ".join(text)

In [None]:
# applying the function on text data.
df['text'] = df['text'].apply(remove_stopwords)
df['text'][0]

**Stemming operations**

Text Normalization

In [None]:
# function for stemming and lemmatization.
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

def stemming(text):
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text)

In [None]:
# applying the function on text data.
df['text'] = df['text'].apply(stemming)
df['text'][0]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf object initialization
tfidf = TfidfVectorizer(stop_words='english', lowercase=False, max_features=10000)  # max features = 10000 to prevent system from crashing

# fitting the vectorizer using the text data
tfidf.fit(df['text'])

# collecting the vocabulary items used in the vectorizer
dictionary = tfidf.vocabulary_.items()

In [None]:
# converting vector into array form for clustering
X = tfidf.transform(df['text']).toarray()

# summarize encoded vector
print(X)
print(f'shape of the vector : {X.shape}')
print(f'datatype : {type(X)}')

**Applying PCA to reduce dimension**

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Assuming X is your feature matrix (e.g., after vectorization)
# If X is very large, use a smaller sample:
X_sampled = X[:1000]  # Use the first 1000 samples for faster testing

# Standardizing the data (PCA is affected by scale)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sampled)

# Applying PCA
pca = PCA(n_components=2, random_state=42)  # Reduce to 2 components for simplicity
X_pca = pca.fit_transform(X_scaled)

# Check the explained variance ratio
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

In [None]:
# Calculate the explained variance ratio for each component
explained_variance_ratio = pca.explained_variance_ratio_

# Plot the explained variance ratio
plt.figure(figsize=(12,6))
plt.plot(np.cumsum(explained_variance_ratio))
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio vs Number of Components')
plt.axhline(y= 0.9, color='red', linestyle='--')
plt.axvline(x= 4000, color='green', linestyle='--')
plt.show()

As I can see clearly in the above graph that the data with 4000 components cover 90 percent variation. If we have approximately 7000 components then it will capture 100% variance but that will increase the model complexity. So with this information, i'm going to reduce the dimension.

In [None]:
X.shape

In [None]:
from sklearn.decomposition import TruncatedSVD

# Use TruncatedSVD for sparse data
svd = TruncatedSVD(n_components=400, random_state=42)  # Try 400 instead of 4000 for faster results
X_reduced = svd.fit_transform(X)

print("New shape:", X_reduced.shape)

In [None]:
from sklearn.decomposition import TruncatedSVD

# Dimensionality reduction for sparse TF-IDF data
svd = TruncatedSVD(n_components=400, random_state=42)
X_reduced = svd.fit_transform(X)

# Now safe to copy
net_data = X_reduced.copy()

# **ML Model**

# **K-Means Clustering**

**Elbow Method**

In [None]:
# finding the k value through elbow method.
k_value = range(2,12)
ssd_value = []
for k in k_value:
  kmeans = KMeans(n_clusters=k, init = 'k-means++', random_state = 42)
  kmeans.fit(net_data)
  ssd_value.append(kmeans.inertia_)

In [None]:
# plotting the elbow curve.
plt.figure(figsize=(12,8))
plt.plot(k_value, ssd_value, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Sum of squared distance')
plt.title('Elbow Method')
plt.show()

Here i can see a sudden stipness after 6 clusters. After 6 cluster value, the within-cluster sum of squares (WSS) starts to decrease at a slower rate. So according to elbow curve the optimal value for k is 6. Let's see the silhoutte score.

**Silhoutte Score**

In [None]:
# range for k values.
n_cluster_range = range(2,12)

# empty list to store silhouette score and cluster lables
silhouette_avg_score = []
cluster_label_list = []

# running a loop to find the score with optimal k value.
for k in n_cluster_range:

  # initializing the instance
  kmeans = KMeans()

  # predicting the cluster lables
  cluster_labels = kmeans.fit_predict(net_data)

  # cluster centres.
  centroids = kmeans.cluster_centers_

  # storing the cluster lables in cluster lable list
  cluster_label_list.append(cluster_labels)

  # finding silhouette score and storing them into silhouette average score list.
  silhouette_avg = silhouette_score(net_data, cluster_labels)
  silhouette_avg_score.append(silhouette_avg)

  # Plotting the data points with cluster labels
  plt.figure(figsize = (12,6))
  plt.scatter(net_data[:, 0], net_data[:, 1], c=cluster_labels)
  plt.title(f'Data Points with {k} Clusters')
  plt.show()


As we can see here the data points with 6 clusters are good enough. It would be more clear with silhouette score visualization.

Visualization of silhouette score with different value of k.

In [None]:
# Plotting the silhouette scores as a function of number of clusters
plt.figure(figsize=(12,8))
plt.plot(n_cluster_range, silhouette_avg_score, marker='o')

# Add labels to the plot
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for Different Number of Clusters')

# Show the plot
plt.show()

According to the above line plot, with 6 clusters we get a good score. Which is really good as compared to different number of clusters.





**Final KMeans model.**

In [None]:
# final model.
kmeans= KMeans(n_clusters=6, max_iter=10000, tol = 1e-4, n_init = 1, random_state= 42)
kmeans.fit(net_data)
label = kmeans.fit_predict(net_data)

In [None]:
# model evaluation.
score = silhouette_score(net_data, label)
print("Silhouette score is {}".format(score))

In [None]:
# creating a new column as cluster with cluster number5s.
df['cluster'] = kmeans.labels_

In [None]:
# first five rows.
df.head()

In [None]:
# value count of different clusters.
df['cluster'].value_counts()

In [None]:
# Size of clusters formed
plt.figure(figsize=(10,7))
sns.countplot(x =df['cluster'], data = df)

Cluster 5 and 0 has got the maximum number of data points while cluster 1 and 3 has got less than 1000 data points. Cluster 2 has approximately 1200 data points.

In [None]:
# plotting for cluster in each type
plt.figure(figsize=(10,6))
sns.countplot(x=df['type'],palette="BuPu",hue=df['cluster'])
plt.title("cluster in each type")
plt.show()


Maximum content in all the cluster labels are movies.

Dendogram




In [None]:
# Using the dendogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(net_data, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Content')
plt.ylabel('Euclidean Distances')
plt.show()

As per the dendrogram if we cut the tallest vertical line which shows the distance between those clusters, we'll get 4 clusters with 5 euclidean distance.

**Hierarchical Clustering**

**Agglomerative Clustering**

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Fitting hierarchical clustering
hc = AgglomerativeClustering(n_clusters=4, metric='euclidean', linkage='ward')
y_hc = hc.fit_predict(net_data)

In [None]:
# Visualizing the clusters (two dimensions only)
plt.figure(figsize=(13,8))
plt.scatter(net_data[y_hc == 0, 0], net_data[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(net_data[y_hc == 1, 0], net_data[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(net_data[y_hc == 2, 0], net_data[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(net_data[y_hc == 3, 0], net_data[y_hc == 3, 1], s = 100, c = 'magenta', label = 'Cluster 4')
#plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Target')

plt.title('Clusters of Contents')

plt.legend()
plt.show()

# **Conclusion**

This project provided a comprehensive analysis of Netflix's content library, revealing key trends and patterns in content addition, release, and types. The majority of Netflix's content was added between 2018 and 2020, with a peak in 2019, coinciding with the platform's growing popularity. However, content additions dropped significantly in 2021, likely due to the disruptions caused by the COVID-19 pandemic. Seasonal trends were also evident, with the highest number of content releases occurring in December, likely due to the holiday season.


 The analysis showed that Netflix's content release volume increased steadily until 2021, after which it declined, reflecting the industry's response to the global pandemic. In terms of content types, Netflix’s catalog is predominantly adult-oriented, with movies outnumbering TV shows. This reflects the platform's focus on catering to a broad, global audience with diverse content formats.


  Additionally, the analysis highlighted the top-producing countries, demonstrating Netflix's global reach and investment in international markets. Overall, these insights into Netflix's content strategy underscore how the platform has adapted to evolving market conditions and the changing dynamics of global production.