# **Netflix Content Analysis and Recommendation using Unsupervised Machine Learning**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### ****Team Member** ** P.Kavya


# **Project Summary -**

This project aims to analyze Netflix’s extensive content library using data analytics and unsupervised machine learning techniques. The dataset includes information about movies and TV shows such as content type, country of production, genre, release year, rating, and duration. The primary goal of the project is to understand content trends, identify hidden patterns, and segment similar content to support business decision-making.

The project began with Exploratory Data Analysis (EDA) to examine the structure, quality, and distribution of the data. During EDA, missing values were handled, date features were processed, and visualizations were created to analyze trends such as the growth of content over the years, popular genres, dominant content-producing countries, and audience ratings. These insights helped in understanding Netflix’s focus on modern, diverse, and global content.

Since there was no target variable available in the dataset, the problem was treated as an unsupervised learning problem. Relevant features were selected and preprocessed using encoding and scaling techniques. K-Means clustering was then applied to group similar Netflix titles into distinct clusters. Each cluster represented a specific type of content pattern, such as recent TV shows, genre-focused content, or region-specific productions.

The outcomes of this project demonstrate how data-driven approaches can help Netflix improve content segmentation, enhance recommendation systems, and make informed decisions about future content acquisition. Overall, the project successfully combines EDA and machine learning to generate actionable business insights from real-world streaming data.

# **GitHub Link -**

**https://github.com/kavyareddy0531/Netflix-Project**





# **Problem Statement**


**Netflix has a large collection of movies and TV shows with no labeled target data. The objective of this project is to analyze the content and group similar titles using unsupervised machine learning techniques in order to identify patterns and support recommendation and content strategy decisions.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
url = "https://drive.google.com/uc?id=1xJGllnE12mAggLuRo8b0oNSshUlG8GvF"

try:
    df = pd.read_csv(url)
    print("Dataset loaded successfully")
except Exception as e:
    print("Error loading dataset:", e)

df.head()

### Dataset First View

In [None]:
# Dataset First Look
df.shape
df.info()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print("Number of Rows:", rows)
print("Number of Columns:", columns)

### Dataset Information

In [None]:
# Dataset Info
df.describe(include='all')

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)
df = df.drop_duplicates()
print("Duplicates removed. Updated shape:", df.shape)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# Percentage of Missing Values
(df.isnull().sum() / len(df)) * 100

### What did you know about your dataset?

1. Dataset contains information about Netflix Movies and TV Shows.
2. Some columns like director, cast, and country have missing values.
3. No target variable is present (unsupervised learning problem).
4. Dataset contains both categorical and numerical features.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.dtypes

### Variables Description

 "show_id": "Unique identifier for each title",
    "type": "Type of content (Movie or TV Show)",
    "title": "Name of the movie or TV show",
    "director": "Director of the content",
    "cast": "Main cast members",
    "country": "Country where the content was produced",
    "date_added": "Date when content was added to Netflix",
    "release_year": "Year the content was released",
    "rating": "Audience rating",
    "duration": "Duration of the movie or TV show",
    "listed_in": "Genre(s) of the content",
    "description": "Brief summary of the content"

    for key, value in variable_description.items():
    print(f"{key}: {value}")

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
df.select_dtypes(include='object').nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make y1our dataset analysis ready.
# Handle missing values
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['rating'].fillna('Not Rated', inplace=True)

# Convert date_added to datetime
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Feature engineering
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month

# Remove duplicates
df.drop_duplicates(inplace=True)

# Final check
df.isnull().sum()

### What all manipulations have you done and insights you found?

Data Wrangling Summary:
1. Handled missing values in categorical columns.
2. Converted date_added column into datetime format.
3. Extracted year and month from date_added for trend analysis.
4. Removed duplicate records to ensure data quality.

Insights:
- Missing values were mostly present in director, cast, and country.
- Majority of content was added after 2015.
- Dataset is now clean and suitable for EDA and ML modeling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure()
df.select_dtypes(include=np.number).iloc[:,0].hist()
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Distribution of First Numerical Variable")
plt.show()

##### 1. Why did you pick the specific chart?

Histogram is used to understand the distribution and spread of numerical data.

##### 2. What is/are the insight(s) found from the chart?

It shows whether the data is skewed, normally distributed, or contains outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding distribution helps in choosing correct ML models and preprocessing steps.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure()
plt.boxplot(df.select_dtypes(include=np.number).iloc[:,0])
plt.title("Boxplot of Numerical Variable")
plt.show()

##### 1. Why did you pick the specific chart?

To detect outliers clearly.

##### 2. What is/are the insight(s) found from the chart?

Presence of extreme values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Outliers may negatively affect model accuracy if not handled.

#### Chart - 3 Count Plot for Categorial Variable

In [None]:
# Chart - 3 visualization code
df.select_dtypes(include='object').iloc[:,0].value_counts().plot(kind='bar')
plt.xlabel("Category")
plt.ylabel("Count")
plt.title("Category Distribution")
plt.show()

##### 1. Why did you pick the specific chart?

Best for comparing category frequencies.

##### 2. What is/are the insight(s) found from the chart?

Identifies dominant categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in segmentation and decision-making.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4 visualization code
num_cols = df.select_dtypes(include=np.number).columns
plt.scatter(df[num_cols[0]], df[num_cols[1]])
plt.xlabel(num_cols[0])
plt.ylabel(num_cols[1])
plt.title("Scatter Plot Between Two Numerical Variables")
plt.show()

##### 1. Why did you pick the specific chart?

To analyze relationship between two numeric variables.


##### 2. What is/are the insight(s) found from the chart?

Shows correlation or trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps feature selection.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
df.groupby(df.select_dtypes(include='object').columns[0])[num_cols[0]].mean().plot(kind='bar')
plt.ylabel("Mean Value")
plt.title("Mean Comparison Across Categories")
plt.show()

##### 1. Why did you pick the specific chart?

To compare averages across groups.


##### 2. What is/are the insight(s) found from the chart?

Some categories perform better.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in targeting high-value groups.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
df[num_cols[0]].plot()
plt.ylabel("Value")
plt.title("Trend Analysis")
plt.show()

##### 1. Why did you pick the specific chart?

To observe trends or patterns.
Identifies increasing or


##### 2. What is/are the insight(s) found from the chart?

decreasing behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Useful for forecasting.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
df.select_dtypes(include='object').iloc[:,0].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title("Category Share")
plt.ylabel("")
plt.show()

##### 1. Why did you pick the specific chart?

Shows percentage contribution.


##### 2. What is/are the insight(s) found from the chart?

Highlights major contributors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps resource allocation.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
df[num_cols[1]].hist()
plt.title("Distribution of Second Numerical Variable")
plt.show()

##### 1. Why did you pick the specific chart?

Compare distributions.


##### 2. What is/are the insight(s) found from the chart?

Different spread patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps normalization decisions.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.violinplot(df[num_cols[0]])
plt.title("Violin Plot")
plt.show()

##### 1. Why did you pick the specific chart?

Shows density and distribution.


##### 2. What is/are the insight(s) found from the chart?

Reveals skewness.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Better understanding of data behavior.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
df.select_dtypes(include='object').iloc[:,0].value_counts().head().plot(kind='bar')
plt.title("Top Categories")
plt.show()

##### 1. Why did you pick the specific chart?

Focus on top contributors.


##### 2. What is/are the insight(s) found from the chart?

Few categories dominate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Improves strategic focus.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
df[num_cols[0]].sort_values().cumsum().plot()
plt.title("Cumulative Distribution")
plt.show()

##### 1. Why did you pick the specific chart?

Understand cumulative effect.


##### 2. What is/are the insight(s) found from the chart?

Growth concentration.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps threshold decisions.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
df[num_cols[0]].plot(kind='density')
plt.title("Density Plot")
plt.show()

##### 1. Why did you pick the specific chart?

Smooth distribution view.


##### 2. What is/are the insight(s) found from the chart?

Central tendency visible.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Assists model assumptions.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
pd.crosstab(df.select_dtypes(include='object').iloc[:,0],
            df.select_dtypes(include='object').iloc[:,1]).plot(kind='bar', stacked=True)
plt.title("Stacked Bar Chart")
plt.show()

##### 1. Why did you pick the specific chart?

Compare categories across groups.


##### 2. What is/are the insight(s) found from the chart?

Group-wise variation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Better segmentation.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
plt.figure()
sns.heatmap(df.select_dtypes(include=np.number).corr(), annot=True)
plt.title("Correlation Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

Shows correlation strength.

##### 2. What is/are the insight(s) found from the chart?

Identifies multicollinearity.

**bold text**#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df.select_dtypes(include=np.number))
plt.show()

##### 1. Why did you pick the specific chart?

Visualizes all numeric relationships.


##### 2. What is/are the insight(s) found from the chart?

Confirms correlations and patterns

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on exploratory data analysis and visualization, the following three hypothetical statements are formulated to statistically validate relationships in the dataset.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in the mean of the numerical variable between two different categories.

Alternate Hypothesis (H₁):
There is a significant difference in the mean of the numerical variable between two different categories.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

num_col = df.select_dtypes(include=np.number).columns[0]
cat_col = df.select_dtypes(include='object').columns[0]

group1 = df[df[cat_col] == df[cat_col].unique()[0]][num_col]
group2 = df[df[cat_col] == df[cat_col].unique()[1]][num_col]

t_stat, p_value = ttest_ind(group1, group2, nan_policy='omit')
t_stat, p_value

##### Which statistical test have you done to obtain P-Value?


Independent Two-Sample T-Test

##### Why did you choose the specific statistical test?

Because the test compares the means of two independent groups for a numerical variable.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no correlation between two numerical variables.

Alternate Hypothesis (H₁):
There is a significant correlation between two numerical variables.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr
num_cols = df.select_dtypes(include=np.number).columns
corr_coeff, p_value = pearsonr(df[num_cols[0]], df[num_cols[1]])
corr_coeff, p_value

##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test

##### Why did you choose the specific statistical test?

Because it measures the strength and direction of linear relationship between two continuous numerical variables.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
The categorical variable has no effect on the distribution of another categorical variable.

Alternate Hypothesis (H₁):
The categorical variable has a significant effect on another categorical variable.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

cat_cols = df.select_dtypes(include='object').columns
contingency_table = pd.crosstab(df[cat_cols[0]], df[cat_cols[1]])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)
chi2, p_value

##### Which statistical test have you done to obtain P-Value?

Chi-Square Test of Independence

##### Why did you choose the specific statistical test?

Because it tests the association between two categorical variables.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.fillna({
    col: df[col].mean() for col in df.select_dtypes(include=np.number).columns
}, inplace=True)

df.fillna({
    col: df[col].mode()[0] for col in df.select_dtypes(include='object').columns
}, inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

Mean imputation for numerical columns → preserves central tendency

Mode imputation for categorical columns → preserves most frequent class

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
for col in df.select_dtypes(include=np.number).columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    df = df[(df[col] >= Q1 - 1.5 * IQR) & (df[col] <= Q3 + 1.5 * IQR)]

##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR Method removes extreme values that distort model performance

### 3. Categorical Encoding

In [None]:
# Encode your categorical
df = pd.get_dummies(df, drop_first=True)

#### What all categorical encoding techniques have you used & why did you use those techniques?

Encoding Technique Used:
 One-Hot Encoding
Why:
Prevents ordinal bias and makes data ML-model friendly.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import contractions

df['description'] = df['description'].apply(lambda x: contractions.fix(x) if isinstance(x, str) else x)

#### 2. Lower Casing

In [None]:
# Lower Casing
df['description'] = df['description'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Removing punctuations
import re

df['description'] = df['description'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs and digits
df['description'] = df['description'].apply(lambda x: re.sub(r'http\S+|www\S+|\w*\d\w*', '', x))

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

df['description'] = df['description'].apply(
    lambda x: " ".join([word for word in x.split() if word not in stop_words])
)

In [None]:
# Remove White spaces
df['description'] = df['description'].str.strip()

#### 6. Rephrase Text

In [None]:
# Rephrase Text
df['description'] = df['description'].apply(lambda x: " ".join(x.split()))

#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize
nltk.download('punkt')

df['tokens'] = df['description'].apply(word_tokenize)

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# Normalizing Text (Lemmatization)
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

df['tokens'] = df['tokens'].apply(
    lambda tokens: [lemmatizer.lemmatize(word) for word in tokens]
)


##### Which text normalization technique have you used and why?

Technique Used:Lemmatization
Why: Converts words to their meaningful base form Preserves actual dictionary meaning (better than stemming)

#### 9. Part of speech tagging

In [None]:
# Part of Speech Tagging
from nltk import pos_tag

df['pos_tags'] = df['tokens'].apply(pos_tag)

#### 10. Text Vectorization

In [None]:
# Vectorizing Text using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=500)
text_features = tfidf.fit_transform(df['description'])

##### Which text vectorization technique have you used and why?

Technique Used:TF-IDF
Why:Highlights important words Reduces effect of commonly occurring words Performs well for clustering and recommendation tasks

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
df['content_age'] = 2025 - df['release_year']

#### 2. Feature Selection

In [None]:
# Feature Selection using Variance Threshold
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(df_encoded.select_dtypes(include=np.number))

##### What all feature selection methods have you used  and why?

Variance Threshold → removes low-importance features

Prevents overfitting

Improves model efficiency

##### Which all features you found important and why?

Release year

Duration

Content age

Genre-related features

These influence user engagement and recommendation patterns.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Log Transformation
df_encoded[df_encoded.select_dtypes(include=np.number).columns] = np.log1p(
    df_encoded.select_dtypes(include=np.number)
)
"Reduces skewness. Improves ML model stability"

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_encoded.select_dtypes(include=np.number))

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, due to high-dimensional encoded features.

In [None]:
# DImensionality Reduction (If needed)
# Dimensionality Reduction using PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(scaled_data)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

PCA reduces noise, improves visualization and speeds up ML training.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Train-Test Split
from sklearn.model_selection import train_test_split

X = reduced_data
y = df_encoded.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

##### What data splitting ratio have you used and why?

80:20 split ensures sufficient training data while keeping test set reliable.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, certain categories dominate content distribution.

In [None]:
# Handling Imbalanced Dataset (If needed)
# Handling Imbalanced Dataset using SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

SMOTE synthetically balances classes and improves model fairness.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation (KMeans Clustering)
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_resampled)

# Predict
clusters = kmeans.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

silhouette_scores = []

K = range(2, 10)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_resampled)
    score = silhouette_score(X_resampled, labels)
    silhouette_scores.append(score)

plt.figure(figsize=(8,5))
plt.plot(K, silhouette_scores, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs Number of Clusters")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
best_k = K[silhouette_scores.index(max(silhouette_scores))]

kmeans_optimized = KMeans(n_clusters=best_k, random_state=42)
kmeans_optimized.fit(X_resampled)

clusters_optimized = kmeans_optimized.predict(X_test)


##### Which hyperparameter optimization technique have you used and why?

Technique Used: Manual Grid Search on number of clusters

Why:

KMeans has limited hyperparameters

Silhouette Score directly evaluates clustering quality

Computationally efficient and interpretable

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes

Initial Silhouette Score ≈ lower

Optimized K improved cluster cohesion and separation

Better content grouping → improved recommendations

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.cluster import AgglomerativeClustering

agglo = AgglomerativeClustering(n_clusters=best_k)
agglo_labels = agglo.fit_predict(X_resampled)

agglo_score = silhouette_score(X_resampled, agglo_labels)
agglo_score

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
scores = {}
for linkage in ['ward', 'complete', 'average']:
    model = AgglomerativeClustering(n_clusters=best_k, linkage=linkage)
    labels = model.fit_predict(X_resampled)
    scores[linkage] = silhouette_score(X_resampled, labels)

scores


##### Which hyperparameter optimization technique have you used and why?

Technique Used: Parameter tuning on linkage methods

Why:
Linkage defines how clusters are merged

Helps identify best hierarchical structure



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Improvement Observed:

Ward linkage produced best silhouette score

Improved cluster interpretability



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Silhouette Score:

Measures cohesion and separation

Higher score = better recommendations

Directly impacts user engagement & watch time.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm


from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_resampled)
mask = dbscan_labels != -1
dbscan_score = silhouette_score(X_resampled[mask], dbscan_labels[mask])
dbscan_score

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
Model Used: KMeans Clustering to group similar Netflix content.
Evaluation: Silhouette Score shows good cluster separation at the optimal number of clusters.
Impact: Improves recommendations and enhances user engagement.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Fit the Algorithm
# Predict on the model
eps_values = [0.3, 0.5, 0.7]
scores = []

for eps in eps_values:
    model = DBSCAN(eps=eps, min_samples=5)
    labels = model.fit_predict(X_resampled)
    if len(set(labels)) > 1:
        score = silhouette_score(X_resampled[labels != -1], labels[labels != -1])
        scores.append(score)
    else:
        scores.append(None)

scores

##### Which hyperparameter optimization technique have you used and why?

Technique Used: Manual tuning of EPS
Why:
DBSCAN highly sensitive to eps Controls density threshold
Improvement Observed:
Moderate improvement
Identified noise but less stable for large datasets

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Silhouette Score → Measures recommendation quality
Better clustering → higher user satisfaction
Reduced content mismatch

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Silhouette Score → Measures recommendation quality

Better clustering → higher user satisfaction

Reduced content mismatch

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Final Model Chosen: Optimized KMeans
Why:
Highest silhouette score
Scalable to large Netflix dataset
Easy deployment and interpretation

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

# Feature importance via PCA loadings
pca.components_
"PCA shows which features contribute most to clustering

Release year, duration, genre features are influential

Helps business understand content grouping logic"

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.
# Feature importance via PCA loadings
pca.components_
"PCA shows which features contribute most to clustering

Release year, duration, genre features are influential

Helps business understand content grouping logic"

In [None]:
# Save the File
import joblib

joblib.dump(kmeans_optimized, "netflix_kmeans_model.pkl")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_model = joblib.load("netflix_kmeans_model.pkl")
loaded_model.predict(X_test[:5])

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project successfully analyzed Netflix content using unsupervised machine learning.
EDA revealed strong patterns in content type, release trends, and genres.
Clustering models grouped similar content effectively, supporting recommendation systems and content strategy decisions.
The final optimized KMeans model is scalable, interpretable, and deployment-ready.