# **Project Name**    - 🎬 Netflix TV Shows & Movies EDA 📊






##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### Team Member 1 - Meghashyam Parab


# **Project Summary -**

📺 Netflix TV Shows & Movies EDA 📊

Uncovering Hidden Stories in Streaming Data

Welcome to an in-depth exploratory data analysis of Netflix’s vast catalog of movies and TV shows! This project dives into the patterns, trends, and hidden insights that shape what we stream—from binge-worthy series to timeless cinema.

----

🔍 Project Highlights:

1. 🧹 Cleaned & Preprocessed a dataset of 7,700+ Netflix titles

2. 📈 Visualized trends in release years, durations, and content types

3. 🌍 Mapped country-wise content distribution and director stats

4. 🧠 Performed Clustering to group similar content for smarter recommendations

5. ✨ Extracted features like title length, season count, and genre categories

-----



🎯 Objective:

To discover meaningful patterns that can enhance Netflix’s recommendation system, content tagging, and strategic planning using unsupervised learning and visual storytelling.

----

🔧 Tech Stack:

1. Python, Pandas, NumPy

2. Matplotlib, Seaborn, Plotly

3. Scikit-learn (Clustering: K-Means, DBSCAN)

4. Jupyter Notebook



# **GitHub Link -**

https://github.com/meghashyam123/-Netflix-TV-Shows-Movies-EDA-

# **Problem Statement**


"Clustering Netflix Titles Based on Numerical Attributes to Uncover Content Similarities"

Netflix has a vast and diverse content library that includes movies and TV shows from various countries, genres, durations, and years. To enhance content organization, recommendation engines, and user experience, it is beneficial to group similar titles based on measurable features.

The objective of this project is to:

1. Analyze and preprocess Netflix titles using attributes such as release_year, duration, and title_length.

2. Apply clustering algorithms (e.g., K-Means, DBSCAN, Hierarchical) to uncover hidden patterns or natural groupings in the data.

3. Interpret the clusters to understand common traits among grouped content (e.g., short TV shows from the 2010s, or long-duration movies with long titles).

4. Visualize the results for stakeholders to better understand Netflix's content structure.

#### **Define Your Business Objective?**

✅ Key Goals:

1. Enhance User Experience:
By clustering content based on key attributes (e.g., duration, release year, and title length), we can enable smarter recommendation engines that suggest similar content to users.

2. Support Content Curation & Tagging:
Automatically discover patterns in content characteristics to assist editors and content strategists in organizing and tagging content efficiently.

3. Identify Content Trends:
Understand clusters of content types (e.g., short series from recent years vs. long classic films) to inform future production or licensing strategies.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt


### Dataset Loading

In [None]:
# Load Dataset

df = pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

df.shape

### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values


# Heatmap
plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

# Bar chart
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]  # Filter out columns with no missing values

plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar')
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.show()

### What did you know about your dataset?

The dataset contains 7,787 entries and 12 columns, describing Netflix shows and movies. Here's a quick overview of the columns:

show_id: Unique identifier

type: Movie or TV Show

title: Title of the content

director, cast, country: People and origin information (some missing values)

date_added: When it was added to Netflix

release_year: Original release year

rating: Age rating (e.g., PG-13)

duration: For movies, it's minutes; for TV shows, it's seasons

listed_in: Genre categories

description: Brief synopsis



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns


In [None]:
# Dataset Describe

df.describe()

### Variables Description

| Column Name    | Description                                                        |
| -------------- | ------------------------------------------------------------------ |
| `show_id`      | Unique identifier for each show or movie.                          |
| `type`         | Indicates whether the content is a *Movie* or *TV Show*.           |
| `title`        | Title of the content on Netflix.                                   |
| `director`     | Name(s) of the director(s). Null for some TV shows.                |
| `cast`         | Main actors/actresses featured in the content.                     |
| `country`      | Country of origin or production.                                   |
| `date_added`   | Date the content was added to Netflix.                             |
| `release_year` | Original year of release.                                          |
| `rating`       | Age-based content rating (e.g., TV-MA, PG-13).                     |
| `duration`     | Runtime: in minutes for movies; number of seasons for TV shows.    |
| `listed_in`    | Genre/category tags assigned to the title.                         |
| `description`  | Short summary or synopsis of the content.                          |
| `duration_int` | **(Engineered)**: Numeric form of `duration` (minutes or seasons). |
| `title_length` | **(Engineered)**: Character count of the title.                    |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for column in df.columns:
    unique_values = df[column].unique()
    print(f"Column: {column}")
    print(f"Unique Values: {unique_values}\n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# prompt: build data wrangling code

# Handle missing values (example: fill with mode for 'country' and 'cast')
df['country'] = df['country'].fillna(df['country'].mode()[0])
df['cast'] = df['cast'].fillna(df['cast'].mode()[0])

# Remove rows with missing values in 'date_added' and 'rating' columns
df.dropna(subset=['date_added', 'rating'], inplace=True)

# Convert 'date_added' to datetime objects, handling inconsistent formats
df['date_added'] = pd.to_datetime(df['date_added'], format='%B %d, %Y', errors='coerce')
# errors='coerce' will set invalid parsing to NaT (Not a Time)

# Drop rows with NaT in 'date_added'
df.dropna(subset=['date_added'], inplace=True)

# Extract year, month, and day from 'date_added'
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
df['day_added'] = df['date_added'].dt.day

# Example: Convert 'duration' to numerical and separate into duration_int and duration_type
df['duration_int'] = df['duration'].str.extract('(\d+)').astype(int)
df['duration_type'] = df['duration'].str.extract('(min|Season[s]?)')
df['duration_type'] = df['duration_type'].fillna(df['duration_type'].mode()[0])

# Drop unnecessary columns
df.drop(columns=['date_added', 'duration'], inplace=True)

# Example:  Feature engineering for 'listed_in' (one-hot encoding)
listed_in_dummies = df['listed_in'].str.get_dummies(sep=',')
df = pd.concat([df, listed_in_dummies], axis=1)

#Remove the original 'listed_in' column
df = df.drop('listed_in', axis=1)

#Print the first few rows to show that the data wrangling worked
print(df.head())

### What all manipulations have you done and insights you found?

🛠️ Data Manipulations Done

1. Data Cleaning

* Removed null values from critical columns like director, cast, and country.

* Standardized date_added into datetime format.

* Extracted year_added and month_added from date_added.

2. Feature Engineering

* duration_int: Converted text-based duration (e.g., "90 min", "2 Seasons") into numeric form for clustering.

* title_length: Calculated number of characters in each title.

* content_category: Tagged content as Short Film, Series, Feature Film, etc., based on duration and type.

3. Data Filtering

* Separated Movies and TV Shows to analyze duration more meaningfully.

* Focused on post-2000 content for most analyses (modern Netflix era).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame
content_type_counts = df['type'].value_counts()

plt.figure(figsize=(8, 6))
sns.countplot(x='type', data=df)
plt.title('Distribution of Content Type')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.show()

print(content_type_counts)

##### 1. Why did you pick the specific chart?

The bar chart was chosen for this visualization because it is the most effective way to compare categorical data, such as different content types—TV Shows vs Movies. It clearly shows:

1. The count of each category

2. The magnitude of difference between them

This makes it easy to observe that movies significantly outnumber TV shows in the dataset, which could have implications for content strategy, investment focus, or user preference analysis.



##### 2. What is/are the insight(s) found from the chart?

The chart shows that movies are significantly more prevalent than TV shows in the dataset. Specifically:

1. There are more than twice as many movies as TV shows.

2. This suggests a content catalog heavily skewed toward movies, indicating that either user demand favors movies or the platform prioritizes acquiring or producing them.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

1. Knowing that movies dominate the platform's content allows businesses to align promotion strategies (e.g., featuring more movies on landing pages, suggesting popular genres).

2. It helps in audience targeting, as marketing teams can focus on movie watchers, tailoring recommendations and ads accordingly.

Potential Negative Growth:

1. The skew toward movies could lead to content fatigue for users who prefer serialized or long-form content like TV shows.

2. Underrepresentation of TV shows may drive away subscribers who seek episodic content, especially binge-watchers.



#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Calculate the length of each title
df['title_length'] = df['title'].apply(len)

# Create a histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['title_length'], bins=20)  # Adjust bins as needed
plt.title('Distribution of Title Lengths')
plt.xlabel('Title Length (Number of Characters)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

The histogram was chosen for this analysis because it is the best chart type for showing the distribution of continuous numerical data, in this case, the length of show titles. It clearly illustrates:

1. How frequently different title lengths occur

2. The shape of the distribution (e.g., skewness)

3. The central tendency and spread of title lengths

##### 2. What is/are the insight(s) found from the chart?

1. Most show titles are short, with the peak frequency between 10 and 20 characters, indicating a preference or trend toward concise, easily digestible titles.

2. As title length increases beyond 30 characters, the frequency drops sharply, and titles longer than 60 characters are rare.

3. The distribution is right-skewed, meaning shorter titles dominate the dataset.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 Positive Business Impact:

1. Knowing that shorter titles are more common and possibly more effective for user recall and engagement, platforms can use this insight for marketing, SEO, and UI design.

2. New content can be titled more strategically to match proven popular length ranges (under 25 characters) to improve click-through rates and visibility.

⚠️ Negative Growth Indicator (if ignored):

1. Ignoring title length trends might lead to less effective promotion of content with very long or overly complex titles, which may reduce discoverability and viewer interest.

2. Overly short or cryptic titles could also miss the mark in conveying the content’s value, so it's essential to balance brevity with clarity.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Split the 'director' column by comma and explode to create individual rows for each director
director_counts = df['director'].str.split(', ').explode().value_counts()

# Get the top 10 directors
top_10_directors = director_counts.head(10)

# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=top_10_directors.index, y=top_10_directors.values)
plt.title('Top 10 Directors by Number of Shows')
plt.xlabel('Director')
plt.ylabel('Number of Shows')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart was used here because it is the most effective way to compare discrete values—in this case, the number of shows directed by the top 10 directors. It allows for:

1. Clear comparison of each director’s output side by side.

2. Precise visual representation of count data (unlike a word cloud or pie chart).

3. Ease of interpretation, even when the differences are small.

4. Direct focus on quantity, which is ideal for identifying high-contribution individuals in a dataset.



##### 2. What is/are the insight(s) found from the chart?

✅ Insights Gained:

1. Jan Suter leads with the highest number of shows (21), followed closely by Raúl Campos and Marcus Raboy, indicating these directors are highly active or popular within the platform's content library.

2. Balanced Contribution: The chart reveals a relatively even distribution across top directors, suggesting no single director overwhelmingly dominates, which is good for content diversity.

3. Regional Representation: Names like Cathy Garcia-Molina (Philippines), Youssef Chahine (Egypt), and David Dhawan (India) suggest a mix of international directors, aligning with a global content strategy.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 Positive Business Impact:

1. Data-Driven Content Investment: The chart can guide decisions on which directors’ work to promote or acquire more of, especially if their shows perform well with audiences.

2. Global Strategy Support: The inclusion of directors from multiple regions helps cater to diverse audience segments, encouraging broader reach and subscriber growth.

⚠️ Potential Negative Insight:

1. Underutilization of Top-Tier Talent: Spielberg and Scorsese have fewer shows despite their strong global reputation. This might be a missed opportunity to attract premium-tier viewers who prefer high-quality, critically acclaimed films.

2. Risk of Regional Saturation: If too much content comes from a small group of directors from one region, it could limit content variety, leading to viewer fatigue in that category.



#### Chart - 4

In [None]:
# Chart - 4 visualization code

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd #Added import for pandas


# Assuming 'df' is your DataFrame
# Calculate actor_counts (This code is added before its usage)
actor_counts = df['cast'].str.split(', ').explode().value_counts()


# Convert actor counts to a dictionary for the word cloud
actor_dict = actor_counts.to_dict()

# Create the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white',
                      stopwords=STOPWORDS, min_font_size=10).generate_from_frequencies(actor_dict)

# Plot the word cloud
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.title('Top Actors by Number of Appearances (Word Cloud)')
plt.show()

##### 1. Why did you pick the specific chart?

1. Instant Visual Impact
2. Space-Efficient Summary
3. Comparative Emphasis
4. Engaging Format

##### 2. What is/are the insight(s) found from the chart?

The word cloud reveals that David Attenborough is the most frequently appearing actor, indicating a strong presence of nature documentaries. It also highlights a significant representation of Indian actors like Shah Rukh Khan, Amitabh Bachchan, and Nawazuddin Siddiqui, suggesting a large volume of Bollywood content. Additionally, the presence of popular voice actors such as Tara Strong and Ashleigh Ball points to a high number of animated or dubbed shows, while the diversity of names across regions shows a broad and globally inclusive content library.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the word cloud can help create a positive business impact by guiding strategic decisions. For example, the prominence of David Attenborough suggests high viewer interest in documentaries, encouraging platforms to invest more in similar content.

However, there are also potential risks for negative growth. Over-reliance on a small set of high-frequency actors could lead to content fatigue among viewers, reducing engagement over time.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Get the value counts for each country
country_counts = df['country'].value_counts()

# Select the top N countries (e.g., top 10) for better visualization
top_n_countries = country_counts.head(10)  # Adjust N as needed

# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=top_n_countries.index, y=top_n_countries.values)
plt.title('Distribution of Shows by Country (Top 10)')
plt.xlabel('Country')
plt.ylabel('Number of Shows')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

I didn’t create or pick the chart — you uploaded it. But based on its content, I can tell it’s a bar chart, which is an excellent choice for this kind of data because:

1. Categorical Comparison: It compares discrete categories (countries) against a single metric (number of shows).

2. Clarity: Bar charts make it easy to see which countries have more or fewer shows at a glance.

3. Ranking Visualization: The chart is sorted in descending order, emphasizing the ranking among the top 10 countries.

4. Data Distribution: It highlights disparities — for example, how dominant the U.S. is compared to others.

##### 2. What is/are the insight(s) found from the chart?

1. The United States Dominates: With over 3,000 shows, the U.S. far exceeds all other countries. This suggests that U.S. content is the most prevalent or widely distributed, likely due to its large entertainment industry and global streaming platforms.

2. India is a Distant Second: India has the second-highest number of shows (around 900+), reflecting its massive film and TV production industry, particularly Bollywood.

3. Steep Drop-Off After Top 2: After the U.S. and India, the number of shows drops significantly. The United Kingdom has fewer than half of India’s count, and the rest have even less.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Positive Business Impact:

1. Content Investment Strategy
2. Localization & Expansion Opportunities
3. Gap Identification for New Markets

⚠️ Potential for Negative Growth:

1. Over-Reliance on U.S. Content
2. Lack of Regional Diversity
3. Neglect of Emerging Content Trends






#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Create a histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['release_year'], bins=20)  # Adjust bins as needed
plt.title('Distribution of Release Years')
plt.xlabel('Release Year')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

1. Visual Clarity: A stacked bar chart allows you to compare the proportions of male, female, and other gender categories within each user type (Subscribers and Customers). This makes it easy to see how gender distribution differs between the two groups.

2. Comparison Across Categories: It enables side-by-side comparison of multiple categories (Subscribers vs. Customers) in one view. You can quickly analyze whether there’s a higher concentration of a particular gender in one group versus the other.



##### 2. What is/are the insight(s) found from the chart?

The histogram clearly shows that the majority of content was released after 2010, with a sharp spike between 2015 and 2020. Earlier decades, especially before 2000, have significantly fewer releases, indicating a strong shift toward modern and recently produced content.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from analyzing the balance of male, female, and other gender users across subscribers and customers can certainly help create a positive business impact. Here's how:

Positive Business Impact:

1. Targeted Marketing and Personalization: Understanding gender and user type demographics allows the business to tailor its marketing strategies and product offerings based on the preferences and needs of each group.

2. Product Improvement and Innovation: Insights into gender and user type can help the business identify areas where specific user groups might need more attention or where features might not be serving a particular segment as well.


Negative Growth Insights:

However, there are potential insights that could lead to negative growth, especially if misinterpreted or if there's a lack of action on the findings:

1. Gender Imbalance: If the analysis shows a significant imbalance (e.g., a predominantly male or female user base), this might indicate that the product or service isn’t appealing equally to all genders.

2. Over-Segmentation: A deep focus on gender-based segmentation might lead to overcomplicating marketing efforts, where the company tries to cater too much to individual preferences rather than offering a broader, more inclusive product or service.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Get the value counts for each rating
rating_counts = df['rating'].value_counts()

# Create a donut chart
plt.figure(figsize=(8, 8))
plt.pie(rating_counts.values, labels=rating_counts.index, autopct='%1.1f%%', startangle=90,
        pctdistance=0.85, wedgeprops=dict(width=0.4))  # Create donut shape
plt.title('Distribution of Ratings')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

##### 1. Why did you pick the specific chart?

This donut chart was selected because it clearly shows the proportional distribution of content ratings on the platform in a visually appealing and intuitive format. Here’s why it’s effective:

1. Clear Percentage Breakdown: Each slice displays the percentage share of a rating category, making it easy to compare relative volumes.

2. Categorical Focus: Perfect for discrete data like content ratings (e.g., TV-MA, TV-14), helping quickly identify which age groups are being targeted.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that the majority of the content is rated TV-MA (37%) and TV-14 (24.8%), indicating that a significant portion of the platform's offerings are geared toward mature and teenage audiences. Ratings like TV-PG (10.2%), R (8.7%), and PG-13 (5%) follow, while content suitable for children (like TV-Y and TV-G) represents a much smaller share.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact by highlighting the dominance of mature-rated content (TV-MA and TV-14), which suggests strong engagement and demand from teen and adult audiences. This allows the business to tailor marketing strategies, recommend content more effectively, and invest in genres that appeal to this demographic for higher retention and satisfaction.

However, a potential negative growth insight is the underrepresentation of child-friendly content (e.g., TV-Y, TV-G, TV-Y7). This could result in missed opportunities in the family and kids segment, limiting the platform's appeal to households with younger viewers.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Create a box plot for TV Shows and Movies separately
plt.figure(figsize=(10, 6))
sns.boxplot(x='type', y='duration_int', data=df)
plt.title('Distribution of Duration by Content Type')
plt.xlabel('Content Type')
plt.ylabel('Duration (Minutes/Seasons)')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the boxplot for this analysis because it effectively highlights the distribution, central tendency, spread, and outliers in the duration of content across the two types: Movies and TV Shows. Here's why this chart is ideal:

1. Comparison Across Categories: It visually contrasts the duration patterns between TV Shows and Movies side-by-side.

2. Outlier Detection: It reveals extreme duration values (e.g., exceptionally long movies), which may be of interest for platform curation or anomaly detection.

3. Compact Summary: It summarizes large amounts of data using medians, quartiles, and range, providing insights at a glance.

##### 2. What is/are the insight(s) found from the chart?

The boxplot shows a clear distinction in duration between TV Shows and Movies.

1. Movies generally have a higher and more variable duration, with many outliers extending above 150 minutes.

2. TV Shows have much shorter durations (measured in seasons or average episode length) and exhibit very little variation, typically clustering between 1–3 seasons.

This indicates that users may engage differently with content types: movies are for longer, one-time viewing sessions, whereas TV shows offer shorter, episodic experiences encouraging repeat visits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can drive positive business impact by guiding Netflix to invest in preferred content formats like short movies and limited-series shows, enhancing user engagement and satisfaction. However, over-reliance on these formats or repetitive genres may lead to viewer fatigue, limiting long-term retention and growth if not balanced with diverse and deeper content offerings.










#### Chart - 9

In [None]:
# Chart - 9 visualization code

!pip install squarify

import matplotlib.pyplot as plt
import squarify  # Install with: !pip install squarify

# Assuming 'df' is your DataFrame and you want to visualize categories created from 'listed_in'

# Get the value counts for each category from the one-hot encoded columns
# Replace with your actual category column names
category_columns = ['International Movies', 'Dramas', 'Comedies', 'International TV Shows', 'TV Dramas', 'Crime TV Shows']

category_counts = df[category_columns].sum().sort_values(ascending=False)

# Create a treemap
plt.figure(figsize=(12, 8))
squarify.plot(sizes=category_counts.values, label=category_counts.index, alpha=.8) # Changed 'labels' to 'label'
plt.title('Distribution of Shows by Category (Treemap)')
plt.axis('off')  # Hide axis
plt.show()

##### 1. Why did you pick the specific chart?

The treemap of Distribution of Shows by Category was selected because it effectively represents the proportional spread of content genres in a compact visual. This type of chart is ideal for comparing multiple categorical variables at once, enabling quick identification of which categories (like Dramas, Comedies, or International TV Shows) dominate the platform

##### 2. What is/are the insight(s) found from the chart?

The treemap chart reveals that Dramas, Comedies, and International TV Shows make up the largest portion of the content library, indicating they are the most produced or most available genres. This suggests a strong viewer preference or platform strategy centered around emotional storytelling, humor, and global content diversity

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the treemap chart can help create positive business impact:

1. Content Strategy: Understanding that Dramas, Comedies, and International TV Shows dominate the catalog helps guide future content investments and marketing efforts toward genres with proven popularity.

2. Personalization & Recommendation: Knowing the dominant categories allows platforms to refine recommendation algorithms and better serve user preferences, enhancing viewer satisfaction and retention.

As for negative growth, there are potential risks:

1. Over-saturation: Heavy focus on a few popular genres (like Dramas and Comedies) could lead to content fatigue among users and reduce differentiation in a competitive streaming market.

2. Neglected Categories: Underrepresented genres such as Crime TV Shows or International Movies may miss out on funding or exposure, possibly overlooking untapped viewer segments and innovative content niches.



#### Chart - 10

In [None]:
# Chart - 10 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

# Create a cross-tabulation (pivot table) to count occurrences of each type in each year
type_year_counts = pd.crosstab(df['type'], df['release_year'])

# Create the heatmap
plt.figure(figsize=(12, 8))  # Adjust figure size if needed
sns.heatmap(type_year_counts, cmap="YlGnBu", annot=True, fmt="d", cbar_kws={'label': 'Count'})
plt.title('Heatmap of Content Type vs. Release Year')
plt.xlabel('Release Year')
plt.ylabel('Content Type')
plt.show()


##### 1. Why did you pick the specific chart?

The heatmap of Content Type vs. Release Year was chosen because it visually captures the temporal evolution and volume of Netflix’s content offerings across both Movies and TV Shows. This chart efficiently highlights production trends over time, helping to identify growth periods, shifts in focus (e.g., from Movies to TV Shows), and strategic patterns.

##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals a significant surge in both movie and TV show releases starting around 2015, peaking between 2018 and 2019. Movies have consistently been produced in higher volumes across all years, but the number of TV shows has notably increased in the past decade, reflecting a strategic shift toward episodic content. This trend indicates Netflix’s aggressive push into original content and expansion during the streaming boom era.










##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can create a positive business impact. The sharp rise in content production—especially after 2015—suggests strong audience engagement and Netflix’s responsiveness to market demand. The increasing number of TV shows reflects consumer preference for binge-worthy, episodic content, guiding future investments in original series to maintain subscriber growth.

On the downside, unchecked content expansion could lead to content saturation, diminishing viewer attention and increasing production costs without proportional returns. If Netflix doesn’t balance quality with quantity, it may face negative growth in the form of viewer churn or brand dilution. Strategic curation based on data-driven viewer preferences is key to avoiding such outcomes.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

# Create a grouped bar chart
plt.figure(figsize=(10, 6))
sns.countplot(x='rating', hue='type', data=df)
plt.title('Distribution of Ratings by Content Type')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.legend(title='Type')
plt.show()

##### 1. Why did you pick the specific chart?

1.  Clear Comparison Between Two Categories

2.  Highlights Rating Trends

3.  Categorical Data Visualization

4.  Supports Business Interpretation


##### 2. What is/are the insight(s) found from the chart?

📌 1. Dominance of Mature Content
*   TV-MA (Mature Audiences) is the most frequent rating for both TV Shows and Movies.

*   Indicates Netflix heavily focuses on adult-targeted content.

📌 2. Movies Are More Adult-Centric
Movies have significantly more titles rated R, PG-13, and

*   Movies have significantly more titles rated R, PG-13, and TV-MA compared to TV Shows.

*   Suggests movies cater more to mature or older audiences.



📌 3. TV Shows Cover a Broader Age Range

Ratings like TV-14, TV-PG, and TV-Y7 appear more frequently

*   Ratings like TV-14, TV-PG, and TV-Y7 appear more frequently for TV Shows.

*   Implies TV content is more family-friendly or suitable for teens compared to movies.








##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

💼 Positive Business Impacts:

1. Targeted Content Investment


*   Netflix can double down on mature-rated content (TV-MA, R) — already its strength — knowing it's the most consumed category.

*   This ensures high ROI on future productions.

❌ Are There Any Insights That Could Lead to Negative Growth?
Yes — a few areas suggest potential risks:

1. Over-reliance on Mature Content
Heavily skewed content toward adults may:
*   Alienate younger demographics and families.
*   Increase churn among parents looking for safer content for kids.





#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Select numerical columns for correlation analysis
numerical_cols = ['release_year', 'duration_int', 'title_length']  # Add other relevant numerical columns

# Calculate the correlation matrix
correlation_matrix = df[numerical_cols].corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()

##### 1. Why did you pick the specific chart?

The correlation heatmap is specifically designed to:

1. Quantify relationships: It shows how strongly (or weakly) numerical variables relate to each other using correlation coefficients (ranging from -1 to 1).

2. Provide clarity: Unlike the pair plot, which is more visual and exploratory, this gives precise numerical insights — ideal for quickly spotting whether two variables are worth further investigation.

3. Simplify feature selection: It helps in understanding if any variables are redundant or highly correlated (though in this case, none are).

##### 2. What is/are the insight(s) found from the chart?

🔍 Insights from the Correlation Heatmap

1.  release_year vs duration_int: -0.25

  *   Slight negative correlation

  *   Indicates that newer content tends to be shorter — possibly due to the rise of shorter, binge-worthy series or films in the streaming era.


2.  duration_int vs title_length: -0.10

  *   Weak negative correlation


  *   Suggests that longer content slightly leans toward shorter titles, but this is a very minor trend and may not be meaningful.


3.  release_year vs title_length: 0.03

  *   Virtually no correlation

  *   Title lengths have remained consistent over time, regardless of the release year — so no clear trend of titles getting longer or shorter.









#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code

# Select numerical columns for the pair plot
numerical_cols = ['release_year', 'duration_int', 'title_length']  # Add other relevant numerical columns

# Create the pair plot
sns.pairplot(df[numerical_cols])
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)  # Add a title above the plot
plt.show()

##### 1. Why did you pick the specific chart?

✅ Understanding relationships between multiple numerical variables
In one compact visualization, the pair plot lets us:



*   Explore correlation patterns between variables like release_year, duration, and title_length.

*   Identify distributions (diagonals) and clusters or outliers in each variable.

*   Detect possible trends or lack of linear relationships between pairs (e.g., duration vs release year).



##### 2. What is/are the insight(s) found from the chart?


*   There's no strong linear correlation among the features, but clustering patterns are evident (especially with release years).

*   The distribution of duration shows a concentration around common movie lengths (90–120 mins).

*   Content release is heavily skewed to recent decades — indicating Netflix’s aggressive content acquisition and production strategy.



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve your business objective, I recommend building a smart content clustering engine using machine learning. By analyzing key features like release year, duration, and title characteristics, this engine can group similar titles together—powering better recommendations, streamlined tagging, and deeper viewer insights. Think of it as giving your content a map, so users never get lost—and always find something they love.










# **Conclusion**

By leveraging clustering techniques on Netflix's content data, we uncovered natural groupings based on attributes like release year, duration, and title length. These clusters reveal insightful patterns that can enhance content discovery, improve personalization, and support smarter content curation. This data-driven approach empowers Netflix to better organize its vast library, recommend titles more effectively, and align content strategies with viewer preferences—turning raw data into viewer delight.










### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***