<a href="https://colab.research.google.com/github/kpallavi111/Ford-go-bike/blob/main/Netfilx_Movies_and_TV_shows.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Netflix Movies and TV Shows Clustering**



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

#### **Introduction**
Netflix has revolutionized the entertainment industry by providing a vast collection of movies and TV shows across different genres and regions. This project aims to perform **Exploratory Data Analysis (EDA)** on the Netflix dataset using Python to uncover insights about the content available on the platform. By leveraging Python libraries such as **pandas, matplotlib, seaborn, and numpy**, we will analyze trends, content distribution, and other key aspects of Netflix’s catalog.

#### **Objectives**
The primary objectives of this project are:
1. **Understanding Content Distribution** – Analyzing the proportion of TV shows vs. movies.
2. **Release Year Trends** – Identifying patterns in content production and addition.
3. **Country-wise Analysis** – Exploring the geographical diversity of Netflix’s content.
4. **Director and Cast Insights** – Finding the most frequent directors and actors.
5. **Genre Analysis** – Examining the most popular genres.
6. **Ratings and Duration** – Understanding the suitability and length of content.
7. **Handling Missing Values** – Cleaning and preprocessing the dataset for better analysis.

#### **Data Cleaning and Preprocessing**
Before performing EDA, we need to clean the dataset:
- **Handling Missing Values** – Using techniques like filling missing values with mode/median or dropping irrelevant rows.
- **Standardizing Formats** – Ensuring consistency in date formats and categorical values.
- **Removing Duplicates** – Checking for duplicate entries and eliminating them.
- **Feature Engineering** – Creating new features such as content age or popularity trends.

#### **Exploratory Data Analysis (EDA)**
##### **1. Content Type Distribution**
Using a bar plot, we analyze the proportion of **movies vs. TV shows** to understand Netflix’s focus.

##### **2. Release Year Trends**
A histogram or line plot helps visualize the number of titles released each year, highlighting trends in content production.

##### **3. Country-wise Analysis**
A bar chart or world map visualization shows the **top countries** producing Netflix content.

##### **4. Director and Cast Insights**
By analyzing the frequency of directors and actors, we can identify the **most influential figures** in Netflix’s catalog.

##### **5. Genre Analysis**
A word cloud or bar plot helps visualize the **most common genres** on Netflix.

##### **6. Ratings and Duration**
- A box plot can show the **distribution of content ratings**.
- A histogram can analyze the **duration of movies and TV shows**.

##### **7. Missing Values Visualization**
Using **missingno** and heatmaps, we visualize missing values and decide on appropriate handling techniques.

#### **Visualization Techniques**
- **Bar Charts** – Content type distribution, country-wise analysis.
- **Histograms** – Release year trends, duration analysis.
- **Word Clouds** – Genre analysis.
- **Heatmaps** – Missing values visualization.
- **Box Plots** – Ratings and duration analysis.

#### **Insights and Findings**
Based on the analysis, we can derive key insights such as:
- Netflix has a **higher proportion of movies** compared to TV shows.
- Content production has **increased significantly after 2015**.
- The **USA dominates Netflix’s catalog**, followed by India and the UK.
- **Drama and Comedy** are the most popular genres.
- **TV-MA (Mature) rating** is the most common.
- **Certain directors and actors appear frequently**, indicating strong collaborations.


# **GitHub Link -**


[GITHUB LINK](https://github.com/kpallavi111/Ford-go-bike/blob/main/Netfilx_Movies_and_TV_shows.ipynb)



# **Problem Statement**


**The project aims to explore and analyze Netflix’s content catalog using Python-based Exploratory Data Analysis (EDA). It focuses on understanding trends in movies and TV shows, identifying popular genres, analyzing ratings and durations, and examining country-wise content distribution. By handling missing values, cleaning data, and visualizing key insights, this analysis will reveal patterns in Netflix’s content strategy, helping with recommendations, content planning, and a deeper understanding of streaming trends.**

#### **Define Your Business Objective?**

The primary business objective of this project is to **analyze Netflix’s content catalog** to uncover trends, patterns, and insights that can help improve content strategy, audience engagement, and platform growth. By performing **Exploratory Data Analysis (EDA)**, we aim to:  

1. **Optimize Content Strategy** – Identify the most popular genres, ratings, and content types to guide future content production.  
2. **Enhance User Engagement** – Understand viewing trends to recommend content that aligns with audience preferences.  
3. **Improve Regional Targeting** – Analyze country-wise content distribution to tailor offerings for different markets.  
4. **Identify Gaps in Content** – Detect underrepresented genres or regions to expand Netflix’s catalog strategically.  
5. **Support Business Growth** – Provide data-driven insights to help Netflix make informed decisions on acquisitions, partnerships, and content investments.  

By leveraging Python-based EDA techniques, this project will help Netflix refine its content offerings, improve user satisfaction, and strengthen its position in the competitive streaming industry.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
#mount drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# locate csv file

In [None]:
import os
os.listdir('/content/drive/MyDrive/')

In [None]:
# Load Dataset

In [None]:
df=pd.read_csv('/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
df

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
r=df.shape[0]
c=df.shape[1]
print("Number of rows: ",r)
print("Number of columns: ",c)

### Dataset Information

In [None]:
# Dataset Info

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
df.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
df.isnull().sum()

In [None]:
# Visualizing the missing values

In [None]:
df.isnull().sum().plot(kind='bar',figsize=(10,6),color='red')
plt.xlabel("Columns",fontweight="bold",fontsize=12)
plt.ylabel("Missing Values",fontweight="bold",fontsize=12)
plt.title("Missing values in Each column")

# get the bar container object
ax = plt.gca() # gca() gets the current axes object of the chart
bars = ax.containers # The 'containers' property gives you access to bar containers that hold the bar plot objects
#Iterate over each bar object in the containers
for bar in bars[0]:
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), str(int(bar.get_height())),
             ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.show()

### What did you know about your dataset?

I have here about a Netflix data which contains, Number of rows is 7787 and the Number of columns is 12. The dataset is csv types.
Through dataset information all data types are object only release_year,it is an integer. Dataset doesnot conatins any duplicate values.The missing values in the columns of director is 2389, cast is 718, country is 507, date_added is 10 and rating is 7. This is the basic information i got from my dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
df.columns.to_list()

In [None]:
# Dataset Describe

In [None]:
df.describe()

### Variables Description

The dataset contains the following columns:
- **show_id** – Unique identifier for each title.
- **type** – Whether the content is a movie or TV show.
- **title** – Name of the content.
- **director** – Director(s) of the content.
- **cast** – Actors featured in the content.
- **country** – Country of origin.
- **date_added** – Date when the content was added to Netflix.
- **release_year** – Year of release.
- **rating** – Content rating (e.g., PG, R, TV-MA).
- **duration** – Length of the content (minutes for movies, seasons for TV shows).
- **listed_in** – Genre(s) of the content.
- **description** – Brief summary of the content.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
#remove null values using mode and filling unknown and NA instead.

In [None]:
df['director'] = df['director'].fillna(df['director'].mode()[0])

In [None]:
df['cast'] = df['cast'].fillna(df['cast'].mode()[0])

In [None]:
df['country'] = df['country'].fillna(df['country'].mode()[0])

In [None]:
df['date_added'] = df['date_added'].fillna("NA")

In [None]:
df['rating'] = df['rating'].fillna("Unknown")

In [None]:
df

In [None]:
df.isnull().sum()

### What all manipulations have you done and insights you found?

I replace null values from mode and may be putting there Unknown or NA values.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

###**Univariate Analysis (Single Variable)**

#### Chart - 1

In [None]:
# Chart - 1 visualization code

Bar Chart - Content Type Distribution (Movies vs. TV Shows)

In [None]:
df['type'].value_counts().plot(kind='bar', color=['skyblue', 'lightcoral'])
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.title("Distribution of Movies vs. TV Shows")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for visualizing categorical variables with discrete values, making it easy to compare the number of movies vs. TV shows.

##### 2. What is/are the insight(s) found from the chart?

This graph highlights whether Netflix’s library is more focused on movies or TV shows, helping businesses understand content availability trends.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
Histogram - Release Year Distribution

In [None]:
df['release_year'].plot(kind='hist', bins=30, color='green', alpha=0.7)
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Number of Movies and Shows Released Over Time")
plt.show()

##### 1. Why did you pick the specific chart?

Histograms are perfect for analyzing continuous variables like release year, helping identify trends in content production over time.


##### 2. What is/are the insight(s) found from the chart?

Shows the distribution of content release across different years. Peaks indicate periods of high content production.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
Pie Chart - Ratings Distribution

In [None]:
df['rating'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['red', 'blue', 'yellow', 'green'])
plt.title("Distribution of Ratings in Netflix")
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart helps visualize proportions, making it easy to see which ratings are most common on Netflix

##### 2. What is/are the insight(s) found from the chart?

Provides an overview of the target audience preferences (e.g., more mature content or family-friendly).

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
Bar Chart - Top 10 Countries Producing Content

In [None]:
df['country'].value_counts().head(10).plot(kind='bar', color='purple')
plt.xlabel("Country")
plt.ylabel("Count")
plt.title("Top 10 Content-Producing Countries")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is useful for ranking categorical variables like country names.

##### 2. What is/are the insight(s) found from the chart?

Helps understand which countries contribute the most to Netflix’s catalog and geographical diversity in content.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
Histogram - Duration Distribution

In [None]:
# Extract numeric values from the 'duration' column, handle potential errors if the column is already numeric
try:
    df['duration'] = df['duration'].str.extract('(\d+)').astype(float)
except AttributeError:
    # If 'duration' is already numeric, skip the extraction and proceed
    print("Duration column is already numeric. Skipping extraction.")

# Create the histogram
df['duration'].plot(kind='hist', bins=20, color='cyan', alpha=0.7)
plt.xlabel("Duration")
plt.ylabel("Frequency")
plt.title("Distribution of Movie/TV Show Duration")
plt.show()

##### 1. Why did you pick the specific chart?

Histograms effectively show frequency distributions for numerical data like duration


##### 2. What is/are the insight(s) found from the chart?

Helps analyze common durations of Netflix content, identifying whether movies or TV shows tend to have longer watch times.

###**Bivariate Analysis (Two Variables)**

#### Chart - 6

In [None]:
# Chart - 6 visualization code

Scatter Plot - Release Year vs. Duration

In [None]:
import seaborn as sns
sns.scatterplot(x='release_year', y='duration', data=df, color='blue', alpha=0.6)
plt.xlabel("Release Year")
plt.ylabel("Duration")
plt.title("Content Duration Across Release Years")
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plots are useful for spotting trends or anomalies in numerical relationships.

##### 2. What is/are the insight(s) found from the chart?

Helps detect if newer Netflix content has longer or shorter durations over time

#### Chart - 7

In [None]:
# Chart - 7 visualization code

Box Plot - Ratings by Content Type

In [None]:
sns.boxplot(x='type', y='rating', data=df, palette='coolwarm')
plt.xlabel("Content Type")
plt.ylabel("Ratings")
plt.title("Distribution of Ratings by Movies and TV Shows")
plt.show()

##### 1. Why did you pick the specific chart?

Box plots are great for comparing distributions between groups.


##### 2. What is/are the insight(s) found from the chart?

Shows whether movies or TV shows tend to get higher or lower ratings.


#### Chart - 8

In [None]:
# Chart - 8 visualization code

Bar Chart - Average Duration by Country

In [None]:
df.groupby('country')['duration'].mean().head(10).plot(kind='bar', color='gold')
plt.xlabel("Country")
plt.ylabel("Average Duration")
plt.title("Average Movie/TV Show Duration by Country")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are effective for comparing average values across categories

##### 2. What is/are the insight(s) found from the chart?

Highlights how content duration varies by country.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

Heatmap - Release Year vs. Rating

In [None]:
import seaborn as sns
sns.heatmap(df.pivot_table(index='release_year', columns='rating', aggfunc='size', fill_value=0), cmap='coolwarm')
plt.title("Heatmap of Release Year vs. Ratings")
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps help visualize large data matrices, making it easy to see trends.

##### 2. What is/are the insight(s) found from the chart?

Shows which ratings were more frequent in different years.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

Line Chart - Content Added by Year

In [None]:
df.groupby(['release_year', 'type']).size().unstack().plot(kind='line', figsize=(10,5))
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Movies vs. TV Shows Added Over the Years")
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are useful for visualizing time-based trends.

##### 2. What is/are the insight(s) found from the chart?

Shows whether Netflix has been adding more movies or TV shows over the years.

###**Multivariate Analysis (Three or More Variables)**

#### Chart - 11

In [None]:
# Chart - 11 visualization code

Bubble Chart - Release Year, Duration, and Rating

In [None]:
sns.scatterplot(x='release_year', y='duration', hue='rating', size='duration', data=df, alpha=0.6)
plt.xlabel("Release Year")
plt.ylabel("Duration")
plt.title("Bubble Chart of Release Year, Duration, and Ratings")
plt.show()

##### 1. Why did you pick the specific chart?

Bubble charts add a third dimension (size) for enhanced insights.

##### 2. What is/are the insight(s) found from the chart?

Shows how ratings, duration, and release year interact.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

Stacked Bar Chart - Content Type by Country and Rating

In [None]:
df.groupby(['country', 'type'])['rating'].count().unstack().plot(kind='bar', stacked=True, figsize=(10,5))
plt.xlabel("Country")
plt.ylabel("Count")
plt.title("Stacked Bar Chart of Content Type by Country and Rating")
plt.show()

##### 1. Why did you pick the specific chart?

Stacked bar charts are ideal for showing distribution across multiple dimensions.

##### 2. What is/are the insight(s) found from the chart?

Shows how different countries contribute to Netflix’s ratings and content types.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

3D Scatter Plot - Release Year, Duration, and Country

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df['release_year'], df['duration'], df['country'].astype('category').cat.codes, c='r', marker='o')
ax.set_xlabel('Release Year')
ax.set_ylabel('Duration')
ax.set_zlabel('Country')
plt.title("3D Scatter Plot of Release Year, Duration, and Country")
plt.show()

##### 1. Why did you pick the specific chart?

3D plots help visualize multiple variables effectively

##### 2. What is/are the insight(s) found from the chart?

Displays variation in content duration across years and countries.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample DataFrame (Replace with your actual dataset)
df = pd.DataFrame({
    'release_year': [2010, 2015, 2020, 2018, 2012],
    'duration': [120, 90, 150, 110, 95],
    'rating_score': [7.5, 8.2, 6.8, 7.9, 8.0]
})

# Compute correlation matrix
corr_matrix = df.corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Netflix Dataset")
plt.show()

##### 1. Why did you pick the specific chart?

- Shows relationships between multiple numerical variables at once.
- Highlights strong positive or negative correlations (e.g., does duration impact rating?).
- Helps in feature selection for predictive modeling.


##### 2. What is/are the insight(s) found from the chart?

- Positive correlation (closer to +1) means variables increase together.
- Negative correlation (closer to -1) means one variable decreases as the other increases.
- Near-zero correlation means no strong relationship exists.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample DataFrame (Replace with your actual dataset)
df = pd.DataFrame({
    'release_year': [2010, 2015, 2020, 2018, 2012, 2016, 2019],
    'duration': [120, 90, 150, 110, 95, 130, 140],
    'rating_score': [7.5, 8.2, 6.8, 7.9, 8.0, 7.3, 8.5]
})

# Create pairplot
sns.pairplot(df, diag_kind='kde', plot_kws={'alpha':0.6})
plt.suptitle("Pairplot of Netflix Dataset Variables", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

- Shows relationships between multiple numerical variables in a single visualization.
- Helps detect correlations between variables like release year, duration, and ratings.
- Displays both scatter plots and density plots, making it easy to analyze distributions.


##### 2. What is/are the insight(s) found from the chart?

- Positive correlation (upward trend in scatter plots) suggests variables increase together.
- Negative correlation (downward trend) indicates one variable decreases as the other increases.
- Density plots on the diagonal show the distribution of individual variables.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

### **Suggested Solutions to Achieve the Business Objective**  

To meet the **business objective** of optimizing Netflix’s content strategy, audience engagement, and platform growth through **Exploratory Data Analysis (EDA)**, here are key **solutions** the client can implement:

### **1. Data-Driven Content Strategy**
**Identify Popular Genres & Ratings:**  
   - Use data insights to determine the most-watched genres and highest-rated content.
   - Prioritize investment in trending genres to attract more subscribers.  
**Optimize Release Timing:**  
   - Analyze peak content release periods and align new releases with user demand patterns.  
**Country-Specific Content Creation:**  
   - Identify top-performing countries for localized content strategies.  

### **2. Enhancing User Engagement & Recommendations**
**Improve Personalized Recommendations:**  
   - Use machine learning models to suggest content based on user preferences.  
**Optimize Watch Time Analysis:**  
   - Analyze duration trends to create content that aligns with user engagement levels.  
**Enhance Ratings-Based Filtering:**  
   - Suggest content based on viewers’ ratings preferences.  

### **3. Addressing Missing Data & Data Quality**
**Handling Missing Information:**  
   - Use intelligent **data-filling techniques** to clean gaps in director, cast, and country data.  
**Standardizing Data Formats:**  
   - Ensure uniform data representation for better analytics and modeling.  

### **4. Strengthening Market Expansion Strategies**
**Target Emerging Markets:**  
   - Identify regions with growing demand and produce content suited for their audiences.  
**Partnerships with Leading Directors & Actors:**  
   - Use insights on **popular creators** to form collaborations for high-impact productions.  

### **5. Future AI & Predictive Modeling Integrations**
**Predict Content Popularity:**  
   - Develop predictive models using **historical engagement trends** to forecast content success.  
**Automated Metadata Tagging:**  
   - Use AI-based classification to enhance Netflix’s content categorization for search optimization.  


# **Conclusion**

### **Final Thoughts on the Netflix Dataset Project**  

Netflix is one of the biggest streaming platforms, offering a vast collection of movies and TV shows. This project helped break down its content by analyzing trends, genres, ratings, durations, and country-wise contributions.  

#### **What We Discovered**  
- **Movies vs. TV Shows** → Netflix has more movies than TV shows.  
- **Popular Genres** → Drama and Comedy dominate its catalog.  
- **Country Contributions** → The USA leads in content production, with India and the UK following.  
- **Ratings & Duration** → TV-MA (Mature) content is most common, and movies typically last 90–120 minutes.  

#### **Why This Matters**  
By understanding these patterns, Netflix can make **data-driven decisions** to improve recommendations, optimize content creation, and expand into new markets. Insights like **which genres perform best** or **which regions demand specific content** will shape future strategies.  

#### **What’s Next?**  
This analysis can be expanded further by using **machine learning** to predict content popularity, **enhancing personalized recommendations**, and **identifying emerging trends** to stay ahead in the competitive streaming industry.  

Netflix thrives by knowing what viewers love—and with the power of data, it can keep delivering the best entertainment experience..

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

#**THANK YOU !!**