<a href="https://colab.research.google.com/github/kavyatejaswini24/Netflix_EDA/blob/main/Netflix_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Netflix Content Strategy Analysis and Clustering**



# **Project Summary -**
This project aims to perform a comprehensive analysis of the Netflix content library, utilizing a dataset of movies and TV shows available up to 2019. The foundation of the analysis is the observed strategic shift in Netflix's content acquisition—namely, the significant increase in TV shows and a corresponding decrease in movie titles since 2010.

The project will address key strategic questions through a multi-stage approach:

Exploratory Data Analysis (EDA): Perform in-depth analysis to understand the content distribution, genres, ratings, and release trends within the dataset.

Global Content Strategy: Investigate and quantify the types of content available across different countries to uncover regional content preferences and strategic deployment patterns.

Content Type Trend Analysis: Specifically analyze the reported trend of increasing focus on TV shows versus movies to validate this strategic shift and understand its implications.

Content Clustering: Apply machine learning techniques, such as clustering algorithms (e.g., K-Means or Hierarchical Clustering), to group similar content based on text-based features (like descriptions, cast, or listed genres). This clustering will help reveal underlying, non-obvious content categories that could be crucial for an optimized recommendation engine or future content acquisition strategies.

The successful completion of this project will provide valuable, actionable insights into Netflix's evolving content strategy, the composition of its global catalog, and the inherent similarities within its programming, which can be further integrated with external data (like IMDB or Rotten Tomatoes ratings) for richer findings.

# **GitHub Link -** https://github.com/kavyatejaswini24/Netflix_EDA


# **Problem Statement**
The core problem is to understand, quantify, and categorize the implications of Netflix's evolving content strategy on its catalog composition, particularly the confirmed shift from prioritizing movies to aggressively increasing the volume of TV shows.

Specifically, this project seeks to address the following knowledge gaps:

Strategic Validation and Trend Analysis: While the initial report confirms a significant shift (movies decreasing, TV shows tripling), the exact dynamics of this trend, including the peak years of change and the relative volume mix, need to be validated and analyzed through EDA.

Global Content Strategy Efficacy: Netflix's global content footprint is massive. The problem requires investigating what type of content (e.g., genre, type, rating) is being concentrated in different countries to determine if there are localized content acquisition strategies being pursued.

Content Discoverability and Classification: The existing genre classifications (listed_in) may be too broad or subjective. The final and most complex problem is to apply unsupervised machine learning (clustering) to the text-based features of the content to discover inherent, non-obvious groupings of similar content. Solving this classification problem provides a path for Netflix to improve its recommendation engine accuracy and optimize future content investment by identifying niche genres that are currently underserved or highly successful.

In essence, the project aims to transform raw catalog data into actionable strategic intelligence regarding content acquisition, international market focus, and content classification.


# **General Guidelines** : -  
The project guidelines define the following four key phases that must be executed to meet the overall objective:

1. Exploratory Data Analysis (EDA)
   This initial phase is mandatory for cleaning the dataset and establishing foundational metrics.

   Data Preparation: Crucial steps include handling missing values (especially in categorical features like director, cast, and country), correcting data types, and creating new time-based features (e.g., year_added).

   Distribution Analysis: Analyzing the basic composition of the catalog, including the overall ratio of movies vs. TV shows, the most common ratings (e.g., TV-MA, PG-13), and the distribution of content duration.

   Categorical Deep Dive: Identifying the Top 10 Genres and the Top 10 Content-Producing Directors/Cast to understand where Netflix's content investment is focused.

2. Global Content Strategy Analysis
   This phase focuses on the geographic distribution of content to understand Netflix's international footprint.

   Country Contribution: Quantifying the total volume of content originating from key countries (e.g., US, India, UK).

   Content Type by Region: Analyzing the specific type of content (e.g., are US titles mostly TV Shows while Indian titles are mostly Movies?) to uncover regional acquisition strategies.

   Genre Concentration: Identifying the dominant genres within major international content sources.

3. Content Type Trend Analysis
   This is a direct validation of the initial hypothesis regarding Netflix's shifting strategy.

   Annual Acquisition Trend: Generating a time-series visualization (e.g., a line graph) of the number of Movies and TV Shows added to the platform each year since 2010.

   Validation of Shift: Visually confirming the period when the number of new TV shows added began to consistently surpass the number of new movies, thereby confirming the strategic shift.

4. Clustering Similar Content (Machine Learning)
   This is the advanced, machine learning component of the project, aimed at finding hidden patterns.

   Text Feature Preparation: Combining and preprocessing relevant text fields (description, listed_in, cast, director) into a single string for each title. This typically involves cleaning, stemming/lemmatizing, and removing stop words.

   Vectorization: Converting the text data into a numerical format suitable for clustering, most commonly using TF-IDF (Term Frequency-Inverse Document Frequency).

   Clustering Algorithm: Applying an unsupervised learning technique like K-Means, Hierarchical Clustering, or DBSCAN to group titles based on the similarity of their numerical feature vectors.

   Cluster Interpretation: Analyzing the resulting clusters to assign meaningful labels (e.g., "Gritty Sci-Fi Thrillers," "International Comedies") and derive strategic insights for content acquisition and recommendation.

These guidelines ensure the project delivers both descriptive statistical analysis and predictive, strategic insights.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [16]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- Libraries for Clustering (Machine Learning / NLP) ---
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA # For dimensionality reduction

# --- Libraries for Text Cleaning ---
# Note: You may need to run 'import nltk; nltk.download("stopwords")' once if you run this locally
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

### Dataset Loading

In [45]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [44]:
# Load Dataset
import pandas as pd

def load_and_clean_data(file_path):
    """
    Loads the Netflix dataset from a CSV file and performs essential initial cleaning.

    Args:
        file_path (str): The path to the CSV file.

    Returns:
        pd.DataFrame or None: The cleaned DataFrame or None if the file is not found.
    """
    print("--- 1. Data Loading and Cleaning ---")
    try:
        # Load the dataset using the pandas library
        df = pd.read_csv(file_path)
    except FileNotFoundError:
        print(f"Error: The file {file_path} was not found.")
        return None

    # Fill missing values for key categorical columns with 'Unknown'
    for col in ['director', 'cast', 'country']:
        df[col] = df[col].fillna('Unknown')

    # Drop rows where critical metadata is missing
    df.dropna(subset=['rating', 'date_added'], inplace=True)

    # Feature Engineering: Convert date_added and extract year for trend analysis
    df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
    df.dropna(subset=['date_added'], inplace=True)
    df['year_added'] = df['date_added'].dt.year.astype(int)

    print("Initial cleaning complete.")
    return df

# Example usage (assuming the file path is defined elsewhere in the project)
# FILE_PATH = 'NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
# netflix_df = load_and_clean_data(FILE_PATH)


### Dataset First View

In [19]:
# Dataset First Look
def get_dataset_first_view(df):
    """
    Displays the first 5 rows and the DataFrame information summary.
    This is essential for the initial EDA.
    """
    print("\n--- Initial Dataset View (First 5 Rows) ---")
    print(df.head())
    print("\n--- DataFrame Information (Data Types and Missing Values) ---")
    df.info()

### Dataset Rows & Columns count

In [20]:
# Dataset Rows & Columns count

### Dataset Information

In [21]:
# Dataset Info

#### Duplicate Values

In [22]:
# Dataset Duplicate Value Count

#### Missing Values/Null Values

In [23]:
# Missing Values/Null Values Count

In [24]:
# Visualizing the missing values

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [25]:
# Dataset Columns

In [26]:
# Dataset Describe

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [27]:
# Check Unique Values for each variable.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [28]:
# Write your code to make your dataset analysis ready.

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [29]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [30]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [31]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [32]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [33]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [34]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [35]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [36]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [37]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [38]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [39]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [40]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [41]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [42]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [43]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***