<a href="https://www.kaggle.com/code/harshbhanushali124/project-2-netflix-content-strategy-analysis?scriptVersionId=268364251" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Project 2: Netflix Content Strategy Analysis**

**Workshop:** Geeks for Geeks 21 Projects, 21 Days: ML, Deep Learning & GenAI

**Date:** October 15, 2025

**Author:** Harsh Bhanushali

**Objective:** Conduct an exploratory data analysis (EDA) on the Netflix dataset to investigate content trends, genres, ratings, and type distribution, addressing the five submission questions with Plotly visualizations and generating a ydata-profiling report.

**Step 1: Import Libraries**

Import libraries for data processing, visualization, and text analysis.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from ydata_profiling import ProfileReport
!pip install wordcloud ydata-profiling -q

**Step 2: Load the Dataset**

Load the Netflix dataset from the specified GitHub repository.

In [2]:
!git clone "https://github.com/GeeksforgeeksDS/21-Days-21-Projects-Dataset" -q
netflix_df = pd.read_csv('/kaggle/working/21-Days-21-Projects-Dataset/Datasets/netflix_titles.csv')
print("Dataset Loaded")
print("\nFirst 5 Rows:")
display(netflix_df.head())
print("\nShape:", netflix_df.shape)

Dataset Loaded

First 5 Rows:


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...



Shape: (7787, 12)


**Step 3: Data Cleaning**

Clean the dataset and standardize formats.

    Missing Values: Fill 'director', 'cast', 'country', 'rating', 'duration' with             'Unknown'; drop missing 'date_added'.

    Date Parsing: Convert 'date_added' to datetime, drop rows with NaT.

    Duration Extraction: Split 'duration' into number and unit.

In [3]:
netflix_df['director'] = netflix_df['director'].fillna('Unknown')
netflix_df['cast'] = netflix_df['cast'].fillna('Unknown')
netflix_df['country'] = netflix_df['country'].fillna('Unknown')
netflix_df['rating'] = netflix_df['rating'].fillna('Unknown')
netflix_df['duration'] = netflix_df['duration'].fillna('Unknown')
netflix_df = netflix_df.dropna(subset=['date_added'])
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'], errors='coerce')
netflix_df = netflix_df.dropna(subset=['date_added'])  # Drop rows with invalid dates
netflix_df['duration_num'] = netflix_df['duration'].str.extract('(\d+)').astype(float)
netflix_df['duration_unit'] = netflix_df['duration'].str.extract('([a-zA-Z]+)')
print("\nCleaned Data Info:")
netflix_df.info()


Cleaned Data Info:
<class 'pandas.core.frame.DataFrame'>
Index: 7689 entries, 0 to 7786
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   show_id        7689 non-null   object        
 1   type           7689 non-null   object        
 2   title          7689 non-null   object        
 3   director       7689 non-null   object        
 4   cast           7689 non-null   object        
 5   country        7689 non-null   object        
 6   date_added     7689 non-null   datetime64[ns]
 7   release_year   7689 non-null   int64         
 8   rating         7689 non-null   object        
 9   duration       7689 non-null   object        
 10  listed_in      7689 non-null   object        
 11  description    7689 non-null   object        
 12  duration_num   7689 non-null   float64       
 13  duration_unit  7689 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(11)
me

**Step 4: Feature Engineering**

Create features for submission question analysis.

    Year Added: From 'date_added'.
    
    Content Age: Difference between 'year_added' and 'release_year'.
    
    Primary Genre: First genre from 'listed_in'.
    
    Primary Country: First country from 'country'.

In [4]:
netflix_df['year_added'] = netflix_df['date_added'].dt.year.astype(int)
netflix_df['content_age'] = netflix_df['year_added'] - netflix_df['release_year']
netflix_df['primary_genre'] = netflix_df['listed_in'].str.split(',').str[0].str.strip()
netflix_df['primary_country'] = netflix_df['country'].str.split(',').str[0].str.strip()
print("\nFeatures Preview:")
display(netflix_df[['year_added', 'content_age', 'primary_genre', 'primary_country']].head())


Features Preview:


Unnamed: 0,year_added,content_age,primary_genre,primary_country
0,2020,0,International TV Shows,Brazil
1,2016,0,Dramas,Mexico
2,2018,7,Horror Movies,Singapore
3,2017,8,Action & Adventure,United States
4,2020,12,Dramas,United States


**Step 5: Submission Questions**

**5.1 Content Ratings Over Time**

Analyze changes in rating distribution.

In [5]:
ratings_over_time = netflix_df.groupby(['year_added', 'rating']).size().unstack().fillna(0)
fig = px.area(ratings_over_time, x=ratings_over_time.index, y=ratings_over_time.columns,
              title='Content Ratings Over Time',
              labels={'year_added': 'Year Added', 'value': 'Titles'})
fig.update_layout(title_x=0.5, xaxis_title='Year Added', yaxis_title='Titles')
fig.show()

**Finding:** TV-MA and TV-14 ratings have risen since 2015, indicating a focus on mature audiences.

**5.2 Content Age by Type**

Compare content age for movies and TV shows.

In [6]:
fig = px.box(netflix_df, x='type', y='content_age', color='type',
             title='Content Age by Type',
             labels={'type': 'Type', 'content_age': 'Age (Years)'},
             color_discrete_sequence=['#636EFA', '#EF553B'])
fig.update_layout(title_x=0.5, xaxis_title='Type', yaxis_title='Age (Years)')
fig.show()

**Finding:** TV shows are newer than movies, which include older titles.

**5.3 Production vs. Addition Trends**

Compare release year and year added.

In [7]:
fig = px.scatter(netflix_df, x='release_year', y='year_added', color='type',
                 title='Release Year vs. Year Added',
                 labels={'release_year': 'Release Year', 'year_added': 'Year Added'},
                 color_discrete_sequence=['#636EFA', '#EF553B'], opacity=0.6)
fig.update_layout(title_x=0.5, xaxis_title='Release Year', yaxis_title='Year Added')
fig.show()

**Finding:** Recent releases are added quickly; older content diversifies the library.

**5.4 Common Word Pairs in Descriptions**

Identify frequent bigrams.

In [8]:
vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english', max_features=10)
bigrams = vectorizer.fit_transform(netflix_df['description'].dropna())
bigram_df = pd.DataFrame(bigrams.toarray(), columns=vectorizer.get_feature_names_out())
bigram_counts = bigram_df.sum().sort_values(ascending=False)
fig = px.bar(x=bigram_counts.index, y=bigram_counts.values,
             title='Top 10 Word Pairs in Descriptions',
             labels={'x': 'Word Pair', 'y': 'Frequency'},
             color_discrete_sequence=['#00CC96'])
fig.update_layout(title_x=0.5, xaxis_title='Word Pair', yaxis_title='Frequency', xaxis_tickangle=45)
fig.show()

**Finding:** Bigrams like "young woman" highlight youth and relationship themes.

**5.5 Top Directors**

List leading directors.

In [9]:
top_directors = netflix_df[netflix_df['director'] != 'Unknown']['director'].value_counts().head(10)
fig = px.bar(x=top_directors.index, y=top_directors.values,
             title='Top 10 Directors',
             labels={'x': 'Director', 'y': 'Titles'},
             color_discrete_sequence=['#AB63FA'])
fig.update_layout(title_x=0.5, xaxis_title='Director', yaxis_title='Titles', xaxis_tickangle=45)
fig.show()

**Finding:** Raúl Campos and Jan Suter lead, focusing on international content.