# Netflix Content Analysis - Part 1: Data Cleaning
## Load, Clean, and Save Dataset

DATA 606 - Capstone in Data Science  
**Objective:** Load the raw Netflix dataset, perform comprehensive data cleaning, and save the cleaned dataset for further analysis.

---

## 1. Import Required Libraries

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd
import numpy as np
import re
import warnings
from datetime import datetime

warnings.filterwarnings('ignore')

print("All libraries imported successfully!")
print(f"Execution Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

All libraries imported successfully!
Execution Date: 2025-12-09 02:52:23


## 2. Load Dataset

Upload the `netflix_titles.csv` file to Colab or mount Google Drive.

In [None]:

df = pd.read_csv('netflix_titles.csv')

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Total records: {len(df):,}")

Dataset loaded successfully!
Dataset shape: (8807, 12)
Total records: 8,807


## 3. Initial Data Exploration

In [None]:
# Display column names
print("Column Names:")
print(list(df.columns))
print("\n" + "="*60)

Column Names:
['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']



In [None]:
# Display first few rows
print("First 5 Rows:")
df.head()

First 5 Rows:


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [None]:
# Dataset information
print(" Dataset Information:")
df.info()

 Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [None]:
# Content type distribution
print("Content Type Distribution:")
content_counts = df['type'].value_counts()
for content_type, count in content_counts.items():
    percentage = (count / len(df)) * 100
    print(f"   {content_type}: {count:,} titles ({percentage:.1f}%)")

Content Type Distribution:
   Movie: 6,131 titles (69.6%)
   TV Show: 2,676 titles (30.4%)


## 4. Missing Values Analysis

In [None]:
# Check missing values
print(" Missing Values Analysis:")
print("="*60)

missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Percentage': missing_percentage.values
})

# Display only columns with missing values
missing_df_filtered = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df_filtered) > 0:
    print(missing_df_filtered.to_string(index=False))
else:
    print(" No missing values found!")

print("\n" + "="*60)

 Missing Values Analysis:
    Column  Missing Count  Percentage
  director           2634   29.908028
   country            831    9.435676
      cast            825    9.367549
date_added             10    0.113546
    rating              4    0.045418
  duration              3    0.034064



## 5. Data Cleaning

### 5.1 Handle Missing Values

In [None]:
print("Starting data cleaning process...")
print("="*60)

# Store original shape
original_shape = df.shape

# Fill missing values with appropriate defaults
df['director'] = df['director'].fillna('Unknown Director')
df['cast'] = df['cast'].fillna('Unknown Cast')
df['country'] = df['country'].fillna('Unknown Country')
df['date_added'] = df['date_added'].fillna('Unknown Date')
df['rating'] = df['rating'].fillna('Not Rated')
df['duration'] = df['duration'].fillna('Unknown Duration')
df['listed_in'] = df['listed_in'].fillna('Unknown Genre')
df['description'] = df['description'].fillna('No description available')

print("Missing values filled successfully!")

Starting data cleaning process...
Missing values filled successfully!


### 5.2 Clean Release Year

In [None]:
# Convert release_year to numeric and remove invalid entries
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')

# Drop rows where release_year couldn't be converted
before_drop = len(df)
df = df.dropna(subset=['release_year'])
after_drop = len(df)

df['release_year'] = df['release_year'].astype(int)

print(f" Release year cleaned!")
print(f"   Removed {before_drop - after_drop} invalid records")
print(f"   Year range: {df['release_year'].min()} - {df['release_year'].max()}")

 Release year cleaned!
   Removed 0 invalid records
   Year range: 1925 - 2021


### 5.3 Parse Duration

In [None]:
def parse_duration(duration_str, content_type):
    """
    Parse duration based on content type.
    - Movies: Extract minutes
    - TV Shows: Extract number of seasons
    """
    if pd.isna(duration_str) or duration_str == 'Unknown Duration':
        return np.nan

    if content_type == 'Movie':
        # Extract minutes for movies
        match = re.search(r'(\d+)', str(duration_str))
        return int(match.group(1)) if match else np.nan
    else:  # TV Show
        # Extract number of seasons for TV shows
        match = re.search(r'(\d+)', str(duration_str))
        return int(match.group(1)) if match else np.nan

# Apply duration parsing
df['duration_value'] = df.apply(lambda x: parse_duration(x['duration'], x['type']), axis=1)

print("Duration parsed successfully!")
print(f"   Movies with duration: {df[df['type']=='Movie']['duration_value'].notna().sum()}")
print(f"   TV Shows with seasons: {df[df['type']=='TV Show']['duration_value'].notna().sum()}")

Duration parsed successfully!
   Movies with duration: 6128
   TV Shows with seasons: 2676


### 5.4 Create Additional Features

In [None]:
# Create decade feature
df['decade'] = (df['release_year'] // 10) * 10

# Create content age categories
current_year = 2024
df['age'] = current_year - df['release_year']
df['age_category'] = pd.cut(df['age'],
                           bins=[0, 5, 15, 30, float('inf')],
                           labels=['Recent', 'Modern', 'Classic', 'Vintage'])

# Create a combined text field for recommendations
df['combined_features'] = (df['listed_in'].fillna('') + ' ' +
                          df['description'].fillna('') + ' ' +
                          df['director'].fillna('') + ' ' +
                          df['cast'].fillna(''))

print("Additional features created:")
print("   - decade")
print("   - age")
print("   - age_category")
print("   - combined_features")

Additional features created:
   - decade
   - age
   - age_category
   - combined_features


## 6. Data Quality Check

In [None]:
print("Final Data Quality Check:")
print("="*60)

# Check for remaining missing values
remaining_missing = df.isnull().sum().sum()
print(f"Remaining missing values: {remaining_missing}")

# Display data types
print("\nData Types:")
print(df.dtypes)

# Display summary statistics
print("\n Summary Statistics:")
print(f"   Total records: {len(df):,}")
print(f"   Movies: {len(df[df['type'] == 'Movie']):,}")
print(f"   TV Shows: {len(df[df['type'] == 'TV Show']):,}")
print(f"   Unique countries: {df['country'].nunique()}")
print(f"   Unique directors: {df['director'].nunique()}")
print(f"   Unique ratings: {df['rating'].nunique()}")

Final Data Quality Check:
Remaining missing values: 3

Data Types:
show_id                object
type                   object
title                  object
director               object
cast                   object
country                object
date_added             object
release_year            int64
rating                 object
duration               object
listed_in              object
description            object
duration_value        float64
decade                  int64
age                     int64
age_category         category
combined_features      object
dtype: object

 Summary Statistics:
   Total records: 8,807
   Movies: 6,131
   TV Shows: 2,676
   Unique countries: 749
   Unique directors: 4529
   Unique ratings: 18


In [None]:
# Display sample of cleaned data
print("\n Sample of Cleaned Data:")
df[['title', 'type', 'release_year', 'duration', 'duration_value',
    'listed_in', 'rating', 'decade', 'age_category']].head(10)


 Sample of Cleaned Data:


Unnamed: 0,title,type,release_year,duration,duration_value,listed_in,rating,decade,age_category
0,Dick Johnson Is Dead,Movie,2020,90 min,90.0,Documentaries,PG-13,2020,Recent
1,Blood & Water,TV Show,2021,2 Seasons,2.0,"International TV Shows, TV Dramas, TV Mysteries",TV-MA,2020,Recent
2,Ganglands,TV Show,2021,1 Season,1.0,"Crime TV Shows, International TV Shows, TV Act...",TV-MA,2020,Recent
3,Jailbirds New Orleans,TV Show,2021,1 Season,1.0,"Docuseries, Reality TV",TV-MA,2020,Recent
4,Kota Factory,TV Show,2021,2 Seasons,2.0,"International TV Shows, Romantic TV Shows, TV ...",TV-MA,2020,Recent
5,Midnight Mass,TV Show,2021,1 Season,1.0,"TV Dramas, TV Horror, TV Mysteries",TV-MA,2020,Recent
6,My Little Pony: A New Generation,Movie,2021,91 min,91.0,Children & Family Movies,PG,2020,Recent
7,Sankofa,Movie,1993,125 min,125.0,"Dramas, Independent Movies, International Movies",TV-MA,1990,Vintage
8,The Great British Baking Show,TV Show,2021,9 Seasons,9.0,"British TV Shows, Reality TV",TV-14,2020,Recent
9,The Starling,Movie,2021,104 min,104.0,"Comedies, Dramas",PG-13,2020,Recent


## 7. Save Cleaned Dataset

In [None]:
# Save cleaned dataset
output_filename = 'netflix_cleaned.csv'
df.to_csv(output_filename, index=False)

print(" Cleaned dataset saved successfully!")
print("="*60)
print(f" Filename: {output_filename}")
print(f" Total records: {len(df):,}")
print(f" Total columns: {len(df.columns)}")
print(f" File size: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n Data cleaning completed successfully!")
print(" Ready for EDA in the next notebook!")

In [None]:
# Download file (optional - for Google Colab)
from google.colab import files
files.download(output_filename)