# Netflix Data Cleaning and Preprocessing

This notebook performs data cleaning and preprocessing on the Netflix titles dataset (`data/netflix_titles.csv`). The goal is to handle missing values, standardize formats, remove duplicates, and prepare the data for exploratory data analysis.

**Steps Covered:**
- Data Loading and Initial Inspection
- Missing Value Handling
- Date Conversion and Temporal Feature Extraction
- Duration Standardization
- Data Validation and Saving

In [11]:
import pandas as pd
import numpy as np

## 1. Data Loading and Initial Inspection

Load the dataset and perform initial checks for shape, data types, and missing values.

In [12]:
df = pd.read_csv('data/netflix_titles.csv')
print(df.shape)
df.head()

(8807, 12)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [13]:
print("\nMissing values rates before inputation:")
for column in df.columns:
    null_counts = df[column].isnull().sum()
    null_rate =  null_counts / len(df) * 100 
    if null_rate > 0 :
        print(f"{column} null rate: {round(null_rate, 2)}%, null counts: {null_counts}")


Missing values rates before inputation:
director null rate: 29.91%, null counts: 2634
cast null rate: 9.37%, null counts: 825
country null rate: 9.44%, null counts: 831
date_added null rate: 0.11%, null counts: 10
rating null rate: 0.05%, null counts: 4
duration null rate: 0.03%, null counts: 3


## 2. Missing Value Handling

Impute missing values using appropriate strategies (e.g., mode for categorical, median for numerical) and remove duplicates.

In [14]:
mode_country = df['country'].mode()
if not mode_country.empty:
    df['country'] = df['country'].fillna(mode_country[0])
else:
    df['country'] = df['country'].fillna('Unknown')

# Impute date_added with 'January 1, ' + release_year
df['date_added'] = df['date_added'].fillna('January 1, ' + df['release_year'].astype(str))

df['cast'] = df['cast'].replace([np.nan, None, ''], 'No Data')
df['director'] = df['director'].replace([np.nan, None, ''], 'No Data')
df['rating'] = df['rating'].replace([np.nan, None, ''], 'No Data')

df = df.dropna()
df = df.drop_duplicates(subset=['title'], keep='first')

# Verify no missing values remain
print("Missing values after imputation:")
print(df.isnull().sum())

Missing values after imputation:
show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64


In [15]:
# Check rows where date_added is NaT after conversion
nat_rows = df[df['date_added'].isna()]
print("Number of rows with NaT in date_added:", len(nat_rows))
if len(nat_rows) > 0:
    print("Sample rows with NaT:")
    print(nat_rows[['title', 'date_added', 'release_year']].head(10))

else:
    print("No NaT values found.")

print('---')
print(df.info())

Number of rows with NaT in date_added: 0
No NaT values found.
---
<class 'pandas.core.frame.DataFrame'>
Index: 8804 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8804 non-null   object
 1   type          8804 non-null   object
 2   title         8804 non-null   object
 3   director      8804 non-null   object
 4   cast          8804 non-null   object
 5   country       8804 non-null   object
 6   date_added    8804 non-null   object
 7   release_year  8804 non-null   int64 
 8   rating        8804 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8804 non-null   object
 11  description   8804 non-null   object
dtypes: int64(1), object(11)
memory usage: 894.2+ KB
None


## 3. Date Conversion and Temporal Feature Extraction

Convert `date_added` to datetime format and extract temporal components (month, year) for analysis.

In [16]:
# Convert date_added to datetime using dateutil for better parsing
from dateutil import parser

def safe_parse(date_str):
    if pd.isna(date_str):
        return pd.NaT
    try:
        return parser.parse(date_str)
    except:
        return pd.NaT

df['date_added'] = df['date_added'].apply(safe_parse)
# Ensure it's datetime64
df['date_added'] = pd.to_datetime(df['date_added'])

# Extract temporal components from date_added for analysis
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
df['month_name_added'] = df['date_added'].dt.month_name()
df['day_added'] = df['date_added'].dt.day

print("Temporal columns added. Sample:")
print(df[['date_added', 'year_added', 'month_added', 'month_name_added', 'day_added']].head())

print("Data types after conversion:")
print(df.dtypes)

Temporal columns added. Sample:
  date_added  year_added  month_added month_name_added  day_added
0 2021-09-25        2021            9        September         25
1 2021-09-24        2021            9        September         24
2 2021-09-24        2021            9        September         24
3 2021-09-24        2021            9        September         24
4 2021-09-24        2021            9        September         24
Data types after conversion:
show_id                     object
type                        object
title                       object
director                    object
cast                        object
country                     object
date_added          datetime64[ns]
release_year                 int64
rating                      object
duration                    object
listed_in                   object
description                 object
year_added                   int32
month_added                  int32
month_name_added            object
day_added         

In [17]:
# Make sure no NaT in date_added after converted to datetime64
print(df.isnull().sum())
df[df['date_added'].isnull()][['title', 'release_year', 'date_added', 'year_added', 'month_added', 'month_name_added', 'day_added']].head()

show_id             0
type                0
title               0
director            0
cast                0
country             0
date_added          0
release_year        0
rating              0
duration            0
listed_in           0
description         0
year_added          0
month_added         0
month_name_added    0
day_added           0
dtype: int64


Unnamed: 0,title,release_year,date_added,year_added,month_added,month_name_added,day_added


## 4. Duration Standardization

Extract numeric values from `duration` for movies (minutes) and TV shows (seasons), handling them separately.

In [18]:
# Extract numeric duration separately for movies and TV shows
# For movies: duration in minutes
df['duration_minutes'] = df.apply(lambda x: int(x['duration'].split()[0]) if pd.notna(x['duration']) and 'min' in x['duration'] else np.nan, axis=1).astype('Int64')

# For TV shows: duration in seasons
df['duration_seasons'] = df.apply(lambda x: int(x['duration'].split()[0]) if pd.notna(x['duration']) and 'Season' in x['duration'] else np.nan, axis=1).astype('Int64')

# Fill NaN with median for each
median_minutes = df['duration_minutes'].median()
df['duration_minutes'] = df['duration_minutes'].fillna(median_minutes)

median_seasons = df['duration_seasons'].median()
df['duration_seasons'] = df['duration_seasons'].fillna(median_seasons)

print("Duration columns standardized. Sample:")
print(df[['type', 'duration', 'duration_minutes', 'duration_seasons']].head(10))

Duration columns standardized. Sample:
      type   duration  duration_minutes  duration_seasons
0    Movie     90 min                90                 1
1  TV Show  2 Seasons                98                 2
2  TV Show   1 Season                98                 1
3  TV Show   1 Season                98                 1
4  TV Show  2 Seasons                98                 2
5  TV Show   1 Season                98                 1
6    Movie     91 min                91                 1
7    Movie    125 min               125                 1
8  TV Show  9 Seasons                98                 9
9    Movie    104 min               104                 1


## 5. Data Validation and Saving

Validate the cleaned dataset and save it for use in subsequent notebooks.

In [19]:
# Data Validation
print("Final dataset shape:", df.shape)
print("\nFinal data types:")
print(df.dtypes)
print("\nFinal missing values:")
print(df.isnull().sum())
print("\nSample of cleaned data:")
print(df.head())

Final dataset shape: (8804, 18)

Final data types:
show_id                     object
type                        object
title                       object
director                    object
cast                        object
country                     object
date_added          datetime64[ns]
release_year                 int64
rating                      object
duration                    object
listed_in                   object
description                 object
year_added                   int32
month_added                  int32
month_name_added            object
day_added                    int32
duration_minutes             Int64
duration_seasons             Int64
dtype: object

Final missing values:
show_id             0
type                0
title               0
director            0
cast                0
country             0
date_added          0
release_year        0
rating              0
duration            0
listed_in           0
description         0
year_added        

In [20]:
# Save cleaned dataset
df.to_csv('data/netflix_titles_cleaned.csv', index=False)
print("Cleaned dataset saved as 'data/netflix_titles_cleaned.csv'")

Cleaned dataset saved as 'data/netflix_titles_cleaned.csv'


## Summary

The dataset has been cleaned and preprocessed:
- Missing values imputed or filled.
- Dates converted to datetime with temporal features extracted.
- Duration standardized into separate numeric columns.
- Duplicates removed.

The cleaned data is saved as `data/netflix_titles_cleaned.csv` and ready for exploratory data analysis in the next notebook.

**Next Steps:**
- Proceed to `exploratory_data_analysis_and_visualization.ipynb` for insights and visualizations.
- Use the new columns (e.g., `duration_minutes`, `month_added`) for advanced analysis.