# TMDb Data Cleaning

**Goal:** Clean the raw TMDb dataset to prepare it for feature engineering.

**Input:**
- `data/tmdb_features.csv`
- `data/tmdb_overviews.csv`

**Output:** `data/tmdb_clean.csv`

**Steps:**
1. Load Data
2. Merge Features and Overviews
3. Handle Duplicates
4. Handle Missing Values (Overview, Budget, Revenue)
5. Create Indicator Flags (`has_budget`, `has_revenue`)
6. Data Type Conversion
7. Export Clean Data

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# Paths
DATA_DIR = Path('../data')

print("Setup complete!")

Setup complete!


## 1. Load Data

In [2]:
# Load raw TMDb data
features = pd.read_csv(DATA_DIR / 'tmdb_features.csv')
overviews = pd.read_csv(DATA_DIR / 'tmdb_overviews.csv')

print(f"Features shape: {features.shape}")
print(f"Overviews shape: {overviews.shape}")

display(features.head(2))
display(overviews.head(2))

Features shape: (43995, 8)
Overviews shape: (44571, 3)


Unnamed: 0,imdbId,id,title,overview,budget,revenue,directors,director_names
0,tt0114709,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",30000000,373554033.0,['John Lasseter'],John Lasseter
1,tt0113497,8844,Jumanji,When siblings Judy and Peter discover an encha...,65000000,262797249.0,['Joe Johnston'],Joe Johnston


Unnamed: 0,imdbId,title,overview
0,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,tt0113497,Jumanji,When siblings Judy and Peter discover an encha...


## 2. Merge & Clean

In [4]:
# Merge on imdbId
# Note: features already has 'overview', so we might get suffixes
df = pd.merge(features, overviews[['imdbId', 'overview']], on='imdbId', how='inner', suffixes=('_feat', '_ov'))

print(f"Merged shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

# Resolve overview column
# We prioritize the one from 'overviews' file (_ov), falling back to 'features' (_feat) if null
df['overview'] = df['overview_ov'].fillna(df['overview_feat'])
df = df.drop(columns=['overview_ov', 'overview_feat'])

# Check for duplicates
print(f"Duplicates in imdbId: {df['imdbId'].duplicated().sum()}")
df = df.drop_duplicates(subset=['imdbId'], keep='first')
print(f"Shape after deduplication: {df.shape}")

# Handle missing values
df['budget'] = pd.to_numeric(df['budget'], errors='coerce')
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

# Create indicator flags BEFORE filling with 0
# Justification: ~80% of budget and ~83% of revenue values are 0 (missing).
# Filling missing values with 0 is problematic because:
#   1. log1p(0) = 0, so 80% of movies get identical log_budget values
#   2. The model cannot distinguish "unknown budget" from "actually $0 budget"
#   3. This dilutes the predictive signal from the ~20% that DO have budget data
# These flags allow us to:
#   - Use them as features (model learns "has_budget" matters)
#   - Filter to only movies with budget data for cleaner experiments
df['has_budget'] = (df['budget'] > 0).astype(int)
df['has_revenue'] = (df['revenue'] > 0).astype(int)

print(f"Movies with budget data: {df['has_budget'].sum()} ({df['has_budget'].mean()*100:.1f}%)")
print(f"Movies with revenue data: {df['has_revenue'].sum()} ({df['has_revenue'].mean()*100:.1f}%)")

# Now fill missing with 0
df['budget'] = df['budget'].fillna(0)
df['revenue'] = df['revenue'].fillna(0)
df['overview'] = df['overview'].fillna("")
df['director_names'] = df['director_names'].fillna("Unknown")

# Ensure imdbId is string and looks correct (starts with 'tt')
df['imdbId'] = df['imdbId'].astype(str)
df = df[df['imdbId'].str.startswith('tt')]

print(f"Final clean shape: {df.shape}")
display(df.head())

Merged shape: (44057, 9)
Columns: ['imdbId', 'id', 'title', 'overview_feat', 'budget', 'revenue', 'directors', 'director_names', 'overview_ov']
Duplicates in imdbId: 62
Shape after deduplication: (43995, 8)
Movies with budget data: 8821 (20.1%)
Movies with revenue data: 7376 (16.8%)
Final clean shape: (43995, 10)


Unnamed: 0,imdbId,id,title,budget,revenue,directors,director_names,overview,has_budget,has_revenue
0,tt0114709,862,Toy Story,30000000,373554033.0,['John Lasseter'],John Lasseter,"Led by Woody, Andy's toys live happily in his ...",1,1
1,tt0113497,8844,Jumanji,65000000,262797249.0,['Joe Johnston'],Joe Johnston,When siblings Judy and Peter discover an encha...,1,1
2,tt0113228,15602,Grumpier Old Men,0,0.0,['Howard Deutch'],Howard Deutch,A family wedding reignites the ancient feud be...,0,0
3,tt0114885,31357,Waiting to Exhale,16000000,81452156.0,['Forest Whitaker'],Forest Whitaker,"Cheated on, mistreated and stepped on, the wom...",1,1
4,tt0113041,11862,Father of the Bride Part II,0,76578911.0,['Charles Shyer'],Charles Shyer,Just when George Banks has recovered from his ...,0,1


## 3. Export Clean Data

In [5]:
output_path = DATA_DIR / 'tmdb_clean.csv'
df.to_csv(output_path, index=False)
print(f"Clean data exported to: {output_path}")

Clean data exported to: ../data/tmdb_clean.csv
