# Data Pipeline - IMDb + TMDB

**Goal:** Create two clean datasets for the Floportop movie rating prediction model.

| Output | Description | Rows |
|--------|-------------|------|
| `movies_full_298k.csv` | All IMDb movies, with TMDB data where available | ~298k |
| `movies_rich_39k.csv` | Only movies that have TMDB data (plots, etc.) | ~39k |

## Data Sources

| Dataset | Kaggle Link | Files We Use |
|---------|-------------|---------------|
| IMDb | [ashirwadsangwan/imdb-dataset](https://www.kaggle.com/datasets/ashirwadsangwan/imdb-dataset) | `title.basics.tsv`, `title.ratings.tsv` |
| TMDB | [rounakbanik/the-movies-dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) | `movies_metadata.csv`, `credits.csv` |

**What each file provides:**

| File | Content |
|------|--------|
| `title.basics.tsv` | title, year, runtime, genres |
| `title.ratings.tsv` | averageRating, numVotes |
| `movies_metadata.csv` | imdb_id, title, overview (plot), budget, revenue |
| `credits.csv` | cast, crew (we extract directors) |

---
## 1. Setup

In [8]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Display settings
pd.set_option("display.max_columns", None)

# === PATHS ===
DATA_DIR = Path("../../data")
RAW_DIR = DATA_DIR / "raw"
OUTPUT_DIR = DATA_DIR / "processed"

print("Setup complete!")
print(f"  RAW_DIR: {RAW_DIR}")
print(f"  OUTPUT_DIR: {OUTPUT_DIR}")

Setup complete!
  RAW_DIR: ../../data/raw
  OUTPUT_DIR: ../../data/processed


---
## 2. Download Raw Data

Requires Kaggle API: save your `kaggle.json` to `~/.kaggle/kaggle.json`

In [9]:
# Kaggle dataset identifiers
IMDB_DATASET = "ashirwadsangwan/imdb-dataset"
TMDB_DATASET = "rounakbanik/the-movies-dataset"

# Check what exists
imdb_exists = (RAW_DIR / "title.basics.tsv").exists() and (RAW_DIR / "title.ratings.tsv").exists()
tmdb_exists = (RAW_DIR / "movies_metadata.csv").exists() and (RAW_DIR / "credits.csv").exists()

print("Checking raw data...")
print(f"  IMDb: {'exists' if imdb_exists else 'MISSING'}")
print(f"  TMDB: {'exists' if tmdb_exists else 'MISSING'}")

# Download if needed
if not imdb_exists:
    print(f"\nDownloading IMDb...")
    os.system(f"kaggle datasets download -d {IMDB_DATASET} -p {RAW_DIR} --unzip")

if not tmdb_exists:
    print(f"\nDownloading TMDB...")
    os.system(f"kaggle datasets download -d {TMDB_DATASET} -p {RAW_DIR} --unzip")

# Show raw files with sizes
print("\n--- Raw files ---")
for f in sorted(RAW_DIR.iterdir()):
    if f.is_file():
        size_mb = f.stat().st_size / 1024 / 1024
        print(f"  {f.name}: {size_mb:.1f} MB")

Checking raw data...
  IMDb: exists
  TMDB: exists

--- Raw files ---
  credits.csv: 181.1 MB
  keywords.csv: 5.9 MB
  links.csv: 0.9 MB
  links_small.csv: 0.2 MB
  movies_metadata.csv: 32.8 MB
  name.basics.tsv: 886.8 MB
  ratings.csv: 676.7 MB
  ratings_small.csv: 2.3 MB
  title.akas.tsv: 2634.2 MB
  title.basics.tsv: 1010.8 MB
  title.principals.tsv: 4147.5 MB
  title.ratings.tsv: 27.1 MB


---
## 3. IMDb Cleaning

**Steps:**
1. Load `title.basics.tsv` + `title.ratings.tsv`
2. Filter to movies only (drop TV shows, shorts, etc.)
3. Merge basics + ratings
4. Drop rows with missing `runtimeMinutes` or `genres`
5. Fix data types

In [10]:
# Load IMDb data
# Note: IMDb uses \N for missing values

print("Loading IMDb data...")

df_basics = pd.read_csv(
    RAW_DIR / "title.basics.tsv",
    sep="\t",
    na_values="\\N",
    low_memory=False
)
print(f"  title.basics: {len(df_basics):,} rows")

df_ratings = pd.read_csv(
    RAW_DIR / "title.ratings.tsv",
    sep="\t",
    na_values="\\N"
)
print(f"  title.ratings: {len(df_ratings):,} rows")

Loading IMDb data...
  title.basics: 12,240,026 rows
  title.ratings: 1,627,920 rows


In [11]:
# Clean IMDb data

# 1. Filter to movies only
print("Filtering to movies only...")
df_movies = df_basics[df_basics["titleType"] == "movie"].copy()
print(f"  {len(df_basics):,} → {len(df_movies):,} (movies only)")

# 2. Drop irrelevant columns
df_movies = df_movies.drop(columns=["endYear", "titleType"])

# 3. Merge with ratings (inner join: only movies that have ratings)
print("\nMerging with ratings...")
df_imdb = pd.merge(df_movies, df_ratings, on="tconst", how="inner")
print(f"  {len(df_movies):,} → {len(df_imdb):,} (with ratings)")

# 4. Drop rows with missing runtimeMinutes or genres
print("\nDropping rows with missing runtime/genres...")
before = len(df_imdb)
df_imdb = df_imdb.dropna(subset=["runtimeMinutes", "genres"])
print(f"  {before:,} → {len(df_imdb):,} (complete data)")

# 5. Fix data types
df_imdb["runtimeMinutes"] = df_imdb["runtimeMinutes"].astype(int)
df_imdb["startYear"] = df_imdb["startYear"].fillna(0).astype(int)

# Final result
print(f"\n✓ IMDb cleaned: {len(df_imdb):,} movies")
df_imdb.head(3)

Filtering to movies only...
  12,240,026 → 738,208 (movies only)

Merging with ratings...
  738,208 → 338,468 (with ratings)

Dropping rows with missing runtime/genres...
  338,468 → 298,616 (complete data)

✓ IMDb cleaned: 298,616 movies


Unnamed: 0,tconst,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000009,Miss Jerry,Miss Jerry,0,1894,45,Romance,5.2,232
1,tt0000147,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,100,"Documentary,News,Sport",5.3,584
2,tt0000335,Soldiers of the Cross,Soldiers of the Cross,0,1900,40,"Biography,Drama",5.4,67


---
## 4. TMDB Cleaning

**Raw files:** `movies_metadata.csv`, `credits.csv`

**Steps:**
1. Load raw TMDB files
2. Clean movies_metadata (get imdb_id directly)
3. Extract directors from crew (JSON parsing)
4. Merge into single dataframe
5. Create indicator flags (`has_budget`, `has_revenue`)
6. Handle missing values

In [12]:
# Load raw TMDB files
import ast

print("Loading TMDB data...")

# Movies metadata (overview, budget, revenue, imdb_id)
movies_meta = pd.read_csv(RAW_DIR / "movies_metadata.csv", low_memory=False)
print(f"  movies_metadata: {len(movies_meta):,} rows")

# Credits (cast, crew - we extract directors)
credits = pd.read_csv(RAW_DIR / "credits.csv")
print(f"  credits: {len(credits):,} rows")

Loading TMDB data...
  movies_metadata: 45,466 rows
  credits: 45,476 rows


In [13]:
# Clean movies_metadata
print("Cleaning movies_metadata...")

# Some rows have non-numeric ids (bad data) - filter them out
movies_meta = movies_meta[movies_meta["id"].apply(lambda x: str(x).isdigit())].copy()
movies_meta["id"] = movies_meta["id"].astype(int)
print(f"  After removing bad ids: {len(movies_meta):,} rows")

# Get imdb_id directly from the file (no need for links.csv!)
# Filter to rows with valid imdb_id (starts with 'tt')
movies_meta["imdb_id"] = movies_meta["imdb_id"].astype(str)
movies_meta = movies_meta[movies_meta["imdb_id"].str.startswith("tt")].copy()
print(f"  After filtering valid imdb_id: {len(movies_meta):,} rows")

# Rename to match our convention
movies_meta = movies_meta.rename(columns={"imdb_id": "imdbId"})

# Keep only columns we need
movies_meta = movies_meta[["id", "imdbId", "title", "overview", "budget", "revenue"]].copy()

# Convert budget/revenue to numeric
movies_meta["budget"] = pd.to_numeric(movies_meta["budget"], errors="coerce").fillna(0)
movies_meta["revenue"] = pd.to_numeric(movies_meta["revenue"], errors="coerce").fillna(0)

movies_meta.head(2)

Cleaning movies_metadata...
  After removing bad ids: 45,463 rows
  After filtering valid imdb_id: 45,446 rows


Unnamed: 0,id,imdbId,title,overview,budget,revenue
0,862,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his ...",30000000,373554033.0
1,8844,tt0113497,Jumanji,When siblings Judy and Peter discover an encha...,65000000,262797249.0


In [14]:
# Extract directors from credits
print("Extracting directors from credits...")

def extract_directors(crew_str):
    """Parse crew JSON and extract director names."""
    try:
        crew = ast.literal_eval(crew_str)
        directors = [p["name"] for p in crew if p.get("job") == "Director"]
        return ", ".join(directors) if directors else ""
    except:
        return ""

credits["id"] = credits["id"].astype(int)
credits["director_names"] = credits["crew"].apply(extract_directors)

# Keep only id and director_names
directors_df = credits[["id", "director_names"]].copy()
print(f"  Extracted directors for {len(directors_df):,} movies")

directors_df.head(2)

Extracting directors from credits...
  Extracted directors for 45,476 movies


Unnamed: 0,id,director_names
0,862,John Lasseter
1,8844,Joe Johnston


In [15]:
# Merge all TMDB data together
print("Merging TMDB data...")

# Start with movies_metadata (already has imdbId)
df_tmdb = movies_meta.copy()

# Add directors
df_tmdb = df_tmdb.merge(directors_df, on="id", how="left")
print(f"  After adding directors: {len(df_tmdb):,}")

# Create indicator flags (from your original notebook logic)
# These help the model distinguish "unknown budget" from "actually $0"
df_tmdb["has_budget"] = (df_tmdb["budget"] > 0).astype(int)
df_tmdb["has_revenue"] = (df_tmdb["revenue"] > 0).astype(int)

print(f"\n  Movies with budget: {df_tmdb['has_budget'].sum():,} ({df_tmdb['has_budget'].mean()*100:.1f}%)")
print(f"  Movies with revenue: {df_tmdb['has_revenue'].sum():,} ({df_tmdb['has_revenue'].mean()*100:.1f}%)")

# Handle missing values
df_tmdb["overview"] = df_tmdb["overview"].fillna("")
df_tmdb["director_names"] = df_tmdb["director_names"].fillna("")

# Final result
print(f"\n✓ TMDB cleaned: {len(df_tmdb):,} movies")
df_tmdb.head(3)

Merging TMDB data...
  After adding directors: 45,522

  Movies with budget: 8,910 (19.6%)
  Movies with revenue: 7,428 (16.3%)

✓ TMDB cleaned: 45,522 movies


Unnamed: 0,id,imdbId,title,overview,budget,revenue,director_names,has_budget,has_revenue
0,862,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his ...",30000000,373554033.0,John Lasseter,1,1
1,8844,tt0113497,Jumanji,When siblings Judy and Peter discover an encha...,65000000,262797249.0,Joe Johnston,1,1
2,15602,tt0113228,Grumpier Old Men,A family wedding reignites the ancient feud be...,0,0.0,Howard Deutch,0,0


---
## 5. Merge IMDb + TMDB

**Strategy:** LEFT JOIN IMDb with TMDB on `tconst` = `imdbId`
- All 298k IMDb movies are kept
- TMDB data is added where available (~44k matches)
- Movies without TMDB data get empty values

In [16]:
# Prepare TMDB columns for merge
tmdb_for_merge = df_tmdb[[
    "imdbId", "overview", "budget", "revenue", 
    "has_budget", "has_revenue", "director_names"
]].copy()

# Remove duplicates (keep first)
tmdb_for_merge = tmdb_for_merge.drop_duplicates(subset=["imdbId"], keep="first")
print(f"TMDB unique movies: {len(tmdb_for_merge):,}")

# LEFT JOIN: IMDb + TMDB
print("\nMerging IMDb + TMDB...")
df_full = df_imdb.merge(
    tmdb_for_merge,
    left_on="tconst",
    right_on="imdbId",
    how="left"
)

# Fill missing TMDB values
df_full["overview"] = df_full["overview"].fillna("")
df_full["budget"] = df_full["budget"].fillna(0)
df_full["revenue"] = df_full["revenue"].fillna(0)
df_full["has_budget"] = df_full["has_budget"].fillna(0).astype(int)
df_full["has_revenue"] = df_full["has_revenue"].fillna(0).astype(int)
df_full["director_names"] = df_full["director_names"].fillna("")

# Drop redundant imdbId column (we have tconst)
df_full = df_full.drop(columns=["imdbId"])

# Count matches
has_tmdb = (df_full["overview"] != "").sum()
print(f"\n✓ Merged dataset: {len(df_full):,} movies")
print(f"  With TMDB data: {has_tmdb:,} ({has_tmdb/len(df_full)*100:.1f}%)")
print(f"  Without TMDB data: {len(df_full) - has_tmdb:,}")

df_full.head(3)

TMDB unique movies: 45,416

Merging IMDb + TMDB...

✓ Merged dataset: 298,616 movies
  With TMDB data: 38,240 (12.8%)
  Without TMDB data: 260,376


Unnamed: 0,tconst,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes,overview,budget,revenue,has_budget,has_revenue,director_names
0,tt0000009,Miss Jerry,Miss Jerry,0,1894,45,Romance,5.2,232,,0.0,0.0,0,0,
1,tt0000147,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,100,"Documentary,News,Sport",5.3,584,,0.0,0.0,0,0,
2,tt0000335,Soldiers of the Cross,Soldiers of the Cross,0,1900,40,"Biography,Drama",5.4,67,,0.0,0.0,0,0,


---
## 5.1 Title Exploration

Before exporting, let's explore the title columns to understand what cleaning might be needed.

In [17]:
# === TITLE EXPLORATION ===
print("=== Title Columns ===")
print(f"primaryTitle nulls: {df_full['primaryTitle'].isna().sum()}")
print(f"originalTitle nulls: {df_full['originalTitle'].isna().sum()}")

# Are they different?
different = (df_full['primaryTitle'] != df_full['originalTitle']).sum()
print(f"\nDifferent from originalTitle: {different:,} ({different/len(df_full)*100:.1f}%)")

# Sample some titles
print("\n--- Sample titles ---")
df_full[['primaryTitle', 'originalTitle']].sample(10)

=== Title Columns ===
primaryTitle nulls: 0
originalTitle nulls: 0

Different from originalTitle: 70,169 (23.5%)

--- Sample titles ---


Unnamed: 0,primaryTitle,originalTitle
45132,Megaforce,Megaforce
255920,Flora,Flora
101070,A Taste of... Greece!,Gefsi apo... Ellada!
223067,A Strange Course of Events,A Strange Course of Events
140476,Before Dawn,Before Dawn
219642,Stephen,Stephen
180551,Lusitania Illusion,Fantasia Lusitana
172219,The Diamonds of Metro Valley,The Diamonds of Metro Valley
34581,Office Girls,Erotik im Beruf - Was jeder Personalchef gern ...
171083,Taste of Desperation,Taste of Desperation


In [18]:
# Title length analysis
df_full['title_len'] = df_full['primaryTitle'].str.len()
df_full['title_word_count'] = df_full['primaryTitle'].str.split().str.len()

print("=== Title Length Stats ===")
print(f"Character length: min={df_full['title_len'].min()}, max={df_full['title_len'].max()}, mean={df_full['title_len'].mean():.1f}")
print(f"Word count: min={df_full['title_word_count'].min()}, max={df_full['title_word_count'].max()}, mean={df_full['title_word_count'].mean():.1f}")

# Show longest titles
print("\n--- Longest titles (potential issues) ---")
df_full.nlargest(5, 'title_len')[['primaryTitle', 'title_len']]

=== Title Length Stats ===
Character length: min=1, max=242, mean=17.1
Word count: min=1, max=47, mean=3.0

--- Longest titles (potential issues) ---


Unnamed: 0,primaryTitle,title_len
291616,I Saw a Little Bird Flying Over a Psychiatric ...,242
187969,Night of the Day of the Dawn of the Son of the...,208
124602,Night of the Day of the Dawn of the Son of the...,196
230992,You Had to Be There: How the Toronto Godspell ...,175
39680,Las poquianchis (De los pormenores y otros suc...,165


In [19]:
# Check for patterns and special characters
import re

print("=== Title Patterns ===")

# Contains numbers (potential sequels)
has_number = df_full['primaryTitle'].str.contains(r'\d', regex=True).sum()
print(f"Contains numbers: {has_number:,} ({has_number/len(df_full)*100:.1f}%)")

# Contains colon (often subtitles)
has_colon = df_full['primaryTitle'].str.contains(':').sum()
print(f"Contains colon: {has_colon:,} ({has_colon/len(df_full)*100:.1f}%)")

# Starts with "The"
starts_the = df_full['primaryTitle'].str.startswith('The ').sum()
print(f"Starts with 'The': {starts_the:,} ({starts_the/len(df_full)*100:.1f}%)")

# Check for non-ASCII characters
has_non_ascii = df_full['primaryTitle'].str.contains(r'[^\x00-\x7F]', regex=True).sum()
print(f"Contains non-ASCII: {has_non_ascii:,} ({has_non_ascii/len(df_full)*100:.1f}%)")

# Sample titles with non-ASCII
print("\n--- Sample non-ASCII titles ---")
non_ascii_mask = df_full['primaryTitle'].str.contains(r'[^\x00-\x7F]', regex=True)
df_full[non_ascii_mask][['primaryTitle']].sample(min(10, non_ascii_mask.sum()))

=== Title Patterns ===
Contains numbers: 12,756 (4.3%)
Contains colon: 21,284 (7.1%)
Starts with 'The': 39,286 (13.2%)
Contains non-ASCII: 15,911 (5.3%)

--- Sample non-ASCII titles ---


Unnamed: 0,primaryTitle
226995,Ögretmen
26071,Åsa-Nisse som polis
129961,Abrígate
211150,Testigo involuntario. Nicolás Redondo
138735,Lasse Månsson fra Skaane
44086,El héroe desconocido
291249,Doble Sesión
292081,La Démocratie des crédules
187517,Conspiração Fatal
79680,Puntos suspensivos o Esperando a los bárbaros


In [20]:
# Clean up exploration columns (we'll create proper features in feature_engineering)
df_full = df_full.drop(columns=['title_len', 'title_word_count'])
print("Dropped temporary exploration columns")

Dropped temporary exploration columns


---
## 6. Export Final Datasets

| Dataset | Description |
|---------|-------------|
| `movies_full_298k.csv` | All movies (TMDB data where available) |
| `movies_rich_39k.csv` | Only movies with TMDB data (has overview) |

In [21]:
# Create the two final datasets

# 1. Full dataset (all movies)
df_movies_full = df_full.copy()

# 2. Rich dataset (only movies with TMDB overview)
df_movies_rich = df_full[df_full["overview"] != ""].copy()

print(f"movies_full: {len(df_movies_full):,} rows")
print(f"movies_rich: {len(df_movies_rich):,} rows")

movies_full: 298,616 rows
movies_rich: 38,240 rows


In [22]:
# Export to CSV
print("Exporting datasets...")

# Ensure output directory exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Export full dataset
full_path = OUTPUT_DIR / "movies_full_298k.csv"
df_movies_full.to_csv(full_path, index=False)
print(f"✓ {full_path.name}: {len(df_movies_full):,} rows")

# Export rich dataset
rich_path = OUTPUT_DIR / "movies_rich_39k.csv"
df_movies_rich.to_csv(rich_path, index=False)
print(f"✓ {rich_path.name}: {len(df_movies_rich):,} rows")

print("\nDone! Files saved to:", OUTPUT_DIR)

Exporting datasets...
✓ movies_full_298k.csv: 298,616 rows
✓ movies_rich_39k.csv: 38,240 rows

Done! Files saved to: ../../data/processed


---
## Notes: Data Quality Investigation

During development, we compared our outputs with the previous v1 cleaning files:

| Dataset | Old (v1) | New | Difference |
|---------|----------|-----|------------|
| IMDb (full) | 298,616 | 298,616 | ✓ Same |
| TMDB (rich) | 44,101 | ~39k | -5k |

**Why the TMDB difference?**

The old `tmdb_clean.csv` was standalone TMDB data. The new `movies_rich_39k.csv` is a subset of IMDb movies that also have TMDB data.

| Stage | Count |
|-------|-------|
| Raw TMDB with valid imdb_id | 45,446 |
| TMDB movies that match cleaned IMDb | ~39k |
| TMDB movies NOT in cleaned IMDb | ~6k |

The ~6k "missing" movies are TMDB entries that:
- Don't exist in IMDb at all
- Exist in IMDb but have no ratings (we require ratings)
- Exist but are missing runtime/genres (we dropped those)

**Conclusion:** The new dataset is cleaner. Every row in `movies_rich_39k.csv` has complete IMDb data AND TMDB data, making it immediately usable for modeling.