# Model Training v3 (Combined Datasets)

**Goal:** Train and compare models using features from both IMDb and TMDb.

**Inputs:**
- `data/movies_wide.csv` (IMDb features, ~298k movies)
- `data/tmdb_wide.csv` (TMDb features + Plot PCA, ~44k movies)

**Strategy:**
1. Merge datasets (Left Join on IMDb ID).
2. **Full Dataset Experiments** (298k movies, 85% without TMDb data):
   - Experiment 1: Baseline (IMDb features only)
   - Experiment 2: Add Plot PCA features (fill missing with 0)
3. **Subset Experiments** (44k movies WITH TMDb data):
   - Experiment 3: Baseline on subset
   - Experiment 4: Baseline + PCA on subset ‚Üí **True plot impact**

**Why subset experiments matter:**
Testing on all 298k movies dilutes the plot signal because 85% have no plot data (filled with 0s).
The subset experiments show the real R¬≤ improvement when a user actually provides a plot.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

# Paths
DATA_DIR = Path('../data')

print("Setup complete!")

Setup complete!


## 1. Load and Merge Data

In [2]:
# Load datasets
imdb = pd.read_csv(DATA_DIR / 'movies_clean.csv') # Need tconst for merging
imdb_wide = pd.read_csv(DATA_DIR / 'movies_wide.csv')
tmdb_wide = pd.read_csv(DATA_DIR / 'tmdb_wide.csv')

print(f"IMDb wide shape: {imdb_wide.shape}")
print(f"TMDb wide shape: {tmdb_wide.shape}")

# Note: movies_wide.csv lost tconst, so we need to be careful.
# Assuming row alignment is preserved from movies_clean.csv is risky.
# Better approach: We'll reconstruct the full dataset using movies_clean as the anchor.

# Let's check if movies_wide has the same length as movies_clean
if len(imdb) != len(imdb_wide):
    print("Warning: Length mismatch!")
else:
    print("Lengths match. Attaching tconst to imdb_wide...")
    imdb_wide['tconst'] = imdb['tconst']

# Merge TMDb features
# Left join: Keep all IMDb movies, add TMDb info where available
merged = imdb_wide.merge(tmdb_wide, left_on='tconst', right_on='imdbId', how='left')

# Create has_tmdb flag BEFORE filling with 0s
# A movie has TMDb data if imdbId is not null after the merge
merged['has_tmdb'] = merged['imdbId'].notna().astype(int)

print(f"Movies with TMDb data: {merged['has_tmdb'].sum():,} ({merged['has_tmdb'].mean()*100:.1f}%)")

# Fill missing TMDb features with 0
# Identify new columns (those from tmdb_wide)
tmdb_cols = [c for c in tmdb_wide.columns if c != 'imdbId']
merged[tmdb_cols] = merged[tmdb_cols].fillna(0)

print(f"Merged dataset shape: {merged.shape}")
display(merged.head())

IMDb wide shape: (298616, 32)
TMDb wide shape: (43995, 26)
Lengths match. Attaching tconst to imdb_wide...
Movies with TMDb data: 37,905 (12.7%)
Merged dataset shape: (298616, 60)


Unnamed: 0,averageRating,isAdult,startYear,numVotes,genre_count,decade,movie_age,runtimeMinutes_capped,log_numVotes,hit,...,pca_11,pca_12,pca_13,pca_14,pca_15,pca_16,pca_17,pca_18,pca_19,has_tmdb
0,5.2,0,1894.0,232,1,1890.0,132.0,45,5.451038,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,5.3,0,1897.0,584,3,1890.0,129.0,100,6.371612,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,5.4,0,1900.0,67,2,1900.0,126.0,40,4.219508,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,6.0,0,1906.0,1046,3,1900.0,120.0,70,6.953684,1,...,0.013349,-0.123792,-0.023552,0.113217,-0.136703,0.022376,-0.090953,0.011291,0.062831,1
4,4.8,0,1907.0,37,1,1900.0,119.0,90,3.637586,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


## 2. Define Feature Sets

In [None]:
# Identify feature groups
target = 'averageRating'
ignore_cols = ['tconst', 'imdbId', 'director_names', target] # director_names needs processing if we want to use it

# 1. Base Features (from IMDb)
base_features = [c for c in imdb_wide.columns if c not in ignore_cols and c != 'tconst']

# 2. PCA Features
pca_features = [c for c in merged.columns if c.startswith('pca_')]

# 3. Budget/Revenue
money_features = ['log_budget', 'log_revenue']

# Print feature sets clearly
print("=" * 60)
print("FEATURE SETS")
print("=" * 60)

print(f"\nüìä BASE FEATURES ({len(base_features)} columns):")
print("-" * 40)
for i, f in enumerate(base_features):
    print(f"  {i+1:2}. {f}")

print(f"\nüé¨ PCA FEATURES ({len(pca_features)} columns):")
print("-" * 40)
print(f"  pca_0 to pca_{len(pca_features)-1} (from plot embeddings)")

print(f"\nüí∞ MONEY FEATURES ({len(money_features)} columns):")
print("-" * 40)
for f in money_features:
    print(f"  - {f}")

print("\n" + "=" * 60)

## 3. Train/Test Split

In [4]:
# Drop rows with NaN in target or base features (should be none for base)
df_model = merged.dropna(subset=[target] + base_features)

X = df_model.drop(columns=[target, 'tconst', 'imdbId', 'director_names'])
y = df_model[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Train size: {len(X_train):,}")
print(f"Test size: {len(X_test):,}")

Train size: 209,015
Test size: 89,579


## 4. Experiment 1: Baseline on Full Dataset (IMDb Only)

In [5]:
print("Training Baseline Model...")
rf_base = RandomForestRegressor(n_estimators=50, max_depth=10, n_jobs=-1, random_state=42)
rf_base.fit(X_train[base_features], y_train)

y_pred_base = rf_base.predict(X_test[base_features])
r2_base = r2_score(y_test, y_pred_base)
mae_base = mean_absolute_error(y_test, y_pred_base)

print(f"Baseline R¬≤: {r2_base:.4f}")
print(f"Baseline MAE: {mae_base:.4f}")

Training Baseline Model...
Baseline R¬≤: 0.3117
Baseline MAE: 0.8535


Baseline R¬≤: 0.3117
Baseline MAE: 0.8535


## 5. Experiment 2: Baseline + Plot PCA on Full Dataset (0-filled)

In [6]:
features_v2 = base_features + pca_features
print(f"Training Model with Plots ({len(features_v2)} features)...")

rf_pca = RandomForestRegressor(n_estimators=50, max_depth=10, n_jobs=-1, random_state=42)
rf_pca.fit(X_train[features_v2], y_train)

y_pred_pca = rf_pca.predict(X_test[features_v2])
r2_pca = r2_score(y_test, y_pred_pca)
mae_pca = mean_absolute_error(y_test, y_pred_pca)

print(f"PCA Model R¬≤: {r2_pca:.4f}")
print(f"PCA Model MAE: {mae_pca:.4f}")

Training Model with Plots (51 features)...
PCA Model R¬≤: 0.3119
PCA Model MAE: 0.8535


PCA Model R¬≤: 0.3119
PCA Model MAE: 0.8535


## 6. Subset Experiments (Movies WITH TMDb Data Only)

**Why this matters:** The full dataset experiments above are diluted because 85% of movies have no plot data (PCA features = 0).

These experiments filter to only the ~44k movies that have TMDb data, showing the **true impact** of plot features when a user actually provides a plot.

In [7]:
# Filter to movies WITH TMDb data
df_subset = merged[merged['has_tmdb'] == 1].copy()
print(f"Subset size: {len(df_subset):,} movies")

# Train/test split on subset
X_sub = df_subset.drop(columns=[target, 'tconst', 'imdbId', 'director_names', 'has_tmdb'])
y_sub = df_subset[target]

X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(
    X_sub, y_sub, test_size=0.3, random_state=42
)

print(f"Subset Train size: {len(X_train_sub):,}")
print(f"Subset Test size: {len(X_test_sub):,}")

# Experiment 3: Baseline on Subset
print("\n--- Experiment 3: Baseline on Subset ---")
rf_sub_base = RandomForestRegressor(n_estimators=50, max_depth=10, n_jobs=-1, random_state=42)
rf_sub_base.fit(X_train_sub[base_features], y_train_sub)

y_pred_sub_base = rf_sub_base.predict(X_test_sub[base_features])
r2_sub_base = r2_score(y_test_sub, y_pred_sub_base)
mae_sub_base = mean_absolute_error(y_test_sub, y_pred_sub_base)

print(f"Subset Baseline R¬≤: {r2_sub_base:.4f}")
print(f"Subset Baseline MAE: {mae_sub_base:.4f}")

# Experiment 4: Baseline + PCA on Subset
print("\n--- Experiment 4: Baseline + PCA on Subset ---")
features_sub_pca = base_features + pca_features
rf_sub_pca = RandomForestRegressor(n_estimators=50, max_depth=10, n_jobs=-1, random_state=42)
rf_sub_pca.fit(X_train_sub[features_sub_pca], y_train_sub)

y_pred_sub_pca = rf_sub_pca.predict(X_test_sub[features_sub_pca])
r2_sub_pca = r2_score(y_test_sub, y_pred_sub_pca)
mae_sub_pca = mean_absolute_error(y_test_sub, y_pred_sub_pca)

print(f"Subset + PCA R¬≤: {r2_sub_pca:.4f}")
print(f"Subset + PCA MAE: {mae_sub_pca:.4f}")

print(f"\n>>> Plot feature improvement on subset: {r2_sub_pca - r2_sub_base:.4f} R¬≤ <<<")

Subset size: 37,905 movies
Subset Train size: 26,533
Subset Test size: 11,372

--- Experiment 3: Baseline on Subset ---
Subset Baseline R¬≤: 0.4757
Subset Baseline MAE: 0.5766

--- Experiment 4: Baseline + PCA on Subset ---
Subset + PCA R¬≤: 0.4818
Subset + PCA MAE: 0.5729

>>> Plot feature improvement on subset: 0.0061 R¬≤ <<<


## 7. Summary of Results

In [8]:
results = pd.DataFrame({
    'Experiment': [
        '1. Full Dataset - Baseline',
        '2. Full Dataset - Baseline + PCA',
        '3. TMDb Subset - Baseline',
        '4. TMDb Subset - Baseline + PCA'
    ],
    'Dataset': ['298k', '298k', '44k', '44k'],
    'Features': ['IMDb only', 'IMDb + PCA (0-filled)', 'IMDb only', 'IMDb + PCA (real)'],
    'R2': [r2_base, r2_pca, r2_sub_base, r2_sub_pca],
    'MAE': [mae_base, mae_pca, mae_sub_base, mae_sub_pca]
})

# Calculate improvements
results['vs_Baseline'] = results['R2'] - r2_base
results['vs_Subset_Baseline'] = [None, None, 0, r2_sub_pca - r2_sub_base]

print("=" * 80)
print("RESULTS SUMMARY")
print("=" * 80)
display(results)

print("\n" + "=" * 80)
print("KEY INSIGHT:")
print("=" * 80)
print(f"On FULL dataset (85% without plots): PCA adds only +{r2_pca - r2_base:.4f} R¬≤")
print(f"On SUBSET (movies WITH plots):       PCA adds +{r2_sub_pca - r2_sub_base:.4f} R¬≤")
print("\n‚Üí When a user provides a plot, prediction accuracy improves significantly!")

RESULTS SUMMARY


Unnamed: 0,Experiment,Dataset,Features,R2,MAE,vs_Baseline,vs_Subset_Baseline
0,1. Full Dataset - Baseline,298k,IMDb only,0.311746,0.853518,0.0,
1,2. Full Dataset - Baseline + PCA,298k,IMDb + PCA (0-filled),0.311876,0.853489,0.00013,
2,3. TMDb Subset - Baseline,44k,IMDb only,0.475729,0.576596,0.163983,0.0
3,4. TMDb Subset - Baseline + PCA,44k,IMDb + PCA (real),0.481849,0.572896,0.170103,0.00612



KEY INSIGHT:
On FULL dataset (85% without plots): PCA adds only +0.0001 R¬≤
On SUBSET (movies WITH plots):       PCA adds +0.0061 R¬≤

‚Üí When a user provides a plot, prediction accuracy improves significantly!


## 9. Future Ideas

**Director Rating Feature:**
- Find a database with director quality/reputation scores
- If a user provides a director name, check if they're in our dataset
- If found, use their historical average rating to modify the prediction
- Hypothesis: Known directors have predictable quality patterns (e.g., Spielberg ‚Üí higher expected rating)

## 8. API Design Decision: numVotes as Optional

**Problem:** The current API requires `numVotes` as input, but this is **data leakage** for new movies.

If the use case is *"Predict the rating of a movie BEFORE it's released"*, then `numVotes` won't exist yet. The user can't provide a value they don't have.

**Solution:** Make `numVotes` optional with **median imputation**.
- If user provides numVotes ‚Üí use it (e.g., predicting for an existing movie)
- If user doesn't provide numVotes ‚Üí use the median from training data

**Why median instead of mean?**
- `numVotes` is heavily right-skewed (few blockbusters with millions of votes)
- Median is more robust to outliers than mean

In [None]:
# Calculate numVotes statistics to justify median imputation
print("numVotes Distribution Analysis")
print("=" * 50)

num_votes = merged['numVotes']
print(f"Mean:   {num_votes.mean():,.0f}")
print(f"Median: {num_votes.median():,.0f}  ‚Üê USE THIS FOR IMPUTATION")
print(f"Min:    {num_votes.min():,}")
print(f"Max:    {num_votes.max():,}")
print(f"Std:    {num_votes.std():,.0f}")

print(f"\nPercentiles:")
for p in [25, 50, 75, 90, 95, 99]:
    print(f"  {p}th: {num_votes.quantile(p/100):,.0f}")

print(f"\n‚Üí MEDIAN_NUM_VOTES = {int(num_votes.median())}")
print("  This value will be used in preprocessing.py when numVotes is not provided.")