# Feature Engineering

**Goal:** Transform clean data into model-ready features.

**Input:** `data/processed/movies_full_298k.csv`

**Outputs:**

| File | Rows | Description |
|------|------|-------------|
| `movies_full_wide.csv` | ~298k | All movies with IMDb features (TMDB features where available) |
| `movies_rich_wide.csv` | ~39k | Only movies with complete TMDB data (includes embeddings) |

## Data Sources

This notebook builds on the cleaned data from `data_pipeline.ipynb`.

**Features we'll create:**

| Source | Features |
|--------|----------|
| IMDb | `movie_age`, `decade`, `runtimeMinutes_capped`, `log_numVotes`, `hit`, `Genre_*` (one-hot) |
| TMDB | `log_budget`, `log_revenue`, `pca_0` to `pca_19` (plot embeddings) |

---
## 1. Setup

In [18]:
import pandas as pd
import numpy as np
from pathlib import Path

# Display settings
pd.set_option("display.max_columns", None)

# Paths
DATA_DIR = Path("../../data")
INPUT_DIR = DATA_DIR / "processed"
OUTPUT_DIR = DATA_DIR / "processed"

print("Setup complete!")

Setup complete!


---
## 2. Load Data

In [19]:
# Load the full dataset (298k movies with TMDB data where available)
df = pd.read_csv(INPUT_DIR / "movies_full_298k.csv")

print(f"Loaded {len(df):,} movies")
print(f"Columns: {list(df.columns)}")

# Check TMDB coverage
has_tmdb = (df["overview"] != "").sum()
print(f"\nMovies with TMDB data: {has_tmdb:,} ({has_tmdb/len(df)*100:.1f}%)")

df.head(3)

Loaded 298,616 movies
Columns: ['tconst', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'runtimeMinutes', 'genres', 'averageRating', 'numVotes', 'overview', 'budget', 'revenue', 'has_budget', 'has_revenue', 'director_names']

Movies with TMDB data: 298,616 (100.0%)


Unnamed: 0,tconst,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes,overview,budget,revenue,has_budget,has_revenue,director_names
0,tt0000009,Miss Jerry,Miss Jerry,0,1894,45,Romance,5.2,232,,0.0,0.0,0,0,
1,tt0000147,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,100,"Documentary,News,Sport",5.3,584,,0.0,0.0,0,0,
2,tt0000335,Soldiers of the Cross,Soldiers of the Cross,0,1900,40,"Biography,Drama",5.4,67,,0.0,0.0,0,0,


---
## 3. IMDb Features

Features derived from IMDb data (available for all 298k movies).

### 3.1 Temporal Features

In [20]:
# Movie age (years since release)
CURRENT_YEAR = 2026
df["movie_age"] = CURRENT_YEAR - df["startYear"]

# Decade (e.g., 1995 -> 1990)
df["decade"] = (df["startYear"] // 10 * 10).astype("Int64")

print("Temporal features created:")
print(f"  movie_age: min={df['movie_age'].min()}, max={df['movie_age'].max()}")
print(f"  decade: {sorted(df['decade'].dropna().unique())}")

Temporal features created:
  movie_age: min=0, max=2026
  decade: [0, 1890, 1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020]


### 3.2 Runtime Capping

In [21]:
# Cap runtime at 300 minutes (5 hours) to handle outliers
RUNTIME_CAP = 300

outliers = (df["runtimeMinutes"] > RUNTIME_CAP).sum()
df["runtimeMinutes_capped"] = df["runtimeMinutes"].clip(upper=RUNTIME_CAP)

print(f"Runtime capped at {RUNTIME_CAP} minutes")
print(f"  Movies affected: {outliers:,}")

Runtime capped at 300 minutes
  Movies affected: 226


### 3.3 Popularity Features

In [22]:
# Log transform numVotes (highly skewed)
df["log_numVotes"] = np.log1p(df["numVotes"])

# Hit flag: top 20% by votes
threshold_80 = df["numVotes"].quantile(0.80)
df["hit"] = (df["numVotes"] >= threshold_80).astype(int)

print(f"Popularity features created:")
print(f"  log_numVotes: range [{df['log_numVotes'].min():.2f}, {df['log_numVotes'].max():.2f}]")
print(f"  hit threshold: {threshold_80:,.0f} votes")
print(f"  hit movies: {df['hit'].sum():,} ({df['hit'].mean()*100:.1f}%)")

Popularity features created:
  log_numVotes: range [1.79, 14.96]
  hit threshold: 648 votes
  hit movies: 59,778 (20.0%)


### 3.4 Genre Encoding

In [23]:
# Count genres per movie
df["genre_count"] = df["genres"].str.split(",").str.len()

# Get all genres and their counts
all_genres = df["genres"].str.split(",").explode()
genre_counts = all_genres.value_counts()

# Keep genres with >= 1000 occurrences
MIN_GENRE_COUNT = 1000
valid_genres = genre_counts[genre_counts >= MIN_GENRE_COUNT].index.tolist()

print(f"Total unique genres: {len(genre_counts)}")
print(f"Genres with >= {MIN_GENRE_COUNT} occurrences: {len(valid_genres)}")
print(f"Dropped: {sorted(set(genre_counts.index) - set(valid_genres))}")

Total unique genres: 27
Genres with >= 1000 occurrences: 22
Dropped: ['Film-Noir', 'Game-Show', 'News', 'Reality-TV', 'Talk-Show']


In [24]:
# One-hot encode valid genres
for genre in valid_genres:
    df[f"Genre_{genre}"] = df["genres"].str.contains(genre, regex=False).astype(int)

genre_cols = [col for col in df.columns if col.startswith("Genre_")]
print(f"Created {len(genre_cols)} genre columns")

Created 22 genre columns


---
## 4. TMDB Features

Features derived from TMDB data (only available for ~39k movies).

### 4.1 Budget & Revenue

In [25]:
# Log transform budget and revenue
df["log_budget"] = np.log1p(df["budget"])
df["log_revenue"] = np.log1p(df["revenue"])

print("Budget/Revenue features created:")
print(f"  Movies with budget > 0: {(df['budget'] > 0).sum():,}")
print(f"  Movies with revenue > 0: {(df['revenue'] > 0).sum():,}")

Budget/Revenue features created:
  Movies with budget > 0: 8,451
  Movies with revenue > 0: 7,354


### 4.2 Plot Embeddings

Generate embeddings from movie overviews using SentenceTransformer, then reduce with PCA.

**Important:** We fit PCA only on movies with real overviews (~39k), then transform all 298k. This ensures PCA learns meaningful plot patterns, not just "has overview vs doesn't".

In [26]:
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Load model
print("Loading SentenceTransformer model...")
model = SentenceTransformer("all-MiniLM-L6-v2")
print("Model loaded.")

Loading SentenceTransformer model...
Model loaded.


In [27]:
# Generate embeddings for ALL movies (empty overview = zero vector)
print("Generating embeddings for all movies...")
print("(Movies without overview will get zero vectors)")

overviews = df["overview"].fillna("").astype(str).tolist()
embeddings = model.encode(overviews, show_progress_bar=True, batch_size=64)

print(f"\nEmbeddings shape: {embeddings.shape}")

Generating embeddings for all movies...
(Movies without overview will get zero vectors)


Batches:   0%|          | 0/4666 [00:00<?, ?it/s]


Embeddings shape: (298616, 384)


In [28]:
# Apply PCA to reduce dimensions (384 -> 20)
# FIT only on movies with real overviews, TRANSFORM all movies
N_COMPONENTS = 20

# Identify movies with real overviews
has_overview = df["overview"].fillna("").astype(str) != ""
rich_indices = df[has_overview].index
print(f"Fitting PCA on {len(rich_indices):,} movies with real overviews...")

# Fit PCA on rich subset only
pca = PCA(n_components=N_COMPONENTS)
pca.fit(embeddings[rich_indices])
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# Transform ALL embeddings using the fitted PCA
print(f"Transforming all {len(embeddings):,} movies...")
embeddings_pca = pca.transform(embeddings)

# Add PCA columns to dataframe
pca_cols = [f"pca_{i}" for i in range(N_COMPONENTS)]
for i, col in enumerate(pca_cols):
    df[col] = embeddings_pca[:, i]

print(f"Added {len(pca_cols)} PCA columns")

Fitting PCA on 38,240 movies with real overviews...
Explained variance: 31.87%
Transforming all 298,616 movies...
Added 20 PCA columns


---
## 5. Create Wide Tables

In [29]:
# Define feature columns
imdb_features = [
    "movie_age", "decade", "runtimeMinutes_capped", 
    "log_numVotes", "hit", "genre_count", "isAdult"
] + genre_cols

tmdb_features = [
    "log_budget", "log_revenue", "has_budget", "has_revenue"
] + pca_cols

# Columns to keep
id_cols = ["tconst"]
target_col = ["averageRating"]
all_features = imdb_features + tmdb_features

print(f"ID columns: {len(id_cols)}")
print(f"Target: {target_col}")
print(f"IMDb features: {len(imdb_features)}")
print(f"TMDB features: {len(tmdb_features)}")
print(f"Total features: {len(all_features)}")

ID columns: 1
Target: ['averageRating']
IMDb features: 29
TMDB features: 24
Total features: 53


In [30]:
# Create full wide table (298k movies)
df_full_wide = df[id_cols + target_col + all_features].copy()

# Create rich wide table (only movies with TMDB data)
has_overview = df["overview"].fillna("").astype(str) != ""
df_rich_wide = df[has_overview][id_cols + target_col + all_features].copy()

print(f"movies_full_wide: {len(df_full_wide):,} rows x {len(df_full_wide.columns)} cols")
print(f"movies_rich_wide: {len(df_rich_wide):,} rows x {len(df_rich_wide.columns)} cols")

movies_full_wide: 298,616 rows x 55 cols
movies_rich_wide: 38,240 rows x 55 cols


---
## 6. Export

In [31]:
# Export both wide tables
full_path = OUTPUT_DIR / "movies_full_wide.csv"
rich_path = OUTPUT_DIR / "movies_rich_wide.csv"

df_full_wide.to_csv(full_path, index=False)
print(f"✓ {full_path.name}: {len(df_full_wide):,} rows")

df_rich_wide.to_csv(rich_path, index=False)
print(f"✓ {rich_path.name}: {len(df_rich_wide):,} rows")

print(f"\nDone! Files saved to: {OUTPUT_DIR}")

✓ movies_full_wide.csv: 298,616 rows
✓ movies_rich_wide.csv: 38,240 rows

Done! Files saved to: ../../data/processed


In [32]:
# Final verification
print("=" * 50)
print("FEATURE ENGINEERING COMPLETE")
print("=" * 50)
print(f"\nFull wide table: {len(df_full_wide):,} movies")
print(f"Rich wide table: {len(df_rich_wide):,} movies")
print(f"\nFeatures ({len(all_features)} total):")
print(f"  IMDb: {imdb_features}")
print(f"  TMDB: {tmdb_features[:4]} + {len(pca_cols)} PCA columns")

FEATURE ENGINEERING COMPLETE

Full wide table: 298,616 movies
Rich wide table: 38,240 movies

Features (53 total):
  IMDb: ['movie_age', 'decade', 'runtimeMinutes_capped', 'log_numVotes', 'hit', 'genre_count', 'isAdult', 'Genre_Drama', 'Genre_Comedy', 'Genre_Documentary', 'Genre_Romance', 'Genre_Action', 'Genre_Crime', 'Genre_Thriller', 'Genre_Horror', 'Genre_Adventure', 'Genre_Mystery', 'Genre_Family', 'Genre_Biography', 'Genre_Fantasy', 'Genre_History', 'Genre_Music', 'Genre_Sci-Fi', 'Genre_Musical', 'Genre_War', 'Genre_Animation', 'Genre_Western', 'Genre_Sport', 'Genre_Adult']
  TMDB: ['log_budget', 'log_revenue', 'has_budget', 'has_revenue'] + 20 PCA columns
