# TMDb Feature Engineering

**Goal:** Generate plot embeddings, apply PCA, and process metadata from the clean TMDb dataset.

**Input:** `data/tmdb_clean.csv`
**Output:** `data/tmdb_wide.csv`

**Steps:**
1. Load Data
2. Generate Plot Embeddings (SentenceTransformer)
3. Apply PCA (20 components)
4. Process Budget & Revenue (log transform + pass through indicator flags)
5. Export Features

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Paths
DATA_DIR = Path('../data')

print("Setup complete!")

Setup complete!


## 1. Load Data

In [2]:
df = pd.read_csv(DATA_DIR / 'tmdb_clean.csv')
print(f"Loaded {len(df):,} movies")
display(df.head(2))

Loaded 43,995 movies


Unnamed: 0,imdbId,id,title,budget,revenue,directors,director_names,overview,has_budget,has_revenue
0,tt0114709,862,Toy Story,30000000,373554033.0,['John Lasseter'],John Lasseter,"Led by Woody, Andy's toys live happily in his ...",1,1
1,tt0113497,8844,Jumanji,65000000,262797249.0,['Joe Johnston'],Joe Johnston,When siblings Judy and Peter discover an encha...,1,1


## 2. Generate Embeddings

In [3]:
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded.")

print("Encoding overviews (this may take a few minutes)...")
# Ensure overviews are strings
overviews = df['overview'].fillna("").astype(str).tolist()
embeddings = model.encode(overviews, show_progress_bar=True, batch_size=64)

print(f"Embeddings shape: {embeddings.shape}")

Model loaded.
Encoding overviews (this may take a few minutes)...


Batches:   0%|          | 0/688 [00:00<?, ?it/s]

Embeddings shape: (43995, 384)


## 3. PCA Reduction

In [4]:
N_COMPONENTS = 20
pca = PCA(n_components=N_COMPONENTS)
embeddings_pca = pca.fit_transform(embeddings)

print(f"Explained variance ratio: {pca.explained_variance_ratio_.sum():.4f}")

# Create PCA dataframe
pca_cols = [f'pca_{i}' for i in range(N_COMPONENTS)]
pca_df = pd.DataFrame(embeddings_pca, columns=pca_cols)
pca_df['imdbId'] = df['imdbId']  # Add ID for merging

Explained variance ratio: 0.3171


## 4. Metadata Features

In [5]:
# Log transform budget and revenue
df['log_budget'] = np.log1p(df['budget'])
df['log_revenue'] = np.log1p(df['revenue'])

# Select final metadata columns (including indicator flags from cleaning step)
meta_cols = ['imdbId', 'log_budget', 'log_revenue', 'has_budget', 'has_revenue', 'director_names']
meta_df = df[meta_cols]

print(f"Metadata columns: {meta_cols}")

Metadata columns: ['imdbId', 'log_budget', 'log_revenue', 'has_budget', 'has_revenue', 'director_names']


## 5. Export

In [6]:
# Merge PCA features with Metadata
final_tmdb_wide = pd.merge(meta_df, pca_df, on='imdbId')

output_path = DATA_DIR / 'tmdb_wide.csv'
final_tmdb_wide.to_csv(output_path, index=False)
print(f"TMDb wide table exported to: {output_path}")
print(f"Shape: {final_tmdb_wide.shape}")

TMDb wide table exported to: ../data/tmdb_wide.csv
Shape: (43995, 26)


Check later adding expensive vs cheap movie for user to add these rtather than numebrs. MAke sure all features can be used by users
