# Feature Engineering — TMDB Movie Revenue & Popularity Prediction

**Purpose:** Transform raw TMDB movie data into a clean, target-agnostic master feature dataset that supports both supervised and semi-supervised modeling for revenue and popularity prediction.

**Key Principles:**
- **Target-agnostic master dataset:** We create `data_features_master` containing ALL pre-release features plus raw revenue, popularity, and budget — without filtering for any specific target.
- **Zero correction:** Budget and revenue values of 0 are treated as missing (replaced with NaN) and tracked with binary flags.
- **No data leakage:** Post-release metrics (vote_average, vote_count) are excluded from feature matrices at modeling time, not deleted from the master.
- **Semi-supervised ready:** Rows with missing revenue remain in the dataset as unlabeled observations (y = -1) for SSL classifiers.
- **Revenue tiers:** Computed only from labeled (non-NaN revenue) observations using quantile-based binning.

**Feature Categories:**
1. Talent Features (cast & director signals)
2. Content Features (genres, keywords, language)
3. Temporal Features (release timing & seasonality)
4. Production Features (budget, runtime)

**Output Datasets:**
- `data_features_master.csv` — neutral master (all rows, all features, raw targets)
- `data_supervised_revenue.csv` — labeled rows only, no post-release features
- `data_supervised_popularity.csv` — labeled rows only, no revenue features
- `data_ssl_revenue.csv` — all rows with y_ssl column for semi-supervised learning

In [53]:
# ============================================================
# Imports
# ============================================================
import ast
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## 1. Data Loading

Load the raw TMDB dataset and drop the unnamed index column carried over from the CSV export.

In [54]:
# ============================================================
# Load raw data
# ============================================================
df = pd.read_csv("../data/movies_2010_2025.csv")
df = df.drop("Unnamed: 0", axis=1)

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Dataset shape: (9290, 51)
Columns: ['movie_id', 'title', 'release_date', 'runtime', 'original_language', 'popularity', 'vote_average', 'vote_count', 'budget', 'revenue', 'status', 'overview', 'genres', 'keywords', 'director_id', 'director_name', 'director_gender', 'director_popularity', 'director_department', 'actor1_id', 'actor1_name', 'actor1_character', 'actor1_gender', 'actor1_popularity', 'actor1_department', 'actor2_id', 'actor2_name', 'actor2_character', 'actor2_gender', 'actor2_popularity', 'actor2_department', 'actor3_id', 'actor3_name', 'actor3_character', 'actor3_gender', 'actor3_popularity', 'actor3_department', 'actor4_id', 'actor4_name', 'actor4_character', 'actor4_gender', 'actor4_popularity', 'actor4_department', 'actor5_id', 'actor5_name', 'actor5_character', 'actor5_gender', 'actor5_popularity', 'actor5_department', 'cast_pop_mean', 'cast_pop_max']


## 2. Helper Functions

Reusable parsing utilities inherited from the EDA phase. The `safe_parse_list` function handles the genres and keywords columns, which are stored as string representations of Python lists in the CSV.

In [55]:
# ============================================================
# Helper: safely parse stringified lists from CSV
# ============================================================
def safe_parse_list(x):
    """
    Convert stringified lists (e.g., "['Action', 'Drama']") back into
    actual Python lists. Handles NaN, empty strings, and edge cases.
    """
    if isinstance(x, list):
        return x
    if isinstance(x, float) and pd.isna(x):
        return []
    if isinstance(x, str):
        x = x.strip()
        if x == "" or x == "[]":
            return []
        try:
            return ast.literal_eval(x)
        except (ValueError, SyntaxError):
            return []
    return []

## 3. Column Classification — Pre-Release vs Post-Release

We classify every column as pre-release (available before the movie opens) or post-release (only known after release). This classification drives all downstream dataset construction.

**Pre-release features (valid for modeling):**
- Temporal: release_year, release_month, release_quarter, is_summer_release, is_holiday_release
- Cast: actor1-5 popularity, cast_pop_mean, cast_pop_max, star_count, cast_popularity_std, cast_gender_ratio
- Director: director_popularity, director_is_female
- Content: genre_* (one-hot), num_genres, keyword_count, lang_* encodings, is_english
- Production: runtime, has_budget, log_budget
- Text signal: has_overview, overview_length

**Post-release variables (kept in master, excluded from feature matrices):**
- popularity, vote_average, vote_count

**Raw target variables (kept in master for target construction):**
- revenue, budget (raw values preserved alongside engineered versions)

**Identifier / non-predictive columns (dropped from master):**
- movie_id, title, status, overview, all name/character/department/ID columns

In [56]:
# ============================================================
# Classify columns: post-release metrics to keep in master
# but exclude from feature matrices during modeling
# ============================================================
post_release_cols = ["popularity", "vote_average", "vote_count"]
raw_target_cols = ["revenue", "budget"]

print("Post-release columns (kept in master, excluded from features):", post_release_cols)
print("Raw target columns (kept in master for target construction):", raw_target_cols)

Post-release columns (kept in master, excluded from features): ['popularity', 'vote_average', 'vote_count']
Raw target columns (kept in master for target construction): ['revenue', 'budget']


In [57]:
# ============================================================
# Define columns to exclude from the MASTER dataset entirely
# (identifiers, free text, high-cardinality names, near-zero variance departments)
# ============================================================
id_cols = ["movie_id", "title"]

non_predictive_cols = [
    "status", "overview",
    "director_id", "director_name", "director_department",
    "actor1_id", "actor1_name", "actor1_character", "actor1_department",
    "actor2_id", "actor2_name", "actor2_character", "actor2_department",
    "actor3_id", "actor3_name", "actor3_character", "actor3_department",
    "actor4_id", "actor4_name", "actor4_character", "actor4_department",
    "actor5_id", "actor5_name", "actor5_character", "actor5_department",
]

# These are dropped from the master dataset entirely
all_excluded_from_master = id_cols + non_predictive_cols

print(f"Total columns excluded from master: {len(all_excluded_from_master)}")
print(f"Excluded: {all_excluded_from_master}")
print(f"\nNOTE: revenue, budget, popularity, vote_average, vote_count are KEPT in the master.")

Total columns excluded from master: 27
Excluded: ['movie_id', 'title', 'status', 'overview', 'director_id', 'director_name', 'director_department', 'actor1_id', 'actor1_name', 'actor1_character', 'actor1_department', 'actor2_id', 'actor2_name', 'actor2_character', 'actor2_department', 'actor3_id', 'actor3_name', 'actor3_character', 'actor3_department', 'actor4_id', 'actor4_name', 'actor4_character', 'actor4_department', 'actor5_id', 'actor5_name', 'actor5_character', 'actor5_department']

NOTE: revenue, budget, popularity, vote_average, vote_count are KEPT in the master.


## 4. Parse List Columns

The `genres` and `keywords` columns are stored as stringified Python lists in the CSV. We parse them into actual lists before engineering features from them.

In [58]:
# ============================================================
# Parse genres and keywords from strings to lists
# ============================================================
df["genres"] = df["genres"].apply(safe_parse_list)
df["keywords"] = df["keywords"].apply(safe_parse_list)

# Quick verification
print("Sample genres:", df["genres"].iloc[0])
print("Sample keywords:", df["keywords"].iloc[0])

Sample genres: ['Action', 'Science Fiction', 'Adventure']
Sample keywords: ['rescue', 'mission', 'dreams', 'airplane', 'paris, france', 'virtual reality', 'kidnapping', 'philosophy', 'spy', 'allegory', 'manipulation', 'car crash', 'heist', 'memory', 'architecture', 'los angeles, california', 'death', 'dream world', 'subconscious', 'dream']


## 5. Temporal Features — Release Timing & Seasonality

Movie success is heavily influenced by **when** it is released. Summer blockbuster season, holiday releases, and award-season timing are well-known industry patterns.

**Features engineered:**
- `release_month` — captures monthly seasonality (e.g., June/July for summer blockbusters, Nov/Dec for awards and holidays)
- `release_year` — captures long-term industry trends (e.g., streaming era shifts)
- `release_quarter` — a coarser seasonal signal (Q1-Q4)
- `is_summer_release` — binary flag for the peak blockbuster window (May-July)
- `is_holiday_release` — binary flag for the holiday corridor (November-December)

In [59]:
# ============================================================
# Temporal features from release_date
# ============================================================
df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")

# Month and year
df["release_month"] = df["release_date"].dt.month
df["release_year"] = df["release_date"].dt.year

# Quarter (1-4)
df["release_quarter"] = df["release_date"].dt.quarter

# Summer release: May (5), June (6), July (7)
df["is_summer_release"] = df["release_month"].isin([5, 6, 7]).astype(int)

# Holiday release: November (11), December (12)
df["is_holiday_release"] = df["release_month"].isin([11, 12]).astype(int)

# Handle the 1 missing release_date - fill temporal features with median
for col in ["release_month", "release_year", "release_quarter"]:
    df[col] = df[col].fillna(df[col].median())

print("Temporal features created:")
print(df[["release_month", "release_year", "release_quarter",
          "is_summer_release", "is_holiday_release"]].describe())

Temporal features created:
       release_month  release_year  release_quarter  is_summer_release  \
count    9290.000000   9290.000000      9290.000000        9290.000000   
mean        6.752530   2017.456512         2.587406           0.218837   
std         3.463668      4.623364         1.136100           0.413480   
min         1.000000   2010.000000         1.000000           0.000000   
25%         4.000000   2013.000000         2.000000           0.000000   
50%         7.000000   2017.000000         3.000000           0.000000   
75%        10.000000   2021.000000         4.000000           0.000000   
max        12.000000   2025.000000         4.000000           1.000000   

       is_holiday_release  
count         9290.000000  
mean             0.164155  
std              0.370436  
min              0.000000  
25%              0.000000  
50%              0.000000  
75%              0.000000  
max              1.000000  


## 6. Talent Features — Cast & Director Signals

The star power of a film's cast and director is one of the strongest pre-release indicators of audience interest. The dataset already provides individual actor popularities (actor1-5) and aggregated stats (`cast_pop_mean`, `cast_pop_max`). We engineer additional features to capture richer talent signals.

**Features engineered:**
- `director_popularity` — already exists in the data, retained as-is
- `director_is_female` — binary flag (TMDB gender encoding: 1 = Female, 2 = Male)
- `star_count` — number of actors (out of top 5 billed) whose popularity exceeds the 75th percentile across all actors; captures how many recognizable names are attached
- `cast_popularity_std` — standard deviation of top 5 actors' popularities; high std = one big name + unknowns, low std = balanced ensemble
- `cast_gender_ratio` — proportion of female actors in the top 5 billed cast
- `cast_pop_mean`, `cast_pop_max` — already exist, retained as-is

In [60]:
# ============================================================
# Talent features
# ============================================================

# --- Actor popularity columns ---
actor_pop_cols = [f"actor{i}_popularity" for i in range(1, 6)]
actor_gender_cols = [f"actor{i}_gender" for i in range(1, 6)]

# --- Star count ---
# Define "star" threshold as 75th percentile of all non-zero actor popularities
all_actor_pops = df[actor_pop_cols].values.flatten()
all_actor_pops_nonzero = all_actor_pops[all_actor_pops > 0]
star_threshold = np.percentile(all_actor_pops_nonzero, 75)
print(f"Star threshold (75th percentile of non-zero actor popularity): {star_threshold:.4f}")

# Count how many of the 5 actors exceed the star threshold per movie
df["star_count"] = (df[actor_pop_cols] > star_threshold).sum(axis=1)

# --- Cast popularity standard deviation ---
# Measures whether the cast is balanced or relies on a single star
df["cast_popularity_std"] = df[actor_pop_cols].apply(
    lambda row: np.std([x for x in row if x > 0]) if any(x > 0 for x in row) else 0,
    axis=1
)

# --- Cast gender ratio (proportion female) ---
# TMDB gender encoding: 1 = Female, 2 = Male, 0 = Unknown/Not set
def calc_female_ratio(row):
    known = [g for g in row if g in (1.0, 2.0)]
    if len(known) == 0:
        return 0.5  # default to balanced when unknown
    return sum(1 for g in known if g == 1.0) / len(known)

df["cast_gender_ratio"] = df[actor_gender_cols].apply(calc_female_ratio, axis=1)

# --- Director is female ---
df["director_is_female"] = (df["director_gender"] == 1.0).astype(int)

print("\nTalent features summary:")
print(df[["star_count", "cast_popularity_std", "cast_gender_ratio",
          "director_is_female", "director_popularity",
          "cast_pop_mean", "cast_pop_max"]].describe())

Star threshold (75th percentile of non-zero actor popularity): 2.9868

Talent features summary:
        star_count  cast_popularity_std  cast_gender_ratio  \
count  9290.000000          9290.000000        9290.000000   
mean      1.147578             1.260020           0.412983   
std       1.364494             1.767775           0.259384   
min       0.000000             0.000000           0.000000   
25%       0.000000             0.576811           0.200000   
50%       1.000000             1.017833           0.400000   
75%       2.000000             1.516547           0.600000   
max       5.000000            84.167164           1.000000   

       director_is_female  director_popularity  cast_pop_mean  cast_pop_max  
count         9290.000000          9290.000000    9290.000000   9290.000000  
mean             0.116685             1.209358       2.023336      4.085244  
std              0.321061             1.361564       1.876921      4.903623  
min              0.000000        

## 7. Content Features — Genre Encoding

Genres are a core signal for audience interest. Since a movie can belong to multiple genres, we use **multi-hot encoding** — each genre becomes a binary column (1 if the movie belongs to that genre, 0 otherwise).

We also create `num_genres` to capture the breadth of a movie's genre classification. Movies spanning many genres may have broader audience appeal.

In [61]:
# ============================================================
# Multi-hot encode genres
# ============================================================

# Get all unique genres across the dataset
all_genres = sorted(set(g for genres_list in df["genres"] for g in genres_list))
print(f"Total unique genres found: {len(all_genres)}")
print(f"Genres: {all_genres}")

# Create binary columns for each genre
for genre in all_genres:
    df[f"genre_{genre}"] = df["genres"].apply(lambda x, g=genre: 1 if g in x else 0)

# Number of genres per movie
df["num_genres"] = df["genres"].apply(len)

print(f"\nGenre feature columns created: {len(all_genres)} binary + 1 count")
print(f"num_genres distribution:\n{df['num_genres'].describe()}")

Total unique genres found: 19
Genres: ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']

Genre feature columns created: 19 binary + 1 count
num_genres distribution:
count    9290.000000
mean        1.988482
std         1.108486
min         0.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         8.000000
Name: num_genres, dtype: float64


## 8. Content Features — Keyword Count

Rather than encoding individual keywords (which would create thousands of extremely sparse columns), we use the **keyword count** as a proxy for how richly tagged a movie is. Movies with more keywords tend to have more developed marketing and metadata, which itself correlates with production scale and audience reach.

In [62]:
# ============================================================
# Keyword count
# ============================================================
df["keyword_count"] = df["keywords"].apply(len)

print("keyword_count distribution:")
print(df["keyword_count"].describe())

keyword_count distribution:
count    9290.000000
mean        5.189020
std         7.172168
min         0.000000
25%         0.000000
50%         2.000000
75%         8.000000
max       101.000000
Name: keyword_count, dtype: float64


## 9. Content Features — Original Language Encoding

The EDA showed that English (`en`) dominates the dataset, with a long tail of other languages. We use a **top-K encoding** strategy:

- The top 5 most frequent languages each get their own binary column.
- All remaining languages are grouped into `lang_other`.
- A simple `is_english` flag captures the most important language split for global popularity prediction.

In [63]:
# ============================================================
# Language encoding (top-K + is_english flag)
# ============================================================

# Identify top 5 languages by frequency
top_languages = df["original_language"].value_counts().head(5).index.tolist()
print(f"Top 5 languages: {top_languages}")

# Create binary column for each top language
for lang in top_languages:
    df[f"lang_{lang}"] = (df["original_language"] == lang).astype(int)

# Catch-all for less common languages
df["lang_other"] = (~df["original_language"].isin(top_languages)).astype(int)

# Simple is_english flag
df["is_english"] = (df["original_language"] == "en").astype(int)

print(f"\nis_english distribution:\n{df['is_english'].value_counts()}")

Top 5 languages: ['en', 'fr', 'es', 'ja', 'de']

is_english distribution:
is_english
1    5748
0    3542
Name: count, dtype: int64


## 10. Production Features — Budget, Revenue Zero Correction & Runtime

**Zero Correction:** In TMDB, `budget = 0` and `revenue = 0` mean "not reported", not truly $0. We:
1. Replace `budget == 0` → `NaN` and `revenue == 0` → `NaN`
2. Create `budget_missing_flag` and `revenue_missing_flag` (1 = missing, 0 = present)
3. **Do NOT drop rows** with missing revenue — these define unlabeled observations for semi-supervised learning

**Budget features:**
- `has_budget` — binary flag (1 if budget is known)
- `log_budget` — log1p transform (0 for missing budgets)
- `budget_missing_flag` — 1 if budget was 0/NaN

**Revenue handling:**
- `revenue_missing_flag` — 1 if revenue was 0/NaN
- Raw `revenue` is preserved in the master for target construction

**Runtime:** Retained as-is.

In [64]:
# ============================================================
# Production features: Budget & Revenue zero correction, Runtime
# ============================================================

# --- Revenue: replace 0 with NaN ---
revenue_zero_count = (df["revenue"] == 0).sum()
df["revenue"] = df["revenue"].replace(0, np.nan)
df["revenue_missing_flag"] = df["revenue"].isna().astype(int)
print(f"Revenue: {revenue_zero_count} zeros replaced with NaN")
print(f"revenue_missing_flag distribution:\n{df['revenue_missing_flag'].value_counts().to_dict()}")

# --- Budget: replace 0 with NaN ---
budget_zero_count = (df["budget"] == 0).sum()
df["budget"] = df["budget"].replace(0, np.nan)
df["budget_missing_flag"] = df["budget"].isna().astype(int)
print(f"\nBudget: {budget_zero_count} zeros replaced with NaN")
print(f"budget_missing_flag distribution:\n{df['budget_missing_flag'].value_counts().to_dict()}")

# --- Budget engineered features ---
df["has_budget"] = (df["budget"].notna()).astype(int)
df["log_budget"] = np.log1p(df["budget"].fillna(0))

print(f"\nhas_budget distribution: {df['has_budget'].value_counts().to_dict()}")
print(f"log_budget stats (where budget known):")
print(df.loc[df["budget"].notna(), "log_budget"].describe())

# --- Runtime ---
print(f"\nRuntime stats:\n{df['runtime'].describe()}")

Revenue: 6686 zeros replaced with NaN
revenue_missing_flag distribution:
{1: 6686, 0: 2604}

Budget: 6527 zeros replaced with NaN
budget_missing_flag distribution:
{1: 6527, 0: 2763}

has_budget distribution: {0: 6527, 1: 2763}
log_budget stats (where budget known):
count    2763.000000
mean       15.833550
std         3.021456
min         0.693147
25%        14.978662
50%        16.705882
75%        17.727534
max        20.009712
Name: log_budget, dtype: float64

Runtime stats:
count    9290.000000
mean       81.390635
std        44.067500
min         0.000000
25%        67.000000
50%        92.000000
75%       107.000000
max       950.000000
Name: runtime, dtype: float64


## 11. Text Signal — Overview Availability

The movie overview (synopsis) is free text that would require NLP to fully exploit. For this pipeline, we extract a simple but meaningful signal: **whether an overview exists and how long it is**. Movies with longer, more detailed overviews tend to have more developed marketing and distribution — a lightweight proxy for production effort.

In [65]:
# ============================================================
# Overview-based features
# ============================================================

# Whether the movie has an overview at all
df["has_overview"] = df["overview"].notna().astype(int)

# Length of overview (character count)
df["overview_length"] = df["overview"].fillna("").apply(len)

print("Overview features:")
print(f"  has_overview: {df['has_overview'].value_counts().to_dict()}")
print(f"  overview_length stats:\n{df['overview_length'].describe()}")

Overview features:
  has_overview: {1: 9089, 0: 201}
  overview_length stats:
count    9290.000000
mean      259.595048
std       164.951892
min         0.000000
25%       146.000000
50%       220.000000
75%       339.000000
max       999.000000
Name: overview_length, dtype: float64


## 12. Assemble Neutral Master Feature Dataset

Create `data_features_master` — a target-agnostic dataset containing:
- All pre-release engineered features
- Raw revenue and raw popularity (for target construction)
- Raw budget (preserved alongside log_budget and has_budget)
- Missing flags (revenue_missing_flag, budget_missing_flag)
- Post-release metrics (vote_average, vote_count) — kept in master, excluded at modeling time

**No rows are dropped. No target-specific filtering.**

In [66]:
# ============================================================
# Assemble the neutral master feature dataset
# ============================================================

# Drop only non-predictive/identifier columns and intermediate raw columns
cols_to_drop_from_master = (
    all_excluded_from_master
    + ["release_date", "original_language", "genres", "keywords",
       "director_gender",                             # replaced by director_is_female
       "actor1_gender", "actor2_gender", "actor3_gender",
       "actor4_gender", "actor5_gender",              # replaced by cast_gender_ratio
       ]
)

# Only drop columns that actually exist
cols_to_drop_existing = [c for c in cols_to_drop_from_master if c in df.columns]

data_features_master = df.drop(columns=cols_to_drop_existing)

print(f"Master dataset shape: {data_features_master.shape}")
print(f"\nAll columns ({len(data_features_master.columns)}):")
print(list(data_features_master.columns))
print(f"\n--- Confirming key columns are present ---")
for col in ["revenue", "popularity", "budget", "vote_average", "vote_count",
            "revenue_missing_flag", "budget_missing_flag"]:
    present = col in data_features_master.columns
    print(f"  {col}: {'YES' if present else 'MISSING!'}")

Master dataset shape: (9290, 57)

All columns (57):
['runtime', 'popularity', 'vote_average', 'vote_count', 'budget', 'revenue', 'director_popularity', 'actor1_popularity', 'actor2_popularity', 'actor3_popularity', 'actor4_popularity', 'actor5_popularity', 'cast_pop_mean', 'cast_pop_max', 'release_month', 'release_year', 'release_quarter', 'is_summer_release', 'is_holiday_release', 'star_count', 'cast_popularity_std', 'cast_gender_ratio', 'director_is_female', 'genre_Action', 'genre_Adventure', 'genre_Animation', 'genre_Comedy', 'genre_Crime', 'genre_Documentary', 'genre_Drama', 'genre_Family', 'genre_Fantasy', 'genre_History', 'genre_Horror', 'genre_Music', 'genre_Mystery', 'genre_Romance', 'genre_Science Fiction', 'genre_TV Movie', 'genre_Thriller', 'genre_War', 'genre_Western', 'num_genres', 'keyword_count', 'lang_en', 'lang_fr', 'lang_es', 'lang_ja', 'lang_de', 'lang_other', 'is_english', 'revenue_missing_flag', 'budget_missing_flag', 'has_budget', 'log_budget', 'has_overview', 'ov

## 13. Create Revenue Tier

Revenue tiers are computed **only from rows where revenue is not NaN** (labeled observations). Quantile thresholds are calculated exclusively from labeled data.

**Tiers:** Low, Medium, High, Blockbuster (based on quartiles)
**Unlabeled rows:** revenue_tier = NaN

In [67]:
# ============================================================
# Create revenue_tier from LABELED rows only
# ============================================================

# Step 1: Identify labeled rows (revenue is not NaN)
df_labeled = data_features_master[data_features_master["revenue"].notna()].copy()
print(f"Labeled rows (revenue not NaN): {len(df_labeled)}")
print(f"Unlabeled rows (revenue is NaN): {len(data_features_master) - len(df_labeled)}")

# Step 2: Compute quantile thresholds from labeled data ONLY
thresholds = df_labeled["revenue"].quantile([0.25, 0.50, 0.75])
q25, q50, q75 = thresholds.iloc[0], thresholds.iloc[1], thresholds.iloc[2]

print(f"\nRevenue quantile thresholds (from labeled data only):")
print(f"  Q25 (Low/Medium boundary):       ${q25:,.0f}")
print(f"  Q50 (Medium/High boundary):       ${q50:,.0f}")
print(f"  Q75 (High/Blockbuster boundary):  ${q75:,.0f}")

# Step 3: Assign revenue_tier using defined bins
bins = [-np.inf, q25, q50, q75, np.inf]
labels_tier = ["Low", "Medium", "High", "Blockbuster"]

# Apply pd.cut to the full revenue column — NaN revenues automatically get NaN tiers
data_features_master["revenue_tier"] = pd.cut(
    data_features_master["revenue"],
    bins=bins,
    labels=labels_tier
)

# Step 4: Store thresholds for reproducibility
revenue_tier_thresholds = {"Q25": q25, "Q50": q50, "Q75": q75}
print(f"\nRevenue tier distribution (labeled rows only):")
print(data_features_master["revenue_tier"].value_counts())
print(f"\nUnlabeled rows (revenue_tier = NaN): {data_features_master['revenue_tier'].isna().sum()}")

Labeled rows (revenue not NaN): 2604
Unlabeled rows (revenue is NaN): 6686

Revenue quantile thresholds (from labeled data only):
  Q25 (Low/Medium boundary):       $3,553,760
  Q50 (Medium/High boundary):       $33,749,242
  Q75 (High/Blockbuster boundary):  $140,441,440

Revenue tier distribution (labeled rows only):
revenue_tier
Low            651
Medium         651
High           651
Blockbuster    651
Name: count, dtype: int64

Unlabeled rows (revenue_tier = NaN): 6686


## 14. Define Pre-Release Feature Columns

Define the list of columns that constitute valid pre-release features. These will be used in all feature matrices for modeling. Post-release variables (popularity, vote_average, vote_count) and raw targets (revenue, budget) are explicitly excluded from feature matrices.

In [68]:
# ============================================================
# Define pre-release feature columns
# ============================================================

# Columns that must NOT be used as features when predicting revenue
post_release_exclude = ["popularity", "vote_average", "vote_count"]
target_exclude = ["revenue", "budget", "revenue_tier", "revenue_missing_flag"]
# budget_missing_flag is a valid feature (it's known pre-release)
# but raw budget is replaced by log_budget + has_budget

# All pre-release feature columns
all_master_cols = list(data_features_master.columns)
non_feature_cols = post_release_exclude + ["revenue", "budget", "revenue_tier"]

pre_release_features = [c for c in all_master_cols if c not in non_feature_cols]

print(f"Pre-release feature columns ({len(pre_release_features)}):")
print(pre_release_features)

Pre-release feature columns (52):
['runtime', 'director_popularity', 'actor1_popularity', 'actor2_popularity', 'actor3_popularity', 'actor4_popularity', 'actor5_popularity', 'cast_pop_mean', 'cast_pop_max', 'release_month', 'release_year', 'release_quarter', 'is_summer_release', 'is_holiday_release', 'star_count', 'cast_popularity_std', 'cast_gender_ratio', 'director_is_female', 'genre_Action', 'genre_Adventure', 'genre_Animation', 'genre_Comedy', 'genre_Crime', 'genre_Documentary', 'genre_Drama', 'genre_Family', 'genre_Fantasy', 'genre_History', 'genre_Horror', 'genre_Music', 'genre_Mystery', 'genre_Romance', 'genre_Science Fiction', 'genre_TV Movie', 'genre_Thriller', 'genre_War', 'genre_Western', 'num_genres', 'keyword_count', 'lang_en', 'lang_fr', 'lang_es', 'lang_ja', 'lang_de', 'lang_other', 'is_english', 'revenue_missing_flag', 'budget_missing_flag', 'has_budget', 'log_budget', 'has_overview', 'overview_length']


## 15. Dataset Builder Functions

Three functions to construct task-specific datasets from the master:

1. **`build_supervised_dataset(target)`** — For supervised models (revenue or popularity)
2. **`build_ssl_dataset()`** — For semi-supervised revenue classification (all rows, y_ssl = tier or -1)

In [None]:
# ============================================================
# Dataset builder: Supervised
# ============================================================
def build_supervised_dataset(master_df, target="revenue"):
    """
    Build a supervised dataset from the master.
    
    Parameters:
        master_df: the master feature dataframe
        target: "revenue" or "popularity"
    
    Returns:
        X (features DataFrame), y (target Series)
    """
    if target == "revenue":
        # Only rows where revenue is not NaN
        df_sub = master_df[master_df["revenue"].notna()].copy()
        y = df_sub["revenue"].copy()
        
        # Exclude post-release + raw targets from features
        exclude = ["popularity", "vote_average", "vote_count",
                    "revenue", "budget", "revenue_tier"]
        X = df_sub.drop(columns=[c for c in exclude if c in df_sub.columns])
        
    elif target == "popularity":
        # Only rows where popularity is not NaN
        df_sub = master_df[master_df["popularity"].notna()].copy()
        y = df_sub["popularity"].copy()
        
        # Exclude revenue + other post-release from features
        exclude = ["popularity", "vote_average", "vote_count",
                    "revenue", "budget", "revenue_tier"]
        X = df_sub.drop(columns=[c for c in exclude if c in df_sub.columns])
        
    else:
        raise ValueError(f"Unknown target: {target}. Use 'revenue' or 'popularity'.")
    
    print(f"Supervised dataset for '{target}':")
    print(f"  X shape: {X.shape}")
    print(f"  y shape: {y.shape}")
    print(f"  Features: {list(X.columns)}")
    return X, y


# ============================================================
# Dataset builder: Semi-Supervised (SSL) for Revenue
# ============================================================
def build_ssl_dataset(master_df):
    """
    Build a semi-supervised dataset for revenue tier classification.
    
    - All rows included
    - y_ssl = revenue_tier label (encoded as int) if revenue is not NaN
    - y_ssl = -1 if revenue is NaN (unlabeled)
    - No post-release features in X
    
    Returns:
        X (features DataFrame), y_ssl (Series with int labels or -1)
    """
    df_ssl = master_df.copy()
    
    # Use .cat.codes: pd.cut creates ordered Categorical where
    # codes map to 0=Low, 1=Medium, 2=High, 3=Blockbuster, -1=NaN
    # This avoids the fillna error on Categorical columns
    df_ssl["y_ssl"] = df_ssl["revenue_tier"].cat.codes.astype(int)
    
    y_ssl = df_ssl["y_ssl"].copy()
    
    # Exclude post-release + raw targets from features
    exclude = ["popularity", "vote_average", "vote_count",
                "revenue", "budget", "revenue_tier", "y_ssl"]
    X = df_ssl.drop(columns=[c for c in exclude if c in df_ssl.columns])
    
    print(f"SSL dataset for revenue tier:")
    print(f"  X shape: {X.shape}")
    print(f"  y_ssl shape: {y_ssl.shape}")
    print(f"  Labeled:   {(y_ssl != -1).sum()}")
    print(f"  Unlabeled: {(y_ssl == -1).sum()}")
    print(f"  y_ssl distribution:\n{y_ssl.value_counts().sort_index().to_dict()}")
    print(f"  Features: {list(X.columns)}")
    return X, y_ssl


print("Dataset builder functions defined: build_supervised_dataset(), build_ssl_dataset()")

Dataset builder functions defined: build_supervised_dataset(), build_ssl_dataset()


## 16. Scaling Preparation for Graph-Based SSL

LabelPropagation and LabelSpreading use distance-based kernels (RBF) and are sensitive to feature scales. A `StandardScaler` pipeline is prepared here. **Scaling must be applied after train/test split** to prevent data leakage.

In [70]:
# ============================================================
# Scaling pipeline for graph-based SSL (LabelPropagation / LabelSpreading)
# ============================================================
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.semi_supervised import LabelPropagation, LabelSpreading

def create_ssl_scaling_pipeline(model_class=LabelSpreading, **model_kwargs):
    """
    Create a Pipeline with StandardScaler + SSL classifier.
    
    IMPORTANT: Apply this AFTER train/test split to prevent leakage.
    The scaler fits only on training data.
    
    Usage:
        pipe = create_ssl_scaling_pipeline(LabelSpreading, kernel='rbf', gamma=20)
        pipe.fit(X_train, y_train_ssl)
        y_pred = pipe.predict(X_test)
    """
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("ssl_model", model_class(**model_kwargs))
    ])
    return pipe

print("Scaling pipeline function defined: create_ssl_scaling_pipeline()")
print("Example: pipe = create_ssl_scaling_pipeline(LabelSpreading, kernel='rbf', gamma=20)")

Scaling pipeline function defined: create_ssl_scaling_pipeline()
Example: pipe = create_ssl_scaling_pipeline(LabelSpreading, kernel='rbf', gamma=20)


## 17. Dataset Validation Checks

Before exporting, validate:
- Labeled vs unlabeled observation counts
- Revenue tier class distribution
- Missing values per feature
- Confirm no post-release columns in SSL feature matrix

In [72]:
# ============================================================
# Validation checks
# ============================================================
print("=" * 60)
print("DATASET VALIDATION")
print("=" * 60)

# 1. Labeled vs Unlabeled revenue observations
n_labeled = data_features_master["revenue"].notna().sum()
n_unlabeled = data_features_master["revenue"].isna().sum()
print(f"\n1. Revenue observations:")
print(f"   Labeled (revenue known):   {n_labeled}")
print(f"   Unlabeled (revenue NaN):   {n_unlabeled}")
print(f"   Total:                     {n_labeled + n_unlabeled}")

# 2. Revenue tier distribution
print(f"\n2. Revenue tier distribution:")
print(data_features_master["revenue_tier"].value_counts())
print(f"   NaN (unlabeled): {data_features_master['revenue_tier'].isna().sum()}")

# 3. Missing values per feature
print(f"\n3. Missing values per column (top 15):")
missing = data_features_master.isnull().sum().sort_values(ascending=False)
missing_pct = (missing / len(data_features_master) * 100).round(2)
missing_report = pd.DataFrame({"count": missing, "pct": missing_pct})
print(missing_report[missing_report["count"] > 0].head(15))

# 4. Build SSL dataset and verify no post-release leakage
print(f"\n4. SSL feature matrix leakage check:")
X_ssl_check, y_ssl_check = build_ssl_dataset(data_features_master)
forbidden_in_ssl = ["popularity", "vote_average", "vote_count", "revenue", "revenue_tier"]
leakage_found = [c for c in forbidden_in_ssl if c in X_ssl_check.columns]
if leakage_found:
    print(f"   WARNING: Post-release leakage detected: {leakage_found}")
else:
    print(f"   PASSED - No post-release columns in SSL feature matrix")

# 5. Supervised dataset check
print(f"\n5. Supervised revenue dataset check:")
X_rev_check, y_rev_check = build_supervised_dataset(data_features_master, target="revenue")
leakage_rev = [c for c in forbidden_in_ssl if c in X_rev_check.columns]
if leakage_rev:
    print(f"   WARNING: Leakage detected: {leakage_rev}")
else:
    print(f"   PASSED - No post-release columns in supervised revenue features")

print(f"\n6. Supervised popularity dataset check:")
X_pop_check, y_pop_check = build_supervised_dataset(data_features_master, target="popularity")
leakage_pop = [c for c in ["popularity", "revenue", "revenue_tier"] if c in X_pop_check.columns]
if leakage_pop:
    print(f"   WARNING: Leakage detected: {leakage_pop}")
else:
    print(f"   PASSED - No target leakage in supervised popularity features")

print(f"\n{'=' * 60}")
print("Revenue tier thresholds (for reproducibility):")
for k, v in revenue_tier_thresholds.items():
    print(f"  {k}: ${v:,.0f}")
print("=" * 60)

DATASET VALIDATION

1. Revenue observations:
   Labeled (revenue known):   2604
   Unlabeled (revenue NaN):   6686
   Total:                     9290

2. Revenue tier distribution:
revenue_tier
Low            651
Medium         651
High           651
Blockbuster    651
Name: count, dtype: int64
   NaN (unlabeled): 6686

3. Missing values per column (top 15):
              count    pct
revenue        6686  71.97
revenue_tier   6686  71.97
budget         6527  70.26

4. SSL feature matrix leakage check:


TypeError: Cannot setitem on a Categorical with a new category (-1), set the categories first

## 18. Export Clean Outputs

Export four datasets:
1. **data_features_master.csv** — neutral master (all rows, all features + raw targets)
2. **data_supervised_revenue.csv** — labeled rows, pre-release features + revenue target
3. **data_supervised_popularity.csv** — labeled rows, pre-release features + popularity target
4. **data_ssl_revenue.csv** — all rows, pre-release features + y_ssl column

In [None]:
# ============================================================
# Export all datasets
# ============================================================
import os

output_dir = "../data"

# 1. Master dataset
master_path = os.path.join(output_dir, "data_features_master.csv")
data_features_master.to_csv(master_path, index=False)
print(f"[1/4] Saved: {master_path}  shape={data_features_master.shape}")

# 2. Supervised revenue dataset
X_rev, y_rev = build_supervised_dataset(data_features_master, target="revenue")
sup_rev = X_rev.copy()
sup_rev["revenue"] = y_rev.values
sup_rev_path = os.path.join(output_dir, "data_supervised_revenue.csv")
sup_rev.to_csv(sup_rev_path, index=False)
print(f"[2/4] Saved: {sup_rev_path}  shape={sup_rev.shape}")

# 3. Supervised popularity dataset
X_pop, y_pop = build_supervised_dataset(data_features_master, target="popularity")
sup_pop = X_pop.copy()
sup_pop["popularity"] = y_pop.values
sup_pop_path = os.path.join(output_dir, "data_supervised_popularity.csv")
sup_pop.to_csv(sup_pop_path, index=False)
print(f"[3/4] Saved: {sup_pop_path}  shape={sup_pop.shape}")

# 4. SSL revenue dataset
X_ssl, y_ssl = build_ssl_dataset(data_features_master)
ssl_rev = X_ssl.copy()
ssl_rev["y_ssl"] = y_ssl.values
ssl_rev_path = os.path.join(output_dir, "data_ssl_revenue.csv")
ssl_rev.to_csv(ssl_rev_path, index=False)
print(f"[4/4] Saved: {ssl_rev_path}  shape={ssl_rev.shape}")

print(f"\n{'='*60}")
print("ALL EXPORTS COMPLETE")
print(f"{'='*60}")

## Final System Structure

| Component | Description |
|---|---|
| **data_features_master.csv** | Neutral master dataset — all rows, all pre-release features, raw targets, missing flags |
| **data_supervised_revenue.csv** | Labeled rows only, pre-release features + revenue |
| **data_supervised_popularity.csv** | Labeled rows only, pre-release features + popularity |
| **data_ssl_revenue.csv** | All rows, pre-release features + y_ssl (tier label or -1) |
| **build_supervised_dataset()** | Function to construct supervised X, y for revenue or popularity |
| **build_ssl_dataset()** | Function to construct SSL X, y_ssl with -1 for unlabeled |
| **create_ssl_scaling_pipeline()** | StandardScaler + SSL model pipeline (apply after split) |

**Guarantees:**
- No target leakage (post-release variables excluded from feature matrices)
- Revenue tiers computed only from labeled observations
- Zero values in budget/revenue corrected to NaN with flags
- All pre-release variables preserved
- Missing revenue rows retained as unlabeled for SSL
- Reproducible quantile thresholds documented