# Advanced Feature Engineering (TMDB 2010-2025)

This notebook builds on the base feature engineering work and adds **advanced features** that can improve
model performance for predicting movie success/popularity.

**New features created:**
- Holiday & seasonal release flags
- Release competition density (movies released same month)
- Director historical track record (avg revenue, avg rating)
- Franchise / sequel indicators
- Budget tier categorization
- Cast diversity index
- Overview sentiment & readability features

**Output:** `data/data_advanced_features.csv`

In [1]:
import pandas as pd
import numpy as np
import ast
import re
from collections import Counter

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 200)

In [2]:
# Load the cleaned/engineered dataset from the base pipeline
df = pd.read_csv("../data/data_cleaned_engineered.csv")
print(f"Dataset shape: {df.shape}")
df.head(3)

Dataset shape: (9290, 122)


Unnamed: 0,movie_id,title,release_date,runtime,original_language,popularity,vote_average,vote_count,budget,revenue,status,overview,genres,keywords,director_id,director_name,director_gender,director_popularity,director_department,actor1_id,actor1_name,actor1_character,actor1_gender,actor1_popularity,actor1_department,actor2_id,actor2_name,actor2_character,actor2_gender,actor2_popularity,actor2_department,actor3_id,actor3_name,actor3_character,actor3_gender,actor3_popularity,actor3_department,actor4_id,actor4_name,actor4_character,actor4_gender,actor4_popularity,actor4_department,actor5_id,actor5_name,actor5_character,actor5_gender,actor5_popularity,actor5_department,cast_pop_mean,cast_pop_max,release_year,release_month,release_quarter,release_dayofweek,release_weekofyear,is_weekend_release,runtime_missing,budget_missing,revenue_missing,director_popularity_missing,overview_len,overview_word_count,genres_count,keywords_count,genre_drama,genre_comedy,genre_thriller,genre_action,genre_horror,genre_romance,genre_adventure,genre_science_fiction,genre_crime,genre_fantasy,genre_family,genre_animation,genre_documentary,genre_mystery,genre_history,primary_genre,kw_woman_director,kw_based_on_novel_or_book,kw_short_film,kw_sequel,kw_based_on_true_story,kw_duringcreditsstinger,kw_murder,kw_lgbt,kw_aftercreditsstinger,kw_biography,kw_coming_of_age,kw_revenge,kw_superhero,kw_gay_theme,kw_friendship,kw_new_york_city,kw_based_on_comic,kw_family,kw_anime,kw_amused,kw_remake,kw_love,kw_dystopia,kw_dark_comedy,kw_absurd,actor_pop_mean,actor_pop_max,actor_pop_min,actor_pop_std,cast_size,gender_female_count,gender_male_count,gender_nonbinary_count,gender_unknown_count,has_female_director,log_budget,log_revenue,roi,has_financials,success_revenue,success_roi_1_5
0,27205,Inception,2010-07-15,148.0,en,32.8952,8.37,38655,160000000.0,839030630.0,Released,"Cobb, a skilled thief who commits corporate es...","['Action', 'Science Fiction', 'Adventure']","['rescue', 'mission', 'dreams', 'airplane', 'p...",525.0,Christopher Nolan,2.0,8.2813,Directing,6193.0,Leonardo DiCaprio,Dom Cobb,2.0,12.2774,Acting,24045.0,Joseph Gordon-Levitt,Arthur,2.0,5.6445,Acting,3899.0,Ken Watanabe,Saito,2.0,4.5824,Acting,2524.0,Tom Hardy,Eames,2.0,9.6156,Acting,27578.0,Elliot Page,Ariadne,3.0,4.8289,Acting,7.38976,12.2774,2010.0,7.0,3.0,3.0,28,0,0,0,0,0,280,44,3,20,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7.38976,12.2774,4.5824,3.403256,5,0,5,1,0,0,18.890684,20.547758,5.243941,True,1,1
1,38757,Tangled,2010-11-24,100.0,en,19.876,7.61,12179,260000000.0,592461732.0,Released,"Feisty teenager Rapunzel, who has long and mag...","['Animation', 'Family', 'Adventure']","['princess', 'magic', 'hostage', 'fairy tale',...",76595.0,Byron Howard,2.0,5.1258,Directing,16855.0,Mandy Moore,Rapunzel (voice),1.0,2.1981,Acting,69899.0,Zachary Levi,Flynn Rider (voice),2.0,1.978,Acting,2517.0,Donna Murphy,Mother Gothel (voice),1.0,1.8252,Acting,2372.0,Ron Perlman,Stabbington Brother (voice),2.0,4.9563,Acting,22132.0,M.C. Gainey,Captain of the Guard (voice),2.0,2.1777,Acting,2.62706,4.9563,2010.0,11.0,4.0,2.0,47,0,0,0,0,0,286,50,3,21,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,Animation,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.62706,4.9563,1.8252,1.311063,5,2,4,0,0,0,19.376192,20.199797,2.278699,True,1,1
2,10138,Iron Man 2,2010-04-28,124.0,en,13.79,6.85,22057,200000000.0,623933331.0,Released,With the world now aware of his dual life as t...,"['Adventure', 'Action', 'Science Fiction']","['technology', 'superhero', 'malibu', 'based o...",15277.0,Jon Favreau,2.0,4.0024,Acting,3223.0,Robert Downey Jr.,Tony Stark,2.0,9.2587,Acting,12052.0,Gwyneth Paltrow,Pepper Potts,1.0,5.5278,Acting,1896.0,Don Cheadle,Lt. Col. James 'Rhodey' Rhodes,2.0,3.4058,Acting,1245.0,Scarlett Johansson,Natalie Rushman / Natasha Romanoff,1.0,17.8153,Acting,6807.0,Sam Rockwell,Justin Hammer,2.0,4.4974,Acting,8.101,17.8153,2010.0,4.0,2.0,2.0,17,0,0,0,0,0,372,63,3,10,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,Adventure,0,0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,8.101,17.8153,3.4058,5.860036,5,2,4,0,0,0,19.113828,20.251554,3.119667,True,1,1


## 1. Seasonality & Holiday Release Features

Movies released around holidays (Christmas, summer, Thanksgiving) often have different box-office dynamics.
We create flags for key release windows.

In [3]:
df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")
df["release_month"] = df["release_date"].dt.month
df["release_day"] = df["release_date"].dt.day

# Summer blockbuster season (May-August)
df["is_summer_release"] = df["release_month"].isin([5, 6, 7, 8]).astype(int)

# Holiday season (Nov 15 - Dec 31)
df["is_holiday_release"] = (
    ((df["release_month"] == 11) & (df["release_day"] >= 15)) |
    (df["release_month"] == 12)
).astype(int)

# Valentine's Day window (Feb 7-14)
df["is_valentines_release"] = (
    (df["release_month"] == 2) & (df["release_day"].between(7, 14))
).astype(int)

# Halloween window (Oct 15-31)
df["is_halloween_release"] = (
    (df["release_month"] == 10) & (df["release_day"] >= 15)
).astype(int)

# January "dump month" (studios release weaker films in Jan)
df["is_dump_month"] = (df["release_month"] == 1).astype(int)

print("Seasonal flags distribution:")
for col in ["is_summer_release", "is_holiday_release", "is_valentines_release", "is_halloween_release", "is_dump_month"]:
    print(f"  {col}: {df[col].sum()} movies ({df[col].mean()*100:.1f}%)")

Seasonal flags distribution:
  is_summer_release: 2754 movies (29.6%)
  is_holiday_release: 1079 movies (11.6%)
  is_valentines_release: 202 movies (2.2%)
  is_halloween_release: 560 movies (6.0%)
  is_dump_month: 814 movies (8.8%)


  df["is_holiday_release"] = (
  df["is_valentines_release"] = (
  df["is_halloween_release"] = (
  df["is_dump_month"] = (df["release_month"] == 1).astype(int)


## 2. Release Competition Density

How many other movies were released in the same month/year? High competition can dilute box-office performance.

In [4]:
# Count movies released in the same year-month
df["release_year"] = df["release_date"].dt.year
df["year_month"] = df["release_date"].dt.to_period("M").astype(str)

month_counts = df.groupby("year_month").size().rename("monthly_competition")
df = df.merge(month_counts, left_on="year_month", right_index=True, how="left")

# Competition within the same week
df["year_week"] = df["release_date"].dt.strftime("%Y-W%U")
week_counts = df.groupby("year_week").size().rename("weekly_competition")
df = df.merge(week_counts, left_on="year_week", right_index=True, how="left")

print(f"Monthly competition — mean: {df['monthly_competition'].mean():.1f}, max: {df['monthly_competition'].max()}")
print(f"Weekly competition  — mean: {df['weekly_competition'].mean():.1f}, max: {df['weekly_competition'].max()}")

# Drop helper columns
df.drop(columns=["year_month", "year_week"], inplace=True)

Monthly competition — mean: 51.7, max: 98.0
Weekly competition  — mean: 12.8, max: 41.0


  df["year_month"] = df["release_date"].dt.to_period("M").astype(str)
  df["year_week"] = df["release_date"].dt.strftime("%Y-W%U")


## 3. Director Historical Track Record

A director's past performance is a strong signal for future movies. We compute rolling averages
of revenue, vote_average, and popularity up to (but not including) the current film to avoid leakage.

In [5]:
# Sort by director and release date
df = df.sort_values(["director_name", "release_date"]).reset_index(drop=True)

# For each director, compute expanding mean of past movies (excluding current)
def director_rolling_features(group):
    """Compute expanding mean of past movies for each director."""
    result = pd.DataFrame(index=group.index)
    
    for col in ["revenue", "vote_average", "popularity"]:
        if col in group.columns:
            # Shift by 1 to exclude the current row, then expanding mean
            result[f"director_hist_{col}"] = group[col].shift(1).expanding().mean()
    
    # Count of prior films by this director
    result["director_film_count"] = range(len(group))
    
    return result

director_features = df.groupby("director_name", group_keys=False).apply(director_rolling_features)
df = pd.concat([df, director_features], axis=1)

# Flag for first-time directors (no track record)
df["is_debut_director"] = (df["director_film_count"] == 0).astype(int)

print(f"Debut directors: {df['is_debut_director'].sum()} ({df['is_debut_director'].mean()*100:.1f}%)")
print(f"Director historical revenue — mean: {df['director_hist_revenue'].mean():.0f}")
print(f"Director historical rating  — mean: {df['director_hist_vote_average'].mean():.2f}")

Debut directors: 7311 (78.7%)
Director historical revenue — mean: 175610668
Director historical rating  — mean: 5.96


  df["is_debut_director"] = (df["director_film_count"] == 0).astype(int)


## 4. Franchise & Sequel Detection

Sequels and franchise films generally have higher built-in audiences. We detect them using
title patterns and keywords.

In [6]:
def parse_list_column(x):
    """Safely parse string representations of lists."""
    if isinstance(x, list):
        return x
    if pd.isna(x):
        return []
    try:
        return ast.literal_eval(str(x))
    except Exception:
        return []

# Parse keywords if stored as strings
if "keywords" in df.columns:
    df["keywords"] = df["keywords"].map(parse_list_column)

# Sequel detection via keywords
sequel_keywords = {"sequel", "franchise", "series", "trilogy", "prequel", "reboot", "remake", "spin-off"}
df["is_franchise_keyword"] = df["keywords"].map(
    lambda kws: int(any(k.lower() in sequel_keywords for k in kws))
)

# Sequel detection via title patterns (e.g., "Part 2", "Chapter 3", Roman numerals)
sequel_patterns = [
    r'\b(part|chapter|vol\.?|volume)\s*\d+',
    r'\b[IVX]{2,}\b',           # Roman numerals (II, III, IV, etc.)
    r'\d{1,2}\s*$',             # Ends with a number (e.g., "Toy Story 3")
    r':\s*.+$',                  # Has a subtitle after colon (common in sequels)
]

def detect_sequel_title(title):
    if pd.isna(title):
        return 0
    for pattern in sequel_patterns[:3]:  # Use only clear sequel patterns
        if re.search(pattern, str(title), re.IGNORECASE):
            return 1
    return 0

df["is_sequel_title"] = df["title"].map(detect_sequel_title)

# Combined franchise flag
df["is_franchise"] = ((df["is_franchise_keyword"] == 1) | (df["is_sequel_title"] == 1)).astype(int)

print(f"Franchise/sequel movies: {df['is_franchise'].sum()} ({df['is_franchise'].mean()*100:.1f}%)")
print(f"  - By keyword: {df['is_franchise_keyword'].sum()}")
print(f"  - By title pattern: {df['is_sequel_title'].sum()}")

Franchise/sequel movies: 610 (6.6%)
  - By keyword: 458
  - By title pattern: 247


  df["is_franchise_keyword"] = df["keywords"].map(
  df["is_sequel_title"] = df["title"].map(detect_sequel_title)
  df["is_franchise"] = ((df["is_franchise_keyword"] == 1) | (df["is_sequel_title"] == 1)).astype(int)


## 5. Budget Tier Categorization

Raw budget values span orders of magnitude. Binning into tiers (micro, low, medium, high, blockbuster)
can capture non-linear effects.

In [7]:
def categorize_budget(budget):
    """Categorize movie budgets into industry-standard tiers."""
    if pd.isna(budget) or budget <= 0:
        return "unknown"
    elif budget < 1_000_000:
        return "micro"          # < $1M
    elif budget < 15_000_000:
        return "low"            # $1M - $15M
    elif budget < 50_000_000:
        return "medium"         # $15M - $50M
    elif budget < 150_000_000:
        return "high"           # $50M - $150M
    else:
        return "blockbuster"    # $150M+

df["budget_tier"] = df["budget"].map(categorize_budget)

# One-hot encode budget tiers
budget_dummies = pd.get_dummies(df["budget_tier"], prefix="budget_tier")
df = pd.concat([df, budget_dummies], axis=1)

print("Budget tier distribution:")
print(df["budget_tier"].value_counts().sort_index())

Budget tier distribution:
budget_tier
blockbuster     217
high            512
low             810
medium         7315
micro           436
Name: count, dtype: int64


  df["budget_tier"] = df["budget"].map(categorize_budget)


## 6. Cast Diversity Index

Gender diversity in the cast is a meaningful signal. We compute a simple diversity index
based on gender representation among the top 5 actors + director.

In [8]:
gender_cols = ["director_gender", "actor1_gender", "actor2_gender", 
               "actor3_gender", "actor4_gender", "actor5_gender"]
existing_gender_cols = [c for c in gender_cols if c in df.columns]

def compute_gender_diversity(row):
    """Shannon entropy-based diversity index for gender representation."""
    genders = [row[c] for c in existing_gender_cols if pd.notna(row[c]) and row[c] > 0]
    if len(genders) == 0:
        return 0.0
    counts = Counter(genders)
    total = sum(counts.values())
    probs = [c / total for c in counts.values()]
    # Shannon entropy (normalized to [0,1])
    entropy = -sum(p * np.log2(p) for p in probs if p > 0)
    max_entropy = np.log2(len(counts)) if len(counts) > 1 else 1.0
    return entropy / max_entropy if max_entropy > 0 else 0.0

df["cast_gender_diversity"] = df.apply(compute_gender_diversity, axis=1)

# Female representation ratio
if "gender_female_count" in df.columns and "cast_size" in df.columns:
    total_people = df["cast_size"] + 1  # include director
    df["female_ratio"] = df["gender_female_count"] / total_people.replace(0, np.nan)

print(f"Cast gender diversity — mean: {df['cast_gender_diversity'].mean():.3f}, std: {df['cast_gender_diversity'].std():.3f}")
print(f"Female ratio          — mean: {df['female_ratio'].mean():.3f}")

Cast gender diversity — mean: 0.744, std: 0.330
Female ratio          — mean: 0.302


  df["cast_gender_diversity"] = df.apply(compute_gender_diversity, axis=1)
  df["female_ratio"] = df["gender_female_count"] / total_people.replace(0, np.nan)


## 7. Overview Text Features

Extract additional signals from the movie overview text: word complexity, presence of
emotionally charged words, and question marks (which may signal mystery/thriller).

In [9]:
def text_features(text):
    """Extract text-based features from overview."""
    if pd.isna(text) or str(text).strip() == "":
        return pd.Series({
            "avg_word_length": 0,
            "long_word_ratio": 0,
            "has_question": 0,
            "exclamation_count": 0,
            "sentence_count": 0,
        })
    
    text = str(text)
    words = text.split()
    word_lengths = [len(w.strip('.,!?;:"')) for w in words]
    
    return pd.Series({
        "avg_word_length": np.mean(word_lengths) if word_lengths else 0,
        "long_word_ratio": sum(1 for l in word_lengths if l > 7) / max(len(word_lengths), 1),
        "has_question": int("?" in text),
        "exclamation_count": text.count("!"),
        "sentence_count": len(re.split(r'[.!?]+', text.strip())),
    })

if "overview" in df.columns:
    text_feats = df["overview"].apply(text_features)
    df = pd.concat([df, text_feats], axis=1)
    
    print("Text features summary:")
    for col in text_feats.columns:
        print(f"  {col} — mean: {df[col].mean():.3f}")

Text features summary:
  avg_word_length — mean: 4.712
  long_word_ratio — mean: 0.157
  has_question — mean: 0.048
  exclamation_count — mean: 0.028
  sentence_count — mean: 3.203


## 8. Genre Interaction Features

Certain genre combinations work differently. We create interaction features for common
genre pairings.

In [10]:
# Define meaningful genre interaction pairs
genre_interactions = [
    ("genre_action", "genre_comedy", "genre_action_x_comedy"),
    ("genre_action", "genre_science_fiction", "genre_action_x_scifi"),
    ("genre_horror", "genre_comedy", "genre_horror_x_comedy"),
    ("genre_drama", "genre_romance", "genre_drama_x_romance"),
    ("genre_action", "genre_adventure", "genre_action_x_adventure"),
    ("genre_animation", "genre_family", "genre_animation_x_family"),
    ("genre_crime", "genre_thriller", "genre_crime_x_thriller"),
]

for g1, g2, name in genre_interactions:
    if g1 in df.columns and g2 in df.columns:
        df[name] = (df[g1] * df[g2]).astype(int)

print("Genre interaction features created:")
for _, _, name in genre_interactions:
    if name in df.columns:
        print(f"  {name}: {df[name].sum()} movies")

Genre interaction features created:
  genre_action_x_comedy: 273 movies
  genre_action_x_scifi: 346 movies
  genre_horror_x_comedy: 176 movies
  genre_drama_x_romance: 633 movies
  genre_action_x_adventure: 433 movies
  genre_animation_x_family: 273 movies
  genre_crime_x_thriller: 407 movies


  df[name] = (df[g1] * df[g2]).astype(int)
  df[name] = (df[g1] * df[g2]).astype(int)
  df[name] = (df[g1] * df[g2]).astype(int)
  df[name] = (df[g1] * df[g2]).astype(int)
  df[name] = (df[g1] * df[g2]).astype(int)
  df[name] = (df[g1] * df[g2]).astype(int)
  df[name] = (df[g1] * df[g2]).astype(int)


## 9. Popularity-to-Votes Ratio

This ratio can indicate whether a movie's visibility (popularity) is driven by actual viewer
engagement (votes) or by marketing/hype alone.

In [11]:
if "popularity" in df.columns and "vote_count" in df.columns:
    # Avoid division by zero
    df["popularity_per_vote"] = df["popularity"] / df["vote_count"].replace(0, np.nan)
    df["log_vote_count"] = np.log1p(df["vote_count"])
    
    # High-hype flag: high popularity but low votes
    pop_median = df["popularity"].median()
    vote_median = df["vote_count"].median()
    df["is_high_hype_low_engagement"] = (
        (df["popularity"] > pop_median) & (df["vote_count"] < vote_median)
    ).astype(int)
    
    print(f"Popularity per vote — mean: {df['popularity_per_vote'].mean():.4f}")
    print(f"High hype, low engagement: {df['is_high_hype_low_engagement'].sum()} movies")

Popularity per vote — mean: 0.7387
High hype, low engagement: 1470 movies


  df["popularity_per_vote"] = df["popularity"] / df["vote_count"].replace(0, np.nan)
  df["log_vote_count"] = np.log1p(df["vote_count"])
  df["is_high_hype_low_engagement"] = (


## 10. Summary & Export

Review the new features and save the enhanced dataset.

In [12]:
# List all new features added in this notebook
base_cols = pd.read_csv("../data/data_cleaned_engineered.csv", nrows=0).columns.tolist()
new_cols = [c for c in df.columns if c not in base_cols]

print(f"\nTotal features added: {len(new_cols)}")
print("\nNew features:")
for i, col in enumerate(new_cols, 1):
    print(f"  {i:2d}. {col}")

print(f"\nFinal dataset shape: {df.shape}")


Total features added: 39

New features:
   1. release_day
   2. is_summer_release
   3. is_holiday_release
   4. is_valentines_release
   5. is_halloween_release
   6. is_dump_month
   7. monthly_competition
   8. weekly_competition
   9. director_hist_revenue
  10. director_hist_vote_average
  11. director_hist_popularity
  12. director_film_count
  13. is_debut_director
  14. is_franchise_keyword
  15. is_sequel_title
  16. is_franchise
  17. budget_tier
  18. budget_tier_blockbuster
  19. budget_tier_high
  20. budget_tier_low
  21. budget_tier_medium
  22. budget_tier_micro
  23. cast_gender_diversity
  24. female_ratio
  25. avg_word_length
  26. long_word_ratio
  27. has_question
  28. exclamation_count
  29. sentence_count
  30. genre_action_x_comedy
  31. genre_action_x_scifi
  32. genre_horror_x_comedy
  33. genre_drama_x_romance
  34. genre_action_x_adventure
  35. genre_animation_x_family
  36. genre_crime_x_thriller
  37. popularity_per_vote
  38. log_vote_count
  39. is_h

In [13]:
# Save enhanced dataset
out_path = "../data/data_advanced_features.csv"
df.to_csv(out_path, index=False)
print(f"Saved enhanced dataset to: {out_path}")
print(f"Shape: {df.shape}")

Saved enhanced dataset to: ../data/data_advanced_features.csv
Shape: (9290, 161)
