# Feature Engineering — TMDB Movie Success Prediction

**Purpose:** Transform raw TMDB movie data into a clean, model-ready feature matrix for predicting pre-release movie popularity.

**Key Principles:**
- **No data leakage:** We exclude any variable that would not be available before a movie's release (e.g., revenue, vote_average, vote_count).
- **Target variable:** `popularity` (retained separately, never used as an input feature).
- **Grounded in EDA findings:** Feature decisions are informed by the distributions, missingness patterns, and correlations identified in the EDA phase.

**Feature Categories:**
1. Talent Features (cast & director signals)
2. Content Features (genres, keywords, language)
3. Temporal Features (release timing & seasonality)
4. Production Features (budget, runtime)

In [81]:
# ============================================================
# Imports
# ============================================================
import ast
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## 1. Data Loading

Load the raw TMDB dataset and drop the unnamed index column carried over from the CSV export.

In [82]:
# ============================================================
# Load raw data
# ============================================================
df = pd.read_csv("../data/movies_2010_2025.csv")
df = df.drop("Unnamed: 0", axis=1)

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Dataset shape: (9290, 51)
Columns: ['movie_id', 'title', 'release_date', 'runtime', 'original_language', 'popularity', 'vote_average', 'vote_count', 'budget', 'revenue', 'status', 'overview', 'genres', 'keywords', 'director_id', 'director_name', 'director_gender', 'director_popularity', 'director_department', 'actor1_id', 'actor1_name', 'actor1_character', 'actor1_gender', 'actor1_popularity', 'actor1_department', 'actor2_id', 'actor2_name', 'actor2_character', 'actor2_gender', 'actor2_popularity', 'actor2_department', 'actor3_id', 'actor3_name', 'actor3_character', 'actor3_gender', 'actor3_popularity', 'actor3_department', 'actor4_id', 'actor4_name', 'actor4_character', 'actor4_gender', 'actor4_popularity', 'actor4_department', 'actor5_id', 'actor5_name', 'actor5_character', 'actor5_gender', 'actor5_popularity', 'actor5_department', 'cast_pop_mean', 'cast_pop_max']


## 2. Helper Functions

Reusable parsing utilities inherited from the EDA phase. The `safe_parse_list` function handles the genres and keywords columns, which are stored as string representations of Python lists in the CSV.

In [83]:
# ============================================================
# Helper: safely parse stringified lists from CSV
# ============================================================
def safe_parse_list(x):
    """
    Convert stringified lists (e.g., "['Action', 'Drama']") back into
    actual Python lists. Handles NaN, empty strings, and edge cases.
    """
    if isinstance(x, list):
        return x
    if isinstance(x, float) and pd.isna(x):
        return []
    if isinstance(x, str):
        x = x.strip()
        if x == "" or x == "[]":
            return []
        try:
            return ast.literal_eval(x)
        except (ValueError, SyntaxError):
            return []
    return []

## 3. Leakage Control — Separate Target and Exclude Post-Release Variables

This is the most critical step. Since we are building a **pre-release** prediction model, we must be strict about what information would actually be available before a movie opens.

**Target (extracted separately):**
- `popularity` — this is what we are predicting

**Excluded from features (post-release or non-predictive):**
- `vote_average`, `vote_count` — audience ratings only exist after release
- `revenue` — box office results are post-release
- `movie_id`, `title` — identifiers, not predictive features
- `status` — nearly all entries are "Released" (no variance)
- `overview` — free-text synopsis; we extract `has_overview` as a simple signal instead of full NLP
- Name columns (`director_name`, `actor*_name`) — high cardinality; predictive power is captured by popularity scores
- Character columns (`actor*_character`) — role names are not predictive
- Department columns (`director_department`, `actor*_department`) — EDA showed near-zero variance (almost all "Directing"/"Acting")
- ID columns (`director_id`, `actor*_id`) — identifiers, not features

In [84]:
# ============================================================
# Extract target variable
# ============================================================
target = df["popularity"].copy()
print(f"Target (popularity) shape: {target.shape}")
print(f"Target stats:\n{target.describe()}")

Target (popularity) shape: (9290,)
Target stats:
count    9290.000000
mean        5.788687
std         7.689172
min         2.414700
25%         3.885000
50%         4.501300
75%         5.628275
max       378.004500
Name: popularity, dtype: float64


In [85]:
# ============================================================
# Define columns to exclude from features
# ============================================================
leakage_cols = ["popularity", "vote_average", "vote_count", "revenue"]

id_cols = ["movie_id", "title"]

non_predictive_cols = [
    "status", "overview",
    "director_id", "director_name", "director_department",
    "actor1_id", "actor1_name", "actor1_character", "actor1_department",
    "actor2_id", "actor2_name", "actor2_character", "actor2_department",
    "actor3_id", "actor3_name", "actor3_character", "actor3_department",
    "actor4_id", "actor4_name", "actor4_character", "actor4_department",
    "actor5_id", "actor5_name", "actor5_character", "actor5_department",
]

all_excluded = leakage_cols + id_cols + non_predictive_cols
print(f"Total columns excluded: {len(all_excluded)}")
print(f"Excluded: {all_excluded}")

Total columns excluded: 31
Excluded: ['popularity', 'vote_average', 'vote_count', 'revenue', 'movie_id', 'title', 'status', 'overview', 'director_id', 'director_name', 'director_department', 'actor1_id', 'actor1_name', 'actor1_character', 'actor1_department', 'actor2_id', 'actor2_name', 'actor2_character', 'actor2_department', 'actor3_id', 'actor3_name', 'actor3_character', 'actor3_department', 'actor4_id', 'actor4_name', 'actor4_character', 'actor4_department', 'actor5_id', 'actor5_name', 'actor5_character', 'actor5_department']


## 4. Parse List Columns

The `genres` and `keywords` columns are stored as stringified Python lists in the CSV. We parse them into actual lists before engineering features from them.

In [86]:
# ============================================================
# Parse genres and keywords from strings to lists
# ============================================================
df["genres"] = df["genres"].apply(safe_parse_list)
df["keywords"] = df["keywords"].apply(safe_parse_list)

# Quick verification
print("Sample genres:", df["genres"].iloc[0])
print("Sample keywords:", df["keywords"].iloc[0])

Sample genres: ['Action', 'Science Fiction', 'Adventure']
Sample keywords: ['rescue', 'mission', 'dreams', 'airplane', 'paris, france', 'virtual reality', 'kidnapping', 'philosophy', 'spy', 'allegory', 'manipulation', 'car crash', 'heist', 'memory', 'architecture', 'los angeles, california', 'death', 'dream world', 'subconscious', 'dream']


## 5. Temporal Features — Release Timing & Seasonality

Movie success is heavily influenced by **when** it is released. Summer blockbuster season, holiday releases, and award-season timing are well-known industry patterns.

**Features engineered:**
- `release_month` — captures monthly seasonality (e.g., June/July for summer blockbusters, Nov/Dec for awards and holidays)
- `release_year` — captures long-term industry trends (e.g., streaming era shifts)
- `release_quarter` — a coarser seasonal signal (Q1-Q4)
- `is_summer_release` — binary flag for the peak blockbuster window (May-July)
- `is_holiday_release` — binary flag for the holiday corridor (November-December)

In [87]:
# ============================================================
# Temporal features from release_date
# ============================================================
df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")

# Month and year
df["release_month"] = df["release_date"].dt.month
df["release_year"] = df["release_date"].dt.year

# Quarter (1-4)
df["release_quarter"] = df["release_date"].dt.quarter

# Summer release: May (5), June (6), July (7)
df["is_summer_release"] = df["release_month"].isin([5, 6, 7]).astype(int)

# Holiday release: November (11), December (12)
df["is_holiday_release"] = df["release_month"].isin([11, 12]).astype(int)

# Handle the 1 missing release_date - fill temporal features with median
for col in ["release_month", "release_year", "release_quarter"]:
    df[col] = df[col].fillna(df[col].median())

print("Temporal features created:")
print(df[["release_month", "release_year", "release_quarter",
          "is_summer_release", "is_holiday_release"]].describe())

Temporal features created:
       release_month  release_year  release_quarter  is_summer_release  \
count    9290.000000   9290.000000      9290.000000        9290.000000   
mean        6.752530   2017.456512         2.587406           0.218837   
std         3.463668      4.623364         1.136100           0.413480   
min         1.000000   2010.000000         1.000000           0.000000   
25%         4.000000   2013.000000         2.000000           0.000000   
50%         7.000000   2017.000000         3.000000           0.000000   
75%        10.000000   2021.000000         4.000000           0.000000   
max        12.000000   2025.000000         4.000000           1.000000   

       is_holiday_release  
count         9290.000000  
mean             0.164155  
std              0.370436  
min              0.000000  
25%              0.000000  
50%              0.000000  
75%              0.000000  
max              1.000000  


## 6. Talent Features — Cast & Director Signals

The star power of a film's cast and director is one of the strongest pre-release indicators of audience interest. The dataset already provides individual actor popularities (actor1-5) and aggregated stats (`cast_pop_mean`, `cast_pop_max`). We engineer additional features to capture richer talent signals.

**Features engineered:**
- `director_popularity` — already exists in the data, retained as-is
- `director_is_female` — binary flag (TMDB gender encoding: 1 = Female, 2 = Male)
- `star_count` — number of actors (out of top 5 billed) whose popularity exceeds the 75th percentile across all actors; captures how many recognizable names are attached
- `cast_popularity_std` — standard deviation of top 5 actors' popularities; high std = one big name + unknowns, low std = balanced ensemble
- `cast_gender_ratio` — proportion of female actors in the top 5 billed cast
- `cast_pop_mean`, `cast_pop_max` — already exist, retained as-is

In [88]:
# ============================================================
# Talent features
# ============================================================

# --- Actor popularity columns ---
actor_pop_cols = [f"actor{i}_popularity" for i in range(1, 6)]
actor_gender_cols = [f"actor{i}_gender" for i in range(1, 6)]

# --- Star count ---
# Define "star" threshold as 75th percentile of all non-zero actor popularities
all_actor_pops = df[actor_pop_cols].values.flatten()
all_actor_pops_nonzero = all_actor_pops[all_actor_pops > 0]
star_threshold = np.percentile(all_actor_pops_nonzero, 75)
print(f"Star threshold (75th percentile of non-zero actor popularity): {star_threshold:.4f}")

# Count how many of the 5 actors exceed the star threshold per movie
df["star_count"] = (df[actor_pop_cols] > star_threshold).sum(axis=1)

# --- Cast popularity standard deviation ---
# Measures whether the cast is balanced or relies on a single star
df["cast_popularity_std"] = df[actor_pop_cols].apply(
    lambda row: np.std([x for x in row if x > 0]) if any(x > 0 for x in row) else 0,
    axis=1
)

# --- Cast gender ratio (proportion female) ---
# TMDB gender encoding: 1 = Female, 2 = Male, 0 = Unknown/Not set
def calc_female_ratio(row):
    known = [g for g in row if g in (1.0, 2.0)]
    if len(known) == 0:
        return 0.5  # default to balanced when unknown
    return sum(1 for g in known if g == 1.0) / len(known)

df["cast_gender_ratio"] = df[actor_gender_cols].apply(calc_female_ratio, axis=1)

# --- Director is female ---
df["director_is_female"] = (df["director_gender"] == 1.0).astype(int)

print("\nTalent features summary:")
print(df[["star_count", "cast_popularity_std", "cast_gender_ratio",
          "director_is_female", "director_popularity",
          "cast_pop_mean", "cast_pop_max"]].describe())

Star threshold (75th percentile of non-zero actor popularity): 2.9868

Talent features summary:
        star_count  cast_popularity_std  cast_gender_ratio  \
count  9290.000000          9290.000000        9290.000000   
mean      1.147578             1.260020           0.412983   
std       1.364494             1.767775           0.259384   
min       0.000000             0.000000           0.000000   
25%       0.000000             0.576811           0.200000   
50%       1.000000             1.017833           0.400000   
75%       2.000000             1.516547           0.600000   
max       5.000000            84.167164           1.000000   

       director_is_female  director_popularity  cast_pop_mean  cast_pop_max  
count         9290.000000          9290.000000    9290.000000   9290.000000  
mean             0.116685             1.209358       2.023336      4.085244  
std              0.321061             1.361564       1.876921      4.903623  
min              0.000000        

## 7. Content Features — Genre Encoding

Genres are a core signal for audience interest. Since a movie can belong to multiple genres, we use **multi-hot encoding** — each genre becomes a binary column (1 if the movie belongs to that genre, 0 otherwise).

We also create `num_genres` to capture the breadth of a movie's genre classification. Movies spanning many genres may have broader audience appeal.

In [89]:
# ============================================================
# Multi-hot encode genres
# ============================================================

# Get all unique genres across the dataset
all_genres = sorted(set(g for genres_list in df["genres"] for g in genres_list))
print(f"Total unique genres found: {len(all_genres)}")
print(f"Genres: {all_genres}")

# Create binary columns for each genre
for genre in all_genres:
    df[f"genre_{genre}"] = df["genres"].apply(lambda x, g=genre: 1 if g in x else 0)

# Number of genres per movie
df["num_genres"] = df["genres"].apply(len)

print(f"\nGenre feature columns created: {len(all_genres)} binary + 1 count")
print(f"num_genres distribution:\n{df['num_genres'].describe()}")

Total unique genres found: 19
Genres: ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']

Genre feature columns created: 19 binary + 1 count
num_genres distribution:
count    9290.000000
mean        1.988482
std         1.108486
min         0.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         8.000000
Name: num_genres, dtype: float64


## 8. Content Features — Keyword Count

Rather than encoding individual keywords (which would create thousands of extremely sparse columns), we use the **keyword count** as a proxy for how richly tagged a movie is. Movies with more keywords tend to have more developed marketing and metadata, which itself correlates with production scale and audience reach.

In [90]:
# ============================================================
# Keyword count
# ============================================================
df["keyword_count"] = df["keywords"].apply(len)

print("keyword_count distribution:")
print(df["keyword_count"].describe())

keyword_count distribution:
count    9290.000000
mean        5.189020
std         7.172168
min         0.000000
25%         0.000000
50%         2.000000
75%         8.000000
max       101.000000
Name: keyword_count, dtype: float64


## 9. Content Features — Original Language Encoding

The EDA showed that English (`en`) dominates the dataset, with a long tail of other languages. We use a **top-K encoding** strategy:

- The top 5 most frequent languages each get their own binary column.
- All remaining languages are grouped into `lang_other`.
- A simple `is_english` flag captures the most important language split for global popularity prediction.

In [91]:
# ============================================================
# Language encoding (top-K + is_english flag)
# ============================================================

# Identify top 5 languages by frequency
top_languages = df["original_language"].value_counts().head(5).index.tolist()
print(f"Top 5 languages: {top_languages}")

# Create binary column for each top language
for lang in top_languages:
    df[f"lang_{lang}"] = (df["original_language"] == lang).astype(int)

# Catch-all for less common languages
df["lang_other"] = (~df["original_language"].isin(top_languages)).astype(int)

# Simple is_english flag
df["is_english"] = (df["original_language"] == "en").astype(int)

print(f"\nis_english distribution:\n{df['is_english'].value_counts()}")

Top 5 languages: ['en', 'fr', 'es', 'ja', 'de']

is_english distribution:
is_english
1    5748
0    3542
Name: count, dtype: int64


## 10. Production Features — Budget & Runtime

**Budget:** A strong pre-release signal — higher-budget films draw more audience attention via marketing and production quality. The EDA revealed many movies have budget = 0, meaning "not reported" rather than truly $0. We handle this with:
1. `has_budget` — binary flag distinguishing known vs unknown budgets
2. `log_budget` — log1p transformation normalizing the extreme right skew observed in the EDA
3. Log budget set to 0 for unreported budgets (the `has_budget` flag tells the model to interpret this correctly)

**Runtime:** Retained as-is. The EDA showed a reasonable distribution centered around 90-120 minutes.

In [92]:
# ============================================================
# Production features: budget and runtime
# ============================================================

# --- Budget ---
df["has_budget"] = (df["budget"] > 0).astype(int)
df["log_budget"] = np.log1p(df["budget"])

print("Budget features:")
print(f"  has_budget distribution: {df['has_budget'].value_counts().to_dict()}")
print(f"  log_budget stats (where budget > 0):")
print(df.loc[df["budget"] > 0, "log_budget"].describe())

# --- Runtime ---
print(f"\nRuntime stats:\n{df['runtime'].describe()}")

Budget features:
  has_budget distribution: {0: 6527, 1: 2763}
  log_budget stats (where budget > 0):
count    2763.000000
mean       15.833550
std         3.021456
min         0.693147
25%        14.978662
50%        16.705882
75%        17.727534
max        20.009712
Name: log_budget, dtype: float64

Runtime stats:
count    9290.000000
mean       81.390635
std        44.067500
min         0.000000
25%        67.000000
50%        92.000000
75%       107.000000
max       950.000000
Name: runtime, dtype: float64


## 11. Text Signal — Overview Availability

The movie overview (synopsis) is free text that would require NLP to fully exploit. For this pipeline, we extract a simple but meaningful signal: **whether an overview exists and how long it is**. Movies with longer, more detailed overviews tend to have more developed marketing and distribution — a lightweight proxy for production effort.

In [93]:
# ============================================================
# Overview-based features
# ============================================================

# Whether the movie has an overview at all
df["has_overview"] = df["overview"].notna().astype(int)

# Length of overview (character count)
df["overview_length"] = df["overview"].fillna("").apply(len)

print("Overview features:")
print(f"  has_overview: {df['has_overview'].value_counts().to_dict()}")
print(f"  overview_length stats:\n{df['overview_length'].describe()}")

Overview features:
  has_overview: {1: 9089, 0: 201}
  overview_length stats:
count    9290.000000
mean      259.595048
std       164.951892
min         0.000000
25%       146.000000
50%       220.000000
75%       339.000000
max       999.000000
Name: overview_length, dtype: float64


## 12. Assemble Final Feature Matrix

Now we select only the engineered features, dropping all excluded columns. The result is a clean, fully numeric matrix ready for modeling.

**Final feature groups:**
- **Temporal:** release_month, release_year, release_quarter, is_summer_release, is_holiday_release
- **Talent:** actor1-5 popularity, cast_pop_mean, cast_pop_max, cast_popularity_std, star_count, cast_gender_ratio, director_popularity, director_is_female
- **Content:** genre_* (multi-hot), num_genres, keyword_count, lang_* (top-K), is_english
- **Production:** runtime, has_budget, log_budget
- **Text signal:** has_overview, overview_length

In [94]:
# ============================================================
# Assemble the final feature matrix
# ============================================================

# Columns to drop: all excluded + intermediate/raw columns
cols_to_drop = (
    all_excluded
    + ["release_date", "original_language", "genres", "keywords",
       "budget",                                      # replaced by log_budget + has_budget
       "director_gender",                             # replaced by director_is_female
       "actor1_gender", "actor2_gender", "actor3_gender",
       "actor4_gender", "actor5_gender",              # replaced by cast_gender_ratio
       "overview"]                                    # replaced by has_overview + overview_length
)

# Only drop columns that actually exist in the dataframe
cols_to_drop_existing = [c for c in cols_to_drop if c in df.columns]

# Build feature matrix
features = df.drop(columns=cols_to_drop_existing)

print(f"Final feature matrix shape: {features.shape}")
print(f"\nFeature columns ({len(features.columns)}):")
print(list(features.columns))

Final feature matrix shape: (9290, 50)

Feature columns (50):
['runtime', 'director_popularity', 'actor1_popularity', 'actor2_popularity', 'actor3_popularity', 'actor4_popularity', 'actor5_popularity', 'cast_pop_mean', 'cast_pop_max', 'release_month', 'release_year', 'release_quarter', 'is_summer_release', 'is_holiday_release', 'star_count', 'cast_popularity_std', 'cast_gender_ratio', 'director_is_female', 'genre_Action', 'genre_Adventure', 'genre_Animation', 'genre_Comedy', 'genre_Crime', 'genre_Documentary', 'genre_Drama', 'genre_Family', 'genre_Fantasy', 'genre_History', 'genre_Horror', 'genre_Music', 'genre_Mystery', 'genre_Romance', 'genre_Science Fiction', 'genre_TV Movie', 'genre_Thriller', 'genre_War', 'genre_Western', 'num_genres', 'keyword_count', 'lang_en', 'lang_fr', 'lang_es', 'lang_ja', 'lang_de', 'lang_other', 'is_english', 'has_budget', 'log_budget', 'has_overview', 'overview_length']


In [95]:
# ============================================================
# Verify: no leakage columns remain in features
# ============================================================
leakage_check = [c for c in leakage_cols if c in features.columns]
assert len(leakage_check) == 0, f"LEAKAGE DETECTED: {leakage_check}"
print("Leakage check PASSED - no target or post-release variables in feature matrix.")

# Verify: no object (string) columns remain
object_cols = features.select_dtypes(include=["object"]).columns.tolist()
assert len(object_cols) == 0, f"Non-numeric columns remain: {object_cols}"
print("Data type check PASSED - all features are numeric.")

Leakage check PASSED - no target or post-release variables in feature matrix.
Data type check PASSED - all features are numeric.


## 13. Feature Matrix Summary

Final look at the engineered features: shape, data types, missing values, and basic statistics.

In [96]:
# ============================================================
# Final summary
# ============================================================
print("=" * 60)
print("FEATURE ENGINEERING COMPLETE")
print("=" * 60)
print(f"\nRows:     {features.shape[0]}")
print(f"Features: {features.shape[1]}")
print(f"Target:   popularity ({target.shape[0]} values)")

print(f"\n--- Missing values ---")
missing = features.isnull().sum()
if missing.sum() == 0:
    print("None - all features fully populated.")
else:
    print(missing[missing > 0])

print(f"\n--- Data types ---")
print(features.dtypes.value_counts())

print(f"\n--- Feature statistics ---")
features.describe()

FEATURE ENGINEERING COMPLETE

Rows:     9290
Features: 50
Target:   popularity (9290 values)

--- Missing values ---
None - all features fully populated.

--- Data types ---
int64      36
float64    14
Name: count, dtype: int64

--- Feature statistics ---


Unnamed: 0,runtime,director_popularity,actor1_popularity,actor2_popularity,actor3_popularity,actor4_popularity,actor5_popularity,cast_pop_mean,cast_pop_max,release_month,...,lang_fr,lang_es,lang_ja,lang_de,lang_other,is_english,has_budget,log_budget,has_overview,overview_length
count,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,...,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0,9290.0
mean,81.390635,1.209358,2.674343,2.167565,1.829108,1.60309,1.417261,2.023336,4.085244,6.75253,...,0.051776,0.046286,0.039182,0.031324,0.212702,0.61873,0.297417,4.70916,0.978364,259.595048
std,44.0675,1.361564,3.484476,3.548816,2.4009,2.389784,1.819295,1.876921,4.903623,3.463668,...,0.221587,0.210116,0.194038,0.174201,0.409241,0.485725,0.457146,7.423386,0.1455,164.951892
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,67.0,0.2486,0.5197,0.400125,0.3115,0.232275,0.15985,0.83688,1.985875,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,146.0
50%,92.0,0.65645,1.777,1.3837,1.119,0.90705,0.7329,1.55343,3.2549,7.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,220.0
75%,107.0,1.74545,3.612125,3.078675,2.73615,2.48615,2.214825,2.775495,4.9931,10.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,13.815512,1.0,339.0
max,950.0,20.8778,88.154,224.15,85.498,145.219,24.694,56.784,224.15,12.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,20.009712,1.0,999.0


In [97]:
# ============================================================
# Preview the final feature matrix
# ============================================================
features.head(10)

Unnamed: 0,runtime,director_popularity,actor1_popularity,actor2_popularity,actor3_popularity,actor4_popularity,actor5_popularity,cast_pop_mean,cast_pop_max,release_month,...,lang_fr,lang_es,lang_ja,lang_de,lang_other,is_english,has_budget,log_budget,has_overview,overview_length
0,148,8.2813,12.2774,5.6445,4.5824,9.6156,4.8289,7.38976,12.2774,7.0,...,0,0,0,0,0,1,1,18.890684,1,280
1,100,5.1258,2.1981,1.978,1.8252,4.9563,2.1777,2.62706,4.9563,11.0,...,0,0,0,0,0,1,1,19.376192,1,286
2,124,4.0024,9.2587,5.5278,3.4058,17.8153,4.4974,8.101,17.8153,4.0,...,0,0,0,0,0,1,1,19.113828,1,372
3,95,2.0111,5.8392,2.9334,4.8379,0.7743,2.4183,3.36062,5.8392,7.0,...,0,0,0,0,0,1,1,18.049617,1,287
4,146,1.4636,7.0936,9.4552,4.3204,3.1401,4.5939,5.72064,9.4552,11.0,...,0,0,0,0,0,1,1,19.336971,1,298
5,138,6.5116,12.2774,6.3304,5.2239,2.7838,3.4178,6.00666,12.2774,2.0,...,0,0,0,0,0,1,1,18.197537,1,219
6,104,4.0504,1.4482,1.032,1.935,1.0299,0.4301,1.17504,1.935,1.0,...,0,0,0,0,1,0,0,0.0,1,318
7,108,5.1379,2.3812,13.44,13.9576,4.5939,2.892,7.45294,13.9576,3.0,...,0,0,0,0,0,1,1,19.113828,1,139
8,102,1.4612,8.1709,6.0004,3.5183,1.8064,3.2926,4.55772,8.1709,6.0,...,0,0,0,0,0,1,1,18.197537,1,138
9,107,0.9931,4.8916,3.2321,2.3868,6.5325,1.3633,3.68126,6.5325,7.0,...,0,0,0,0,0,1,1,17.50439,1,118


In [98]:
# Export
out_path_full = "../data/data_cleaned_engineered.csv"

features.to_csv(out_path_full, index=False)


print("Saved:", out_path_full)


Saved: ../data/data_cleaned_engineered.csv


In [99]:
features.shape


(9290, 50)

In [100]:
features.columns

Index(['runtime', 'director_popularity', 'actor1_popularity',
       'actor2_popularity', 'actor3_popularity', 'actor4_popularity',
       'actor5_popularity', 'cast_pop_mean', 'cast_pop_max', 'release_month',
       'release_year', 'release_quarter', 'is_summer_release',
       'is_holiday_release', 'star_count', 'cast_popularity_std',
       'cast_gender_ratio', 'director_is_female', 'genre_Action',
       'genre_Adventure', 'genre_Animation', 'genre_Comedy', 'genre_Crime',
       'genre_Documentary', 'genre_Drama', 'genre_Family', 'genre_Fantasy',
       'genre_History', 'genre_Horror', 'genre_Music', 'genre_Mystery',
       'genre_Romance', 'genre_Science Fiction', 'genre_TV Movie',
       'genre_Thriller', 'genre_War', 'genre_Western', 'num_genres',
       'keyword_count', 'lang_en', 'lang_fr', 'lang_es', 'lang_ja', 'lang_de',
       'lang_other', 'is_english', 'has_budget', 'log_budget', 'has_overview',
       'overview_length'],
      dtype='object')

In [None]:
# Export modeling dataset with target
out_path_model = "../data/data_model_with_target.csv"

model_df = features.copy()
model_df["popularity"] = target.values
model_df.to_csv(out_path_model, index=False)

print("Saved:", out_path_model)
print("model_df shape:", model_df.shape)
