## Feature Engineering and Modeling

### Feature Engineering

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error
import statsmodels.api as sm


In [2]:
df = pd.read_csv("data/movie_processed.csv", index_col=0)
df = df.dropna(subset=["critic_rating"]) # drop rows with null target value
df.head()

Unnamed: 0,movie_id,movie_title,movie_info,rating,genre,directors,in_theaters_date,on_streaming_date,runtime_in_minutes,critic_rating,critic_count,audience_rating,audience_count
0,1,Percy Jackson & the Olympians: The Lightning T...,A teenager discovers he's the descendant of a ...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,2010-02-12,2010-06-29,83.0,49,144,53.0,254287.0
1,2,Please Give,Kate has a lot on her mind. There's the ethics...,R,Comedy,Nicole Holofcener,2010-04-30,2010-10-19,90.0,86,140,64.0,11567.0
2,3,10,Blake Edwards' 10 stars Dudley Moore as George...,R,"Comedy, Romance",Blake Edwards,1979-10-05,1997-08-27,118.0,68,22,53.0,14670.0
3,4,12 Angry Men (Twelve Angry Men),"A Puerto Rican youth is on trial for murder, a...",,"Classics, Drama",Sidney Lumet,1957-04-13,2001-03-06,95.0,100,51,97.0,105000.0
4,5,"20,000 Leagues Under The Sea","This 1954 Disney version of Jules Verne's 20,0...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,1954-01-01,2003-05-20,127.0,89,27,74.0,68860.0


In [3]:
# Train/Test Split
df["in_theaters_date"] = pd.to_datetime(df["in_theaters_date"], errors="coerce") # ensure datetime format

# split into train (before 2010) and test (after 2010)
train_df = df[df["in_theaters_date"].dt.year < 2010].copy().reset_index(drop=True)
test_df = df[df["in_theaters_date"].dt.year >= 2010].copy().reset_index(drop=True)

X_train = train_df.drop(columns="critic_rating")
y_train = train_df["critic_rating"]
X_test = test_df.drop(columns="critic_rating")
y_test = test_df["critic_rating"]

print("Training set size:", len(train_df))
print("Test set size:", len(test_df))

Training set size: 9764
Test set size: 6059


In [4]:
# Drop critic_rating (target) related columns to avoid leakage
leak_cols = ["critic_count", "audience_rating", "audience_count", "on_streaming_date"]
X_train = X_train.drop(columns=leak_cols)
X_test = X_test.drop(columns=leak_cols)

print("X_train and X_test columns:", X_train.columns)

X_train and X_test columns: Index(['movie_id', 'movie_title', 'movie_info', 'rating', 'genre', 'directors',
       'in_theaters_date', 'runtime_in_minutes'],
      dtype='object')


In [5]:
# Create some new features on top of X_train

# movie_title, runtime_in_minutes
X_train_new = X_train[["movie_title", "runtime_in_minutes"]].copy()

# kid_friendly
X_train_new["kid_friendly"] = X_train["rating"].isin(["G", "PG"]).astype(int)

# genre dummy variables
df = X_train.copy()
df["genre_list"] = df["genre"].str.split(",").apply(
    lambda lst: [g.strip() for g in lst] if isinstance(lst, list) else []
) # split into lists of genres
df_exploded = df.explode("genre_list")
genre_dummies = pd.get_dummies(df_exploded["genre_list"], prefix="genre", dtype=int)
genre_dummies = genre_dummies.groupby(df_exploded.index).max() # combine dummies back to one row per movie
X_train_new = pd.concat([X_train_new, genre_dummies], axis=1)


In [6]:
# Create more new features on top of X_train

# release_season
def season_from_month(m):
    if pd.isna(m):
        return np.nan
    m = int(m)
    if m in [12, 1, 2]: return "winter"
    if m in [3, 4, 5]:  return "spring"
    if m in [6, 7, 8]:  return "summer"
    return "fall"

release_season = X_train["in_theaters_date"].dt.month.apply(season_from_month)
season_dummies = pd.get_dummies(release_season, prefix="release_season", dtype=int)
X_train_new = pd.concat([X_train_new, season_dummies], axis=1)

# director_movies_count
director_counts = X_train["directors"].value_counts()
X_train_new["director_movies_count"] = X_train["directors"].map(director_counts).fillna(1)

# movie_genres_count
X_train_new["movie_genres_count"] = (
    X_train["genre"]
    .fillna("")
    .apply(lambda x: len([g.strip() for g in x.split(",") if g.strip() != ""]))
)

X_train_new

Unnamed: 0,movie_title,runtime_in_minutes,kid_friendly,genre_Action & Adventure,genre_Animation,genre_Anime & Manga,genre_Art House & International,genre_Classics,genre_Comedy,genre_Cult Movies,...,genre_Special Interest,genre_Sports & Fitness,genre_Television,genre_Western,release_season_fall,release_season_spring,release_season_summer,release_season_winter,director_movies_count,movie_genres_count
0,10,118.0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,27.0,2
1,12 Angry Men (Twelve Angry Men),95.0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,29.0,2
2,"20,000 Leagues Under The Sea",127.0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,16.0,3
3,"10,000 B.C.",109.0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,8.0,3
4,The 39 Steps,87.0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,36.0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9759,Zoolander,105.0,0,0,0,0,0,0,1,0,...,1,0,0,0,1,0,0,0,4.0,2
9760,Zoom,88.0,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,4.0,3
9761,Zorba the Greek,142.0,0,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,1,1.0,4
9762,Zulu,139.0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,2.0,2


### Modeling

In [7]:
# Create the same new features at test_df
# TODO: align genre and release_season columns between train_df and test_df

# movie_title, runtime_in_minutes
X_test_new = X_test[["movie_title", "runtime_in_minutes"]].copy()

# kid_friendly
X_test_new["kid_friendly"] = X_test["rating"].isin(["G", "PG"]).astype(int)

# genre dummy variables
df = X_test.copy()
df["genre_list"] = df["genre"].str.split(",").apply(
    lambda lst: [g.strip() for g in lst] if isinstance(lst, list) else []
) # split into lists of genres
df_exploded = df.explode("genre_list")
genre_dummies = pd.get_dummies(df_exploded["genre_list"], prefix="genre", dtype=int)
genre_dummies = genre_dummies.groupby(df_exploded.index).max() # combine dummies back to one row per movie
X_test_new = pd.concat([X_test_new, genre_dummies], axis=1)

# release_season
release_season = X_test["in_theaters_date"].dt.month.apply(season_from_month)
season_dummies = pd.get_dummies(release_season, prefix="release_season", dtype=int)
X_test_new = pd.concat([X_test_new, season_dummies], axis=1)

# director_movies_count
director_counts = X_test["directors"].value_counts()
X_test_new["director_movies_count"] = X_test["directors"].map(director_counts).fillna(1)

# movie_genres_count
X_test_new["movie_genres_count"] = (
    X_test["genre"]
    .fillna("")
    .apply(lambda x: len([g.strip() for g in x.split(",") if g.strip() != ""]))
)

In [8]:
# Drop non-numeric columns (e.g. movie_titles)
X_train_new = X_train_new.select_dtypes(include=["number"])
X_test_new = X_test_new.select_dtypes(include=["number"])

# Impute missing values by median (e.g. runtime_in_minutes)
print("X_train columns with missing values: ", X_train_new.columns[X_train_new.isnull().any()])
print("X_test columns with missing values: ", X_test_new.columns[X_test_new.isnull().any()])

imputer = SimpleImputer(strategy="median")
X_train_new = pd.DataFrame(imputer.fit_transform(X_train_new), columns=X_train_new.columns)
X_test_new = pd.DataFrame(imputer.transform(X_test_new), columns=X_test_new.columns)
print("Imputed")

X_train columns with missing values:  Index(['runtime_in_minutes'], dtype='object')
X_test columns with missing values:  Index(['runtime_in_minutes'], dtype='object')
Imputed


I decided to only keep numeric columns to model smoothly for the next part and impute missing values by median to preserve as many data as possible. Note that the new features were created with the following rules such that no null values are presented:

- runtime_in_minutes is the median of runtime_in_minutes of all movies in train or test if it is null in the original dataset.

- kid_friendly = 0 if rating is null in the original dataset.

- director_movies_count = 1 if director is null in the original dataset.

- movie_genres_count = 0 if genre if null in the original dataset.

In [9]:
# Fit 3 linear regression models:
# Model 1: Use only runtime_in_minutes
# Model 2: Use runtime_in_minutes and kid_friendly
# Model 3: Use runtime_in_minutes, kid_friendly and the dummy columns

# Define train and test set
# Model 1: only runtime_in_minutes
X_train_m1 = X_train_new[["runtime_in_minutes"]]
X_test_m1  = X_test_new[["runtime_in_minutes"]]

# Model 2: runtime_in_minutes + kid_friendly
X_train_m2 = X_train_new[["runtime_in_minutes", "kid_friendly"]]
X_test_m2  = X_test_new[["runtime_in_minutes", "kid_friendly"]]

# Model 3: runtime_in_minutes + kid_friendly + all genre dummies
genre_cols = [c for c in X_train_new.columns if c.startswith("genre_")]
X_train_m3 = X_train_new[["runtime_in_minutes", "kid_friendly"] + genre_cols]
X_test_m3  = X_test_new[["runtime_in_minutes", "kid_friendly"] + genre_cols]

# Fit
linreg1 = LinearRegression()
linreg1.fit(X_train_m1, y_train)
y_pred_m1 = linreg1.predict(X_test_m1)

linreg2 = LinearRegression()
linreg2.fit(X_train_m2, y_train)
y_pred_m2 = linreg2.predict(X_test_m2)

linreg3 = LinearRegression()
linreg3.fit(X_train_m3, y_train)
y_pred_m3 = linreg3.predict(X_test_m3)


In [10]:
# Evaluate
def evaluate_regression(y_true, y_pred):
    """Print and return R2, MAE, and RMSE for a regression model."""
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = root_mean_squared_error(y_true, y_pred)

    print(f"R2   : {r2:.4f}")
    print(f"MAE  : {mae:.4f}")
    print(f"RMSE : {rmse:.4f}")
    print("-" * 30)

    return {"R2": r2, "MAE": mae, "RMSE": rmse}

print("Model 1: runtime_in_minutes")
metrics_m1 = evaluate_regression(y_test, y_pred_m1)

print("Model 2: runtime_in_minutes + kid_friendly")
metrics_m2 = evaluate_regression(y_test, y_pred_m2)

print("Model 3: runtime_in_minutes + kid_friendly + genre dummies")
metrics_m3 = evaluate_regression(y_test, y_pred_m3)

Model 1: runtime_in_minutes
R2   : 0.0016
MAE  : 24.3635
RMSE : 28.3115
------------------------------
Model 2: runtime_in_minutes + kid_friendly
R2   : 0.0014
MAE  : 24.4012
RMSE : 28.3153
------------------------------
Model 3: runtime_in_minutes + kid_friendly + genre dummies
R2   : 0.1518
MAE  : 22.0284
RMSE : 26.0950
------------------------------


In [11]:
# Feature importance using statsmodel
X3 = sm.add_constant(X_train_m3)
y3 = y_train

ols3 = sm.OLS(y3, X3).fit()
print(ols3.summary())

                            OLS Regression Results                            
Dep. Variable:          critic_rating   R-squared:                       0.228
Model:                            OLS   Adj. R-squared:                  0.226
Method:                 Least Squares   F-statistic:                     125.2
Date:                Wed, 03 Dec 2025   Prob (F-statistic):               0.00
Time:                        15:53:59   Log-Likelihood:                -45359.
No. Observations:                9764   AIC:                         9.077e+04
Df Residuals:                    9740   BIC:                         9.094e+04
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const     

In [12]:
# Fit 3 more linear regression models

genre_cols = [c for c in X_train_new.columns if c.startswith("genre_")]
season_cols = [c for c in X_train_new.columns if c.startswith("release_season_")]

# Model 4: runtime + kid_friendly + genre dummies + season dummies
features_m4 = ["runtime_in_minutes", "kid_friendly"] + genre_cols + season_cols
X_train_m4 = X_train_new[features_m4]
X_test_m4  = X_test_new[features_m4]

lm4 = LinearRegression()
lm4.fit(X_train_m4, y_train)
y_pred_m4 = lm4.predict(X_test_m4)

print("Model 4: runtime + kid_friendly + genre dummies + season dummies")
metrics_m4 = evaluate_regression(y_test, y_pred_m4)


# Model 5: Model 3 features + director_movies_count
features_m5 = ["runtime_in_minutes", "kid_friendly", "director_movies_count"] + genre_cols
X_train_m5 = X_train_new[features_m5]
X_test_m5  = X_test_new[features_m5]

lm5 = LinearRegression()
lm5.fit(X_train_m5, y_train)
y_pred_m5 = lm5.predict(X_test_m5)

print("Model 5: runtime + kid_friendly + genre dummies + director_movies_count")
metrics_m5 = evaluate_regression(y_test, y_pred_m5)


# Model 6: Model 3 features + movie_genres_count
features_m6 = ["runtime_in_minutes", "kid_friendly", "movie_genres_count"] + genre_cols
X_train_m6 = X_train_new[features_m6]
X_test_m6  = X_test_new[features_m6]

lm4 = LinearRegression()
lm4.fit(X_train_m6, y_train)
y_pred_m6 = lm4.predict(X_test_m6)

print("Model 6: runtime + kid_friendly + genre dummies + movie_genres_count")
metrics_m6 = evaluate_regression(y_test, y_pred_m6)

Model 4: runtime + kid_friendly + genre dummies + season dummies
R2   : 0.1508
MAE  : 22.0319
RMSE : 26.1103
------------------------------
Model 5: runtime + kid_friendly + genre dummies + director_movies_count
R2   : 0.1417
MAE  : 22.2563
RMSE : 26.2508
------------------------------
Model 6: runtime + kid_friendly + genre dummies + movie_genres_count
R2   : 0.1518
MAE  : 22.0284
RMSE : 26.0950
------------------------------


Model 3 performs the best. runtime_in_minutes, kid_friendly, genre_Action & Adventure, genre_Animation, genre_Classics, genre_Comedy are significant and important predictors. However, the performance of all three models are not good. 

For model 4, I included season dummy variables to test if they should be included in the model by comparing the performance with model 3. It turned out R^2 is lower, so I decided not to include season dummy variables.

For model 5, I added director_movies_count to test if director_movies_count should be included in the model by comparing the performance with model 3. It turned out R^2 is lower, so I decided not to include director_movies_count.

For model 5, I added movie_genres_count to test if movie_genres_count should be included in the model by comparing the performance with model 3. It turned out R^2 is the same, so I decided to include movie_genres_count.

Out of the 6 models, model 3 performs the best, where only runtime_in_minutes, kid_friendly, and genre dummy variables are included. genre dummy variables like genre_Animation, genre_Classics, genre_Comedy, seem to be strong predictors because they have very low p-value and by adding them R^2 increased significantly.

3 ways to improve the model:

- Incorporate textual features from movie descriptions (e.g., TF-IDF or sentiment from synopses).

- Use Ridge or Lasso regression to handle the large number of genre dummy variables and reduce overfitting.

- Try tree-based models like Random Forest or Gradient Boosting that could capture interactions and non-linear effects