# Modeling and Validation (05)

This notebook evaluates whether engineered features contain predictive signal
for a rating-based hit label.

The goal is validation and interpretability, not deployment-level performance.


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/netflix_tv_features_v1.csv")
print(df.shape)
df.head()

(2020, 78)


Unnamed: 0,id,hit,years_since_release,is_us,is_korea,covid_period_covid,covid_period_post_covid,covid_period_pre_covid,covid_period_unknown,covid_period_nan,...,primary_country_SE,primary_country_SG,primary_country_SN,primary_country_TH,primary_country_TR,primary_country_TW,primary_country_US,primary_country_VN,primary_country_ZA,primary_country_nan
0,66732,1,10.0,1,0,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
1,238458,1,1.0,0,1,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,4656,1,33.0,1,0,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
3,63174,1,10.0,1,0,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
4,71912,1,7.0,1,0,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False


In [2]:
y = df["hit"]
X = df.drop(columns=["hit", "id"])

print("Positive rate:", y.mean())

Positive rate: 0.09059405940594059


The hit label exhibits moderate class imbalance, with approximately 9% of titles classified as hits.
Accordingly, accuracy is not a reliable evaluation metric in this setting, and model performance is assessed using **ROC-AUC** and class-specific precision/recall.

In [13]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe_lr = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(
        class_weight="balanced",
        max_iter=1000,
        random_state=42
    ))
])

pipe_lr.fit(X_train, y_train)

y_prob = pipe_lr.predict_proba(X_val)[:, 1]

In [14]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, y_prob)

0.8329070758738277

In [15]:
nan_cols = X.isna().mean().sort_values(ascending=False)
nan_cols[nan_cols > 0].head(30)

years_since_release    0.060396
dtype: float64

Approximately 6% of numeric feature values are missing.
Rather than dropping observations, which would disproportionately affect the minority hit class,
missing values are imputed using the median within the training pipeline.

In [18]:
coef = pd.Series(
    pipe_lr.named_steps["lr"].coef_[0],
    index=X.columns
).sort_values(key=abs, ascending=False)

coef.head(15)

covid_period_unknown                -1.235099
primary_country_IN                  -0.787415
primary_genre_grouped_unknown       -0.752718
primary_genre_grouped_Documentary   -0.668215
years_since_release                  0.569405
primary_country_GB                   0.557989
primary_genre_grouped_Drama          0.536299
primary_country_US                   0.529834
is_us                                0.529834
primary_country_TH                  -0.520755
primary_country_PL                  -0.498857
primary_genre_grouped_Reality       -0.498541
primary_country_ZA                  -0.490635
primary_country_TR                  -0.476032
primary_country_JP                   0.459457
dtype: float64

In [22]:
df.drop(columns=['is_us', 'is_korea'])

Unnamed: 0,id,hit,years_since_release,covid_period_covid,covid_period_post_covid,covid_period_pre_covid,covid_period_unknown,covid_period_nan,primary_genre_grouped_Action & Adventure,primary_genre_grouped_Animation,...,primary_country_SE,primary_country_SG,primary_country_SN,primary_country_TH,primary_country_TR,primary_country_TW,primary_country_US,primary_country_VN,primary_country_ZA,primary_country_nan
0,66732,1,10.0,False,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
1,238458,1,1.0,False,True,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
2,4656,1,33.0,False,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,63174,1,10.0,False,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,71912,1,7.0,False,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015,292533,0,,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2016,307530,0,,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2017,205186,0,4.0,False,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2018,308788,0,5.0,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Some engineered features encode overlapping information (e.g., country indicators).
While regularized logistic regression is robust to such redundancy, redundant variables
were removed where appropriate to improve interpretability without affecting predictive performance.


In [24]:
X_cat_only = X.filter(regex="covid_period|primary_genre|primary_country|is_")

X_train2, X_val2, y_train2, y_val2 = train_test_split(
    X_cat_only, y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

pipe_lr.fit(X_train2, y_train2)
y_prob2 = pipe_lr.predict_proba(X_val2)[:, 1]
roc_auc_score(y_val2, y_prob2)

0.8356067064506962

The logistic regression model achieves a ROC-AUC of 0.83 on a held-out validation set,
indicating substantial separability between hit and non-hit titles.
Given the label construction and feature design, this result should be interpreted
as evidence of strong structural signal rather than deployable predictive performance.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

pipe_lr.fit(X_train, y_train)

y_prob = pipe_lr.predict_proba(X_val)[:, 1]
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, y_prob)

0.8329070758738277

In [None]:
coef = pd.Series(
    pipe_lr.named_steps["lr"].coef_[0],
    index=X.columns
).sort_values(key=abs, ascending=False)

coef.head(12)

covid_period_unknown                -1.235099
primary_country_IN                  -0.787415
primary_genre_grouped_unknown       -0.752718
primary_genre_grouped_Documentary   -0.668215
years_since_release                  0.569405
primary_country_GB                   0.557989
primary_genre_grouped_Drama          0.536299
primary_country_US                   0.529834
is_us                                0.529834
primary_country_TH                  -0.520755
primary_country_PL                  -0.498857
primary_genre_grouped_Reality       -0.498541
dtype: float64

## Modeling Results

A regularized logistic regression model achieves a ROC-AUC of approximately 0.83
on a held-out validation set, indicating strong separability between hit and non-hit titles.

Predictive signal is primarily driven by temporal context (release period),
genre composition, and country of origin. Missing metadata categories are
consistently associated with lower hit probability.

Given that the hit label is constructed from engagement metrics, results are
interpreted as structural associations rather than causal drivers of success.