# 03 – Modeling: Predicting High-Growth YouTube Videos

This notebook trains and evaluates machine learning models to predict whether a
YouTube video will become a **high-growth video** based on its early performance
metrics and Google Trends signals.

We use the final feature set produced in `02_feature_engineering.ipynb`, where
YouTube metadata has been merged with category-level Google Trends scores.


## 1. Imports

In [None]:
# 03_modeling.ipynb
# High-growth video prediction using YouTube and Google Trends features

from xgboost import XGBClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

plt.style.use("ggplot")

print("Modeling notebook ready.")

In [None]:
# 03_modeling.ipynb
# High-growth video prediction using YouTube and Google Trends features

from xgboost import XGBClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

plt.style.use("ggplot")

print("Modeling notebook ready.")

## 2. Load Features With Google Trends

We load the processed feature set `features_with_trends.csv` from the
`../data/processed/` directory. This dataset already contains:

- YouTube video metrics (views, likes, comments, etc.)
- Engineered ratios (like/view, comment/view)
- Time-based features (publish hour, etc.)
- Category-level Google Trends scores and rolling averages


In [None]:
# Path to the final feature set (merged with Google Trends)
features_path = "../data/processed/features_with_trends.csv"

df = pd.read_csv(
    features_path,
    parse_dates=["trending_date", "publish_date"],
)

print("Features shape:", df.shape)
df.head()

## 3. Feature Selection and Missing Value Handling

We define the list of features used for modeling. Note that we intentionally
exclude future-dependent variables such as `view_growth` or `growth_rate` from
the input features to avoid data leakage.

Since Google Trends data is merged with a left join, some rows may not have a
trend value. To ensure the models can be trained, we impute missing numeric
values with the **median** of each feature.


In [None]:
# Inspect distribution of the target variable
print("High-growth label distribution (fraction):")
print(df["high_growth"].value_counts(normalize=True))

# Features to be used in the models
feature_cols = [
    "views",
    "likes",
    "dislikes",
    "comment_count",
    "like_view_ratio",
    "comment_view_ratio",
    "publish_hour",
    "category_id",
    "trend_score",
    "trend_score_3d_mean",
    "trend_score_7d_mean",
]

# Check for missing values
print("\nNumber of NaNs per feature before imputation:")
print(df[feature_cols].isna().sum())

# Simple and robust imputation strategy:
# - numeric features -> median
# - non-numeric features (if any) -> -1 as a dummy category
for col in feature_cols:
    if df[col].dtype.kind in "iuf":  # int/unsigned/float
        df[col] = df[col].fillna(df[col].median())
    else:
        df[col] = df[col].fillna(-1)

print("\nNumber of NaNs per feature after imputation:")
print(df[feature_cols].isna().sum())

X = df[feature_cols].copy()
y = df["high_growth"].astype(int)

X.head()

## 4. Time-Based Train / Validation / Test Split

Instead of using a random split, we respect the temporal nature of the data:

- The dataset is sorted by `trending_date`.
- The oldest **60%** of the rows are used for **training**.
- The next **20%** are reserved for **validation** (not explicitly tuned here,
  but could be used for hyperparameter search).
- The most recent **20%** are used for **testing**.

This setup mimics a realistic scenario: we train models on past data and
evaluate them on future videos.


In [None]:
# Time-based train/validation/test split
df_sorted = df.sort_values("trending_date").reset_index(drop=True)

n = len(df_sorted)
train_end = int(n * 0.6)
val_end = int(n * 0.8)

train = df_sorted.iloc[:train_end]
val = df_sorted.iloc[train_end:val_end]
test = df_sorted.iloc[val_end:]

print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")

X_train = train[feature_cols]
y_train = train["high_growth"].astype(int)

X_val = val[feature_cols]
y_val = val["high_growth"].astype(int)

X_test = test[feature_cols]
y_test = test["high_growth"].astype(int)

print("\nLabel distribution (train, val, test):")
print(y_train.value_counts(normalize=True), 
      y_val.value_counts(normalize=True), 
      y_test.value_counts(normalize=True))

## 5. Models and Evaluation Metrics

We evaluate two supervised classification models:

1. **Logistic Regression** – a simple linear baseline.
2. **Random Forest** – a non-linear ensemble model.

For each model we report:

- **Accuracy**
- **F1-score**
- **ROC-AUC**
- A full classification report (precision, recall, F1 per class).


In [None]:
def evaluate_classifier(model, X_tr, y_tr, X_te, y_te, name="Model"):
    """Prints basic evaluation metrics and returns them in a dictionary."""
    y_pred = model.predict(X_te)
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_te)[:, 1]
        roc = roc_auc_score(y_te, y_proba)
    else:
        y_scores = model.decision_function(X_te)
        roc = roc_auc_score(y_te, y_scores)

    acc = accuracy_score(y_te, y_pred)
    f1 = f1_score(y_te, y_pred)

    print(f"\n==== {name} ====")
    print("Accuracy:", acc)
    print("F1-score:", f1)
    print("ROC-AUC:", roc)
    print("\nClassification report:\n", classification_report(y_te, y_pred))

    return {"accuracy": acc, "f1": f1, "roc_auc": roc}

In [None]:
# Baseline model: Logistic Regression
log_reg = LogisticRegression(
    max_iter=2000,
    n_jobs=-1,
)

log_reg.fit(X_train, y_train)

metrics_lr = evaluate_classifier(
    log_reg, X_train, y_train, X_test, y_test, name="Logistic Regression"
)

In [None]:
# Non-linear model: Random Forest
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
)

rf.fit(X_train, y_train)

metrics_rf = evaluate_classifier(
    rf, X_train, y_train, X_test, y_test, name="Random Forest"
)

In [None]:
# XGBoost Classifier
xgb = XGBClassifier(
    n_estimators=400,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric="logloss",
    random_state=42,
)

xgb.fit(X_train, y_train)

metrics_xgb = evaluate_classifier(
    xgb, X_train, y_train, X_test, y_test, name="XGBoost"
)

metrics_xgb


## 6. Random Forest Confusion Matrix and Feature Importance

Random Forest often performs better when there are non-linear relationships
and interactions between features.

We inspect:

- The confusion matrix on the test set.
- Feature importances estimated by the Random Forest model.

This helps us understand which signals (e.g., early views, engagement ratios,
Google Trends scores) contribute the most to predicting high-growth videos.


In [None]:
# Confusion matrix for Random Forest
cm = confusion_matrix(y_test, rf.predict(X_test))

fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
ax.set_title("Confusion Matrix - Random Forest")
plt.show()

In [None]:
# Feature importances
importances = rf.feature_importances_
fi = pd.Series(importances, index=feature_cols).sort_values(ascending=False)

plt.figure(figsize=(8, 4))
fi.plot(kind="bar")
plt.title("Feature Importances (Random Forest)")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

fi

## 7. Impact of Google Trends Features

To quantify the added value of Google Trends, we compare two Random Forest models:

- **RF WITHOUT Trends**: trained on YouTube features only  
  (`views`, `likes`, `comment_view_ratio`, `publish_hour`, `category_id`, etc.).

- **RF WITH Trends**: trained on the full feature set, including  
  `trend_score`, `trend_score_3d_mean`, `trend_score_7d_mean`.

By comparing their Accuracy, F1-score, and ROC-AUC on the same test set, we can
assess whether Google Trends provides additional predictive power.


In [None]:
# Trend-related columns
trend_cols = ["trend_score", "trend_score_3d_mean", "trend_score_7d_mean"]

base_features = [c for c in feature_cols if c not in trend_cols]
trend_features = feature_cols  # all features


def train_eval_rf(feature_list, name=""):
    X_train_f = train[feature_list]
    X_test_f = test[feature_list]
    y_train_f = train["high_growth"].astype(int)
    y_test_f = test["high_growth"].astype(int)

    model = RandomForestClassifier(
        n_estimators=300,
        max_depth=None,
        random_state=42,
        n_jobs=-1,
    )
    model.fit(X_train_f, y_train_f)

    y_pred = model.predict(X_test_f)
    y_proba = model.predict_proba(X_test_f)[:, 1]

    acc = accuracy_score(y_test_f, y_pred)
    f1 = f1_score(y_test_f, y_pred)
    roc = roc_auc_score(y_test_f, y_proba)

    print(f"\n==== {name} ====")
    print("Accuracy:", acc)
    print("F1:", f1)
    print("ROC-AUC:", roc)

    return {"accuracy": acc, "f1": f1, "roc_auc": roc}


metrics_rf_no_trend = train_eval_rf(base_features, name="RF WITHOUT Trends")
metrics_rf_with_trend = train_eval_rf(trend_features, name="RF WITH Trends")

summary = pd.DataFrame(
    {
        "model": ["RF_without_trends", "RF_with_trends"],
        "accuracy": [metrics_rf_no_trend["accuracy"], metrics_rf_with_trend["accuracy"]],
        "f1": [metrics_rf_no_trend["f1"], metrics_rf_with_trend["f1"]],
        "roc_auc": [metrics_rf_no_trend["roc_auc"], metrics_rf_with_trend["roc_auc"]],
    }
)

summary

## 8. Conclusion

- The Random Forest model typically outperforms Logistic Regression on the test
  set, suggesting non-linear relationships between features and the `high_growth`
  label.
- Adding Google Trends features often leads to an improvement in overall
  performance (especially ROC-AUC and/or F1-score), indicating that external
  trend signals carry useful information beyond YouTube-internal metrics.
- The most important features usually include early views, engagement ratios
  (like/view, comment/view), and category-level trend scores.
- The final model can be exported locally as a `.pkl` file for deployment or
  further analysis.

> **Note:** Model artifacts (`.pkl` files) are intentionally not stored in the
> GitHub repository to avoid large binary files. They can be regenerated by
> rerunning this notebook, or stored in external storage if needed.


## 9. (Optional) Save the Trained Model Locally

If you want to save the trained Random Forest model for later use (outside of
this notebook), you can uncomment and run the following cell. This will create
a `models/` directory (if it does not exist) and store the model as a `.pkl`
file **on your local machine only**.

Remember **not to commit or push** such large model files to GitHub; instead,
add `models/` and `*.pkl` to your `.gitignore`.


In [None]:
# import joblib
# import os

# os.makedirs("../models", exist_ok=True)
# model_path = "../models/final_rf_model.pkl"
# joblib.dump(rf, model_path)
# print(f"Random Forest model saved to: {model_path}")