# Ethiopia Food Prices – Modelling & Model Comparison

## Summary – Ethiopia food price forecasting experiments (Naive vs ARIMA/XGB)

**Goal**

Explore different time series models for Ethiopia food prices and decide which one to use as the **operational model** for a ~3-month planning horizon, with model tracking in MLflow.

---

### Data & setup

- Data: Ethiopia Tier A panel (`ethiopia_foodprices_model_panel_tierA.parquet`), monthly prices.
- Series unit: `(admin_1, product)` pairs.
- Main target: `value_imputed` → renamed to `y` for modelling.
- Two main data “views”:
  - **StatsForecast panel:** `unique_id` (admin_1 · product), `ds` (month-end), `y`.
  - **Feature-based panel:** built via `staples_model_core.py`:
    - `impute_features` → handle rain/FAO/ptm/population, missingness flags.
    - `build_features` → lags (`y_lag1,3,6,12`), rolling stats, month number, seasonal encodings (`mo_sin`, `mo_cos`).
    - `encode_ids`, `pick_features`, `time_split` for train/test.

---

### Models evaluated

**1. StatsForecast baselines (panel models)**  
Trained on per-series history in `df_sf`:

- `sf_Naive`: random walk (forecast = last observed y).
- `sf_WindowAverage(12)` (earlier) – simple 12-month moving average.
- `sf_ARIMA`: `AutoARIMA(seasonal=False, alias="ARIMA")`.
- `sf_SARIMA`: `AutoARIMA(season_length=12, alias="SARIMA")`.
- `sf_NaiveDrift`: custom “Naive + drift” model:
  - Per series, compute average recent monthly change (last 6 diffs).
  - Forecast: `y_t + h * drift`.

**2. Global feature-based model (XGBoost)**

- Common engineered feature set from `staples_model_core.py`:
  - Lags, rolling stats, month encodings, rain/FAO/PTM/population, etc.
- Global XGBoost regressor (one model across all series):
  - Hyperparameters tuned with Optuna on sMAPE (log-space target).
  - `fit_xgb_compat` handles early stopping on the tail of the training period.

**3. Hybrid residual-corrected model**

- Base: global XGB prediction `y_pred_global` (in level).
- Residuals: `resid = y - y_pred_global`.
- For each **product**, fit a small Ridge regression on the residuals:
  - Features: `[y_pred_global]` only.
  - `residual_corrector(product)`.
- Final forecast:  
  `y_pred_hybrid = y_pred_global + correction(product, y_pred_global)`.

Later, this logic was generalized into a `GroupResidualCorrector` class that can, in principle, correct any base model per product or per (admin_1, product). In practice, the hybrid did **not** beat Naive / SARIMA for this dataset and horizon.

---

### Horizons tested

Two horizons were explicitly compared using StatsForecast models:

- **h = 3 months** (`h3`) – the main planning horizon.
- **h = 6 months** (`h6`) – to check if longer-horizon behaviour changes which model is best.

For each horizon, the last `h` months per series were used as test; the rest as train.

---

### Key results

#### Horizon h = 3 (short-term, operational horizon)

Panel-level metrics (approximate):

- **sf_Naive_h3**  
  - MAE ≈ **5.79**  
  - RMSE ≈ **8.66**  
  - sMAPE ≈ **6.03%**  ← **Best model**
- **sf_SARIMA_h3**  
  - MAE ≈ 7.07, RMSE ≈ 9.94, sMAPE ≈ 7.30%
- **sf_ARIMA_h3**  
  - MAE ≈ 7.94, RMSE ≈ 12.17, sMAPE ≈ 7.67%
- **sf_NaiveDrift_h3**  
  - MAE ≈ 8.82, RMSE ≈ 14.47, sMAPE ≈ 7.68%

- Global XGB and hybrid XGB were **worse** than Naive on all metrics.
- Naive+Drift also performed **worse** than plain Naive.

**Conclusion for h=3:**  
For extremely volatile Ethiopian food prices with only strong lag-1 autocorrelation, a simple **Naive (last-value) model** is the most accurate and robust. Extra structure (ARIMA, SARIMA, XGB, drift) does **not** improve short-horizon forecasts.

#### Horizon h = 6 (longer horizon check)

- **sf_Naive_h6**: MAE ≈ 11.90, RMSE ≈ 18.87, sMAPE ≈ 10.51%
- **sf_ARIMA_h6**: MAE ≈ **11.01**, RMSE ≈ 16.92, sMAPE ≈ 10.55%
- **sf_SARIMA_h6**: MAE ≈ 11.29, RMSE ≈ **16.73**, sMAPE ≈ 10.94%
- **sf_NaiveDrift_h6**: clearly worse than all others.

**Conclusion for h=6:**  
At 6 months, **ARIMA/SARIMA slightly outperform Naive** on MAE/RMSE, suggesting some weak medium-term structure, but gains are modest.

---

### Final decision for the dashboard

Given the planning need is **3 months ahead**, and Ethiopian prices are highly volatile:

- **Chosen production model:** `sf_Naive_h3`  
  → Forecast for each `(admin_1, product, month+h)` = **last observed price** for that series.

- Feature-based XGB + hybrid and ARIMA/SARIMA remain:
  - Logged in MLflow for tracking and future experiments.
  - Potentially useful later for:
    - longer horizons,
    - richer exogenous data,
    - scenario / sensitivity analysis.

For now, the dashboard’s operational forecast adopts the **Naive model with horizon = 3 months**.


In [None]:
# ========================= 1. Imports & basic setup =========================

import os
os.chdir('/Users/nataschajademinnitt/Documents/5_data/food_security/')

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

# Time-series packages
from statsforecast import StatsForecast
from statsforecast.models import Naive, WindowAverage, AutoARIMA
from utilsforecast.losses import mae, mape, rmse, smape
from utilsforecast.evaluation import evaluate

# ML & feature-based modelling
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

# MLflow
import mlflow

# Your custom ETL / modelling utilities
from etl.staples_model_core import (
    impute_features,
    build_features,
    encode_ids,
    pick_features,
    time_split,
    tune_xgb,
    fit_xgb_compat,
    smape as smape_vec,
    rmse as rmse_vec,
    TEST_HORIZON_MONTHS,
    N_TRIALS,
    SEED,
)

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

mlflow.set_experiment("ethiopia_food_prices")


In [None]:
# ========================= 2. Helper functions & classes =========================

from typing import List, Optional, Union, Dict, Tuple


def compute_panel_metrics(y_true, y_pred):
    """
    Flat MAE, RMSE, sMAPE over all series & horizons.
    """
    y_true = np.asarray(y_true, dtype=float)
    y_pred = np.asarray(y_pred, dtype=float)
    mask = ~np.isnan(y_true) & ~np.isnan(y_pred)
    y_true = y_true[mask]
    y_pred = y_pred[mask]

    if len(y_true) == 0:
        return {"mae": np.nan, "rmse": np.nan, "smape": np.nan}

    mae_val = mean_absolute_error(y_true, y_pred)
    rmse_val = np.sqrt(mean_squared_error(y_true, y_pred))
    denom = (np.abs(y_true) + np.abs(y_pred) + 1e-9) / 2.0
    smape_val = np.mean(np.abs(y_pred - y_true) / denom) * 100.0

    return {"mae": mae_val, "rmse": rmse_val, "smape": smape_val}


def log_run_to_mlflow(model_name, metrics, params=None, tags=None):
    """
    Helper to log a single model run in MLflow.
    """
    with mlflow.start_run(run_name=model_name):
        if params:
            mlflow.log_params(params)
        if tags:
            mlflow.set_tags(tags)
        mlflow.log_metrics(metrics)


class GroupResidualCorrector:
    """
    Generic group-wise residual corrector.

    - You give it:
        * group_cols: columns defining a time series group (e.g. ["product"] or ["admin_1", "product"])
        * main_pred_col: name of the base model prediction column (used to define residuals)
        * feature_cols: columns used as features for residual model (defaults to [main_pred_col])
        * target_col: name of true target column
    - It fits a small regression model per group to predict residuals:
        residual = y_true - y_pred_main
    - At prediction time, it adds:
        corrected = y_pred_main + predicted_residual
    """

    def __init__(
        self,
        group_cols: Union[str, List[str]],
        main_pred_col: str,
        feature_cols: Optional[List[str]] = None,
        target_col: str = "y",
        min_n: int = 12,
        base_estimator_factory=None,
    ):
        self.group_cols = [group_cols] if isinstance(group_cols, str) else list(group_cols)
        self.main_pred_col = main_pred_col
        self.feature_cols = feature_cols
        self.target_col = target_col
        self.min_n = min_n

        if base_estimator_factory is None:
            def default_factory():
                return Pipeline([
                    ("scaler", StandardScaler(with_mean=True, with_std=True)),
                    ("ridge", Ridge(alpha=2.0, fit_intercept=True, random_state=SEED)),
                ])
            self.base_estimator_factory = default_factory
        else:
            self.base_estimator_factory = base_estimator_factory

        self.models_: Dict[Tuple, Pipeline] = {}

    def fit(self, df: pd.DataFrame) -> "GroupResidualCorrector":
        if self.feature_cols is None:
            self.feature_cols = [self.main_pred_col]

        required = set(self.group_cols + [self.target_col, self.main_pred_col] + self.feature_cols)
        missing = required - set(df.columns)
        if missing:
            raise ValueError(f"DataFrame is missing required columns: {missing}")

        self.models_.clear()

        for g_key, sub in df.groupby(self.group_cols, observed=False):
            if len(sub) < self.min_n:
                continue

            y_true = sub[self.target_col].to_numpy(dtype=float)
            y_base = sub[self.main_pred_col].to_numpy(dtype=float)
            residuals = y_true - y_base

            X = sub[self.feature_cols].to_numpy(dtype=float)
            model = self.base_estimator_factory()
            model.fit(X, residuals)
            self.models_[g_key] = model

        return self

    def predict(
        self,
        df: pd.DataFrame,
        new_col: str = "y_pred_corrected",
        inplace: bool = False,
    ) -> pd.DataFrame:
        if not self.models_:
            raise RuntimeError("You must call .fit() before .predict().")

        if self.feature_cols is None:
            self.feature_cols = [self.main_pred_col]

        required = set(self.group_cols + [self.main_pred_col] + self.feature_cols)
        missing = required - set(df.columns)
        if missing:
            raise ValueError(f"DataFrame is missing required columns: {missing}")

        out = df if inplace else df.copy()
        out[new_col] = out[self.main_pred_col].astype(float)

        for g_key, sub_idx in out.groupby(self.group_cols, observed=False).groups.items():
            model = self.models_.get(g_key)
            if model is None:
                continue
            X = out.loc[sub_idx, self.feature_cols].to_numpy(dtype=float)
            corr = model.predict(X)
            base_vals = out.loc[sub_idx, self.main_pred_col].to_numpy(dtype=float)
            out.loc[sub_idx, new_col] = base_vals + corr

        return out
    
def forecast_naive_drift(train_sf, test_sf, h, drift_window=6, col_name="NaiveDrift"):
    """
    Naive + drift model:
      - per unique_id, fit a constant drift = mean of last `drift_window` diffs
      - forecast h steps ahead: y_t + k * drift, k=1..h
    Returns a DataFrame with columns ['unique_id', 'ds', col_name].
    """
    preds = []

    for uid, g_train in train_sf.groupby("unique_id", observed=False):
        g_train = g_train.sort_values("ds")
        g_test = (
            test_sf.loc[test_sf["unique_id"] == uid]
                   .sort_values("ds")
        )
        if g_test.empty:
            continue

        y_train = g_train["y"].to_numpy()
        last_y = y_train[-1]

        diffs = np.diff(y_train)
        if len(diffs) == 0:
            drift = 0.0
        else:
            drift = diffs[-drift_window:].mean()

        horizon = len(g_test)
        y_forecast = [last_y + (i + 1) * drift for i in range(horizon)]

        tmp = g_test[["unique_id", "ds"]].copy()
        tmp[col_name] = y_forecast
        preds.append(tmp)

    if not preds:
        return pd.DataFrame(columns=["unique_id", "ds", col_name])

    return pd.concat(preds, ignore_index=True)

def make_train_test_statsforecast(df_sf, horizon):
    counts = df_sf.groupby("unique_id")["ds"].count()
    ok_ids = counts[counts > horizon].index
    df_sub = df_sf[df_sf["unique_id"].isin(ok_ids)].reset_index(drop=True)

    test = df_sub.groupby("unique_id", group_keys=False).tail(horizon).reset_index(drop=True)
    train = (
        df_sub.groupby("unique_id", group_keys=False)
              .apply(lambda g: g.iloc[:-horizon])
              .reset_index(drop=True)
    )
    return train, test

def run_sf_models_for_horizon(df_sf_train, df_sf_test, horizon, label_suffix):
    models_sf = [
        Naive(),
        AutoARIMA(seasonal=False, alias='ARIMA'),
        AutoARIMA(season_length=12, alias='SARIMA'),
    ]

    sf = StatsForecast(models=models_sf, freq="M", n_jobs=-1)
    sf.fit(df=df_sf_train)

    preds = sf.predict(h=horizon)
    eval_df = df_sf_test.merge(preds, on=["unique_id", "ds"], how="left")

    model_cols = [c for c in eval_df.columns if c not in ["unique_id", "ds", "y"]]

    rows = []
    for col in model_cols:
        model_name = f"sf_{col}_{label_suffix}"
        metrics = compute_panel_metrics(eval_df["y"], eval_df[col])
        print(model_name, metrics)
        rows.append({"model": model_name, **metrics})

        log_run_to_mlflow(
            model_name=model_name,
            metrics=metrics,
            params={"family": "statsforecast", "horizon": horizon},
            tags={"kind": f"baseline_{label_suffix}"}
        )

    # Naive+drift for this horizon
    nd_preds = forecast_naive_drift(df_sf_train, df_sf_test, h=horizon,
                                    drift_window=6, col_name="NaiveDrift")
    eval_nd = df_sf_test.merge(nd_preds, on=["unique_id", "ds"], how="left")
    metrics_nd = compute_panel_metrics(eval_nd["y"], eval_nd["NaiveDrift"])

    nd_name = f"sf_NaiveDrift_{label_suffix}"
    print(nd_name, metrics_nd)
    rows.append({"model": nd_name, **metrics_nd})
    log_run_to_mlflow(
        model_name=nd_name,
        metrics=metrics_nd,
        params={"family": "naive_drift", "horizon": horizon, "drift_window": 6},
        tags={"kind": f"baseline_plus_{label_suffix}"}
    )

    return rows


In [None]:
# ========================= 3. Load data & build panels =========================

PARQUET_PATH = "data/processed/ethiopia_foodprices_model_panel_tierA.parquet"
panel = pd.read_parquet(PARQUET_PATH)

print("Top-level columns:", list(panel.columns))

# --- A) StatsForecast panel (unique_id, ds, y) ---

H = TEST_HORIZON_MONTHS  # test/forecast horizon in months

df_sf = (
    panel.rename(columns={"month": "ds", "value_imputed": "y"})
          .assign(
              ds=lambda d: pd.to_datetime(d["ds"], errors="coerce")
                             .dt.to_period("M").dt.to_timestamp("M"),
              unique_id=lambda d: d["admin_1"].astype(str) + " · " + d["product"].astype(str),
          )[["unique_id", "ds", "y"]]
          .groupby(["unique_id", "ds"], as_index=False)["y"].mean()
          .sort_values(["unique_id", "ds"])
)

# Remove series with very short history (need at least H+1 points)
counts = df_sf.groupby("unique_id")["ds"].count()
ok_ids = counts[counts > H].index
df_sf = df_sf[df_sf["unique_id"].isin(ok_ids)].reset_index(drop=True)

test_sf = df_sf.groupby("unique_id", group_keys=False).tail(H).reset_index(drop=True)
train_sf = (
    df_sf.groupby("unique_id", group_keys=False)
         .apply(lambda g: g.iloc[:-H])
         .reset_index(drop=True)
)

print("StatsForecast panel:", df_sf.shape, "train:", train_sf.shape, "test:", test_sf.shape)


# --- B) Feature-based panel for global XGB (your existing pipeline) ---

# Feature engineering
staples = impute_features(panel)
df_feat = build_features(staples)
df_feat = encode_ids(df_feat)
df_feat = df_feat.sort_values(["month", "admin_1", "product"]).reset_index(drop=True)

feats = pick_features(df_feat)
df_feat[feats] = df_feat[feats].apply(pd.to_numeric, errors="coerce").astype("float32")

# Split into train/test (time-based)
train_df, test_df = time_split(df_feat, horizon=TEST_HORIZON_MONTHS)

X_train, X_test = train_df[feats], test_df[feats]
y_train_log = np.log1p(train_df["y"])
y_test_true = test_df["y"].to_numpy()

print("Feature-based panel train/test:", train_df.shape, test_df.shape)


In [None]:
# ========================= 4. StatsForecast baselines =========================

results = []   # will collect summary metrics for all models

models_sf = [
    Naive(),
    WindowAverage(window_size=12),
    AutoARIMA(seasonal=False, alias='ARIMA'),
    AutoARIMA(season_length=12, alias='SARIMA'),
]

sf = StatsForecast(models=models_sf, freq="M", n_jobs=-1)
sf.fit(df=train_sf)

preds_sf = sf.predict(h=H)  # columns: unique_id, ds, <model cols>
eval_sf = test_sf.merge(preds_sf, on=["unique_id", "ds"], how="left")

model_cols = [c for c in eval_sf.columns if c not in ["unique_id", "ds", "y"]]

for col in model_cols:
    model_name = f"sf_{col}"
    y_true = eval_sf["y"].values
    y_pred = eval_sf[col].values

    metrics = compute_panel_metrics(y_true, y_pred)
    results.append({"model": model_name, **metrics})

    log_run_to_mlflow(
        model_name=model_name,
        metrics=metrics,
        params={"family": "statsforecast", "horizon": H},
        tags={"kind": "baseline"}
    )

# --- Naive + drift model on the same horizon H ---

nd_preds = forecast_naive_drift(train_sf, test_sf, h=H, drift_window=6, col_name="NaiveDrift")
eval_nd = test_sf.merge(nd_preds, on=["unique_id", "ds"], how="left")

metrics_nd = compute_panel_metrics(eval_nd["y"], eval_nd["NaiveDrift"])
print("Naive+drift metrics:", metrics_nd)

results.append({"model": "sf_NaiveDrift", **metrics_nd})

log_run_to_mlflow(
    model_name="sf_NaiveDrift",
    metrics=metrics_nd,
    params={"family": "naive_drift", "horizon": H, "drift_window": 6},
    tags={"kind": "baseline_plus"}
)

pd.DataFrame(results)


In [None]:
from statsforecast import StatsForecast
from statsforecast.models import Naive, SeasonalNaive, AutoARIMA, WindowAverage
from utilsforecast.losses import mae, mape, rmse, smape
from utilsforecast.evaluation import evaluate

H_LIST = [3, 6]  # horizons you want to compare

def evaluate_statsforecast_models(df_sf, h, n_windows=3):
    """
    Run StatsForecast cross-validation for baseline models at horizon h
    and return a tidy metrics DataFrame (one row per model).
    """
    models = [
        Naive(),  # last value
        SeasonalNaive(season_length=12, alias="SeasonalNaive_12"),
        WindowAverage(window_size=3, alias="WindowAvg_3"),
        AutoARIMA(seasonal=False, alias="ARIMA"),
        AutoARIMA(season_length=12, alias="SARIMA"),
    ]
    
    sf = StatsForecast(models=models, freq="M", n_jobs=-1)
    
    cv_df = sf.cross_validation(
        h=h,
        df=df_sf,
        n_windows=n_windows,
        step_size=h,
        refit=True,
    )
    
    # utilsforecast.evaluation.evaluate returns a *wide* table:
    # columns: ['unique_id', 'metric', 'Naive', 'SeasonalNaive_12', ...]
    cv_eval = evaluate(
        cv_df.drop(["cutoff"], axis=1),
        metrics=[mae, mape, rmse, smape],
    )
    
    # Melt to long: one row per (unique_id, metric, model)
    id_vars = [c for c in ["unique_id", "metric"] if c in cv_eval.columns]
    cv_long = cv_eval.melt(
        id_vars=id_vars,
        var_name="model",
        value_name="value",
    )
    
    # Average across series: one row per (model, metric)
    out = (
        cv_long
        .groupby(["model", "metric"], observed=False)["value"]
        .mean()
        .reset_index()
        .pivot(index="model", columns="metric", values="value")
        .reset_index()
    )
    out["horizon"] = h
    return out

all_results = []
for h in H_LIST:
    res_h = evaluate_statsforecast_models(df_sf, h=h, n_windows=3)
    all_results.append(res_h)

sf_results = pd.concat(all_results, ignore_index=True)
sf_results = sf_results.set_index(["model", "horizon"]).sort_index()
sf_results


In [None]:
# ========================= 5. StatsForecast Compare Horizons =========================

H_SHORT = TEST_HORIZON_MONTHS   
H_LONG = 6                      

train_sf_short, test_sf_short = make_train_test_statsforecast(df_sf, H_SHORT)
train_sf_long,  test_sf_long  = make_train_test_statsforecast(df_sf, H_LONG)

print("Short horizon:", train_sf_short.shape, test_sf_short.shape)
print("Long horizon :", train_sf_long.shape,  test_sf_long.shape)

results = []  # or extend your existing list

rows_short = run_sf_models_for_horizon(train_sf_short, test_sf_short, H_SHORT, label_suffix="h3")
rows_long  = run_sf_models_for_horizon(train_sf_long,  test_sf_long,  H_LONG,  label_suffix="h6")

results.extend(rows_short)
results.extend(rows_long)

results_df = (
    pd.DataFrame(results)
    .set_index("model")
    .sort_values(["smape", "rmse"])
)
display(results_df)



In [None]:
# ========================= 6. Global XGBoost model (feature-based) =========================

# How many distinct months of history in the training period?
hist_month_counts = (
    train_df
    .groupby(["admin_1", "product"])["month"]
    .nunique()
    .rename("n_hist_months")
    .reset_index()
)

test_df = test_df.merge(hist_month_counts, on=["admin_1", "product"], how="left")
test_df["has_12m_hist"] = test_df["n_hist_months"].fillna(0) >= 12

print("Number of test rows with ≥12 months history:", test_df["has_12m_hist"].sum())

print(f"Tuning XGB global model ({N_TRIALS} trials)…")
best_params = tune_xgb(X_train, y_train_log, n_trials=N_TRIALS, seed=SEED)
print("Best params:", best_params)

# Simple validation split: last 2 months as validation, if available
last_train_m = train_df["month"].max()
val_cut = (last_train_m.to_period("M") - 2).to_timestamp()

val_mask = train_df["month"] >= val_cut
X_tr_es, y_tr_es = X_train[~val_mask], y_train_log[~val_mask]
X_va_es, y_va_es = X_train[val_mask],  y_train_log[val_mask]

if len(X_va_es) == 0:
    X_tr_es, y_tr_es = X_train, y_train_log
    X_va_es = y_va_es = None

final_model = fit_xgb_compat(
    best_params,
    X_tr_es,
    y_tr_es,
    X_va_es,
    y_va_es,
    early_rounds=200,
)

# Base predictions (in level space)
train_df["y_pred_xgb"] = np.expm1(final_model.predict(X_train))
test_df["y_pred_xgb"] = np.expm1(final_model.predict(X_test))

# Restrict evaluation to rows with adequate history
mask_ok = test_df["has_12m_hist"]
eval_xgb = test_df[mask_ok].copy()

valid_mask_base = eval_xgb["y_pred_xgb"].notna() & eval_xgb["y"].notna()

metrics_xgb_base = compute_panel_metrics(
    eval_xgb.loc[valid_mask_base, "y"],
    eval_xgb.loc[valid_mask_base, "y_pred_xgb"],
)
print("Global XGB base metrics:", metrics_xgb_base)

results.append({"model": "xgb_global_base", **metrics_xgb_base})

log_run_to_mlflow(
    model_name="xgb_global_base",
    metrics=metrics_xgb_base,
    params={
        "family": "global_ml",
        "horizon": TEST_HORIZON_MONTHS,
        "n_features": len(feats),
        **{f"xgb_{k}": v for k, v in best_params.items()},
    },
    tags={"kind": "candidate", "uses_residual_correction": False},
)


In [None]:
# ========================= 7. Hybrid / residual-corrected XGB =========================

# Fit residual correctors per product
corrector_xgb = GroupResidualCorrector(
    group_cols="product",                 # or ["admin_1", "product"]
    main_pred_col="y_pred_xgb",
    feature_cols=["y_pred_xgb"],         # simple correction, can add more features later
    target_col="y",
    min_n=12,
)

corrector_xgb.fit(train_df)

# Apply to test set (only rows with ≥12m history are evaluated)
eval_xgb = corrector_xgb.predict(eval_xgb, new_col="y_pred_xgb_hybrid", inplace=True)

valid_mask_hybrid = eval_xgb["y_pred_xgb_hybrid"].notna() & eval_xgb["y"].notna()

metrics_xgb_hybrid = compute_panel_metrics(
    eval_xgb.loc[valid_mask_hybrid, "y"],
    eval_xgb.loc[valid_mask_hybrid, "y_pred_xgb_hybrid"],
)
print("Global XGB hybrid metrics:", metrics_xgb_hybrid)

results.append({"model": "xgb_global_hybrid", **metrics_xgb_hybrid})

log_run_to_mlflow(
    model_name="xgb_global_hybrid",
    metrics=metrics_xgb_hybrid,
    params={
        "family": "global_ml",
        "horizon": TEST_HORIZON_MONTHS,
        "n_features": len(feats),
        "residual_group": "product",
        "residual_features": "y_pred_xgb",
        **{f"xgb_{k}": v for k, v in best_params.items()},
    },
    tags={"kind": "production_candidate", "uses_residual_correction": True},
)


In [None]:
# ========================= 8. Final model comparison table =========================

results_df = (
    pd.DataFrame(results)
    .drop_duplicates(subset=["model"])
    .set_index("model")
    .sort_values("smape")   # smaller is better
)

print("Model ranking (lower is better):")
display(results_df)

best_model_name = results_df.index[0]
best_model_metrics = results_df.iloc[0].to_dict()

print("\nBest model:", best_model_name)
print("Metrics:", best_model_metrics)
