# Model Evaluation & Validation: Train Test Split

In [1]:

from modules import utils
utils.configure_plotly_template(showlegend=True)

  from .autonotebook import tqdm as notebook_tqdm


## Overview

## Data

In [2]:
import pandas as pd

df = pd.read_parquet('../../../data/statsmodels/AirPassengers.parquet').asfreq('ME')
df.columns = ['values']

df

Unnamed: 0,values
1949-01-31,112
1949-02-28,118
...,...
1960-11-30,390
1960-12-31,432


In [3]:
import numpy as np
df['values_log'] = np.log(df['values'])

df

Unnamed: 0,values,values_log
1949-01-31,112,4.718499
1949-02-28,118,4.770685
...,...,...
1960-11-30,390,5.966147
1960-12-31,432,6.068426


In [4]:
df_base = df.copy()

In [5]:
series = df["values_log"]
series

1949-01-31    4.718499
1949-02-28    4.770685
                ...   
1960-11-30    5.966147
1960-12-31    6.068426
Freq: ME, Name: values_log, Length: 144, dtype: float64

## Previous Lesson: Overfitting Problem

1. Evaluate the model on the same series used for training.
2. Overfitting problem: the model is good to predict historical series, but not for the future.
3. Businesses depend on predicting the future, not the past.

In [6]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import root_mean_squared_error

model = SARIMAX(series, order=(0, 1, 1), seasonal_order=(0, 1, 1, 12))
model_fit = model.fit()

df['predictions_log'] = model_fit.predict()
df['predictions_log_exp'] = np.exp(df['predictions_log'])

idx = 12+1

score = root_mean_squared_error(df['values'][idx:], df['predictions_log_exp'][idx:])
score

10.714860511741556

## Train Test Split: Detecting Overfitting

### Split Data

In [7]:
from sklearn.model_selection import train_test_split

df = df[['values', 'values_log']]
df_train, df_test = train_test_split(df, shuffle=False, test_size=0.3)

In [8]:
df_train

Unnamed: 0,values,values_log
1949-01-31,112,4.718499
1949-02-28,118,4.770685
...,...,...
1957-03-31,356,5.874931
1957-04-30,348,5.852202


In [9]:
df_test

Unnamed: 0,values,values_log
1957-05-31,355,5.872118
1957-06-30,422,6.045005
...,...,...
1960-11-30,390,5.966147
1960-12-31,432,6.068426


### Evaluate Model on Test Data

In [10]:
model = SARIMAX(df_train['values_log'], order=(0, 1, 1), seasonal_order=(0, 1, 1, 12), enforce_invertibility=False, enforce_stationarity=False)
model_fit = model.fit()

In [11]:
df = df_test.copy()

start, end = df.index[[0,-1]]
df["predictions_log"] = model_fit.predict(start=start, end=end)
df["predictions_log_exp"] = np.exp(df["predictions_log"])

idx = 12 + 1

score = root_mean_squared_error(df["values"][idx:], df["predictions_log_exp"][idx:])
score

43.43902943845554

In [12]:
df_test = df.copy()

### Evaluate Model on Train Data

In [13]:
df = df_train.copy()
df

Unnamed: 0,values,values_log
1949-01-31,112,4.718499
1949-02-28,118,4.770685
...,...,...
1957-03-31,356,5.874931
1957-04-30,348,5.852202


In [14]:
start, end = df.index[[0,-1]]
df["predictions_log"] = model_fit.predict(start=start, end=end)
df["predictions_log_exp"] = np.exp(df["predictions_log"])

idx = 12 + 1

score = root_mean_squared_error(df["values"][idx:], df["predictions_log_exp"][idx:])
score

8.545512270973711

In [15]:
df_train = df.copy()

### Visualize Overfitting

In [18]:
df_pred = pd.DataFrame({
    'train': df_train['values'],
    'test': df_test['values'],
    'train_forecast_sarima': df_train['predictions_log_exp'],
    'test_forecast_sarima': df_test['predictions_log_exp'],
})

df_pred

Unnamed: 0,train,test,train_forecast_sarima,test_forecast_sarima
1949-01-31,112.0,,1.0,
1949-02-28,118.0,,112.0,
...,...,...,...,...
1960-11-30,,390.0,,438.001046
1960-12-31,,432.0,,499.663157


In [19]:
df_pred[12+1:].plot()

## [ ] Model Comparison: SARIMA vs ETS vs Prophet

In [20]:
from modules import utils

In [21]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_base['values'], test_size=0.3, shuffle=False)

In [22]:
configs = {
    'sarima': {
        'model_params': {
            'order': (0, 1, 1),
            'seasonal_order': (0, 1, 1, 12),
            'enforce_invertibility': False,
            'enforce_stationarity': False,
        },
        'log_transform': True,
    },
    'ets': {
        'model_params': {
            'trend': 'add',
            'seasonal': 'mul',
            'damped_trend': False,
        },
        'log_transform': False,
    },
    'prophet': {
        'model_params': {
            'seasonality_mode': 'multiplicative',
            'yearly_seasonality': True,
        },
        'log_transform': True,
    },
}


data = {
    'train': train,
    'test': test,
}
tf = utils.TimeSeriesForecaster(**data, freq='ME')

d = []
for model_name, config in configs.items():

    forecaster = getattr(tf, model_name)
    f_train, f_test = forecaster(**config)
    forecast = {
        'train': f_train,
        'test': f_test,
    }
    
    for split in ['train', 'test']:
        data_real = data[split]
        data_forecast = forecast[split]
        d.append({
            'model': model_name,
            'split': split,
            'rmse': root_mean_squared_error(data_real, data_forecast),
        })

df = pd.DataFrame(d)
df.style

15:47:19 - cmdstanpy - INFO - Chain [1] start processing
15:47:19 - cmdstanpy - INFO - Chain [1] done processing


Unnamed: 0,model,split,rmse
0,sarima,train,263.058878
1,sarima,test,38.558619
2,ets,train,6.722249
3,ets,test,31.860423
4,prophet,train,5.751788
5,prophet,test,46.292891


In [23]:
dfp = df.pivot(index=["split"], columns="model", values="rmse")
dfp.style.background_gradient(cmap="Greens_r", axis=None).format(precision=2)

model,ets,prophet,sarima
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,31.86,46.29,38.56
train,6.72,5.75,263.06


| Objetivo                                    | Confía más en... | Justificación                                                   |
| ------------------------------------------- | ---------------- | --------------------------------------------------------------- |
| **Forecast inmediato (pocos pasos)**        | Test split       | Optimizas el rendimiento empírico                               |
| **Modelo estable, reusable, interpretable** | Diagnostics      | Te aseguras que el modelo captura bien la estructura subyacente |
| **Forecast multistep largo**                | Diagnostics      | Modelos mal especificados se degradan con el horizonte          |


Buena pregunta. **No, no tiene sentido asegurar la calidad predictiva de un modelo basándote solo en los errores in-sample (residuos sobre el train)**. Pero **sí tiene sentido analizarlos para validar la estructura del modelo**.

---

### 🔍 Diferenciemos:

#### ✅ **Usar los residuos in-sample tiene sentido para:**

* Verificar que el modelo esté bien especificado (sin autocorrelación, varianza constante).
* Asegurarte de que no hay patrones no explicados.
* Validar que los supuestos del modelo se cumplen.

#### ❌ **Usar los residuos in-sample para reportar error de predicción es engañoso:**

* Estás midiendo el ajuste, no la capacidad de generalización.
* Es un error común que da modelos con bajo error in-sample pero que fallan fuera de muestra (overfitting).

---

### 📌 Ejemplo concreto:

* Un `ARIMA(12,1,1)` puede tener residuos in-sample muy pequeños.
* Pero si el modelo está sobreajustado o no generaliza, su error out-of-sample será alto.
* Solo evaluando el RMSE en el test set puedes confirmar su poder predictivo.

---

### ✅ Conclusión clara:

> **Diagnósticos in-sample te dicen si el modelo tiene sentido. Error out-of-sample te dice si sirve.**

¿Quieres una visualización clara que muestre ambos lados en una notebook o clase?