# Model validation for time series data

#### 🎯 Learning Goals

1. **Model validation** for time series data
2. Proper **data preprocessing** and feature engineering

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm

## Validation Strategies

### Time Series specific challenges

Time series data presents unique challenges for model validation due to its chronological nature. The K-Fold cross-validation method should be adjusted for time series because:

(i) There is a dependency between the training and validation datasets, which can lead to overfitting.
(ii) The model may be trained with future information, leading to unrealistic performance metrics.

### Out-of-sample Evaluation (OOS)

OOS evaluation is a widely-used and intuitive approach for time series. It involves selecting testing data that occurs after the training data. We explore two methods: the expanding window and the sliding window algorithms.

#### Multiple Train-Test Splits

The expanding window validation starts with a fixed base training set. In the first iteration, the model is tested on the initial test set. For each subsequent iteration, the previous test set is added to the training data, and a new test set of the same size is introduced. This process continues until the data is exhausted.

<img src="img/expanding_window_validation.jpg" alt="Expanding Window Validation" width="600"/>

In the rolling or sliding window validation, the training data size remains constant. As the validation progresses, the start point of the training data is moved forward.

<img src="img/rolling_window_validation.jpg" alt="Expanding Window Validation" width="600"/>

The performance evaluation is based on the aggregated loss from the predictions on the test segments.

____
#### ➡️ ✏️ Task 1


We want to learn the model for the time series data. In particular, we want to fit an AR(p) model for stationary $\{y_t\}$. We are unsure about $p$ and hence wish to perform model validation. 

+ Check model performance for $p \in \{0, 1, \cdots, 3\}$. Use the first  750 obsrvations for training and the remaining data for testing. 
+ Use `sm.tsa.ARIMA` to initialize an appropriate object. 
+ Use `statsmodels.tsa.arima.model.ARIMAResults.forecast` to forecast the values in the testing data.
+ Record the mean-squared error (MSE) of your forecasts.
+ Which model do you select?

In [16]:
series1 = pd.read_csv("data/series1.csv")

results = pd.DataFrame(np.zeros((1, 10)))
results.columns = np.arange(1, 11)
training = series1[:750]
testing = series1[750:]

for p in range(1, 11):
        mod = sm.tsa.ARIMA(training, order=(p, 0, 0))
        res = mod.fit()
        forecast = res.forecast(steps=250, signal_only=False)
        results.loc[0, p] = ((testing.values - forecast.values)**2).mean()
results

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,2.971271,2.956747,2.957917,2.956365,2.953334,2.952907,2.953426,2.953159,2.953736,2.953729


____
#### ➡️ ✏️ Task 2

Redo you analysis, however, this time using an expanding window and 5 iterations in your validation process. 

+ Fix the first 750 observations for training and 50 observations to evaluate your model.
+ Expand the window appropriately. 
+ Record the mean-squared error (MSE) of your forecasts.
+ Which model do you select?

In [17]:
# Your solution

results = pd.DataFrame(np.zeros((1, 10)))
results.columns = np.arange(1, 11)

for iter in range(0, 5):
    lim = 750 + 50*iter
    lim_up = lim + 50
    training = series1[:lim]
    testing = series1[lim:lim_up]
    for p in range(1, 11):
        mod = sm.tsa.ARIMA(training, order=(p, 0, 0))
        res = mod.fit()
        forecast = res.forecast(steps=50, signal_only=False)
        results.loc[0, p] = results.loc[0, p] + ((testing.values - forecast.values)**2).mean()
results = results / 5
results

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,2.958453,2.923054,2.925235,2.922448,2.914851,2.913718,2.913722,2.913965,2.9149,2.915048


____
#### ➡️ ✏️ Task 3

Redo you analysis, however, this time using a rolling window and 5 iterations in your validation process. 
+ Fix the first 750 observations for training and the next 50 observations to evaluate your model. 
+ Shift the window appropriately. 
+ Record the mean-squared error (MSE) of your forecasts.
+ Which model do you select?

In [18]:
# Your solution

results = pd.DataFrame(np.zeros((1, 10)))
results.columns = np.arange(1, 11)

for iter in range(0, 5):
    start = 50*iter
    lim = 750 + 50*iter
    lim_up = lim + 50
    training = series1[start:lim]
    testing = series1[lim:lim_up]
    for p in range(1, 11):
        mod = sm.tsa.ARIMA(training, order=(p, 0, 0))
        res = mod.fit()
        forecast = res.forecast(steps=50, signal_only=False)
        results.loc[0, p] = results.loc[0, p] + ((testing.values - forecast.values)**2).mean()
results = results / 5
results

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,2.947801,2.912059,2.914602,2.911747,2.903073,2.902098,2.902533,2.904302,2.904866,2.905075
