# Forecasting as Supervised Learning

**Approximate Learning Time**:Up to 4 hours

---


In this module, we will cast the problem of time series forecasting as a **supervised learning task**.
Supervised learning is a type of machine learning where a model learns from labeled data. In this approach, the data consists of input-output pairs, where the input is a set of features and the output is the correct answer or label (often called the "target").

The model's goal is to find patterns in the input data that can predict the correct output. It does this by learning from many examples during the training phase, where the model is shown both the input and the correct output. Once trained, the model can make predictions on new, unseen data.

For example, in a time series forecasting problem, the input might be past values of a time series (e.g., temperature over the past few days), and the output would be the predicted value (e.g., tomorrow's temperature).

Before the advent of advanced deep learning techniques like RNNs and LSTMs, the machine learning community approached forecasting as a supervised learning problem. This method involves crafting features for each time step, allowing a predictor to learn patterns and make future predictions. The goal is to make patterns in the data more explicit, facilitating easier learning for the model.

Thus, we will cover the following in this notebook:

- How to convert time series data into features suitable for supervised training.
- How to perform train-validation-test splits in a supervised setup for time series forecasting.
- How to apply the **XGBoost** algorithm to forecast future time steps.

We will utilize the following libraries: `sklearn`, `xgboost`, and `pandas`.

---
Let's load the log daily returns of exchange rates!

**Note**: We do **not** split the data at this stage. Splitting is done **after** the dataset has been featurized, ensuring that none of the future values are used as features in the training process.


In [None]:
import pathlib
import numpy as np
import pandas as pd
from termcolor import colored

# model builder 
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline # https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn.neural_network import MLPRegressor

MAX_PAST_TIME_STEPS_FOR_FEATURES = 4 # time steps in the past to use for computing lag features and difference features

## WARNING: To compare different models on the same horizon, keep this same across the notebooks
import sys; sys.path.append("../")
import utils

FORECASTING_HORIZON = [4, 8, 12] # weeks 
MAX_FORECASTING_HORIZON = max(FORECASTING_HORIZON)

SEQUENCE_LENGTH = 2 * MAX_FORECASTING_HORIZON
PREDICTION_LENGTH = MAX_FORECASTING_HORIZON

DIRECTORY_PATH_TO_SAVE_RESULTS = pathlib.Path('../results/DIY/').resolve()
MODEL_NAME = "XGBoost"

RESULTS_DIRECTORY = DIRECTORY_PATH_TO_SAVE_RESULTS / MODEL_NAME
if RESULTS_DIRECTORY.exists():
    print(colored(f'Directory {str(RESULTS_DIRECTORY)} already exists.'
           '\nThis notebook will overwrite results in the same directory.'
           '\nYou can also create a new directory if you want to keep this directory untouched.'
           ' Just change the `MODEL_NAME` in this notebook.\n', "red" ))
else:
    RESULTS_DIRECTORY.mkdir(parents=True)

data, transformed_data = utils.load_tutotrial_data(dataset='exchange_rate', log_transform=True)
data = transformed_data

%load_ext autoreload
%autoreload 2

--- 

## Feature Engineering

Assuming a univariate time series is defined as $\{x_t\}_{t=1}^{T} = {x_1, x_2, \dots, x_T}$, we aim to convert this into a supervised learning task, where the data can be described as $\mathcal{D} = \{X_t, Y_t\}_{t=1}^{T'}$. Here, $X_t$ represents the input features, and $Y_t$ is the target or label.

In a time series forecasting task, the goal is to predict the value at the next time step. Thus, $Y_t = x_{t+1}$. The challenge is to map $x_t \rightarrow X_t$, effectively transforming each time step into features that can be used by a predictive machine learning model.

In this tutorial, we will create four types of features. More feature engineering ideas will be left as an exercise for you to explore:

1. **Shift/Lag Features**: Features formed from the values of previous time steps.
2. **Difference/Jump Features**: The differences between consecutive time steps.
3. **Rolling Statistics**: Features based on rolling averages, standard deviations, etc., computed over a window of past values.
4. **Categorical Temporal Features**: Features derived from the time-related properties of the data, such as the day of the week, week of the year, etc.

### Shift/Lag Features

In this approach, features are formed from the previous time steps, i.e., $ X_t = \{x_{t-1}, x_{t-2}, \dots \} $. The maximum number of past time steps we want to include as features must be specified.

In Pandas, the `shift` operator shifts the values in a column by the amount specified in the argument. A negative value shifts the values backward, while a positive value shifts them forward. 

A **backward shift** is problematic for supervised learning tasks because it would include future values as features, which violates the concept of forecasting. Instead, we use a **forward shift** to ensure that, at any time step, we are using only past values of the time series.

### Difference/Jump Features

To maximize the predictive power of the model, we aim to provide as many relevant features as possible. One set of features is the **differences** or **jumps** between consecutive time steps. These might seem redundant if past values are already used as features, but whether they add value is an empirical question that can only be answered by evaluating the model's performance on a validation dataset.

In this tutorial, we will use these difference features and leave it as an exercise for you to determine whether the jump features improve model performance.

Thus, $ X_t = \{x_t - x_{t-1}, x_{t-1} - x_{t-2}, \dots \} = \{\Delta_t, \Delta_{t-1}, \dots \} $.

Again, we need to specify the maximum number of past time steps to consider.

**Note**: We will set the maximum number of past time steps for features to `MAX_PAST_TIME_STEPS_FOR_FEATURES` (denoted as $ K $ in mathematical notation below). As a result, our feature set becomes:

$$
X_t = \{x_t, x_{t-1}, \dots, x_{t-K}, \Delta_t, \Delta_{t-1}, \dots, \Delta_{t-K-1}\}
$$

Consequently, the number of shift features will be one more than the number of delta features.



Let's see these features in code! 

In [None]:
shifted = data.shift(1)['0'].to_frame('0-shifted-by-1')
diff = data['0'] - data.shift(1)['0']
Y_t = data['0'].shift(-1).to_frame('Y_t')
pd.concat([data['0'], shifted, diff.to_frame('0-diff'), Y_t], axis=1)

**What do you observe?**

- The second column, `0-shifted-by-1`, contains the values of the first column but shifted forward by 1 time step, which results in a NaN in the first row. Therefore, if we use only the previous value as a feature, we must discard the first row to avoid confusing the model with `NaN`. Similarly, if we use multiple previous values (lags) as features, we will need to discard the corresponding number of rows from the beginning of the dataset.

- The last value in the original column (`0`) is discarded in the `Y_t` due to the backward shift.

--- 

### Rolling Statistics as Features

Another set of features is derived from statistics of the time series based on the current and past $ K $ observations. Assuming $ f_i(x_{t}, x_{t-1}, \dots, x_{t-K}) $ is the statistical function (such as mean, median, or standard deviation), we can define the feature set as:

$$
X_{t} = \{f_i(x_t, x_{t-1}, x_{t-2}, \dots, x_{t-K})\}_{i=1}^{M}
$$

Where $ M $ represents the number of statistical features, each calculated over a **rolling window** of size $ K+1 $.

**Note**: Since we are considering `MAX_PAST_TIME_STEPS_FOR_FEATURES` for lagged features, we can use a rolling window of size `MAX_PAST_TIME_STEPS_FOR_FEATURES + 1`, as this includes the current time step in the calculation of rolling statistics. However, this is a design choice, and you may choose to experiment with excluding the current time step from the rolling window.

Let's see it in code!

In [None]:
# rolling statistics 
WINDOW_SIZE = 3
rolling_mean = data['0'].rolling(window=WINDOW_SIZE).mean()

assert rolling_mean.iloc[WINDOW_SIZE-1] == np.mean(data['0'][:WINDOW_SIZE]), "Rolling mean is not correctly computed"

pd.concat([data['0'], shifted, diff.to_frame('0-diff'), rolling_mean.to_frame('rolling_mean'), Y_t], axis=1)

**Exercise**: Why are two rows in `rolling_mean` columns `NaN`?

--- 
### Categorical Temporal Teatures

Finally, we can encode the date and timestamp into various categorical features such as the time of day, day of the week, week of the year, and other time-related attributes. These features help capture seasonal or periodic patterns in the time series data.

Let's implement this in the code:

In [None]:
quarters = pd.DataFrame(data.index.quarter.values, index=data.index, columns=['quarter'])
months = pd.DataFrame(data.index.month.values, index=data.index, columns=['month']) 
weeks = pd.DataFrame(data.index.isocalendar().week.values, index=data.index, columns=['week'])
week_of_months = data.index.to_series().apply(lambda x: (x.day - 1) // 7 + 1).to_frame('week_of_month')

pd.concat([quarters, months, weeks, week_of_months], axis=1)

--- 

Finally, let's implement these features in the dataset. As you proceed, please ensure that no future time steps have been used as features. This is crucial to avoid data leakage.


In [None]:
# Shifts and Jumps
shifted_data, jumps_data = [], []
for shift in range(1, MAX_PAST_TIME_STEPS_FOR_FEATURES + 1):
    x = data.shift(shift) # x_{t-shift}
    diff = data.shift(shift-1) - x # x_{t-shift+1} - x_{t-shift}
    x.columns = [f"{col}-shifted-by-{shift}" for col in x.columns]
    diff.columns = [f"{col}-delta-{shift}" for col in data.columns]
    shifted_data.append(x)
    jumps_data.append(diff)


# rolling statistics
WINDOW_SIZE = MAX_PAST_TIME_STEPS_FOR_FEATURES+1
rolling_data = data.rolling(window=WINDOW_SIZE)

rolling_mean = rolling_data.mean()
rolling_mean.columns = [f"{col}-rolling_mean" for col in rolling_mean.columns]

rolling_std = rolling_data.std()
rolling_std.columns = [f"{col}-rolling_std" for col in rolling_std.columns]

# categorical temporal features
quarters = pd.DataFrame(data.index.quarter.values, index=data.index, columns=['quarter'])
months = pd.DataFrame(data.index.month.values, index=data.index, columns=['month']) 
weeks = pd.DataFrame(data.index.isocalendar().week.values, index=data.index, columns=['week'])
week_of_months = data.index.to_series().apply(lambda x: (x.day - 1) // 7 + 1).to_frame('week_of_month')

# target
Y_t = data.shift(-1)
Y_t.columns = [f"{col}-Y_t" for col in Y_t.columns]

# putting it all together
df = pd.concat([data, *shifted_data, *jumps_data, rolling_mean, rolling_std, quarters, months, weeks, week_of_months, Y_t], axis=1)

# collect column names for easy referencing
CATEGORICAL_TEMPORAL_COLUMN_NAMES = ['quarter', 'month', 'week', 'week_of_month']
COLUMN_NAMES = {
    er: {
        'input': [er] + [x for x in df.columns if f'{er}-' in x and 'Y_t' not in x],
        'target': [f"{er}-Y_t"]
    } 
    for er in data.columns # column names correspond to each of the country in the data
}

df[COLUMN_NAMES['0']['input'] + CATEGORICAL_TEMPORAL_COLUMN_NAMES + COLUMN_NAMES['0']['target']].head()


**Exercise**: Can you explain why some rows have `NaN`s?

**Removing invalid rows**

Finally, we need to remove rows containing `NaN` values. In our case, the first `MAX_PAST_TIME_STEPS_FOR_FEATURES` rows will contain `NaN` values due to the use of past time steps as features. Additionally, the last row will have no future time step available, resulting in a `NaN` for the target column $ Y_t $.

In total, we should expect to remove `MAX_PAST_TIME_STEPS_FOR_FEATURES + 1` rows from the dataset.


In [None]:
print("Original number of rows: ", len(df))
df = df.dropna(axis=0) # axis=0 remove entire rows
print("New number of rows: ", len(df))

---

## Train-Test Split


After computing these features, we split the dataset into **training**, **validation**, and **testing** subsets in a manner appropriate for time series forecasting.

Since this is a forecasting task, the split between the training and testing subsets should be **chronological**, rather than random, which is more common in typical supervised learning tasks. This ensures that the model is only trained on past data and tested on future data, avoiding data leakage.

However, within the training subset, unless we are explicitly evaluating the model's forecasting ability, we can treat it as a purely supervised learning setup. Here, we can randomly partition the training data into training and validation sets. We then fit the model on the training subset and evaluate the performance using the objective function over the validation subset. The validation performance is used to pick the best hyperparameters. 

**Note**: If we want to leave a testing subset with `MAX_FORECASTING_HORIZON` future steps for prediction (as we’ve done in other models throughout this tutorial), we need to ensure that the test split is aligned with this. Thus, we split the data such that the first row of the resulting testing subset corresponds to predicting the first time step in the test data, consistent with the approach used in other notebooks. To achieve this, we split the data at `-MAX_FORECASTING_HORIZON - 1`.


In [16]:
# form training and test data
train_val_data = df[df.index < data.index[-MAX_FORECASTING_HORIZON-1]]
test_data = df[df.index >= data.index[-MAX_FORECASTING_HORIZON-1]]

--- 

## Preprocessing Data


Before we proceed with model building, we need to preprocess the data. Specifically, we aim to do the following:

- **Normalize the numerical columns** using `StandardScaler`([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)): This involves subtracting the **mean** of each column and dividing by the **standard deviation**. This process ensures that each numerical feature has a mean of 0 and a standard deviation of 1, making the features more comparable in magnitude.
  
- **Convert the categorical columns** using Ordinal Encoding: We have categorical columns related to temporal features. These categoires have an order. For example, March comes after February. Thus, we need to encode these columns as oridinal features. We leverage `OrdinalEncoder`([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)) to do so. This is necessary because without it, the model may interpret the raw integer values (e.g., [3, 4, 5]) as having numerical significance. Ordinal encoding transforms these values into `[0, 1, 2]`, where only the order matters, not the magnitude.

Finally, to prevent data leakage (i.e., allowing information from the test set to influence the training set), we define a preprocessor using `ColumnTransformer` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)) and fit it only on the **training data**. The preprocessor will then be applied to both the training and test data. This ensures that the scaling and encoding are learned from the training data and not influenced by the test or validation sets.

Although this transformation will be automatically applied by the model during training (as we’ll see in the final model training section), for now, we'll demonstrate this preprocessing on the combined training and validation data.

Let’s see how this looks in practice.

In [None]:
# Determine input and target columns
NUMERICAL_X_COLUMNS = [x  for er in data.columns for x in COLUMN_NAMES[er]['input']]
X_COLUMNS = NUMERICAL_X_COLUMNS + CATEGORICAL_TEMPORAL_COLUMN_NAMES
Y_COLUMNS = [x  for er in data.columns for x in COLUMN_NAMES[er]['target']]

X, Y = train_val_data[X_COLUMNS], train_val_data[Y_COLUMNS]


# define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers = [
        ('cat', OrdinalEncoder(), CATEGORICAL_TEMPORAL_COLUMN_NAMES), # all the categorical columns are categorically coded
        ('num', StandardScaler(), NUMERICAL_X_COLUMNS) # all the columns are scaled using mean and standard deviation
    ]
)


print("Shape of X before transformation: ", X.shape)
transformed_data = preprocessor.fit_transform(X)
print("Shape of transformed data: ", transformed_data.shape)




--- 

## eXtreme Gradient Boosting (XGBoost) 

While there are many supervised learning algorithms that can be applied to this task, we will focus on the XGBoost algorithm. Exploring other models is left as an exercise for you to experiment with.

<ins>**What is XGBoost?**</ins>

XGBoost stands for eXtreme Gradient Boosting. It's an implementation of a machine learning technique called Gradient Boosting, which builds an ensemble (a collection) of decision trees to make predictions.

<ins>**What does XGBoost do?**</ins>

Imagine you’re trying to guess someone's age based on some clues, and you have multiple people giving you hints. Instead of one person trying to guess the age in one go, each person takes turns making a better guess than the last one. The first person might make a rough guess, and then the next person improves on that by correcting some mistakes. Over time, the combined effort of all these guesses leads to a much better final prediction. This is similar to how XGBoost works:

- Build Trees One by One: XGBoost starts with a simple model (often a single tree), and each subsequent tree is built to improve the errors made by the previous ones.

- Focus on Mistakes: Each new tree tries to correct the mistakes made by the earlier trees. This gradual improvement helps create a strong, accurate final model.

- Combine Trees: The final model is made up of many small trees, each contributing to the prediction. This combination of many models (trees) results in more accurate predictions.

For a more formal discussion, refer to this [tutorial](https://xgboost.readthedocs.io/en/stable/tutorials/model.html). In this tutorial, we will use `xgboost` ([documentation](https://xgboost.readthedocs.io/en/stable/)) library to implement regression task using `xgboost.XGBRegressor`. 

<ins>**Parameters for `xgboost.XGBRegressor`**</ins>

For a full list of parameters, refer to this [documentation](https://xgboost.readthedocs.io/en/stable/parameter.html). For our purposes, we will use the following **XGBoost parameters**:

- `objective='reg:absoluteerror'`: This defines the objective function that XGBoost will minimize. In this case, it minimizes the Mean Absolute Error (MAE) during training.
  
- `enable_categorical=True`: This tells XGBoost that the input data contains categorical features. The ordinal data encoded by `sklearn` will be interpreted as categorical by XGBoost. Without this, XGBoost would treat the encoded values as numerical, which could lead to incorrect assumptions about their relationships.

- `eval_metric='mae'`: This specifies the evaluation metric to be computed after each iteration. The model uses this metric to evaluate performance on the validation data (internally created) and to decide whether to stop training early if performance stops improving.

---

## K-fold Cross-Validation & Hyperparameter Tuning


<ins>**Cross Validation**</ins>

Cross-validation refers to the practice of training a model on a subset of the dataset and evaluating its performance on the remaining data. There are various cross-validation strategies, depending on the dataset and task. For our regression task, we can choose between methods such as `KFold` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)) or, if you prefer to fix the train and validation split from the start, `PredefinedSplit` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html)).

In **KFold cross-validation**, the dataset is split into K subsets (folds). The model is then trained K times—each time using K-1 folds for training and the remaining fold for validation. The overall performance is the average of the model's performance across all K validation folds.


<ins>**Hyperparameter Search**</ins>

As with any model, some parameters cannot be learned during training and must be selected beforehand—these are known as hyperparameters. To find the best hyperparameters, we perform a hyperparameter search. `sklearn` provides methods for this, and we will use `GridSearchCV` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)), which systematically evaluates all specified combinations of parameters and selects the best-performing one.

Key parameters for `GridSearchCV` include:

- `refit=True`: Ensures that once the best hyperparameters are found, the model is refit on the entire training and validation dataset using the optimal parameters. This is important for consistency, as we followed this procedure in previous notebooks.

- `cv=5`: Specifies the number of folds for cross-validation. Here, we split the dataset into 5 folds, training 5 models, each on 4 folds (80% of the data) and validating on the remaining fold (20%). The average score across validation folds is considered for each parameter set. While this provides robust results, it can be computationally expensive. You can reduce the number of folds, but this may affect the stability of the results (why?). Using `n_jobs=-1` leverages all available CPUs to speed up the process.

- `param_grid`: Specifies the hyperparameter combinations to be explored in the grid search.

- `scoring='neg_mean_absolute_error'`: Defines the evaluation metric for the validation data. Ideally, we would use MASE (as defined earlier), but implementing a custom scoring function is more involved and beyond the scope of this tutorial. We leave that as an extension exercise.

---

**Important Notes**:

- With `refit=True`, the preprocessor will be refit on the combined training and validation data once the best model is selected. However, during cross-validation, the preprocessor is fit **only on the training folds** for each split, which mimics how the model will be applied at inference time on the test set.

- While XGBoost provides an internal cross-validation method that is faster, we avoid using it here for simplicity. This is left as an exercise for further exploration.

- If you have limited CPU resources, consider reducing the number of grid parameters or the number of folds to speed up the training process.


In [None]:
# Determine input and target columns
NUMERICAL_X_COLUMNS = [x  for er in data.columns for x in COLUMN_NAMES[er]['input']]
X_COLUMNS = NUMERICAL_X_COLUMNS + CATEGORICAL_TEMPORAL_COLUMN_NAMES
Y_COLUMNS = [x  for er in data.columns for x in COLUMN_NAMES[er]['target']]

X, Y = train_val_data[X_COLUMNS], train_val_data[Y_COLUMNS]

# define the model
xgb = XGBRegressor(objective='reg:absoluteerror', random_state=42, enable_categorical=True, eval_metric='mae')

# define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers = [
        ('cat', OrdinalEncoder(), CATEGORICAL_TEMPORAL_COLUMN_NAMES),
        ('num', StandardScaler(), NUMERICAL_X_COLUMNS)
    ]
)

# define the pipeline of preprocessing and regression
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', xgb)
])

# define the parameters to sweep over
xgb_param_grid = {
    'regressor__n_estimators': [100, 200],
    'regressor__learning_rate': [0.01, 0.1],
    'regressor__max_depth': [6, 9],
    'regressor__subsample': [0.8, 1.0]
}

# define the CV strategy
xgb_grid = GridSearchCV(pipeline, xgb_param_grid, cv=5, scoring='neg_mean_absolute_error', verbose=3, n_jobs=-1, refit=True, return_train_score=True)

# perform the model selection and fitting to the best found parameters
xgb_grid.fit(X, Y)

print(f"\n\nBest params: {xgb_grid.best_params_}\t Best score: {xgb_grid.best_score_}")

--- 
## Forecast 

<ins>**Wrong Way to Forecast**</ins>


It might be tempting to compute all predictions for the test data in one pass and treat them as final predictions. However, this approach is incorrect. 

In real-world scenarios or during inference, we only receive data **progressively over time**. Therefore, we need to iteratively generate predictions, one step at a time. At each step, the model's prediction is added back into the dataset, and new features are constructed based on this updated information to predict the **next** step.

This iterative process ensures that the model only uses information available up to the current point in time, just as it would in a real-world forecasting scenario.

In [None]:
# wrong way to forecast using ML model 
print("Wrong way to forecast using ML model")
best_xgb_model = xgb_grid.best_estimator_
test_predictions = best_xgb_model.predict(test_data[X_COLUMNS]) # WRONG INFERENCE ON TEST SUBSET
test_predictions_df = pd.DataFrame(test_predictions, columns=data.columns)
test_predictions_df.head()

---

<ins>**Correct Way to Evaluate the Model**</ins>

In the following cell, we will build the features iterativel, incorporating the predictions generated by the model at each step. This ensures that new features are constructed based on the updated dataset, reflecting the real-world scenario of receiving data progressively over time.


In [None]:
# more rigorous evaluation
online_test_predictions = []
for idx in range(test_data.shape[0]):
    if idx == 0: # form a data frame at the first iteration
        x_input = pd.concat([train_val_data.iloc[-MAX_PAST_TIME_STEPS_FOR_FEATURES:], test_data.iloc[idx: idx+1]], axis=0)

    # make predictions
    preds = best_xgb_model.predict(x_input[X_COLUMNS].iloc[-1:]) # only the last sample is given as an input
    online_test_predictions.append(preds)

    if len(online_test_predictions) == len(test_data):
        break # we have predictions for the entire test data; stop here

    # add this observation to the dataframe and form features
    new_index = test_data.index[idx+1: idx+2]
    preds = pd.DataFrame(preds, columns=data.columns, index=new_index)
    x_raw = pd.concat([x_input, preds])

    # Make features
    # categorical temporal features
    x_raw.loc[new_index, CATEGORICAL_TEMPORAL_COLUMN_NAMES] = test_data.loc[new_index, CATEGORICAL_TEMPORAL_COLUMN_NAMES]
    for col in data.columns:
        # shift and delta features
        for shift in range(1,MAX_PAST_TIME_STEPS_FOR_FEATURES + 1):
            x_raw.loc[new_index, f"{col}-shifted-by-{shift}"] = x_raw[col].iloc[-shift-1] 
            x_raw.loc[new_index, f"{col}-delta-{shift}"] = x_raw[col].iloc[-shift] - x_raw[f"{col}-shifted-by-{shift}"].iloc[-1] 

        # rolling statistics
        x_raw.loc[new_index, f"{col}-rolling_mean"] = np.mean(x_raw[col].iloc[-WINDOW_SIZE:])
        x_raw.loc[new_index, f"{col}-rolling_std"] = np.std(x_raw[col].iloc[-WINDOW_SIZE:])

    x_input = x_raw


# save them to the directory
AUGMENTED_COL_NAMES = [f"{MODEL_NAME}_{col}_mean" for col in data.columns]
test_predictions_df = pd.DataFrame(np.array(online_test_predictions).squeeze(1), columns=AUGMENTED_COL_NAMES, index=data.index[-MAX_FORECASTING_HORIZON:])

test_predictions_df.to_csv(f"{str(RESULTS_DIRECTORY)}/predictions.csv", index=True)
print(test_predictions_df.shape)
test_predictions_df.head()

--- 

## Evaluate 

Let's compute the metrics by comparing the predictions with that of the target data. Note that we will have to rename the columns of the dataframe to match the expected column names by the function. 

In [None]:
# compute MASE metrics
target_data = data.iloc[-MAX_FORECASTING_HORIZON:]
model_metrics, records = utils.get_mase_metrics(
    historical_data=train_val_data,
    test_predictions=test_predictions_df.rename(
            columns={x:x.split("_")[1] for x in test_predictions_df.columns
        }),
    target_data=target_data,
    forecasting_horizons=FORECASTING_HORIZON,
    columns=data.columns, 
    model_name='XGBoost',
)

records = pd.DataFrame(records)

records.to_csv(f"{str(RESULTS_DIRECTORY)}/metrics.csv", index=False)
records[['col', 'horizon', 'mase']].pivot(index=['horizon'], columns='col')

--- 

## Compare Models

In [None]:
utils.display_results(path=DIRECTORY_PATH_TO_SAVE_RESULTS, metric='mase')

---

## Plot Forecasts

In [None]:
target_data = data.iloc[-MAX_FORECASTING_HORIZON:]
fig, axs = utils.plot_forecasts(
    historical_data=train_val_data,
    forecast_directory_path=DIRECTORY_PATH_TO_SAVE_RESULTS,
    target_data=target_data,
    columns=data.columns,
    n_history_to_plot=10, 
    forecasting_horizon=MAX_FORECASTING_HORIZON,
    dpi=200,
    plot_se=False
)

--- 

## Conclusion

In this module, we learned how to featurize a time series to convert it into a supervised learning task, how to split the data into training, validation, and test subsets in a that is appropriate for time series forecasting, and finally, how to use XGBoost for time series forecastin. 

--- 

## Exercises

- Explore new feature engineering strategies to improve forecasting accuracy, such as seasonal indicators, cumulative sums, or lag-based interaction terms.
  
- Analyze whether the inclusion of difference (jump) features adds value to the model by comparing the performance with and without them.

- Implement a custom scoring function in `GridSearchCV` to better align with the model's objectives. ([Here is the documentation for creating a custom scorer](https://scikit-learn.org/stable/modules/model_evaluation.html#implementing-your-own-scoring-object)).

- Perform faster hyperparameter tuning by leveraging XGBoost's internal cross-validation mechanism. ([Demo on how to use cross-validation in XGBoost](https://xgboost.readthedocs.io/en/stable/python/examples/cross_validation.html)).

- Explore alternative regression algorithms, such as Support Vector Regression, Linear Regression, Random Forest Regressor, K-Nearest Neighbors.


---

## Next steps

To learn about deep learning based approaches, check out module 5 (LSTM-based models), module 6 (Transformer based models), or module 7 (LLM-based models).

---