# Transformers for Forecasting: Informer

**Approximate Learning Time**: Up to 4 hours

---

In this notebook, we will briefly touch on Transformers and their applicability for time series forecasting. Additionally, we will explore a specific type of transformer called Informer. We will use the Hugging Face implementation of Informer and leverage the Lightning framework to train our models.

Training transformers can be computationally intensive, so I have reduced the network parameters to create a lightweight version that can be trained with limited computational resources.


---

## Transformers & Informer Model

<ins>**What is a Transformer?**</ins>

The **Transformer** ([Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762)) is a deep learning architecture that was originally introduced for tasks in natural language processing (NLP), particularly for sequence-to-sequence problems like translation. The key innovation in transformers is the **self-attention mechanism**, which allows the model to weigh the importance of different elements in a sequence, regardless of their distance from each other. This removes the need for sequential data processing (as in RNNs or LSTMs), allowing transformers to process entire sequences in parallel and capture long-range dependencies more efficiently.

There are some excellent resources (e.g., [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)) which can be referred to for better understanding of Transformers. 

<ins>**How are Transformers Suitable for Forecasting?**</ins>

While transformers were first developed for NLP tasks, their ability to model long-range dependencies makes them well-suited for **time series forecasting**. Time series data often contain complex relationships across various time scales, which transformers can capture effectively due to their self-attention mechanism. Additionally, transformers can handle irregularly spaced data (e.g., missing data in time series) and multivariate series, making them versatile for forecasting tasks.

However, transformers can be resource-intensive when applied to long sequences, which leads to the need for specialized transformer models tailored for time series.

<ins>**What is the Informer Model?**</ins>

The **Informer** model is a transformer architecture specifically designed for long-sequence time series forecasting. Proposed by [Zhou et al. (2021)](https://arxiv.org/abs/2012.07436), Informer tackles the scalability issues of traditional transformers by introducing two key innovations:
- **ProbSparse self-attention**, which reduces the computational complexity of the attention mechanism by focusing on the most informative time steps, rather than attending to every single step in the sequence.
- **Distillation**: Informer further improves efficiency by compressing the input sequence into a shorter, more informative representation, which makes it lightweight and fast for long time series forecasting.

These improvements make the Informer model particularly well-suited for handling long-range dependencies in large-scale time series data while maintaining efficiency and accuracy.

<ins>**What Are Other Transformer Models for Time Series Forecasting?**</ins>

There have been many recent proposals aimed at improving transformer architectures for time series forecasting. A non-exhaustive list includes:

- **Autoformer** by Wu et al. (2021): This model introduces decomposition-based forecasting, breaking time series into trend and seasonal components, allowing the model to focus on learning these distinct patterns.
  
- **PatchTST** by Nie et al. (2022): This model leverages the concept of patching, similar to vision transformers, to capture more localized patterns in time series data.
  
- **Crossformer** by Zhang et al. (2023): Transformer-based model that is designed to explicitly capture cross-dimension dependency.

You can find implementations of these models and many more in several time series forecasting libraries:

- [**GluonTS**](https://ts.gluon.ai/stable/index.html): A comprehensive library offering several [models](https://ts.gluon.ai/stable/getting_started/models.html) widely used in the academic community.
- [**Pytorch Forecasting**](https://pytorch-forecasting.readthedocs.io/en/stable/index.html): A popular library built on top of PyTorch, designed specifically for time series forecasting.
- [**Neuralforecast**](https://github.com/Nixtla/neuralforecast): A library that provides several deep learning-based forecasting algorithms.
- [**HuggingFace Transformers**](https://huggingface.co/docs/transformers/index): Offers a wide range of transformer models and utilities. We will be using this library in the current notebook to implement the **Informer** model.

**References**:

[(Vaswani et al. 2017) Attention is all you need](https://arxiv.org/abs/1706.03762)

[(Zhou et al. 2020) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436)

[(Wu et al. 2021) Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008)

[(Nie et al. 2022) A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730)

[(Zhang et al. 2023) Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting](https://openreview.net/forum?id=vSVLM2j9eie)

--- 

## Probabilistic Forecasting 


So far, we have focused on **point forecasting techniques**, where models predict a single value for future time steps. However, understanding the **uncertainty** around these estimates is equally important, as it provides confidence in the predictions. This is where **probabilistic forecasting** comes in.

In probabilistic forecasting, instead of predicting a single value, the model estimates the **distribution** of possible outcomes. For example, a common approach is to assume a **normal distribution** for the forecast. The model's task is then to predict the parameters of this distribution—specifically, the **mean** and **variance**. 

Once the model predicts these parameters, we can compute the **likelihood** of observing the actual target value under the predicted distribution. In the case of distributions like the **normal** or **Student's t-distribution**, calculating the likelihood is relatively straightforward. The model then minimizes the **negative log-likelihood (NLL)** of the target values, which serves as the loss function.

Because these likelihood functions are smooth and differentiable, **backpropagation** can be applied to update the model weights. This allows the model to learn and improve its predictions over time using standard gradient-based optimization techniques.

---

We will train the **Informer model**, which by default is implemented to compute the **negative log-likelihood (NLL)** assuming one of the several available distributions provided by Hugging Face. You can explore the available distributions for time series forecasting in the [Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/internal/time_series_utils).


In [None]:
# setup
import torch

import lightning as L 
from lightning.pytorch import seed_everything
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping

import pathlib
import numpy as np
import pandas as pd

# gluon's import 
from gluonts.time_feature import time_features_from_frequency_str

from gluonts.dataset.field_names import FieldName 
from gluonts.transform import (
    AddAgeFeature,
    AddTimeFeatures, 
    AddObservedValuesIndicator,
    Chain,
    RemoveFields,
    VstackFeatures,
    RenameFields,
    AsNumpyArray,
    InstanceSplitter,
    ExpectedNumInstanceSampler,
    TestSplitSampler,
    ValidationSplitSampler, 
)

from gluonts.itertools import Cyclic
from gluonts.dataset.loader import as_stacked_batches
from gluonts.dataset.pandas import PandasDataset

from transformers import InformerConfig, InformerForPrediction
LAGS_SEQUENCE = [1, 2, 4, 12, 24] # we add these lags as current features.

## WARNING: To compare different models on the same horizon, keep this same across the notebooks
from termcolor import colored
import sys; sys.path.append("../")
import utils

FORECASTING_HORIZON = [4, 8, 12] # weeks 
MAX_FORECASTING_HORIZON = max(FORECASTING_HORIZON)

SEQUENCE_LENGTH = 2 * MAX_FORECASTING_HORIZON
PREDICTION_LENGTH = MAX_FORECASTING_HORIZON

DIRECTORY_PATH_TO_SAVE_RESULTS = pathlib.Path('../results/DIY/').resolve()
MODEL_NAME = "Informer"

RESULTS_DIRECTORY = DIRECTORY_PATH_TO_SAVE_RESULTS / MODEL_NAME
if RESULTS_DIRECTORY.exists():
    print(colored(f'Directory {str(RESULTS_DIRECTORY)} already exists.'
           '\nThis notebook will overwrite results in the same directory.'
           '\nYou can also create a new directory if you want to keep this directory untouched.'
           ' Just change the `MODEL_NAME` in this notebook.\n', "red" ))
else:
    RESULTS_DIRECTORY.mkdir(parents=True)

data, transformed_data = utils.load_tutotrial_data(dataset='exchange_rate', log_transform=True)

data = transformed_data
train_val_data = data.iloc[:-MAX_FORECASTING_HORIZON]
train_data, val_data = train_val_data.iloc[:-MAX_FORECASTING_HORIZON], train_val_data.iloc[-MAX_FORECASTING_HORIZON:]
test_data = data.iloc[-MAX_FORECASTING_HORIZON:]
print(f"Number of steps in training data: {len(train_data)}\nNumber of steps in validation data: {len(val_data)}\nNumber of steps in test data: {len(test_data)}")

%load_ext autoreload
%autoreload 2

--- 

## Transform & Batch Data using GluonTS

This section builds on concepts from **Module 1 (Notebook 1.3)** and **Module 2 (Notebook 2.0)**. Specifically, we will:
- Leverage the **transformation capabilities** in **GluonTS** that were covered in Module 1/Notebook 1.3, and
- Implement a **sampling strategy** to create training, validation, and testing subsets, as introduced in Module 2/Notebook 2.0.

--- 

Let's first convert the data into a `PandasDataset` to match GluonTS's internal dataset representation. Apply the following transformations using GluonTS:

- Remove unused fields from the dataset.
- Convert data into NumPy arrays.
- Add a field `"observed_values"`, with 1s for as many target values are present in the sample.
- Add time and age features to the dataset, and stack them into a single field, `"feat_time"`.
- Finally, rename the columns to match the format expected by the Hugging Face transformer.

**Note**: The **age feature** is particularly important because it provides the transformer with information about the **position** of each input in the sequence, helping the model understand the temporal context. In **NLP tasks**, transformers use **positional encodings** to capture the order of words in a sentence, as they process input in parallel and do not inherently understand sequence order. Similarly, in time series forecasting, the age feature acts as a positional signal, ensuring that the model understands where each input falls within the time sequence.


In [2]:
# convert data into PandasDataset
gluon_train_data = PandasDataset(train_data, target=data.columns)
gluon_train_val_data = PandasDataset(train_val_data, target=data.columns)
gluon_data = PandasDataset(data, target=data.columns)

# Define transformations
remove_field_names=[FieldName.FEAT_STATIC_REAL, FieldName.FEAT_DYNAMIC_REAL, FieldName.FEAT_STATIC_CAT]
transformation = Chain(
    [RemoveFields(field_names=remove_field_names)]
    + [
        AsNumpyArray(
            field=FieldName.TARGET,
            expected_ndim=2,
        ),

        AddObservedValuesIndicator(
            target_field=FieldName.TARGET,
            output_field=FieldName.OBSERVED_VALUES,
        ),
        AddTimeFeatures(
            start_field=FieldName.START,
            target_field=FieldName.TARGET,
            output_field=FieldName.FEAT_TIME,
            time_features=time_features_from_frequency_str(data.index.freq),
            pred_length=MAX_FORECASTING_HORIZON,
        ),
        AddAgeFeature(
            target_field=FieldName.TARGET,
            output_field=FieldName.FEAT_AGE,
            pred_length=MAX_FORECASTING_HORIZON,
            log_scale=True,
        ),
        VstackFeatures(
            output_field=FieldName.FEAT_TIME,
            input_fields=[FieldName.FEAT_TIME, FieldName.FEAT_AGE]
        ),
        RenameFields(
            mapping={
                FieldName.FEAT_STATIC_CAT: "static_categorical_features",
                FieldName.FEAT_STATIC_REAL: "static_real_features",
                FieldName.FEAT_TIME: "time_features",
                FieldName.TARGET: "values",
                FieldName.OBSERVED_VALUES: "observed_mask",
            }
        )

    ]
)

After this, we will use `InstanceSplitter` to define how to split the time series. We want each sample to contain a total of `SEQUENCE_LENGTH + max(LAGS_SEQUENCE)` elements. While the input to the Informer will be `SEQUENCE_LENGTH` long for past time steps, we need `max(LAGS_SEQUENCE)` extra steps to create lagged features for the last time step in the input. Essentially, `InstanceSplitter` samples only the required time steps for featurizing, not the full input.

Additionally, it will sample the target over the next `PREDICTION_LENGTH` time steps. The values are passed with separate keys, prefixed with `past_` for input and `future_` for the target. In cases where there aren’t enough observations to fill up `SEQUENCE_LENGTH`, the `observed_values` indicator will fill those positions with 0. The `observed_values` field is processed similarly to the values themselves.

The `InstanceSplitter` requires an `InstanceSampler`, which controls which indices to sample. For example:
- `ValidationSplitSampler` iterates over indices for validation.
- `ExpectedNumInstanceSampler` performs rebalancing to ensure time steps are equally represented in the output.

Additionally, we will use the function `as_stacked_batches` ([documentation here](https://ts.gluon.ai/dev/api/gluonts/gluonts.dataset.loader.html?highlight=as_stac#gluonts.dataset.loader.as_stacked_batches)), which serves a similar purpose to `torch.utils.data.DataLoader` by batching the data instances. It also provides other functionalities, such as retaining specific field names during batching.

In [None]:
PREDICTION_INPUT_NAMES = [
    "past_time_features",
    "past_values",
    "past_observed_mask",
    "future_time_features",

]

TRAINING_INPUT_NAMES = PREDICTION_INPUT_NAMES + [
        "future_values",
        "future_observed_mask",
    ]

# TRAINING DATA
transformed_data = transformation.apply(gluon_train_data, is_train=True)
stream = Cyclic(transformed_data).stream() # never stop serving training data
instance_splitter = InstanceSplitter(
    target_field="values",
    is_pad_field=FieldName.IS_PAD,
    start_field=FieldName.START,
    forecast_start_field=FieldName.FORECAST_START,
    instance_sampler=ExpectedNumInstanceSampler(num_instances=1.0, min_future=MAX_FORECASTING_HORIZON),
    past_length=SEQUENCE_LENGTH + max(LAGS_SEQUENCE),
    future_length=PREDICTION_LENGTH,
    time_series_fields=["time_features", "observed_mask"],
)
training_instances = instance_splitter.apply(stream) # applies the above specified logic in the instance splitter to stream

batch_size=64
train_loader = as_stacked_batches(
        training_instances,
        batch_size=batch_size,
        field_names=TRAINING_INPUT_NAMES,
        output_type=torch.tensor,
        num_batches_per_epoch=100
    )

# VALIDATION DATA
transformed_data = transformation.apply(gluon_train_val_data, is_train=True)
instance_splitter = InstanceSplitter(
    target_field="values",
    is_pad_field=FieldName.IS_PAD,
    start_field=FieldName.START,
    forecast_start_field=FieldName.FORECAST_START,
    instance_sampler=ValidationSplitSampler(min_future=MAX_FORECASTING_HORIZON),
    past_length=SEQUENCE_LENGTH + max(LAGS_SEQUENCE),
    future_length=PREDICTION_LENGTH,
    time_series_fields=["time_features", "observed_mask"],
)
val_instances = instance_splitter.apply(transformed_data, is_train=True)
val_loader = as_stacked_batches(
        val_instances,
        batch_size=batch_size,
        field_names=TRAINING_INPUT_NAMES,
        output_type=torch.tensor,
    )

batch = next(iter(val_loader))
print(batch.keys())
for k, v in batch.items():
    print(k, v.shape, v.type())

**Exercise**: Check the field names and the corresponding tensor shape. Are they according to what you expected?

---

## Informer model using 🤗 Transformers


🤗 Transformers provides an API that makes it easy to download and train state-of-the-art transformer models. The library also includes an implementation of the Informer model ([documentation here](https://huggingface.co/docs/transformers/main/en/model_doc/informer)), which we will use.

The model consists of three main classes:
- `InformerConfig`: Defines the model configuration.
- `InformerModel`: The actual implementation of the model, initialized based on the configuration.
- `InformerForPrediction`: Loads the pre-trained transformer model and can be used for inference.

We will define the configuration and train the model using the Lightning framework. Note that the default implementation of Informer outputs a probability distribution instead of using L2 loss or L1 loss like in the previous modules.

Additionally, we will perform a dry run with a batch from the data loader to ensure the batches are being processed correctly.

In [None]:
# Initializing a Time Series Transformer configuration with 12 time steps for prediction
configuration = InformerConfig(
    prediction_length=PREDICTION_LENGTH,
    context_length=SEQUENCE_LENGTH,
    lags_sequence=LAGS_SEQUENCE,
    num_time_features=3,
    distribution_output='student_t', # can be `normal` as well 
    input_size=8,
    d_model=32, # reduceed to make it lightweight 
    encoder_layers=2, # reduceed to make it lightweight 
    decoder_layers=2,
    dropout=0.1,
    num_parallel_samples=100, 
)

# Randomly initializing a model (with random weights) from the configuration
model = InformerForPrediction(configuration)

batch = next(iter(train_loader))
outputs = model(
    past_values=batch["past_values"],
    past_time_features=batch["past_time_features"],
    future_values=batch["future_values"],
    future_time_features=batch["future_time_features"],
    future_observed_mask=batch["future_observed_mask"],
    past_observed_mask=batch["past_observed_mask"],
    output_hidden_states=True,
)

outputs.loss

Great! We now have a working implementation of the Informer model, and we’re able to compute the negative log-likelihood of our outputs. The next step is to begin training the model and learning the weights.

--- 

## Training the model

Following the same structure as the previous notebooks, we will initialize the LightningModule and Trainer from pytorch_lighnint and let it run the training and validation loops.

Here, we will experiment by looping over two different values of n_encoders and train the model for a few steps, keeping in mind the relatively slow training speed of transformers. If training is too slow on your system, feel free to reduce the model parameters even further.

**Note**: The method `_shared_eval` has been added to the module for this implementation. This method is used exclusively during **inference** and has no role during training. We will revisit its purpose and functionality during the **Evaluation** section.


In [5]:
class LightningInformer(L.LightningModule):
    def __init__(self, informer_config):
        super().__init__()
        self.save_hyperparameters()
        self.informer = InformerForPrediction(informer_config)

    def forward(self, batch):
        return self.informer(
            past_values=batch["past_values"],
            past_time_features=batch["past_time_features"],
            future_values=batch["future_values"],
            future_time_features=batch["future_time_features"],
            future_observed_mask=batch["future_observed_mask"],
            past_observed_mask=batch["past_observed_mask"],
            output_hidden_states=True,
        )

    def training_step(self, batch, batch_idx):
        outputs = self(batch)
        loss = outputs.loss
        self.log('train_loss', loss, prog_bar=True)
        return loss 

    def validation_step(self, batch, batch_idx):
        # Note: one can evaluate using the median and self.informer.generate by using _shared_eval
        # For computational efficiency, we will stick to the loss function here
        outputs = self(batch)
        loss = outputs.loss
        self.log('val_loss', loss, prog_bar=True)
        return loss
    
    def _shared_eval(self, batch, device):
        self.informer.eval()

        return self.informer.to(device).generate(
            past_time_features=batch["past_time_features"].to(device),
            past_values=batch["past_values"].to(device),
            future_time_features=batch["future_time_features"].to(device),
            past_observed_mask=batch["past_observed_mask"].to(device),
        )

    def configure_optimizers(self):
        return torch.optim.AdamW(model.parameters(), lr=6e-4, betas=(0.9, 0.95), weight_decay=1e-1)


In [None]:
best_val_loss, best_params = np.inf, []
for n_encoders in [2, 4]:
    seed_everything(42, workers=True)


    # define callbacks 
    early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=True, patience=10)

    # save the best checkpoint; this will be used to evaluate test metrics
    best_checkpoint = ModelCheckpoint(save_top_k=1, monitor='val_loss',
                                    mode='min', 
                                    filename='{epoch:02d}-{global_step}-{val_loss:.5f}')
    # save the last checkpoint in case if you want to resume training 
    last_checkkpoint =  ModelCheckpoint(save_top_k=1, monitor='global_step',
                                    mode='max',
                                    filename='{epoch:02d}-{global_step}')

    configuration.encoder_layers = n_encoders
    model = LightningInformer(configuration)
    trainer = L.Trainer(
        max_epochs=500, 
        val_check_interval=20,
        callbacks=[early_stopping, best_checkpoint], 
        deterministic=True,
        )

    trainer.fit(model, train_loader, val_loader)

    if best_val_loss > best_checkpoint.best_model_score:
        best_val_loss = best_checkpoint.best_model_score
        best_params = (n_encoders, )


print(f"\n\nBest params: encoder_layers: {best_params[0]}. Val loss: {best_val_loss}")


**Visualize the training trajectory**

Run the following command on command line and navigate to the port on which tensorboard is launched. Then navigate to `Scalars` tab to see how `train_loss` and `val_loss` changes during the training. 

```bash
tensorboard --logdir=lightning_logs
```
--- 

## Refit on Train-Val Subset

To evaluate the model's performance on the test data, we will first retrain the model using the best hyperparameters identified earlier, this time utilizing the combined train-validation dataset for training.

In [None]:
# refit on train and val dataset
seed_everything(42, workers=True)


# training data
transformed_data = transformation.apply(gluon_train_val_data, is_train=True)
stream = Cyclic(transformed_data).stream()
instance_splitter = InstanceSplitter(
    target_field="values",
    is_pad_field=FieldName.IS_PAD,
    start_field=FieldName.START,
    forecast_start_field=FieldName.FORECAST_START,
    instance_sampler=ExpectedNumInstanceSampler(num_instances=1.0, min_future=MAX_FORECASTING_HORIZON),
    past_length=SEQUENCE_LENGTH + max(LAGS_SEQUENCE),
    future_length=PREDICTION_LENGTH,
    time_series_fields=["time_features", "observed_mask"],
)
train_val_instances = instance_splitter.apply(stream)

batch_size=64
train_val_loader = as_stacked_batches(
        train_val_instances,
        batch_size=batch_size,
        field_names=TRAINING_INPUT_NAMES,
        output_type=torch.tensor,
        num_batches_per_epoch=100
    )

# define the best obtained configuration
configuration.encoder_layers = best_params[0]
model = LightningInformer(configuration)
trainer = L.Trainer(
    max_epochs=30, 
    callbacks=[EarlyStopping(monitor='train_loss', mode='min', verbose=True, patience=5)], 
    deterministic=True
    )

trainer.fit(model, train_val_loader)


--- 

## Forecast

As we observed in the previous module, predictions in this model are also generated iteratively. We will implement this process within the `_shared_eval` function to handle iterative prediction generation.


Finally, although we are learning a **distribution**, at test time the outputs are **sampled from the predicted distribution**. This sampling is performed iteratively.

For evaluation, we will use the **median** of the distribution as it provides a more robust estimate. However, feel free to experiment with using the **mean** or any other statistic to evaluate the performance of the model.


In [None]:
# let's load the best model
model = LightningInformer.load_from_checkpoint(
    checkpoint_path=trainer.checkpoint_callback.best_model_path)
model.eval()

transformed_data = transformation.apply(gluon_data, is_train=False)

instance_splitter = InstanceSplitter(
    target_field="values",
    is_pad_field=FieldName.IS_PAD,
    start_field=FieldName.START,
    forecast_start_field=FieldName.FORECAST_START,
    instance_sampler=TestSplitSampler(),
    past_length=SEQUENCE_LENGTH + max(LAGS_SEQUENCE),
    future_length=PREDICTION_LENGTH,
    time_series_fields=["time_features", "observed_mask"],
)
test_instances = instance_splitter.apply(transformed_data, is_train=False)

batch_size=64
test_loader = as_stacked_batches(
        test_instances,
        batch_size=batch_size,
        field_names=PREDICTION_INPUT_NAMES,
        output_type=torch.tensor,
    )

batch = next(iter(test_loader))

# iterative prediction
forecasts = model._shared_eval(batch, device=torch.device('cpu'))[0].squeeze(0)

# the model outputs a distribution so we keep a point estimate here.
forecast_median = np.median(forecasts, 0)

AUGMENTED_COL_NAMES = [f"{MODEL_NAME}_{col}_mean" for col in data.columns]
test_predictions_df = pd.DataFrame(forecast_median, columns=AUGMENTED_COL_NAMES, index=test_data.index)

# ssave them to the directory
test_predictions_df.to_csv(f"{str(RESULTS_DIRECTORY)}/predictions.csv", index=True)
print(test_predictions_df.shape)
test_predictions_df.head()

--- 

## Evaluate 

Let's compute the metrics by comparing the predictions with that of the target data. Note that we will have to rename the columns of the dataframe to match the expected column names by the function. 

In [None]:
# evalaute metrics
target_data = data[-MAX_FORECASTING_HORIZON:]
model_metrics, records = utils.get_mase_metrics(
    historical_data=train_val_data,
    test_predictions=test_predictions_df.rename(
            columns={x:x.split("_")[1] for x in test_predictions_df.columns
        }),
    target_data=target_data,
    forecasting_horizons=FORECASTING_HORIZON,
    columns=data.columns, 
    model_name=MODEL_NAME
)
records = pd.DataFrame(records)

records.to_csv(f"{str(RESULTS_DIRECTORY)}/metrics.csv", index=False)
records[['col', 'horizon', 'mase']].pivot(index=['horizon'], columns='col')

---

## Compare Models

In [None]:
utils.display_results(path=DIRECTORY_PATH_TO_SAVE_RESULTS, metric='mase')

--- 

## Plot Forecasts

In [None]:
fig, axs = utils.plot_forecasts(
    historical_data=train_val_data,
    forecast_directory_path=DIRECTORY_PATH_TO_SAVE_RESULTS,
    target_data=target_data,
    columns=data.columns,
    n_history_to_plot=10, 
    forecasting_horizon=MAX_FORECASTING_HORIZON,
    dpi=200,
    exclude_models=['LSTM', 'ExpSmooth', 'VAR'],
    plot_se=False,
)

--- 

## Conclusion

We gained an understanding of how transformers work and their applicability to probabilistic time series forecasting. Specifically, we explored the Informer model, and successfully trained one using the Hugging Face Transformers API and the Lightning framework.

---

## Exercises

- Find MASE using the mean instead of the median. What differences do you observe in the results?
  
- Plot the 1-standard deviation confidence intervals around the predictions.
  
- Change the output distribution type and analyze how it affects the predictions.
  
- Select a few hyperparameters and perform hyperparameter optimization to improve the model.
  
- Explore other time series transformers available in Hugging Face and compare their performance.
  
- Add more features to the model and assess their impact on forecasting accuracy.
  
- Apply a normalization procedure (e.g., **min-max scaling**) to the data, ensuring that only the training data is used for fitting the scaler. Perform the modeling process on the normalized data and, after generating the final model's predictions, invert the normalization to return the output to its original scale. See `sklearn.preprocessing.MinMaxScaler` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html))
  
- Additionally, perform the modeling on the **raw data**, without applying any transformation (such as converting it into log daily returns), to compare results directly with the untransformed dataset.

---

## Next Steps

Proceed to the next module to learn about LLM-based approaches for time series forecasting. 

---