## Deep Learning Methods for Forecasting: LSTNet

**Approximate Learning Time**:Up to 4 hours

---

In this notebook, we will understand LSTNet proposal by [Lai et al., 2018](https://arxiv.org/abs/1703.07015) that addresses the limitations of LSTM models in time series forecasting. The proposed architecture combines Convolutional Networks (CNN), LSTM, and Attention mechanisms to tackle these challenges. By exploring this model, we gain a deeper understanding of how various deep learning architectures can be used and where they excel.

---

## LSTNet Model


Time series often exhibit both short-term and long-term recurring patterns. For example, hourly traffic data might display daily patterns as well as weekly patterns, with reduced traffic on weekends. In more complex time series, there can be multiple patterns occurring at different time scales.

The LSTNet model was proposed to capture these patterns by utilizing:
- Convolutional Networks (CNN) to model short-term dependencies,
- LSTMs (or Gated Recurrent Units, GRUs) to capture long-term dependencies,
- Skip connections or attention mechanisms to model very long-term dependencies.

Convolutional networks excel at capturing local patterns by sliding a window (or kernel) over the input. This window is applied over the input sequence to extract local features, and different kernels can capture different aspects of the local structure. You can read more about Convolutional Neural Networks [here on Wikipedia](https://en.wikipedia.org/wiki/Convolutional_neural_network).

We’ve already studied LSTM in the previous notebook. While LSTMs are effective at modeling long-term dependencies, they still struggle with very long-term dependencies due to the same vanishing gradient problem faced by RNNs. The authors proposed using GRU (Gated Recurrent Unit) instead of LSTM. GRUs are a more compact version of LSTMs, with fewer parameters, making them computationally lighter while offering comparable performance.

To address this, the authors of LSTNet introduced LSTM-skip connections, which skip certain elements in the sequence and focus on capturing long-range dependencies. By skipping steps from the last time step, LSTM-skip connections can better capture long-term patterns.

Another approach to handle long-term dependencies is the use of the attention mechanism, which was [originally proposed](https://arxiv.org/abs/1409.0473) to overcome the vanishing gradient problem in long sequences. While this mechanism is not depicted in the diagram below, it is discussed in the text version of the LSTNet paper.


<div style="text-align: center; padding: 20px;">
<img src="../images/lstnet.png" style="max-width: 90%; clip-path: inset(2px); height: auto; border-radius: 15px; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.1); "></img>
</div>


Finally, the LSTNet model includes an autoregressive (AR) component that models the output as a linear autoregression on the last $k$ time steps. This autoregressive component is added to the highly non-linear part of the model, which may not preserve the input scale. The idea is that the linear AR component can act as an output scaler, helping to maintain the appropriate scale of the forecasted values, which might otherwise be lost in the non-linear transformations.

**References**:

[[Lai et al., 2018] Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks](https://arxiv.org/abs/1703.07015)

---

Let's load the log daily returns of exchange rates, and split the data into train, validation, and test subsets!


In [None]:
import pathlib
import numpy as np
import pandas as pd


import torch
from torch.utils.data import DataLoader, Dataset
import torch.nn.functional as F 
import torch.nn as nn 

import lightning as L 
from lightning.pytorch import seed_everything
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping
import optuna


## WARNING: To compare different models on the same horizon, keep this same across the notebooks
import sys; sys.path.append("../")
import utils
from utils_tutorial import load_file

FORECASTING_HORIZON = [4, 8, 12] # weeks 
MAX_FORECASTING_HORIZON = max(FORECASTING_HORIZON)

SEQUENCE_LENGTH = 2 * MAX_FORECASTING_HORIZON
PREDICTION_LENGTH = MAX_FORECASTING_HORIZON

DIRECTORY_PATH_TO_SAVE_RESULTS = pathlib.Path('../results/DIY/').resolve()
MODEL_NAME = "LSTNet"

RESULTS_DIRECTORY = DIRECTORY_PATH_TO_SAVE_RESULTS / MODEL_NAME
if RESULTS_DIRECTORY.exists():
    print(f'Directory {str(RESULTS_DIRECTORY)} already exists.'
           '\nThis notebook will overwrite results in the same directory.'
           '\nYou can also create a new directory if you want to keep this directory.'
           ' Just change the `MODEL_NAME` in this notebook.\n')
else:
    RESULTS_DIRECTORY.mkdir()

data, transformed_data = utils.load_tutotrial_data(dataset='exchange_rate', log_transform=True)
data = transformed_data

train_val_data = data.iloc[:-MAX_FORECASTING_HORIZON]
train_data, val_data = train_val_data.iloc[:-MAX_FORECASTING_HORIZON], train_val_data.iloc[-MAX_FORECASTING_HORIZON:]
test_data = data.iloc[-MAX_FORECASTING_HORIZON:]
print(f"Number of steps in training data: {len(train_data)}\nNumber of steps in validation data: {len(val_data)}\nNumber of steps in test data: {len(test_data)}")

%load_ext autoreload
%autoreload 2

--- 

## Build a rough LSTNet 

In this section, we will roughly implement the LSTNet model, starting from equations 1 to 6 in the paper by Lai et al. (2018).

We will utilize the `SlidingFixedWindow` dataloader from the previous notebook to extract a batch of input data. This batch will serve as the input for implementing the different components of LSTNet, including the convolutional, recurrent, and autoregressive components.

I’ve intentionally left parts of the implementation blank for you to practice completing the code yourself. Once you've filled in the blanks, **you can compare your implementation against mine by copying it into the next cell block** and running the function to verify your work.

```python
%%load_file 
../solutions/lstnet_rough.py
```

In [None]:
seq_length = 50
dataset = utils.SlidingFixedWindow(train_data, seq_length)
train_loader = DataLoader(dataset, batch_size=2,  shuffle=True)

inputs, targets = next(iter(train_loader))

print("Inputs shape", inputs.shape)
dropout = 0
n_out_channels = 10
input_dim = inputs.shape[-1]
window=2
bs = inputs.shape[0]
out_features = targets.shape[-1]

# Eq. (1) Convolutional Component
conv1 = nn.Conv2d(
        in_channels= <BLANK>, 
        out_channels=n_out_channels, 
        kernel_size=(window, input_dim)
    )

h_conv = conv1(inputs.unsqueeze(1)).squeeze(-1)
print("Conv output shape:", h_conv.shape)

# Eq. (2) Recurrent Component 
hidden_state_dims_GRU1 = 32
GRU1 = nn.GRU(
        input_size=<BLANK>,
        hidden_size=hidden_state_dims_GRU1,
        batch_first=True, 
        dropout=dropout,
    )

h_conv_in = h_conv.permute(0, 2, 1)
print("GRU input shape: ", h_conv_in.shape)
H_gru, h_gru = GRU1(h_conv_in)
h_gru = h_gru.squeeze(0)
print("GRU output shape:", h_gru.shape)

# Eq. (3) Recurrent-skip Component (GRU for every p hidden states)
skip = 4
hidden_state_dims_GRU2 = 16
GRU2 = nn.GRU(
    input_size=<BLANK>,
    hidden_size=hidden_state_dims_GRU2,
    batch_first=True,
    dropout=dropout,
)

seq_len = h_conv_in.shape[1] // skip # each sequence will have these many elements
n_seq = skip # there will be these many sequences
c = h_conv_in[:, -<BLANK>:] # discard the states which can't fit in the window
c = c.view(<BLANK>, seq_len, n_seq, c.shape[-1]).contiguous() # stride every n_seq before switching index

# switch the dimensions and obtain the input for GRU
c = c.permute(0, 2, 1, 3).contiguous().view(bs*n_seq, <BLANK>, c.shape[-1])

print("These must be equal: ", c[1, :, 1], h_conv_in[0, 2::skip, 1])

_, s = GRU2(c)
print("GRU2 Output shape:", s.shape)

# Eq. (4) Recurrent Skip Component (concatenation)
r = torch.cat((h_gru, s.view(bs, -1)), 1)
linear1 = nn.Linear(hidden_state_dims_GRU1 + skip*hidden_state_dims_GRU2, <BLANK>)
res = linear1(r)
print("r shape:", res.shape)

# (optional) Temporal Attention Layer (replacing the Recurrent Skip Component)
print("H_gru shape:", H_gru.shape)
attn_layer = nn.MultiheadAttention(embed_dim=hidden_state_dims_GRU1, num_heads=4, batch_first=True)
attn_out, attn_ws = attn_layer(query=H_gru[:, -1:], key=<BLANK>, value=<BLANK>)
print("attn out shape: ", attn_out.shape)
print("attn ws shape: ", attn_ws.shape)
r2 = torch.cat((h_gru, attn_out.squeeze(1)), 1)

linear1_attn = nn.Linear(<BLANK>, out_features)
res_attn = linear1_attn(r2)
print("res attn shape: ", res_attn.shape)

# Eq. (5) Autoregressive Component (scaling sensitivity)
linear2 = nn.Linear(<BLANK>, 1)
z = linear2(inputs.view(<BLANK>, out_features, -1)).squeeze(-1)

# Eq. (6) Final result
Y_t = res + z 
Y_t.shape



--- 

## LSTNet Module


Now that you’ve worked on a rough LSTNet implementation, let’s integrate it into a PyTorch model. We will focus on the PyTorch framework for now, and later we will use this model within LightningModule.

For now, subclass `torch.nn.Module` and define the two essential functions:
- `__init__`: For initializing the components.
- `forward`: For defining the forward pass of the network.

Use the respective components from the previous implementation to complete these methods.

Once again, I’ve left certain sections blank for you to fill in. After completing the implementation, **you can compare your solution with mine by using the following magic command**:

```python
%%load_file 
../solutions/lstnet_module.py
```

In [3]:
class LSTNet(nn.Module):
    def __init__(self, 
                 input_dim, 
                 out_features, 
                 seq_length, 
                 num_attn_heads = 4, hidden_state_dims_attn=32,
                 n_out_channels=10, window_size=2, hidden_state_dims_GRU1=32, skip=4, 
                 hidden_state_dims_GRU2=32, dropout=0.0):
        super().__init__()
        self.conv1 = nn.Conv2d(
                in_channels=1, 
                out_channels=n_out_channels, 
                kernel_size=(window_size, input_dim)
            )
        
        self.GRU1 = nn.GRU(
            input_size=n_out_channels,
            hidden_size=hidden_state_dims_GRU1,
            batch_first=True, 
            dropout=dropout,
        )

        self.GRU2 = nn.GRU(
            input_size=n_out_channels,
            hidden_size=hidden_state_dims_GRU2,
            batch_first=True,
            dropout=dropout,
        )

        self.skip = skip
        self.linear1 = nn.Linear(hidden_state_dims_GRU1 + skip*hidden_state_dims_GRU2, out_features)
        self.linear2 = nn.Linear(seq_length, 1)   

        self.attn_layer = nn.MultiheadAttention(embed_dim=hidden_state_dims_attn, 
                                                num_heads=num_attn_heads,
                                                dropout=dropout, 
                                                batch_first=True)     
        self.linear1_attn = nn.Linear(hidden_state_dims_attn + hidden_state_dims_GRU1, out_features)

        self.dropout = nn.Dropout(dropout)

    def forward(self, inputs):
        
        batch_size = inputs.shape[0] 

        ## IMPLEMENT THIS

        return Y_t


--- 

## Training the model

Following the same structure as the previous notebook, we will initialize the LightningModule and Trainer from pytorch_lighnint and let it run the training and validation loops.

In [4]:
class TimeSeriesModel(L.LightningModule):
    """
    Lightning module for training the model.
    Args:
        input_dim (int): Number of time series (8 if there are 8 time series) 
        out_features (int): number of time series to predict 
        seq_length (int): number of past time steps given as input
        n_out_channels (int): number of kernels in the convolution layer
        window_size (int): kernel width in convolution layer
        hidden_state_dims_GRU1 (int): dimension of the first recurrent component 
        skip (int): number of hidden units to skip in skip-recurrent component
        hidden_state_dims_GRU2 (int): dimension of the second GRU unit (skip-recurrent component)
        num_attn_heads (int): number of attention heads in the attention unit (used only if skip > 0)
        hidden_state_dims_attn (int): dimension of attention layer (used only if skip > 0)
        dropout (float): probability of dropout (similar to regularization )

        learning_rate (float): starting learning rate for adam optimizer
    """
    def __init__(self, input_dim, output_dim, seq_length, 
                 num_attn_heads=4, hidden_state_dims_attn=64,
                 n_out_channels=16, window_size=2, hidden_state_dims_GRU1=64, skip=4, 
                 hidden_state_dims_GRU2=64, dropout=0.0, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters() # can access hyperparams by self.hparams
        self.LSTNet = LSTNet(input_dim, output_dim, seq_length, 
                num_attn_heads, hidden_state_dims_attn, n_out_channels, 
                window_size, hidden_state_dims_GRU1, skip, hidden_state_dims_GRU2)
        
        self.loss_function = torch.nn.L1Loss()
    
    def forward(self, inputs):
        return self.LSTNet(inputs)

    def training_step(self, batch, batch_idx):
        inputs, targets = batch 
        outputs = self(inputs)
        loss = self.loss_function(outputs, targets)
        self.log('train_loss', loss, prog_bar=True)
        return loss 

    def validation_step(self, batch, batch_idx):
        inputs, targets = batch 
        outputs = self(inputs)
        loss = self.loss_function(outputs, targets)
        self.log('val_loss', loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

In [None]:
seed_everything(42, workers=True)
BATCH_SIZE = 64

# training dataset 
dataset = utils.SlidingFixedWindow(train_data, seq_length=SEQUENCE_LENGTH)
train_loader = DataLoader(dataset, batch_size=BATCH_SIZE,  shuffle=True)

# validation dataset
val_dataset = utils.SlidingFixedWindow(train_val_data[-(SEQUENCE_LENGTH + MAX_FORECASTING_HORIZON): ], SEQUENCE_LENGTH)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,  shuffle=False)

model = TimeSeriesModel(input_dim=8, output_dim=8, seq_length=SEQUENCE_LENGTH, skip=0)

# define callbacks 
early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=True, patience=5)
# save the best checkpoint; this will be used to evaluate test metrics
best_checkpoint = ModelCheckpoint(save_top_k=1, monitor='val_loss',
                                   mode='min', save_on_train_epoch_end=1,
                                   filename='{epoch:02d}-{val_loss:.5f}')
# save the last checkpoint in case if you want to resume training 
last_checkpoint =  ModelCheckpoint(save_top_k=1, monitor='step',
                                   mode='max', save_on_train_epoch_end=1,
                                   filename='{epoch:02d}-{step}')

trainer = L.Trainer(
    max_epochs=20, # preliminary run; we will run hyperparameter tuning in the next step
    check_val_every_n_epoch=5, 
    callbacks=[early_stopping, best_checkpoint, last_checkpoint], 
    deterministic=True,
    )

trainer.fit(model, train_loader, val_loader)


**Visualize the training trajectory**

Run the following command on command line and navigate to the port on which tensorboard is launched. Then navigate to `Scalars` tab to see how `train_loss` and `val_loss` changes during the training. 

```bash
tensorboard --logdir=lightning_logs
```

--- 

## Hyperparameter Tuning

Following the same structure as in the previous notebook, we will use Optuna to perform **hyperparameter optimization** for our LSTNet model. As before, we will define:
- An **objective function** that trains the model and returns the performance metric to optimize,
- A **search space** for the hyperparameters using `trial.suggest_*` methods,
- A **study** to manage the optimization process and evaluate different hyperparameter combinations over several trials.

In [None]:
def objective(trial: optuna.Trial):
    dropout = trial.suggest_categorical("dropout", [1e-1, 0])
    skip = trial.suggest_categorical("skip", [0, 4]) # attention or seasonality

    model = TimeSeriesModel(
        input_dim=8, output_dim=8, seq_length=SEQUENCE_LENGTH,
        skip=skip,
        dropout=dropout    
    )

    trainer = L.Trainer(
        max_epochs=500, 
        check_val_every_n_epoch=5, 
        callbacks=[EarlyStopping(monitor='val_loss', mode='min', verbose=True, patience=3)], 
        deterministic=True
    )

    trainer.fit(model, train_loader, val_loader)

    return trainer.callback_metrics["val_loss"].item()

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=4)

--- 

## Refit on Train-Val Subset

To measure the model's performance on the test data, we will first retrain the model using the combined train-validation dataset. 

**Note**: We can access the best parameters using `study.best_params`. 

In [None]:
study.best_params

In [None]:
# refit on train and val dataset
seed_everything(42, workers=True)

# training dataset 
dataset = utils.SlidingFixedWindow(train_val_data, seq_length=SEQUENCE_LENGTH)
train_val_loader = DataLoader(dataset, batch_size=BATCH_SIZE,  shuffle=True)

model = TimeSeriesModel(
    input_dim=8, output_dim=8, seq_length=SEQUENCE_LENGTH,
    skip=study.best_params['skip'],
    dropout=study.best_params['dropout'],
    )

trainer = L.Trainer(
    max_epochs=500, 
    callbacks=[EarlyStopping(monitor='train_loss', mode='min', verbose=True, patience=10)], 
    deterministic=True
    )

trainer.fit(model, train_val_loader)



--- 

##  Forecast

Inference in recurrent models needs to be performed iteratively. This is because, during inference, we predict one step at a time. After each prediction, we append the newly predicted value to the input sequence and use it to predict the next time step.

This process is repeated iteratively until the required number of future predictions is made.

In [None]:
# let's load the best model
model = TimeSeriesModel.load_from_checkpoint(
    checkpoint_path=trainer.checkpoint_callback.best_model_path,
    input_dim=8, output_dim=8, seq_length=SEQUENCE_LENGTH)
model.eval()

# evaluate on test set
output = torch.tensor([data.iloc[-(SEQUENCE_LENGTH + MAX_FORECASTING_HORIZON):-MAX_FORECASTING_HORIZON].values], 
                      dtype=torch.float, device=model.device)
y_pred = []
with torch.no_grad():
    for idx in range(MAX_FORECASTING_HORIZON):
        y = model(output)
        output = torch.cat((output, y.unsqueeze(0)), axis=1)
        output = output[:, 1:]
        y_pred.append(y)

y_pred = torch.cat(y_pred)

AUGMENTED_COL_NAMES = [f"{MODEL_NAME}_{col}_mean" for col in data.columns]
test_predictions_df = pd.DataFrame(y_pred.cpu().numpy(), columns=AUGMENTED_COL_NAMES, index=test_data.index)

# save them to the directory
test_predictions_df.to_csv(f"{str(RESULTS_DIRECTORY)}/predictions.csv", index=True)
print(test_predictions_df.shape)
test_predictions_df.head()

--- 

## Evaluate 

Let's compute the metrics by comparing the predictions with that of the target data. Note that we will have to rename the columns of the dataframe to match the expected column names by the function. 

In [None]:
# evalaute metrics
target_data = data[-MAX_FORECASTING_HORIZON:]
model_metrics, records = utils.get_mase_metrics(
    historical_data=train_val_data,
    test_predictions=test_predictions_df.rename(
            columns={x:x.split("_")[1] for x in test_predictions_df.columns
        }),
    target_data=target_data,
    forecasting_horizons=FORECASTING_HORIZON,
    columns=data.columns, 
    model_name=MODEL_NAME
)
records = pd.DataFrame(records)

records.to_csv(f"{str(RESULTS_DIRECTORY)}/metrics.csv", index=False)
records[['col', 'horizon', 'mase']].pivot(index=['horizon'], columns='col')

--- 

## Compare Models

In [None]:
utils.display_results(path=DIRECTORY_PATH_TO_SAVE_RESULTS, metric='mase')

---

## Plot Forecasts

In [None]:
fig, axs = utils.plot_forecasts(
    historical_data=train_val_data,
    forecast_directory_path=DIRECTORY_PATH_TO_SAVE_RESULTS,
    target_data=target_data,
    columns=data.columns,
    n_history_to_plot=10, 
    forecasting_horizon=MAX_FORECASTING_HORIZON,
    dpi=200,
    exclude_models=['LSTM'],
    plot_se=False,
)

---

## Conclusion

We explored various deep learning architectures for time series forecasting, particularly by closely following the work of Lai et al. (2018) on LSTNets. Specifically, we leveraged: CNNs to capture local patterns, GRUs to model long-term dependencies, Skip connections or Attention mechanisms to capture very long-term patterns. These approaches combined allow us to model the complex structure of time series data across different time scales.

---

## Exercises

- Increase `SEQUENCE_LENGTH` and observe whether this has any impact on model performance.
  
- Explore ways to include multiple levels of skip connections, i.e., `skip=[4, 12, 24]`.
  
- Add new features to the inputs, such as day of the week, lag features, or other time-related variables.
  
- Experiment with different hidden dimensions for the LSTNet to see how the size of the hidden layer affects the model's accuracy.
  
- Apply a normalization procedure (e.g., **min-max scaling**) to the data, ensuring that only the training data is used for fitting the scaler. Perform the modeling process on the normalized data and, after generating the final model's predictions, invert the normalization to return the output to its original scale. See `sklearn.preprocessing.MinMaxScaler` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html))
  
- Additionally, perform the modeling on the **raw data**, without applying any transformation (such as converting it into log daily returns), to compare results directly with the untransformed dataset.


---
## Next Steps

To learn about more advanced deep learning based approaches, proceed to module 6 (Transformer based models) or module 7 (LLM-based models).

---