# Deep Learning Methods for Forecasting

**Approximate Learning Time**:Up to 4 hours

--- 

In this module, we will explore **deep learning architectures** for time series forecasting. Specifically:

- We will cover **Multi-Layer Perceptrons (MLP)**, **Recurrent Neural Networks (RNN)**, and **Long-Short Term Memory (LSTM)** architectures in this notebook.
- In the next notebook, we will implement **LSTNet**, a model proposed by Lai et al. (2018). This will expose you to the challenges and considerations in applying LSTM to time series forecasting, as well as strategies for addressing these issues.

Throughout this module, we will use the **PyTorch** and **PyTorch Lightning** frameworks to facilitate model training and experimentation.

---

## Quick Overview of Deep Learning Models

<ins>**What is Deep Learning?**<ins>


Deep learning is a subset of machine learning that focuses on using **neural networks** with many layers (hence "deep"). These networks are designed to automatically learn and extract meaningful patterns from large datasets. Deep learning excels in tasks such as image recognition, natural language processing, and time series forecasting, where traditional models may struggle to capture complex relationships.

<ins>**What is MLP?**</ins>

A **Multi-Layer Perceptron (MLP)** is a type of **feedforward neural network** consisting of multiple layers of neurons:
- An input layer,
- One or more hidden layers,
- An output layer.

Each neuron is connected to every neuron in the next layer, and the network learns by adjusting the weights of these connections to minimize the error between predicted and actual outputs. MLPs are good for tasks where patterns can be learned directly from the input data, but they struggle with sequential data like time series.

<ins>**What is RNN?**</ins>

A **Recurrent Neural Network (RNN)** is a type of neural network designed to handle **sequential data**, such as time series or text. Unlike MLPs, RNNs have connections that loop back on themselves, allowing them to "remember" information from previous steps in the sequence. This makes RNNs well-suited for tasks where context or memory is important.

However, RNNs often face issues like **vanishing gradients**, which can make learning long-term dependencies difficult.

<ins>**What is LSTM?**</ins>

**Long-Short Term Memory (LSTM)** is a type of RNN that addresses the limitations of traditional RNNs, particularly the issue of vanishing gradients. LSTMs introduce special units called **gates** (input, forget, and output gates) that allow the network to selectively retain or discard information over long sequences. This enables LSTMs to learn both **short-term** and **long-term** dependencies more effectively, making them ideal for time series forecasting where patterns span over different timescales.


Please refer to this [excellent blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) to understand more about LSTMs and RNNs.

---

In this session, we will focus on building forecasting models using **LSTM networks**. While we will primarily work with LSTMs, replacing them with **RNNs** or **MLPs** is relatively straightforward and will be left as an exercise for you to explore.

**Note**: 
There are several Python libraries available that offer pre-built implementations for time series forecasting. However, as of September 2024, many of these libraries are still under active development. While they can be useful due to their pre-specified knowledge and tools, it’s valuable to learn how to implement your own models from scratch. This will give you a deeper understanding of the underlying mechanics. That being said, here are some libraries you might want to be aware of:

- [**GluonTS**](https://ts.gluon.ai/stable/index.html): A comprehensive library offering several [models](https://ts.gluon.ai/stable/getting_started/models.html) widely used in the academic community.
- [**Pytorch Forecasting**](https://pytorch-forecasting.readthedocs.io/en/stable/index.html): A popular library built on top of PyTorch, designed specifically for time series forecasting.
- [**Neuralforecast**](https://github.com/Nixtla/neuralforecast): Provides several deep learning-based forecasting algorithms.


**References**

[(Lai et al., 2018) Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks](https://arxiv.org/abs/1703.07015)


---

Let's load the log daily returns of exchange rates, and split the data into train, validation, and test subsets!


In [None]:
import pathlib
import numpy as np
import pandas as pd
from termcolor import colored

import torch
from torch.utils.data import DataLoader, Dataset
import lightning as L 
from lightning.pytorch import seed_everything
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
import optuna

## WARNING: To compare different models on the same horizon, keep this same across the notebooks
from termcolor import colored
import sys; sys.path.append("../")
import utils

FORECASTING_HORIZON = [4, 8, 12] # weeks 
MAX_FORECASTING_HORIZON = max(FORECASTING_HORIZON)

SEQUENCE_LENGTH = 2 * MAX_FORECASTING_HORIZON
PREDICTION_LENGTH = MAX_FORECASTING_HORIZON

DIRECTORY_PATH_TO_SAVE_RESULTS = pathlib.Path('../results/DIY/').resolve()
MODEL_NAME = "LSTM"

RESULTS_DIRECTORY = DIRECTORY_PATH_TO_SAVE_RESULTS / MODEL_NAME
if RESULTS_DIRECTORY.exists():
    print(colored(f'Directory {str(RESULTS_DIRECTORY)} already exists.'
           '\nThis notebook will overwrite results in the same directory.'
           '\nYou can also create a new directory if you want to keep this directory untouched.'
           ' Just change the `MODEL_NAME` in this notebook.\n', "red" ))
else:
    RESULTS_DIRECTORY.mkdir(parents=True)

data, transformed_data = utils.load_tutotrial_data(dataset='exchange_rate', log_transform=True)
data = transformed_data

train_val_data = data.iloc[:-MAX_FORECASTING_HORIZON]
train_data, val_data = train_val_data.iloc[:-MAX_FORECASTING_HORIZON], train_val_data.iloc[-MAX_FORECASTING_HORIZON:]
test_data = data.iloc[-MAX_FORECASTING_HORIZON:]
print(f"Number of steps in training data: {len(train_data)}\nNumber of steps in validation data: {len(val_data)}\nNumber of steps in test data: {len(test_data)}")

%load_ext autoreload
%autoreload 2

---

## Long Short-Term Memory (LSTM)

PyTorch provides LSTM functionality through the `nn.LSTM` class ([documentation here](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)). The two main arguments required for setting up an LSTM are: (a) **Input size**: The number of features associated with each step in the sequence, (b) **Hidden size**: The size of the internal hidden state representation in the LSTM.

Neural networks learn by adjusting their weights according to an **objective function** (or loss function). These weight updates are made using the gradients of the objective function, often calculated on a batch of samples. Therefore, during training, data is processed in **batches**.

We will use the convention `batch_first=True`, which means our input tensors will have the following shape:`(batch size, number of time steps in the sequence, dimension of each step)`.

In time series data, each time step can be featurized in different ways. While many time series models use a single feature per time step, it is possible to include multiple features, such as lagged values or categorical features like day of the week, making the dimension of each time step larger than one.

Additionally, time series data may have sequences of varying lengths. In cases where some sequences are shorter than others, **padding** is applied. Padding involves a separate tensor that indicates where the valid data is, using 1s for valid data points and 0s for the padded values.

In [None]:
batch_size=12
seq_length = 10
input_dim = 8
hidden_dim = 16
input = torch.randn(batch_size, seq_length, input_dim)

lstm = torch.nn.LSTM(input_dim, hidden_dim, batch_first=True)

out, (h, _) = lstm(input)
print("Shape of output: ", out.shape)
print("Shape of the final hidden state: ", h.shape)

--- 

## Dataloading for model training

As mentioned earlier, deep learning frameworks require special handling of data, particularly when it comes to **batching**. Batching is crucial for computational efficiency, as it allows models to process multiple samples simultaneously, leveraging hardware like GPUs to speed up training.


PyTorch provides two essential components for working with data:

- `torch.utils.data.Dataset`: This class defines how to retrieve a single data sample from a dataset. It's designed to abstract data access, allowing you to focus on what a sample looks like, not how it’s retrieved.
- `torch.utils.data.DataLoader`: This takes an instance of Dataset and efficiently loads samples in parallel, using multiple CPUs to batch the data, making it ready for training or inference.

In PyTorch, creating a custom dataset involves subclassing `torch.utils.data.Dataset` and defining at least two key methods:

- `__len__`: This method returns the number of samples in your dataset. It defines the range of valid indices that can be passed to `__getitem__`.
- `__getitem__`: This method defines how to fetch a single data sample, given its index. Each call to `__getitem__` should return the data point associated with the provided index.

By implementing these two methods, we can effectively define how data is accessed from any underlying data source—whether it's a CSV file, a database, or a time series dataset like in our case.

---

We will implement a **sliding fixed-window approach** for fetching data. To keep things simple for now, we will not handle sequences that are shorter than the required sequence length. This will be left as an exercise for you to explore. In the next module, we will introduce **GluonTS**, which automatically handles varying sequence lengths.

In this approach, we aim to create windows of data where the indices go from 0 to `len(data) - seq_length`. The target for each window is the value of the next step after the window.

The class below implements the sliding window logic, and the following cells will initialize this class with both the training and validation datasets. Note that for the validation dataset, the targets will come exclusively from the validation data, while the inputs may still include data from the training set to maintain the sequential nature of the time series.

In [4]:
# Training series
class SlidingFixedWindow(Dataset):
    """
    Returns a time series data iterator with sliding window of fixed size.

    Args:
        data: Pandas dataframe 
        seq_length (int): number of past time steps to include in the training sequence.
    """
    def __init__(self, data, seq_length=100):
        self.data = data
        self.seq_length = seq_length
    
    def __len__(self):
        return len(self.data) - self.seq_length
    
    def __getitem__(self, index):
        return (
            torch.tensor(self.data[index:index+self.seq_length].values, dtype=torch.float),
            torch.tensor(self.data.iloc[index+self.seq_length].values, dtype=torch.float)
        )

In [None]:
# example run to check the batch sizes
dataset = SlidingFixedWindow(train_data, SEQUENCE_LENGTH)
train_loader = DataLoader(dataset, batch_size=12,  shuffle=True)

print("Training Dataset (showing only 1 batch)")
for batch in train_loader:
    inputs, targets = batch 
    print(f"SlidingFixedWindow:\tShape of inputs: {inputs.shape}\tShape of targets: {targets.shape}")
    break

In [None]:
# since we keep only the last few steps for validation purposes, we truncate train_val_data
print("Validation Dataset")
val_dataset = SlidingFixedWindow(train_val_data[-(SEQUENCE_LENGTH+MAX_FORECASTING_HORIZON):], SEQUENCE_LENGTH)
val_loader = DataLoader(val_dataset, batch_size=10,  shuffle=False)
for batch in val_loader:
    inputs, targets = batch
    print(f"SlidingFixedWindow:\tShape of inputs: {inputs.shape}\tShape of targets: {targets.shape}")

--- 

## Lightning: Training the model


PyTorch provides a flexible interface for implementing deep learning models, but a lot of the training code tends to look quite similar across different projects. Many codebases include best practices and optimizations that help streamline the training process.

**PyTorch Lightning** is a framework designed to simplify the training of neural networks by abstracting much of the boilerplate code involved in training. It does this through the `LightningModule`, a class that you subclass to define the core components of your model.

In a `LightningModule`, you’ll need to implement the following key functions:
- `forward`: Defines the forward pass of the network.
- `training_step`: Specifies the logic for a single training step.
- `validation_step` (optional): Defines the validation logic for a single validation step.
- `configure_optimizers`: Sets up the optimizer(s) and learning rate scheduler(s) for training.

Internally, PyTorch Lightning handles the standard training loops by calling these methods. The framework also provides a `Trainer` class that manages the execution of the training process. `Trainer` comes with a variety of arguments and options to customize training, many of which we will explore in this tutorial.


In [7]:
class TimeSeriesModel(L.LightningModule):
    def __init__(self, input_dim=1, hidden_dim=150, output_dim=1, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters() # can access hyperparams by self.hparams
        self.lstm = torch.nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        _, (hidden, _) = self.lstm(x)
        return self.fc(hidden[-1])
    
    def training_step(self, batch, batch_idx):
        inputs, targets = batch 
        outputs = self(inputs)
        loss = torch.nn.L1Loss()(outputs, targets)
        self.log('train_loss', loss, prog_bar=True)
        return loss 

    def validation_step(self, batch, batch_idx):
        inputs, targets = batch 
        outputs = self(inputs)
        loss = torch.nn.L1Loss()(outputs, targets)
        self.log('val_loss', loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)


We will now train the model for **20 epochs**, where each epoch consists of a set number of batches. This number is typically determined by the total length of the dataset, once it has been split into batches.

The argument `check_val_every_epoch` specifies how frequently the validation step is run. In our case, we set it to run the `validation_step` every 5 epochs, which is when we will see the `val_loss` being logged.

**PyTorch Lightning** also supports a wide range of **callbacks** ([documentation](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html)) that can be passed to the `Trainer` to enhance training flexibility. For example, we are using the **EarlyStopping** callback ([documentation](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.EarlyStopping.html#lightning.pytorch.callbacks.EarlyStopping)), which stops the training if the monitored metric (e.g., validation loss) does not improve for a specified number of epochs, defined by the `patience` parameter.

In [None]:
seed_everything(42, workers=True)
BATCH_SIZE=64

# training dataset 
dataset = SlidingFixedWindow(train_data, seq_length=SEQUENCE_LENGTH)
train_loader = DataLoader(dataset, batch_size=BATCH_SIZE,  shuffle=True)

# validation dataset
val_dataset = SlidingFixedWindow(train_val_data[-(SEQUENCE_LENGTH + MAX_FORECASTING_HORIZON): ], SEQUENCE_LENGTH)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,  shuffle=False)

model = TimeSeriesModel(input_dim=train_val_data.shape[1], output_dim=train_val_data.shape[1])
trainer = L.Trainer(
    max_epochs=20, # preliminary run; we run hyperparameter tuning in the next step
    check_val_every_n_epoch=5, 
    callbacks=[EarlyStopping(monitor='val_loss', mode='min', verbose=True, patience=5)], 
    deterministic=True
    )

trainer.fit(model, train_loader, val_loader)


Great! We've just trained our first model.

Now, let's examine how the **loss** evolved throughout the training process. To do this, we'll use **TensorBoard**, which is integrated into the Lightning framework. If you haven't installed TensorBoard yet, you can do so from its [homepage](https://pypi.org/project/tensorboard/).

Once TensorBoard is installed, run the following command in your terminal:

```bash
tensorboard --logdir=lightning_logs
```

After launching TensorBoard, navigate to the port it specifies in your browser. In the **Scalars** tab, you can observe how `train_loss` and `val_loss` changed during the training process. Note that the run name is displayed next to `v_num` in the progress bar during training, so you can easily identify your session.


--- 

## Hyperparameter Tuning

As you may notice, there are hyperparameters that we hard-coded in the previous section, such as `learning_rate` and `hidden_dim`. These hyperparameters significantly impact the model’s performance. To optimize these values, we will conduct a **hyperparameter search** using **Optuna** ([documentation](https://optuna.readthedocs.io/en/stable/)).

Using Optuna is fairly straightforward and requires four key components:

1. **Objective function**: The function that Optuna tries to optimize (e.g., minimizing validation loss).
2. **Trial**: Each trial is a single evaluation of the objective function with a specific set of hyperparameters.
3. **Search space**: This defines the range of values Optuna will explore for each hyperparameter. We define the search space using `Optuna.trial`.
4. **Study**: We create an `optuna.Study` object with the objective function, which tries to optimize the hyperparameters using internal optimization algorithms.

Typically, **learning rate** is searched on a **log-uniform scale** because it often takes smaller values across many orders of magnitude. However, for simplicity and because we will only run a few trials in this tutorial, we will use a categorical search for the learning rate.

We will also search for the best **hidden dimension** (`hidden_dim`) from the values `[256, 512]`.

In [None]:
def objective(trial: optuna.Trial):
    # typical usage
    # learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-1)
    # hidden_dim = trial.suggest_int("hidden_dim", 128, 256)

    # for tutorial we go light
    learning_rate = trial.suggest_categorical("learning_rate", [1e-3, 1e-2])
    hidden_dim = trial.suggest_categorical("hidden_dim", [256, 512])

    model = TimeSeriesModel(
        input_dim=train_val_data.shape[1], 
        output_dim=train_val_data.shape[1],
        learning_rate=learning_rate,
        hidden_dim=hidden_dim    
    )

    trainer = L.Trainer(
        max_epochs=200, 
        check_val_every_n_epoch=5, 
        callbacks=[EarlyStopping(monitor='val_loss', mode='min', verbose=True, patience=3)], 
        deterministic=True
    )

    trainer.fit(model, train_loader, val_loader)

    return trainer.callback_metrics["val_loss"].item()

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=2) 

--- 

## Refit on Train-Val Subset

To measure the model's performance on the test data, we will first retrain the model using the combined train-validation dataset. 

**Note**: We can access the best parameters using `study.best_params`. 

In [None]:
# refit on train and val dataset

seed_everything(42, workers=True)

# training dataset 
dataset = SlidingFixedWindow(train_val_data, seq_length=SEQUENCE_LENGTH)
train_val_loader = DataLoader(dataset, batch_size=BATCH_SIZE,  shuffle=True)

model = TimeSeriesModel(
    input_dim=train_val_data.shape[1], 
    output_dim=train_val_data.shape[1],
    hidden_dim=study.best_params['hidden_dim'],
    learning_rate=study.best_params['learning_rate'],
    )

trainer = L.Trainer(
    max_epochs=200, 
    callbacks=[EarlyStopping(monitor='train_loss', mode='min', verbose=False, patience=10)], 
    deterministic=True
    )

trainer.fit(model, train_val_loader)



--- 

##  Forecast

Inference in recurrent models needs to be performed iteratively. This is because, during inference, we predict one step at a time. After each prediction, we append the newly predicted value to the input sequence and use it to predict the next time step.

This process is repeated iteratively until the required number of future predictions is made.

In [None]:
# inference
output = torch.tensor([data.iloc[-(SEQUENCE_LENGTH + MAX_FORECASTING_HORIZON):-MAX_FORECASTING_HORIZON].values], dtype=torch.float)
y_pred = []
with torch.no_grad():
    for idx in range(MAX_FORECASTING_HORIZON):
        y = model(output)
        output = torch.cat((output, y.unsqueeze(0)), axis=1)
        output = output[:, 1:]
        y_pred.append(y)

y_pred = torch.cat(y_pred)

In [None]:
AUGMENTED_COL_NAMES = [f"{MODEL_NAME}_{col}_mean" for col in data.columns]
test_predictions_df = pd.DataFrame(y_pred.numpy(), columns=AUGMENTED_COL_NAMES, index=test_data.index)
test_predictions_df.to_csv(f"{str(RESULTS_DIRECTORY)}/predictions.csv", index=True)
test_predictions_df.head()

--- 

## Evaluate 

Let's compute the metrics by comparing the predictions with that of the target data. Note that we will have to rename the columns of the dataframe to match the expected column names by the function. 

In [None]:
target_data = data[-MAX_FORECASTING_HORIZON:]
model_metrics, records = utils.get_mase_metrics(
    historical_data=train_val_data,
    test_predictions=test_predictions_df.rename(
            columns={x:x.split("_")[1] for x in test_predictions_df.columns
        }),
    target_data=target_data,
    forecasting_horizons=FORECASTING_HORIZON,
    columns=data.columns, 
    model_name=MODEL_NAME
)

records = pd.DataFrame(records)

records.to_csv(f"{str(RESULTS_DIRECTORY)}/metrics.csv", index=False)
records[['col', 'horizon', 'mase']].pivot(index=['horizon'], columns='col')

--- 

## Compare Models

In [None]:
utils.display_results(path=DIRECTORY_PATH_TO_SAVE_RESULTS, metric='mase')

---

## Plot Forecasts

In [None]:
fig, axs = utils.plot_forecasts(
    historical_data=train_val_data,
    forecast_directory_path=DIRECTORY_PATH_TO_SAVE_RESULTS,
    target_data=test_data,
    columns=data.columns,
    n_history_to_plot=10, 
    forecasting_horizon=MAX_FORECASTING_HORIZON,
    dpi=200, 
    plot_se=False
)

## (Optional) Expanding Window DataLoader


In this optional section, we will look at the design of an expanding window dataloader. 
We want to be able to handle sequences of varying lengths.
The difficulty arises because `torch.nn.utils.DataLoader` batches all the samples in a single tensor and a tensor can't have variable length elements.
As a result, we need to pad our sequences so that all of them are of the same length. 

`collate_fn` argument passed to `torch.utils.data.DataLoader`, helps in preprocessing individual samples returned from `torch.utils.data.Dataset` so that they are of the same length.
PyTorch provides efficient implmenetation of recurrent networks such as RNNs, LSTMs while handling sequences of unequal lengths.
This implmementation requires `PackedSequence` object, which is a batch of unequal sequences.

`PackedSequence` has two attributes: 
  - **Data**: A flattened list of all the non-padded elements from the input sequences, stacked together.
  - **Batch sizes**: A tensor that specifies how many valid (non-padded) sequences are present at each time step.


Thus, a value of 12 at index 1 of `batch_sizes` mean that there are 12 sequences that are active at that time step. 
Recurrent networks in PyTorch are designed to efficiently process the inputs by keeping track of the number of seqeunces alive at any time step, thereby not wasting compute resources on paddings. 


 - By using the **`batch_sizes`** tensor, PyTorch knows exactly how many sequences are "alive" (non-padded) at each time step.
 - Instead of iterating over all time steps for all sequences (which would involve unnecessary computations on padded values), PyTorch processes only the non-padded entries.
 
 In essence, PyTorch only processes the valid data points in the `PackedSequence`, while completely ignoring any padded elements.


The `batch_sizes` attribute of this object holds the number of sequences alive at a particular index. 
Thus, a value of 12 at index 1 of `batch_sizes` mean that there are 12 sequences that are active at that time step. 
Recurrent networks in PyTorch are designed to efficiently process the inputs by keeping track of the number of seqeunces alive at any time step, thereby not wasting compute resources on paddings. 

PyTorch’s `PackedSequence` helps avoid processing padding in recurrent neural networks (RNNs, LSTMs, GRUs) by skipping the padded values during computation. 


Let's see how this works in practice. 

In [None]:
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence

class ExpandingWindow(Dataset):
    """
    Returns a expanding window data iterator for time series.
    Args:
        min_index (int): defines the minimum number of past time steps to include in the training sequence.
    """
    def __init__(self, data, min_index=0):
        self.data = data
        self.min_index = min_index
    
    def __len__(self):
        return len(self.data) - self.min_index - 1
    
    def __getitem__(self, index):
        current_index = index + self.min_index
        return (
            # +1 so that at index 0 we don't get an empty batch
            torch.tensor(self.data[:current_index + 1].values, dtype=torch.float), 
            torch.tensor(self.data.iloc[current_index + 1].values, dtype=torch.float)
        )

def collate_fn(batch):
    windows, targets = zip(*batch)

    lengths = torch.tensor([len(window) for window in windows])
    # print("lengths\t ", lengths)

    padded_windows = pad_sequence(windows, batch_first=True)
    # print("padded_windows shape\t", padded_windows.shape)

    packed_windows = pack_padded_sequence(padded_windows, lengths, batch_first=True, enforce_sorted=False)
    targets = torch.stack(targets)

    return packed_windows, targets


dataset = ExpandingWindow(train_data, min_index=100)
train_loader = DataLoader(dataset, batch_size=12,  collate_fn=collate_fn, shuffle=True)

print("Training dataset (showing only 1 sample)")
for batch in train_loader:
    inputs, targets = batch 
    print(f"ExpandingWindow:\nPackedWindow data shape:{inputs.data.shape}\t PackedWindow batch_sizes shape:{inputs.batch_sizes.shape}\tShape of targets: {targets.shape}")
    print(f"Sum of batch_sizes in PackedWindow: {inputs.batch_sizes.sum()}")
    break


In [None]:
val_dataset = ExpandingWindow(train_val_data, min_index=len(train_data)-1)
val_loader = DataLoader(val_dataset, batch_size=10,  collate_fn=collate_fn, shuffle=False)
print("validaton dataset")
for batch in val_loader:
    inputs, targets = batch
    print(f"ExpandingWindow:\nPackedWindow data shape:{inputs.data.shape}\t PackedWindow batch_sizes shape:{inputs.batch_sizes.shape}\tShape of targets: {targets.shape}")
    print(f"Sum of batch_sizes in PackedWindow: {inputs.batch_sizes.sum()}")


---

## Conclusion

We covered how to build deep learning models for time series forecasting. We learned how to implement **dataloaders** in PyTorch to efficiently pack and prepare samples for time series forecasting. Additionally, we explored how to use **PyTorch** and the **Lightning framework** to build and train LSTM models, simplifying the model training process and improving efficiency. Finally, we learned how to use **Optuna** for hyperparameter optimization.

---
## Exercises

- Increase `SEQUENCE_LENGTH` and observe whether this has any impact on model performance.

- Adapt the SlidingFixedWindow method to handle **variable-length sequences** in the input data.
  
- Add new features to the inputs, such as day of the week, lag features, or other time-related variables.
  
- Experiment with different hidden dimensions for the LSTM to see how the size of the hidden layer affects the model's accuracy.
  
- Replace the LSTM module with a simple **feed-forward network (MLP)** and compare the performance.
  
- Replace the LSTM module with a standard **RNN** or a **stacked LSTM** architecture to evaluate whether more complex structures improve the results.
  
- Apply a normalization procedure (e.g., **min-max scaling**) to the data, ensuring that only the training data is used for fitting the scaler. Perform the modeling process on the normalized data and, after generating the final model's predictions, invert the normalization to return the output to its original scale. See `sklearn.preprocessing.MinMaxScaler` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html))

- Additionally, perform the modeling on the **raw data**, without applying any transformation (such as converting it into log daily returns), to compare results directly with the untransformed dataset.

---

## Next Steps

- To understand the limitations of LSTM in time series modeling, proceed to Notebook 5.1, where we will explore and implement the LSTNet architecture.

- To learn about more advanced deep learning based approaches, proceed to module 6 (Transformer based models) or module 7 (LLM-based models).

---