# HW3-B. Defining an Encoder-Decoder model

## About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.0 (06/03/2024)

**Requirements:**
- Python 3
- Matplotlib
- Numpy
- Pandas
- Torch
- Torchmetrics

In [1]:
# Matplotlib
import matplotlib.pyplot as plt
# Numpy
import numpy as np
# Pandas
import pandas as pd
# Torch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# Helper functions (additional file)
from helper_functions import *

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


<div class="alert alert-block alert-info">
<b>An important note: While usually not advised, you might want to run the code for this homework using CPU only. <br>
It remains possible, however, to use GPU, but we would advise against it, until we have been able to clarify the reason for bugs (most likely some CUDA reason). </b> 
</div> 

In [2]:
# Use GPU if available, else use CPU
device = torch.device("cpu")
print(device)

cpu


## 0. Dataset and Dataloaders from earlier

We start by loading our dataset from the Excel file, and reuse our Dataset and Dataloader objects from HW3-A.

In [3]:
# Load dataset from file
excel_file_path = 'dataset.xlsx'
times, values = load_dataset(excel_file_path)

In [4]:
class CustomDataset(Dataset):
    def __init__(self, times, values, n_inputs, n_outputs):
        self.times = times
        self.values = values
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs
        self.define_samples()

    def define_samples(self):
        self.inputs = []
        self.outputs = []
        self.mid = []
        # Define all inputs
        for i in range(len(times) - self.n_inputs - self.n_outputs + 1):
            # Last input not included (only 19 values)
            next_input = self.values[i:(i + self.n_inputs - 1)]
            next_output = self.values[(i + self.n_inputs):(i + self.n_inputs + self.n_outputs)]
            # Mid is the turning point, i.e. the value of the 20-th sample in the series of inputs
            # It will not be read by the encoder and will serve as the first input to the decoder
            next_mid = [self.values[i + self.n_inputs - 1]]
            self.inputs.append(next_input)
            self.outputs.append(next_output)
            self.mid.append(next_mid)
        
    def __len__(self):
        return len(self.inputs)
    
    def __getitem__(self, idx):
        # Select samples corresponding to the different inputs
        # and outputs we have created with the define_samples() function,
        # and convert them to PyTorch tensors
        x = torch.tensor(self.inputs[idx], dtype = torch.float32)
        y = torch.tensor(self.outputs[idx], dtype = torch.float32)
        m = torch.tensor(self.mid[idx], dtype = torch.float32)
        return x, y, m

In [6]:
# Create our PyTorch Dataset object from the class above
n_inputs = 20
n_outputs = 5
pt_dataset = CustomDataset(times, values, n_inputs, n_outputs)

In [7]:
# Create DataLoader object
batch_size = 256
pt_dataloader = DataLoader(pt_dataset, batch_size = batch_size, shuffle = True)

## 1. Step 1: Designing an Encoder model

We propose to approach this task, by using and Encoder-Decoder model of some sort. Both the Encoder and Decoder parts of the model will consist of a simple LSTM.

**Question 6:** What is a Seq2Seq model, and how does it relate to Encoder-Decoder models?

A Seq2Seq model as the name suggests essentially helps map one sequence (i.e. the feature sequence) to another different sequence which is your target sequence. An example of sequence-to-sequence model could be a translation model for example. It relates to the encoder-decoder models in the sense that the seq2seq model itself has 2 main components as part of it's architecture. 

An encoder first processes the input sequence using RNNs and basically encodes the input sequence into a context vector with fixed dimensions and we can see this as the semantics of the input sequence. 

The decoder then takes the context vector as generated by the encoder above and again using RNNs produces the output/target sequence one by one to gives us the output sequence. 

Therefore, we can pretty much see the Seq2Seq model as an instance of the Encoder-Decoder models where the overall model is divided into 2 components of and Encoder and a Decoder. 

We want to implement the LSTM architecture drawn below. It objective is to receives entires $ x(t), x(t+1), ..., x(t+18) $, 19 input points, and learn the dynamics of the data, in the hopes that we will later be able to use this information for future predictions.

<img title="Our Encoder Architecture" alt="Our Encoder Architecture" src="./images/20240318_183921.jpg">

Given the LSTM Architecture above, answer the questions below.

**Question 7:** This encoder seems to receive all inputs present in the first tensor coming from the Dataloader object, which includes n_inputs - 1 elements (here 20-1 = 19 inputs). This LSTM could then produce 19 outputs, but for some reason, they are not shown on this image. What is the reason for this omission? Why is our diagram suggesting that the final memory vector is the only important information that will come out of this encoder model?

The main reason to do so is basically for the clarity of the diagram we are presenting and to essentially focus on only showing the key components and outputs of the encoder which in this case would be the final memory vector as pointed out. It's true yes that the encoder produces an output which is "hidden" upon recieving each of the elements in the sesquence but the focus is still on the main memory vector that is going to be passed through to the decoder. Therefore, this functionality of the encoder is abstracted and the diagram is as it is shown.

We want our Encoder model to be represented by the EncoderRNN object, whose class prototype is shown below.

**Question 8:** There are a few Nones to be replaced in the code below. Please show your code in your report after you have figured out the correct EncoderRNN class.

In [43]:
# class EncoderRNN(nn.Module):
#     def __init__(self, input_size, hidden_size):
#         super(EncoderRNN, self).__init__()
#         self.hidden_size = hidden_size
#         self.lstm = nn.LSTM(input_size, hidden_size)

#     def forward(self, inputs):
#         batch_size = inputs.size(0)  # Get batch size from the first dimension
#         hidden = torch.randn(1, batch_size, self.hidden_size)
#         cell = torch.randn(1, batch_size, self.hidden_size)
        
#         # Pass inputs through the LSTM layer
#         output, (hidden, cell) = self.lstm(inputs, (hidden, cell))
        
#         # Return the output and hidden states
#         return output, hidden

# class EncoderRNN(nn.Module):
#     def __init__(self, input_size, hidden_size):
#         super(EncoderRNN, self).__init__()
#         self.hidden_size = hidden_size
#         self.lstm = nn.LSTM(input_size, hidden_size)

#     def forward(self, inputs):
#         # Get batch size from the first dimension of the input
#         batch_size = inputs.size(0)

#         # Initialize hidden and cell states with zeros
#         hidden = torch.zeros(1, batch_size, self.hidden_size)
#         cell = torch.zeros(1, batch_size, self.hidden_size)

#         # Pass inputs through the LSTM layer
#         output, (hidden, cell) = self.lstm(inputs, (hidden, cell))

#         # Return the output and the last hidden state
#         return output



class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size)

    def forward(self, inputs):
        # Get batch size from the first dimension of the input (might be 1 for unbatched case)
        batch_size = inputs.size(0)

        # Initialize hidden and cell states with zeros (ignoring batch dimension)
        hidden = torch.zeros(self.hidden_size)
        cell = torch.zeros(self.hidden_size)

        # Pass inputs through the LSTM layer
        output, (hidden, cell) = self.lstm(inputs, (hidden.unsqueeze(0), cell.unsqueeze(0)))

        # Return the output and the last hidden state
        return output

In [44]:
# Defining our EncoderRNN model
hidden_size = 25
encoder_model = EncoderRNN(n_inputs, hidden_size).to(device)
print(encoder_model)

EncoderRNN(
  (lstm): LSTM(20, 25)
)


**Question 9:** Consider the cell below. What is contained in *vec1\[0\]* and *vec2\[0\]*?

In [45]:
# Testing our EncoderRNN model
inputs, _, _ = next(iter(pt_dataloader))
inputs_reworked = inputs[0, :].reshape(1, -1)
print(inputs_reworked.shape)
encoder_out = encoder_model(inputs_reworked)
vec1, vec2 = encoder_out
print(vec1[0])
print(vec2[0])

torch.Size([1, 19])


RuntimeError: input.size(-1) must be equal to input_size. Expected 20, got 19

## 2. Step 2: Designing a Decoder model

Our next step is to produce a decoder model. It will receive a certain memory vector as its memory starting point. It will also receive five inputs denoted *val1*, *val2*, *val3*, *val4*, and *val5*. It will then attempt to produce five outputs denoted *y1*, *y2*, *y3*, *y4*, and *y5*.

Consider the architecture drawn below and answer the following questions.

As you can see, it will receive certain input values *valk* and will attempt to predict a value *yk*, with k in $ \{1, 2, 3, 4, 5\} $.

<img title="Our Decoder Architecture" alt="Our Decoder Architecture" src="./images/20240318_184112.jpg">

**Question 10:** Assuming that the encoder has seen the inputs $ x(t), x(t+1), ... x(t+18) $, what should we use as a memory vector to play the role of the memory starting point for the decoder?

**Question 11:** We will use a Decoder that is NOT auto-regressive. What does that mean for the input and output values of our Decoder LSTM-based model?

**Question 12:** Assuming that the encoder has seen the inputs corresponding to the sample with index $ t $, i.e. $ x(t), x(t+1), ... x(t+18) $, which values should we use in place fo *val1*, *val2*, *val3*, *val4*, *val5*? Remember Q11, we are not planning to use an auto-regressive decoder here. Could you then explain why we only used 19 values as inputs in the encoder part then?

**Question 13:** Assuming that the encoder has seen the inputs corresponding to the sample with index $ t $, i.e. $ x(t), x(t+1), ... x(t+18) $, what are the target values should we are trying to match with our predictions in place fo *y1*, *y2*, *y3*, *y4*, *y5*?

**Question 14:** What is then the purpose and the expected use for the Linear layer in self.linear? Why is there a for loop in the forward method?

**Question 15:** Having figured out the questions in Q10-14, can you figure what to use in place of the Nones in the code for the DecoderRNN below? Show your final code in your report.

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.lstm = None
        self.linear = nn.Linear(None)

    def forward(self, outputs, mid, encoder_hidden_states):
        hidden_states = encoder_hidden_states
        final_pred = torch.zeros(outputs.shape).to(outputs.device)
        val = None
        for i in range(outputs.shape[1]):
            pred, hidden_states = None
            pred = self.linear(pred)
            final_pred[:, i] = pred.squeeze()
            val = None
        return final_pred

In [None]:
# Defining our DecoderRNN model
decoder_model = DecoderRNN(hidden_size = hidden_size, output_size = n_outputs)
print(decoder_model)

**Question 16:** Consider the cell below. What should the final size of the *decoder_out* tensor be?

In [None]:
# Testing our DecoderRNN model
inputs, outputs, mid = next(iter(pt_dataloader))
encoder_out = encoder_model(inputs)
decoder_out = decoder_model(outputs, mid, encoder_out)
print(decoder_out.shape)

## Step 3: Assembling everything into a Seq2Seq model.

Our final objective is to assemble both our encoder model and decoder model into a Seq2Seq model, following the architecture drawn below.

<img title="Our Seq2Seq Architecture" alt="Our Seq2Seq Architecture" src="./images/20240318_184350.jpg">

**Question 17:** Why have we prefered to use a Decoder-Encoder architecture, instead of a single LSTM that would receive 24 inputs, produce 24 outputs, and would only compare the final 5 predicted values to the ground truth in our dataset?

**Question 18:** Having figured out the models in EncoderRNN and DecoderRNN, can you now figure out the missing code in the cell below? Show it in your report.

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Seq2Seq, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.encoder_model = EncoderRNN(self.input_size, self.hidden_size)
        self.decoder_model = DecoderRNN(self.hidden_size, self.output_size)

    def forward(self, inputs, outputs, mid):
        encoder_hidden_states = self.encoder_model(inputs)
        pred_final = self.decoder_model(outputs, mid, encoder_hidden_states)
        return pred_final

In [None]:
# Defining our Seq2Seq model
seq2seq_model = Seq2Seq(input_size = n_inputs, \
                        hidden_size = hidden_size, \
                        output_size = n_outputs)
print(seq2seq_model)

In [None]:
# Testing our Seq2Seq model
seq2seq_model = Seq2Seq(input_size = n_inputs, \
                        hidden_size = hidden_size, \
                        output_size = n_outputs)
inputs, outputs, mid = next(iter(pt_dataloader))
print(inputs.shape, outputs.shape, mid.shape)
seq2seq_out = seq2seq_model(inputs, outputs, mid)
print(seq2seq_out.shape)

## Step 4: Finally, training and evaluating our Seq2Seq model

**Question 19:** Given your understanding of the task, which (very simple) loss function should we use in our trainer function? Show your updated code in your report.

In [None]:
def train(dataloader, model, num_epochs, learning_rate):
    # Set the model to training mode
    model.train()
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)

    for epoch in range(num_epochs):
        total_loss = 0
        for inputs, outputs, mid in dataloader:
            # Clear previous gradients
            optimizer.zero_grad()
            # Forward pass
            pred = model(inputs, outputs, mid)
            # Calculate loss
            loss = criterion(pred, outputs)
            total_loss += loss.item()
            # Backward pass and optimization
            loss.backward()
            optimizer.step()
        
        # Print total loss every few epochs
        if epoch % 25 == 0:
            print(f'Epoch {epoch+1}/{num_epochs}, Avg Loss: {total_loss/len(dataloader)}')

Having figure out the correct models and trainer function, you may not use the celml below. It will train the model from scratch, on 50 iterations and will show you the amount of time take to train this model. This is just information to let you know how long the full training loop (in the next cell), might take on your machine!

In [None]:
# Training the model
hidden_size = 25
seq2seq_model = Seq2Seq(input_size = n_inputs, \
                        hidden_size = hidden_size, \
                        output_size = n_outputs).to(device)
%timeit -r 1 -n 1 train(dataloader = pt_dataloader, model = seq2seq_model, num_epochs = 26, learning_rate = 1e-3)

In [None]:
# This was used to save a starting point for the next cell, do not run.
#torch.save(seq2seq_model.state_dict(), 'seq2seq_model_start.pth')

**Question 20:** It seems the loss values we are seeing when using the model with randomly initialized parameters is very high. While it seems to decrease, it seems lots of iterations will be needed. The next cell suggests to run the training loop, but initialize the weights of the model using values in the *seq2seq_model_start.pth* file, presumably coming from another roughly similar model, trained on a different but similar task. This is done in an attempt to help the model train better and faster. Under which name is this concept known in Deep Learning?

<div class="alert alert-block alert-info">
<b>Note: The next cell will take a (somewhat) long time to run. On my machine CPU, it takes ~4minutes.<br>
You can simply guess how long it will take to run on your machine, by using the execution time of the previous cell (using 50 iterations) and multiplying that by 20. As mentioned at the beginning of this Notebook, we have observed issues in running the code, on some CUDA machines. It is unclear at the moment, so try GPU computing at your own risk... </b>
</div>

In [None]:
# Training the model
hidden_size = 25
seq2seq_model = Seq2Seq(input_size = n_inputs, \
                        hidden_size = hidden_size, \
                        output_size = n_outputs).to(device)
# Start from given parameters to make training easier
seq2seq_model.load_state_dict(torch.load('seq2seq_model_start.pth'))
%timeit -r 1 -n 1 train(dataloader = pt_dataloader, model = seq2seq_model, num_epochs = 1051, learning_rate = 1e-3)

In [None]:
# Do not uncomment and execute, this was used to prepare the model that you will be using in the final question!
#torch.save(seq2seq_model.state_dict(), 'seq2seq_model_end.pth')

**Question 21:** The code below shows the predictions produced by your Seq2Seq model after training and can be used to confirm that you have trained the right model! Show some screenshots in your report, and discuss the final performance you have obtained for your model. For your information, I typically obtain an MSE of ~0.05 after 1000 iterations of training. Additional performance can probably be obtained via hyperparameters tuning (changing the size of memory vector, etc.).

In [None]:
# Quick check on our Seq2Seq model
# (Seeding for reproducibility)
hidden_size = 25
seq2seq_model = Seq2Seq(input_size = n_inputs, \
                        hidden_size = hidden_size, \
                        output_size = n_outputs).to(device)
seq2seq_model.load_state_dict(torch.load('seq2seq_model_end.pth'))
seed_value = 187
test_model(seq2seq_model, pt_dataloader, seed_value)

In [None]:
# Reload the model
# (Seeding for reproducibility)
hidden_size = 25
seq2seq_model = Seq2Seq(input_size = n_inputs, \
                        hidden_size = hidden_size, \
                        output_size = n_outputs).to(device)
seq2seq_model.load_state_dict(torch.load('seq2seq_model_end.pth'))
# Visualize
visualize_some_predictions(seq2seq_model, pt_dataloader)

## This concludes HW3.

Do not give up, it is a feasible task! If your model does not work, most likely, you are making a mistake in the Encoder model or - most likely - the Decoder model. Take your time to think about the task at hand and the model we should use for that task.