📝 **Author:** Amirhossein Heydari - 📧 **Email:** <amirhosseinheydari78@gmail.com> - 📍 **Origin:** [mr-pylin/pytorch-workshop](https://github.com/mr-pylin/pytorch-workshop)

---


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Dataset](#toc2_)    
  - [Regular Dataset](#toc2_1_)    
  - [Sequential Dataset](#toc2_2_)    
- [Types of sequence-to-sequence modeling configurations](#toc3_)    
- [Network Structure: Recurrent Neural Networks](#toc4_)    
  - [Simple Vanilla RNN](#toc4_1_)    
  - [Combined Weights and Concatenated Input and Hidden](#toc4_2_)    
  - [Deep RNN](#toc4_3_)    
  - [RNN using PyTorch](#toc4_4_)    
  - [Long Short-Term Memory (LSTM)](#toc4_5_)    
  - [Gated Recurrent Units (GRU)](#toc4_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [1]:
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torchinfo import summary

In [2]:
# set a seed for deterministic results
seed = 42
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
# check if cuda is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# log
device

# <a id='toc2_'></a>[Dataset](#toc0_)


## <a id='toc2_1_'></a>[Regular Dataset](#toc0_)

- Regular datasets typically used in CNNs & MLPs are composed of independent data points
- Each data point is usually represented as a fixed-size vector (or tensor for images)

🧾 **Notations**:

- $N$: Number of samples in the dataset.
- $\mathbf{x}_i$: Input data point $i$, where $i \in \{1, 2, \ldots, N\}$.
- $\mathbf{y}_i$: Label or target associated with input data $i$.

🔬 **Formulations**:

- Dataset: $D=\{(\mathbf{x}_i, \mathbf{y}_i)\mid i = 1, 2, \ldots, N\}$
- Each $\mathbf{x}_i \in ℝ^M$, where $M$ is the dimensionality of the input feature vector

🌟 **Example**: $D = \{ (\mathbf{x}_1, \mathbf{y}_1), (\mathbf{x}_2, \mathbf{y}_2), (\mathbf{x}_3, \mathbf{y}_3) \}$

- $\mathbf{x}_1 = [1.0, 2.0], \quad \mathbf{y}_1 = 0$
- $\mathbf{x}_2 = [2.5, 3.5], \quad \mathbf{y}_2 = 1$
- $\mathbf{x}_3 = [0.5, 1.5], \quad \mathbf{y}_3 = 0$


In [None]:
class RegularDataset(Dataset):
    def __init__(self):
        self.data = torch.tensor([[1.1, 2.1], [2.5, 3.5], [0.5, 1.5]], dtype=torch.float32)
        self.labels = torch.tensor([0, 1, 0], dtype=torch.int64)

    def __len__(self) -> int:
        return len(self.data)

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        return self.data[idx], self.labels[idx]


# create dataset and dataloader
dataset = RegularDataset()
dataloader = DataLoader(dataset, batch_size=1, shuffle=False, num_workers=0)

# iterate through the dataset
for data, label in dataloader:
    print(f"data: {data}, label: {label}")

## <a id='toc2_2_'></a>[Sequential Dataset](#toc0_)

- Sequential datasets used in RNNs are composed of sequences of data points.
- Each sequence represents a temporal or sequential relationship among the data points.

🧾 **Notations**:

- $N$: Number of sequences in the dataset.
- $T$: Length of each sequence.
- $\mathbf{x}^t_i$: Input data point at time step $t$ in the sequence $i$, where $t \in \{1, 2, \ldots, T\}$ and $i \in \{1, 2, \ldots, N\}$
- $\mathbf{y}_i$: Label or target associated with sequence $i$.

🔬 **Formulations**:

- Dataset: $D = \{ (\mathbf{x}_i^1, \mathbf{x}_i^2, \ldots, \mathbf{x}_i^T, \mathbf{y}_i) \mid i = 1, 2, \ldots, N \}$
- Each $\mathbf{x}^t_i \in ℝ^M$, where $M$ is the dimensionality of the input feature vector at each time step.

🌟 **Example**: $D = \{ (\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, \mathbf{y}_1), (\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4, \mathbf{y}_2), (\mathbf{x}_3, \mathbf{x}_4, \mathbf{x}_5, \mathbf{y}_3) \}$

- $\mathbf{x}_1 = [1.0, 0.0]$
- $\mathbf{x}_2 = [0.5, 1.5]$
- $\mathbf{x}_3 = [1.0, 2.0]$
- $\mathbf{x}_4 = [2.0, 1.0]$
- $\mathbf{x}_5 = [1.5, 0.5]$
- $\mathbf{y}_1 = 0$
- $\mathbf{y}_2 = 1$
- $\mathbf{y}_3 = 0$


In [None]:
class SequentialDatasetWithoutOverlap(Dataset):
    def __init__(self):
        # original data points
        self.data = torch.tensor([[1.0, 0.0], [0.5, 1.5], [1.0, 2.0], [2.0, 1.0], [1.5, 0.5], [2.5, 1.5]], dtype=torch.float32)

        # labels for each sequence
        self.labels = torch.tensor([0, 1], dtype=torch.int64)

        # sequence length
        self.seq_length = 3

    def __len__(self) -> int:
        # number of sequences without overlap
        return len(self.data) // self.seq_length

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        # calculate the start index of the sequence
        start_idx = idx * self.seq_length

        # create a sequence of length seq_length
        sequence = self.data[start_idx : start_idx + self.seq_length]
        label = self.labels[idx]
        return sequence, label


# create dataset and dataloader
dataset = SequentialDatasetWithoutOverlap()
dataloader = DataLoader(dataset, batch_size=1, shuffle=False, num_workers=0)

# iterate through the dataset
for sequence, label in dataloader:
    print(f"sequence:\n{sequence}\nlabel: {label}\n")

In [None]:
class SequentialDatasetWithOverlap(Dataset):
    def __init__(self):
        # original data points
        self.data = torch.tensor([[1.0, 0.0], [0.5, 1.5], [1.0, 2.0], [2.0, 1.0], [1.5, 0.5]], dtype=torch.float32)

        # labels for each sequence
        self.labels = torch.tensor([0, 1, 0], dtype=torch.int64)

        # sequence length
        self.seq_length = 3

    def __len__(self) -> int:
        return len(self.data) - self.seq_length + 1

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        # create a sequence of length seq_length
        sequence = self.data[idx : idx + self.seq_length]
        label = self.labels[idx]
        return sequence, label


# create dataset and dataloader
dataset = SequentialDatasetWithOverlap()
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

# iterate through the dataset
for sequence, label in dataloader:
    print(f"sequence:\n{sequence}\nlabel: {label}\n")

# <a id='toc3_'></a>[Types of sequence-to-sequence modeling configurations](#toc0_)

1. **One-to-One** (Single Input to Single Output):
    - Simplest form of neural network where a single input is mapped to a single output
    - Used in a standard feed-forward neural network (e.g. MLP or CNN based architectures)
    - e.g. Image classification
1. **One-to-Many** (Single Input to Sequence Output):
    - A single input is processed by the RNN, which then produces a sequence of outputs over time.
    - e.g. Image captioning (an image input resulting in a sequence of words).
1. **Many-to-One** (Sequence Input to Single Output):
    - The RNN processes each input in the sequence, and the final hidden state is used to produce the output
    - e.g. Sentiment analysis (a sequence of words leading to a single sentiment label)
1. **Many-to-Many** (Sequence Input to Sequence Output):
    - A sequence of inputs leads to a sequence of outputs. This can be further divided into two subcategories:
      - **Synchronized** Many-to-Many
        - Each input in the sequence has a corresponding output
        - The RNN processes a sequence of inputs, producing a corresponding output at each time step
        - e.g. Video classification (each frame in a video results in a corresponding label)
      - **Asynchronized** Many-to-Many
        - The lengths of the input and output sequences can differ
        - The RNN processes a sequence of inputs and generates a sequence of outputs which may have different lengths
        - e.g. Machine translation (a sequence of words in one language translates to a sequence of words in another language)

<figure style="text-align: center;">
  <img src="../assets/images/original/rnn/seq-to-seq-modeling.svg" alt="seq-to-seq-modeling.svg" style="width: 100%;">
  <figcaption style="text-align: center;">sequence-to-sequence modeling</figcaption>
</figure>


# <a id='toc4_'></a>[Network Structure: Recurrent Neural Networks](#toc0_)

- Recurrent Neural Networks (RNNs) are a class of neural networks designed for processing sequential data, such as time series, natural language, or speech.
- They are characterized by their ability to use information from previous time steps, enabling them to model temporal dependencies effectively.
- Unlike feedforward neural networks, RNNs possess a "memory" component to process information from previous inputs, influencing the current output.
- The same weights are used across all time steps, which reduces the number of parameters and allows learning to generalize better.
- RNNs can suffer from vanishing and exploding gradients, making training difficult for long sequences.

🧬 **RNN Variants**:

- Vanilla RNN
- Long Short-Term Memory (LSTM)
  - Improves upon the vanilla RNN by introducing gates to control information flow
- Gated Recurrent Units (GRU)
  - Simplifies the LSTM architecture while maintaining performance

🔗 **Usefull Links**:

- [karpathy.github.io/2015/05/21/rnn-effectiveness](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)
- [d2l.ai/chapter_recurrent-modern/deep-rnn.html](https://d2l.ai/chapter_recurrent-modern/deep-rnn.html)
- [towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)


## <a id='toc4_1_'></a>[Simple Vanilla RNN](#toc0_)

🧾 **Notations**:

- $\mathbf{x}_t$: input at time step $t$.
- $\mathbf{h}_t$: Hidden state at time step $t$.
- $\mathbf{y}_t$: Output at time step $t$.
- $\mathbf{W}_{ih}$: Weight matrix for input to hidden
- $\mathbf{W}_{hh}$: Weight matrix for hidden to hidden
- $\mathbf{W}_{ho}$: Weight matrix for hidden to output
- $\mathbf{b}_{ih}$: Bias for input to hidden
- $\mathbf{b}_{hh}$: Bias for hidden to hidden
- $\mathbf{b}_{ho}$: Bias for hidden to output
- $\mathbf{\sigma}$: Activation function (e.g., Tanh, Sigmoid, ReLU)
- $\mathbf{g}$: Activation function for output (e.g., Softmax for classification)

🔬 **Formulations**:

- **Hidden State Calculation**:
   $$\mathbf{h}_t = \sigma(\mathbf{W}_{ih} \mathbf{x}_t + \mathbf{b}_{ih} + \mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{b}_{hh}), \quad \mathbf{h}_0 = \mathbf{0}$$
- **Output Calculation**:
   $$\mathbf{y}_t = g(\mathbf{W}_{ho} \mathbf{h}_t + \mathbf{b}_{ho})$$

<figure style="text-align: center;">
  <img src="../assets/images/original/rnn/vanilla-rnn.svg" alt="vanilla-rnn.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Vanilla Recurrent Neural Networks</figcaption>
</figure>

<figure style="text-align: center;">
  <img src="../assets/images/original/rnn/calculation.svg" alt="calculation.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Calculations</figcaption>
</figure>


In [7]:
class VanillaRNN(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.hidden_dim = hidden_dim

        # input to hidden connection weights
        self.W_ih = nn.Parameter(torch.randn(hidden_dim, input_dim))
        # input to hidden connection biases
        self.b_ih = nn.Parameter(torch.randn(hidden_dim))

        # hidden to hidden connection weights
        self.W_hh = nn.Parameter(torch.randn(hidden_dim, hidden_dim))
        # hidden to hidden connection biases
        self.b_hh = nn.Parameter(torch.randn(hidden_dim))

        # weights for hidden to output connection
        self.W_ho = nn.Parameter(torch.randn(output_dim, hidden_dim))
        # bias for output layer
        self.b_ho = nn.Parameter(torch.randn(output_dim))

    def forward(self, input: torch.Tensor, hidden: torch.Tensor) -> torch.Tensor:
        hidden = torch.tanh(input @ self.W_ih.T + self.b_ih + hidden @ self.W_hh.T + self.b_hh)
        output = hidden @ self.W_ho.T + self.b_ho
        return output, hidden

    def init_hidden(self, batch_size: int) -> torch.Tensor:
        # initialize the hidden state with zeros (h_0)
        return torch.zeros(batch_size, self.hidden_dim)

In [None]:
# hyperparameters
input_dim = 10
hidden_dim = 20
output_dim = 5
num_data = 128
sequence_length = 5
batch_size = 32

# generate synthetic dataset
x = torch.randn(num_data, sequence_length, input_dim)
y = torch.randn(num_data)

# create dataset and dataloader
dataset = TensorDataset(x, y)
trainsetloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=0)

# log
print(f"x.size()               : {x.size()}")
print(f"y.size()               : {y.size()}")
print(f"x.size() [first batch] : {next(iter(trainsetloader))[0].size()}")
print(f"y.size() [first batch] : {next(iter(trainsetloader))[1].size()}")

In [None]:
# initialize model
rnn_1 = VanillaRNN(input_dim, hidden_dim, output_dim)
rnn_1

In [None]:
summary(rnn_1, input_size=((batch_size, input_dim), (batch_size, hidden_dim)), device="cpu")

In [None]:
# forward pass through the RNN
for c, (x, y_true) in enumerate(trainsetloader):
    # initialize hidden state
    hidden = rnn_1.init_hidden(batch_size)

    for i in range(sequence_length):
        y_pred, hidden = rnn_1(x[:, i, :], hidden)
        print(f"batch: {c+1}/{len(trainsetloader)} | time step: {i+1} | hidden.size(): {hidden.size()} | output.size(): {y_pred.size()}")

## <a id='toc4_2_'></a>[Combined Weights and Concatenated Input and Hidden](#toc0_)

- Reformulate the Vanilla RNN by:
  - Combining the input-to-hidden and hidden-to-hidden weights into a single weight matrix
  - Concatenating the input and hidden states together

🧾 **Notations**:

- $\mathbf{x}_t$: Input at time step $t$.
- $\mathbf{h}_t$: Hidden state at time step $t$.
- $\mathbf{y}_t$: Output at time step $t$.
- $\mathbf{W}$: Combined weight matrix
- $\mathbf{b}$: Combined bias vector
- $\mathbf{W}_{ho}$: Weight matrix for hidden to output
- $\mathbf{b}_{ho}$: Bias for hidden to output
- $\mathbf{\sigma}$: Activation function (e.g., Tanh, Sigmoid, ReLU)
- $\mathbf{g}$: Activation function for output (e.g., Softmax for classification)

🔬 **Formulations**:

- **Concatenation of Input and Hidden State**:
   $$\mathbf{z}_t = [\mathbf{x}_t; \mathbf{h}_{t-1}]$$
- **Hidden State Calculation**:
   $$\mathbf{h}_t = \sigma(\mathbf{W} \mathbf{z}_t + \mathbf{b})$$
- **Output Calculation**:
   $$\mathbf{y}_t = g(\mathbf{W}_{ho} \mathbf{h}_t + \mathbf{b}_{ho})$$

<figure style="text-align: center;">
  <img src="../assets/images/original/rnn/combine-weights.svg" alt="combine-weights.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Combining Weights</figcaption>
</figure>


In [12]:
class VanillaRNN2(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.hidden_dim = hidden_dim

        # combined weight matrix for input to hidden and hidden to hidden
        self.W = nn.Parameter(torch.randn(hidden_dim, input_dim + hidden_dim))
        self.b = nn.Parameter(torch.randn(hidden_dim))

        # weights for hidden to output connection
        self.W_ho = nn.Parameter(torch.randn(output_dim, hidden_dim))
        self.b_ho = nn.Parameter(torch.randn(output_dim))

    def forward(self, input: torch.Tensor, hidden: torch.Tensor) -> torch.Tensor:
        combined = torch.cat((input, hidden), dim=1)  # concatenate input and hidden state
        hidden = torch.tanh(combined @ self.W.T + self.b)
        output = hidden @ self.W_ho.T + self.b_ho
        return output, hidden

    def init_hidden(self, batch_size: int) -> torch.Tensor:
        # initialize the hidden state with zeros (h_0)
        return torch.zeros(batch_size, self.hidden_dim)

In [None]:
# hyperparameters
input_dim = 10
hidden_dim = 20
output_dim = 5
num_data = 128
sequence_length = 5
batch_size = 32

# generate synthetic dataset
x = torch.randn(num_data, sequence_length, input_dim)
y = torch.randn(num_data)

# create dataset and dataloader
dataset = TensorDataset(x, y)
trainsetloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=0)

# log
print(f"x.size()               : {x.size()}")
print(f"y.size()               : {y.size()}")
print(f"x.size() [first batch] : {next(iter(trainsetloader))[0].size()}")
print(f"y.size() [first batch] : {next(iter(trainsetloader))[1].size()}")

In [None]:
# initialize model
rnn_2 = VanillaRNN2(input_dim, hidden_dim, output_dim)
rnn_2

In [None]:
summary(rnn_2, input_size=((batch_size, input_dim), hidden.size()), device="cpu")

In [None]:
# forward pass through the RNN
for c, (x, y_true) in enumerate(trainsetloader):
    # initialize hidden state
    hidden = rnn_2.init_hidden(batch_size)

    for i in range(sequence_length):
        y_pred, hidden = rnn_2(x[:, i, :], hidden)
        print(f"batch: {c+1}/{len(trainsetloader)} | time step: {i+1} | hidden.size(): {hidden.size()} | output.size(): {y_pred.size()}")

## <a id='toc4_3_'></a>[Deep RNN](#toc0_)

- A **Deep RNN** consists of **multiple** layers of RNN cells stacked on top of each other.
- Each layer processes the **hidden states** of the layer below as its **input**.
- The **output** of one layer is used as the **input** to the **next layer**.

🧾 **Notations**:

- $\mathbf{x}_t$: Input at time step $t$.
- $\mathbf{h}^l_t$: Hidden state at time step $t$ in layer $l$.
- $\mathbf{y}_t$: Output at time step $t$.
- $\mathbf{W}^l$: Combined weight matrix for layer $l$
- $\mathbf{b}^l$: Bias vector for layer $l$
- $\mathbf{W}_{ho}$: Weight matrix for hidden to output
- $\mathbf{b}_{ho}$: Bias for hidden to output
- $\mathbf{\sigma}$: Activation function (e.g., Tanh, Sigmoid, ReLU)
- $\mathbf{g}$: Activation function for output (e.g., Softmax for classification)
- $\mathbf{L}$: Number of layers

🔬 **Formulations**:

- **Concatenation of Input and Hidden State for Layer 1**:
   $$\mathbf{z}_t^1 = [\mathbf{x}_t; \mathbf{h}_{t-1}^1]$$
- **Hidden State Calculation for Layer 1**:
   $$\mathbf{h}_t^1 = \sigma(\mathbf{W}^1 \mathbf{z}_t^1 + \mathbf{b}^1)$$
- **Concatenation of Hidden States for Subsequent Layers**:
   $$\mathbf{z}_t^l = [\mathbf{h}_t^{l-1}; \mathbf{h}_{t-1}^l] \quad \text{for} \quad l = 2, \ldots, L$$
- **Hidden State Calculation for Subsequent Layers**:
   $$\mathbf{h}_t^l = \sigma(\mathbf{W}^l \mathbf{z}_t^l + \mathbf{b}^l) \quad \text{for} \quad l = 2, \ldots, L$$
- **Output Calculation**:
   $$\mathbf{y}_t = g(\mathbf{W}_{ho} \mathbf{h}_t^L + \mathbf{b}_{ho})$$

<figure style="text-align: center;">
  <img src="../assets/images/original/rnn/deep-rnn.svg" alt="deep-rnn.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Deep Recurrent Neural Networks</figcaption>
</figure>


In [17]:
class DeepRNN(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, num_layers: int):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # define RNN layers
        self.rnn_layers = nn.ModuleList()
        for i in range(num_layers):
            if i == 0:
                self.rnn_layers.append(nn.Linear(input_dim + hidden_dim, hidden_dim))
            else:
                self.rnn_layers.append(nn.Linear(hidden_dim + hidden_dim, hidden_dim))

        # define the output layer
        self.output_layer = nn.Linear(hidden_dim, output_dim)

    def forward(self, input: torch.Tensor, hidden: torch.Tensor) -> torch.Tensor:
        # concatenate input and the first hidden state along the feature dimension
        combined_input = torch.cat((input, hidden[0]), dim=1)
        new_hidden = []

        for i, rnn_layer in enumerate(self.rnn_layers):
            hidden_state = torch.tanh(rnn_layer(combined_input))
            new_hidden.append(hidden_state)

            # concatenate the current hidden state with the previous one
            combined_input = torch.cat((hidden_state, hidden[i]), dim=1)

        # use the last hidden state for output
        final_hidden = new_hidden[-1]
        output = self.output_layer(final_hidden)
        return output, torch.stack(new_hidden)

    def init_hidden(self, batch_size: int) -> torch.Tensor:
        # initialize hidden state with zeros for each layer and batch
        return torch.zeros(self.num_layers, batch_size, self.hidden_dim)

In [18]:
# hyperparameters
input_dim = 10
hidden_dim = 20
output_dim = 5
num_layers = 3
num_data = 128
sequence_length = 5
batch_size = 32

# generate synthetic dataset
x = torch.randn(num_data, sequence_length, input_dim)
y = torch.randn(num_data)

# create dataset and dataloader
dataset = TensorDataset(x, y)
trainsetloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=0)

In [None]:
# initialize model
deep_rnn = DeepRNN(input_dim, hidden_dim, output_dim, num_layers)
deep_rnn

In [None]:
summary(deep_rnn, input_size=((batch_size, input_dim), (num_layers, batch_size, hidden_dim)), device="cpu")

In [None]:
# forward pass through the RNN
for c, (x, y_true) in enumerate(trainsetloader):
    # initialize hidden state for each batch
    hidden = deep_rnn.init_hidden(batch_size)
    print(hidden.size())

    for i in range(sequence_length):
        y_pred, hidden = deep_rnn(x[:, i, :], hidden)
        print(f"Batch: {c+1}/{len(trainsetloader)} | Time step: {i+1} | hidden.size(): {hidden.size()} | output.size(): {y_pred.size()}")

## <a id='toc4_4_'></a>[RNN using PyTorch](#toc0_)

📝 **Docs**:

- `torch.nn.RNN`: [pytorch.org/docs/stable/generated/torch.nn.RNN.html](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html)


In [22]:
# RNN model for sequence-to-one tasks
class RNN(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, num_layers: int = 1):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.rnn = nn.RNN(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # initialize hidden state with zeros
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        # forward propagate RNN
        # out : output of the RNN last layer for each sequence in a batch for each time step [(batch_size, seq_length, hidden_dim)].
        # _   : the final hidden state (often denoted as hn) of the RNN [(num_layers, batch_size, hidden_dim)].
        out, _ = self.rnn(x, h0)

        # decode the hidden state of the last time step [seq-to-one modeling]
        # :  -> selects all elements along the first dimension (typically batch size).
        # -1 -> selects the last element along the second dimension (which represents the sequence length)
        # :  -> selects all elements along the third dimension (feature dimension)
        out = self.fc(out[:, -1, :])
        return out

In [23]:
# hyperparameters
input_dim = 10
hidden_dim = 20
output_dim = 5
num_layers = 1
num_data = 128
sequence_length = 5
batch_size = 32

# generate synthetic dataset
x = torch.randn(num_data, sequence_length, input_dim)
y = torch.randn(num_data, output_dim)

# create dataset and dataloader
dataset = TensorDataset(x, y)
trainsetloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=0)

In [None]:
# initialize model
rnn_3 = RNN(input_dim, hidden_dim, output_dim, num_layers)
rnn_3

In [None]:
summary(rnn_3, input_size=(batch_size, *x.size()[1:]), device="cpu")

In [None]:
# forward pass through the RNN
for c, (x, y_true) in enumerate(trainsetloader):
    y_pred = rnn_3(x)
    print(f"batch: {c+1}/{len(trainsetloader)} | output.size(): {y_pred.size()}")

## <a id='toc4_5_'></a>[Long Short-Term Memory (LSTM)](#toc0_)

- A type of recurrent neural network (RNN) designed to address the **Vanishing Gradient** problem inherent in traditional RNNs.
- **Long Short-Term Memory** signifies a system capable of remembering information over both **long** and **short** durations of time.
- **Vanilla RNNs** primarily had **short-term** memory due to their design.
- It is based on the [**Long Short-term Memory**](https://www.researchgate.net/publication/13853244_Long_Short-term_Memory) paper, Developed in **1997** by [*Sepp Hochreiter*](https://scholar.google.at/citations?user=tvUH3WMAAAAJ&hl=en) and [*Jürgen Schmidhuber*](https://scholar.google.com/citations?user=gLnCTgIAAAAJ&hl=en).

🧾 **Notations**:

- $\mathbf{x}_t$: Input vector at time step $t$
- $\mathbf{h}_t$: Hidden state vector at time step $t$
- $\mathbf{c}_t$: Cell state vector at time step $t$
- $\mathbf{W}_f$: Weight matrix for the forget gate (combined input and hidden state)
- $\mathbf{W}_i$: Weight matrix for the input gate (combined input and hidden state)
- $\mathbf{W}_c$: Weight matrix for the candidate cell state (combined input and hidden state)
- $\mathbf{W}_o$: Weight matrix for the output gate (combined input and hidden state)
- $\mathbf{b}_f$: Bias vector for the forget gate
- $\mathbf{b}_i$: Bias vector for the input gate
- $\mathbf{b}_c$: Bias vector for the candidate cell state
- $\mathbf{b}_o$: Bias vector for the output gate

🔬 **Formulations**:

- **Concatenation of Input and Hidden State**:
   $$\mathbf{z}_t = [\mathbf{x}_t; \mathbf{h}_{t-1}]$$
- **Forget Gate**
   $$\mathbf{f}_t = \sigma(\mathbf{W}_f \mathbf{z}_t + \mathbf{b}_f)$$
- **Input Gate**
   $$\mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{z}_t + \mathbf{b}_i)$$
- **Candidate Cell State**
   $$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c \mathbf{z}_t + \mathbf{b}_c)$$
- **Cell State**
   $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$
- **Output Gate**
   $$\mathbf{o}_t = \sigma(\mathbf{W}_o \mathbf{z}_t + \mathbf{b}_o)$$
- **Hidden State**
   $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$

✍️ **Notes**:

- The lack of `Weights` in the `Cell State`, allows the long-term memories to flow through a series of unrolled units without causing the gradient to explode or vanish.  
- **LSTMs** do not directly solve **Exploding Gradients** but are often **less prone** to it because their structure avoids excessively amplifying gradients during backpropagation.
- **LSTMs** can capture **longer sequences** and handle long-term dependencies more effectively than vanilla RNNs due to controlling **Vanishing/Exploding Gradient**.
- The **hidden state** $h_t$ (**short-term** memory) at each time step is the typical **output** for that specific time step.
- The **cell state** $C_t$ (**long-term** memory) is internal to the LSTM and **is not** used directly as the **output**.

📝 **Docs**:

- [pytorch.org/docs/stable/generated/torch.nn.LSTM.html](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)

<figure style="text-align: center;">
  <img src="../assets/images/original/rnn/lstm.svg" alt="lstm.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Long Short-Term Memory (LSTM)</figcaption>
</figure>


In [27]:
# LSTM model for sequence-to-one tasks
class LSTMModel(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, num_layers: int):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        out, _ = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        return out

In [28]:
# hyperparameters
input_dim = 10
hidden_dim = 20
output_dim = 5
num_layers = 2
num_data = 128
sequence_length = 5
batch_size = 32

# generate synthetic dataset
x = torch.randn(num_data, sequence_length, input_dim)
y = torch.randn(num_data, output_dim)

# create dataset and dataloader
dataset = TensorDataset(x, y)
trainsetloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=0)

In [None]:
# initialize model
lstm = LSTMModel(input_dim, hidden_dim, output_dim, num_layers)
lstm

In [None]:
summary(lstm, input_size=(batch_size, *x.size()[1:]), device="cpu")

In [None]:
# forward pass through the LSTM
for c, (x_batch, y_true) in enumerate(trainsetloader):
    y_pred = lstm(x_batch)
    print(f"Batch: {c+1}/{len(trainsetloader)} | Output Size: {y_pred.size()}")

## <a id='toc4_6_'></a>[Gated Recurrent Units (GRU)](#toc0_)

- A gating mechanism in recurrent neural networks, introduced in 2014 by [*Kyunghyun*](https://dblp.uni-trier.de/search/author?author=Kyunghyun%20Cho).
- It is based on the [**Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling**](https://arxiv.org/abs/1412.3555) paper.
- Similar to LSTM but lacks a `context vector` or `output gate`, resulting in fewer parameters than LSTM.

🧾 **Notations**:

- $\mathbf{x}_t$: Input vector at time step $t$
- $\mathbf{h}_t$: Hidden state vector at time step $t$
- $\mathbf{c}_t$: Concatenated input and hidden state vector at time step $t$
- $\mathbf{W}_z$: Weight matrix for the update gate
- $\mathbf{W}_r$: Weight matrix for the reset gate
- $\mathbf{W}_h$: Weight matrix for the candidate hidden state
- $\mathbf{b}_z$: Bias vector for the update gate
- $\mathbf{b}_r$: Bias vector for the reset gate
- $\mathbf{b}_h$: Bias vector for the candidate hidden state
- $\mathbf{z}_t$: Update gate vector at time step $t$
- $\mathbf{r}_t$: Reset gate vector at time step $t$
- $\mathbf{\tilde{h}}_t$: Candidate hidden state vector at time step $t$

🔬 **Formulations**:

- **Concatenated Input and Hidden State**:
   $$\mathbf{c}_t = [\mathbf{x}_t; \mathbf{h}_{t-1}]$$
- **Reset Gate**:
   $$\mathbf{r}_t = \sigma(\mathbf{W}_r \mathbf{c}_t + \mathbf{b}_r)$$
- **Update Gate**:
   $$\mathbf{z}_t = \sigma(\mathbf{W}_z \mathbf{c}_t + \mathbf{b}_z)$$
- **Candidate Hidden State**:
   $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{x}_t; (\mathbf{r}_t \odot \mathbf{h}_{t-1})] + \mathbf{b}_h)$$
- **Hidden State**:
   $$\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$$

📝 **Docs**:

- [pytorch.org/docs/stable/generated/torch.nn.GRU.html](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html)

<figure style="text-align: center;">
  <img src="../assets/images/original/rnn/gru.svg" alt="gru.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Gated Recurrent Units (GRU)</figcaption>
</figure>


In [32]:
class GRUModel(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, num_layers: int):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.gru = nn.GRU(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        out, _ = self.gru(x, h0)
        out = self.fc(out[:, -1, :])
        return out

In [33]:
# hyperparameters
input_dim = 10
hidden_dim = 20
output_dim = 5
num_layers = 2
num_data = 128
sequence_length = 5
batch_size = 32

# generate synthetic dataset
x = torch.randn(num_data, sequence_length, input_dim)
y = torch.randn(num_data, output_dim)

# create dataset and dataloader
dataset = TensorDataset(x, y)
trainsetloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=0)

In [None]:
# initialize model
gru = GRUModel(input_dim, hidden_dim, output_dim, num_layers)
gru

In [None]:
summary(gru, input_size=(batch_size, *x.size()[1:]), device="cpu")

In [None]:
# forward pass through the GRU
for c, (x_batch, y_true) in enumerate(trainsetloader):
    y_pred = gru(x_batch)
    print(f"Batch: {c+1}/{len(trainsetloader)} | Output Size: {y_pred.size()}")