# Lab 4: Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a class of neural networks that are particularly suited for sequential data. In this lab, we will explore the basics of RNNs, implement a simple RNN from scratch, and apply it to a sequence prediction task. One of the most common applications of RNNs is in natural language processing, where they can be used for tasks such as language modeling, text generation, and sentiment analysis.

In this lab, we will briefly review the concept of RNNs, implement a simple RNN from scratch using PyTorch, and demonstrate how to use it for sequence prediction tasks. We will also explore how to handle sequential data using `Dataset` and `DataLoader`.

## Introduction to RNNs

RNNs are designed to handle sequential data by maintaining a hidden state that captures information from previous time steps. This allows them to learn dependencies in the data over time. The basic structure of an RNN consists of an input layer, a hidden layer, and an output layer. The hidden layer is recurrent, meaning that it takes both the current input and the previous hidden state as input.

A simple RNN can be defined mathematically as follows:

$$ h_t = f_w(h_{t-1}, x_t) $$

where:

| symbol | description |
|--------|-------------|
| $$ h_t $$ | new hidden state at time step \( t \), the output of the RNN at that time step |
| $$ f_w $$ | function that computes the hidden state with parameters \( w \) |
| $$ h_{t-1} $$ | old hidden state from the previous time step |
| $$ x_t $$ | input vector at time step \( t \) |

To process a sequence of inputs, the RNN iterates through each time step, updating its hidden state based on the current input and the previous hidden state. This allows the network to maintain a memory of past inputs, which is crucial for tasks that require understanding context over time.

For example in a language modeling task, the RNN can learn to predict the next word in a sentence based on the words that have come before it. Consider the input sentence "The cat sat on the mat." The RNN processes each word in the sentence sequentially, updating its hidden state to capture the context of the sentence as it progresses.

---

Let's start by importing the necessary libraries and setting up our environment.

In [2]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Embeddings

Before diving into RNNs, it's important to understand embeddings, which are a way to represent words or tokens in a continuous vector space. This allows the model to capture semantic relationships between words. Word embeddings can be learned from data using techniques like Word2Vec, GloVe, or FastText, or they can be obtained from pre-trained models like BERT or GPT.

## Understanding Word Embeddings

In particular, we will use PyTorch's `nn.Embedding` class to create an embedding layer. This layer takes a vocabulary size and an embedding dimension as input and outputs the corresponding embeddings for the input tokens.

## Implementation of Embedding Layer

Let's implement a simple embedding layer using PyTorch:

In [3]:
# simple embedding layer
class SimpleEmbedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SimpleEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, x):
        return self.embedding(x)

In above code, we define a simple embedding layer using PyTorch's `nn.Embedding` class. This layer takes a vocabulary size and an embedding dimension as input and outputs the corresponding embeddings for the input tokens. The `forward` method applies the embedding layer to the input tensor `x`, which contains the indices of the tokens in the vocabulary.

## Example Usage of the Embedding Layer

To demonstrate how the embedding layer works, we can use a sample sentence and convert it into embeddings. Let's assume we have a vocabulary of size 10000 and an embedding dimension of 50. We can create a simple example as follows:

In [4]:
# Example usage
vocab_size = 10000  # Example vocabulary size
embedding_dim = 50  # Example embedding dimension
embedding_layer = SimpleEmbedding(vocab_size, embedding_dim).to(device)
sample_input = "a quick brown fox jumps over the lazy dog"

# tokenize the input sentence
tokens = sample_input.split()
# create a mapping from tokens to indices
token_to_index = {token: i for i, token in enumerate(set(tokens))}
# convert tokens to indices
input_indices = torch.tensor([token_to_index[token] for token in tokens], dtype=torch.long).to(device)

output_embeddings = embedding_layer(input_indices)
print("Output embeddings shape:", output_embeddings.shape)
print("Output embeddings:", output_embeddings)

Output embeddings shape: torch.Size([9, 50])
Output embeddings: tensor([[-0.2963, -0.6340, -0.3285, -0.6866,  2.8050, -0.4176, -0.2991,  0.1261,
         -2.5564,  0.0357,  0.1976,  0.0407, -1.5840, -0.6216,  1.8142,  0.4543,
         -0.1480, -0.1056, -1.3950, -1.3051,  0.1303,  1.5572, -1.3652, -0.0126,
          0.7808, -2.0361, -1.3009,  0.0242, -0.0625,  0.3843, -0.6666, -0.7672,
         -0.4212, -0.1538,  1.0725,  0.6273, -0.2252, -0.6387,  0.4487, -2.1970,
          2.0449,  1.5520,  1.1275, -0.3182,  0.8463,  0.5343, -0.5538,  1.0649,
         -0.3522, -0.2571],
        [ 0.2715, -1.3029, -0.9057,  0.3579, -1.7064,  0.7636, -2.0855, -0.5252,
          0.7536, -0.4373,  0.5272, -0.3707,  0.0800,  0.6728,  1.2277,  1.9393,
          1.1930, -0.2834, -0.7333,  0.0278,  1.5018,  0.4754, -0.1076, -0.3920,
          1.1594,  0.8735,  0.2560,  2.0067, -0.5472, -1.5595,  1.5067, -0.5563,
         -1.4512,  0.4099,  2.3768,  2.1723,  0.0143, -0.2143,  0.4804,  0.7065,
         -0.5686,

# Vanilla RNN

Vanilla RNNs are a type of RNN that do not use any advanced techniques like LSTM or GRU. They are the simplest form of RNNs and can be implemented using basic PyTorch operations. In this section, we will implement a simple vanilla RNN from scratch using PyTorch's `nn.Module`.

![Process sequences](https://calvinfeng.gitbook.io/machine-learning-notebook/~gitbook/image?url=https%3A%2F%2F760545131-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-legacy-files%2Fo%2Fassets%252F-LIA3amopGH9NC6Rf0mA%252F-LIA3mTJltflw3MVKAEQ%252F-LIA3nSKrqNJpLgASeso%252Fsequence.png%3Fgeneration%3D1532415397328022%26alt%3Dmedia&width=768&dpr=4&quality=100&sign=5baacf94&sv=2)

Depeding on the task and input sequence, RNNs can be implemented in different ways. The most common configurations are:

- One-to-One: The input sequence is a single vector, and the RNN produces a single output vector. This is typically used for tasks like sentiment analysis or classification.
- One-to-Many: The input sequence is a single vector, and the RNN produces a sequence of output vectors. This is used for tasks like text generation or image captioning.
- Many-to-One: The input sequence is a sequence of vectors, and the RNN produces a single output vector. This is used for tasks like video classification or sentiment analysis on a sequence of text.
- Many-to-Many: The input sequence is a sequence of vectors, and the RNN produces a sequence of output vectors. This is used for tasks like machine translation or image captioning.

A vanilla RNN could be implemented using PyTorch's `nn.Module` class without `nn.RNN`. Let's take a look in detail at how to implement a simple vanilla RNN from scratch.

## One-to-One RNN

In a one-to-one RNN, the input sequence is a single vector, and the RNN produces a single output vector. It consists of a single input vector, a hidden state, and an output vector. The hidden state is updated based on the input vector, and the output vector is produced from the hidden state.

In [5]:
class OneToOneRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(OneToOneRNN, self).__init__()
        self.hidden_size = hidden_size
        self.Wxh = nn.Linear(input_size, hidden_size)  # input to hidden
        self.Whh = nn.Linear(hidden_size, hidden_size)  # hidden to hidden
        self.Why = nn.Linear(hidden_size, output_size)  # hidden to output

    def forward(self, x, h_prev):
        h_prev = torch.tanh(self.Wxh(x) + self.Whh(h_prev))
        y_t = self.Why(h_prev)
        return y_t, h_prev

The usage of a one-to-one RNN is similar to that of a feedforward neural network, where the input is processed through the RNN to produce an output. This type of RNN is often used for tasks like sentiment analysis or classification, where the input is a single vector (e.g., a fixed length sentence represented as a vector) and the output is a single vector (e.g., a sentiment score or class label).

In [10]:
# Example usage of OneToOneRNN
input_size = 10  # Size of input vector
hidden_size = 20  # Size of hidden state
output_size = 5   # Size of output vector
rnn = OneToOneRNN(input_size, hidden_size, output_size).to(device)
x = torch.randn(1, input_size).to(device)  # Example input vector
h_prev = torch.zeros(1, hidden_size).to(device)  # Initial hidden state
output, h_next = rnn(x, h_prev)
print("input: ", x)
print("Output:", output)

input:  tensor([[-0.6192, -0.6575,  1.0770,  0.6959,  1.2351, -0.3910, -1.1162, -1.1564,
         -0.3809, -1.4583]])
Output: tensor([[-0.0336,  0.4924,  0.1411, -0.0116,  0.3583]],
       grad_fn=<AddmmBackward0>)


## Many-to-One RNN

In a many-to-one RNN, the input sequence is a sequence of vectors, and the RNN produces a single output vector. This is used for tasks like video classification or sentiment analysis on a sequence of text. The hidden state is updated at each time step based on the input vector, and the final hidden state is used to produce the output vector.

The implementation of a many-to-one RNN is similar to that of a one-to-one RNN, but it processes a sequence of input vectors instead of a single vector. The final hidden state after processing the entire sequence is used to produce the output vector.

The modification is highlighted below:

```diff
@@ -10 +10,2 @@
-        h_prev = torch.tanh(self.Wxh(x) + self.Whh(h_prev))
+        for t in range(x.size(0)):
+            h_prev = torch.tanh(self.Wxh(x[t]) + self.Whh(h_prev))
```

In [11]:
# implementing a simple RNN without using nn.RNN
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        self.Wxh = nn.Linear(input_size, hidden_size)  # input to hidden
        self.Whh = nn.Linear(hidden_size, hidden_size)  # hidden to hidden
        self.Why = nn.Linear(hidden_size, output_size)  # hidden to output

    def forward(self, x, h_prev):
        # to handle sequences, we iterate over the inputs (time steps)
        # x is expected to be of shape (seq_len, input_size)
        for t in range(x.size(0)):
            h_prev = torch.tanh(self.Wxh(x[t]) + self.Whh(h_prev))
        y_t = self.Why(h_prev)
        return y_t, h_prev

### Example Usage of the Many-to-One RNN layer

To demonstrate how the RNN works, we can create a simple example where we process a sequence of inputs and compute the corresponding hidden states and outputs. The sample input is taken from the output embeddings of a previous embedding layer, and we will use a simple RNN to process this sequence.

In [20]:

x = output_embeddings  # Get embeddings for the input indices
input_size = embedding_dim  # Size of input vector
hidden_size = 128  # Size of hidden state
output_size = 2   # Size of output vector
rnn = SimpleRNN(input_size, hidden_size, output_size).to(device)
h_prev = torch.zeros(1, hidden_size).to(device)  # Initial hidden state
output, h_next = rnn(x, h_prev)
print("Input: ", x.shape)
print("Output:", output.shape)

Input:  torch.Size([9, 50])
Output: torch.Size([1, 2])


# Datasets and Data Loaders for sequential data

To train our RNN, we need a dataset that consists of sequences. There are various datasets available for sequence prediction tasks, such as the Penn Treebank dataset for language modeling or the [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/) for sentiment analysis.

## Downloading the Dataset

Let's start by downloading the dataset with `curl` and extracting it with `tar`. We will use the IMDB dataset for sentiment analysis, which consists of movie reviews labeled as positive or negative.:

In [203]:
!curl -Lo imdb.tar.gz https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf imdb.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 10 80.2M   10 8784k    0     0  1951k      0  0:00:42  0:00:04  0:00:38 1951k
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


Here we define two helper functions to load the IMDB dataset and its corresponding labels. The `load_imdb_data` function reads the dataset from the specified path and returns a list of tuples containing the text and label for each review. The label is `1` for positive reviews and `-1` for negative reviews. We also need to load the vocabulary from the `imdb.vocab` file, which contains the mapping of words to indices. 

In [204]:
# after downloading the dataset, we define a function to load the IMDB dataset
import os

def load_imdb_data(path="./aclImdb", split='train'):
    if split not in ['train', 'test']:
        raise ValueError("split must be either 'train' or 'test'")
    data = []
    split_path = os.path.join(path, split)
    for label in ['pos', 'neg']:
        label_path = os.path.join(split_path, label)
        for fname in os.listdir(label_path):
            if fname.endswith('.txt'):
                with open(os.path.join(label_path, fname), 'r', encoding='utf-8') as f:
                    line = f.read().strip()
                    data += [(1 if label == 'pos' else 0, line)]
    return data

def load_imdb_vocab(path="./aclImdb"):
    if not os.path.exists(os.path.join(path, 'imdb.vocab')):
        raise FileNotFoundError("Vocabulary file not found in the specified path.")
    # Load the vocabulary from the imdb.vocab file
    tokens = []
    with open(os.path.join(path, 'imdb.vocab'), 'r', encoding='utf-8') as f:
        tokens = [line.strip() for line in f.readlines()]
    return tokens

vocab_data = load_imdb_vocab(path="./aclImdb")
train_data = load_imdb_data(path="./aclImdb", split='train')
test_data = load_imdb_data(path="./aclImdb", split='test')

It's good to print out some basic statistics about the dataset, such as the number of reviews and the average length of the reviews. This will help us understand the dataset better and prepare for training our RNN.

In [205]:
print("Vocabulary size:", len(vocab_data))
print("Number of training samples:", len(train_data))
print("Number of testing samples:", len(test_data))

print("Sample training data:", train_data[0])
print("Sample testing data:", test_data[0])

Vocabulary size: 89527
Number of training samples: 25000
Number of testing samples: 25000
Sample training data: (1, 'Zentropa has much in common with The Third Man, another noir-like film set among the rubble of postwar Europe. Like TTM, there is much inventive camera work. There is an innocent American who gets emotionally involved with a woman he doesn\'t really understand, and whose naivety is all the more striking in contrast with the natives.<br /><br />But I\'d have to say that The Third Man has a more well-crafted storyline. Zentropa is a bit disjointed in this respect. Perhaps this is intentional: it is presented as a dream/nightmare, and making it too coherent would spoil the effect. <br /><br />This movie is unrelentingly grim--"noir" in more than one sense; one never sees the sun shine. Grim, but intriguing, and frightening.')
Sample testing data: (1, "Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is

## Defining a Custom Dataset Class

After that, we define a custom dataset class that inherits from `torch.utils.data.Dataset`. This class will handle the tokenization of the input sequences and the conversion of tokens to indices.

In [206]:
from torch.utils.data import Dataset, DataLoader

class SentimentDataset(Dataset):
    def __init__(self, sequences, vocab, device='cpu'):
        self.sequences = sequences
        self.vocab = vocab
        self.device = device
        self.token_to_index = {token: i for i, token in enumerate(vocab)}
        self.index_to_token = {i: token for i, token in enumerate(vocab)}

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        label, line = self.sequences[idx]
        target = torch.zeros(2, dtype=torch.float).to(self.device)
        target[label] = 1.0  # one-hot encoding for binary classification
        input_seq = torch.tensor([self.token_to_index[token] for token in self.tokenize(line) if token in self.vocab], dtype=torch.long).to(device)
        return input_seq, target

    def __len__(self):
        return len(self.sequences)
    
    def tokenize(self, text):
        text = text.lower().replace('.', '').replace(',', '')
        return text.split()

## Constructing the Dataloader

To use our custom dataset class, we can create an instance of it and pass it to a `DataLoader`. The `DataLoader` will handle batching and shuffling of the data during training. For simplicity, we will use a batch size of 1 to avoid complications with padding and masking during training.

In [207]:
train_loader = DataLoader(
    SentimentDataset(train_data, vocab_data, device=device),
    batch_size=1,
    shuffle=True
)
test_loader = DataLoader(
    SentimentDataset(test_data, vocab_data, device=device),
    batch_size=1,
    shuffle=False
)

In [208]:
for inputs, labels in train_loader:
    print("Input shape:", inputs.shape)
    print("Label:", labels)
    
    break  # just to check the first batch

Input shape: torch.Size([1, 439])
Label: tensor([[0., 1.]])


# Training the model

Before we can train our RNN, we need to construct the model for the binary classification task. We will use the previous simple RNN architecture with an embedding layer, an RNN layer, and a linear output layer.

```mermaid
flowchart TD;
    A[Input Layer] --> B[Embedding Layer]
    B --> C[RNN Layer]
    C --> D[Output Layer]
    C -->|Recurrent Connection| C
```

In [209]:
class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(SentimentRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = SimpleEmbedding(vocab_size, embedding_dim)
        self.rnn = SimpleRNN(input_size=embedding_dim, hidden_size=hidden_size, output_size=output_size)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        embedded = self.embedding(x)
        h_prev = torch.zeros(1, self.hidden_size)  # initial hidden state
        rnn_output, h_prev = self.rnn(embedded, h_prev)
        output = self.fc(h_prev)
        return output

## Construct the model, loss function, optimizer and parameters

In [210]:
NUM_EPOCHS = 5
learning_rate = 0.001
embedding_size = 50

# Construct the model, loss function, optimizer and parameters
model = SentimentRNN(vocab_size=len(vocab_data), embedding_dim=embedding_size, hidden_size=128, output_size=2).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

## Training loop

In the training loop, we will iterate over the dataset for a specified number of epochs. For each batch, we will perform the following steps:
1. Zero the gradients of the model parameters.
2. Forward pass: Pass the input through the model to get the output.
3. Compute the loss using the specified loss function.
4. Backward pass: Compute the gradients of the loss with respect to the model parameters.
5. Update the model parameters using the optimizer.

In [None]:
model.train() # Set the model to training mode

for epoch in range(NUM_EPOCHS):
    for i, (inputs, targets) in enumerate(train_loader):
        inputs = inputs.to(device)
        targets = targets.float().to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs.flatten())

        # Compute loss
        loss = criterion(outputs, targets)

        # Backward pass
        loss.backward()

        # Update parameters
        optimizer.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{NUM_EPOCHS}], Step [{i + 1}/{len(train_loader)}], Loss: {loss.item():.4f}')

Epoch [1/5], Step [100/25000], Loss: 0.4456
Epoch [1/5], Step [200/25000], Loss: 0.7368
Epoch [1/5], Step [300/25000], Loss: 0.9745
Epoch [1/5], Step [400/25000], Loss: 0.6506
Epoch [1/5], Step [500/25000], Loss: 0.4566
Epoch [1/5], Step [600/25000], Loss: 0.5874
Epoch [1/5], Step [700/25000], Loss: 0.4563
Epoch [1/5], Step [800/25000], Loss: 0.7857
Epoch [1/5], Step [900/25000], Loss: 0.7775
Epoch [1/5], Step [1000/25000], Loss: 0.6017
Epoch [1/5], Step [1100/25000], Loss: 0.6524
Epoch [1/5], Step [1200/25000], Loss: 0.4830
Epoch [1/5], Step [1300/25000], Loss: 0.5055
Epoch [1/5], Step [1400/25000], Loss: 1.0214
Epoch [1/5], Step [1500/25000], Loss: 0.5068
Epoch [1/5], Step [1600/25000], Loss: 0.8094
Epoch [1/5], Step [1700/25000], Loss: 0.8628
Epoch [1/5], Step [1800/25000], Loss: 0.7206
Epoch [1/5], Step [1900/25000], Loss: 0.8552
Epoch [1/5], Step [2000/25000], Loss: 0.6927
Epoch [1/5], Step [2100/25000], Loss: 0.6084
Epoch [1/5], Step [2200/25000], Loss: 0.4912
Epoch [1/5], Step [

# Gradient vanishing and gradient explosion problem of Vanilla RNN

Vanilla RNNs can suffer from the gradient vanishing and gradient explosion problems, which can make training difficult. The gradient vanishing problem occurs when the gradients become very small as they are propagated back through time, leading to slow or stalled learning. The gradient explosion problem occurs when the gradients become very large, causing the model parameters to diverge.

## Demnonstration of the gradient vanishing

To demonstrate the gradient vanishing problem, we can create a simple RNN with a large number of time steps and observe how the gradients behave during training. We will use a synthetic dataset with a long sequence length to illustrate this issue.

In [None]:
# Create a synthetic dataset with long sequences
sequence_length = 1000  # Length of the sequence
num_samples = 100  # Number of samples in the dataset
input_size = 10  # Size of input vector
hidden_size = 20  # Size of hidden state
output_size = 5   # Size of output vector
rnn = SimpleRNN(input_size, hidden_size, output_size).to(device)
h_prev = torch.zeros(1, hidden_size).to(device)  # Initial hidden state
for i in range(num_samples):
    x = torch.randn(sequence_length, input_size).to(device)  # Example input sequence
    output, h_next = rnn(x, h_prev)
    loss = criterion(output, torch.randn(1, output_size).to(device))  # Random target for demonstration
    optimizer.zero_grad()
    loss.backward()
    print(f"Sample {i+1}, Loss: {loss.item()}, Gradients: {rnn.Wxh.weight.grad.norm().item()}, {rnn.Whh.weight.grad.norm().item()}, {rnn.Why.weight.grad.norm().item()}")

## Demnonstration of the gradient explosion

To demonstrate the gradient explosion problem, we can create a simple RNN with a large number of time steps and observe how the gradients behave during training. We will use a synthetic dataset with a long sequence length to illustrate this issue.

In [None]:
# Demonstration of the gradient explosion problem
for i in range(num_samples):
    x = torch.randn(sequence_length, input_size).to(device)  # Example input sequence
    output, h_next = rnn(x, h_prev)
    loss = criterion(output, torch.randn(1, output_size).to(device))  # Random target for demonstration
    optimizer.zero_grad()
    loss.backward()
    print(f"Sample {i+1}, Loss: {loss.item()}, Gradients: {rnn.Wxh.weight.grad.norm().item()}, {rnn.Whh.weight.grad.norm().item()}, {rnn.Why.weight.grad.norm().item()}")

# Solving gradient vanishing and explosion with LSTM and GRU

LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) are advanced RNN architectures that are designed to address the gradient vanishing and explosion problems. They use gating mechanisms to control the flow of information through the network, allowing them to learn long-term dependencies more effectively.

## LSTM

LSTM is a type of RNN that uses a special gating mechanism to control the flow of information through the network. It consists of three gates: the input gate, the forget gate, and the output gate. These gates allow the LSTM to selectively remember or forget information from previous time steps, making it more effective at learning long-term dependencies.


In [21]:
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(SentimentLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = SimpleEmbedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # LSTM layer expects input of shape (batch_size, seq_len, input_size)
        embedded = self.embedding(x)
        lstm_output, (h_n, c_n) = self.lstm(embedded)
        output = self.fc(h_n)
        return output

## GRU

GRU (Gated Recurrent Unit) is another type of RNN that is similar to LSTM but has a simpler architecture. It uses two gates: the update gate and the reset gate. The update gate controls how much of the previous hidden state to keep, while the reset gate controls how much of the previous hidden state to forget. GRUs are often preferred over LSTMs due to their simpler structure and faster training times.

In [None]:
class SentimentGRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(SentimentGRU, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = SimpleEmbedding(vocab_size, embedding_dim)
        self.gru = nn.GRU(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # GRU layer expects input of shape (batch_size, seq_len, input_size)
        embedded = self.embedding(x)
        gru_output, h_n = self.gru(embedded)
        output = self.fc(h_n)
        return output