In [1]:
# install dependencies
!pip install datasets transformers

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting transformers
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-win_amd64.whl.metadata (3.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-win_amd64.whl.metadata (13 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.6.1,>=2023.1.0 (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.10.10-cp310-cp310-win_amd64.whl.metadata (7.8 kB)
Collecting huggingface-hub>=0.22.0 (from datasets)
  Downloading huggingface_hub-0.26.1-py3-none-any.whl.metadata (13 kB)
Col

# M2177.004300 002 Deep Learning <br> Assignment #2 Part 1: Training Recurrent Neural Network (RNN)


Copyright (C) Data Science & AI Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by DongHyeok Lee, October 2024


The goal of this assignment is to progressively train deeper and more accurate models using PyTorch.

This notebook uses the [imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset to be used with python experiments. The IMDB dataset is a popular benchmark for sentiment analysis tasks in natural language processing. It contains 50,000 movie reviews from the Internet Movie Database (IMDB), split evenly into 25,000 reviews for training and 25,000 for testing. Each review is labeled as either positive (1) or negative (0), making it a binary classification problem. The dataset is well-balanced, with an equal number of positive and negative reviews in both the training and testing sets.

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:

<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done problems, run the _CollectSubmission.sh_ script with your **Student number** as input argument. <br>
This will produce a compressed file called _[Your student number].tar.gz_. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./_CollectSubmission.sh_ &nbsp; 20\*\*-\*\*\*\*\*)


### In this notebook, we will focus on two main sections:

2-1-1. Implement the forward and backward propagation for a GRU (Gated Recurrent Unit)

- We will dive deep into the internal mechanics of a GRU cell
- You will implement the forward pass, understanding how the update gate, reset gate, and new memory content interact
- You'll also implement the backward pass, deriving and calculating the gradients for each component
- This exercise will give you a thorough understanding of how GRUs process sequential data

2-1-2. Training Multi-Layer RNN with PyTorch Module

- We'll use PyTorch to build a complete GRU-based model for sentiment analysis
- You'll learn how to:
  - Create a custom PyTorch Module for the GRU model
  - Set up the training loop, including data loading and batching
  - Implement the forward pass of the entire model
  - Train the model on the IMDB dataset
  - Evaluate the model's performance on a test set

By completing these two sections, you'll gain both a low-level understanding of GRU operations and practical experience in using GRUs for a real-world NLP task. This combination of theoretical knowledge and practical application will deepen your understanding of recurrent neural networks and their use in sequence modeling tasks.


## Assignment 2-1-1 | Implement the forward and backward propagation for a GRU (Gated Recurrent Unit)


The following shows the architecture and explanation of GRU.  
Referring to this, calculate the forward and backward passes.


<img src="./images/GRU.webp" alt="Image description" width="500"/>


Key equations of the Gated Recurrent Unit (GRU):

1. Update Gate:

$z_t = \sigma( (U_z \cdot h_{t-1} + b_{zh}) + (W_z \cdot x_{t} + b_{zw}))$

2. Reset Gate:

$r_t = \sigma( (U_r \cdot h_{t-1} + b_{rh}) + (W_r \cdot x_{t} + b_{rw}))$

3. Current Memory Content:

$\tilde{h}_t = \tanh((W \cdot x_{t} + b_{w}) + r_t * (U \cdot h_{t-1} + b_{h}))$

4. Final Memory:

$h_t = z_t * h_{t-1} + (1-z_t) * \tilde{h}_t$

Where:

- $\sigma$ is the sigmoid function
- $\tanh$ is the hyperbolic tangent function
- $*$ denotes element-wise multiplication
- $\cdot$ represents matrix multiplication
- $W_z, W_r, W, U_z, U_r, U$ are weight matrices
- $b_{zh}, b_{rh}, b_{h}, b_{zw}, b_{rw}, b_{w}$ are bias vectors
- $x_t$ is the input at the current time step
- $h_{t-1}$ is the hidden state from the previous time step
- $h_t$ is the hidden state at the current time step

Tip. The derivative of the tanh function:

$\frac{d}{dx} \tanh(x) = 1 - \tanh^2(x)$


In [2]:
import torch


def gru_forward(
    input: torch.Tensor,  # (batch_size, input_size), dtype=torch.double
    hidden: torch.Tensor,  # (batch_size, hidden_size), dtype=torch.double
    weight_ih: torch.Tensor,  # (3 * hidden_size, input_size), dtype=torch.double
    weight_hh: torch.Tensor,  # (3 * hidden_size, hidden_size), dtype=torch.double
    bias_ih: torch.Tensor,  # (3 * hidden_size), dtype=torch.double
    bias_hh: torch.Tensor,  # (3 * hidden_size), dtype=torch.double
):
    ##############################################################################
    #                          IMPLEMENT YOUR CODE                               #
    ##############################################################################
    # Update Gate
    z_t = torch.sigmoid(torch.mm(weight_ih[0:hidden.size(1), :].t(), input.t()) + torch.mm(weight_hh[0:hidden.size(1), :].t(), hidden.t()) + bias_ih[0:hidden.size(1)].reshape(-1, 1) + bias_hh[0:hidden.size(1)].reshape(-1, 1))
    # Reset Gate
    r_t = torch.sigmoid(torch.mm(weight_ih[hidden.size(1):2*hidden.size(1), :].t(), input.t()) + torch.mm(weight_hh[hidden.size(1):2*hidden.size(1), :].t(), hidden.t()) + bias_ih[hidden.size(1):2*hidden.size(1)].reshape(-1, 1) + bias_hh[hidden.size(1):2*hidden.size(1)].reshape(-1, 1))
    # Current Memory Content
    h_bar = torch.tanh(torch.mm(weight_ih[2*hidden.size(1):3*hidden.size(1), :].t(), input.t()) + r_t * (torch.mm(weight_hh[2*hidden.size(1):3*hidden.size(1), :].t(), hidden.t())) + bias_ih[2*hidden.size(1):3*hidden.size(1)].reshape(-1, 1) + bias_hh[2*hidden.size(1):3*hidden.size(1)].reshape(-1, 1))
    # Final Memory
    h_t = (1 - z_t) * h_bar + z_t * hidden.t()
    ##############################################################################
    #                          END OF YOUR CODE                                  #
    ##############################################################################
    # output: torch.Tensor  # (batch_size, hidden_size)
    output = h_t
    return output


def gru_backward(
    grad_output: torch.Tensor,  # (batch_size, hidden_size), dtype=torch.double
    #
    input: torch.Tensor,  # (batch_size, input_size), dtype=torch.double
    hidden: torch.Tensor,  # (batch_size, hidden_size), dtype=torch.double
    weight_ih: torch.Tensor,  # (3 * hidden_size, input_size), dtype=torch.double
    weight_hh: torch.Tensor,  # (3 * hidden_size, hidden_size), dtype=torch.double
    bias_ih: torch.Tensor,  # (3 * hidden_size), dtype=torch.double
    bias_hh: torch.Tensor,  # (3 * hidden_size), dtype=torch.double
    # IMPORTANT!
    # Thhe order of weight_ih, weight_hh, bias_ih, bias_hh (3 hidden_size, input_size)
    # is reset, update, new (current)"
):
    ##############################################################################
    #                          IMPLEMENT YOUR CODE                               #
    ##############################################################################
    # Calculate backward pass through GRU
    # Backprop through the final memory
    grad_h_t = grad_output.t()
    
    ##############################################################################
    #                          END OF YOUR CODE                                  #
    ##############################################################################
    grad_hidden: torch.Tensor  # (batch_size, hidden_size)
    grad_weight_ih: torch.Tensor  # (3 * hidden_size, input_size)
    grad_weight_hh: torch.Tensor  # (3 * hidden_size, hidden_size)
    grad_bias_ih: torch.Tensor  # (3 * hidden_size)
    grad_bias_hh: torch.Tensor  # (3 * hidden_size)
    return grad_hidden, grad_weight_ih, grad_weight_hh, grad_bias_ih, grad_bias_hh

<font color='red'>Important!</font>  
Write the final result to `model_checkpoints/gru.py` and submit it.  
Errors resulting from modifications to any part of the script other than the function implementation sections will be considered as a failure to submit.


## Assignment 2-1-2 | Training Multi-Layer RNN


Fortunately, GRU is already implemented in PyTorch [link](https://pytorch.org/docs/stable/generated/torch.nn.GRUCell.html). In this problem, we will use the pre-implemented GRU cell to create a model and train it on the IMDB dataset.


In this problem, you will implement and train a Gated Recurrent Unit (GRU) model for sentiment analysis using the IMDB dataset. Your task is to:

1. Implement the GRUModel class:

   - The class should inherit from nn.Module
   - Initialize the model with appropriate layers (embedding, GRU, and output layers)
   - Implement the forward pass

2. Set up the training process:

   - Choose appropriate hyperparameters (e.g., embedding dimension, hidden dimension, number of layers)
   - Initialize the model, loss function, and optimizer
   - Implement the training loop using the provided train() function

3. Evaluate the model:
   - Use the provided evaluate() function to test your model on the test dataset
   - Report the final training loss, test loss, and test accuracy

Your goal is to achieve the highest possible accuracy on the test set. Experiment with different hyperparameters and model architectures to improve your results.

Note: The data loading and preprocessing steps have been provided for you. Focus on implementing the model and training process.


In [None]:
from torch.utils.data import DataLoader

from datasets import load_dataset
from transformers import BertTokenizer


def load_data_and_tokenizer(max_length: int = 256):
    dataset = load_dataset("imdb")
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

    def preprocess_function(examples):
        return tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=max_length,
        )

    tokenized_datasets = dataset.map(preprocess_function, batched=True)
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

    encoded_dataset = dataset.map(preprocess_function, batched=True)
    encoded_dataset.set_format(type="torch")

    return encoded_dataset, tokenizer


def get_dataloader(encoded_dataset, batch_size):
    train_dataloader = DataLoader(
        encoded_dataset["train"], shuffle=True, batch_size=batch_size, drop_last=True
    )
    test_dataloader = DataLoader(
        encoded_dataset["test"], batch_size=batch_size, drop_last=True
    )
    return train_dataloader, test_dataloader

In [7]:
from typing import Tuple
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.optim import Optimizer


def train(
    model: nn.Module,
    dataloader: DataLoader,
    optimizer: Optimizer,
    criterion: nn.Module,
    device: torch.device,
) -> float:
    model.train()
    model.to(device)
    total_loss = 0
    tqdm_bar = tqdm(dataloader, desc="Training")
    for batch in tqdm_bar:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].float().to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        #
        tqdm_bar.set_postfix(loss=loss.item())
        tqdm_bar.update(1)
    return total_loss / len(dataloader)


def evaluate(
    model: nn.Module,
    dataloader: DataLoader,
    criterion: nn.Module,
    device: torch.device,
) -> Tuple[float, float]:
    model.eval()
    model.to(device)
    total_loss = 0
    correct = 0
    tqdm_bar = tqdm(dataloader, desc="Evaluating")
    with torch.no_grad():
        for batch in tqdm_bar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].float().to(device)
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            predictions = torch.round(torch.sigmoid(outputs))
            correct += (predictions == labels).float().sum()
            #
            tqdm_bar.set_postfix(loss=loss.item())
            tqdm_bar.update(1)
    accuracy = correct / (len(dataloader.dataset))
    return total_loss / len(dataloader), accuracy.item()

#### A. setup dataset


load encoded dataset and tokenizer  
you can check the detail of code in `src/assign1/load_data.py`


In [None]:
max_length = 256

encoded_dataset, tokenizer = load_data_and_tokenizer(max_length=max_length)

#### B. EDA of dataset


In [None]:
print(
    f"tokenizer vocab size : {tokenizer.vocab_size}, \n token of [PAD] : {tokenizer.pad_token_id}, \n token of [UNK] : {tokenizer.unk_token_id} \n token of [CLS] : {tokenizer.cls_token_id} \n token of [SEP] : {tokenizer.sep_token_id} \n token of [MASK] : {tokenizer.mask_token_id}"
)

In [None]:
# check data
print(encoded_dataset)

In [None]:
# check data detail
print(encoded_dataset["train"][224])

#### C. setup dataloader and check validation


In [None]:
batch_size = 32

train_dataloader, test_dataloader = get_dataloader(
    encoded_dataset, batch_size=batch_size
)

print(f"num of train data batches : {len(train_dataloader)}")
print(f"num of test data batches : {len(test_dataloader)}")

for batch in train_dataloader:
    assert isinstance(batch, dict)
    assert "input_ids" in batch
    assert "attention_mask" in batch
    assert "label" in batch
    assert batch["input_ids"].shape[0] == batch["attention_mask"].shape[0] == batch_size
    assert batch["input_ids"].shape[-1] == max_length

for batch in test_dataloader:
    assert isinstance(batch, dict)
    assert "input_ids" in batch
    assert "attention_mask" in batch
    assert "label" in batch
    assert batch["input_ids"].shape[0] == batch["attention_mask"].shape[0] == batch_size
    assert batch["input_ids"].shape[-1] == max_length

#### D. Define Model


In [8]:
import torch

import torch.nn as nn
import torch.optim as optim

In [9]:
class GRUModel(nn.Module):
    def __init__(
        self,
        tokenizer,
        embed_dim,
        hidden_dim,
        num_layers,
        is_bidirectional,
        output_dim=1,
        dropout=0.2,
    ):
        super(GRUModel, self).__init__()
        self.is_bidirectional = is_bidirectional

        self.embedding = nn.Embedding(
            tokenizer.vocab_size, embed_dim, padding_idx=tokenizer.pad_token_id
        )
        self.gru = nn.GRU(
            embed_dim,
            hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout,
            bidirectional=is_bidirectional,
        )
        self.fc = nn.Linear(hidden_dim * (2 if is_bidirectional else 1), output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_ids, attention_mask):
        embedded = self.dropout(self.embedding(input_ids))
        lengths = attention_mask.sum(dim=1)
        packed_embedded = nn.utils.rnn.pack_padded_sequence(
            embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        _, hidden = self.gru(packed_embedded)

        if self.is_bidirectional:
            hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
        else:
            hidden = hidden[-1, :, :]

        output = self.fc(hidden)
        return output.squeeze()

## Experiment with changing the parameters and submit best model

**Train at least 5 different models with varying parameter combinations and report the results.  
This instruction is asking you to:**

1. Modify the given parameters to create different model configurations.
2. Train at least 5 distinct models, each with a unique combination of these parameters.
3. Run experiments with these different models.
4. Collect and analyze the results from each experiment.
5. Prepare a short report that compares and contrasts the performance of these different model configurations at `model_checkpoints/assignment2-1-2/report.md`  
   !!Tip. Write it briefly. Length and content are not part of the grading score.!!
6. **Submit all trained models and configs, including the best-performing one**  
   <font color='red'>The scores will be assigned in order based on the highest score, and a perfect score will be given for accuracy of 88% or above</font>

The goal is to understand how different parameter settings affect the model's performance on the given task (likely sentiment analysis on the IMDB dataset). This process is a crucial part of machine learning research and development, often referred to as hyperparameter tuning or model optimization.


In [10]:
# cpu or gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# or if you have MPS
if torch.backends.mps.is_available():
    device = torch.device("mps")

In [28]:
from pydantic import BaseModel, Field
from typing import Optional


class ExperimentConfig(BaseModel):
    config_name: str  # feel free to name your model as you like, it will be used for saving model

    # about env
    seed: int  # for simplicity, we use common seed for all experiments

    # about dataset
    batch_size: int

    # about model
    embed_dim: int
    hidden_dim: int
    num_layers: int
    output_dim: int
    is_bidirectional: bool
    dropout: float

    # about optimizer
    optimizer: Optional[str] = Field(None)
    optimizer_params: dict = Field(default_factory=dict)
    epochs: int

In [31]:
OptimizerClass = torch.optim.Adam  # < ----- set this parameter
optimizer_params = {"lr": 1e-3}  # < ----- set this parameter

model_config = ExperimentConfig(
    config_name="gru",  # <---- will be used for saving model
    #
    seed=1,  # < ----- set this parameter
    #
    batch_size=32,  # < ----- set this parameter
    #
    embed_dim=128,  # < ----- set this parameter
    hidden_dim=128,  # < ----- set this parameter
    num_layers=2,  # < ----- set this parameter
    is_bidirectional=False,  # < ----- set this parameter
    dropout=0.2,  # < ----- set this parameter
    optimizer_params=optimizer_params,
    #
    epochs=10,  # < ----- set this parameter
    #
    output_dim=1,
)

In [None]:
torch.manual_seed(model_config.seed)
torch.cuda.manual_seed(model_config.seed)
torch.cuda.manual_seed_all(model_config.seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


tokenizer = tokenizer

model = GRUModel(
    tokenizer=tokenizer,
    embed_dim=model_config.embed_dim,
    hidden_dim=model_config.hidden_dim,
    num_layers=model_config.num_layers,
    is_bidirectional=model_config.is_bidirectional,
    dropout=model_config.dropout,
)

criterion = nn.BCEWithLogitsLoss()
optimizer = OptimizerClass(model.parameters(), **model_config.optimizer_params)

# set optimizer name
model_config.optimizer = optimizer.__class__.__name__

train_dataloader, test_dataloader = get_dataloader(
    encoded_dataset, batch_size=model_config.batch_size
)

In [37]:
for _ in range(model_config.epochs):
    train_final_loss = train(model, train_dataloader, optimizer, criterion, device)
test_loss, test_accuracy = evaluate(model, test_dataloader, criterion, device)

print(
    f"train_final_loss : {train_final_loss}, test_loss : {test_loss}, test_accuracy : {test_accuracy}"
)
# ---- save model and config
from pathlib import Path

save_path = Path("./model_checkpoints/assignment2-1-2")
if not save_path.exists():
    save_path.mkdir(parents=True)

save_dict = {
    "test_loss": test_loss,
    "test_accuracy": test_accuracy,
    "model_state_dict": model.state_dict(),
}

save_dict.update(model_config.dict())

model_path = (
    save_path
    / f"{model_config.config_name}_test_loss_{test_loss:.4f}_test_accuracy_{test_accuracy:.4f}.pth"
)
torch.save(save_dict, model_path)
print(f"model saved to {model_path}")

---


# Please provide your analysis on each hyper-parameter.


---
