# Assignment 1 - Autoregressive models with Transformers
## Generative AI Models 2024

#### Instructions on how to use this notebook:

This notebook is hosted on ``Google Colab``. To be able to work on it, you have to create your own copy. Go to *File* and select *Save a copy in Drive*.

You can also avoid using ``Colab`` entirely, and download the notebook to run it on your own machine. If you choose this, go to *File* and select *Download .ipynb*.

The advantage of using **Colab** is that you can use a GPU. You can complete this assignment with a CPU, but it will take a bit longer. Furthermore, we encourage you to train using the GPU not only for faster training, but also to get experience with this setting. This includes moving models and tensors to the GPU and back. This experience is very valuable because for various models and large datasets (like large CNNs for ImageNet, or Transformer models trained on Wikipedia), training on GPU is the only feasible way.

The default ``Colab`` runtime does not have a GPU. To change this, go to *Runtime - Change runtime type*, and select *GPU* as the hardware accelerator. The GPU that you get changes according to what resources are available at the time, and its memory can go from a 5GB, to around 18GB if you are lucky. If you are curious, you can run the following in a code cell to check:

```sh
!nvidia-smi
```

Note that despite the name, ``Google Colab`` does  not support collaborative work without issues. When two or more people edit the notebook concurrently, only one version will be saved. You can choose to do group programming with one person sharing the screen with the others, or make multiple copies of the notebook to work concurrently.

**Submission:** Please bring your (partial) solution to instruction sessions. Then you can discuss it with intructors and your colleagues.

In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


## Introduction

In this assignment, we are going to implement an autoregressive model (ARM). An AMR is a likelihood-based deep generative model that utilizes the product rule and generates new object one-by-one. Transformers are current state-of-the-art architectures used for Large Language Models (LLMs). Specifically, generative LLMs are parameterized by so called decoder-transformers. The model used in this assignment is based on the architecture of so called Generative Pretrained Transformers (GPTs):
- [Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training.](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

You can read more about ARMs in Chapter 2 of the following book:
- [Tomczak, J.M., "Deep Generative Modeling", Springer, 2022](https://link.springer.com/book/10.1007/978-3-030-93158-2)

You can read more about transformers in Chapter 12 of the following book:
- [Prince, S.J.D., "Understanding Deep Learning", MIT Press, 2023](https://udlbook.github.io/udlbook/)

In particular, the goals of this assignment are the following:

- Understand how transformer-based ARMs are formulated.
- Implement components of transformer-based ARMs using PyTorch.
- Train and evaluate a transformer-based ARM for text data.

This notebook is essential for preparing a report. Moreover, please remember to submit the final notebook together with the report (PDF).

### Theory behind ARMs

Let us consider a high-dimensional random variable $\mathbf{x} \in \mathcal{X}^{T}$ where $\mathcal{X} = \{0,1,\dots , L-1\}$ or $\mathcal{X} = \mathbb{R}$. Our goal is to model $p(\mathbf{x})$. We can apply the product rule to express this distribution as follows:
$$
p(\mathbf{x}) = p(x_1) \prod_{t=2}^{T} p(x_{t}|\mathbf{x}_{<t}) ,
$$
where $\mathbf{x}_{<t} = [x_1, x_2, \ldots , x_{t-1}]^{\top}$. For instance, for $\mathbf{x} = [x_1, x_2, x_{3}]^{\top}$, we have $p(\mathbf{x}) = p(x_1) p(x_{2}|x_{1}) p(x_{3} | x_{1}, x_{2})$.

The generative procedure is straightforward: We start with $x_1 \sim p(x_1)$, and then we proceed with $x_t \sim p(x_{t}|\mathbf{x}_{<t})$ by plugging in all previously sampled variables $\mathbf{x}_{<t}$. We can think of this procedure as a for-loop.

Now, the main goal is how to parameterize conditional distributions $p(x_{t}|\mathbf{x}_{<t})$. We can accomplish that by using neural networks, in particular, transformers. In this assignment, we focus on <i>decoder transformers</i> that utilize causal multi-head self-attention.

### Note

In this assignment, we build a simple LLM model. For this purpose, we use a dataset consisting of $\sim 8.5$k newspaper headlines, and each headline contain at most 150 letters (tokens). You are provided with a tokenizer for turning characters into a sequence of integers and padding, and text processing functions (e.g., removing special characters). Your model will be trained with 1.3M tokens per iteration, and will consist of few millions to over dozen millions of weights.

These numbers do not necessarilly impress anyone in the LLM community. However, please be aware that such datasets and models are not small and could be treated as a small-sized LLM-based problems. As you will notice in the end, we can still observe similar phenomena like hallucinations and the power of scaling up.

## IMPORTS

In [None]:
# DO NOT REMOVE!
import os

import pickle

import spacy
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import numpy as np

!pip install datasets
from datasets import load_dataset

import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

!pip install pytorch_model_summary
from pytorch_model_summary import summary

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

In [None]:
# DO NOT REMOVE OR MODIFY
# Check if GPU is available and determine the device
if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'

print(f'The available device is {device}')

The available device is cpu


In [None]:
# DO NOT REMOVE! (unless you work locally)
# mount drive: WE NEED IT FOR SAVING IMAGES! NECESSARY FOR GOOGLE COLAB!
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# DO NOT REMOVE! (unless you work locally)
# PLEASE CHANGE IT TO YOUR OWN GOOGLE DRIVE OR YOUR LOCAL DIR!
results_model_dir = '/content/drive/My Drive/Results/'

## Auxiliary classes and functions

Let us define some useful classes:
1. DataProcessor: "cleaning" texts.
2. Tokenizer: transforming characters to integers and padding.

In [None]:
# DO NOT REMOVE OR MODIFY
class DataProcessor(object):
    def __init__(self, ):
        super().__init__()
        nlp = spacy.load("en_core_web_sm")
        nltk.download('omw-1.4')
        nltk.download("punkt")
        nltk.download("wordnet")
        nltk.download("stopwords")

    @staticmethod
    def preprocess_text(text):
        # Tokenize, remove punctuation and lowercase
        tokens = nltk.word_tokenize(text)
        tokens = [word.lower() for word in tokens if word.isalpha()]

        # Remove stopwords and lemmatize
        stop_words = set(stopwords.words("english"))
        lemmatizer = WordNetLemmatizer()
        processed_text = [
            lemmatizer.lemmatize(word) for word in tokens if word not in stop_words
        ]

        return " ".join(processed_text)

    def process_batch(self, texts):
        return [self.preprocess_text(d) for d in texts]

In [None]:
# DO NOT REMOVE OR MODIFY
class Tokenizer(object):
    def __init__(self, max_length=0):
        super().__init__()

        self.max_length = max_length

        self.alphabet_letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

        self.alphabet = self.prepare_alphabet()
        self.decoded_alphabet = self.prepare_decoded_alphabet()

    def prepare_alphabet(self):
        # PREPARE THE ALPHABET (CHAR->INT)
        # as a dictionary
        alphabet = {}
        alphabet['pad'] = 0  # add 'pad'
        count = 1

        for letter in self.alphabet_letters:
            alphabet[letter] = count
            count += 1

        # add ' ', 'cls' tokens
        alphabet[' '] = count
        alphabet['cls'] = count + 1

        return alphabet

    def prepare_decoded_alphabet(self):
        # PREPARE DECODED ALPHABET (INT->CHAR)
        decoded_alphabet_ints = [i for i in range(len(self.alphabet_letters))]

        decoded_alphabet = {}
        decoded_alphabet[0] = 'pad'

        for i in decoded_alphabet_ints:
            decoded_alphabet[i+1] = self.alphabet_letters[i]

            decoded_alphabet[i+2] = ' '
        decoded_alphabet[i+3] = 'cls'

        return decoded_alphabet

    def encode(self, texts):
        N = len(texts)

        if self.max_length == 0:
            max_length = 0
            for i in range(N):
                len_i = len(texts[i])
                if len_i > max_length:
                    max_length = len_i
        else:
            max_length = self.max_length

        tokens = np.zeros((N, max_length+1))

        for i in range(N):
            len_i = len(texts[i])
            for j in range(-1, max_length):
                if j == -1:
                    tokens[i,j+1] = self.alphabet['cls']
                elif j >= len_i:
                    tokens[i,j+1] = self.alphabet['pad']
                else:
                    if texts[i][j] == 'é':
                        tokens[i,j+1] = self.alphabet['e']
                    elif texts[i][j] == 'í':
                        tokens[i,j+1] = self.alphabet['e']
                    elif texts[i][j] == 'á':
                        tokens[i,j+1] = self.alphabet['a']
                    elif texts[i][j] == 'ó':
                        tokens[i,j+1] = self.alphabet['o']
                    elif texts[i][j] == 'æ':
                        tokens[i,j+1] = self.alphabet['a']
                    elif texts[i][j] == 'ä':
                        tokens[i,j+1] = self.alphabet['a']
                    else:
                        tokens[i,j+1] = self.alphabet[texts[i][j]]

        return tokens

    def decode(self, tokens):
        texts = []

        for i in range(len(tokens)):
            tokens_i = tokens[i,:]
            text_i = ''
            for j in range(len(tokens_i)):
                if tokens_i[j] == 0:
                    break
                else:
                    if self.decoded_alphabet[tokens_i[j]] != 'cls':
                        text_i += self.decoded_alphabet[tokens_i[j]]
            texts.append(text_i)

        return texts

Some useful functions:

In [None]:
# DO NOT REMOVE OR MODIFY
def save_texts(sampled_texts, name=''):
    # open file in write mode
    with open(results_dir + '/samples_' + name + '.txt', 'w') as fp:
        for item in sampled_texts:
            # write each item in a new line
            fp.write("%s\n" % item)

# Data

In [None]:
# DO NOT REMOVE OR MODIFY
class Headers(Dataset):
    """A simple dataset based on headers. Source: https://huggingface.co/datasets/IlyaGusev/headline_cause"""

    def __init__(self, dataprocessor, tokenizer, mode='train', num_training_data=None, transforms=None):
        # LOAD DATA
        dataset = load_dataset("IlyaGusev/headline_cause", "en_simple")

        # PREPARE DATA
        if mode == 'train':
            train_texts = dataprocessor.process_batch(dataset['train'][:]['left_title'] + dataset['train'][:]['right_title']) # list
            if num_training_data is None:
                self.data = torch.from_numpy(tokenizer.encode(train_texts)).long()
            else:
                self.data = torch.from_numpy(tokenizer.encode(train_texts))[:num_training_data].long()
        elif mode == 'val':
            validation_texts = dataprocessor.process_batch(dataset['validation'][:]['left_title'] + dataset['validation'][:]['right_title']) # list
            self.data = torch.from_numpy(tokenizer.encode(validation_texts)).long()
        else:
            test_texts = dataprocessor.process_batch(dataset['test'][:]['left_title'] + dataset['test'][:]['right_title']) # list
            self.data = torch.from_numpy(tokenizer.encode(test_texts)).long()

        self.transforms = transforms

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        if self.transforms:
            sample = self.transforms(sample)
        return sample

## Implementing ARMs with Transformers


### Loss Function (NLL)
Our loss function is the negative log-likelihood for the categorical distribution (i.e., the cross-entropy loss).

Please note how it is implemented and how tokens (T) are handled.

In [None]:
# DO NOT REMOVE OR MODIFY
class LossFun(nn.Module):
    def __init__(self,):
        super().__init__()

        self.loss = nn.NLLLoss(reduction='none')

    def forward(self, y_model, y_true, reduction='sum'):
        # y_model: B(atch) x T(okens) x V(alues)
        # y_true: B x T
        B, T, V = y_model.size()

        y_model = y_model.view(B * T, V)
        y_true = y_true.view(B * T,)

        loss_matrix = self.loss(y_model, y_true) # B*T

        if reduction == 'sum':
            return torch.sum(loss_matrix)
        elif reduction == 'mean':
            loss_matrix = loss_matrix.view(B, T)
            return torch.mean(torch.sum(loss_matrix, 1))
        else:
            raise ValueError('Reduction could be either `sum` or `mean`.')

### Transformer block

Transformers consist of transformer block. In the cell below, please define a transformer block.

In [None]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, num_emb, num_heads=8):
        super().__init__()

        # hyperparams
        self.D = num_emb
        self.H = num_heads

        # weights for self-attention
        self.w_k = nn.Linear(self.D, self.D * self.H)
        self.w_q = nn.Linear(self.D, self.D * self.H)
        self.w_v = nn.Linear(self.D, self.D * self.H)

        # weights for a combination of multiple heads
        self.w_c = nn.Linear(self.D * self.H, self.D)

    def forward(self, x, causal=True):
        # x: B(atch) x T(okens) x D(imensionality)
        B, T, D = x.size()

        # keys, queries, values
        k = self.w_k(x).view(B, T, self.H, D) # B x T x H x D
        q = self.w_q(x).view(B, T, self.H, D) # B x T x H x D
        v = self.w_v(x).view(B, T, self.H, D) # B x T x H x D

        k = k.transpose(1, 2).contiguous().view(B * self.H, T, D) # B*H x T x D
        q = q.transpose(1, 2).contiguous().view(B * self.H, T, D) # B*H x T x D
        v = v.transpose(1, 2).contiguous().view(B * self.H, T, D) # B*H x T x D

        k = k / (D**0.25) # scaling
        q = q / (D**0.25) # scaling

        # kq
        kq = torch.bmm(q, k.transpose(1, 2)) # B*H x T x T

        # if causal
        if causal:
            mask = torch.triu_indices(T, T, offset=1)
            kq[..., mask[0], mask[1]] = float('-inf')

        # softmax
        skq = F.softmax(kq, dim=2)

        # self-attention
        sa = torch.bmm(skq, v) # B*H x T x D
        sa = sa.view(B, self.H, T, D) # B x H x T x D
        sa = sa.transpose(1, 2) # B x T x H x D
        sa = sa.contiguous().view(B, T, D * self.H) # B x T x D*H

        out = self.w_c(sa) # B x T x D

        return out

In [None]:
# YOUR CODE GOES HERE
# NOTE: The class must containt the following elements:
# (i) components (nn.Module) of a transformer bloc
# (ii) the forward function
# Moreover, forward must return the processed input

class TransformerBlock(nn.Module):
    def __init__(self, num_emb, num_neurons, num_heads=4):
        super().__init__()

        # hyperparams
        self.D = num_emb
        self.H = num_heads
        self.neurons = num_neurons

        # components
        self.msha = MultiHeadSelfAttention(num_emb=self.D, num_heads=self.H)
        self.layer_norm1 = nn.LayerNorm(self.D)
        self.layer_norm2 = nn.LayerNorm(self.D)

        self.mlp = nn.Sequential(nn.Linear(self.D, self.neurons * self.D),
                                nn.GELU(),
                                nn.Linear(self.neurons * self.D, self.D))

    def forward(self, x, causal=True):
        # Multi-Head Self-Attention
        x_attn = self.msha(x, causal)
        # LayerNorm
        x = self.layer_norm1(x_attn + x)
        # MLP
        x_mlp = self.mlp(x)
        # LayerNorm
        x = self.layer_norm2(x_mlp + x)

        return x

### ARM (Decoder-Transformer)

Once we have a class for transformer blocks, we need to define a decoder-transformer that defines an auto-regressive model.

In [None]:
# DO NOT REMOVE OR MODIFY
class DecoderTransformer(nn.Module):
    def __init__(self, num_tokens, num_token_vals, num_emb, num_neurons, num_heads=2, dropout_prob=0.1, num_blocks=10, device='cpu'):
        super().__init__()

        # hyperparams
        self.device = device
        self.num_tokens = num_tokens
        self.num_token_vals = num_token_vals
        self.num_emb = num_emb
        self.num_blocks = num_blocks

        # embedding layer
        self.embedding = torch.nn.Embedding(num_token_vals, num_emb)

        # positional embedding
        self.positional_embedding = nn.Embedding(num_tokens, num_emb)

        # transformer blocks
        self.transformer_blocks = nn.ModuleList()
        for _ in range(num_blocks):
            self.transformer_blocks.append(TransformerBlock(num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads))

        # output layer (logits + softmax)
        self.logits = nn.Sequential(nn.Linear(num_emb, num_token_vals))

        # dropout layer
        self.dropout = nn.Dropout(dropout_prob)

        # loss function
        self.loss_fun = LossFun()

    def transformer_forward(self, x, causal=True, temperature=1.0):
        # x: B(atch) x T(okens)
        # embedding of tokens
        x = self.embedding(x) # B x T x D
        # embedding of positions
        pos = torch.arange(0, x.shape[1], dtype=torch.long).unsqueeze(0).to(self.device)
        pos_emb = self.positional_embedding(pos)
        # dropout of embedding of inputs
        x = self.dropout(x + pos_emb)

        # transformer blocks
        for i in range(self.num_blocks):
            x = self.transformer_blocks[i](x)

        # output logits
        out = self.logits(x)

        return F.log_softmax(out/temperature, 2)

    @torch.no_grad()
    def sample(self, batch_size=4, temperature=1.0):
        x_seq = np.asarray([[self.num_token_vals - 1] for i in range(batch_size)])

        # sample next tokens
        for i in range(self.num_tokens-1):
            xx = torch.tensor(x_seq, dtype=torch.long, device=self.device)
            # process x and calculate log_softmax
            x_log_probs = self.transformer_forward(xx, temperature=temperature)
            # sample i-th tokens
            x_i_sample = torch.multinomial(torch.exp(x_log_probs[:,i]), 1).to(self.device)
            # update the batch with new samples
            x_seq = np.concatenate((x_seq, x_i_sample.to('cpu').detach().numpy()), 1)

        return x_seq

    @torch.no_grad()
    def top1_rec(self, x, causal=True):
        x_prob = torch.exp(self.transformer_forward(x, causal=True))[:,:-1,:].contiguous()
        _, x_rec_max = torch.max(x_prob, dim=2)
        return torch.sum(torch.mean((x_rec_max.float() == x[:,1:].float().to(device)).float(), 1).float())

    def forward(self, x, causal=True, temperature=1.0, reduction='mean'):
        # get log-probabilities
        log_prob = self.transformer_forward(x, causal=causal, temperature=temperature)

        return self.loss_fun(log_prob[:,:-1].contiguous(), x[:,1:].contiguous(), reduction=reduction)

### Evaluation and training functions

**Please DO NOT remove or modify them.**

In [None]:
# DO NOT REMOVE OR MODIFY
def evaluation(test_loader, name=None, model_best=None, epoch=None, device='cuda'):
    # EVALUATION
    if model_best is None:
        # load best performing model
        model_best = torch.load(name + '.model').to(device)

    model_best.eval()
    loss = 0.
    rec = 1.
    N = 0.
    for indx_batch, test_batch in enumerate(test_loader):
        loss_t = model_best.forward(test_batch.to(device), reduction='sum')
        loss = loss + loss_t.item()

        rec_t = model_best.top1_rec(test_batch.to(device))
        rec = rec + rec_t.item()

        N = N + test_batch.shape[0]
    loss = loss / N
    rec = rec / N

    if epoch is None:
        print(f'FINAL LOSS: nll={loss}, rec={rec}')
    else:
        print(f'Epoch: {epoch}, val nll={loss}, val rec={rec}')

    return loss, rec

def plot_curve(name, nll_val, ylabel='nll'):
    plt.plot(np.arange(len(nll_val)), nll_val, linewidth='3')
    plt.xlabel('epochs')
    plt.ylabel(ylabel)
    plt.savefig(name + '_' + ylabel + '_val_curve.pdf', bbox_inches='tight')
    plt.close()

In [None]:
# DO NOT REMOVE OR MODIFY
def training(name, max_patience, num_epochs, model, optimizer, training_loader, val_loader, device='cuda'):
    nll_val = []
    rec_val = []
    best_nll = 1000.
    patience = 0

    # Main loop
    for e in range(num_epochs):
        # TRAINING
        model.train()
        for indx_batch, batch in enumerate(training_loader):
            loss = model.forward(batch.to(device))

            optimizer.zero_grad()
            loss.backward(retain_graph=True)
            optimizer.step()

        # Validation
        loss_val, r_val = evaluation(val_loader, model_best=model, epoch=e, device=device)
        nll_val.append(loss_val)  # save for plotting
        rec_val.append(r_val)

        if e == 0:
            print('saved!')
            torch.save(model, name + '.model')
            best_nll = loss_val

            sampled_tokens = model.sample(batch_size=64, temperature=1.0)
            sampled_texts = tokenizer.decode(sampled_tokens)
            save_texts(sampled_texts, name='epoch_' + str(e))

        else:
            if loss_val < best_nll:
                print('saved!')
                torch.save(model, name + '.model')
                best_nll = loss_val
                patience = 0

                sampled_tokens = model.sample(batch_size=64, temperature=1.0)
                sampled_texts = tokenizer.decode(sampled_tokens)
                save_texts(sampled_texts, name='epoch_' + str(e))
            else:
                patience = patience + 1

        if patience > max_patience:
            break

    nll_val = np.asarray(nll_val)
    rec_val = np.asarray(rec_val)

    np.save(name + '_nll_val.npy', nll_val)
    np.save(name + '_rec_val.npy', rec_val)

    return nll_val, rec_val

### Setup

**NOTE: *Please comment your code! Especially if you introduce any new variables (e.g., hyperparameters).***

In [None]:
# DO NOT REMOVE OR MODIFY
dataprocessor = DataProcessor()
tokenizer = Tokenizer(max_length=149)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# PLEASE MODIFY ACCORDING TO THE REPORT REQUIREMENTS
num_training_data = None  # None to take all training data

# DO NOT REMOVE OR MODIFY THE REST OF THIS CELL
#-dataset
train_dataset = Headers(dataprocessor, tokenizer, num_training_data=num_training_data, mode="train")
validation_dataset = Headers(dataprocessor, tokenizer, mode="val")
test_dataset = Headers(dataprocessor, tokenizer, mode="test")

#-dataloaders
BATCH_SIZE = 32

training_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/145k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/542 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/542 [00:00<?, ? examples/s]

# **1.** Model with $<100$k weights

In [None]:
# DO NOT REMOVE (but you can modify if necessary)
#-creating a dir for saving results
name = 'arm_transformer_1'  # NOTE: if you run multiple experiments, you would overwrite results. Please modify this part if necessary.
results_dir = results_model_dir + name + '/'
if not(os.path.exists(results_dir)):
  os.mkdir(results_dir)

In the next cell, please initialize the model. Please remember about commenting your code!

In [None]:
# DO NOT REMOVE but PLEASE MODIFY WHENEVER YOU ARE ASKED FOR IT!
# NOTE: in order to obtain required sizes of your models, you can play with
#       various values of num_neurons, num_heads, num_blocks, num_emb
num_tokens = 150 # do not modify!
num_token_vals = 29  # do not modify!
num_neurons = 10 # please modify it
num_heads = 5 # please modify it
num_blocks = 5 # please modify it
num_emb = num_heads * 4  # please modify it but it must be a multiplication of num_heads
causal=True # do not modify!

lr = 1e-3 # learning rate; do not modify!
num_epochs = 1000 # max. number of epochs; do not modify!
max_patience = 10 # an early stopping is used, if training doesn't improve for longer than 20 epochs, it is stopped; do not modify!

In [None]:
# DO NOT REMOVE OR MODIFY
model = DecoderTransformer(num_tokens=num_tokens, num_token_vals=num_token_vals, num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads, num_blocks=num_blocks, device=device)
model = model.to(device)
# Print the summary (like in Keras)
print(summary(model, torch.zeros(1, num_tokens, dtype=torch.long).to(device), show_input=False, show_hierarchical=False))

--------------------------------------------------------------------------
         Layer (type)        Output Shape         Param #     Tr. Param #
          Embedding-1        [1, 150, 20]             580             580
          Embedding-2        [1, 150, 20]           3,000           3,000
            Dropout-3        [1, 150, 20]               0               0
   TransformerBlock-4        [1, 150, 20]          16,620          16,620
   TransformerBlock-5        [1, 150, 20]          16,620          16,620
   TransformerBlock-6        [1, 150, 20]          16,620          16,620
   TransformerBlock-7        [1, 150, 20]          16,620          16,620
   TransformerBlock-8        [1, 150, 20]          16,620          16,620
             Linear-9        [1, 150, 29]             609             609
           LossFun-10                  []               0               0
Total params: 87,289
Trainable params: 87,289
Non-trainable params: 0
-----------------------------------------

Please initialize the optimizer

In [None]:
# DO NOT REMOVE OR MODIFY
optimizer = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad == True], lr=lr)

## Training and final evaluation

In the following two cells, we run the training and the final evaluation.

In [None]:
# DO NOT REMOVE OR MODIFY
# Training procedure
nll_val, rec_val = training(name=results_dir + name, max_patience=max_patience, num_epochs=num_epochs, model=model, optimizer=optimizer, training_loader=training_loader, val_loader=val_loader, device=device)

Epoch: 0, val nll=150.11425533505823, val rec=0.7034721075388778
saved!
Epoch: 1, val nll=141.23683194656653, val rec=0.7145979342865328
saved!
Epoch: 2, val nll=136.89613055127134, val rec=0.721488896331224
saved!
Epoch: 3, val nll=134.1142120924383, val rec=0.7267701001184893
saved!
Epoch: 4, val nll=131.06994809083832, val rec=0.7328252282089853
saved!
Epoch: 5, val nll=127.96223117504613, val rec=0.7388741767714384
saved!
Epoch: 6, val nll=125.42319322424181, val rec=0.7439820247382696
saved!
Epoch: 7, val nll=123.44612690267527, val rec=0.7489350959383694
saved!
Epoch: 8, val nll=121.70628689136012, val rec=0.7520679125486704
saved!
Epoch: 9, val nll=120.37328133107991, val rec=0.7548849697042656
saved!
Epoch: 10, val nll=118.73346803109145, val rec=0.7579125284708734
saved!
Epoch: 11, val nll=117.44606339007726, val rec=0.7613920637602296
saved!
Epoch: 12, val nll=116.15168655226591, val rec=0.7635156835577145
saved!
Epoch: 13, val nll=114.76345453579047, val rec=0.76641324173480

In [None]:
# DO NOT REMOVE OR MODIFY
# Final evaluation
test_loss, test_rec = evaluation(name=results_dir + name, test_loader=test_loader, device=device)

with open(results_dir + name + '_test_loss.txt', "w") as f:
    f.write('Test NLL: ' + str(test_loss)+'\n'+'Test REC: ' + str(test_rec))
    f.close()

plot_curve(results_dir + name, nll_val, ylabel='nll')
plot_curve(results_dir + name, rec_val, ylabel='rec')

FINAL LOSS: nll=98.86395286194073, rec=0.8040441780512623


# **2.** Model with $\sim 500$k weights

In [None]:
# DO NOT REMOVE (but you can modify if necessary)
#-creating a dir for saving results
name = 'arm_transformer_2'  # NOTE: if you run multiple experiments, you would overwrite results. Please modify this part if necessary.
results_dir = results_model_dir + name + '/'
if not(os.path.exists(results_dir)):
  os.mkdir(results_dir)

In [None]:
# DO NOT REMOVE but PLEASE MODIFY WHENEVER YOU ARE ASKED FOR IT!
# NOTE: in order to obtain required sizes of your models, you can play with
#       various values of num_neurons, num_heads, num_blocks, num_emb
num_tokens = 150 # do not modify!
num_token_vals = 29  # do not modify!
num_neurons = 50 # please modify it
num_heads = 5 # please modify it
num_blocks = 5 # please modify it
num_emb = num_heads * 6  # please modify it but it must be a multiplication of num_heads
causal=True # do not modify!

lr = 1e-3 # learning rate; do not modify!
num_epochs = 1000 # max. number of epochs; do not modify!
max_patience = 10 # an early stopping is used, if training doesn't improve for longer than 20 epochs, it is stopped; do not modify!

In [None]:
# DO NOT REMOVE OR MODIFY
model = DecoderTransformer(num_tokens=num_tokens, num_token_vals=num_token_vals, num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads, num_blocks=num_blocks, device=device)
model = model.to(device)
# Print the summary (like in Keras)
print(summary(model, torch.zeros(1, num_tokens, dtype=torch.long).to(device), show_input=False, show_hierarchical=False))

--------------------------------------------------------------------------
         Layer (type)        Output Shape         Param #     Tr. Param #
          Embedding-1        [1, 150, 30]             870             870
          Embedding-2        [1, 150, 30]           4,500           4,500
            Dropout-3        [1, 150, 30]               0               0
   TransformerBlock-4        [1, 150, 30]         110,130         110,130
   TransformerBlock-5        [1, 150, 30]         110,130         110,130
   TransformerBlock-6        [1, 150, 30]         110,130         110,130
   TransformerBlock-7        [1, 150, 30]         110,130         110,130
   TransformerBlock-8        [1, 150, 30]         110,130         110,130
             Linear-9        [1, 150, 29]             899             899
           LossFun-10                  []               0               0
Total params: 556,919
Trainable params: 556,919
Non-trainable params: 0
---------------------------------------

In [None]:
# DO NOT REMOVE OR MODIFY
optimizer = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad == True], lr=lr)

## Training and final evaluation

In [None]:
# DO NOT REMOVE OR MODIFY
# Training procedure
nll_val, rec_val = training(name=results_dir + name, max_patience=max_patience, num_epochs=num_epochs, model=model, optimizer=optimizer, training_loader=training_loader, val_loader=val_loader, device=device)

Epoch: 0, val nll=142.6662045862402, val rec=0.7140283303067253
saved!
Epoch: 1, val nll=134.27733721856262, val rec=0.7241945125959897
saved!
Epoch: 2, val nll=128.8114979874164, val rec=0.7351717526622364
saved!
Epoch: 3, val nll=122.59004149489738, val rec=0.7472882042071916
saved!
Epoch: 4, val nll=116.89857533585102, val rec=0.7598566111603346
saved!
Epoch: 5, val nll=113.38544557015395, val rec=0.7684254188819125
saved!
Epoch: 6, val nll=108.52875318914322, val rec=0.7786844718060371
saved!
Epoch: 7, val nll=105.47002993650543, val rec=0.7844795652861085
saved!
Epoch: 8, val nll=101.99314649518566, val rec=0.7927821412737519
saved!
Epoch: 9, val nll=99.62399100553506, val rec=0.7980200144637555
saved!
Epoch: 10, val nll=98.82028614902848, val rec=0.7991963611757623
saved!
Epoch: 11, val nll=96.96204874524331, val rec=0.8037408146031229
saved!
Epoch: 12, val nll=94.83151774388837, val rec=0.8063721287294507
saved!
Epoch: 13, val nll=93.53866948764703, val rec=0.8089043874142354
sa

In [None]:
# DO NOT REMOVE OR MODIFY
# Final evaluation
test_loss, test_rec = evaluation(name=results_dir + name, test_loader=test_loader, device=device)

with open(results_dir + name + '_test_loss.txt', "w") as f:
    f.write('Test NLL: ' + str(test_loss)+'\n'+'Test REC: ' + str(test_rec))
    f.close()

plot_curve(results_dir + name, nll_val, ylabel='nll')
plot_curve(results_dir + name, rec_val, ylabel='rec')

FINAL LOSS: nll=90.63749689193669, rec=0.8175536838404807


# **3.** Model with $\sim$5M weights

In [None]:
# DO NOT REMOVE (but you can modify if necessary)
#-creating a dir for saving results
name = 'arm_transformer_3'  # NOTE: if you run multiple experiments, you would overwrite results. Please modify this part if necessary.
results_dir = results_model_dir + name + '/'
if not(os.path.exists(results_dir)):
  os.mkdir(results_dir)

In [None]:
# DO NOT REMOVE but PLEASE MODIFY WHENEVER YOU ARE ASKED FOR IT!
# NOTE: in order to obtain required sizes of your models, you can play with
#       various values of num_neurons, num_heads, num_blocks, num_emb
num_tokens = 150 # do not modify!
num_token_vals = 29  # do not modify!
num_neurons = 170 # please modify it
num_heads = 6 # please modify it
num_blocks = 6 # please modify it
num_emb = num_heads * 8  # please modify it but it must be a multiplication of num_heads
causal=True # do not modify!

lr = 1e-3 # learning rate; do not modify!
num_epochs = 1000 # max. number of epochs; do not modify!
max_patience = 10 # an early stopping is used, if training doesn't improve for longer than 20 epochs, it is stopped; do not modify!

In [None]:
# DO NOT REMOVE OR MODIFY
model = DecoderTransformer(num_tokens=num_tokens, num_token_vals=num_token_vals, num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads, num_blocks=num_blocks, device=device)
model = model.to(device)
# Print the summary (like in Keras)
print(summary(model, torch.zeros(1, num_tokens, dtype=torch.long).to(device), show_input=False, show_hierarchical=False))

--------------------------------------------------------------------------
         Layer (type)        Output Shape         Param #     Tr. Param #
          Embedding-1        [1, 150, 48]           1,392           1,392
          Embedding-2        [1, 150, 48]           7,200           7,200
            Dropout-3        [1, 150, 48]               0               0
   TransformerBlock-4        [1, 150, 48]         847,968         847,968
   TransformerBlock-5        [1, 150, 48]         847,968         847,968
   TransformerBlock-6        [1, 150, 48]         847,968         847,968
   TransformerBlock-7        [1, 150, 48]         847,968         847,968
   TransformerBlock-8        [1, 150, 48]         847,968         847,968
   TransformerBlock-9        [1, 150, 48]         847,968         847,968
            Linear-10        [1, 150, 29]           1,421           1,421
           LossFun-11                  []               0               0
Total params: 5,097,821
Trainable par

In [None]:
# DO NOT REMOVE OR MODIFY
optimizer = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad == True], lr=lr)

## Training and final evaluation

In [None]:
# DO NOT REMOVE OR MODIFY
# Training procedure
nll_val, rec_val = training(name=results_dir + name, max_patience=max_patience, num_epochs=num_epochs, model=model, optimizer=optimizer, training_loader=training_loader, val_loader=val_loader, device=device)

Epoch: 0, val nll=156.06555896491582, val rec=0.6951633326681778
saved!
Epoch: 1, val nll=146.16275170808348, val rec=0.7100844576789884
saved!
Epoch: 2, val nll=140.99412666123732, val rec=0.7150313299960316
saved!
Epoch: 3, val nll=137.11826023228494, val rec=0.7219284786069525
saved!
Epoch: 4, val nll=134.49906217568034, val rec=0.7260085735813718
saved!
Epoch: 5, val nll=131.0279286514789, val rec=0.7324909002578567
saved!
Epoch: 6, val nll=127.3363843404059, val rec=0.738880370375855
saved!
Epoch: 7, val nll=122.00376413493139, val rec=0.7497337770637991
saved!
Epoch: 8, val nll=116.59743418112892, val rec=0.7616273450675486
saved!
Epoch: 9, val nll=110.69199272803274, val rec=0.7749572891150893
saved!
Epoch: 10, val nll=103.61478911114794, val rec=0.7881324581554455
saved!
Epoch: 11, val nll=100.03460152826626, val rec=0.7962431080666855
saved!
Epoch: 12, val nll=96.06888346654462, val rec=0.8037903423238945
saved!
Epoch: 13, val nll=93.9127611674066, val rec=0.8083100160549488
s

In [None]:
# DO NOT REMOVE OR MODIFY
# Final evaluation
test_loss, test_rec = evaluation(name=results_dir + name, test_loader=test_loader, device=device)

with open(results_dir + name + '_test_loss.txt', "w") as f:
    f.write('Test NLL: ' + str(test_loss)+'\n'+'Test REC: ' + str(test_rec))
    f.close()

plot_curve(results_dir + name, nll_val, ylabel='nll')
plot_curve(results_dir + name, rec_val, ylabel='rec')

FINAL LOSS: nll=89.12901796010148, rec=0.8210208266423638


# **4.** Model with $>10$M weights

In [None]:
# DO NOT REMOVE (but you can modify if necessary)
#-creating a dir for saving results
name = 'arm_transformer_4'  # NOTE: if you run multiple experiments, you would overwrite results. Please modify this part if necessary.
results_dir = results_model_dir + name + '/'
if not(os.path.exists(results_dir)):
  os.mkdir(results_dir)

In [None]:
# DO NOT REMOVE but PLEASE MODIFY WHENEVER YOU ARE ASKED FOR IT!
# NOTE: in order to obtain required sizes of your models, you can play with
#       various values of num_neurons, num_heads, num_blocks, num_emb
num_tokens = 150 # do not modify!
num_token_vals = 29  # do not modify!
num_neurons = 220 # please modify it
num_heads = 6 # please modify it
num_blocks = 6 # please modify it
num_emb = num_heads * 10  # please modify it but it must be a multiplication of num_heads
causal=True # do not modify!

lr = 1e-3 # learning rate; do not modify!
num_epochs = 1000 # max. number of epochs; do not modify!
max_patience = 10 # an early stopping is used, if training doesn't improve for longer than 20 epochs, it is stopped; do not modify!

In [None]:
# DO NOT REMOVE OR MODIFY
model = DecoderTransformer(num_tokens=num_tokens, num_token_vals=num_token_vals, num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads, num_blocks=num_blocks, device=device)
model = model.to(device)
# Print the summary (like in Keras)
print(summary(model, torch.zeros(1, num_tokens, dtype=torch.long).to(device), show_input=False, show_hierarchical=False))

--------------------------------------------------------------------------
         Layer (type)        Output Shape         Param #     Tr. Param #
          Embedding-1        [1, 150, 60]           1,740           1,740
          Embedding-2        [1, 150, 60]           9,000           9,000
            Dropout-3        [1, 150, 60]               0               0
   TransformerBlock-4        [1, 150, 60]       1,685,040       1,685,040
   TransformerBlock-5        [1, 150, 60]       1,685,040       1,685,040
   TransformerBlock-6        [1, 150, 60]       1,685,040       1,685,040
   TransformerBlock-7        [1, 150, 60]       1,685,040       1,685,040
   TransformerBlock-8        [1, 150, 60]       1,685,040       1,685,040
   TransformerBlock-9        [1, 150, 60]       1,685,040       1,685,040
            Linear-10        [1, 150, 29]           1,769           1,769
           LossFun-11                  []               0               0
Total params: 10,122,749
Trainable pa

In [None]:
# DO NOT REMOVE OR MODIFY
optimizer = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad == True], lr=lr)

## Training and final evaluation

In [None]:
# DO NOT REMOVE OR MODIFY
# Training procedure
nll_val, rec_val = training(name=results_dir + name, max_patience=max_patience, num_epochs=num_epochs, model=model, optimizer=optimizer, training_loader=training_loader, val_loader=val_loader, device=device)

KeyboardInterrupt: 

In [None]:
# DO NOT REMOVE OR MODIFY
# Final evaluation
test_loss, test_rec = evaluation(name=results_dir + name, test_loader=test_loader, device=device)

with open(results_dir + name + '_test_loss.txt', "w") as f:
    f.write('Test NLL: ' + str(test_loss)+'\n'+'Test REC: ' + str(test_rec))
    f.close()

plot_curve(results_dir + name, nll_val, ylabel='nll')
plot_curve(results_dir + name, rec_val, ylabel='rec')

## Final sampled texts

In [None]:
# DO NOT REMOVE
# Sample texts: load best model
names = ['arm_transformer_1', 'arm_transformer_2', 'arm_transformer_3','arm_transformer_4']

# sample
temperature = 1.0 # you can modify it
num_samples = 64 # you can modify it

for name in names:
    results_dir = results_model_dir + name + '/'
    model_best = torch.load(results_dir + name + '.model')
    model_best = model_best.eval()

    sampled_tokens = model_best.sample(batch_size=num_samples, temperature=temperature)  # do not modify
    sampled_texts = tokenizer.decode(sampled_tokens)  # do not modify

    save_texts(sampled_texts, name='FINAL_' + str(temperature))

# **5.** Model with $>10$M weights and 1000 test data

In [None]:
# PLEASE MODIFY ACCORDING TO THE REPORT REQUIREMENTS
num_training_data = 1000  # None to take all training data

# DO NOT REMOVE OR MODIFY THE REST OF THIS CELL
#-dataset
train_dataset = Headers(dataprocessor, tokenizer, num_training_data=num_training_data, mode="train")
validation_dataset = Headers(dataprocessor, tokenizer, mode="val")
test_dataset = Headers(dataprocessor, tokenizer, mode="test")

#-dataloaders
BATCH_SIZE = 32

training_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

In [None]:
# DO NOT REMOVE (but you can modify if necessary)
#-creating a dir for saving results
name = 'arm_transformer_1000'  # NOTE: if you run multiple experiments, you would overwrite results. Please modify this part if necessary.
results_dir = results_model_dir + name + '/'
if not(os.path.exists(results_dir)):
  os.mkdir(results_dir)

In [None]:
# DO NOT REMOVE but PLEASE MODIFY WHENEVER YOU ARE ASKED FOR IT!
# NOTE: in order to obtain required sizes of your models, you can play with
#       various values of num_neurons, num_heads, num_blocks, num_emb
num_tokens = 150 # do not modify!
num_token_vals = 29  # do not modify!
num_neurons = 50 # please modify it
num_heads = 5 # please modify it
num_blocks = 5 # please modify it
num_emb = num_heads * 6  # please modify it but it must be a multiplication of num_heads
causal=True # do not modify!

lr = 1e-3 # learning rate; do not modify!
num_epochs = 1000 # max. number of epochs; do not modify!
max_patience = 10 # an early stopping is used, if training doesn't improve for longer than 20 epochs, it is stopped; do not modify!

In [None]:
# DO NOT REMOVE OR MODIFY
model = DecoderTransformer(num_tokens=num_tokens, num_token_vals=num_token_vals, num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads, num_blocks=num_blocks, device=device)
model = model.to(device)
# Print the summary (like in Keras)
print(summary(model, torch.zeros(1, num_tokens, dtype=torch.long).to(device), show_input=False, show_hierarchical=False))

In [None]:
# DO NOT REMOVE OR MODIFY
optimizer = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad == True], lr=lr)

In [None]:
# DO NOT REMOVE OR MODIFY
# Training procedure
nll_val, rec_val = training(name=results_dir + name, max_patience=max_patience, num_epochs=num_epochs, model=model, optimizer=optimizer, training_loader=training_loader, val_loader=val_loader, device=device)

In [None]:
# DO NOT REMOVE OR MODIFY
# Final evaluation
test_loss, test_rec = evaluation(name=results_dir + name, test_loader=test_loader, device=device)

with open(results_dir + name + '_test_loss.txt', "w") as f:
    f.write('Test NLL: ' + str(test_loss)+'\n'+'Test REC: ' + str(test_rec))
    f.close()

plot_curve(results_dir + name, nll_val, ylabel='nll')
plot_curve(results_dir + name, rec_val, ylabel='rec')

## Final sampled texts

In [None]:
# DO NOT REMOVE
# Sample texts: load best model
model_best = torch.load(results_dir + name + '.model')
model_best = model_best.eval()

# sample
temperature = 1.0 # you can modify it
num_samples = 64 # you can modify it

sampled_tokens = model_best.sample(batch_size=num_samples, temperature=temperature)  # do not modify
sampled_texts = tokenizer.decode(sampled_tokens)  # do not modify

save_texts(sampled_texts, name='FINAL_' + str(temperature))

# Best Model

In [None]:
# DO NOT REMOVE
# Sample texts: load best model
name = 'arm_transformer_2'
results_dir = results_model_dir + name + '/'
# sample
temperatures = [0.01, 0.1, 0.5, 0.8, 1.0] # you can modify it
num_samples = 64 # you can modify it

for temperature in temperatures:
  model_best = torch.load(results_dir + name + '.model')
  model_best = model_best.eval()

  sampled_tokens = model_best.sample(batch_size=num_samples, temperature=temperature)  # do not modify
  sampled_texts = tokenizer.decode(sampled_tokens)  # do not modify

  save_texts(sampled_texts, name='FINAL_' + str(temperature))