# Chapter 7: Transformers

## Introduction

In this chapter we cover transformers.

Transformers are currently the state of the art for natural language processing (NLP) tasks. For example, large language models are basically just huge transformers (with some extra tricks).

## Self-attention

The main idea behind transformers is the self-attention operation.

This operation takes in a sequence of $n$ dimensional vectors $x_1, \dots, x_k$ and outputs another sequence of $n$ dimensional vectors of same length $y_1, \dots, y_k.$ In NLP tasks these vectors represent the tokens of the text.

## Self-attention

The output vector $y_i$ is obtained as a weighted sum of input vectors:
$$
  y_i = \sum_{j} w_{ij}x_j.
$$

These $w_{ij}$ are not weights of the model but are instead computed as
$$
  w'_{ij} = x_ix_j^T, \ w_{ij} = \frac{e^{w'_{ij}}}{\sum_j e^{w'_{ij}}}
$$

## Self-attention

That is to compute $w_{ij}$ we first compute the Euclidean inner product of $x_i$ with the rest of the input vectors $x_j$ and then we apply softmax to them. Softmax is applied to normalize the inner products.

For now we have no weights in self-attention, we will discuss how to add them later.

## Self-attention

To understand what problem self-attention tries to solve consider the following paragraph:

"The cat lived in a barn. It liked to chase mice."

To parse this paragraph you need to know what "It" in the second sentence refers to. To do this you need to look at all the nouns in the preceding sentence and pick one based on context.

The idea is that when trying to understand what specific words in a sentence mean we need to pay special attention to other words.

## Self-attention

This is exactly what self-attention allows the model to do. When parsing input vector $x_i$ it allows the model to focus on all other input vectors that are relevant when parsing $x_i$.

The inner product measures how "related" the two vectors are. The meaning of related depends on the modeling task.

To produce the output $y_i$ we first measure how related $x_i$ is to all other input vectors $x_j,$ we then compute the weighted sum of the $x_i$ based on this relevance to obtain the $y_i.$

## Self-attention

To get the self-attention layer used in actual transformers we need three extra additions.

## Self-attention

1. Queries, keys and values:

In self-attention each input vector $x_i$ is used in three distinct ways:

- It is compared to every other input vector to get the $w_{ij}$ used to compute its own output $y_i.$
- It is compared to every other input vector to get the $w_{ji}$ used to compute outputs for all other input vectors $x_j.$
- It is used in the weighted sum to get each output vector $y_j.$

## Self-attention

These three distinct roles are called query, key and value. In actual self-attention three different $k$ by $k$ matrices $Q$, $K$ and $V$ are used to preprocess the $x_i$ before computing $y_i.$ This gives extra flexibility and also weights for the model to learn.

So the formulas now are:
$$
  q_i = Qx_i, \ k_i = Kx_i, \ v_i = Vx_i,
$$
$$
  w'_{ij} = q_ik_i^T, w_{ij} = \text{softmax}(w'_{ij}),
$$
$$
  y_i = \sum_{j}w_{ij}v_j.
$$

## Self-attention

2. Normalizing the inner product:

The inner product used to compute $w'_{ij}$ is normalized as
$$
w'_{ij} = \frac{q_ik_i^T}{\sqrt{n}},
$$
where $n$ is the dimension of $x_i.$

## Self-attention

3. Multi head attention:

A single self-attention operation can learn only one relationship between the inputs. However, it is reasonable to assume that there probably is more than one relationship. We would have to apply several self-attention operations to the input to be able to learn them. This increases the model size considerably.

Turns out we can apply several self-attention operations without increasing the size of the model. This is called multi head attention.

## Self-attention

For concreteness, suppose the dimension of our input vectors $x_i$ is $40.$ To apply multi head attention with $4$ "heads" we proceed as follows.
First, we are going to have four separate query, key and value matrices each:
$$
  Q_{r}, K_r, V_r, \ r=1,2,3,4.
$$
These matrices are not going to be square but instead $40$ by $10.$

## Self-attention

We apply these matrices to our input vectors to project them into four sets of $10$ dimensional query, key and value vectors. We then apply self-attention to each of these sets.

We then concatenate the outputs to obtain $40$ dimensional vectors $y_i.$ Lastly, for the model to be able to learn a proper embedding (instead of just a concatenation) we multiply each vector by a $40$ by $40$ matrix $W.$

## Self-attention

Here is the diagram of the whole process (it was taken from this [blogpost](https://peterbloem.nl/blog/transformers)):

![](../images/multi-head.png){fig-align="center"}

## Transformers

The output of an attention layer is then usually fed to a feedforward layer to obtain a transformer block. Models are called transformers if they are built up from transformer blocks.

Let's build our own transformer block using `PyTorch`.

This is the transformer block used in the [paper](https://arxiv.org/abs/1706.03762) that introduced the transformer architecture.

## Transformers

In [1]:
#| output-location: slide
import torch
from torch import nn

class TransformerBlock(nn.Module):
  def __init__(self, embed_dim, no_heads, attn_mask, fc_dim=2048):
    super().__init__()
    self.mask = attn_mask
    self.attention = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=no_heads, batch_first=True)
    self.norm1 = nn.LayerNorm(embed_dim)
    self.fc = nn.Sequential(
      nn.Linear(embed_dim, fc_dim),
      nn.ReLU(),
      nn.Linear(fc_dim, embed_dim)
    )
    self.norm2 = nn.LayerNorm(embed_dim)

  def forward(self, x):
    att, _ = self.attention(x, x, x, is_causal=True, attn_mask=self.mask.to(x.device))
    x = self.norm1(x + att)
    x = self.norm2(x + self.fc(x))
    return x

print(TransformerBlock(256, 4, 4))

TransformerBlock(
  (attention): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
  )
  (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (fc): Sequential(
    (0): Linear(in_features=256, out_features=2048, bias=True)
    (1): ReLU()
    (2): Linear(in_features=2048, out_features=256, bias=True)
  )
  (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)


## Transformers

Here is a diagram for our transformer block:

![](../images/transformer_block.png){fig-align="center"}

## Transformers

Self-attention layers are permutation invariant, i.e. if we shuffle up the order of the input $x_i$ the output $y_i$ remain the same except also shuffled. For NLP, we would like our model to be sensitive to word order. To achieve this to each token embedding we are also going to add a positional embedding.

Let's build a transformer model.

## Transformers

In [2]:
#| output-location: slide
class Transformer(nn.Module):
  def __init__(self, no_tokens, seq_length, pad_idx, embed_dim, no_heads, depth):
    super().__init__()
    mask = nn.Transformer.generate_square_subsequent_mask(seq_length)

    self.token_embedding = nn.Embedding(embedding_dim=embed_dim, num_embeddings=no_tokens)
    self.pos_embedding = nn.Embedding(embedding_dim=embed_dim, num_embeddings=seq_length)

    tblocks = []
    for i in range(depth):
      tblocks.append(TransformerBlock(embed_dim=embed_dim, no_heads=no_heads, attn_mask=mask))
    self.tblocks = nn.Sequential(*tblocks)

    self.toprobs = nn.Linear(embed_dim, no_tokens)

  def forward(self, x):
    tokens = self.token_embedding(x)
    b, n, e = tokens.size()
    positions = self.pos_embedding(torch.arange(n, device=tokens.device))[None, :, :].expand(b, n, e)
    x = tokens + positions
    x = self.tblocks(x)
    return self.toprobs(x)

print(Transformer(10, 10, 0, 10, 5, 3))

Transformer(
  (token_embedding): Embedding(10, 10)
  (pos_embedding): Embedding(10, 10)
  (tblocks): Sequential(
    (0): TransformerBlock(
      (attention): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=10, out_features=10, bias=True)
      )
      (norm1): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
      (fc): Sequential(
        (0): Linear(in_features=10, out_features=2048, bias=True)
        (1): ReLU()
        (2): Linear(in_features=2048, out_features=10, bias=True)
      )
      (norm2): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
    )
    (1): TransformerBlock(
      (attention): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=10, out_features=10, bias=True)
      )
      (norm1): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
      (fc): Sequential(
        (0): Linear(in_features=10, out_features=2048, bias=True)
        (1): ReLU()
        (2): Linear(in_features=2048, out_

## Transformers

Let's apply our transformer to text generation. Let's generate text the way large language models do it.

Given a sequence of tokens our model is going to predict the next token. We can use this to generate text token by token by feeding the model its output as input.

## Transformers

To speed up training we are actually going to ask the model to predict the input sequence, but shifted to the right. I.e. the model is still learning to predict the next token, but now we can use the whole sequence for training instead of just the last token.

## Transformers

Since we are giving the whole sequence to the model we need to make sure that it is not able to peek ahead when predicting the next token. This is done by adding a mask to the attention layers' weights. Then, when computing the output of the attention layer the model is only able to use the current and previous tokens.

## Transformers

This idea is summarized in the following diagram, which was taken from this [blogpost](https://peterbloem.nl/blog/transformers).

![](../images/masked-attention.svg){fig-align="center"}

## Transformers

We have already added this mask when building our transformer model.

For the actual task, we are going to feed a list of lithuanian names to the model and ask it to generate more names letter by letter.

Let's build a dataset class that is going to return tokenized names and also the same names shifted one letter to the right for training.

## Transformers

In [4]:
import numpy as np
import torch
import pandas as pd
from torch.utils.data import Dataset, DataLoader

class CharLevelTokenizerLt:
  def __init__(self):
    self.i2t = [
      '<pad>', '<eos>',
      'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
      'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
      'r', 's', 't', 'u', 'v', 'y', 'z', 'č',
      'ė', 'š', 'ū', 'ž'
    ]
    self.t2i = {token:index for index, token in enumerate(self.i2t)}
    self.no_tokens = len(self.i2t)
    self.pad_idx = 0
    self.eos_idx = 1

  def string_to_idx(self, string, seq_length=None):
    tokens = [token for token in string]
    return self.tokens_to_idx(tokens, seq_length)

  def tokens_to_idx(self, tokens, seq_length=None):
    idxs = [self.t2i[token] if token in self.t2i else self.unk_idx for token in tokens]
    idxs = [*idxs, self.eos_idx]
    if seq_length is not None:
      idxs = idxs + [self.pad_idx] * (seq_length - len(idxs))
      idxs = idxs[:seq_length]
    return idxs

  def idx_to_string(self, indices):
    tokens = self.idx_to_tokens(indices)
    return ' '.join(tokens)

  def idx_to_tokens(self, indices):
    return [self.i2t[idx] for idx in indices if idx != self.pad_idx or idx != self.eos_idx]

class NameData(Dataset):
  def __init__(self, seq_length):
    self.tokenizer = CharLevelTokenizerLt()
    self.data = pd.read_csv("../data/names/vardai.csv")

    self.no_tokens = self.tokenizer.no_tokens
    self.pad_idx = self.tokenizer.pad_idx
    self.eos_idx = self.tokenizer.eos_idx
    self.seq_length = seq_length

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    name = self.data.loc[idx]["name"]

    inp  = self.tokenizer.tokens_to_idx(name[:], self.seq_length)
    target = self.tokenizer.tokens_to_idx(name[1:], self.seq_length)

    inp = torch.IntTensor(inp)
    target = torch.LongTensor(target)
    target = torch.zeros(self.no_tokens*self.seq_length, dtype=torch.float).view(self.seq_length, self.no_tokens).scatter(1, target.view(self.seq_length, 1), value=1)

    return inp, target

seq_length = 15
batch_size = 32

data = NameData(seq_length=seq_length)

dataloader = DataLoader(
  data,
  batch_size=batch_size,
  shuffle=True
)

## Transformers

Let's create a function for training one epoch.

In [5]:
from tqdm import tqdm # This is a library that implements loading bars
import sys

def train_epoch(dataloader, model, loss_fn, optimizer):
  model.train() # Set model to training mode

  total_loss = 0
  total_batches = 0

  with tqdm(dataloader, unit="batch", file=sys.stdout) as ep_tqdm:
    ep_tqdm.set_description("Train")
    for X, y in ep_tqdm:
      X, y = X.to(device), y.to(device)

      # Forward pass
      pred = model(X)
      loss = loss_fn(pred, y)
        
      # Backward pass
      loss.backward()
      optimizer.step()

      # Reset the computed gradients back to zero
      optimizer.zero_grad()

      # Output stats
      total_loss += loss
      total_batches += 1
      ep_tqdm.set_postfix(average_batch_loss=(total_loss/total_batches).item())

## Transformers

Now we can train the model.

In [21]:
#| output-location: slide
# Hyperparameters
learning_rate = 0.0001
epochs = 20

device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

model = Transformer(
  no_tokens=data.no_tokens,
  seq_length=data.seq_length,
  pad_idx=data.pad_idx,
  embed_dim=128,
  no_heads=8,
  depth=4
).to(device)

loss_fn = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=learning_rate)

# Organize the training loop
for t in range(epochs):
  print(f"Epoch {t+1}\n-------------------------------")
  train_epoch(dataloader, model, loss_fn, optimizer)

print("Done!")

Using cuda device
Epoch 1
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 214.43batch/s, average_batch_loss=0.816]
Epoch 2
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 211.03batch/s, average_batch_loss=0.703]
Epoch 3
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 227.83batch/s, average_batch_loss=0.673]
Epoch 4
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 233.20batch/s, average_batch_loss=0.658]
Epoch 5
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 214.30batch/s, average_batch_loss=0.648]
Epoch 6
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 227.10batch/s, average_batch_loss=0.64] 
Epoch 7
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 225.38batch/s, average_batch_loss=0.635]
Epoch 8
-------------------------------
Train: 100%|██████████| 194/194 [00:00<00:00, 207.40b

## Transformers

Now let's sample from the model.

For comparison, you can try generating random sequences of characters from the lithuanian alphabet to see that these results are quite good.

## Transformers

In [22]:
#| output-location: slide
def sample(model, seed, temperature=1.0):
  assert len(seed) > 0
  model.eval()
  with torch.no_grad():
    idx_to_sample = len(seed) - 1

    tokenizer = CharLevelTokenizerLt()

    result = '' + seed
    for _ in range(15 - len(seed)):
      inp = torch.LongTensor(tokenizer.string_to_idx(result, 15)).view(1, 15).to(device)
      inp = model(inp)/temperature
      inp = inp[0, idx_to_sample, :]
      p = nn.functional.softmax(inp, dim=0)
      cd = torch.distributions.Categorical(p)
      next_char = cd.sample()
      if next_char == tokenizer.eos_idx or next_char == tokenizer.pad_idx:
        break
      result += tokenizer.idx_to_string([next_char])
      idx_to_sample += 1

    return result

print(
  ' '.join([sample(model=model, seed='jo') for _ in range(5)]) + '\n'
  + ' '.join([sample(model=model, seed='jū') for _ in range(5)]) + '\n'
  + ' '.join([sample(model=model, seed='ja') for _ in range(5)]) + '\n'
  + ' '.join([sample(model=model, seed='je') for _ in range(5)]) + '\n'
)

joelmantė jopvydas johelas jozavinė joherdas
jūrija jūkija jūfraja jūšrija jūratijus
jačius jaolgvinė jaūgvidė jaomintė jaūdronė
jevdrohijas jeongirdas jevfropijas ječimonė jeompinė



## Extras

There are many pre-trained open source transformers. Here are some classic examples:

- [BERT](https://huggingface.co/google-bert/bert-base-uncased)
- [GPT-2](https://huggingface.co/openai-community/gpt2)

## Practice task

Try building a transformer model for sentiment analysis on this [dataset](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment).