<a href="https://colab.research.google.com/github/vlamen/tue-deeplearning/blob/main/practicals/P5.3_Transformer_answer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#P5.3 - Text Translation (using Transformers)

In this practical we will implement a Transformer [1] from scratch and apply it to the task of text translation (same as in P3.3). We will use the PyTorch library.

## Learning outcomes



*   Understand the basic concept of positional embeddings in transformers
*   Understand and implement the basic concept and underlying mechanism of multi-head attention
*   Learn how to train a transformer for the parametrization of the joint probability distribution $P(y_0,...,y_k|x_0,...,x_n)$ over the words $Y$ in the targe language, conditioned on the words $X$ of the source sequence

## Data Preparation
We will follow roughly the same steps as P3.3 to prepare the text data. For more information on what each function does, you can revisit P3.3.

### Data Downloading

In [None]:
!pip install pytorch-nlp
!python -m spacy download de
!python -m spacy download en
!pip install torch==2.3.0 torchtext

In [None]:
## boilerplate code to let us download/extract tar.gz downloads ##
import json
from tqdm.notebook import tqdm
import gdown


def download_and_extract(url, file_name):
    gdown.download(url, file_name, quiet=False)
    with open(file_name, 'r', encoding='utf-8') as f:
        data = [json.loads(line.strip()) for line in f]
    return data


train_data = download_and_extract("https://drive.google.com/uc?id=1GqE08tMg-dQBbVRiQZ-7eiEHdjl0LXzr", "train.jsonl")
valid_data = download_and_extract("https://drive.google.com/uc?id=1PIPpx3rm0eYuJw3cJxYpgDlzeIrzeDr9", "test.jsonl")

print(f"Number of training sentences: {len(train_data)}")
print(f"Number of validation sentences: {len(valid_data)}\n\n")

valid_iterator = iter(valid_data)
for _ in range(3):
    batch = next(valid_iterator)
    print("DE: " + batch['de'])
    print("EN: " + batch['en'] + '\n')


### Data preprocessing

In [None]:
import torch
import torch.nn as nn
import math
from collections import Counter
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import vocab
import spacy
from tqdm.notebook import tqdm

# check if gpu is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

de_counter, en_counter = Counter(), Counter()

de_tokenizer = get_tokenizer('spacy', language='de')
en_tokenizer = get_tokenizer('spacy', language='en')

for batch in tqdm(train_data):

    en, de = batch.values()

    de_counter.update(de_tokenizer(de))
    en_counter.update(en_tokenizer(en))


de_vocab = vocab(de_counter, min_freq=2, specials=['<unk>', '<start>', '<stop>', '<pad>'])
en_vocab = vocab(en_counter, min_freq=2, specials=['<unk>', '<start>', '<stop>', '<pad>'])

print(f"Unique tokens in source (de) vocabulary: {len(de_vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(en_vocab)}")

### Pipeline creation

In [None]:
def de_pipeline(text):
    """
    Reverses German sentence and tokizes from a string into a list of strings (tokens). Then converts each token
    to corresponding indices. Furthermore, it adds a start token at the appropriate position.
    """
    word_idcs = [de_vocab['<start>']]  # start with start token
    vocab_map = de_vocab.get_stoi()  # get our vocab -> idx map

    # now, append all words if they exist in the vocab; if not, enter <unk>
    # note that we do this in reverse
    [word_idcs.append(vocab_map[token] if token in vocab_map else vocab_map['<unk>']) for token in de_tokenizer(text)[::-1]]

    word_idcs.append(de_vocab['<stop>'])  # end with stop token

    return word_idcs

def en_pipeline(text):
    """
    Tokenizes English sentence from a string into a list of strings (tokens), then converts each token
    to corresponding indices. Furthermore, it adds a start token at the appropriate position
    """
    word_idcs = [en_vocab['<start>']]  # start with start token
    vocab_map = en_vocab.get_stoi()  # get our vocab -> idx map

    # now, append all words if they exist in the vocab; if not, enter <unk>
    [word_idcs.append(vocab_map[token] if token in vocab_map else vocab_map['<unk>']) for token in en_tokenizer(text)]
    word_idcs.append(de_vocab['<stop>'])  # end with stop token
    return word_idcs

In [None]:
# tokenize all data up front
train_tokenized = [(torch.tensor(en_pipeline(sentence['en']), dtype=torch.int64, device=device),
                    torch.tensor(de_pipeline(sentence['de']), dtype=torch.int64, device=device))
                    for sentence in tqdm(train_data)]
valid_tokenized = [(torch.tensor(en_pipeline(sentence['en']), dtype=torch.int64, device=device),
                   torch.tensor(de_pipeline(sentence['de']), dtype=torch.int64, device=device))
                   for sentence in tqdm(valid_data)]

## Transformer implementation

Now, we will implement the different components of the transformer architecture (see below). The given code also proposes the main hyperparameters your code should use. Feel free to change the values of these parameters!

<center><img src="https://raw.githubusercontent.com/vlamen/tue-deeplearning/main/img/transformer.png" alt=\"None\" width=\"500\"/></center>

### Positional encoding

The first transformer component we will be implementing is the positional encoding. Since transformers do not make use of the order of the sequence, information about the order must be injected manually in the model. This is done through positional encodings. To generate the positional encoding of a sequence, the following formulas are used:

$PE_{pos,2i} = sin(pos/10000^{2i/d_{model}})$

$PE_{pos,2i+1} = cos(pos/10000^{2i/d_{model}})$

In these formulas, $pos$ is the position, while $i$ is the index of the embedding dimension ($d_{model}$). For the uneven indices of the embedding dimension, the cosine is used, while for the even dimension the sine is used. The embeddings of the words and their position both have the same size, such that they can be summed together. After the summation, a dropout layer is applied.

In the ```PositionalEncoding``` class below, implement the calculation of the positional encoding discussed above. You can use the ```pe``` variable for the positional encoding. The ```position``` variable already contains all possible positions.

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length, dropout):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        # sum word embedding and positional encoding
        return x + self.pe[:, :x.size(1)]

### Mulite-Head attention

Next, we will implement the (multi-head) attention mechanism of transformers. Recall that in transformers, attention is calculated using the queries ($Q$), keys ($K$) and values ($V$). These are obtained using a linear transformation of the current embeddings. The formula for a single attention head is as follows, where the $d_k$ (embedding size) is the scaling factor:

$ \text{Attention}( Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}}V \right) $

A drawback of using a single attention head is that only one type of attention can be calculated. To mitigate this issue, multiple attention heads can be used, where each attention head uses a different $Q_i$, $K_i$ and $V_i$. The output of the different attention heads is then concatenated at the end to obtain a new embedding for each word.

In practice, we can generate queries, keys and values by first performing the linear transformations on the input embeddings. Then, we can split the output into the different $Q_i$, $K_i$ and $V_i$.


Below, complete the ```MultiHeadAttention``` class. In the class initialization ```d_model``` indicates the hidden state of the input/output embeddings (512 in the paper), while ```num_heads``` indicates the amount of attention heads used (8 in the paper).

 Note that in the transformer a mask is used. In the encoder, it is used to ensure that attention is only calculated between embeddings of words (and not padding tokens). In the decoder, it is used to not to restrict the transformer from accessing "future" tokens. To apply the mask in the ```scaled_dot_product_attention``` function, you can consider the following code, where we fill the masked values with a large negative number:

```attn_scores = attn_scores.masked_fill(mask == 0, -1e9)```

**Hint:** When performing the splitting and combining of the heads, keep in mind what the data shape should be. The input (and output) of the ```MultiHeadAttention``` module will have shape ```[batch size, seq_length, d_model]```.


In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output

    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

    def forward(self, q, k, v, mask=None):
        Q = self.split_heads(self.W_q(q))
        K = self.split_heads(self.W_k(k))
        V = self.split_heads(self.W_v(v))

        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

### Position-wise Feed-Forward Networks

Now, we will implement the position-wise feed-forward networks used by the transformers. These are performed by applying a two layer MLP to each embedding, with an input and output size of ```d_model```, and a hidden state of ```d_ff``` and a ReLU activation between the two layers. The network can be described using the following formula:

$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

In the original transformer paper, the input and output size has a dimensionality of 512, while the hidden state has a size of 2048.

In [None]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

### Encoder Block

With all the required components defined, lets bring it all together to create a single encoding block, as shown in the picture below:

<center><img src="https://raw.githubusercontent.com/vlamen/tue-deeplearning/main/img/transformer_encoder_block.png" alt=\"None\" width=\"500\"/></center>

In a single block, first, a Multi-Head Attention layer is applied, with dropout used on the output. After that, the residual connection is applied, which is followed by normalization. For the next part of the block, the same steps are applied, with the attention layer being changed to the the Point-wise Feed-Forward network.

**Note** In the transformer, layer normalization is used. In PyTorch, this is implemented in the ```nn.LayerNorm``` module.

In [None]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderBlock, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

### Encoder



To create the full encoder, we will stack multiple encoder blocks together. In the original paper, 6 blocks were used. Both in the positional encoding and each block, a dropout probability of 0.1 was used.



In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length, dropout)
        self.layers = nn.ModuleList([EncoderBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])


    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)

        return x

### Decoder Block

Similarly, we will need to create a decoder block. Compared to the encoder block, the decoder block contains two multi-head attention layers. The first layer is masked, such that tokens can only attend to "past" tokens, and not look into the future. The second attention layer calculates attention between the embeddings of the decoder, and the output embeddings from the encoder.

<center><img src="https://raw.githubusercontent.com/vlamen/tue-deeplearning/main/img/transformer_decoder_block.png" alt=\"None\" width=\"500\"/></center>

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderBlock, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, out_enc, enc_mask, dec_mask):
        attn_output = self.self_attn(x, x, x, dec_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, out_enc, out_enc, enc_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

### Decoder

Now, we will stack multiple blocks together to form the full decoder. In the original paper, 6 blocks were stacked on top of each other. Following the decoder blocks, there is a final layer to map the output embeddings to predictions.

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length, dropout)

        self.layers = nn.ModuleList([DecoderBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.pred_layer = nn.Linear(d_model, vocab_size)

    def forward(self, x, out_enc, enc_mask, dec_mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, out_enc, enc_mask, dec_mask)

        out = self.pred_layer(x)
        return out

### Transformer

Since we have all the required parts, we can build the full transformer. We will also add a function to create 2 masks (which we have already seen previously). The first mask, ```enc_mask```, will mask out padding tokens in the input sentence, since we do not want to attend to these. The second mask, ```dec_mask```, masks the padding tokens in the output sentence. However, in addition to masking the padding tokens, it also masks "future" decoder tokens, such that the decoder cannot make use of future information during training.

Thanks to the mask used in the decoder, we only need to perform one forward pass through the model (during trainign), whereas in a RNN model this would neep to happen autoregressively.



In [None]:
class Transformer(nn.Module):
    def __init__(self, enc_vocab_size, dec_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()

        self.encoder = Encoder(enc_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)
        self.decoder = Decoder(dec_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

    def forward(self, enc_in, dec_in, enc_mask, dec_mask):
        out_enc = self.encoder(enc_in, enc_mask)
        out = self.decoder(dec_in, out_enc, enc_mask, dec_mask)
        return out

### Creating Dataloaders

In [None]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def decoder_mask(size):
    # Creating a square matrix of dimensions 'size x size' filled with ones
    mask = torch.triu(torch.ones(1, size, size), diagonal = 1).type(torch.int)
    return mask == 0


def collate_batch(batch):
    """
    Concatenate multiple datapoints to obtain a single batch of data
    """
    # sentences are stored as tuples; get respective lists
    en_list = [x[0] for x in batch]
    de_list = [x[1] for x in batch]

    # pad sequences in batch
    de_padded = pad_sequence(sequences = de_list,
                             batch_first = True,
                             padding_value = de_vocab['<pad>'])
    en_padded = pad_sequence(sequences = en_list,
                             batch_first = True,
                             padding_value = en_vocab['<pad>'])

    # decoder input, remove last token
    en_padded_in = en_padded[:,:-1]

    # decoder output (target), remove first token
    en_padded_out = en_padded[:, 1:]

    de_mask = (de_padded != de_vocab['<pad>']).unsqueeze(1).unsqueeze(1).int()

    # mask is calculated for the decoder input
    en_mask_in = (en_padded_in != en_vocab['<pad>']).unsqueeze(1).unsqueeze(1).int()
    dec_mask = decoder_mask(en_padded_in.size(-1)).unsqueeze(0).to(device)

    # combine the two masks
    en_mask = en_mask_in & dec_mask

    # return source (DE) and target sequences (EN) after transferring them to GPU (if available)
    return de_padded.to(device), en_padded_in.to(device), en_padded_out.to(device), de_mask.to(device), en_mask.to(device)

## Model Training

With everything in place, lets train a transformer! Below, we provide some hyperparameters, but feel free to change this to your own preferences.

In [None]:
MAX_SEQ_LENGTH = max([len(i[0]) for i in train_tokenized] + \
                     [len(i[1]) for i in train_tokenized] + \
                     [len(i[0]) for i in valid_tokenized] + \
                     [len(i[1]) for i in valid_tokenized]) + 1

In [None]:
BATCH_SIZE = 128
EPOCHS = 15
LR = 0.001

DROPOUT = 0.1
N_HEADS = 4
N_LAYERS = 4
HIDDEN_DIM = 64
FF_HIDDEN_DIM = 64

In [None]:
transformer = Transformer(enc_vocab_size = len(de_vocab),
                          dec_vocab_size = len(en_vocab),
                          d_model = HIDDEN_DIM,
                          num_heads = N_HEADS,
                          num_layers = N_LAYERS,
                          d_ff = FF_HIDDEN_DIM,
                          max_seq_length = MAX_SEQ_LENGTH,
                          dropout=DROPOUT,
                          ).to(device)

In [None]:
from torch import optim

optimizer = optim.Adam(transformer.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss(ignore_index=en_vocab['<pad>'])

In [None]:
trainloader = DataLoader(train_tokenized, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
validloader = DataLoader(valid_tokenized, batch_size=BATCH_SIZE, collate_fn=collate_batch)

Below, we implement the training loop, as well as helper functions for training and evaluating for a full epoch. Since the transformer does not need to perform autoregressive forward passes during training, these functions remain relatively simple.

In [None]:
def train(model, dataloader, optimizer):

    epoch_loss = 0.0
    model.train()

    for i, (src, tgt_in, tgt_out, src_mask, tgt_mask) in enumerate(tqdm(dataloader)):

        optimizer.zero_grad()
        tgt_hat = model(src, tgt_in, src_mask, tgt_mask).transpose(-1,-2)

        loss = criterion(tgt_hat, tgt_out)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()

    return epoch_loss/len(dataloader)

@torch.no_grad()
def evaluate(model, dataloader):

    epoch_loss = 0.0
    model.eval()

    for i, (src, tgt_in, tgt_out, src_mask, tgt_mask) in enumerate(tqdm(dataloader)):
        tgt_hat = model(src, tgt_in, src_mask, tgt_mask).transpose(-1,-2)
        loss = criterion(tgt_hat, tgt_out)
        epoch_loss += loss.item()

    return epoch_loss/len(dataloader)

In [None]:
import time

SAVE=False

best_valid_loss = float('inf')
train_loss_arr = []; val_loss_arr = []
for epoch in range(EPOCHS):

    epoch_start_time = time.time()

    train_loss = train(transformer, trainloader, optimizer)
    val_loss = evaluate(transformer, validloader)

    train_loss_arr.append(train_loss); val_loss_arr.append(val_loss)

    if SAVE and (val_loss < best_valid_loss):
        best_val_loss = val_loss
        torch.save(transformer.state_dict(), 'p5_3-model.pt')

    print('-' * 76)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'train loss {:8.3f} '
          'valid loss {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           train_loss_arr[-1],
                                           val_loss_arr[-1]))
    print('-' * 76)

## Model evaluation

Now that we've trained the model, let's see how it performs! When running inference, the transformer is not able to produce a translated sentence in a single forward pass. Like the RNN seq2seq models, this decoding happens autoregressively, producing one new token at a time. Below, the greedy decoder is implemented, where we simply add the next most likely word to the sentence at each autoregressive step.

In [None]:
import numpy as np

### Your code here ###
def idx_to_sen(sentence_idcs, vocab):
    sentence_idcs = sentence_idcs[sentence_idcs > 3] #remove special tokens
    sentence_idcs = np.array(vocab.get_itos())[sentence_idcs]
    return ' '.join(sentence_idcs)

def print_val_examples(src, trg, pred, N):
    for src_, trg_, pred_ in zip(src[:N], trg[:N], pred[:N]):
        print(f' src: {src_}\n trg: {trg_}\n pred: {pred_}\n')

@torch.no_grad()
def greedy_decoder(model, dataloader):

    epoch_loss = 0

    predf = []; srcf = []; trgf = []

    for idx, (src, tgt_in, tgt_out, src_mask, tgt_mask) in tqdm(enumerate(dataloader)):

        out_enc = model.encoder(src, src_mask)

        sentence = tgt_in[:,[0]]

        sen_len = 1
        while True:

            dec_mask = decoder_mask(sentence.size(1)).unsqueeze(0).to(device)
            out = model.decoder.forward(sentence, out_enc, src_mask, dec_mask)

            _, next_word = torch.max(out[:,-1], dim=1, keepdim=True)
            if sen_len == MAX_SEQ_LENGTH:
                break

            sentence = torch.cat([sentence, next_word], dim=1)
            sen_len += 1

        for p, s, t in zip(sentence.cpu(), src.cpu(), tgt_out.cpu()):
            predf.append(idx_to_sen(p, en_vocab))
            srcf.append(idx_to_sen(s, de_vocab))
            trgf.append(idx_to_sen(t, en_vocab))



    return srcf, trgf, predf

out = greedy_decoder(transformer, validloader)

print_val_examples(*out, N=10)
