<a href="https://colab.research.google.com/github/hissain/mlworks/blob/main/codes/Seq2Seq_Transformer_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
# NOTE: If you are running this notebook on Google Colab,
#       then uncomment the two lines below and then run this cell!

#!pip install datasets evaluate --upgrade
#!python -m spacy download de_core_news_sm

In [5]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/hissain')

Mounted at /content/drive


In [6]:
# prompt: How to list my google drive files?

!ls


codes  datasets  models  tmp


In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
import spacy
import datasets
import torchtext
import tqdm
import evaluate


#transformer
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from typing import Iterable, List
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
from timeit import default_timer as timer
from torch.nn import Transformer
from torch import Tensor
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

import numpy as np
import torch.nn as nn
import torch
import torch.nn.functional as F
import numpy as np
import math
import os
import pandas as pd
import matplotlib.pyplot as plt



In [8]:
seed = 1234

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

### Dataset

Next, we'll load our dataset using the `datasets` library. When using the `load_dataset` function we pass the name of the dataset, `bentrevett/multi30k`.

The dataset we'll be using is a subset of the [Multi30k dataset](https://github.com/multi30k/dataset), which is hosted [here](https://huggingface.co/datasets/bentrevett/multi30k) on the HuggingFace dataset hub. This subset has ~30,000 parallel English and German sentences obtained using the task 1 raw data from [here](https://github.com/multi30k/dataset/tree/master/data/task1/raw). We use the "2016" versions of the test sets.


In [9]:
dataset = datasets.load_dataset("bentrevett/multi30k")
#dataset_chat = datasets.load_dataset("li2017dailydialog/daily_dialog")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.60M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/164k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/29000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1014 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'de'],
        num_rows: 29000
    })
    validation: Dataset({
        features: ['en', 'de'],
        num_rows: 1014
    })
    test: Dataset({
        features: ['en', 'de'],
        num_rows: 1000
    })
})

For convenience, we create a variable for each split. Each being a `Dataset` object.


In [11]:
train_data, valid_data, test_data = (
    dataset["train"],
    dataset["validation"],
    dataset["test"],
)

We can index into each `Dataset` to view an individual example. Each example has two features: "en" and "de", which are the parallel English and German sentences.


In [12]:
train_data[0]

{'en': 'Two young, White males are outside near many bushes.',
 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'}

### Tokenizers

Next, we'll load the spaCy models that contain the tokenizers.

A tokenizer is used to turn a string into a list of tokens that make up that string, e.g. "good morning!" becomes ["good", "morning", "!"]. We'll start talking about the sentences being a sequence of tokens from now, instead of saying they're a sequence of words. What's the difference? Well, "good" and "morning" are both words and tokens, but "!" is not a word. We could say "!" is punctuation, but the term token is more general and covers: words, punctuation, numbers and any special symbols.

spaCy has model for each language ("de_core_news_sm" for German and "en_core_web_sm" for English) which need to be loaded so we can access the tokenizer of each model.

**Note**: the models must first be downloaded using the following on the command line:

```
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

We load the models as such:


In [13]:
#!python -m spacy download de_core_news_sm
#!python -m spacy download en_core_web_sm

In [14]:
en_nlp = spacy.load("en_core_web_sm")
de_nlp = spacy.load("de_core_news_sm")

We can call the tokenizer for each spaCy model using the `.tokenizer` method, which accepts a string and returns a sequence of `Token` objects. We can get the string from the token object using the `text` attribute.


In [15]:
string = "What a lovely day it is today!"
[token.text for token in en_nlp.tokenizer(string)]

['What', 'a', 'lovely', 'day', 'it', 'is', 'today', '!']

Next, we'll write a function used to apply the tokenizer to all of the examples in each data split, as well as apply some other processing.

This function takes in an example from the `Dataset` object, applies the tokenizers English and German spaCy models, trims the list of tokens to a maximum length, optionally converts each token to lowercase, and then appends the start of sequence and end of sequence tokens to the beginning and end of the list of tokens.

This function will be used with the `map` method from a `Dataset`, which needs to return a dictionary containing the names of the features in each example where the outputs are stored. As the output feature names "en_tokens" and "de_tokens" are not already contained in the example (where we only have "en" and "de" features), this will create two new features in each example.


In [16]:
def tokenize_example(example, en_nlp, de_nlp, max_length, lower, sos_token, eos_token, pad_token):
    en_tokens = [token.text for token in en_nlp.tokenizer(example["en"])][:max_length]
    de_tokens = [token.text for token in de_nlp.tokenizer(example["de"])][:max_length]
    if lower:
        en_tokens = [token.lower() for token in en_tokens]
        de_tokens = [token.lower() for token in de_tokens]
    en_tokens = [sos_token] + en_tokens[:max_length-2] + [eos_token]
    de_tokens = [sos_token] + de_tokens[:max_length-2] + [eos_token]

    en_tokens = en_tokens + [pad_token] * (max_length - len(en_tokens))
    de_tokens = de_tokens + [pad_token] * (max_length - len(de_tokens))

    return {"en_tokens": en_tokens, "de_tokens": de_tokens}

In [17]:
# prompt: a dummy example to test tokenize_example

example = {
    "en": "This is a test sentence.",
    "de": "Dies ist ein Testsatz."
    }
tokenized_example = tokenize_example(
    example, en_nlp, de_nlp,
    max_length=10, lower=True,
    sos_token="<sos>", eos_token="<eos>", pad_token="<pad>")

print(tokenized_example)


{'en_tokens': ['<sos>', 'this', 'is', 'a', 'test', 'sentence', '.', '<eos>', '<pad>', '<pad>'], 'de_tokens': ['<sos>', 'dies', 'ist', 'ein', 'testsatz', '.', '<eos>', '<pad>', '<pad>', '<pad>']}


We apply the `tokenize_example` function using the `map` method as below.

The `example` argument is implied, however all additional arguments to the `tokenize_example` function need to be stored in a dictionary and passed to the `fn_kwargs` argument of `map`.

Here, we're trimming all sequences to a maximum length of 1000 tokens, converting each token to lower case, and using `<sos>` and `<eos>` as the start and end of sequence tokens, respectively.


In [18]:
max_length = 100
lower = True
sos_token = "<sos>"
eos_token = "<eos>"
pad_token = "<pad>"

fn_kwargs = {
    "en_nlp": en_nlp,
    "de_nlp": de_nlp,
    "max_length": max_length,
    "lower": lower,
    "sos_token": sos_token,
    "eos_token": eos_token,
    "pad_token": pad_token,
}

train_data = train_data.map(tokenize_example, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(tokenize_example, fn_kwargs=fn_kwargs)
test_data = test_data.map(tokenize_example, fn_kwargs=fn_kwargs)

Map:   0%|          | 0/29000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1014 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

We can now look at an example, confirming the two new features have been added; both of which are lowercased list of strings with the start/end of sequence tokens appended.


In [19]:
#train_data[0]

In [20]:
#train_data["en_tokens"]

In [21]:
min_freq = 1
unk_token = "<unk>"

special_tokens = [
    unk_token,
    pad_token,
    sos_token,
    eos_token,
]

en_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["en_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

de_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["de_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

In [22]:
en_vocab.get_itos()[-10:]

['yong',
 'yongsan',
 'yorkie',
 'youngster',
 'ypoung',
 'zales',
 'zippered',
 'zips',
 'zoom',
 'zooming']

We can get the index from a given token using the `get_stoi` (stoi = " **s**tring **to** **i**nt) method.


In [23]:
en_vocab.get_stoi()["the"]

7

The `len` of each vocabulary gives us the number of unique tokens in each one. We can see that our training data had around 2000 more German tokens (that appeared at least twice) than the English data.


In [24]:
len(en_vocab), len(de_vocab)

(9797, 18669)

We can also use the `in` keyword to get a boolean indicating if a token is in the vocabulary.


In [25]:
"the" in en_vocab

True

Remember how we converted all of our tokens to lowercase? This means that no tokens containing any uppercase characters appear in our vocabulary.


In [26]:
"The" in en_vocab

False

What happens if you try and get the index of a token that isn't in the vocabulary? You get index zero for the `<unk>` (unknown) token, right?

Well, no. One quirk of the `torchtext` vocabulary class is that you have to manually set what value you want your vocabulary to return when you try and get the index of an out-of-vocabulary token. If you have not set this value, then you will receive an error! This is so you can set your vocabulary to return any value when trying to get the index of a token not in the vocabulary, even something like `-100`.


In [27]:
#en_vocab["The"]

We already know the index of our `<unk>` token is zero as it's the first element in our `special_tokens` list, and we've also manually inspected it using `get_itos`.

However, here we'll programmatically get it and also check that both our vocabularies have the same index for the unknown and padding tokens as this simplifies some code later on.

We also get the index of our `<pad>` token, as we'll use it later


In [28]:
assert en_vocab[unk_token] == de_vocab[unk_token]
assert en_vocab[pad_token] == de_vocab[pad_token]

unk_index = en_vocab[unk_token]
pad_index = en_vocab[pad_token]

unk_index, pad_index

(0, 1)

Using the `set_default_index` method we can set what value is returned when we try and get the index of a token outside of our vocabulary. In this case, the index of the unknown token, `<unk>`.


In [29]:
en_vocab.set_default_index(unk_index)
de_vocab.set_default_index(unk_index)

Now, we can happily get indexes of out of vocabulary tokens until our heart is content!


In [30]:
en_vocab["The"]

0

And we can get the token corresponding to that index to prove it's the `<unk>` token.


In [31]:
en_vocab.get_itos()[0]

'<unk>'

Another useful feature of the vocabulary is the `lookup_indices` method. This takes in a list of tokens and returns a list of indices. In the below example we can see the token "crime" does not exist in our vocabulary so is coverted to the index of the `<unk>` token, zero, which we passed to the `set_default_index` method.


In [32]:
tokens = ["i", "love", "watching", "crime", "shows"]

In [33]:
en_vocab.lookup_indices(tokens)

[956, 2169, 173, 6799, 821]

Conversely, we can use the `lookup_tokens` method to convert a list of indices back into tokens using the vocabulary. Notice how the original "crime" token is now an `<unk>` token. There is no way to tell what the original sequence of tokens was.


In [34]:
en_vocab.lookup_tokens(en_vocab.lookup_indices(tokens))

['i', 'love', 'watching', 'crime', 'shows']

Hopefully we've now got the gist of how the `torchtext.Vocab` class works. Time to put it into action!

Just like our `tokenize_example`, we create a `numericalize_example` function which we'll use with the `map` method of our dataset. This will "numericalize" (a fancy way of saying convert tokens to indices) our tokens in each example using the vocabularies and return the result into new "en_ids" and "de_ids" features.


In [35]:
def numericalize_example(example, en_vocab, de_vocab):
    en_ids = en_vocab.lookup_indices(example["en_tokens"])
    de_ids = de_vocab.lookup_indices(example["de_tokens"])
    return {"en_ids": en_ids, "de_ids": de_ids}

We apply the `numericalize_example` function, passing our vocabularies in the `fn_kwargs` dictionary to the `fn_kwargs` argument.


In [36]:
fn_kwargs = {
    "en_vocab": en_vocab,
    "de_vocab": de_vocab
    }

train_data = train_data.map(numericalize_example, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(numericalize_example, fn_kwargs=fn_kwargs)
test_data = test_data.map(numericalize_example, fn_kwargs=fn_kwargs)

Map:   0%|          | 0/29000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1014 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Checking an example, we can see that it has the two new features: "en_ids" and "de_ids", both a list of integers representing their indices in the respective vocabulary.


In [37]:
#train_data[0]

We can confirm the indices are correct by using the `lookup_tokens` method with the corresponding vocabulary on the list of indices.


In [38]:
#en_vocab.lookup_tokens(train_data[0]["en_ids"])

In [39]:
data_type = "torch"
format_columns = ["en_ids", "de_ids"]

train_data = train_data.with_format(
    type=data_type,
    columns=format_columns,
    output_all_columns=True
)

valid_data = valid_data.with_format(
    type=data_type,
    columns=format_columns,
    output_all_columns=True,
)

test_data = test_data.with_format(
    type=data_type,
    columns=format_columns,
    output_all_columns=True,
)

We can confirm this worked by checking an example and seeing the "en_ids" and "de_ids" features are listed as `tensor([...])`.


In [40]:
#train_data[0]

We can also check this using the `type` built-in function on one of the features.


In [41]:
type(train_data[0]["en_ids"])

torch.Tensor

In [42]:
def get_collate_fn(en_vocab, de_vocab, pad_index):

    def collate_fn(batch):
        en_batch = [torch.tensor(en_vocab(example["en_tokens"])) for example in batch] # Convert to tensors
        de_batch = [torch.tensor(de_vocab(example["de_tokens"])) for example in batch] # Convert to tensors

        en_batch = pad_sequence(en_batch, padding_value=pad_index, batch_first=True)
        de_batch = pad_sequence(de_batch, padding_value=pad_index, batch_first=True)

        return en_batch, de_batch

    return collate_fn

def get_data_loader(dataset, batch_size, en_vocab, de_vocab, pad_index, shuffle=False):
    collate_fn = get_collate_fn(en_vocab, de_vocab, pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

In [43]:
batch_size = 512

# Added en_vocab and de_vocab arguments to the function call
train_data_loader = get_data_loader(train_data, batch_size, en_vocab, de_vocab, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, en_vocab, de_vocab, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, en_vocab, de_vocab, pad_index)

In [44]:
for batch in train_data_loader:
    print(batch[0].shape)
    print(batch[1].shape)
    break

torch.Size([512, 100])
torch.Size([512, 100])


In [45]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

In [46]:
class Generator(nn.Module):
    def __init__(
        self,
        max_seq_len,
        num_encoder_layers: int,
        num_decoder_layers: int,
        emb_size: int,
        nhead: int,
        src_vocab_size: int,
        tgt_vocab_size: int,
        dim_feedforward: int,
        dropout: float = 0.1
    ):
        super(Generator, self).__init__()
        self.transformer = Transformer(
            d_model=emb_size,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = nn.Embedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = nn.Embedding(tgt_vocab_size, emb_size)
        self.pos_emb = PositionalEncoding(emb_size, max_seq_len)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):

        src_emb = self.pos_emb(self.src_tok_emb(src))
        tgt_emb = self.pos_emb(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.pos_emb(self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.pos_emb(self.tgt_tok_emb(tgt)), memory, tgt_mask)

In [47]:
MAX_SEQ_LEN = max_length
SRC_VOCAB_SIZE = len(de_vocab)
TGT_VOCAB_SIZE = len(en_vocab)
EMB_SIZE = 256
NHEAD = 4
FFN_HID_DIM = 512
BATCH_SIZE = batch_size
NUM_ENCODER_LAYERS = 6
NUM_DECODER_LAYERS = 6
DEVICE = 'cuda'
NUM_EPOCHS = 10

model = Generator(MAX_SEQ_LEN, NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, NHEAD,
                SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
model

Generator(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=512, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
    

In [48]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)


model.apply(init_weights)

Generator(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=512, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
    

In [49]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 17,713,477 trainable parameters


In [50]:
optimizer = optim.Adam(model.parameters())

In [51]:
criterion = nn.CrossEntropyLoss(ignore_index=pad_index)

In [52]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
    src_seq_len = src.shape[1]
    tgt_seq_len = tgt.shape[1]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == pad_index)
    tgt_padding_mask = (tgt == pad_index)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

In [53]:
def train_fn(
    model, data_loader, optimizer, criterion, clip, device
):
    model.train()
    model.to(device)
    epoch_loss = 0
    for i, batch in enumerate(tqdm.notebook.tqdm(data_loader, desc="Training", unit="batch")):
        src = batch[1].to(device)
        trg = batch[0].to(device)

        #src = [src length, batch size]
        #trg = [trg length, batch size]

        # Ensure that src and trg have the same batch size
        assert src.size(1) == trg.size(1), "Source and target batch sizes must be equal"

        optimizer.zero_grad()
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, trg)
        output = model(src, trg, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
        # output = [trg length, batch size, trg vocab size]
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        # output = [(trg length - 1) * batch size, trg vocab size]
        trg = trg[1:].view(-1)
        # trg = [(trg length - 1) * batch size]
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(data_loader)

In [54]:
def evaluate_fn(model, data_loader, criterion, device):
    model.eval()
    model.to(device)
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(tqdm.notebook.tqdm(data_loader, desc="Evaluating", unit="batch")):
            src = batch[1].to(device)
            trg = batch[0].to(device)
            # src = [src length, batch size]
            # trg = [trg length, batch size]
            src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, trg)
            output = model(src, trg, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
            #output = model(src, trg, 0)  # turn off teacher forcing
            # output = [trg length, batch size, trg vocab size]
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            # output = [(trg length - 1) * batch size, trg vocab size]
            trg = trg[1:].view(-1)
            # trg = [(trg length - 1) * batch size]
            loss = criterion(output, trg)
            epoch_loss += loss.item()
    return epoch_loss / len(data_loader)


In [55]:
import tqdm

n_epochs = 3
clip = 1.0

best_valid_loss = float("inf")

for epoch in tqdm.notebook.tqdm(range(0, n_epochs)):
    print(f"Epoch: {epoch}")
    train_loss = train_fn(
        model,
        train_data_loader,
        optimizer,
        criterion,
        clip,
        DEVICE,
    )
    valid_loss = evaluate_fn(
        model,
        valid_data_loader,
        criterion,
        DEVICE,
    )
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), "tmp/transformer_model.pt")
    print(f"\tTrain Loss: {train_loss:7.3f} | Train PPL: {np.exp(train_loss):7.3f}")
    print(f"\tValid Loss: {valid_loss:7.3f} | Valid PPL: {np.exp(valid_loss):7.3f}")

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch: 0


Training:   0%|          | 0/57 [00:00<?, ?batch/s]



Evaluating:   0%|          | 0/2 [00:00<?, ?batch/s]

	Train Loss:   7.705 | Train PPL: 2218.665
	Valid Loss:   5.865 | Valid PPL: 352.324
Epoch: 1


Training:   0%|          | 0/57 [00:00<?, ?batch/s]

Evaluating:   0%|          | 0/2 [00:00<?, ?batch/s]

	Train Loss:   5.442 | Train PPL: 230.898
	Valid Loss:   5.370 | Valid PPL: 214.794
Epoch: 2


Training:   0%|          | 0/57 [00:00<?, ?batch/s]

Evaluating:   0%|          | 0/2 [00:00<?, ?batch/s]

	Train Loss:   5.366 | Train PPL: 214.074
	Valid Loss:   5.371 | Valid PPL: 215.061


We've now successfully trained a model that translates German into English! But how well does it perform?


## Evaluating the Model

The first thing to do is to test the model's performance on the test set.

We'll load the parameters (`state_dict`) that gave our model the best validation loss and run it on the test set to get our test loss and perplexity.


In [56]:
!ls tmp

transformer_model.pt


In [57]:
torch.save(model.state_dict(), "./tmp/transformer_model.pt")
model.load_state_dict(torch.load("./tmp/transformer_model.pt"))

test_loss = evaluate_fn(model, test_data_loader, criterion, DEVICE)

print(f"| Test Loss: {test_loss:.3f} | Test PPL: {np.exp(test_loss):7.3f} |")

Evaluating:   0%|          | 0/2 [00:00<?, ?batch/s]

| Test Loss: 5.357 | Test PPL: 212.084 |


In [58]:
import torch

def greedy_decode2(model, sentence, en_vocab, de_vocab, device, max_length, de_nlp, lower=True):
    model.eval()
    with torch.no_grad():

        src_tokens = [token.text for token in de_nlp.tokenizer(sentence)]

        if lower:
            src_tokens = [token.lower() for token in src_tokens]

        src_tokens = [sos_token] + src_tokens[:max_length-2] + [eos_token]
        src_indices = de_vocab.lookup_indices(src_tokens)
        src_tensor = torch.LongTensor(src_indices).unsqueeze(0).to(device)

        # Encode the source sentence
        encoder_output = model.encode(src_tensor, None)

        # Initialize the target sentence with the start-of-sentence token
        tgt_indices = [en_vocab.lookup_indices([sos_token])[0]]

        for _ in range(max_length):
            tgt_tensor = torch.LongTensor(tgt_indices).unsqueeze(0).to(device)
            # Decode the target sentence
            decoder_output = model.decode(tgt_tensor, encoder_output, None)
            # Get the predicted token
            pred_token = decoder_output.argmax(2)[:,-1].item()
            tgt_indices.append(pred_token)

            # Stop decoding if the end-of-sentence token is predicted
            if pred_token == en_vocab.lookup_indices([eos_token])[0]:
                break

    # Convert the predicted indices to tokens and join them into a sentence
    tgt_tokens = en_vocab.lookup_tokens(tgt_indices)
    return " ".join(tgt_tokens[1:-1])  # Remove the start and end tokens

# Usage Example (assuming proper objects are defined):
# result = greedy_decode(model, sentence, en_vocab, de_vocab, device, max_length, de_nlp)


In [59]:
def greedy_decode(model, sentence, en_vocab, de_vocab, device, max_length):
    model.eval()
    with torch.no_grad():
        # Tokenize and numericalize the input sentence
        src_tokens = [token.text for token in de_nlp.tokenizer(sentence)]

        if lower:
          de_tokens = [token.lower() for token in src_tokens]
          de_tokens = [sos_token] + de_tokens[:max_length-2] + [eos_token]

        de_tokens = de_tokens + [pad_token] * (max_length - len(de_tokens))

        src_indices = de_vocab.lookup_indices(de_tokens)
        src_tensor = torch.LongTensor(src_indices).unsqueeze(0).to(device)

        # Encode the source sentence
        encoder_output = model.encode(src_tensor, None)

        # Initialize the target sentence with the start-of-sentence token
        # Use sos_token instead of init_token
        tgt_indices = [en_vocab.lookup_indices([sos_token])[0]]

        for _ in range(max_length):
            tgt_tensor = torch.LongTensor(tgt_indices).unsqueeze(0).to(device)
            # Decode the target sentence
            decoder_output = model.decode(tgt_tensor, encoder_output, None)
            # Get the predicted token
            pred_token = decoder_output.argmax(2)[:,-1].item()
            tgt_indices.append(pred_token)

            # Stop decoding if the end-of-sentence token is predicted
            # Use eos_token instead of init_token
            if pred_token == en_vocab.lookup_indices([eos_token])[0]:
                break

    # Convert the predicted indices to tokens and join them into a sentence
    tgt_tokens = en_vocab.lookup_tokens(tgt_indices)
    return " ".join(tgt_tokens[1:-1])  # Remove the start and end tokens

In [61]:
for i in range(5):
    src_sentence = test_data[i]["de"]
    trg_sentence = test_data[i]["en"]
    translation = greedy_decode(model, src_sentence, en_vocab, de_vocab, DEVICE, max_length)
    #translation = greedy_decode2(model, src_sentence, en_vocab, de_vocab, DEVICE, max_length,de_nlp, True)
    print(f"Source: {src_sentence}")
    print(f"Target: {trg_sentence}")
    print(f"Translation: {translation}")
    print()

Source: Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.
Target: A man in an orange hat starring at something.
Translation: trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees

Source: Ein Boston Terrier läuft über saftig-grünes Gras vor einem weißen Zaun.
Target: A Boston Terrier is running on lush green grass in front of a white fence.
Translation: trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees trees