# Language Translation with Transformer

In this hands-on session, we demonstrate how to build a light transformer model to translate German to English. Codes below are adopted from this [PyTorch tutorial](https://pytorch.org/tutorials/beginner/translation_transformer.html).

We first install the 'spacy' package, which provides handy tools to break sentences into words.

In [7]:
# !pip install -U spacy
# !pip install torchdata==0.5.1
# !python -m spacy download en_core_web_sm
# !python -m spacy download de_core_news_sm


## Language Translation with nn.Transformer and torchtext

We will us [Multi30k](http://www.statmt.org/wmt16/multimodal-task.html#task1)
dataset to train a German to English translation model.



In [8]:
from torchtext.datasets import Multi30k
import itertools

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))

In [9]:
# Multi30k return an iterator.
# It will sequentially produce a pair of
# 'de' sentence and its corresponding
# 'en' sentence.
# Let use print some pairs
start_idx = 0
stop_idx = 20
step_size = 5
for de_sentence, en_sentence in itertools.islice(train_iter,start_idx,stop_idx,step_size):
    print(de_sentence)
    print(en_sentence)
    print('-'*20)

Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.
--------------------
Ein Mann in grün hält eine Gitarre, während der andere Mann sein Hemd ansieht.
A man in green holds a guitar while the other man observes his shirt.
--------------------
Eine Ballettklasse mit fünf Mädchen, die nacheinander springen.
A ballet class of five girls jumping in sequence.
--------------------
Eine Frau mit schwarzem Oberteil und Brille streut Puderzucker auf einem Gugelhupf.
A lady in a black top with glasses is sprinkling powdered sugar on a bundt cake.
--------------------


### Data Processing

We first build the data pipeline. In order to train our transformer, we need to

1.  Break sentences into words -- we use `get_tokenizer`
3.  Build German and English vocabularies words from the training set -- we use `build_vocab` (short for `build_vocab_from_iterator`)
4.  Define special 'words' and insert them into vocabularies
    -  BOS: Begin of Sentence
    -  EOS: End of Sentence
    -  PAD: Padding
    -  UNK: Unkown word (for words not in training words)

We will build three transforms to convert sentences to PyTorch tensors.

1. `token_transform`: given a sentence, breaks it into a list of words
2. `vocab_transform`: given a list of words, return the indices of these words in the dictionary
3. `tensor_transform` : given a list of indices, add indices of `<BOS>`, `<EOS>` and convert list to PyTorch tensor.

Lastly we compose these three transforms into `text_transform`.

In [10]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator as build_vocab
from toolz.functoolz import compose
from typing import Iterable, List
import torch

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

In [11]:
token_transform = {}
vocab_transform = {}
language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

# Create source and target language tokenizer. Make sure to install the dependencies.
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        sample_sentence = data_sample[language_index[language]]
        yield token_transform[language](sample_sentence)


for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator 
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object 
    vocab_transform[ln] = build_vocab(yield_tokens(train_iter, ln),
                                      min_freq=1,
                                      specials=special_symbols,
                                      special_first=True)
    # Set UNK_IDX as the default index.
    vocab_transform[ln].set_default_index(UNK_IDX)

def tensor_transform(token_ids: List[int]):
    # token_ids: a list of word indices
    return torch.cat((torch.tensor([BOS_IDX]), 
                      torch.tensor(token_ids), 
                      torch.tensor([EOS_IDX])))
 
# src and tgt language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Note: compose from toolz
    # by default compose from the right.
    # compose(a,b,c)(x)
    # give you a(b(c(x)))
    text_transform[ln] = compose(tensor_transform, #Tokenization
                                 vocab_transform[ln], #Numericalization
                                 token_transform[ln]) # Add BOS/EOS and create tensor


In [12]:
# Let us try token_transform and text_transform
# note how 'abscdes' is represented by index 0,
# which is the index for 'UNK_IDX'
en_sentence = 'I like apple, but I do not like abscdes'
word_list = token_transform['en'](en_sentence)
print('Word list is {} \n'.format(word_list))

word_tensor = text_transform['en'](en_sentence)
print('Tensor of word indices: {}'.format(word_tensor))

Word list is ['I', 'like', 'apple', ',', 'but', 'I', 'do', 'not', 'like', 'abscdes'] 

Tensor of word indices: tensor([   2, 1166,  347, 1633,   15, 1289, 1166,  755,  978,  347,    0,    3])


### Build a dataloader

Now we are ready to build a dataloader. Recall that in PyTorch, we train the model on a batch of inputs. The purpose of a dataloader is to automatically collect a batch of inputs.

To build a dataloader, we need to tell PyTorch how to assemble a collection of inputs into a single tensor. This is called `collate`.

We already have `text_transform` which transforms a sentence into a tensor representing this sentence. To collate a batch of sentences we need to
1.  Pad all tensors representing sentences to the same length -- we use `pad_sequence`
2.  Concatenate padded tensors.

Note that the input to our `collate` function has the form

[[de_sentence_0, en_sentence_0], [de_sentence_1, en_sentence_1], ..., [de_sentence_n, en_sentence_n]]

In [13]:
from torch.nn.utils.rnn import pad_sequence

# function to collate data samples into batch tesors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

### Masking

Recall that in the lecture we introduced the concept of target padding. We implement the padding mechanism below.

In [14]:
def generate_square_subsequent_mask(sz):
    '''
    Generate masking matrix.
    For sz=3, mask is
    [[0., -inf, -inf],
     [0.,   0., -inf],
     [0.,   0.,  0. ]]
    '''
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


Note that when preparing the dataloader, we padded short sentences by the special word `<PAD>`. These padding keywords should not participant in training. We also generate masking tensors to indicate location of the `<PAD>` keyword.

In [15]:
def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

### Seq2Seq Network using Transformer


<img src="Figs/Transformer_Arc.png" alt= “” width="300">

PyTorch has its build-in transformer constructor. We need to implement our own word embedding and position embedding methods.

In [16]:
from torch import Tensor
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# Embedding layer template
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Seq2Seq Network 
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.linear = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, 
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.linear(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

Let's now define the parameters of our model. 

In [17]:
torch.manual_seed(0)
BATCH_SIZE = 196
SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 256
NUM_ENCODER_LAYERS = 5
NUM_DECODER_LAYERS = 5

Let's define training and evaluation loop that will be called for each 
epoch.




In [18]:
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm

def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    # training set has 29000 sentence pairs,
    # valid set has 1014 sentence pairs
    train_len = 29000
    valid_len = 1014
    for src, tgt in tqdm(train_dataloader, total=math.ceil(train_len/BATCH_SIZE)):
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / math.ceil(train_len/BATCH_SIZE)


def evaluate(model):
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    # training set has 29000 sentence pairs,
    # valid set has 1014 sentence pairs
    train_len = 29000
    valid_len = 1014

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)
        
        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / math.ceil(valid_len/BATCH_SIZE)

We define the actual function to translate German sentences to English sentences.

Note that when we only have the de sentence, on the decoder side we cannot input ground truth en sentence. Therefore, during actual translation, on the decoder side we feed in the prediction made by the network.

In [19]:
# function to generate output sequence using greedy algorithm 
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.linear(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

Let us initialize an instance of our transformer model and use it to translate a German sentence.

In [20]:
transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, 
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

DE_sentence = "Eine Gruppe von Menschen steht vor einem Iglu ."

transformer = transformer.to(DEVICE)
print('Translated sentence for randomly initialized transformer')
print('-'*80)
print('DE sentence: {}'.format(DE_sentence))
print('Translated sentence: {}'.format(translate(transformer, DE_sentence)))
print('-'*80)

Translated sentence for randomly initialized transformer
--------------------------------------------------------------------------------
DE sentence: Eine Gruppe von Menschen steht vor einem Iglu .
Translated sentence:  hanger hanger hanger hanger hanger synchronized Barker hanger Barker hanger Barker Barker hanger hanger hanger
--------------------------------------------------------------------------------


We use the cross entropy as the loss function, and use Adam (a variant of SGD with momentum) as the optimizer. Now we have all the ingredients to train our model. Let's do it!

In [21]:
from timeit import default_timer as timer
NUM_EPOCHS = 10

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, 
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))


  0%|          | 0/148 [00:00<?, ?it/s]



Epoch: 1, Train loss: 6.966, Val loss: 5.475, Epoch time = 422.088s


  0%|          | 0/148 [00:00<?, ?it/s]

Epoch: 2, Train loss: 5.021, Val loss: 4.557, Epoch time = 428.845s


  0%|          | 0/148 [00:00<?, ?it/s]

Epoch: 3, Train loss: 4.323, Val loss: 4.027, Epoch time = 447.127s


In [22]:
print('Translated sentence for trained transformer')
print('-'*80)
print(translate(transformer, "Eine Gruppe von Menschen steht vor einem Iglu ."))
print('-'*80)

Translated sentence for trained transformer
--------------------------------------------------------------------------------
 A group of people are are in a street . 
--------------------------------------------------------------------------------
