# Chapter 10: Train A Transformer to Translate English to French

This chapter covers

* Tokenizing English and French phrases to subwords
* Understanding word embedding and positional encoding 
* Training a Transformer from scratch to translate English to French
* Using the trained Transformer to translate an English phrase into French

In the last chapter, we built a Transformer from scratch that can translate between any two languages, based on the paper Attention Is All You Need.  Specifically, we implemented the self-attention mechanism, using query, key, and value vectors to calculate scaled dot product attention (SDPA). 

To have a deeper understanding of self-attention and Transformers, we'll use English-to-French translation as our case study in this chapter. By exploring the process of training a model for converting English sentences into French, you will gain a comprehensive understanding of the Transformer's architecture and the functioning of the attention mechanism. 

Picture yourself having amassed a collection of over 47,000 English-to-French translation pairs. Your objective is to train the encoder-decoder Transformer from the last chapter using this dataset. This chapter will walk you through all phases of the project. You’ll first use subword tokenization to break English and French phrases into tokens. You’ll then build your English and French vocabularies that contain all unique tokens in each language. The vocabularies allow you to represent English and French phrases as sequences of indexes. After that, you’ll use word embedding to transform these indexes (essentially one-hot vectors) into compact vector representations. We’ll add positional encodings to the word embeddings to form input embeddings. Positional encodings allow the Transformer to know the ordering of tokens in the sequence. 

Finally, you’ll train the encoder-decoder Transformer from Chapter 9 to translate English to French by using the collection of English-to-French translations. After training, you’ll learn to translate common English phrases with the trained Transformer. Specifically, you’ll use the encoder to capture the meaning of the English phrase. You’ll then use the decoder in the trained Transformer to generate the French translation in an autoregressive manner, starting with the beginning token "BOS". In each time step, the decoder generates the most likely next token based on previously generated tokens and the encoder’s output, until the predicted token is "EOS", which signals the end of the sentence. The trained model can translate common English phrases accurately as if you were using Google Translate for the task.

# 1	Subword tokenization

## 1.1	Tokenize English and French phrases
First go to https://gattonweb.uky.edu/faculty/lium/gai/en2fr.zip to download zip file that contains the 47,000 English to French translations that I collected from various sources. Unzip the file and place en2fr.csv in the folder /files/ on your computer. We'll load the data and take a look as follows:

In [1]:
import pandas as pd

df=pd.read_csv("files/en2fr.csv")
num_examples=len(df)
print(f"there are {num_examples} examples in the training data")
print(df.iloc[30856]["en"])
print(df.iloc[30856]["fr"])

there are 47173 examples in the training data
How are you?
Comment êtes-vous?


In [2]:
!pip install transformers



In [3]:
from transformers import XLMTokenizer

tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")

tokenized_en=tokenizer.tokenize("I don't speak French.")
print(tokenized_en)
tokenized_fr=tokenizer.tokenize("Je ne parle pas français.")
print(tokenized_fr)
print(tokenizer.tokenize("How are you?"))
print(tokenizer.tokenize("Comment êtes-vous?"))

['i</w>', 'don</w>', "'t</w>", 'speak</w>', 'fr', 'ench</w>', '.</w>']
['je</w>', 'ne</w>', 'parle</w>', 'pas</w>', 'franc', 'ais</w>', '.</w>']
['how</w>', 'are</w>', 'you</w>', '?</w>']
['comment</w>', 'et', 'es-vous</w>', '?</w>']


In [4]:
# build dictionaries
from collections import Counter

en=df["en"].tolist()

en_tokens=[["BOS"]+tokenizer.tokenize(x)+["EOS"] for x in en]        
PAD=0
UNK=1
# apply to English 
word_count=Counter()
for sentence in en_tokens:
    for word in sentence:
        word_count[word]+=1
frequency=word_count.most_common(50000)        
total_en_words=len(frequency)+2
en_word_dict={w[0]:idx+2 for idx,w in enumerate(frequency)}
en_word_dict["PAD"]=PAD
en_word_dict["UNK"]=UNK
# another dictionary to map numbers to tokens
en_idx_dict={v:k for k,v in en_word_dict.items()}

In [5]:
enidx=[en_word_dict.get(i,UNK) for i in tokenized_en]   
print(enidx)

[15, 100, 38, 377, 476, 574, 5]


In [6]:
entokens=[en_idx_dict.get(i,"UNK") for i in enidx]   
print(entokens)
en_phrase="".join(entokens)
en_phrase=en_phrase.replace("</w>"," ")
for x in '''?:;.,'("-!&)%''':
    en_phrase=en_phrase.replace(f" {x}",f"{x}")   
print(en_phrase)

['i</w>', 'don</w>', "'t</w>", 'speak</w>', 'fr', 'ench</w>', '.</w>']
i don't speak french. 


In [7]:
# exercise 10.1
tokens=['how</w>', 'are</w>', 'you</w>', '?</w>']
indexes=[en_word_dict.get(i,UNK) for i in tokens]   
print(indexes)
tokens=[en_idx_dict.get(i,"UNK") for i in indexes]   
print(tokens)
phrase="".join(tokens)
phrase=phrase.replace("</w>"," ")
for x in '''?:;.,'("-!&)%''':
    phrase=phrase.replace(f" {x}",f"{x}")   
print(phrase)

[157, 17, 22, 26]
['how</w>', 'are</w>', 'you</w>', '?</w>']
how are you? 


In [8]:
# do the same for French phrases
fr=df["fr"].tolist()       
fr_tokens=[["BOS"]+tokenizer.tokenize(x)+["EOS"] for x in fr] 
word_count=Counter()
for sentence in fr_tokens:
    for word in sentence:
        word_count[word]+=1
frequency=word_count.most_common(50000)        
total_fr_words=len(frequency)+2
fr_word_dict={w[0]:idx+2 for idx,w in enumerate(frequency)}
fr_word_dict["PAD"]=PAD
fr_word_dict["UNK"]=UNK
fr_idx_dict={v:k for k,v in fr_word_dict.items()}

In [9]:
fridx=[fr_word_dict.get(i,UNK) for i in tokenized_fr]   
print(fridx)

[28, 40, 231, 32, 726, 370, 4]


In [10]:
frtokens=[fr_idx_dict.get(i,"UNK") for i in fridx]   
print(frtokens)
fr_phrase="".join(frtokens)
fr_phrase=fr_phrase.replace("</w>"," ")
for x in '''?:;.,'("-!&)%''':
    fr_phrase=fr_phrase.replace(f" {x}",f"{x}")  
print(fr_phrase)

['je</w>', 'ne</w>', 'parle</w>', 'pas</w>', 'franc', 'ais</w>', '.</w>']
je ne parle pas francais. 


In [11]:
# exercise 10.2
tokens=['comment</w>', 'et', 'es-vous</w>', '?</w>']
indexes=[fr_word_dict.get(i,UNK) for i in tokens]   
print(indexes)
tokens=[fr_idx_dict.get(i,"UNK") for i in indexes]   
print(tokens)
phrase="".join(tokens)
phrase=phrase.replace("</w>"," ")
for x in '''?:;.,'("-!&)%''':
    phrase=phrase.replace(f" {x}",f"{x}")   
print(phrase)

[452, 61, 742, 30]
['comment</w>', 'et', 'es-vous</w>', '?</w>']
comment etes-vous? 


In [12]:
import pickle

with open("files/dict.p","wb") as fb:
    pickle.dump((en_word_dict,en_idx_dict,
                 fr_word_dict,fr_idx_dict),fb)

## 1.2. Sequence Padding and Batch Creation


In [13]:
out_en_ids=[[en_word_dict.get(w,UNK) for w in s] for s in en_tokens]
out_fr_ids=[[fr_word_dict.get(w,UNK) for w in s] for s in fr_tokens]
sorted_ids=sorted(range(len(out_en_ids)),
                  key=lambda x:len(out_en_ids[x]))
out_en_ids=[out_en_ids[x] for x in sorted_ids]
out_fr_ids=[out_fr_ids[x] for x in sorted_ids]

In [14]:
import numpy as np

batch_size=128
idx_list=np.arange(0,len(en_tokens),batch_size)
np.random.shuffle(idx_list)

batch_indexs=[]
for idx in idx_list:
    batch_indexs.append(np.arange(idx,min(len(en_tokens),
                                          idx+batch_size)))

In [15]:
def seq_padding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    padded_seq = np.array([np.concatenate([x, [padding] * (ML - len(x))])
        if len(x) < ML else x for x in X])
    return padded_seq

The following class is defined in the local module ch09util.py

```Python
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# define the Batch class
class Batch:
    def __init__(self, src, trg=None, pad=0):
        src = torch.from_numpy(src).to(DEVICE).long()
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if trg is not None:
            trg = torch.from_numpy(trg).to(DEVICE).long()
            self.trg = trg[:, :-1]
            self.trg_y = trg[:, 1:]
            self.trg_mask = make_std_mask(self.trg, pad)
            self.ntokens = (self.trg_y != pad).data.sum()
```

```python
import numpy as np
def subsequent_mask(size):
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape),
                              k=1).astype('uint8')
    output = torch.from_numpy(subsequent_mask) == 0
    return output

def make_std_mask(tgt, pad):
    tgt_mask = (tgt != pad).unsqueeze(-2)
    output = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data)
    return output 
```

In [16]:
from utils.ch09util import Batch

class BatchLoader():
    def __init__(self):
        self.idx=0
    def __iter__(self):
        return self
    def __next__(self):
        self.idx += 1
        if self.idx<=len(batch_indexs):
            b=batch_indexs[self.idx-1]
            batch_en=[out_en_ids[x] for x in b]
            batch_fr=[out_fr_ids[x] for x in b]
            batch_en=seq_padding(batch_en)
            batch_fr=seq_padding(batch_fr)
            return Batch(batch_en,batch_fr)
        raise StopIteration

# 2	Word embedding and positional encoding
## 2.1. Word Embedding


In [17]:
src_vocab = len(en_word_dict)
tgt_vocab = len(fr_word_dict)
print(f"there are {src_vocab} distinct English tokens")
print(f"there are {tgt_vocab} distinct French tokens")

there are 11055 distinct English tokens
there are 11239 distinct French tokens


```python
import math

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super().__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        out = self.lut(x) * math.sqrt(self.d_model)
        return out
```

## 2.1. Positional Encoding
To model the order of elements in the input and output sequences, we'll first create positional encodings of the sequences as follows:

```python
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model, device=DEVICE)
        position = torch.arange(0., max_len, 
                                device=DEVICE).unsqueeze(1)
        div_term = torch.exp(torch.arange(
            0., d_model, 2, device=DEVICE)
            * -(math.log(10000.0) / d_model))
        pe_pos = torch.mul(position, div_term)
        pe[:, 0::2] = torch.sin(pe_pos)
        pe[:, 1::2] = torch.cos(pe_pos)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)  

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)].requires_grad_(False)
        out = self.dropout(x)
        return out
```

In [18]:
from utils.ch09util import PositionalEncoding
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

pe = PositionalEncoding(256, 0.1)
x = torch.zeros(1, 8, 256).to(DEVICE)
y = pe.forward(x)
print(f"the shape of positional encoding is {y.shape}")
print(y)

the shape of positional encoding is torch.Size([1, 8, 256])
tensor([[[ 0.0000e+00,  1.1111e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  1.1111e+00],
         [ 9.3497e-01,  0.0000e+00,  8.9107e-01,  ...,  0.0000e+00,
           1.1940e-04,  1.1111e+00],
         [ 1.0103e+00, -0.0000e+00,  1.0646e+00,  ...,  1.1111e+00,
           2.3880e-04,  1.1111e+00],
         ...,
         [-1.0655e+00,  3.1518e-01, -1.1091e+00,  ...,  1.1111e+00,
           5.9700e-04,  1.1111e+00],
         [-3.1046e-01,  1.0669e+00, -7.1559e-01,  ...,  1.1111e+00,
           7.1641e-04,  1.1111e+00],
         [ 0.0000e+00,  8.3767e-01,  2.5419e-01,  ...,  1.1111e+00,
           8.3581e-04,  0.0000e+00]]], device='mps:0')


# 3	Train the Transformer for English-to-French translation

## 3.1 Loss Function and the Optimizer



In [19]:
from utils.ch09util import create_model

model = create_model(src_vocab, tgt_vocab, N=6,
    d_model=256, d_ff=1024, h=8, dropout=0.1)

We define the following class in the local module:

```python
class LabelSmoothing(nn.Module):
    def __init__(self, size, padding_idx, smoothing=0.1):
        super().__init__()
        self.criterion = nn.KLDivLoss(reduction='sum')  
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, 
               target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        output = self.criterion(x, true_dist.clone().detach())
        return output
```

```python
class NoamOpt:
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0

    def step(self):
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()

    def rate(self, step=None):
        if step is None:
            step = self._step
        output = self.factor * (self.model_size ** (-0.5) *
        min(step ** (-0.5), step * self.warmup ** (-1.5)))
        return output
```

We create the optimizer for training as follows:

In [20]:
from utils.ch09util import NoamOpt

optimizer = NoamOpt(256, 1, 2000, torch.optim.Adam(
    model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

To create the loss function for training, we first define the following class in the local module:

```python
class SimpleLossCompute:
    def __init__(self, generator, criterion, opt=None):
        self.generator = generator
        self.criterion = criterion
        self.opt = opt

    def __call__(self, x, y, norm):
        x = self.generator(x)
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
                              y.contiguous().view(-1)) / norm
        loss.backward()
        if self.opt is not None:
            self.opt.step()
            self.opt.optimizer.zero_grad()
        return loss.data.item() * norm.float()
```

We then define the loss function as follows:

In [21]:
from utils.ch09util import (LabelSmoothing,
       SimpleLossCompute)

criterion = LabelSmoothing(tgt_vocab, 
                           padding_idx=0, smoothing=0.1)
loss_func = SimpleLossCompute(
            model.generator, criterion, optimizer)

## 3.2 The training loop
We'll train the model for 100 epochs. We'll calculate the loss and the number of tokens from each batch. After each epoch, we calculate the average loss in the epoch as the ratio between the total loss and the total number of tokens:

In [22]:
from tqdm import tqdm
# train for 100 epochs
for epoch in range(100):
    model.train()
    tloss=0
    tokens=0
    for batch in tqdm(BatchLoader()):
        out = model(batch.src, batch.trg, 
                    batch.src_mask, batch.trg_mask)
        loss = loss_func(out, batch.trg_y, batch.ntokens)
        tloss += loss
        tokens += batch.ntokens
    print(f"Epoch {epoch}, average loss: {tloss/tokens}")
torch.save(model.state_dict(),"files/en2fr.pth")   

369it [02:07,  2.90it/s]


Epoch 0, average loss: 5.847522735595703


369it [01:53,  3.24it/s]


Epoch 1, average loss: 3.689833641052246


369it [01:53,  3.25it/s]


Epoch 2, average loss: 2.9376380443573


369it [01:54,  3.23it/s]


Epoch 3, average loss: 2.2890236377716064


369it [01:53,  3.24it/s]


Epoch 4, average loss: 1.8289636373519897


369it [01:53,  3.24it/s]


Epoch 5, average loss: 1.603419303894043


369it [01:53,  3.24it/s]


Epoch 6, average loss: 1.4144387245178223


369it [01:53,  3.24it/s]


Epoch 7, average loss: 1.2825123071670532


369it [01:55,  3.20it/s]


Epoch 8, average loss: 1.1661337614059448


369it [01:55,  3.21it/s]


Epoch 9, average loss: 1.0830228328704834


369it [01:54,  3.23it/s]


Epoch 10, average loss: 1.0090007781982422


369it [01:54,  3.24it/s]


Epoch 11, average loss: 0.943182110786438


369it [01:54,  3.23it/s]


Epoch 12, average loss: 0.8932991623878479


369it [01:54,  3.23it/s]


Epoch 13, average loss: 0.8491826057434082


369it [01:53,  3.25it/s]


Epoch 14, average loss: 0.8029215335845947


369it [01:53,  3.26it/s]


Epoch 15, average loss: 0.763870120048523


369it [01:53,  3.24it/s]


Epoch 16, average loss: 0.7349712252616882


369it [01:53,  3.25it/s]


Epoch 17, average loss: 0.7014152407646179


369it [01:53,  3.25it/s]


Epoch 18, average loss: 0.6720163822174072


369it [01:53,  3.25it/s]


Epoch 19, average loss: 0.6478269100189209


369it [01:53,  3.25it/s]


Epoch 20, average loss: 0.6248847246170044


369it [01:53,  3.25it/s]


Epoch 21, average loss: 0.6040297746658325


369it [01:53,  3.25it/s]


Epoch 22, average loss: 0.5849698781967163


369it [01:53,  3.25it/s]


Epoch 23, average loss: 0.5691565275192261


369it [01:54,  3.23it/s]


Epoch 24, average loss: 0.5525315403938293


369it [01:53,  3.24it/s]


Epoch 25, average loss: 0.5350983142852783


369it [01:53,  3.24it/s]


Epoch 26, average loss: 0.5215675830841064


369it [01:54,  3.23it/s]


Epoch 27, average loss: 0.5070772767066956


369it [01:53,  3.25it/s]


Epoch 28, average loss: 0.4940132796764374


369it [01:53,  3.25it/s]


Epoch 29, average loss: 0.4823126196861267


369it [01:53,  3.25it/s]


Epoch 30, average loss: 0.472593754529953


369it [01:53,  3.25it/s]


Epoch 31, average loss: 0.4633314609527588


369it [01:53,  3.25it/s]


Epoch 32, average loss: 0.45213305950164795


369it [01:53,  3.25it/s]


Epoch 33, average loss: 0.44409289956092834


369it [01:53,  3.25it/s]


Epoch 34, average loss: 0.4327544569969177


369it [01:53,  3.24it/s]


Epoch 35, average loss: 0.42592698335647583


369it [01:53,  3.24it/s]


Epoch 36, average loss: 0.41850852966308594


369it [01:53,  3.25it/s]


Epoch 37, average loss: 0.41082173585891724


369it [01:53,  3.24it/s]


Epoch 38, average loss: 0.40383151173591614


369it [01:53,  3.24it/s]


Epoch 39, average loss: 0.3972686529159546


369it [01:53,  3.24it/s]


Epoch 40, average loss: 0.38946449756622314


369it [01:53,  3.24it/s]


Epoch 41, average loss: 0.3862106502056122


369it [01:54,  3.24it/s]


Epoch 42, average loss: 0.379559189081192


369it [01:54,  3.22it/s]


Epoch 43, average loss: 0.37567535042762756


369it [01:53,  3.25it/s]


Epoch 44, average loss: 0.36888596415519714


369it [01:53,  3.24it/s]


Epoch 45, average loss: 0.36327412724494934


369it [01:53,  3.24it/s]


Epoch 46, average loss: 0.3613062798976898


369it [01:53,  3.25it/s]


Epoch 47, average loss: 0.3563258945941925


369it [01:53,  3.24it/s]


Epoch 48, average loss: 0.35198041796684265


369it [01:53,  3.25it/s]


Epoch 49, average loss: 0.3468976616859436


369it [01:53,  3.25it/s]


Epoch 50, average loss: 0.34578049182891846


369it [01:53,  3.25it/s]


Epoch 51, average loss: 0.34015193581581116


369it [01:53,  3.25it/s]


Epoch 52, average loss: 0.3366280794143677


369it [01:53,  3.25it/s]


Epoch 53, average loss: 0.332644522190094


369it [01:54,  3.23it/s]


Epoch 54, average loss: 0.3293381631374359


369it [01:54,  3.24it/s]


Epoch 55, average loss: 0.3263861835002899


369it [01:54,  3.23it/s]


Epoch 56, average loss: 0.3213471472263336


369it [01:53,  3.24it/s]


Epoch 57, average loss: 0.31845811009407043


369it [01:54,  3.23it/s]


Epoch 58, average loss: 0.3162967562675476


369it [01:53,  3.25it/s]


Epoch 59, average loss: 0.31308597326278687


369it [01:54,  3.23it/s]


Epoch 60, average loss: 0.3104373812675476


369it [01:54,  3.22it/s]


Epoch 61, average loss: 0.30830684304237366


369it [01:54,  3.23it/s]


Epoch 62, average loss: 0.3045060932636261


369it [01:54,  3.23it/s]


Epoch 63, average loss: 0.30258941650390625


369it [01:54,  3.23it/s]


Epoch 64, average loss: 0.30039340257644653


369it [01:54,  3.23it/s]


Epoch 65, average loss: 0.29789379239082336


369it [01:54,  3.23it/s]


Epoch 66, average loss: 0.2949216067790985


369it [01:54,  3.22it/s]


Epoch 67, average loss: 0.29301711916923523


369it [01:54,  3.23it/s]


Epoch 68, average loss: 0.29126012325286865


369it [01:54,  3.23it/s]


Epoch 69, average loss: 0.28894802927970886


369it [01:54,  3.23it/s]


Epoch 70, average loss: 0.28648027777671814


369it [01:54,  3.23it/s]


Epoch 71, average loss: 0.28499600291252136


369it [01:54,  3.23it/s]


Epoch 72, average loss: 0.2832731306552887


369it [01:54,  3.23it/s]


Epoch 73, average loss: 0.2806107997894287


369it [01:54,  3.24it/s]


Epoch 74, average loss: 0.27935734391212463


369it [01:54,  3.24it/s]


Epoch 75, average loss: 0.277267187833786


369it [01:54,  3.24it/s]


Epoch 76, average loss: 0.2759348750114441


369it [01:54,  3.23it/s]


Epoch 77, average loss: 0.2743316888809204


369it [01:54,  3.23it/s]


Epoch 78, average loss: 0.2727672755718231


369it [01:54,  3.23it/s]


Epoch 79, average loss: 0.27027732133865356


369it [01:54,  3.23it/s]


Epoch 80, average loss: 0.2680286467075348


369it [01:53,  3.24it/s]


Epoch 81, average loss: 0.26699298620224


369it [01:54,  3.23it/s]


Epoch 82, average loss: 0.2660874128341675


369it [01:54,  3.22it/s]


Epoch 83, average loss: 0.2651960253715515


369it [01:54,  3.23it/s]


Epoch 84, average loss: 0.26356935501098633


369it [01:54,  3.22it/s]


Epoch 85, average loss: 0.2613799273967743


369it [01:54,  3.23it/s]


Epoch 86, average loss: 0.26018279790878296


369it [01:54,  3.22it/s]


Epoch 87, average loss: 0.25927796959877014


369it [01:54,  3.23it/s]


Epoch 88, average loss: 0.25833258032798767


369it [01:54,  3.24it/s]


Epoch 89, average loss: 0.25644704699516296


369it [01:53,  3.24it/s]


Epoch 90, average loss: 0.25532129406929016


369it [01:54,  3.23it/s]


Epoch 91, average loss: 0.2546718120574951


369it [01:54,  3.23it/s]


Epoch 92, average loss: 0.2522023916244507


369it [01:53,  3.24it/s]


Epoch 93, average loss: 0.2518562376499176


369it [01:54,  3.23it/s]


Epoch 94, average loss: 0.2501901388168335


369it [01:54,  3.23it/s]


Epoch 95, average loss: 0.25037461519241333


369it [01:53,  3.24it/s]


Epoch 96, average loss: 0.24857191741466522


369it [01:54,  3.24it/s]


Epoch 97, average loss: 0.24719516932964325


369it [01:54,  3.23it/s]


Epoch 98, average loss: 0.2453359067440033


369it [01:54,  3.23it/s]

Epoch 99, average loss: 0.24515776336193085





The above training process takes a couple of hours if you are using a GPU. It may take several hours if you are using CPU training. Once the training is done, the model weights are saved as *en2fr.pth* on your computer. 

# 4. Translate English to French with the Trained Model


In [23]:
def translate(eng):
    # tokenize the English sentence
    tokenized_en=tokenizer.tokenize(eng)
    # add beginning and end tokens
    tokenized_en=["BOS"]+tokenized_en+["EOS"]
    # convert tokens to indexes
    enidx=[en_word_dict.get(i,UNK) for i in tokenized_en]  
    src=torch.tensor(enidx).long().to(DEVICE).unsqueeze(0)
    # create mask to hide padding
    src_mask=(src!=0).unsqueeze(-2)
    # encode the English sentence
    memory=model.encode(src,src_mask)
    # start translation in an autogressive fashion
    start_symbol=fr_word_dict["BOS"]
    ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
    translation=[]
    for i in range(100):
        out = model.decode(memory,src_mask,ys,
        subsequent_mask(ys.size(1)).type_as(src.data))
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat([ys, torch.ones(1, 1).type_as(
            src.data).fill_(next_word)], dim=1)
        sym = fr_idx_dict[ys[0, -1].item()]
        if sym != 'EOS':
            translation.append(sym)
        else:
            break
    # convert tokens to sentences
    trans="".join(translation)
    trans=trans.replace("</w>"," ") 
    for x in '''?:;.,'("-!&)%''':
        trans=trans.replace(f" {x}",f"{x}")    
    print(trans)
    return trans

Let's try the defined function on the English phrase "Today is a beautiful day!", like so:

In [24]:
from utils.ch09util import subsequent_mask

with open("files/dict.p","rb") as fb:
    en_word_dict,en_idx_dict,\
    fr_word_dict,fr_idx_dict=pickle.load(fb)
trained_weights=torch.load("files/my_en2fr.pth",
                           map_location=DEVICE)
model.load_state_dict(trained_weights)
model.eval()
eng = "Today is a beautiful day!"
translated_fr = translate(eng)

UNKtroUNKUNKplantambour contre............................................................................................. 


In [25]:
eng = "A little boy in jeans climbs a small tree while another child looks on."
translated_fr = translate(eng)

UNKregrette atiUNK18UNKUNKUNKUNKwar18UNKUNKUNKUNKwar18UNKUNKUNKun. un. un.............. un.. un. un. un. un..... un. un. un. un............. un................... un..... 


In [26]:
eng = "I don't speak French."
translated_fr = translate(eng)

UNKUNKUNKUNKactUNKun. un. un. un. un.......................... un. un.......... un. un. un. un..... un. un... un. un. un... un. un........... un. un... 


Now let's try the sentence "I do not speak French."

In [27]:
eng = "I do not speak French."
translated_fr = translate(eng)

UNKUNKUNKUNKactUNKun. un. un. un. un................. un..... un. un. un. un. un....... un. un. un. un..... un. un... un. un. un.... un....... un... un. un.... 


In [28]:
# exercise 10.3
eng = "I love skiing in the winter!"
translated_fr = translate(eng)
eng = "How are you?"
translated_fr = translate(eng)

vend UNKUNKruUNKdiscuwarUNKcontre. contre. contre....................................................................................... 
UNKles. UNKles. les. les. les. les. les..................................................................................... 
