# Portuguese to English Translation

In this notebook, I load the filtered dataset from the previous notebook and then train an encoder-decoder model to translate short Portuguese phrases to English. 

First I import the libraries that I need:

In [1]:
import torch
from torch import nn
from torch.nn import functional as F
from collections import Counter
import re
import pandas as pd
import numpy as np
import random

## Reproducibility

I set the random seeds for PyTorch, NumPy and Python's `random` library so others can get the same results:

In [2]:
torch.manual_seed(5)
np.random.seed(3)
random.seed(0)

## Load DataFrame

I load in the pickled DataFrame from the previous notebook, which is already in the correct format for use in this notebook:

In [3]:
df = pd.read_pickle('short_df.p')

## Split

To split the data into training and validation sets, I first shuffle the rows of the DataFrame:

In [4]:
# shuffle
df = df.sample(frac=1, random_state=7)

In [5]:
len(df)

45812

Then I choose a split point that will put 90% of the items in the training set and 10% in the validation set:

In [6]:
cut = int(len(df)*.9)
cut

41230

In [7]:
df_train = df[:cut]
df_val = df[cut:]

In [8]:
len(df_train), len(df_val)

(41230, 4582)

## Tokenize

I need to tokenize the Potuguese text separately from the English text, since they clearly have distinct vocabularies (except for some overlap, like proper nouns). First I create full flat lists of all tokens in each set:

In [9]:
pt_alltoks = []
for sent in df_train['pt_toks']:
    for tok in sent:
        pt_alltoks.append(tok)

In [10]:
en_alltoks = []
for sent in df_train['en_toks']:
    for tok in sent:
        en_alltoks.append(tok)

Then I create the vocab lists by sorting the tokens in order of most common (although this isn't really necessary, but it's nice for inspection):

In [11]:
pt_vocab = [tok for tok, count in Counter(pt_alltoks).most_common()]
en_vocab = [tok for tok, count in Counter(en_alltoks).most_common()]
len(pt_vocab), len(en_vocab)

(14872, 10715)

Then I add special tokens to each vocab list:
- UNK is for unknown tokens (which may appear in the validation set but not in the training set)
- PAD is for padding the ends of sequences which are not as long as others in the same batch
- START marks the start of each sentence
- END marks the end of each sentence

In [12]:
special_toks = ['UNK', 'PAD', 'START', 'END']
pt_vocab = special_toks + pt_vocab
en_vocab = special_toks + en_vocab

Finally I make a tokenization method, which will extract all tokens from each string by matching on words or punctuation, without combining them (so avoiding tokens that include a word and a punctuation mark):

In [13]:
def tokenize_sentence(sent):
    return re.findall(r'\w+|[^\w\s]+', sent) # matches words or punctuation

## Numericalize

I convert tokens into integers for use in the neural network by assigning each token a number based on its index in its vocab list:

In [14]:
pt_tok2num = {tok:num for num,tok in enumerate(pt_vocab)}
pt_num2tok = {num:tok for num,tok in enumerate(pt_vocab)}

In [15]:
en_tok2num = {tok:num for num,tok in enumerate(en_vocab)}
en_num2tok = {num:tok for num,tok in enumerate(en_vocab)}

Then I make functions that can deal with out-of-dictionary tokens by replacing them with the UNK token:

In [16]:
def pt_numericalize(tok):
    return pt_tok2num[tok] if tok in pt_tok2num.keys() else pt_tok2num['UNK']

In [17]:
def en_numericalize(tok):
    return en_tok2num[tok] if tok in en_tok2num.keys() else en_tok2num['UNK']

I make encode and decode functions that add START and END tokens to each sentence and also convert them to tensors:

In [18]:
def pt_encode(sent):
    toks = ['START'] + tokenize_sentence(sent) + ['END']
    nums = [pt_numericalize(tok) for tok in toks]
    tens = torch.tensor(nums)
    return tens

The model's decoder will add the START token to English sentences, so I only need to add the END token:

In [19]:
def en_encode(sent):
    toks = tokenize_sentence(sent) + ['END']
    nums = [en_numericalize(tok) for tok in toks]
    tens = torch.tensor(nums)
    return tens

The decode functions do the opposite:

In [20]:
def pt_decode(tens):
    nums = tens.numpy().tolist()
    nums = nums[1:-1] # remove START and END tokens
    toks = [pt_num2tok[num] for num in nums]
    sentence = ' '.join(toks)
    return sentence

In [21]:
def en_decode(tens):
    nums = tens.numpy().tolist()
    nums = nums[:-1] # remove END
    toks = [en_num2tok[num] for num in nums]
    sentence = ' '.join(toks)
    return sentence

Note: Don't confuse these encode and decode functions with the encoder and decoder in the model.

## Batching

I create batches by first creating a batch as a small DataFrame, then extracting x and y tensor batches from that.

First I choose a batch size:

In [22]:
bs = 64

Then I create a function to extract the input tensor batch from a DataFrame batch:

In [23]:
def xb_from_dfb(dfb):
    tens_list = [pt_encode(o) for o in dfb['pt']]
    max_len = dfb['pt_len'].max() + 2 # start and end tokens also
    padded = torch.ones(bs,max_len).long()
    for i,t in enumerate(tens_list):
        padded[i,:len(t)] = t
    return padded

I find the longest sentence in the whole dataset and store it:

In [24]:
max_en_len = df['en_len'].max()
max_en_len

8

Then I create a function to extract the output tensor from a DataFrame batch:

In [25]:
def yb_from_dfb(dfb):
    tens_list = [en_encode(o) for o in dfb['en']]
    max_len = max_en_len + 1 # end token also
    padded = torch.ones(bs,max_len).long()
    for i,t in enumerate(tens_list):
        padded[i,:len(t)] = t
    return padded

Finally I create a generator to create a DataFrame batch and then use the previous functions to output the x and y batches:

In [26]:
def get_batches(df):
    random_state = random.randint(0,100) # I need a random shuffle each time
    df_shuf = df.sample(frac=1, random_state=random_state) # the random state is different each time
    for i in range(0,len(df),bs):
        start = i
        end = i+bs if i+bs < len(df) else len(df)
        dfb = df_shuf.iloc[start:end]
        yield xb_from_dfb(dfb), yb_from_dfb(dfb)

## Model

To create my encoder-decoder model, I first need to choose the sizes of each layer.

The encoder embedding layer will go from the size of the Portuguese vocab to the embedding size, so I need the vocab size:

In [27]:
pt_voc_sz = len(pt_vocab)
pt_voc_sz

14876

I need the English vocab size for the decoder's embedding layer:

In [28]:
en_voc_sz = len(en_vocab)
en_voc_sz

10719

Then I'll use the same embedding size for both encoder and decoder, and also for the LSTM's hidden size:

In [29]:
emb_sz = 200

This function creates the START token for a full batch in the decoder:

In [30]:
def en_start(bs):
    return torch.ones(bs,1).long()*2

Then I create the full encoder-decoder model.

It includes these layers:
- an Embedding layer for the encoder
- and LSTM layer for the encoder
- an Embedding layer for the decoder
- and LSTM layer for the decoder
- a Linear layer for the decoder

The forward method uses the encoder to create vector representations of each sentence in the batch, which are then passed as LSTM hidden state to the decoder.

The decoder works a lot like a normal LSTM language model, but it also starts with the vector representations as hidden state from the encoder.

The decoder uses a technique called "teacher forcing", which uses the true input of the English sentence at each sequence step to predict the following token. If I didn't use this technique, the decoder would have to base most of its predictions on incorrect previous predictions most of the time at the start of training, which makes the task much harder. It effectively makes the task a lot more like a normal language model.

My slight adjustment to this technique is that I add a hyperparameter `p` to the forward method. At every decoder time step, I use the true previous token with probability `p` (normal "teacher forcing"), but with probability `1-p` I use the predicted previous token. My idea is that this will make the model more flexible and stronger, since this is closer to the inference prediction task (which doesn't get to see any of the y data). 

The forward method can be used for inference by simply not supplying `y` or `p` arguments. When `y` is not given, only previous predictions will be used as input in the decoder.

In [31]:
class TranslateModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Portuguese encoder
        self.pt_emb = nn.Embedding(pt_voc_sz, emb_sz, padding_idx=1)
        self.pt_lstm = nn.LSTM(emb_sz, emb_sz, 1, batch_first=True)
        # English decoder
        self.en_emb = nn.Embedding(en_voc_sz, emb_sz, padding_idx=1)
        self.en_lstm = nn.LSTM(emb_sz, emb_sz, 1, batch_first=True)
        self.en_lin = nn.Linear(emb_sz, en_voc_sz)
        self.en_lin.weight = self.en_emb.weight # weight tying
    
    # using teacher forcing
    def forward(self, x, y=None, p=.9):
        # encoder
        _, h = self.pt_lstm(self.pt_emb(x))
        # decoder
        en_in = en_start(len(x)) # START token
        outputs = []
        for i in range(max_en_len+1): # add 1 for STOP token
            if y != None: # y is only included in training
                if i>0: # for any sequence step after the first
                    # use true previous token with probability p
                    if random.random() < p: en_in = y[:,i-1:i]
            out, h = self.en_lstm(self.en_emb(en_in), h)
            out = self.en_lin(out)
            en_in = out.argmax(dim=-1)
            outputs.append(out)
        return torch.cat(outputs, dim=1) # concat on sequence dim

In [32]:
model = TranslateModel()

## Loss

My loss function is the standard language model cross-entropy loss. I just need to flatten the predictions and targets before using the PyTorch function:

In [33]:
def translate_loss(preds, targs):
    preds = preds.view(-1, en_voc_sz)
    targs = targs.view(-1)
    return F.cross_entropy(preds, targs)

## Optimizer

I just use the very effective Adam optimizer since it works well:

In [34]:
opt = torch.optim.Adam(model.parameters(), lr=.001)

## Train

I use a standard training function, but with a `p` value supplied to determine the chance of teacher forcing:

In [35]:
def train_epoch(p):
    for xb, yb in get_batches(df_train):
        out = model(xb, yb, p=p)
        loss = translate_loss(out, yb)
        loss.backward()
        opt.step()
        opt.zero_grad()

## Validate

I also use a standard validation function that returns the mean loss over the full validation set. I don't pass a `y` to the model, so this is the same functionality as I'll use during inference:

In [36]:
def validate_epoch():
    losses = []
    with torch.no_grad():
        for xb, yb in get_batches(df_val):
            out = model(xb)
            loss = translate_loss(out, yb)
            losses.append(loss)
    return np.array(losses).mean().item()    

I'm not using any additional metric (such as accuracy) since I want to check the effectiveness on the validation set at the end of training by manually checking the results. Translation is tricky because there are many ways to "correctly" translate a sentence, so I'll use my own judgment.

## Epochs

For training, I'll use a decaying `p` value for each epoch, starting at 1 (which means no reliance on previous predictions) and ending at .6 (which means a 60% chance at each time step of relying on the previous prediction). This lets the model start with an easier task and increase the difficulty as it goes:

In [37]:
ps = [1., 1., 1., 1., .9, .8, .7, .6]
for i,p in enumerate(ps):
    train_epoch(p=p)
    val_loss = validate_epoch()
    print(f'Epoch {i+1} completed. Validation loss: {val_loss}')

Epoch 1 completed. Validation loss: 5.276256084442139
Epoch 2 completed. Validation loss: 5.27774715423584
Epoch 3 completed. Validation loss: 5.353205680847168
Epoch 4 completed. Validation loss: 5.444311141967773
Epoch 5 completed. Validation loss: 4.595528602600098
Epoch 6 completed. Validation loss: 4.233743667602539
Epoch 7 completed. Validation loss: 4.03050422668457
Epoch 8 completed. Validation loss: 3.769270181655884


## Results

I get a batch of inputs from the validation set and pass them through the model in inference mode (by not passing in any `y` data). I then group the predicted sentences with the true label sentences and the input sentences for further examination:

In [38]:
xb, yb = next(get_batches(df_val))
out = model(xb)
nums = out.argmax(dim=-1)

ins = [pt_decode(o) for o in xb]
trues = [en_decode(o) for o in yb]
preds = [en_decode(o) for o in nums]

examples = list(zip(ins, trues, preds))

Next I compare inputs and true labels to predictions to judge how well the model did. Each example features three sentences:
- the input Portuguese sentence
- the English translation that is the true label from the dataset
- the English translation that the model generated

Here are some examples of inputs that probably overlap with the training data and have been memorized:

In [44]:
examples[16]

('( o parlamento aprova a acta ) END',
 '( the minutes were approved ) END PAD',
 '( the minutes were approved ) END PAD')

In [45]:
examples[21]

('( o parlamento aprova a resolução ) END',
 '( parliament adopted the resolution ) END PAD',
 '( parliament adopted the resolution ) END PAD')

In [46]:
examples[25]

('a votação terá lugar hoje . END PAD',
 'the vote will take place today . END',
 'the vote will take place today . END')

Memorization is not ideal, since it only works on data that has been seen in the training set. However, it can still be useful for phrases that are so common that they might as well be memorized.

The more interesting results have translations that are not exactly the same as the true labels but still are pretty accurate.

In this first example, the model's output is actually a more literal translation of the Portuguese sentence, which could be literally translated as something like "I don't have any doubt about that":

In [47]:
examples[2]

('não tenho qualquer dúvida quanto a isso .',
 'of this i am in no doubt .',
 'i have no doubt about that . END')

This example again has a paraphrased version of the true label translation:

In [48]:
examples[7]

('em consequência , votámos contra este relatório .',
 'we have therefore voted against this report .',
 'we are now voted against this report .')

This next prediction actually makes more sense than the label translation, at least when this sentence is not in any surrounding context. "precisamos" means "we need to", and "dar-lhes tempo" means "give them time", so the prediction seems more accurate at least at the beginning of the sentence:

In [52]:
examples[27]

('precisamos dar - lhes tempo . END PAD',
 'it must grant the time for them .',
 'we need to do them . END PAD')

A lot of these previous translations could again be based on some memorization of different label translations for the same common input sentence, shared between training and validation data.

The most interesting predictions are the ones that are clearly not memorized because they are obviously not how any human would translate the inputs, but they still demonstrate some strange understanding of the inputs.

This next one translates "very clear" as "clear clear", which makes sense in a way, because it is repeating and thus emphasizing the idea of "clear":

In [49]:
examples[8]

('esta mensagem é muito clara . END PAD',
 'that message is very clear . END PAD',
 'this message is clear clear . END PAD')

This one doesn't quite work grammatically but seems to understand that more positivity is needed:

In [50]:
examples[15]

('temos de adoptar uma atitude mais positiva .',
 'we must adopt a more positive approach .',
 'we must be a positive more . END')

This one knows about a decision and its location:

In [51]:
examples[22]

('a decisão desta assembleia está tomada . END',
 'the house has made its decision . END',
 'the decision is in this house . END')

This one shows the model has an idea of the relation between the concepts of priority and importance, although it fumbles the full translation:

In [53]:
examples[33]

('estas prioridades são de três ordens . END',
 'there are three types of priority . END',
 'these three three situations are important . END')

This one translates the first half of the sentence very well: "é a única coisa" by itself would be "it is the only thing". It didn't translate the last half that should be "that matters":

In [55]:
examples[38]

('é a única coisa que importa ! END',
 'that is all that counts . END PAD',
 'it is the only one ! ! END')

This one understood the concept but chose the most obvious word both times:

In [59]:
examples[46]

('isto é bom e positivo . END PAD',
 'this is both good and positive . END',
 'this is good and good . END PAD')

This one knows the need for awareness, but not what we need to be aware of:

In [56]:
examples[43]

('temos de estar cientes desse risco . END',
 'we must be aware of this risk .',
 'we need to be aware of this .')

The following translations have more problems but are still interesting to examine.

This one got too focused on the ideas of "important" and "two":

In [64]:
examples[59]

('tomaram - se assim dois passos importantes .',
 'two important steps have thus been taken .',
 'important two two points important points . END')

This one has an interesting Yoda-like way of saying "we don't need more":

In [57]:
examples[44]

('não precisamos de mais directivas . END PAD',
 'we do not need more directives . END',
 'we need more not be . END PAD')

This one has an interesting way of saying something isn't finished:

In [62]:
examples[53]

('esse trabalho ainda não foi concluído . END',
 'this work is on - going . END',
 'this has yet been not yet . END')

And finally, this one thinks Mrs. Pack is so correct that she should be addressed as Mrs. Right:

In [60]:
examples[47]

('tem razão , senhora deputada pack . END',
 'you are right , mrs pack . END',
 'you are , mrs right . END PAD')