<a href="https://colab.research.google.com/github/ounospanas/AIDL_B_CS01/blob/main/Seq2Seq_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""


'\nThe MIT License (MIT)\nCopyright (c) 2021 NVIDIA\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the "Software"), to deal in\nthe Software without restriction, including without limitation the rights to\nuse, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of\nthe Software, and to permit persons to whom the Software is furnished to do so,\nsubject to the following conditions:\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\nTHE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS\nFOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR\nCOPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER\nIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OU

This code example demonstrates how to build a neural machine translation network. It is a sequence-to-sequence network based on an encoder-decoder architecture. More context for this code example can be found in the section "Programming Example: Neural Machine Translation" in Chapter 14 in the book Learning Deep Learning by Magnus Ekman (ISBN: 9780137470358).

The data used to train the model is expected to be in the file ../data/fra.txt.
We begin by importing modules that we need for the program. Just like in c12e1_autocomplete_embedding we use some functionality from TensorFlow to preprocess text.


In [2]:
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text \
    import text_to_word_sequence
from tensorflow.keras.preprocessing.sequence \
    import pad_sequences
import numpy as np
import random


Next, we define some constants. We specify a vocabulary size of 10,000 symbols, out of which four indices are reserved for padding, out-of-vocabulary words (denoted as UNK), START tokens, and STOP tokens. Our training corpus is large, so we set the parameter READ_LINES to the number of lines in the input file we want to use in our example (60,000). Our layers consist of 256 units (LAYER_SIZE), and the embedding layers output 128 dimensions (EMBEDDING_WIDTH). We use 20% (TEST_PERCENT) of the dataset as test set and further select 20 sentences (SAMPLE_SIZE) to inspect in detail during training. We limit the length of the source and destination sentences to, at most, 60 words (MAX_LENGTH). Finally, we provide the path to the data file, where each line is expected to contain two versions of the same sentence (one in each language) separated by a tab character.

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
EPOCHS = 20
BATCH_SIZE = 128
MAX_WORDS = 10000
READ_LINES = 60000
LAYER_SIZE = 256
EMBEDDING_WIDTH = 128
TEST_PERCENT = 0.2
SAMPLE_SIZE = 20
OOV_WORD = 'UNK'
PAD_INDEX = 0
OOV_INDEX = 1
START_INDEX = MAX_WORDS - 2
STOP_INDEX = MAX_WORDS - 1
MAX_LENGTH = 60
SRC_DEST_FILE_NAME = 'fra.txt'


The next code snippet shows the function used to read the input data file and do some initial processing. Each line is split into two strings, where the first contains the sentence in the destination language and the second contains the sentence in the source language. We use the function text_to_word_sequence() to clean the data somewhat (make everything lowercase and remove punctuation) and split each sentence into a list of individual words. If the list (sentence) is longer than the maximum allowed length, then it is truncated.

In [4]:
# Function to read file.
def read_file_combined(file_name, max_len):
    file = open(file_name, 'r', encoding='utf-8')
    src_word_sequences = []
    dest_word_sequences = []
    for i, line in enumerate(file):
        if i == READ_LINES:
            break
        pair = line.split('\t')
        word_sequence = text_to_word_sequence(pair[1])
        src_word_sequence = word_sequence[0:max_len]
        src_word_sequences.append(src_word_sequence)
        word_sequence = text_to_word_sequence(pair[0])
        dest_word_sequence = word_sequence[0:max_len]
        dest_word_sequences.append(dest_word_sequence)
    file.close()
    return src_word_sequences, dest_word_sequences


The next code snippet shows functions used to turn sequences of words into
sequences of tokens, and vice versa. We call tokenize() a single time for each
language, so the argument sequences is a list of lists where each of the inner
lists represents a sentence. The Tokenizer class assigns indices to the most
common words and returns either these indices or the reserved OOV_INDEX
for less common words that did not make it into the vocabulary. We tell the
Tokenizer to use a vocabulary of 9998 (MAX_WORDS-2)—that is, use only
indices 0 to 9997, so that we can use indices 9998 and 9999 as our START and
STOP tokens (the Tokenizer does not support the notion of START and STOP
tokens but does reserve index 0 to use as a padding token and index 1 for outof-
vocabulary words). Our tokenize() function returns both the tokenized
sequence and the Tokenizer object itself. This object will be needed anytime we
want to convert tokens back into words.

The function tokens_to_words() requires a Tokenizer and a list of indices. We simply check for the reserved indices: If we find a match, we replace them with hardcoded strings, and if we find no match, we let the Tokenizer convert the index to the corresponding word string. The Tokenizer expects a list of lists of indices and returns a list of strings, which is why we need to call it with [[index]] and then select the 0th element to arrive at a string.


In [5]:
# Functions to tokenize and un-tokenize sequences.
def tokenize(sequences):
    # "MAX_WORDS-2" used to reserve two indices
    # for START and STOP.
    tokenizer = Tokenizer(num_words=MAX_WORDS-2,
                          oov_token=OOV_WORD)
    tokenizer.fit_on_texts(sequences)
    token_sequences = tokenizer.texts_to_sequences(sequences)
    return tokenizer, token_sequences

def tokens_to_words(tokenizer, seq):
    word_seq = []
    for index in seq:
        if index == PAD_INDEX:
            word_seq.append('PAD')
        elif index == OOV_INDEX:
            word_seq.append(OOV_WORD)
        elif index == START_INDEX:
            word_seq.append('START')
        elif index == STOP_INDEX:
            word_seq.append('STOP')
        else:
            word_seq.append(tokenizer.sequences_to_texts(
                [[index]])[0])
    print(word_seq)


Given these helper functions, it is trivial to read the input data
file and convert into tokenized sequences.

In [6]:
# Read file and tokenize.
src_seq, dest_seq = read_file_combined(SRC_DEST_FILE_NAME,
                                       MAX_LENGTH)
src_tokenizer, src_token_seq = tokenize(src_seq)
dest_tokenizer, dest_token_seq = tokenize(dest_seq)


It is now time to arrange the data into arrays that can be used for training and testing. The following example provides some insight into what we need as input and output for a single training example, where src_input is the input to the encoder network, dest_input is the input to the decoder network, and dest_target is the desired output from the decoder network:

src_input = [PAD, PAD, PAD, id("je"), id("suis"), id("étudiant")]
dest_input = [START, id("i"), id("am"), id("a"), id("student"), STOP, PAD, PAD]
dest_target = [one_hot_id("i"), one_hot_id("am"), one_hot_id("a"), one_hot_id("student"), one_hot_id(STOP), one_hot_id(PAD), one_hot_id(PAD), one_hot_id(PAD)]

In the example, id(string) refers to the tokenized index of the string, and one_hot_id is the one-hot encoded version of the index. We have assumed that the longest source sentence is six words, so we padded src_input to be of that length. Similarly, we have assumed that the longest destination sentence is eight words including START and STOP tokens, so we padded both dest_input and dest_target to be of that length. Note how the symbols in dest_input are offset by one location compared to the symbols in dest_target because when we later do inference, the inputs into the decoder network will be coming from the output of the network for the previous timestep. Although this example has shown the training example as being lists, in reality, they will be rows in NumPy arrays, where each array contains multiple training examples.

The padding is done to ensure that we can use mini-batches for training. That is, all source sentences need to be the same length, and all destination sentences need to be the same length. We pad the source input at the beginning (known as prepadding) and the destination at the end (known as postpadding), which is nonobvious.

We previously stated that when using padding, the model can learn to ignore the padded values. However, as always, things are not as simple as they might appear. Although the model can learn to ignore values, it will not perfectly learn this. The ease with which it learns to ignore padding values might depend on how the data is arranged. It is not hard to imagine that inputting a considerable number of zeros at the end of a sequence will dilute the input and affect the internal state of the network. From that perspective, it makes sense to pad the input values with zeros in the beginning of the sequence instead. Similarly, in a sequence-to-sequence network, if the encoder has created an internal state that is transferred to the decoder, diluting this state by presenting a number of zeros before the START token also seems like it could be bad. This reasoning supports the chosen padding (prepadding of the source input and postpadding of the destination input).

The code snippet below shows a compact way of creating the three arrays that we need. The first two lines create two new lists, each containing the destination sequences but the first (dest_target_token_seq) also augmented with STOP_INDEX after each sequence and the second (dest_input_token_seq) augmented with both START_INDEX and STOP_INDEX. It is easy to miss that dest_input_token_seq has a STOP_INDEX, but that falls out naturally because it is created from the dest_target_token_seq for which a STOP_INDEX was just added to each sentence.

Next, we call pad_sequences() on both the original src_input_data list (of lists) and on these two new destination lists. The pad_sequences() function pads the sequences with the PAD value and then returns a NumPy array. The default behavior of pad_sequences is to do prepadding, and we do that for the source sequence but explicitly ask for postpadding for the destination sequences.

We conclude with converting the data type to np.int64 to match what PyTorch later requires.


In [7]:
# Prepare training data.
dest_target_token_seq = [x + [STOP_INDEX] for x in dest_token_seq]
dest_input_token_seq = [[START_INDEX] + x for x in
                        dest_target_token_seq]
src_input_data = pad_sequences(src_token_seq)
dest_input_data = pad_sequences(dest_input_token_seq,
                                padding='post')
dest_target_data = pad_sequences(
    dest_target_token_seq, padding='post', maxlen
    = len(dest_input_data[0]))

# Convert to same precision as model.
src_input_data = src_input_data.astype(np.int64)
dest_input_data = dest_input_data.astype(np.int64)
dest_target_data = dest_target_data.astype(np.int64)


The next code snippet demonstrates how we can manually split our dataset into a training dataset and a test dataset. We split the dataset by first creating a list test_indices, which contains a 20% (TEST_PERCENT) subset of all the numbers from 0 to N−1, where N is the size of our original dataset. We then create a list train_indices, which contains the remaining 80%. We can now use these lists to select a number of rows in the arrays representing the dataset and create two new collections of arrays, one to be used as training set and one to be used as test set. Finally, we create a third collection of arrays, which only contains 20 (SAMPLE_SIZE) random examples from the test dataset. We will use them to inspect the resulting translations in detail, but since that is a manual process, we limit ourselves to a small number of sentences.

Finally, we convert the NumPy arrays to PyTorch tensors and create Dataset objects.


In [8]:
# Split into training and test set.
rows = len(src_input_data[:,0])
all_indices = list(range(rows))
test_rows = int(rows * TEST_PERCENT)
test_indices = random.sample(all_indices, test_rows)
train_indices = [x for x in all_indices if x not in test_indices]

train_src_input_data = src_input_data[train_indices]
train_dest_input_data = dest_input_data[train_indices]
train_dest_target_data = dest_target_data[train_indices]

test_src_input_data = src_input_data[test_indices]
test_dest_input_data = dest_input_data[test_indices]
test_dest_target_data = dest_target_data[test_indices]

# Create a sample of the test set that we will inspect in detail.
test_indices = list(range(test_rows))
sample_indices = random.sample(test_indices, SAMPLE_SIZE)
sample_input_data = test_src_input_data[sample_indices]
sample_target_data = test_dest_target_data[sample_indices]

# Create Dataset objects.
trainset = TensorDataset(torch.from_numpy(train_src_input_data),
                         torch.from_numpy(train_dest_input_data),
                         torch.from_numpy(train_dest_target_data))
testset = TensorDataset(torch.from_numpy(test_src_input_data),
                         torch.from_numpy(test_dest_input_data),
                         torch.from_numpy(test_dest_target_data))


We are now ready to build our model. It consists of an encoder part and a decoder part (see Figures 14-4 and 14-5 in Chapter 14). The encoder consists of an embedding layer and two LSTM layers. The decoder consists of an embedding layer, two LSTM layers, and a fully connected softmax layer. We define these as two separate models, but we will use them together as an encoder-decoder model.

The code snippet below contains the implementation of the encoder model. One thing to note is that it will output the full output from the LSTM object. That is, it will output a tuple containing both output state and internal state for all layers.


In [9]:
# Define models.
class EncoderModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding_layer = nn.Embedding(MAX_WORDS, EMBEDDING_WIDTH)
        nn.init.uniform_(self.embedding_layer.weight, -0.05, 0.05) # Default is -1, 1.
        self.lstm_layers = nn.LSTM(EMBEDDING_WIDTH, LAYER_SIZE, num_layers=2, batch_first=True)

    def forward(self, inputs):
        x = self.embedding_layer(inputs)
        x = self.lstm_layers(x)
        # we need the states not the output
        return x[1]


The next code snippet shows the implementation of the decoder model. In addition to the sentence in the destination language, it needs the output state from the encoder model. We use the same mechanism as in c12e1_autocomplete_embedding to manage the input state.


In [11]:
class DecoderModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.state = None
        # we will call the LSTM layers with just x and state. This implies that the LSTM module will not use 0
        # as initial h and c states. However, we will supply the variable self.state as input to the LSTM layers.
        # In that case the LSTM layers will use these states as their initial states instead of 0.
        self.use_state = False
        self.embedding_layer = nn.Embedding(MAX_WORDS, EMBEDDING_WIDTH)
        nn.init.uniform_(self.embedding_layer.weight, -0.05, 0.05)
        #For the model, we start with declaring an Embedding layer with MAX_WORDS as input size and
        # EMBEDDING_WIDTH as output size. We adjust the weight initialization to use
        #uniform random numbers between -0.05 and 0.05, as opposed to the default range of -1.0 to 1.0.
        #We did this to match the range used in our TensorFlow examples.
        self.lstm_layers = nn.LSTM(EMBEDDING_WIDTH, LAYER_SIZE, num_layers=2, batch_first=True)
        self.output_layer = nn.Linear(LAYER_SIZE, MAX_WORDS)

    def forward(self, inputs):
        x = self.embedding_layer(inputs)
        if(self.use_state):
            x = self.lstm_layers(x, self.state)
        else:
            x = self.lstm_layers(x)
        # use it for the next time step
        # apart from calling detach() -  declared not to need gradients - we also call clone(),
        # which makes a copy of the state
        # so it does not change under the hood by the layers themselves if they are later
        # called with new inputs.
        self.state = (x[1][0].detach().clone(), x[1][1].detach().clone()) # Store most recent internal state.
        x = self.output_layer(x[0])
        return x

    # Functions to provide explicit control of LSTM state.
    def set_state(self, state):
        self.state = state
        self.use_state = True
        return

    def get_state(self):
        return self.state

    def clear_state(self):
        self.use_state = False
        return


The next code snippet instantitates the two models, and creates two optimizers, one for each model. We decided to use RMSProp as optimizer because some experiments indicate that it performs better than Adam for this specific model. We use CrossEntropyLoss as usual.

We transfer the models to the GPU and create a DataLoader object for both the training and test dataset. We have not had to do this lately because it has been included in our train_model funtion that was reused for all recent examples. We cannot use that function in this example because it does not support the more complex encoder-decoder model that we want to train.


In [12]:
encoder_model = EncoderModel()
decoder_model = DecoderModel()

# Loss functions and optimizer.
encoder_optimizer = torch.optim.RMSprop(encoder_model.parameters(), lr=0.001)
decoder_optimizer = torch.optim.RMSprop(decoder_model.parameters(), lr=0.001)
loss_function = nn.CrossEntropyLoss()

# Using a custom training loop instead of our standard training function.
# Transfer model to GPU.
encoder_model.to(device)
decoder_model.to(device)

trainloader = DataLoader(dataset=trainset, batch_size=BATCH_SIZE, shuffle=True)
testloader = DataLoader(dataset=testset, batch_size=BATCH_SIZE, shuffle=False)


The final code snippet shows hos to train and test the model. It is a modified version of the training loop that we had implemented in the train_model function (see utilities.py for comparison). We focus on describing the differences.

We need to set both the encoder and decoder model in training mode by calling the train() method. The same applies for the calls to zero_grad() and step() on the optimizers. Additionally, our DataLoader objects will now return two sets of inputs (src_inputs and dest_inputs) to supply input values both to the encoder and decoder models.

For the forward pass we first invoke the encoder model. We then read out its state and use that to set the state for the decoder model, followed by invoking the decoder model as well.

We have also modified how the accuracy metric is calculated. As described in the book, accuracy is not necessarily a meaningful metric for machine translation (BLEU score is better) and one can also envision different ways of computing accuracy. The way we compute it in this example is that we determine how many of the words match the expected targets, instead of looking at if a sentence fully matches the target sentence. We chose to compute it this way to match what is automatically computed in the TensorFlow version of this code example.

The second inner loop below evaluates the model on the test dataset. Just as in our usual train_model function, this loop largely mimics the first inner loop but without running a backward pass and adjusting weights.

We then move on to a third inner loop below that was not present in our train_model function. This loop uses the model to produce and print out translations for our 20 sample examples. We do this to gain some insight into what the translations look like. We first invoke the encoder model with the sentence to translate and retrieve the model internal state. We then use this internal state as a starting point for the decoder to generate a translation, using the START token as input for the first time step. The loop that generates the translation is very similar to the code in c12e1_autocomplete_embedding. After all, the decoder is no different than a language model that generates a sentence based on an encoder-generated initial state representing the sentence to translate. The loop iterates until the model produces a STOP token or until a given number of words have been produced. Finally, we convert the produced tokenized sequences into the corresponding word sequences and print them out.

See the section "Experimental results" in Chapter 14 for examples of translations that were generated by an equivalent TensorFlow implementation of this code example, as well as a discussion about the results.


In [13]:
# Train and test repeatedly.
for i in range(EPOCHS):
    encoder_model.train() # Set model in training mode.
    decoder_model.train() # Set model in training mode.
    train_loss = 0.0
    train_correct = 0
    train_batches = 0
    train_elems = 0
    for src_inputs, dest_inputs, dest_targets in trainloader:
        # Move data to GPU.
        src_inputs, dest_inputs, dest_targets = src_inputs.to(
            device), dest_inputs.to(device), dest_targets.to(device)

        # Zero the parameter gradients.
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        # Forward pass.
        encoder_state = encoder_model(src_inputs)
        decoder_model.set_state(encoder_state)
        outputs = decoder_model(dest_inputs)
        loss = loss_function(outputs.view(-1, MAX_WORDS), dest_targets.view(-1))
        # Accumulate metrics.
        _, indices = torch.max(outputs.data, 2)
        train_correct += (indices == dest_targets).sum().item()
        train_elems += indices.numel()
        train_batches +=  1
        train_loss += loss.item()

        # Backward pass and update.
        loss.backward()
        encoder_optimizer.step()
        decoder_optimizer.step()

    train_loss = train_loss / train_batches
    train_acc = train_correct / train_elems

    # Evaluate the model on the test dataset.
    encoder_model.eval() # Set model in inference mode.
    decoder_model.eval() # Set model in inference mode.
    test_loss = 0.0
    test_correct = 0
    test_batches = 0
    test_elems = 0
    for src_inputs, dest_inputs, dest_targets in testloader:
        # Move data to GPU.
        src_inputs, dest_inputs, dest_targets = src_inputs.to(
            device), dest_inputs.to(device), dest_targets.to(device)
        encoder_state = encoder_model(src_inputs)
        decoder_model.set_state(encoder_state)
        outputs = decoder_model(dest_inputs)
        loss = loss_function(outputs.view(-1, MAX_WORDS), dest_targets.view(-1))
        _, indices = torch.max(outputs, 2)
        test_correct += (indices == dest_targets).sum().item()
        test_elems += indices.numel()
        test_batches +=  1
        test_loss += loss.item()

    test_loss = test_loss / test_batches
    test_acc = test_correct / test_elems
    print(f'Epoch {i+1}/{EPOCHS} loss: {train_loss:.4f} - acc: {train_acc:0.4f} - val_loss: {test_loss:.4f} - val_acc: {test_acc:0.4f}')

    # Loop through samples to see result
    for (test_input, test_target) in zip(sample_input_data,
                                         sample_target_data):
        # Run a single sentence through encoder model.
        x = np.reshape(test_input, (1, -1))
        inputs = torch.from_numpy(x)
        inputs = inputs.to(device)
        last_states = encoder_model(inputs)

        # Provide resulting state and START_INDEX as input
        # to decoder model.
        decoder_model.set_state(last_states)
        prev_word_index = START_INDEX
        produced_string = ''
        pred_seq = []
        for j in range(MAX_LENGTH):
            x = np.reshape(np.array(prev_word_index), (1, 1))
            # Predict next word and capture internal state.
            inputs = torch.from_numpy(x)
            inputs = inputs.to(device)
            outputs = decoder_model(inputs)
            preds = outputs.cpu().detach().numpy()[0][0]
            state = decoder_model.get_state()
            decoder_model.set_state(state)

            # Find the most probable word.
            prev_word_index = preds.argmax()
            pred_seq.append(prev_word_index)
            if prev_word_index == STOP_INDEX:
                break
        tokens_to_words(src_tokenizer, test_input)
        tokens_to_words(dest_tokenizer, test_target)
        tokens_to_words(dest_tokenizer, pred_seq)
        print('\n\n')


Epoch 1/20 loss: 2.4628 - acc: 0.6101 - val_loss: 2.0833 - val_acc: 0.6441
['PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'vous', 'ai', 'je', 'réveillés']
['did', 'i', 'wake', 'you', 'up', 'STOP', 'PAD', 'PAD', 'PAD']
['i', 'want', 'to', 'do', 'you', 'STOP']



['PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'cela', "m'a", 'rendu', 'très', 'triste']
['that', 'made', 'me', 'very', 'sad', 'STOP', 'PAD', 'PAD', 'PAD']
['i', 'have', 'a', 'good', 'STOP']



['PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', "c'est", 'tellement', 'relaxant']
['this', 'is', 'so', 'relaxing', 'STOP', 'PAD', 'PAD', 'PAD', 'PAD']
['i', 'have', 'a', 'lot', 'STOP']



['PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'tom', 'a', 'fait', 'un', 'bon', 'travail']
['tom', 'did', 'a', 'good', 'job', 'STOP', 'PAD', 'PAD', 'PAD']
['tom', 'is', 'a', 'good', 'STOP']



['PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', "j'e