We'll see that with LSTMs and the Encoder Decoder framework, we can do some pretty powerful things like: translators ! Let's see how we can create a French > English translator with TensorFlow

Tips

Don't take the whole dataset at the beginning for your experiments, just take 5000 or even 3000 sentences. This will allow you to iterate faster and avoid bugs simply related to your need for computing power.

Let's get started!

# Import Libraries

In [20]:
# Import necessaries librairies
import pandas as pd
import numpy as np
import sklearn
import tensorflow_datasets as tfds
import tensorflow as tf
tf.__version__

'2.17.1'

# Importing data

In [21]:
url ='https://go.aws/38ECHUB'
dataset = pd.read_csv(url, sep = '\t', header=None)
dataset.head()

Unnamed: 0,0,1
0,Go.,Va !
1,Hi.,Salut !
2,Run!,Cours !
3,Run!,Courez !
4,Wow!,Ça alors !


Create an object doc containing the first 5000 rows from the file.

In [22]:
doc = dataset.iloc[:15000, :]
doc.head()

Unnamed: 0,0,1
0,Go.,Va !
1,Hi.,Salut !
2,Run!,Cours !
3,Run!,Courez !
4,Wow!,Ça alors !


In [23]:
len(doc)

15000

In your opinion, are we going to need to lemmatize and remove stop words for a translation problem?

Add the word <start> to the beginning of each target sentence in order to create a new column named padded_en

In [24]:
doc["padded_en"] = doc.iloc[:,0].apply(lambda x: '<start> ' + x)
doc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["padded_en"] = doc.iloc[:,0].apply(lambda x: '<start> ' + x)


Unnamed: 0,0,1,padded_en
0,Go.,Va !,<start> Go.
1,Hi.,Salut !,<start> Hi.
2,Run!,Cours !,<start> Run!
3,Run!,Courez !,<start> Run!
4,Wow!,Ça alors !,<start> Wow!


Create two objects : tokenizer_fr and tokenizer_en that will be instances of the tf.keras.preprocessing.text.Tokenizer class.

Be careful! Since we added a special token containing special characters, make sure you setup the tokenizers right so this token is well interpreted! (use the filters argument for example).

In [25]:
tokenizer_fr = tf.keras.preprocessing.text.Tokenizer()
tokenizer_en = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')

Fit the tokenizers on the french, and padded english sentences respectively.

In [26]:
tokenizer_fr.fit_on_texts(doc.iloc[:,1])
tokenizer_en.fit_on_texts(doc['padded_en'])

Create three new columns in your Dataframe for the encoded french, english, and padded english sentences.

In [27]:
doc["encoded_fr"] = doc.iloc[:,1].apply(lambda x: tokenizer_fr.texts_to_sequences([x])[0])
doc["encoded_en"] = doc.iloc[:,0].apply(lambda x: tokenizer_en.texts_to_sequences([x])[0])
doc["encoded_padded_en"] = doc["padded_en"].apply(lambda x: tokenizer_en.texts_to_sequences([x])[0])
doc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["encoded_fr"] = doc.iloc[:,1].apply(lambda x: tokenizer_fr.texts_to_sequences([x])[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["encoded_en"] = doc.iloc[:,0].apply(lambda x: tokenizer_en.texts_to_sequences([x])[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["encoded_padded_en"

Unnamed: 0,0,1,padded_en,encoded_fr,encoded_en,encoded_padded_en
0,Go.,Va !,<start> Go.,[53],[19],"[1, 19]"
1,Hi.,Salut !,<start> Hi.,[717],[866],"[1, 866]"
2,Run!,Cours !,<start> Run!,[3078],[147],"[1, 147]"
3,Run!,Courez !,<start> Run!,[3079],[147],"[1, 147]"
4,Wow!,Ça alors !,<start> Wow!,"[24, 3080]",[2008],"[1, 2008]"


We learned from the tutorial that the padded target sequences need to have the same length as the target sequences, so we will remove the last element of each padded target sequence (this will help us enforce teacher forcing)

In [28]:
doc["encoded_padded_en_clean"] = doc["encoded_padded_en"].apply(lambda x: x[:-1])
doc.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doc["encoded_padded_en_clean"] = doc["encoded_padded_en"].apply(lambda x: x[:-1])


Unnamed: 0,0,1,padded_en,encoded_fr,encoded_en,encoded_padded_en,encoded_padded_en_clean
0,Go.,Va !,<start> Go.,[53],[19],"[1, 19]",[1]
1,Hi.,Salut !,<start> Hi.,[717],[866],"[1, 866]",[1]
2,Run!,Cours !,<start> Run!,[3078],[147],"[1, 147]",[1]
3,Run!,Courez !,<start> Run!,[3079],[147],"[1, 147]",[1]
4,Wow!,Ça alors !,<start> Wow!,"[24, 3080]",[2008],"[1, 2008]",[1]


In [29]:
print(doc.head(20))

         0              1       padded_en  encoded_fr encoded_en  \
0      Go.           Va !     <start> Go.        [53]       [19]   
1      Hi.        Salut !     <start> Hi.       [717]      [866]   
2     Run!        Cours !    <start> Run!      [3078]      [147]   
3     Run!       Courez !    <start> Run!      [3079]      [147]   
4     Wow!     Ça alors !    <start> Wow!  [24, 3080]     [2008]   
5    Fire!       Au feu !   <start> Fire!   [67, 640]      [756]   
6    Help!     À l'aide !   <start> Help!  [19, 3081]       [74]   
7    Jump.         Saute.   <start> Jump.      [1982]      [757]   
8    Stop!    Ça suffit !   <start> Stop!  [24, 1983]       [63]   
9    Stop!         Stop !   <start> Stop!      [3082]       [63]   
10   Stop!   Arrête-toi !   <start> Stop!   [122, 39]       [63]   
11   Wait!      Attends !   <start> Wait!       [249]      [110]   
12   Wait!     Attendez !   <start> Wait!       [311]      [110]   
13  Go on.      Poursuis.  <start> Go on.      [

It's rather difficult to work with sequences with variable length, use zero-padding to normalize the length of all the sequences in each category.

In [30]:
encoded_padded_fr = tf.keras.preprocessing.sequence.pad_sequences(doc["encoded_fr"], padding='post')
encoded_padded_en = tf.keras.preprocessing.sequence.pad_sequences(doc["encoded_en"], padding='post')
teacher_forcing_en = tf.keras.preprocessing.sequence.pad_sequences(doc["encoded_padded_en_clean"], padding='post')

doc.head()

Unnamed: 0,0,1,padded_en,encoded_fr,encoded_en,encoded_padded_en,encoded_padded_en_clean
0,Go.,Va !,<start> Go.,[53],[19],"[1, 19]",[1]
1,Hi.,Salut !,<start> Hi.,[717],[866],"[1, 866]",[1]
2,Run!,Cours !,<start> Run!,[3078],[147],"[1, 147]",[1]
3,Run!,Courez !,<start> Run!,[3079],[147],"[1, 147]",[1]
4,Wow!,Ça alors !,<start> Wow!,"[24, 3080]",[2008],"[1, 2008]",[1]


What are the shapes of the arrays you just created for the french, padded english, and english sentences?

In [31]:
encoded_padded_fr.shape, encoded_padded_en.shape, teacher_forcing_en.shape

((15000, 11), (15000, 5), (15000, 5))

Use sklearn train_test_split function to divide your sample into train and validation sets.

In [32]:
from sklearn.model_selection import train_test_split
en_train, en_val, fr_train, fr_val, teacher_forcing_train, teacher_forcing_val = train_test_split(encoded_padded_en, encoded_padded_fr, teacher_forcing_en, test_size=0.3)

# MODEL

Now it's time to code the model, thankfully you can largely base yourself off the code provided during the demo!

Create the following variables:

n_embed the number of dimensions you want for the embeddings output spaces

n_lstm the number of units you want for the lstm layers
fr_len the length of a french sentence

en_len the length of an english or teacher forcing sentence

vocab_size_fr the number of tokens in the french vocabulary

vocab_size_en the number of tokens in the english vocabulary (based of the padded sequences so the start is included!

In [33]:
n_embed = 128
n_lstm = 54
fr_len = encoded_padded_fr.shape[1]
en_len = encoded_padded_en.shape[1]
vocab_size_fr = len(tokenizer_fr.word_index)
vocab_size_en = len(tokenizer_en.word_index)

## Set up the encoder

This will work in the same way as the demo, just make sure the input dimension of the embedding is equal to the number of words in the french vocabulary +1 (for the zero-padding)

In [34]:
encoder_input = tf.keras.layers.Input(shape=(fr_len,))
encoder_embedding = tf.keras.layers.Embedding(input_dim=vocab_size_fr+1, output_dim=n_embed)
encoder_lstm = tf.keras.layers.LSTM(n_lstm, return_state=True)

encoder_embedding_output = encoder_embedding(encoder_input)
encoder_output = encoder_lstm(encoder_embedding_output)

encoder = tf.keras.Model(encoder_input, encoder_output)

Try the encoder on the french train data (using the call method)

In [35]:
encoder(fr_train)

(<tf.Tensor: shape=(10500, 54), dtype=float32, numpy=
 array([[-0.0029259 ,  0.00423762, -0.00022697, ..., -0.02514385,
         -0.00603572,  0.03828905],
        [-0.00467367,  0.00342932, -0.00068056, ..., -0.02621231,
         -0.00447195,  0.03833227],
        [-0.00576637,  0.00032366, -0.00055746, ..., -0.0262196 ,
         -0.00725065,  0.04018672],
        ...,
        [-0.00364678,  0.00833101,  0.00586866, ..., -0.02265527,
         -0.00177914,  0.03089541],
        [-0.00613004,  0.00213864,  0.00080831, ..., -0.02626098,
         -0.00793605,  0.04020957],
        [-0.00364258,  0.00304395,  0.00084676, ..., -0.02543091,
         -0.00715573,  0.03807997]], dtype=float32)>,
 <tf.Tensor: shape=(10500, 54), dtype=float32, numpy=
 array([[-0.0029259 ,  0.00423762, -0.00022697, ..., -0.02514385,
         -0.00603572,  0.03828905],
        [-0.00467367,  0.00342932, -0.00068056, ..., -0.02621231,
         -0.00447195,  0.03833227],
        [-0.00576637,  0.00032366, -0.0005574

## Set up the decoder

This will work in the same way as the demo, just make sure the input dimension of the embedding is equal to the number of words in the english vocabulary +1 (for the zero-padding). The same goes for the last Dense layer!

In [36]:
decoder_input = tf.keras.Input(shape=(en_len,))
decoder_embed = tf.keras.layers.Embedding(input_dim=vocab_size_en+1,
                                          output_dim=n_embed)
decoder_lstm = tf.keras.layers.LSTM(n_lstm, return_sequences=True, return_state=True)
decoder_pred = tf.keras.layers.Dense(vocab_size_en+1, activation="softmax")

decoder_embed_output = decoder_embed(decoder_input) # teacher forcing happens here
# the decoder input is actually the padded target we created earlier, remember
# if target is: [91, 47, 89, 21, 62]
# the decoder input will be: [0, 91, 47, 89, 21]
decoder_lstm_output, _, _ = decoder_lstm(decoder_embed_output, initial_state=encoder_output[1:])
# in the step described above the decoder receives the encoder state as its
# initial state.
decoder_output = decoder_pred(decoder_lstm_output)
# then the dense layer will convert the vector representation for each element
# in the sequence into a probability distribution across all possible tokens
# in the vocabulary!

decoder = tf.keras.Model(inputs = [encoder_input,decoder_input], outputs = decoder_output)
# all we need to do is put the model together using the input output framework!

In [37]:
decoder([fr_train, teacher_forcing_train])

<tf.Tensor: shape=(10500, 5, 2920), dtype=float32, numpy=
array([[[0.00034316, 0.00034162, 0.00034017, ..., 0.00034312,
         0.00034113, 0.00034277],
        [0.00034281, 0.00034204, 0.00034085, ..., 0.00034286,
         0.0003413 , 0.00034258],
        [0.00034324, 0.00034274, 0.00034108, ..., 0.00034246,
         0.00034188, 0.00034281],
        [0.00034308, 0.00034292, 0.00034131, ..., 0.00034285,
         0.00034184, 0.00034292],
        [0.00034307, 0.00034311, 0.00034155, ..., 0.00034317,
         0.00034181, 0.00034306]],

       [[0.00034298, 0.00034168, 0.00034025, ..., 0.00034311,
         0.00034112, 0.0003428 ],
        [0.00034281, 0.00034149, 0.00034021, ..., 0.00034253,
         0.00034119, 0.00034291],
        [0.00034271, 0.00034227, 0.00034086, ..., 0.00034281,
         0.00034164, 0.00034255],
        [0.0003428 , 0.00034277, 0.00034111, ..., 0.00034316,
         0.00034166, 0.00034264],
        [0.00034296, 0.00034316, 0.00034135, ..., 0.00034345,
         0.000

## Set up the inference decoder
The code here will be identical to the one from the demo except if you changed some naming conventions!

In [38]:
decoder_state_input_h = tf.keras.Input(shape=(n_lstm,))
decoder_state_input_c = tf.keras.Input(shape=(n_lstm,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# at the first step of the inference, these input will be respectively the
# hidden state and C state of the encoder model
# for following steps, they will become the hidden and C state from the decoder
# itself since the input sequence is unknown we will have to predict step by step
# using a loop

decoder_input_inf = tf.keras.Input(shape=(1,))
decoder_embed_output = decoder_embed(decoder_input_inf)
# the decoder input here is of shape 1 because we will feed the elements in the
# sequence one by one

decoder_outputs, state_h, state_c = decoder_lstm(decoder_embed_output, initial_state=decoder_states_inputs)
# the lstm layer works in the same way, the output from the embedding is used
# and the decoder state is used as described above

decoder_states = [state_h, state_c]
# we store the lstm states in a specific object as we'll have to use them as
# initial state for the next inference step

decoder_outputs = decoder_pred(decoder_outputs)
# the lstm output is then converted to a probability distribution over the
# target vocabulary

decoder_inf = tf.keras.Model(inputs = [decoder_input_inf, decoder_states_inputs],
                     outputs = [decoder_outputs, decoder_states])
# Finally we wrap up the model building by setting up the inputs and outputs

Compile the decoder (the training version) using the appropriate loss and metric functions.

In [39]:
decoder.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)

Train the decoder for 50 epochs, this should take 10 minutes. Is there overfitting ?

In [40]:
decoder.fit(x=[fr_train, teacher_forcing_train], y=en_train, epochs=50, validation_data=([fr_val, teacher_forcing_val], en_val))

Epoch 1/50
[1m329/329[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 9ms/step - loss: 5.2920 - sparse_categorical_accuracy: 0.3782 - val_loss: 3.6276 - val_sparse_categorical_accuracy: 0.4280
Epoch 2/50
[1m329/329[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 9ms/step - loss: 3.5500 - sparse_categorical_accuracy: 0.4317 - val_loss: 3.4568 - val_sparse_categorical_accuracy: 0.4464
Epoch 3/50
[1m329/329[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 3.3707 - sparse_categorical_accuracy: 0.4508 - val_loss: 3.3317 - val_sparse_categorical_accuracy: 0.4659
Epoch 4/50
[1m329/329[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 3.2319 - sparse_categorical_accuracy: 0.4667 - val_loss: 3.1780 - val_sparse_categorical_accuracy: 0.4696
Epoch 5/50
[1m329/329[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 3.0612 - sparse_categorical_accuracy: 0.4735 - val_loss: 3.0347 - val_sparse_categorical_accuracy: 0.4854


<keras.src.callbacks.history.History at 0x7c2e7fc084c0>

Adapt the code from the demo to make some predictions on the validation data.
Be careful, in the demo the starting index for the teacher forcing sequences was 0, what index is the starting point of the teacher forcing sequences now?

Set up the first decoder input with the right dimension too!

In [41]:
enc_input = fr_val
#classic encoder input

dec_input = tf.ones(shape=(len(fr_val),1))
# the first decoder input is the special token 0

enc_out, state_h_inf, state_c_inf = encoder(enc_input)
# we compute once and for all the encoder output and the encoder
# h state and c state

dec_state = [state_h_inf, state_c_inf]
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(en_len):
  dec_out, dec_state = decoder_inf([dec_input, dec_state])
  # the decoder state is updated and we get the first prediction probability
  # vector
  decoded_out = tf.argmax(dec_out, axis=-1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred = tf.concat(pred, axis=-1).numpy()
for i in range(10):
  print("pred:", pred[i,:])
  print("true:", en_val[i,:])
  print("\n")

pred: [  2  26  29 107 163]
true: [  2  26  29 107   0]


pred: [   2  665    3    2 1583]
true: [  2 453   3  39   0]


pred: [ 15  16 166   7  21]
true: [51 16 49  0  0]


pred: [215 132   9 132   9]
true: [460 139   0   0   0]


pred: [178 350 524  66  63]
true: [178 879   0   0   0]


pred: [   2   26 1728  263    4]
true: [  2 421   0   0   0]


pred: [  6 244  28 510 510]
true: [  6 244  28 510   0]


pred: [  2 206 126 564  54]
true: [  2 859 242  23   0]


pred: [  32   11 2635  218  218]
true: [ 32 645   0   0   0]


pred: [ 14  59  19 247  48]
true: [  14   59 1215    0    0]




Use the tokenizer to convert the target and predicted sequences back to text, what do you think of the translations?

In [42]:
y_sample = tokenizer_en.sequences_to_texts(en_val)[:10]
pred_sample = tokenizer_en.sequences_to_texts(pred)[:10]

for i, j in zip(y_sample,pred_sample):
  print("true:", i)
  print("pred", j)
  print("\n")

true: i have no time
pred i have no time day


true: i woke you up
pred i believed you i borrowed


true: that's the one
pred it's the cat is that


true: study hard
pred sit down tom down tom


true: everyone agreed
pred everyone knows tea now stop


true: i forgot
pred i have forgotten forget it


true: i'm by your side
pred i'm by your side side


true: i truly hope not
pred i wasn't done yet out


true: they escaped
pred they are melons today today


true: we can't fail
pred we can't go die on


