# English-to-Spanish translation with a sequence-to-sequence Transformer

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2021/05/26<br>
**Last modified:** 2022/02/25<br>
**Description:** Implementing a sequence-to-sequene Transformer and training it on a machine translation task.

## Introduction

In this example, we'll build a sequence-to-sequence Transformer model, which
we'll train on an English-to-Spanish machine translation task.

You'll learn how to:

- Vectorize text using the Keras `TextVectorization` layer.
- Implement a `TransformerEncoder` layer, a `TransformerDecoder` layer,
and a `PositionalEmbedding` layer.
- Prepare data for training a sequence-to-sequence model.
- Use the trained model to generate translations of never-seen-before
input sentences (sequence-to-sequence inference).

The code featured here is adapted from the book
[Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition)
(chapter 11: Deep learning for text).
The present example is fairly barebones, so for detailed explanations of
how each building block works, as well as the theory behind Transformers,
I recommend reading the book.

We will try to find the best paramethers adjusting them to different input lenghts:


a) 1-3 words


b) 4-10 words


c) >10 words

## Setup

In [18]:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

## Downloading the data

We'll be working with an English-to-Spanish translation dataset
provided by [Anki](https://www.manythings.org/anki/). But we will split it in three different files according to the input text leghts. Let's load it from Google Drive:

In [19]:
text_file = keras.utils.get_file(
    fname="spa-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
    extract=True,
)
text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"

## Parsing the data

Each line contains an English sentence and its corresponding Spanish sentence.
The English sentence is the *source sequence* and Spanish one is the *target sequence*.
We prepend the token `"[start]"` and we append the token `"[end]"` to the Spanish sentence.

*Run only one of the three chunks of code at a time*

In [20]:
#Takes into account only sentences in English of less than 4 words
with open(text_file) as f:
     lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
  eng, spa = line.split("\t")
  spa = "[start] " + spa + " [end]"
  if len(eng.split())< 4: 
    text_pairs.append((eng, spa))

In [21]:
#Takes into account only sentences in English of 4 to 10 words
#with open(text_file) as f:
#   lines = f.read().split("\n")[:-1]
#text_pairs = []
#for line in lines:
#    eng, spa = line.split("\t")
#    spa = "[start] " + spa + " [end]"
#    if 4 <= len(eng.split()) <11: #
#      text_pairs.append((eng, spa))

In [22]:
#Takes into account only sentences in English of more than 10 words
# with open(text_file) as f:
#     lines = f.read().split("\n")[:-1]
# text_pairs = []
# for line in lines:
#     eng, spa = line.split("\t")
#     spa = "[start] " + spa + " [end]"
#     if len(eng.split()) > 11: #
#       text_pairs.append((eng, spa))

Here's what our sentence pairs look like:

In [23]:
for _ in range(10):
    print(random.choice(text_pairs))

("Someone's watching me.", '[start] Alguien me está vigilando. [end]')
('Keep this.', '[start] Guarde esto. [end]')
('Tom likes Mary.', '[start] A Tom le gusta Mary. [end]')
('She likes tigers.', '[start] A ella le gustan los tigres. [end]')
('Do not interfere!', '[start] ¡No interfieras! [end]')
('That will do.', '[start] Con eso bastará. [end]')
("I didn't ask.", '[start] Yo no pregunté. [end]')
('Have him come.', '[start] Hazlo venir. [end]')
('I loved that.', '[start] Eso me gustó. [end]')
('Come on.', '[start] Ándale. [end]')


Now, let's split the sentence pairs into a training set, a validation set,
and a test set.

In [24]:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

12456 total pairs
8720 training pairs
1868 validation pairs
1868 test pairs


## Vectorizing the text data

We'll use two instances of the `TextVectorization` layer to vectorize the text
data (one for English and one for Spanish),
that is to say, to turn the original strings into integer sequences
where each integer represents the index of a word in a vocabulary.

The English layer will use the default string standardization (strip punctuation characters)
and splitting scheme (split on whitespace), while
the Spanish layer will use a custom standardization, where we add the character
`"¿"` to the set of punctuation characters to be stripped.

Note: in a production-grade machine translation model, I would not recommend
stripping the punctuation characters in either language. Instead, I would recommend turning
each punctuation character into its own token,
which you could achieve by providing a custom `split` function to the `TextVectorization` layer.

In [25]:
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

vocab_size = 15000
sequence_length = 20
batch_size = 64


def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")


eng_vectorization = TextVectorization(
    max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length,
)
spa_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)
train_eng_texts = [pair[0] for pair in train_pairs]
train_spa_texts = [pair[1] for pair in train_pairs]
eng_vectorization.adapt(train_eng_texts)
spa_vectorization.adapt(train_spa_texts)

Next, we'll format our datasets.

At each training step, the model will seek to predict target words N+1 (and beyond)
using the source sentence and the target words 0 to N.

As such, the training dataset will yield a tuple `(inputs, targets)`, where:

- `inputs` is a dictionary with the keys `encoder_inputs` and `decoder_inputs`.
`encoder_inputs` is the vectorized source sentence and `encoder_inputs` is the target sentence "so far",
that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target sentence.
- `target` is the target sentence offset by one step:
it provides the next words in the target sentence -- what the model will try to predict.

In [26]:

def format_dataset(eng, spa):
    eng = eng_vectorization(eng)
    spa = spa_vectorization(spa)
    return ({"encoder_inputs": eng, "decoder_inputs": spa[:, :-1],}, spa[:, 1:])


def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.shuffle(2048).prefetch(16).cache()


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

*Let*'s take a quick look at the sequence shapes
(we have batches of 64 pairs, and all sequences are 20 steps long):

In [27]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)


## Building the model

Our sequence-to-sequence Transformer consists of a `TransformerEncoder`
and a `TransformerDecoder` chained together. To make the model aware of word order,
we also use a `PositionalEmbedding` layer.

The source sequence will be pass to the `TransformerEncoder`,
which will produce a new representation of it.
This new representation will then be passed
to the `TransformerDecoder`, together with the target sequence so far (target words 0 to N).
The `TransformerDecoder` will then seek to predict the next words in the target sequence (N+1 and beyond).

A key detail that makes this possible is causal masking
(see method `get_causal_attention_mask()` on the `TransformerDecoder`).
The `TransformerDecoder` sees the entire sequences at once, and thus we must make
sure that it only uses information from target tokens 0 to N when predicting token N+1
(otherwise, it could use information from the future, which would
result in a model that cannot be used at inference time).

In [28]:

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "dense_dim": self.dense_dim,
            "num_heads": self.num_heads,
        })
        return config


class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)
    def get_config(self):
        config = super().get_config()
        config.update({
            "sequence_length": self.sequence_length,
            "vocab_size": self.vocab_size,
            "embed_dim": self.embed_dim,
        })
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(latent_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
            axis=0,
        )
        return tf.tile(mask, mult)
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "latent_dim": self.latent_dim,
            "num_heads": self.num_heads,
        })
        return config


Next, we assemble the end-to-end model.

In [29]:
embed_dim = 512
latent_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

## Training our model

We'll use accuracy as a quick way to monitor training progress on the validation data.
Note that machine translation typically uses BLEU scores as well as other metrics, rather than accuracy.

Here we only train for 1 epoch, but to get the model to actually converge
you should train for at least 30 epochs.

In [30]:
epochs = 30  # This should be at least 30 for convergence

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 encoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 positional_embedding_2 (Positi  (None, None, 512)   7690240     ['encoder_inputs[0][0]']         
 onalEmbedding)                                                                                   
                                                                                                  
 decoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 transformer_encoder_1 (Transfo  (None, None, 512)   10503168    ['positional_embedding_

<keras.callbacks.History at 0x7f4b075327a0>

## Decoding test sentences

Finally, let's demonstrate how to translate brand new English sentences.
We simply feed into the model the vectorized English sentence
as well as the target token `"[start]"`, then we repeatedly generated the next token, until
we hit the token `"[end]"`.

In [31]:
spa_vocab = spa_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = eng_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = spa_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token

        if sampled_token == "[end]":
            break
    return decoded_sentence


test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(30):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequence(input_sentence)
    print(_, input_sentence, translated)

0 You're nuts! [start] estás loco [end]
1 I'm grieving. [start] soy imparcial [end]
2 I feel tired. [start] yo siento cansada [end]
3 Bite your tongue. [start] Átate los cordones [end]
4 Make a list. [start] haz una lista [end]
5 Flip a coin. [start] los una zorra [end]
6 Wait here. [start] esperad aquí [end]
7 You aren't stupid. [start] no sos estúpido [end]
8 I'm so alone. [start] estoy tan solo [end]
9 Do fish sleep? [start] lo dormir [end]
10 Summer is coming. [start] el invierno se acerca [end]
11 I love golf. [start] me encantan las bodas [end]
12 Keep in touch. [start] mantente en contacto [end]
13 I've been suspended. [start] he estado preocupada [end]
14 Keep still. [start] mantente quieto [end]
15 Come over. [start] veníos [end]
16 It is second-hand. [start] es maravilloso [end]
17 I'll change. [start] voy a cambiar [end]
18 Somebody poisoned Tom. [start] alguien a tom [end]
19 That's obvious. [start] es obvio [end]
20 You want this? [start] eres este [end]
21 Be confident. [

# Evaluation

*   Implement BLEU




## BLEU

In [32]:
pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [33]:
import random
import numpy as np
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [34]:
def calculate_bleu(reference, candidate):
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    smoothie = SmoothingFunction().method4  # smoothing to handle cases with 0 counts
    return sentence_bleu([reference_tokens], candidate_tokens, smoothing_function=smoothie)


test_eng_texts = []
for pair in test_pairs:
    eng_sentence = pair[0]
    test_eng_texts.append(eng_sentence)

test_spa_texts = []
for pair in test_pairs:
    spa_sentence = pair[1]
    test_spa_texts.append(spa_sentence)

print(len(test_eng_texts))

bleu_scores = []
for idx in range(len(test_eng_texts)):
    input_sentence = test_eng_texts[idx]
    target_translation = test_spa_texts[idx]
    translated = decode_sequence(input_sentence)
    translated = translated.replace("[start]", "").replace("[end]", "").strip()

    bleu_score = calculate_bleu(target_translation, translated)
    bleu_scores.append(bleu_score)

    print(idx, input_sentence, translated, "| BLEU score:", bleu_score)

average_bleu = np.mean(bleu_scores)
print("\n\nAverage BLEU score:", average_bleu)

1868
0 We're having dinner. estamos colgados en la cena | BLEU score: 0
1 Why not? por qué no interfieras | BLEU score: 0.04753271977233425
2 Send him in. mándalo dentro | BLEU score: 0
3 I ate it. lo conozco | BLEU score: 0.015071184180845467
4 I like climbing. me gusta la luz de las velas | BLEU score: 0.081939171811711
5 This doesn't fit. esto no cabe | BLEU score: 0.03722145753922423
6 Nobody's indispensable. nadie sabrá | BLEU score: 0
7 I don't eat. yo no como comer | BLEU score: 0
8 Ask me tomorrow. pregunta en mañana | BLEU score: 0
9 Raise your hands! levantá tus manos | BLEU score: 0
10 Here's your bag. aquí está tu cartera | BLEU score: 0.10202995073993343
11 I see them. las veo | BLEU score: 0
12 We're Canadians. somos abogados | BLEU score: 0
13 They waited. se abrazaron | BLEU score: 0
14 Kiss Tom. besen a tomás | BLEU score: 0.03722145753922423
15 Was I snoring? estaba equivocada | BLEU score: 0
16 You're too idealistic. eres demasiado generoso | BLEU score: 0.0372214575

## CHRF

In [35]:
from nltk.translate.chrf_score import sentence_chrf

In [36]:
def calculate_chrf(reference, candidate):
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    return sentence_chrf([reference], candidate, min_len=2, max_len=6, beta=3.0)

test_eng_texts = []
for pair in test_pairs:
    eng_sentence = pair[0]
    test_eng_texts.append(eng_sentence)

test_spa_texts = []
for pair in test_pairs:
    spa_sentence = pair[1]
    test_spa_texts.append(spa_sentence)

print(len(test_eng_texts))

chrf_scores = []
for idx, (input_sentence, target_translation) in enumerate(zip(test_eng_texts, test_spa_texts)):
# for idx in range(30):
    # if idx >= 30:
    #     break

    translated = decode_sequence(input_sentence)
    translated = translated.replace("[start]", "").replace("[end]", "").strip() 

    chrf_score = calculate_chrf(target_translation, translated)
    chrf_scores.append(chrf_score)

    print(idx, input_sentence, translated, "| chrF score:", chrf_score)

average_chrf = np.mean(chrf_scores)
print("\n\nAverage chrF score:", average_chrf)

1868
0 We're having dinner. estamos colgados en la cena | chrF score: 0.23028814154651345
1 Why not? por qué no interfieras | chrF score: 0.2093635853687812
2 Send him in. mándalo dentro | chrF score: 0.1621894607114041
3 I ate it. lo conozco | chrF score: 0.06629156563414278
4 I like climbing. me gusta la luz de las velas | chrF score: 0.20265479339994757
5 This doesn't fit. esto no cabe | chrF score: 0.102120026566992
6 Nobody's indispensable. nadie sabrá | chrF score: 0.06853018035548639
7 I don't eat. yo no como comer | chrF score: 0.12242529221082628
8 Ask me tomorrow. pregunta en mañana | chrF score: 0.17157868183882444
9 Raise your hands! levantá tus manos | chrF score: 0.23806587785868363
10 Here's your bag. aquí está tu cartera | chrF score: 0.27188411298948667
11 I see them. las veo | chrF score: 0.12793017662604753
12 We're Canadians. somos abogados | chrF score: 0.05424778311056875
13 They waited. se abrazaron | chrF score: 0.07253821054668874
14 Kiss Tom. besen a tomás | c