This tutorial demonstrates how to a sequence-to-sequence (seq2seq) model for Spanish-to-English translation roughly based on [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/pdf/1508.04025v5.pdf). The model uses an encoder-decoder architecture. The encoder is a bidirectional RNN that condenses the input sequence into a single vector, and the decoder is a RNN that expands that vector into a new sequence.

![example](./images/notebook2_figure1.png)

An encoder/decoder connected by attention.

While this architecture is somewhat outdated, it is still a very useful project to work through to get a deeper understanding of seq2seq models and attention mechanisms before going on to Transformers.

> ## SETUP

In [1]:
import numpy as np

import typing
from typing import Any, Tuple

import einops
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.display import clear_output
import tensorflow as tf
tf.multiply(2, 3)
clear_output()
import tensorflow_text as tf_text

>> ### The data

We use dataset from [Anki](http://www.manythings.org/anki/). This dataset contains language translation pairs in the format: <br>
`May I borrow this book? ¿Puedo tomar prestado este libro?`

>> ### Download and prepare the dataset

After downloading the dataset, here are the steps you need to take to prepare the data:
1. Add a start and end token to each sentence.
2. Clean the sentences by removing special characters.
3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
4. Pad each sentence to a maximum length.

* Step 0: Download the data

In [2]:
import pathlib

path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)
path_to_file = pathlib.Path(path_to_zip).parent/'spa-eng/spa.txt'

In [3]:
def load_data(path):
    text = path.read_text(encoding='utf-8')
    lines = text.splitlines()
    pairs = [line.split('\t') for line in lines]

    context = np.array([pair[0] for pair in pairs])
    target = np.array([pair[1] for pair in pairs])

    return context, target

In [4]:
target_raw, context_raw = load_data(path_to_file)
print(context_raw[-1])

Si quieres sonar como un hablante nativo, debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un músico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado.


In [5]:
print(target_raw[-1])

If you want to sound like a native speaker, you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo.


>> ### Create a tf.data dataset

In [6]:
BUFFER_SIZE = len(context_raw)
BATCH_SIZE = 64

is_train = np.random.uniform(size=(len(target_raw),)) < 0.8

train_raw = (
    tf.data.Dataset.from_tensor_slices((context_raw[is_train], target_raw[is_train]))
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE)
)

val_raw = (
    tf.data.Dataset.from_tensor_slices((context_raw[~is_train], target_raw[~is_train]))
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE)
)

In [7]:
for example_context_strings, example_target_strings in train_raw.take(1):
    print(example_context_strings[:5])
    print()
    print(example_target_strings[:5])
    break

tf.Tensor(
[b'Cuando en Roma, haz como los romanos.' b'Soy un investigador privado.'
 b'Se est\xc3\xa1 haciendo mayor.'
 b'\xc2\xbfLe pedir\xc3\xadas a Tom que me llame de vuelta, por favor?'
 b'Tom no tuvo el valor de admitir que hab\xc3\xada cometido un error.'], shape=(5,), dtype=string)

tf.Tensor(
[b'When in Rome, do as the Romans do.' b"I'm a private investigator."
 b'He is getting old.' b'Would you ask Tom to call me back, please?'
 b"Tom didn't have the courage to admit that he had made a mistake."], shape=(5,), dtype=string)


>> ### Text Preprocessing

One of the goals of this tutorial is to build a model that can be exported as a tf.saved_model. To make that exported model useful it should take tf.string inputs and return tf.string outputs. All the text preprocessing happens inside the model. Mainly using a layers.TextVectorization layer.

>>> #### Standardization

The model is dealing with multilingual text with a limited vocabulary. So it will be important to standardize the input text.

The first step is Unicode normalization to split accented characters and replace compatibility characters with their ASCII equivalents. The `tensorflow_text` package contains a unicode normalize operation:

In [8]:
example_text = tf.constant('¿Todavía está en casa?')

print(example_text.numpy())
print(tf_text.normalize_utf8(example_text, 'NFKD').numpy())

b'\xc2\xbfTodav\xc3\xada est\xc3\xa1 en casa?'
b'\xc2\xbfTodavi\xcc\x81a esta\xcc\x81 en casa?'


Unicode normalization will be the first step in the text standardization function:

In [9]:
def tf_lower_and_split_punct(text):
  # Split accented characters.
  text = tf_text.normalize_utf8(text, 'NFKD')
  text = tf.strings.lower(text)
  # Keep space, a to z, and select punctuation.
  text = tf.strings.regex_replace(text, '[^ a-z.?!,¿]', '')
  # Add spaces around punctuation.
  text = tf.strings.regex_replace(text, '[.?!,¿]', r' \0 ')
  # Strip whitespace.
  text = tf.strings.strip(text)

  text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
  return text

In [10]:
print(example_text.numpy().decode())
print(tf_lower_and_split_punct(example_text).numpy().decode())

¿Todavía está en casa?
[START] ¿ todavia esta en casa ? [END]


>>> #### Text Vectorization

This standardization function will be wrapped up in a `tf.keras.layers.TextVectorization` layer which will handle the vocabulary extraction and conversion of input text to sequences of tokens.

In [11]:
max_vocab_size = 5000

context_text_processor = tf.keras.layers.TextVectorization(
    standardize=tf_lower_and_split_punct,
    max_tokens=max_vocab_size,
    ragged=True)