This tutorial guides you on how to train a sequence-to-sequence model to translate from Spanish to English. Second, you will build up the attention mechanism into the model. 

The following shows how to build up the attention mechanism from scratch. Now in Tensorflow, there are several latest API, like `tf.keras.layers.AdditiveAttention()`, to help you build it quickly.

After training, you can plot a word-to-word plot showing the attention between words. The example plot shows which part of input sequence has the model's attention while translating.

![](https://tensorflow.org/images/spanish-english.png)
Refer to Tensorflow.Org (2020).

Reference:
* Neural machine translation with attention: https://www.tensorflow.org/tutorials/text/nmt_with_attention
* Neural Machine Translation (seq2seq) Tutorial: https://github.com/tensorflow/nmt

In [None]:
import logging
logging.basicConfig(level=logging.INFO, 
                    format="%(asctime)s - %(levelname)s : %(message)s")

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import unicodedata
import re
import numpy as np
import os
import io

import tensorflow as tf
log = logging.getLogger('tensorflow')
print("Tensorflow Version: {}".format(tf.__version__))
print("GPU is{} available.".format(
    "" if tf.config.experimental.list_physical_devices("GPU") else " not"))

## Data Preprocessing

We'll use a language dataset provided by [http://www.manythings.org/anki/](http://www.manythings.org/anki/). The dataset contains the language-translation pairs in the format `Sentence1 Translated_Sentence2`.

```text
May I borrow this book? ¿Puedo tomar prestado este libro?
```

You can download the desired dataset. After you downloaded the dataset, the following is the preprocessing step.

1. Add a `start` and an `end` token to each sequence.
2. Clean the sentences by removing special characters.
3. Create mapping dictionaries that one maps from words to indices and the other maps from indices to words.
4. To do the batch training, we need to pad each sequence to a maximum length. However, there are different strategies to do padding, for example, a better way is masking.

In [None]:
# Download the dataset
path_to_dir = tf.keras.utils.get_file('spa-eng.zip', 
                                      origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip', 
                                      extract=True)
logging.info("Path to the directory: {}".format(path_to_dir))

In [None]:
path_to_file = os.path.join(os.path.dirname(path_to_dir), "spa-eng", "spa.txt")
assert os.path.exists(path_to_file), "File was not found."

In [None]:
!tail -n 3 {path_to_file}

Convert a file encoding from the unicode to ascii.

In [None]:
def unicode_to_ascii(s):
  """
  normalize: Return the normal form form for the Unicode string unistr.
  category: Returns the general category assigned to the character chr as string.
  """
  return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

The following is the example of preprocessing the sentence with cleaning the special chars and adding the start and end tag.

In [None]:
def preprocess_sentence(w):
  """Preprocess the sentence, and add tags on the head and the tail of it."""
  w = unicode_to_ascii(w.lower().strip())
  
  # create a space between a word and the punctuation following it
  # "This is the end." -> "This is the end ."
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with a space except (a-z, A-Z, '.', '?', '!', ',')
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = w.rstrip().strip()

  # add a start and a end token
  w = '<start> ' + w + ' <end>'
  return w

Let's try some examples. 

For the seq2seq task, it is necessary to add the start tag and the end tag for identifying where the sentence starts and ends. This operation is required by both of input and output sequences.

In [None]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"

print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

Create a dataset with the preprocessings.

In [None]:
def create_dataset(path, num_examples):
  """Create the datasets for both the input sequence and the output sequence."""
  lines = io.open(path, encoding='utf-8').read().strip().split('\n')
  word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]
  return zip(*word_pairs)

In [None]:
en, sp = create_dataset(path_to_file, None)

logging.info("There are {} sequences.".format(len(en)))

print(en[-1])
print(sp[-1])

The function tokenize implements how the sequence is tokenized. In the latest version of tensorflow (r2.8), you can simply use `tf.keras.layers.TextVectorization()` to tokenize the sequence with ease.

In [None]:
def tokenize(text):
  """Implement the tokenizer."""
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  lang_tokenizer.fit_on_texts(text)
  tensor = lang_tokenizer.texts_to_sequences(text)
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
  return tensor, lang_tokenizer

We warp all of the above procedures as a function. The function takes file path as the input and returns the input tensor (the input sequence), the target tensor (the output sequence), the input tokenizer, and the target tokenizer.

In [None]:
def load_dataset(path, num_examples=None):
  """
  create cleaned input, output pairs
  to train a model from Spanish to English,
  the target language is English, and input language is Spanish
  """
  targ_lang, inp_lang = create_dataset(path, num_examples)
  
  input_tensor, input_tokenizer = tokenize(inp_lang)
  target_tensor, target_tokenizer = tokenize(targ_lang)

  return input_tensor, target_tensor, input_tokenizer, target_tokenizer

### Limit the Size of Datasets

Before training a model on the whole dataset, you can limit the dataset size to train it faster.

In [None]:
num_examples = None  # you can try different numbers
input_tensor, target_tensor, input_tokenizer, target_tokenizer = load_dataset(path_to_file, num_examples)

logging.info("Input tensor shape: {}".format(input_tensor.shape))
logging.info("Target tensor shape: {}".format(target_tensor.shape))
logging.info("Input tokenizer's vocabulary size: {}".format(
    len(input_tokenizer.index_word)))
logging.info("Target tokenizer's vocabulary size: {}".format(
    len(target_tokenizer.index_word)))

In [None]:
def max_length(tensor):
  """Get the max length of the sequence."""
  tmp, maxlen = 0, 0
  for t in tensor:
    tmp = len(t)
    if tmp > maxlen:
      maxlen = tmp
  return maxlen

# the max sequence length
max_length_target, max_length_input = max_length(target_tensor), max_length(input_tensor)
logging.info("The max length of input sequences is {}.".format(max_length_input))
logging.info("The max length of target sequences is {}.".format(max_length_target))

In [None]:
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = \
  train_test_split(input_tensor, target_tensor, test_size=0.2)

logging.info("There are {} sequences for training.".format(len(input_tensor_train)))
logging.info("There are {} sequences for validation.".format(len(input_tensor_val)))

Let's examine the decoder first.

In [None]:
input_tensor_train[0]

In [None]:
target_tensor_train[0]

In [None]:
def decode(tokenizer, tensor):
  """Decode the indices into the tokens."""
  for t in tensor:
    if t != 0: print("  {} --> {}".format(t, tokenizer.index_word[t]))

print("Input (Spanish): Index to Word")
decode(input_tokenizer, input_tensor_train[0])
print("\nTarget (English): Index to Word")
decode(target_tokenizer, target_tensor_train[0])

### Create a tf.data.Dataset

Here we use the `tensorflow.data.Dataset.from_tensor_slices` APIs to warp the data (input sequence) and labeling (output sequence).

In [None]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train))\
            .shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
val_dataset = tf.data.Dataset.from_tensor_slices((input_tensor_val, target_tensor_val))\
            .batch(BATCH_SIZE, drop_remainder=True)
dataset

Let's get the input and the output sequences first.

In [None]:
example_input_batch, example_target_batch = next(iter(dataset))
logging.info("Input sequence's shape: {}".format(example_input_batch.shape))
logging.info("Target sequence's shape: {}".format(example_target_batch.shape))

print(example_input_batch[0])
print(example_target_batch[0])

It's also allowed for decoding the data in the pipeline.

In [None]:
' '.join([input_tokenizer.index_word[n] for n in example_input_batch[8].numpy() if n != 0]), \
' '.join([target_tokenizer.index_word[n] for n in example_target_batch[8].numpy() if n != 0])

## Seq2Seq Model with the Attention

The seq2seq model basically consists of two parts, one is the encoder, and the other is the decoder. The encoder encodes the raw text into a intermediate matrix, the decoder decodes the sequence from the matrix.

### Encoder

The encoder, the blue part of the below image, does:

- Take a list of token IDs. (from the previous dataset pipeline)
- Look up an embedding vector for each token (here is tf.keras.layers.Embedding)
- Process the embeddings into a new sequence (here using a layers.GPU)

, and returns:

- The processed sequence for the attention head.
- The state for initializing the decoder.

In [None]:
class Encoder(tf.keras.Model):
  """Define the Encoder model."""

  def __init__(self, vocab_size, embedding_dim, enc_units, batch_size):
    super(Encoder, self).__init__()
    self.batch_size = batch_size
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, 
                                               output_dim=embedding_dim)
    self.gru = tf.keras.layers.GRU(units=enc_units, 
                                   return_sequences=True, 
                                   return_state=True, 
                                   recurrent_initializer='glorot_uniform')
    
  def call(self, x, hidden):
    """
    args:
      x: (batch_size, sequence_length)
      hidden: (batch_size, enc_units)
      
    returns:
      sequence_output: (batch_size, sequence_length, enc_units)
      final_state: (batch_size, enc_units)
    """
    embed = self.embedding(x)
    sequence_output, final_state = self.gru(embed, initial_state=hidden)
    return sequence_output, final_state

  def initial_hidden_state(self):
    """Initialize the hidden state to all zeros."""
    return tf.zeros((self.batch_size, self.enc_units))

vocab_size_input = len(input_tokenizer.word_index) + 1
vocab_size_target = len(target_tokenizer.word_index) + 1
embedding_dim = 256
units = 1024

encoder = Encoder(vocab_size_input, embedding_dim, units, BATCH_SIZE)

In [None]:
# sample input
sample_hidden = encoder.initial_hidden_state()
sequence_output, final_state = encoder(example_input_batch, sample_hidden)

print('Encoder output shape (batch_size, sequence_length (variable, timestamp), units): {}'.format(sequence_output.shape))
print('Encoder hidden state shape (batch_size, units): {}'.format(final_state.shape))

### Attention Mechanism

Now let's take a look at the attention mechanism. Look at the below diagram, each input word taken by an encoder is assigned a weight by the attention mechanism that is used further by the decoder to predict the words in the sentence.

![](https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg)
Refer to Tensorflow.Org (2020).

**More specifically, the attention mechanism provides an idea of weighting the timestamp.** In an RNN model, a hidden state is a relationship between timestamps, it is generated after each iteration. 

First, the sequence generated from each iteration is in the shape of `(batch_size, sequence_length, hidden_state_size)`. 

The attention weights are randomly initialized on the shape of `(batch_size, sequence_length)` in batch data. It expands dimensions and repeats values alongside the final axis at the hidden state to become `(batch_size, sequence_length, hidden_state_size)`. 

Next, multiply the hidden state matrix and expanded attention matrix to get the new matrix which contains the information of time stamps. Sum this new matrix at the axis of timestamps to summarize overall impacts at hidden states (shape becomes `(batch_size, hidden_state_size)`). Such a summarized matrix is called `context vector`. 

This context vector could be further passed into an activation layer (normally a Tanh layer) to become an attention vector. 

This attention vector is further multiplied with the hidden state of the decoder to predict the words.

More details and formulas as the below.
![](https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg)
![](https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg)
Refer to Tensorflow.Org (2020).

The pseduo-code (refer to Tensorflow.Org (2020)):

where

- $s$ is the encoder index. In other words, $s$ stands for the timestamp.
- $t$ is the decoder index.
- $a_{ts}$ is the attention weights.
- $h_s$ is the sequence of encoder outputs being attented to (the attention `key` and `value` in transformer teminology)
- $h_t$ is the decoder state attending to the sequence (the attention `query` in transformer terminology)
- $c_t$ is the resulting context vector
- $a_t$ is the final output combining the `context` and `query`

In details,

* `attention weights = softmax(score, axis = 1)`. Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is `(batch_size, max_length, hidden_size)`. Max_length is the length of our input (for the timestamps). Since we are trying to assign a weight to each input, softmax should be applied on that axis.

* `context vector = sum(attention weights * encoder outputs, axis = 1)`. Same reason as above for choosing axis as 1.

* `merged vector = concat(embedding output, context vector)` This merged vector is then passed to the GRU (a gated RNNCell) layer.

* The last is the `score` function. Its job is to calculate a scalar logit-score for each key-query pair. There are two common approaches.
You can use the `tf.keras.layers.Attention` for the multiplicative style, and the `tf.keras.layers.AdditiveAttention` for the additive style, in the Tensorflow.

Next, you are going to build a Bahdanau attention layer.

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
  """An Attention Layer"""
  def __init__(self, units):
    """Constructor
    
    args:
      units: the potential number of timestamps
    """
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    """call is the layer function
    
    args:
      query: the hidden state from the decoder, (batch_size, hidden_size == encoder_units)
      values: the encoder outputs with return_sequence=True
              (batch_size, sequence_length, encoder_units)
      
    returns:
      context_vector: (batch_size, units == hidden_size)
      attention_weights: (batch_size, max_length, 1)
    """
    # hidden state shape == (batch size, hidden size)
    # to expand dimension at the time axis == (batch size, 1, hidden size)
    # perform this to calculate the score
    hidden_time_axis = tf.expand_dims(query, axis=1)
    
    # time addition (batch_size, sequence_length, units) == \
    #               (batch_size, sequence_length, units) + (batch_size, 1, units)
    # score shape == (batch size, sequence_length, 1)
    addition = tf.nn.tanh(self.W1(values) + self.W2(hidden_time_axis))
    score = self.V(addition)
    
    # attention weights shape == (batch size, sequence_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)
    
    # context vector shape after sum == (batch_size, units == hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights

In [None]:
attention_layer = BahdanauAttention(10)
context_vector, attention_weights = attention_layer(final_state, sequence_output)

print("Context vector Shape (batch size, hidden size): {}".format(context_vector.shape))
print("Attention Weights Shape (batch size, sequence length, 1): {}".format(attention_weights.shape))

### Decoder

The decoder's job is to generate predictions for the next output token. (The decoder is the red part of the above image.)

- The decoder receives the complete encoder output.
- Here we use the GRU layer to keep track of what it has generated so far.
- It uses it RNN output as the query to the attention over the encoder's output to produce the context vector.
- It combines the decoder's output with the context vector to generate the `attention vector`.
- It generates logit predictions for the next token based on `attention vector`.

In [None]:
class Decoder(tf.keras.Model):
  """Implement the decoder."""

  def __init__(self, vocab_size, embedding_dim, dec_units, batch_size):
    super(Decoder, self).__init__()
    self.dec_units = dec_units
    self.batch_size = batch_size
    self.embedding = tf.keras.layers.Embedding(
        input_dim=vocab_size, 
        output_dim=embedding_dim)
    self.gru = tf.keras.layers.GRU(
        units=dec_units, 
        return_sequences=True, 
        return_state=True, 
        recurrent_initializer='glorot_uniform')
    self.dense = tf.keras.layers.Dense(units=vocab_size)
    self.attention = BahdanauAttention(dec_units)
    
  def call(self, x, hidden, enc_output):
    """
    args:
      x: (batch_size, 1) an input of the decoder, 
         it is generated from the previous timestamp or the encoder at the beginning
      hidden: (batch_size, encoder_units)
         the hidden state from the previous timestamp or the encoder at the beginning
      enc_output: (batch_size, sequence_length, encoder_units)
    """
    
    # the embedding features (batch_size, 1) => (batch_size, 1, embedding_dims)
    x = self.embedding(x)
    
    # context_vector: (batch_size, encoder_units == hidden_size)
    # attentions_weights: (batch_size, sequence_length, 1)
    context_vector, attention_weights = self.attention(hidden, enc_output)
    
    # (batch_size, 1, hidden_size + embedding_size)
    attention_vector = tf.concat([tf.expand_dims(context_vector, axis=1), x], axis=-1)
    
    # gru_seq: (batch_size, 1, dec_units)
    # gru_state: (batch_size dec_units)
    gru_seq, gru_state = self.gru(attention_vector)
    
    # shape: (batch_size, dec_units)
    gru_seq = tf.reshape(gru_seq, (-1, gru_seq.shape[2]))
    
    # results: (batch_size, vocab_size)
    results = self.dense(gru_seq)
    
    return results, gru_state, attention_weights

Let's try some examples passed into the decoder.

In [None]:
decoder = Decoder(vocab_size_target, embedding_dim, units, BATCH_SIZE)

decoder_output, decode_state, sample_attention = decoder(
  tf.random.uniform((BATCH_SIZE, 1)), final_state, sequence_output)

print("Decoder output shape (batch_size, vocab_size): {}".format(decoder_output.shape))
print("Decoder state shape (batch_size, decoder_units): {}".format(decode_state.shape))
print("Attention Weights shape (batch_size, sequence_length, 1): {}".format(sample_attention.shape))

# Training

## Define the Optimizer and the Loss Function

Here, we can address this problem (or training) as a classification issue. Further, we can use the `tf.keras.losses.SparseCategoricalCrossentropy` to calculate the loss value. In the loss function, we only calculate and add the loss of the target token.

Here we use masking to hidden the padded token so that the `reduction` is set to `none` for `SparseCategoricalCrossentropy`.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
  from_logits=True, reduction='none')

def loss_function(real, pred):
  """Loss function with masking."""
  
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss = loss_object(real, pred)
    
  mask = tf.cast(mask, dtype=loss.dtype)
  loss *= mask

  return tf.reduce_mean(loss)

## Define the callback for saving models.

In [None]:
ckpt_dir = os.path.join('./train_ckpt')
ckpt_path = os.path.join(ckpt_dir, 'ckpt')
checkpoint = tf.train.Checkpoint(optimizer=optimizer, 
                                 encoder=encoder, 
                                 decoder=decoder)

## Training Steps

The following is the basic training flow of the NMT (or the seq2seq) model.

1. Pass the input (a batch of text sequences) and an initialized hidden state into the encoder that returns encoder output and encoder statue.
2. The encoder output, encoder status and input of decoder (or a start token) are passed into the decoder that returns decoder results (or the vocabulary predictions), decoder status, and attention weights.
3. Continue passing both the decoder result and hidden status back to the decoder to generate next result.

- Pass the decoder hidden status back to the decoder for the next prediction.

- Use `Teacher forcing` to decide the next input to the decoder. (The `Teacher Forcing` is a method where the target word is passed as the next input to the decoder.)

4. Calculate the gradients and an optimizer applies them to the trainable variables.

In [None]:
@tf.function
def train_step(inputs, targets, enc_hidden):
  """train_step calculates the loss and apply it to variable using an optimizer"""
  loss = 0.0

  with tf.GradientTape() as tape:
    encoder_output, encoder_state = \
      encoder(inputs, enc_hidden, training=True)
    
    # the hidden state from the encoder is passed into the decoder as the hidden state at the beginning
    decoder_hidden = encoder_state 
    
    # before training alongside the time axis, define the start token, (BATCH_SIZE, 1)
    decoder_inputs = tf.expand_dims([target_tokenizer.word_index['<start>']] * BATCH_SIZE, axis=-1)
    
    # the way implementing the `Teacher Forcing` method
    for t in range(1, targets.shape[1]):
      # predictions: (batch_size, vocab_size)
      predictions, decoder_hidden, _ = \
        decoder(decoder_inputs, decoder_hidden, encoder_output, training=True)
      
      loss += loss_function(targets[:, t], predictions)
        
      # for the next timestamp
      decoder_inputs = tf.expand_dims(targets[:, t], 1)

  batch_loss = loss / int(targets.shape[1])
  trainable_variables = encoder.trainable_variables + decoder.trainable_variables
    
  # TODO: why not use batch_loss
  gradients = tape.gradient(loss, trainable_variables)
  optimizer.apply_gradients(zip(gradients, trainable_variables))

  return batch_loss

In [None]:
@tf.function
def val_step(inputs, targets, enc_hidden):
  """val_step calculates the loss to the validation step."""

  loss = 0.0

  encoder_output, encoder_state = \
    encoder(inputs, enc_hidden, training=False)
  decoder_hidden = encoder_state
  decoder_inputs = tf.expand_dims([target_tokenizer.word_index['<start>']] * BATCH_SIZE, axis=-1)

  for t in range(1, targets.shape[1]):
    results, decoder_hidden, _ = \
      decoder(decoder_inputs, decoder_hidden, encoder_output, training=False)
    loss += loss_function(targets[:, t], results)

    best_predict_ids = tf.argmax(results, axis=1)
    decoder_inputs = tf.expand_dims(best_predict_ids, axis=-1)
  
  batch_loss = loss / targets.shape[0]

  return batch_loss

In [None]:
EPOCHS = 10
steps_per_epoch = len(input_tensor_train) // BATCH_SIZE

for epoch in range(EPOCHS):
    
  enc_hidden = encoder.initial_hidden_state()

  total_loss = 0.
  for (idx, (inputs, targets)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inputs, targets, enc_hidden)
    total_loss += batch_loss
    
    if (idx + 1) % 400 == 0:
      print("  Epoch {}, Batch: {}, Loss: {:.8f}".format(
        epoch + 1, idx + 1, batch_loss))
  
  # saving the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix=ckpt_path)

  val_loss = 0.
  for idx, (inputs, targets) in enumerate(val_dataset.take(steps_per_epoch)):
    batch_loss = val_step(inputs, targets, enc_hidden)
    val_loss += batch_loss

  print("Epoch {}, Loss: {:.6f}, Val Loss: {:.6f}".format(
    epoch+1, total_loss / steps_per_epoch, val_loss / steps_per_epoch))

# Translation

* The translation is a series of predictions and is similar to the training loop, but is different from it on no the teacher forcing method involved. The input to the decoder at each time is the output from the previous prediction along with its hidden status and encoder output.
* Stop prediction when reaching the end token.
* We can also store the attention weights for each time prediction.

Now let's translate from Spanish (input) to English (target) using the model.

In [None]:
def generate(sentence):
  """Translate a sentence."""
  input_length = max_length_input
  target_length = max_length_target
  attention_plots = np.zeros((target_length, input_length))
  
  # preprocess sentence
  sentence = preprocess_sentence(sentence)
  inputs = [input_tokenizer.word_index[i] for i in sentence.split(' ')]
  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=input_length, padding="post")
  inputs = tf.convert_to_tensor(inputs)

  # store the result
  result = ''

  hidden = tf.zeros((1, units))  # (batch_size, hidden_size): (1, 1024)
  enc_output, enc_hidden = encoder(inputs, hidden, training=False)
  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([target_tokenizer.word_index['<start>']], axis=0)  # (batch_size, sequence_length), (1, 1)

  for t in range(target_length):
    predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_output, training=False)
    
    # storing the attention weights
    attention_weights = tf.reshape(attention_weights, (-1, )) 
    attention_plots[t] = attention_weights.numpy()
    
    # predicted id, [0] to get the element
    prediced_id = tf.argmax(predictions, axis=-1).numpy()[0]
    result += target_tokenizer.index_word[prediced_id] + " "
    
    if target_tokenizer.index_word[prediced_id] == "<end>":
      break
    
    # continue predicting the word
    dec_input = tf.expand_dims([prediced_id], axis=0)
  return result, sentence, attention_plots

In [None]:
def plot_attention(attention, sentence, predicted_sentence):
  """plot for attention"""
  fig = plt.figure(figsize=(8, 8))
  ax = fig.add_subplot(1,1,1)
  ax.matshow(attention, cmap="viridis")
  ax.set_xticklabels([''] + sentence, rotation=90)
  ax.set_yticklabels([''] + predicted_sentence)
  ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
  ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
  plt.show()

In [None]:
def translate(sentence):
  result, sentence, attention_plots = generate(sentence)

  print("Input: {}".format(sentence))
  print("Translation: {}".format(result))

  attention_plot = attention_plots[:len(result.split(' ')), :len(sentence.split(' '))]
  plot_attention(attention_plot, sentence.split(' '), result.split(' '))

# Restore the Latest Checkpoint

In [None]:
checkpoint.restore(tf.train.latest_checkpoint(ckpt_dir))

In [None]:
translate(u"hace mucho frio aqui.")

In [None]:
translate(u'¿todavia estan en casa?')

In [None]:
# wrong translation (Try to find out.)
translate(u'trata de averiguarlo.')