#### Taken from [tensorflow-tutorial ](https://www.tensorflow.org/tutorials/text/nmt_with_attention ) and added a *translate_batch()* function to translate a batch and dump outputs into a file

# Neural machine translation with attention

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time
from spanish_english_datapreprocessing import NMTDataset
from tensorflow.python.ops import math_ops

In [5]:
!ls utils


ls: cannot access 'utils': No such file or directory


We'll use the same dataset we worked on notebook-1 (text-processing). For our convenience we've created a utils/dataset.py file which returns train and validation tf.data.Dataset objects.

In [6]:
BUFFER_SIZE = 90000
BATCH_SIZE = 32
num_examples = 80000

dataset_creator = NMTDataset('en-spa')
train_dataset, val_dataset, inp_lang, targ_lang = dataset_creator.call(num_examples, BUFFER_SIZE, BATCH_SIZE)

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


In [7]:

print("Inpute Vocabulary Size: {}".format(len(inp_lang.word_index)))
print("Target Vocabulary Size: {}".format(len(targ_lang.word_index)))

Inpute Vocabulary Size: 17592
Target Vocabulary Size: 9219


In [8]:
example_input_batch, example_target_batch = next(iter(train_dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([32, 20]), TensorShape([32, 14]))

In [9]:
example_input_batch.shape

TensorShape([32, 20])

In [10]:
input_maxLen = example_input_batch.shape[1]
output_maxLen = example_target_batch.shape[1]

## Write the encoder and decoder model

Implement an encoder-decoder model with attention which you can read about in the TensorFlow [Neural Machine Translation (seq2seq) tutorial](https://github.com/tensorflow/nmt). This example uses a more recent set of APIs. This notebook implements the [attention equations](https://github.com/tensorflow/nmt#background-on-the-attention-mechanism) from the seq2seq tutorial. The following diagram shows that each input words is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence. The below picture and formulas are an example of attention mechanism from [Luong's paper](https://arxiv.org/abs/1508.04025v5).

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg" width="500" alt="attention mechanism">

The input is put through an encoder model which gives us the encoder output of shape *(batch_size, max_length, hidden_size)* and the encoder hidden state of shape *(batch_size, hidden_size)*.

Here are the equations that are implemented:

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg" alt="attention equation 0" width="800">
<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg" alt="attention equation 1" width="800">

This tutorial uses [Luong attention](https://arxiv.org/abs/1508.04025) for the encoder. Let's decide on notation before writing the simplified form:

* FC = Fully connected (dense) layer
* EO = Encoder output
* H = hidden state
* X = input to the decoder

And the pseudo-code:

* `score = MatMul(Transpose(H)xE0)`
* `attention weights = softmax(score, axis = 1)`. Softmax by default is applied on the last axis but here we want to apply it on the *1st axis*, since the shape of score is *(batch_size, max_length, hidden_size)*. `Max_length` is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
* `context vector = sum(attention weights * EO, axis = 1)`. Same reason as above for choosing axis as 1.
* `embedding output` = The input to the decoder X is passed through an embedding layer.
* `merged vector = concat(embedding output, context vector)`
* This merged vector is then given to the GRU

The shapes of all the vectors at each step have been specified in the comments in the code:

In [11]:
# Define some useful parameters for further use

vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1
max_length_input = example_input_batch.shape[1]
max_length_output = example_target_batch.shape[1]

embedding_dim = 250
units = 512
steps_per_epoch = num_examples//BATCH_SIZE

In [12]:
# Encoder is composed of embedding layer and then one GRU layer. It produces outputs and last hidden states.
# Encoder Outputs shape = (BATCH_SIZE, max_length_input, units)
# Last Hidden State Shape = (BATCH_SIZE, units)

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

In [13]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

Encoder output shape: (batch size, sequence length, units) (32, 20, 512)
Encoder Hidden state shape: (batch size, units) (32, 512)


In [14]:
class LuongAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(LuongAttention, self).__init__()
    # To recall, score = V*tanh(W1(encoder_outputs) + W2(Prev Step's Hidden State))
    # self.W1 = tf.keras.layers.Dense(units)
    # self.W2 = tf.keras.layers.Dense(units)

    # self.W_p = tf.keras.layers.Dense(units,use_bias = False)
    # self.v_p = tf.keras.layers.Dense(1,use_bias = False)

  def local_m(self,input_max_length,values,current_timestep,D):
    ## input_max_length represent Tx or S
    ## values (m,maxLen,n_a) ,n_a represent Encoder hidden state
    aligned_position = current_timestep
    left = int(aligned_position - D)
    if (left<0):
      left = 0
    right = int(aligned_position + D)
    if (right > input_max_length):
      right = input_max_length
    values = values[:,left:right,:]
    return values





  def call(self, query, values,current_timestep,input_max_length=input_maxLen,D=8):


    query_with_time_axis = tf.expand_dims(query, 1)     # (m,1,n_s)

    values = self.local_m(input_max_length,values,current_timestep,D)     # (m,maxLen*,n_a)



    ##-------- COMPUTING EQUATION (4: Luong's Attention) (General Alignment Model)   ---------#
    # new_values = self.W1(values) # (batch_size,inpt_ max_length, units) (m,inpt_maxLen,units)
    # new_query_with_time_axis = self.W2(query_with_time_axis) # (batch_size, 1, units) (m,1,units)
    # score = tf.matmul(new_query_with_time_axis, new_values, transpose_b=True)     # (m,1,inpt_maxLen)

    ##-------- COMPUTING EQUATION (4: Luong's Attention) (Dot Alignment Model)   ---------#
    score = tf.matmul(query_with_time_axis, values, transpose_b=True)     # (m,1,maxLen*)




    attention_weights = tf.reshape(score, shape=(-1, score.shape[2], 1))      # (m,maxLen*,1)



    # Reshape attention_weights to shape = (batch_size, inpt_max_length, 1)
    attention_weights = tf.nn.softmax(attention_weights, axis=1)      # (m,maxLen*,1)



    #---------- COMPUTING EQUATION (2) -----------#
    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values     # (m,maxLen*,n_a)
    # Context vector is passed on to the curren time step's decoder cell
    context_vector = tf.reduce_sum(context_vector, axis=1)      # (m,n_a)

    return context_vector, attention_weights


In [15]:
attention_layer = LuongAttention(50)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output,current_timestep = 0)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

Attention result shape: (batch size, units) (32, 512)
Attention weights shape: (batch_size, sequence_length, 1) (32, 8, 1)


Decoder based on Luong Attention architecture

In [16]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

  def call(self, x, hidden, enc_output,current_timestep):

    x = self.embedding(x) # (m,1,emb_dim)

    output,state = self.gru(x) # output (m,1,n_s) state (m,n_s)
    context_vector, attention_weights = attention_layer(state, enc_output,current_timestep)  # context_vector (m,n_a) attention_weigths (m,maxlen,1)
    output = tf.concat([tf.expand_dims(context_vector, 1), output], axis=-1) # output  (m,1,n_a+n_s)

    output = tf.reshape(output, (-1, output.shape[2])) # output (m,n_a+n_s)

    # output shape == (batch_size, vocab)
    x = self.fc(output) #

    return x, state, attention_weights


In [17]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output,current_timestep = 0)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Decoder output shape: (batch_size, vocab size) (32, 9220)


## Define the optimizer and the loss function

In [18]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

## Checkpoints (Object-based saving)

In [19]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

## Training

1. Pass the *input* through the *encoder* which return *encoder output* and the *encoder hidden state*.
2. The encoder output, encoder hidden state and the decoder input (which is the *start token*) is passed to the decoder.
3. The decoder returns the *predictions* and the *decoder hidden state*.
4. The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
5. Use *teacher forcing* to decide the next input to the decoder.
6. *Teacher forcing* is the technique where the *target word* is passed as the *next input* to the decoder.
7. The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

In [20]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output,current_timestep = t)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [21]:
EPOCHS = 10

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(train_dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    if batch % 100 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 4.8091
Epoch 1 Batch 100 Loss 2.6913
Epoch 1 Batch 200 Loss 2.3396
Epoch 1 Batch 300 Loss 1.9207
Epoch 1 Batch 400 Loss 1.8140
Epoch 1 Batch 500 Loss 1.8017
Epoch 1 Batch 600 Loss 1.5252
Epoch 1 Batch 700 Loss 1.6745
Epoch 1 Batch 800 Loss 1.6051
Epoch 1 Batch 900 Loss 1.5095
Epoch 1 Batch 1000 Loss 1.3068
Epoch 1 Batch 1100 Loss 1.5956
Epoch 1 Batch 1200 Loss 1.2988
Epoch 1 Batch 1300 Loss 1.3366
Epoch 1 Batch 1400 Loss 1.3374
Epoch 1 Batch 1500 Loss 1.0973
Epoch 1 Batch 1600 Loss 1.2705
Epoch 1 Batch 1700 Loss 1.1393
Epoch 1 Batch 1800 Loss 1.1572
Epoch 1 Batch 1900 Loss 1.1366
Epoch 1 Loss 1.2830
Time taken for 1 epoch 90.48747634887695 sec

Epoch 2 Batch 0 Loss 0.7219
Epoch 2 Batch 100 Loss 0.6921
Epoch 2 Batch 200 Loss 0.8858
Epoch 2 Batch 300 Loss 0.9150
Epoch 2 Batch 400 Loss 0.9338
Epoch 2 Batch 500 Loss 0.8557
Epoch 2 Batch 600 Loss 0.8890
Epoch 2 Batch 700 Loss 0.9014
Epoch 2 Batch 800 Loss 0.6968
Epoch 2 Batch 900 Loss 0.6829
Epoch 2 Batch 1000 Loss 0.71

## Translate

* The evaluate function is similar to the training loop, except we don't use *teacher forcing* here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.
* Stop predicting when the model predicts the *end token*.
* And store the *attention weights for every time step*.

Note: The encoder output is calculated only once for one input.

In [26]:
def evaluate(sentence):

  sentence = dataset_creator.preprocess_sentence(sentence)

  inputs = []
  for i in sentence.split(" "):
    try:
      idx = inp_lang.word_index[i]
      inputs.append(idx)
    except:
      # 1 for oov token to handle key erro
      inputs.append(1)
  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                         maxlen=max_length_input,
                                                         padding='post')
  inputs = tf.convert_to_tensor(inputs)

  result = ''

  hidden = [tf.zeros((1, units))]
  enc_out, enc_hidden = encoder(inputs, hidden)

  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

  for t in range(max_length_output):
    predictions, dec_hidden, attention_weights = decoder(dec_input,
                                                         dec_hidden,
                                                         enc_out,current_timestep = t)

    # storing the attention weights to plot later on

    predicted_id = tf.argmax(predictions[0]).numpy()

    result += targ_lang.index_word[predicted_id] + ' '

    if targ_lang.index_word[predicted_id] == '<end>':
      return result, sentence

    # the predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)

  return result, sentence

In [27]:
def translate(sentence):
  result, sentence = evaluate(sentence)

  print('Input: %s' % (sentence))
  print('Predicted translation: {}'.format(result))

## Restore the latest checkpoint and test

In [28]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7e2ed01cb2b0>

In [29]:
translate(u'hace mucho frio aqui.')

Input: <start> hace mucho frio aqui . <end>
Predicted translation: it s very cold here for it here . <end> 


In [30]:
translate(u'esta es mi vida.')

Input: <start> esta es mi vida . <end>
Predicted translation: this is my life . <end> 


In [31]:
translate(u'¿todavia estan en casa?')

Input: <start> ¿ todavia estan en casa ? <end>
Predicted translation: are you still at home ? <end> 


In [32]:
# wrong translation
translate(u'trata de averiguarlo.')

Input: <start> trata de averiguarlo . <end>
Predicted translation: try to figure it to figure it . <end> 


In [33]:
translate(u'la comida se desperdicia')


Input: <start> la comida se desperdicia <end>
Predicted translation: food s gone sour . <end> 


In [34]:
translate(u'mi nueva casa es muy grande')

Input: <start> mi nueva casa es muy grande <end>
Predicted translation: my new house is very large . <end> 


In [35]:
translate(u"el futbol es un buen juego")

Input: <start> el futbol es un buen juego <end>
Predicted translation: soccer is a good game . <end> 


In [36]:
translate(u"Estoy muy enojada contigo")

Input: <start> estoy muy enojada contigo <end>
Predicted translation: i m very angry with you . <end> 


In [37]:
translate(u"hoy es un día muy especial para mi")

Input: <start> hoy es un dia muy especial para mi <end>
Predicted translation: today is a very special day very special day . <end> 


In [38]:
translate(u"hola señor necesitamos su ayuda para poder resolver el problema")

Input: <start> hola senor necesitamos su ayuda para poder resolver el problema <end>
Predicted translation: hi for his help him for his price . <end> 


In [None]:
def translate_batch(test_dataset):
  with open('output_text.txt', 'w') as f:
    for (inputs, targets) in test_dataset:
      outputs = np.zeros((BATCH_SIZE, max_length_output),dtype=np.int16)
      hidden_state = tf.zeros((BATCH_SIZE, units))
      enc_output, dec_h = encoder(inputs, hidden_state)
      dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
      for t in range(max_length_output):
        preds, dec_h, _ = decoder(dec_input, dec_h, enc_output)
        predicted_id = tf.argmax(preds, axis=1).numpy()
        outputs[:, t] = predicted_id
        dec_input = tf.expand_dims(predicted_id, 1)
      outputs = targ_lang.sequences_to_texts(outputs)
      for t, item in enumerate(outputs):
        try:
          i = item.index('<end>')
          f.write("%s\n" %item[:i])
        except:
          f.write("%s \n" % item) # For those translated sequences which didn't correctly translated and have <end> token.

outputs = translate_batch(val_dataset)

In [None]:
!head output_text.txt
! wc -l output_text.txt

In [None]:
val_targets = list(val_dataset.take(1))
val_targets = np.asarray(val_targets[0][1])
print(type(val_targets))
targ_lang.sequences_to_texts(val_targets)[:10]

We can see that the model worked well. Despite being not very accurate the translations, however do make some sense.