## Neural Machine Translation with Attention

In this notebook we will show you how to train a `seq2seq` model for English to
Turkish translation. When you train the model, you will be able to translate
English sentences to Turkish.

We heavily borrowed from [TensorFlow's tutorial on Neural Machine Translation with Attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention).

In [1]:
import os
import time

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import tensorflow as tf
from IPython.display import Video
from sklearn.model_selection import train_test_split

import preprocess
import utils
from model import *

### Sections of the Notebook
1. [Loading Dataset](#load)
2. [Preparing Dataset](#prepare)
3. [Seq2Seq Models](#seq2seq)
4. [Training and Optimizer](#training)
5. [Evaluation and Testing](#testing)
6. [Exercises](#exercise)

<a id="load"></a>
### 1. Loading Dataset

We'll use a language dataset provided by http://www.manythings.org/anki/. They
provide translation datasets for 80 different languages to/from English. The
dataset is in tab separated tabular format with 3 columns. First column is a
sentence in one of the 80 languages, and second is its translation in English.
Third column shows the source of the row. We can ignore third column for our
purposes.

In [None]:
path_to_zip = tf.keras.utils.get_file(
    "tur-eng.zip", origin="https://github.com/haluk/NLP_course_materials/blob/master/hw4/tur-eng.zip?raw=true",
    extract=True)
path_to_file = os.path.dirname(path_to_zip)+"/tur.txt"

num_examples = 1000
input_tensor, target_tensor, inp_lang, targ_lang = utils.load_dataset(
    path_to_file, num_examples
)

# we translate from English to Turkish
input_tensor, target_tensor, inp_lang, targ_lang = target_tensor, input_tensor, targ_lang, inp_lang

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]

# Creating training and validation sets using an 80-20 split
(
    input_tensor_train,
    input_tensor_val,
    target_tensor_train,
    target_tensor_val,
) = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
print(
    "{:15s} => {:10s}: {}\t{:15s}: {}".format(
        "Input language",
        "Training size",
        len(input_tensor_train),
        "Validation size",
        len(input_tensor_val),
    )
)
print(
    "{:15s} => {:10s}: {}\t{:15s}: {}".format(
        "Target language",
        "Training size",
        len(target_tensor_train),
        "Validation size",
        len(target_tensor_val),
    )
)

In [None]:
# We show one sentence from input and target languages
print("Input Language; index to word mapping")
utils.convert(inp_lang, input_tensor_train[30])
print ()
print ("Target Language; index to word mapping")
utils.convert(targ_lang, target_tensor_train[30])

<a id=prepare></a>
### 2. Preparing Dataset 
We will use `tf.data.Dataset` API for building an asynchronous, highly optimized
data pipeline to prevent GPUs from data starvation. It loads data from the disk,
text in our case, creates batches and sends it to the GPU.

In [None]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train) // BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1

dataset = tf.data.Dataset.from_tensor_slices(
    (input_tensor_train, target_tensor_train)
).shuffle(BUFFER_SIZE)

dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

<a id=seq2seq></a>
### 3. Seq2Seq Models

We will use [Jay Alammar's](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) wonderful visualizations to explain `seq2seq` model and `attention` mechanism.

In [2]:
Video("https://jalammar.github.io/images/seq2seq_2.mp4", width=900, height=200)

In [3]:
Video("https://jalammar.github.io/images/seq2seq_4.mp4", width=900, height=200)

Neural Machine Translation (NMT) model is composed of an `encoder` and `decoder`. Encoder part of the model processes each token in the input sequence, and captures the learned information to a vector called `context` with size of given number of units. On the other hand, decoder part of the model gets `context` vector as input and produces output sequence token by token. In NMT model, both `encoder` and `decoder` are RNNs. We define `encoder` and `decoder` models in `models.py`

In [4]:
Video("https://jalammar.github.io/images/RNN_1.mp4", width=900, height=400)

Unrolled view of `RNN` based `encoder` and `decoder` processing steps. The `decoder` finds the relevant parts of the input for a given decoding step, first looks at the hidden states of the encoder and score them. Then, softmaxes the scores and multiplies the hidden states with these softmaxed scores. This results in the hidden states with high scores to be amplified.

In [5]:
Video("https://jalammar.github.io/images/seq2seq_7.mp4", width=900, height=300)

### Encoder and Decoder

In [None]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

<a id=training></a>
### 4. Training and Optimizer

   1. Pass the input through the encoder which return encoder output and the encoder hidden state.
   2. The encoder output, encoder hidden state and the decoder input (which is the start token) is passed to the decoder.
   3. The decoder returns the predictions and the decoder hidden state.
   4. The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
   5. Use teacher forcing to decide the next input to the decoder.
   6. Teacher forcing is the technique where the target word is passed as the next input to the decoder.
   7. The final step is to calculate the gradients and apply it to the optimizer and backpropagate.


In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction="none"
)


def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)


checkpoint_dir = "./training_checkpoints"
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer, encoder=encoder, decoder=decoder)


@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)

        dec_hidden = enc_hidden

        dec_input = tf.expand_dims([targ_lang.word_index["<start>"]] * BATCH_SIZE, 1)

        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

            loss += loss_function(targ[:, t], predictions)

            # using teacher forcing
            dec_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = loss / int(targ.shape[1])

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss


EPOCHS = 100

for epoch in range(EPOCHS):    
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss        
        if batch % 100 == 0:
            print(
                "Epoch {} Batch {} Loss {:.4f}".format(
                    epoch + 1, batch, batch_loss.numpy()
                )
            )
    # saving (checkpoint) the model every 2 epochs
    if (epoch + 1) % 2 == 0:
        checkpoint.save(file_prefix=checkpoint_prefix)

    print("Epoch {} Loss {:.4f}".format(epoch + 1, total_loss / steps_per_epoch))
    print("Time taken for 1 epoch {} sec\n".format(time.time() - start))

<a id=testing></a>
### 5. Evaluation and Testing

Evaluation is similar to training except we don't use teacher forcing here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.

In [None]:
def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))

    sentence = preprocess.preprocess_sentence(sentence)

    inputs = [inp_lang.word_index[i] for i in sentence.split(" ")]
    inputs = tf.keras.preprocessing.sequence.pad_sequences(
        [inputs], maxlen=max_length_inp, padding="post"
    )
    inputs = tf.convert_to_tensor(inputs)

    result = ""

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index["<start>"]], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(
            dec_input, dec_hidden, enc_out
        )

        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1,))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.index_word[predicted_id] + " "

        if targ_lang.index_word[predicted_id] == "<end>":
            return result, sentence, attention_plot

        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

In [None]:
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap="viridis")

    fontdict = {"fontsize": 14}

    ax.set_xticklabels([""] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([""] + predicted_sentence, fontdict=fontdict)

    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()

In [None]:
def translate(sentence):
    result, sentence, attention_plot = evaluate(sentence)

    print("Input: %s" % (sentence))
    print("Predicted translation: {}".format(result))

    attention_plot = attention_plot[
        : len(result.split(" ")), : len(sentence.split(" "))
    ]
    plot_attention(attention_plot, sentence.split(" "), result.split(" "))

In the wrapper function, `translate`, we translate the given sentence in input language to target language and plot attention weights.

In [None]:
translate("Tell me!")

<a id=exercise></a>
### 6. Exercises


1. Due to computational limitations inside the Jupyter container, we used a size 1,000 subset of the training set. Train the model for at least 10 epochs using all training examples. (*Hint*: If you set `num_examples` to None, you will use the whole dataset.) 

2. Plot the attention weights for the new model. In Jupyter, we can see the plots in the notebook, however, when you use batch submission, you need to save the plots to a file. You will need [this API](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.savefig.html) instead of `plt.show()`.
 

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Please contact Haluk Dogan (<a href="mailto:hdogan@vivaldi.net">hdogan@vivaldi.net</a>) for further questions or inquries.