<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti107/blob/main/session-7/nmt_baseline_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# Seq2Seq Model for Machine Translation

The most successful application of seq2seq architecture is in machine translation. We commonly use the term *Neural Machine Translation (NMT)* for neural network-based machine translation . In this week's programming exercise, we will examine a basic seq2seq architecture that consists of Encoder-Decoder pair. We will use this to translate from English to Bahasa Indonesia (simiar to Malay language). In the next programming exercise, we will modify this basic structure to include the attention mechanism to improve the translation quality.

You will learn: 
1. how to implement an encoder and decoder network
2. how a sequence to sequence model works
3. basic processing steps in preparing text for translation

*Credit: This notebook is adapted from https://www.tensorflow.org/tutorials/text/nmt_with_attention*

In [None]:
import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import unicodedata
import re
import numpy as np
import os
import io
import time


def fix_cudnn_bug(): 
    # during training, tf will throw cudnn initialization error: failed to get convolution algos
    # the following codes somehow fix it
    config = tf.compat.v1.ConfigProto()
    config.gpu_options.allow_growth = True
    config.log_device_placement = False
    sess = tf.compat.v1.Session(config=config)
    tf.compat.v1.keras.backend.set_session(sess)
    
fix_cudnn_bug()

## Data Preparation

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
what do you want to say?	apa yang ingin kamu katakan?
```

There are a variety of languages available, but we'll use the English-Indonesian dataset. For convenience, we've hosted a copy of this dataset on SDAAI cloud storage, but you can also download directly from the link provided above. After downloading the dataset, here are the steps we'll take to prepare the data:

1. Clean the sentences by removing special characters.
2. Add a *start* and *end* token to each sentence.
3. Convert text to numbers (vectorization) using tokenizer (tokenizer automatically creates a word index and reverse word index, i.e. dictionaries mapping from word → id and id → word).
4. Pad each sentence to a maximum length of the corpus.

In [None]:
# Download the file
url = 'https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/datasets/ind-eng.zip'
zipfilename = 'ind-eng.zip'
path_to_zip = tf.keras.utils.get_file(
    zipfilename, origin=url,
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/ind.txt"

The following code converts the unicode to a normalized form so that it can be represented as ascii chars. This step is _not necessary_ for a language like Bahasa Indonesia (or Malay) as the language, like, English only contains ascii characters. But for languages like French, or German, etc, you will need to use the following code to normalize it.

In [None]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence(s):
    s = unicode_to_ascii(s.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    s = re.sub(r"([?.!,¿])", r" \1 ", s)
    s = re.sub(r'[" "]+', " ", s)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    s = re.sub(r"[^a-zA-Z?.!,¿]+", " ", s)

    s = s.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    s = '<start> ' + s + ' <end>'
    return s

In [None]:
en_sentence = u"What do you want to say?"
ind_sentence = u"Apa yang ingin kamu katakan?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(ind_sentence))

Each line in the file contains the following fields: English sentence, Indonesian sentence and attribution. Each field is separated by a tab `(\t)`.  For example:
```
It might rain tomorrow.	Hujan mungkin akan turun besok.	CC-BY 2.0 (France) Attribution: tatoeba.org #31045 (CK) & #4449966 (Bilmanda)
```
We only want to keep the English and Indonesian sentence fields and drop the attribution. So we do a `line.split('\t')` which gives us an array of 3 fields and we keep the first 2 by python slicing `[:2]`.

In [None]:
# 1. split the line into source (e.g. english), and target (e.g.indonesian) sentence fields 
# 2. normalize the source and target sentence fields 
# 3. return sentence pairs in the format: e.g. [ENGLISH, INDONESIAN]
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')[:2]]  for l in lines[:num_examples]]

    return zip(*word_pairs)

In [None]:
en, ind = create_dataset(path_to_file, None)

# print the last sample
print(en[-1])
print(ind[-1])

In [None]:
def max_length(sequences):
    return max(len(seq) for seq in sequences)

### Tokenization

We need to convert the 'cleaned' text to a sequence of numbers (i.e. tokenize the text) and pad each sequence of numbers to the same length as our network expect all samples in the batch to be the same tensor shape. Here we use [keras Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) to do the job (there are other more advanced tokenizers such as [subwords-based tokenizer](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder) which will not be covered here). Note that we set `filters` to empty string in the Tokenizer, because we have already done our own filtering (e.g. replacing special characters with space).

In [None]:
def tokenize(sentences):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
    lang_tokenizer.fit_on_texts(sentences)

    sequences = lang_tokenizer.texts_to_sequences(sentences)

    sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences,
                                                         padding='post')

    return sequences, lang_tokenizer

We need *separate* tokenizers for source text (English) and target text (Indonesian). Here we create two tokenizers (`src_lang_tokenizer` and `targ_lang_tokenizer`), fit separately on source and target text corpus. 

In [None]:
def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    src_sentences, targ_sentences = create_dataset(path, num_examples)

    src_sequences, src_lang_tokenizer = tokenize(src_sentences)
    targ_sequences, targ_lang_tokenizer = tokenize(targ_sentences)

    return src_sequences, targ_sequences, src_lang_tokenizer, targ_lang_tokenizer

Training on the complete dataset of sentences will probably take a long time. For quick testing to see if your model is working properly (i.e. as expected, with no logic error), we can set the size of the dataset to something small (say, couple of hundred samples). If `num_samples = None`, we will use the entire dataset.

In [None]:
# Try experimenting with the size of that dataset
num_examples = None
src_sequences, targ_sequences, src_lang_tokenizer, targ_lang_tokenizer = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_src, max_length_targ = max_length(src_sequences), max_length(targ_sequences)
print('maximum sentence length in source text corpus = {}'.format(max_length_src))
print('maximum sentence length in target text corpus = {}'.format(max_length_targ))


In [None]:
# function to print a sequence of indexes to its corresponding words
def convert(lang_tokenizer, tokens):
    for t in tokens:
        if t != 0:
            print ("%d ----> %s" % (t, lang_tokenizer.index_word[t]))

In [None]:
print ("Input Language; index to word mapping")
convert(src_lang_tokenizer, src_sequences[200])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang_tokenizer, targ_sequences[200])

In [None]:
# the vocabulary size consists of all the words in the word_to_index table plus 1 reserved token of value 0
src_vocab_size = len(src_lang_tokenizer.word_index)+1
targ_vocab_size = len(targ_lang_tokenizer.word_index)+1
print('src language vocab size = {}'.format(src_vocab_size))
print('target language vocab size = {}'.format(targ_vocab_size))

In [None]:
# buffer size for shuffling data
BUFFER_SIZE = len(src_sequences)

# batch size 
BATCH_SIZE = 64

# we set the training steps per epoch to match number of batches
steps_per_epoch = len(src_sequences)//BATCH_SIZE

# this is the embedding size  
EMBEDDING_SIZE = 256

# this is the number of neuron units in the LSTM/GRU layer
RNN_UNITS = 1024

### Create a tf.data dataset

We convert our training data into `tf.data.Dataset` and use it for shuffling and batching.

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((src_sequences, targ_sequences)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

Check if our dataset gives the correct batch size. 

In [None]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

## Encoder and Decoder model

Now let us implement the encoder and decoder network.

**Exercise**

First let's implement our Encoder network as shown in the dotted box: 

![encoder](nb_images/encoder.png)

Our encoder network consists of one embedding layer, followed by a GRU layer. As we need to use the hidden state of the last timestep as the input to the decoder network, we need to set `return_state` to `True`. Although we don't really need the ouput at each timestep, we will be needing it for the attention-based model later on, so let's just set `return_sequences` to `True` also. 

To make our codes easier to read, we will encapsulate the details of our encoder network in a custom model by using Keras subclassing API (it is introduced in Keras 2.2.0). We just need to implement the foward pass in the `call()` method. Your `call()` needs to return both the output and final (time-step) hidden state. This final time-step encoder hidden state is to be passed as initial hidden state to the decoder.

Complete the code. 

<details><summary>Click here for solution</summary>
    
```
def __init__(...):
    ...
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    ...
    
def call():
    ... 
    output, state = self.gru(embed, initial_state = hidden)
    ...
```
</details>

In [None]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        """
        Arguments:
        vocab_size -- vocabulary size for the embedding layer
        embedding_size -- the length of the embedding vector
        enc_units -- number of units in the encoder RNN layer
        batch_sz -- batch size
        """
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        
        ### START YOUR CODE HERE ###
        
        # create an Embedding layer with appropriate size
        self.embedding = None
        
        # create a gru layer with appropriate parameters. Make sure it return final output and hidden state
        self.gru = None
        
        ### END YOUR CODE HERE ###
        
    # Implement the forward pass
    def call(self, sequence, hidden):
        """
        Arguments:
        sequence -- source sequence
        hidden -- initial hidden state
        """
        
        # call embedding layer 
        embed = self.embedding(sequence)
        
        ### START YOUR CODE HERE ###
        
        # call GRU layer and set the initial state. 
        output, state = None, None
        
        ### END YOUR CODE HERE 
        
        return output, state
    
    # initialize encoder initial hidden state
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

Let's test your encoder network by passing it the sample sequences you created from earlier notebook cell. As the sample sequences are padded to length of 38, you should expect the following output:

```
Encoder output shape: (batch size, sequence length, units) (64, 38, 1024)
Encoder Hidden state shape: (batch size, units) (64, 1024)
```


In [None]:
encoder = Encoder(src_vocab_size, EMBEDDING_SIZE, RNN_UNITS, BATCH_SIZE)

print(example_input_batch.shape)
# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

**Exercise**

Now let us implement our Decoder network as shown in the dotted box:

![decoder.png](nb_images/decoder.png)

Similar to encoder network, our decoder also consists of one embedding layer, followed by a GRU layer and a Dense layer (shown as projection layer in diagram). Our GRU needs to return output as well as hidden state at each time step, because in the training step, we will be feeding the decoder one token at a time, and compare the output at each time (step) with the expected output and calculate the loss, and pass the hidden state to the next timestep. 

Complete the code below.

<details><summary>Click here for solution</summary>
    
```
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
                               return_sequences=True,
                               return_state=True,
                               recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
```

In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        
        ### START YOUR CODE HERE 
       
    
        ### END YOUR CODE ###

    def call(self, sequence, hidden):
        """
        Arguments:
        sequence -- target sequence (as we are using teacher forcing)
        hidden -- hidden state (in the first timestep, the hidden state is from encoder's final hidden state)
        """
        
        # embedding shape after passing through embedding == (batch_size, 1, EMBEDDING_SIZE)
        embed = self.embedding(sequence)

        # passing the embedding to the GRU
        output, state = self.gru(embed, hidden)
        
        # if one of the component of shape is -1, the size is computed automatically so that the total size is constant
        # so if the original shape of x is (64,10,32), tf.reshape(x, (-1, 32) will become (640, 32))
        output = tf.reshape(output, (-1, output.shape[2]))
    
        # output shape == (batch_size, vocab)
        
        x = self.fc(output)

        return x, state

Let's test your decoder network by passing it batch of samples but with single timestep. You should expect the following output:

```
Decoder output shape: (batch_size, vocab size) (64, 4291)
```

Since our vocab size is 4291, the output is of 4291 dimensions.

In [None]:
decoder = Decoder(targ_vocab_size, EMBEDDING_SIZE, RNN_UNITS, BATCH_SIZE)

sample_decoder_output, _ = decoder(tf.random.uniform((64, 1)),
                                      sample_hidden)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

## Define the optimizer and the loss function

Let's define our loss function. 

Since we are using indexes (e.g. 23, 45, 12, etc) and not one-hot-encoded vector for our target label, we will use `SparseCategoricalCrossEntropy` as our loss function.  Note that we need to set `from_logits=True` as the output from our Decoder are logits (i.e. unscaled unnormalized values)

As our sequences are padded with 0 (to be the same length), we don't want to take these zeros into account when computing the loss. One way to do this is to compute the mask and use the mask to zero out the loss at those positions that are padded. See the diagram here:

![mask](nb_images/mask_loss.png)

You can first compare each position to zero by using `tf.math.equal()`. This will set to True for those positions that are zeros. You can then invert that using `tf.math.logical_not()` so that your final mask will be True for those non-zero positions. 

You can then use the mask to do element-wise multiplication with the loss. But before you can do that you need to cast the mask (which are of boolean type) to whatever dtype the loss is by using `tf.cast(x, dtype=loss.dtype)`

**Exercise:** 

Complete the code in `loss_func()`. 

<details><summary>Click here for solution</summary>
    
```

mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)

mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask

```
    
</details>

In [None]:
optimizer = tf.keras.optimizers.Adam()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    
    ### WRITE YOUR CODE HERE ###
    

    
    ### END CODE HERE ###
    
    return tf.reduce_mean(loss_)

## Checkpoints (Object-based saving)

In [None]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

## Training

The following diagram shows the training process using teaching forcing:

![training seq2seq](nb_images/seq2seq_train.png)

1. Pass the *input* through the *encoder* which return *encoder output* and the *encoder hidden state*.
2. The encoder hidden state and the decoder input (which is the *start token*) is passed to the decoder.
3. The decoder returns the *predictions* and the *decoder hidden state*.
4. The *decoder hidden state* is passed to the model in the next timestep. The *prediction* is compared with expected to calculate the loss.
5. Use *teacher forcing* to decide the next input to the decoder. *Teacher forcing* is the technique where the *target word* is passed as the *next input* to the decoder.
7. The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

**Note** We are feeding the target sequence one timestep at a time to the decoder

*In the code below, you will see the use @tf.function at the beginning of the function. It basically transforms the python function into a high-performing tensorflow graph for performance reason*

In [None]:
@tf.function
def train_step(src, targ, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(src, enc_hidden)

        dec_hidden = enc_hidden

        # create the input for the first timestep for decoder which is <start> token
        # we create batch size samples of <start_token>, and shape it to <batch, 1>
        dec_input = tf.expand_dims([targ_lang_tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

        # Teacher forcing - feeding the target as the next input
        # Note that targ.shape[1] refers the dimension of 2nd axis which is the target sequence length
        # e.g. if target sequence is '<start> I am happy <end>', then  range(1, 5)
        # will be the tokens at following positions: 1, 2, 3, 4
        # i.e. 'I', 'am', 'happy', ''<end>', while dec_input is <start>, I, am, happy
        # t is ahead of dec_input by 1 timestep
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, dec_hidden = decoder(dec_input, dec_hidden)

            loss += loss_function(targ[:, t], predictions)

            # we advance the input to the next timestep
            dec_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / int(targ.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

In [None]:
EPOCHS = 30

train = False 

if train:
    for epoch in range(EPOCHS):
        start = time.time()

        enc_hidden = encoder.initialize_hidden_state()
        total_loss = 0

        for (batch, (src, targ)) in enumerate(dataset.take(steps_per_epoch)):
            batch_loss = train_step(src, targ, enc_hidden)
            total_loss += batch_loss

            if batch % 50 == 0:
                print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.numpy()))
          # saving (checkpoint) the model every 2 epochs
        if (epoch + 1) % 2 == 0:
            checkpoint.save(file_prefix = checkpoint_prefix)

        print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                          total_loss / steps_per_epoch))
        print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

## Translate

* The evaluate function is similar to the training loop, except we don't use *teacher forcing* here. The input to the decoder at each time step is its previous predictions along with the previous hidden state. For timestep 0, the hidden state of decoder is set to the hidden state of the last timestep of encoder.
* Stop predicting when the model predicts the *end token \<end\>* .

Note: The encoder output is calculated only once for one input.

In [None]:
def evaluate(sentence):
    

    sentence = preprocess_sentence(sentence)

    inputs = [src_lang_tokenizer.word_index[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                           maxlen=max_length_src,
                                                           padding='post')
    inputs = tf.convert_to_tensor(inputs)

    result = ''

    hidden = [tf.zeros((1, RNN_UNITS))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang_tokenizer.word_index['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden = decoder(dec_input,dec_hidden)

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang_tokenizer.index_word[predicted_id] + ' '

        if targ_lang_tokenizer.index_word[predicted_id] == '<end>':
            return result, sentence

        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence

In [None]:
def translate(sentence):
    result, sentence = evaluate(sentence)

    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))


## Restore the latest checkpoint and test

In [None]:
#Uncomment the following if you want to download the pretrained model checkpoints 
# !wget https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/pretrained-weights/iti107/session-8/nmt-chk-30epochs.tar.gz
# !tar xvf nmt-chk-30epochs.tar.gz

In [None]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

In [None]:
test_sents = [
    "We hope prices are going to drop.",
    "You look familiar. Do I know you?",
    "I went to see a doctor.", 
    "I'm sorry, but I'm busy right now.",
    "I have moved out.",
    "there was a heavy rain this morning.",
    "My wife likes the painting.",
    "I ate two slices of bread.",
    "I can't go out because I broke my leg.",
    "This is a very cold morning."
]

for sent in test_sents: 
    translate(sent)

## Next steps

* As we can see, our baseline model got some of the translations correct, but in some cases, the translations made no sense at all. We will try to improve the model using Attention in our next exercise.
