# 2. Translation Model

## Introduction

In this example, I will build LSTM single-layer neural networks for both the encoder and decoder. We need to take care of the final forward and backward states from a single layer to the decoder.

```python
# LSTM layer in Encoder
lstm_layer = tf.keras.layers.LSTM(
    units,  # dimensionality of the output space
    return_sequences=True,  # Pass output sequence and state to Decoder
    return_state=True,
)
```
However, we can improve the accuracy by implementing BiLSTM or multi-layer LSTM/BiLSTM. Let's create a BiLSTM model with forward and backward layers:

```python
model = Sequential()
forward_layer = tf.keras.layers.LSTM(10, return_sequences=True)
backward_layer = tf.keras.layers.LSTM(
    10, activation="relu", return_sequences=True, go_backwards=True
)
model.add(
    tf.keras.layers.Bidirectional(
        forward_layer, backward_layer=backward_layer, input_shape=(5, 10)
    )
)
model.add(Dense(5))
model.add(Activation("softmax"))
```

There is a tutorial to build [Encoder-Decoder Model using LSTM](https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/) and [compare LSTM with BiLSTM](https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/).

## Overview

### Training Task

There are two tasks during training:

1. Input Task: given an input sequence (text) and extract useful information
2. Output Task: we need to process the output properly to calculate the probability. So that we need Ground Truth Sequence as the given information and Final Token Sequence as a result which model should predict when giving the Ground Truth Sequence.

```python
dec_input = targ[ : , :-1 ]   # Ground Truth Sequence
real = targ[ : , 1: ]         # Final Token Sequence
pred = decoder(dec_input, decoder_initial_state)
logits = pred.rnn_output
loss = loss_function(real, logits)
```

#### Data cleaning

Standardize Unicode letters and convert to ASCII to simplify the process. 
*unicodedata.normalize(form, unistr)* :This function returns the normal form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
*unicodedata.category Mn* : Ignore NonSpacing Mark

```python
def unicode_to_ascii(s):
    return "".join(
        c
        for c in unicodedata.normalize("NFD", s)
        if unicodedata.category(c) != "Mn"
    )
```

Below is a sample code how to deal with special letters 


```python
w = unicode_to_ascii(w.lower().strip())
# creating a space between a word and the punctuation following it
w = re.sub(r"([?.!,¿])", r" \1 ", w)
w = re.sub(r'[" "]+', " ", w)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
w = w.strip()
```

#### Padding
The length input/output is not given / fixed, such as translation, summarization of text. But the input of model is fixed when building neural networks. An extra symbol was filled into empty space called pad.
```python
tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
```

#### Start and End of a Sentence
The output is not required, but we need Machine returns something. So we use start-of-sequence \<start> and end-of-sequence \<end> tokens.
```python
w = '<start> ' + w + ' <end>'
```

#### Out of Vocabulary
There are special words which do not exist in dictionary, we introduce Out-Of-Vocabulary (OOV) token.
```python
tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>')
```

These extra symbols called new vocabulary or extended vocabulary.

### Attention

There are two popular Attentions developed by Bahdanau (tfa.seq2seq.BahdanauAttention) and Luong (tfa.seq2seq.LuongAttention).
Although the idea to use attention is easy to understand, implementation is complex. Fortunately, there is a helper in TensorFlow *AttentionWrapper* to add attention to the decoder cell.

```python
# Luong Attention
attention_mechanism = tfa.seq2seq.LuongAttention(
    dec_units, memory, memory_sequence_length
)
rnn_cell = tfa.seq2seq.AttentionWrapper(
    tf.keras.layers.LSTMCell,
    attention_mechanism,
    attention_layer_size=dec_units,
)
 
```

### Decoding during Training

During training, we have access to both the input and output sequences of a training pair. This means that we can use the output sequence's ground truth tokens as input for the decoder.

The TrainingSampler object is initialized with the (embedded) ground truth sequences and the lengths of the ground truth sequences.

```python
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder = tfa.seq2seq.BasicDecoder(rnn_cell, sampler=sampler, output_layer=fc)
```

### Decoding during Inferencing

When inferencing, there is no ground truth. Hence, we need to change TrainingSampler object to an inference helper. In this example, I show BasicDecoder from tf-addons which uses GreedyEmbeddingSampler. There is another helper [BeamSearchDecoder also from tf-addons](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt#use_tf-addons_beamsearchdecoder).

```python
greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler()
decoder_instance = tfa.seq2seq.BasicDecoder(
    cell=decoder.rnn_cell, sampler=greedy_sampler, output_layer=decoder.fc
)

```

## Demo

I will build a Translator from Vietnamese to English. The data was downloaded from http://www.manythings.org/anki/ and pre-processed. 

In [1]:
import tensorflow as tf
import tensorflow_addons as tfa
import time

from nmtdataset import NMTDataset
from models import Encoder, Decoder
from functions import *

def get_nmt():
    """Get the link to the dataset.
    If the dataset does not exist, download it manually and assign new path."""
    path_to_file = "./dict/vie-eng/vie.txt"
    return path_to_file

In [2]:
# Configuration parameters
# DataSet
BUFFER_SIZE = 256000
BATCH_SIZE = 64
num_examples = 10000 # Let's limit the #training examples for faster training
# Neural Network parameters
embedding_dim = 256
units = 1024
steps_per_epoch = num_examples//BATCH_SIZE

In [3]:
# Load DataSet
dataset_creator = NMTDataset("en-vie", get_nmt())
train_dataset, val_dataset, inp_lang, targ_lang = dataset_creator.call(
    num_examples, BUFFER_SIZE, BATCH_SIZE
)
example_input_batch, example_target_batch = next(iter(train_dataset))
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1
max_length_input = example_input_batch.shape[1]
max_length_output = example_target_batch.shape[1]

In [4]:
# Test Encoder Stack
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)

In [5]:
# Test decoder stack
decoder = Decoder(
    vocab_tar_size,
    embedding_dim,
    units,
    BATCH_SIZE,
    max_length_input,
    max_length_output,
    "luong",
)
sample_x = tf.random.uniform((BATCH_SIZE, max_length_output))
decoder.attention_mechanism.setup_memory(sample_output)
initial_state = decoder.build_initial_state(
    BATCH_SIZE, [sample_h, sample_c], tf.float32
)

sample_decoder_outputs = decoder(sample_x, initial_state)

In [6]:
EPOCHS = 50

for epoch in range(EPOCHS):
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(train_dataset.take(steps_per_epoch)):
        batch_loss = train_step(
            inp, targ, enc_hidden, BATCH_SIZE, encoder, decoder
        )
        total_loss += batch_loss

    print(
        "Epoch {} Loss {:.4f} taken time  {:.2f} sec".format(
            epoch + 1, total_loss / steps_per_epoch, time.time() - start
        )
    )

Epoch 1 Loss 0.8116 taken time  29.86 sec
Epoch 2 Loss 0.6467 taken time  20.18 sec
Epoch 3 Loss 0.5686 taken time  20.33 sec
Epoch 4 Loss 0.5143 taken time  19.90 sec
Epoch 5 Loss 0.4593 taken time  19.97 sec
Epoch 6 Loss 0.4025 taken time  20.08 sec
Epoch 7 Loss 0.3499 taken time  20.31 sec
Epoch 8 Loss 0.2995 taken time  21.75 sec
Epoch 9 Loss 0.2519 taken time  20.70 sec
Epoch 10 Loss 0.2120 taken time  20.61 sec
Epoch 11 Loss 0.1732 taken time  21.00 sec
Epoch 12 Loss 0.1438 taken time  20.89 sec
Epoch 13 Loss 0.1228 taken time  20.99 sec
Epoch 14 Loss 0.1014 taken time  20.91 sec
Epoch 15 Loss 0.0870 taken time  21.12 sec
Epoch 16 Loss 0.0779 taken time  20.73 sec
Epoch 17 Loss 0.0680 taken time  21.22 sec
Epoch 18 Loss 0.0620 taken time  20.49 sec
Epoch 19 Loss 0.0560 taken time  20.49 sec
Epoch 20 Loss 0.0532 taken time  20.68 sec
Epoch 21 Loss 0.0522 taken time  21.97 sec
Epoch 22 Loss 0.0592 taken time  20.50 sec
Epoch 23 Loss 0.0653 taken time  20.55 sec
Epoch 24 Loss 0.0830

In [7]:
def translate(sentence):
    result = evaluate_sentence(
        dataset_creator.preprocess_sentence(sentence),
        inp_lang,
        targ_lang,
        encoder,
        decoder,
        max_length_input,
        units,
    )
    print(result)
    result = targ_lang.sequences_to_texts(
        result
    )  # Transform vertor numbers to words
    print("Input: %s" % (sentence))
    print("Translation: {}".format(result))

In [8]:
translate(u'Tôi thích hoa.')

[[  5  41 860  39  22   4   3]]
Input: Tôi thích hoa.
Translation: ['i like bread with me . <end>']


In [9]:
translate(u'Trời nắng.')

[[  17   16  919 1345    4    3]]
Input: Trời nắng.
Translation: ['it s likely snowing . <end>']


In [10]:
translate(u'Anh yêu em.')

[[ 15 150  22   4   3]]
Input: Anh yêu em.
Translation: ['he love me . <end>']


In [11]:
translate(u'Tiếp tục đi.')

[[176 123   4   3]]
Input: Tiếp tục đi.
Translation: ['keep last . <end>']
