# Neural Machine Translation

* We will build a Neural Machine Translation (NMT) model to translate human-readable dates ("25th of June, 2009") into machine-readable dates ("2009-06-25"), using an attention model. 

<a name='0'></a>
## Packages

In [None]:
import random

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

from tensorflow.keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from tensorflow.keras.layers import RepeatVector, Dense, Activation, Lambda
from tensorflow.keras.optimizers.legacy import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model, Model
import tensorflow.keras.backend as K

from faker import Faker
from tqdm import tqdm
from babel.dates import format_date

from nmt_utils import *

%matplotlib inline

<a name='1'></a>
## 1 - Loading the dataset

<a name='1.1'></a>
## 1.1 - Overview

* The dataset contains 10,000 human readable dates written in a variety of possible formats (*e.g. "the 29th of August 1958", "03/30/1968", "24 JUNE 1987"*) and their equivalent, standardized, machine readable dates. 

In [None]:
m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

- `dataset`: a list of tuples of (human readable date, machine readable date).

- `human_vocab`: a python dictionary mapping all characters used in the human readable dates to an integer-valued index.

- `machine_vocab`: a python dictionary mapping all characters used in machine readable dates to an integer-valued index. 

- `inv_machine_vocab`: the inverse dictionary of `machine_vocab`, mapping from indices back to characters. 

Let's preprocess the data and map the raw text data into the index values. 
- We will set Tx=30 
    - We assume Tx is the maximum length of the human readable date.
    - If we get a longer input, we would have to truncate it. <br><br>
    
- We will set Ty=10
    - "YYYY-MM-DD" is 10 characters long.

In [None]:
Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

We now have:
- `X`:
    - Each character in X is replaced by an index (integer) mapped to the character using `human_vocab`. 
    - Each date is padded to ensure a length of $T_x$ using a special character (< pad >). 
    - `X.shape = (m, Tx)` where m is the number of training examples in a batch. <br><br>
    
- `Y`:
    - Each character is replaced by the index (integer) mapped to the character using `machine_vocab`. 
    - `Y.shape = (m, Ty)`. <br><br>
    
- `Xoh`: one-hot version of `X`
    - Each index in `X` is converted to the one-hot representation.
    - `Xoh.shape = (m, Tx, len(human_vocab))` <br><br>  
    
- `Yoh`: one-hot version of `Y`
    - Each index in `Y` is converted to the one-hot representation. 
    - `Yoh.shape = (m, Ty, len(machine_vocab))`. 
    - `len(machine_vocab) = 11` since there are 10 numeric digits (0 to 9) and the `-` symbol.

* Let's also look at some examples of preprocessed training examples. 

In [None]:
index = 0

print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()

print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()

print("Source after preprocessing (one-hot):\n", Xoh[index])
print()

print("Target after preprocessing (one-hot):\n", Yoh[index])

<a name='2'></a>
## 2 - Building the model

<a name='2-1'></a>
### 2.1 - Model Overview

#### Model architecture:

<center><img src="images/attn_model.png" width="60%" height="60%"></center>
<caption><center><b>Figure 1:</b> Neural machine translation with attention</center></caption>

- We will use Attention Mechanism between encoder and decoder layers.

- There are two separate LSTMs in this model: pre-attention and post-attention LSTMs.

- *Pre-attention* Bi-LSTM is a Bi-directional LSTM and comes *before* the attention mechanism.

    - The pre-attention Bi-LSTM goes through $T_x$ time steps. <br><br>

- *Post-attention* LSTM comes *after* the attention mechanism. 

    - The post-attention LSTM goes through $T_y$ time steps. <br><br>

- In this model, the post-attention LSTM at time $t$ does not take the previous time step's prediction $y^{\langle t-1 \rangle}$ as input.

    - There isn't as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.


#### Pre-attention Bi-LSTM outputs

- $\overrightarrow{a}^{\langle t \rangle}$: hidden state of the forward-direction, pre-attention LSTM.
    
    
- $\overleftarrow{a}^{\langle t \rangle}$: hidden state of the backward-direction, pre-attention LSTM.
    
    
- $a^{\langle t \rangle} = [\overrightarrow{a}^{\langle t \rangle}, \overleftarrow{a}^{\langle t \rangle}]$: The concatenation of the activations of both the forward-direction $\overrightarrow{a}^{\langle t \rangle}$ and backward-directions $\overleftarrow{a}^{\langle t \rangle}$ of the pre-attention Bi-LSTM. 

#### "Energies" $e^{\langle t, t' \rangle}$ computing
- "e" is called the "energies" variable.
- $s^{\langle t-1 \rangle}$ is the hidden state of the post-attention LSTM
- $a^{\langle t' \rangle}$ is the hidden state of the pre-attention LSTM.
- $s^{\langle t-1 \rangle}$ and $a^{\langle t' \rangle}$ are fed into a simple neural network, which learns the function to output $e^{\langle t, t' \rangle}$.
    - `RepeatVector` node is used to copy $s^{\langle t-1 \rangle}$'s value $T_x$ times;
    - Then `Concatenation` is used to concatenate $s^{\langle t-1 \rangle}$ and $a^{\langle t' \rangle}$;
    - The concatenation of $s^{\langle t-1 \rangle}$ and $a^{\langle t' \rangle}$ is fed into a "Dense" layer, which computes $e^{\langle t, t' \rangle}$. <br><br>
    
- $e^{\langle t, t' \rangle}$ is passed through a softmax to compute the attention $\alpha^{\langle t, t' \rangle}$ that $y^{\langle t \rangle}$ should pay to $a^{\langle t' \rangle}$.

<a name='2-2'></a>
### 2.2 - Building Self-attention Mechanism

In [None]:
# Defining shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') 
dotor = Dot(axes = 1)

In [None]:
def one_step_attention(a, s_prev):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attention) LSTM cell
    """
    
    # Using repeator to repeat s_prev to be of shape (m, Tx, n_s)
    s_prev = repeator(s_prev)
    
    # Usin concatenator to concatenate a and s_prev on the last axis
    concat = concatenator([a, s_prev])
    
    # Using densor1 and densor2 to compute the "energies"
    e = densor1(concat)
    energies = densor2(e)
    
    # Using "activator" to compute the attention weights
    alphas = activator(energies)
    
    # Using dotor to compute the context vector
    context = dotor([alphas, a])
    
    return context

<a name='2-3'></a>
### 2.3 - Full Implementation

In [None]:
n_a = 32 # number of units for the pre-attention, bi-directional LSTM's hidden state 'a'
n_s = 64 # number of units for the post-attention LSTM's hidden state "s"

post_activation_LSTM_cell = LSTM(n_s, return_state=True)
output_layer = Dense(len(machine_vocab), activation=softmax)

In [None]:
def modelf(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    human_vocab_size -- size of the python dictionary "human_vocab"
    machine_vocab_size -- size of the python dictionary "machine_vocab"

    Returns:
    model -- Keras model instance
    """
    
    # Defining the input layer
    X = Input(shape=(Tx, human_vocab_size))
    
    # Defining initial hidden state s0 and initial cell state c0 for the post-attention LSTM
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    
    s = s0
    c = c0
    
    outputs = []
    
    # Defining pre-attention Bi-LSTM
    a = Bidirectional(LSTM(units=n_a, return_sequences=True))(X)
    
    # Loop over Ty
    for t in range(Ty):
        context = one_step_attention(a,s)
        _, s, c = post_activation_LSTM_cell(inputs=context, initial_state=[s,c])
        out = output_layer(s)
  
        outputs.append(out)
    
    model = Model(inputs=[X,s0,c0], outputs=outputs)
    
    return model

In [None]:
model = modelf(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))

In [None]:
model.summary()

<a name='2-4'></a>
### 2.4 - Training the Model

In [None]:
opt = Adam(learning_rate=.005, beta_1=.9, beta_2=.999, decay=.01)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

In [None]:
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))

In [None]:
history = model.fit([Xoh, s0, c0], outputs, epochs=100, batch_size=100)

While training you can see the loss as well as the accuracy on each of the 10 positions of the output. 

In [None]:
history.history.keys()

In [None]:
plt.figure(figsize=(5,4))
plt.plot(history.history['loss'], c='k')
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
plt.xlabel('Epochs', fontsize=13)
plt.ylabel('Erro', fontsize=13)
plt.title('')
plt.show()

<a name='3'></a>
## 3 - Results

<a name='3.1'></a>
### 3.1 - Testing the Model

In [None]:
examples = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']
s_0 = np.zeros((1, n_s))
c_0 = np.zeros((1, n_s))

for example in examples:
    source = string_to_int(example, Tx, human_vocab)

    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source)))
    source = np.expand_dims(source, axis=0)
    
    prediction = model.predict([source, s_0, c_0])
    prediction = np.argmax(prediction, axis = -1)
    output = [inv_machine_vocab[int(i)] for i in prediction]
    
    print("source:", example)
    print("output:", ''.join(output), "\n")

<a name='3.2'></a>
### 3.2 - Visualizing Attention

One advantage of the attention model is that each part of the output (such as the month) knows it needs to depend only on a small part of the input (the characters in the input giving the month). We can visualize what each part of the output is looking at which part of the input.

Lets now visualize the attention values of the network. We'll propagate an example through the network, then visualize the values of $\alpha^{\langle t, t' \rangle}$.

In [None]:
model.layers

The function `attention_map()` pulls out the attention values from your model and plots them.

In [None]:
def plot_attention_map(modelx, input_vocabulary, inv_output_vocabulary, text, Tx, Ty, n_s, num = 7):
    """
    Plot the attention map
    """
   
    # Recreating part of the model 
    X = modelx.inputs[0] 
    s0 = modelx.inputs[1] 
    c0 = modelx.inputs[2] 
    s = s0
    c = s0
    
    a = modelx.layers[2](X)  # pre-attention bi-LSTM
    outputs = []

    for t in range(Ty):
        s_prev = s
        s_prev = modelx.layers[3](s_prev)    # repeat vector
        concat = modelx.layers[4]([a, s_prev])    # concatenation
        e = modelx.layers[5](concat)     # dense 1
        energies = modelx.layers[6](e)    # dense 2
        alphas = modelx.layers[7](energies) # softmax
        context = modelx.layers[8]([alphas, a]) 
        s, _, c = modelx.layers[10](context, initial_state = [s, c]) # post-attention LSTM
        outputs.append(energies)

    f = Model(inputs = [X, s0, c0], outputs = outputs)  
    
    # Converting 'text' to its one-hot representation
    encoded = np.array(string_to_int(text, Tx, input_vocabulary)).reshape((1, 30))
    encoded = np.array(list(map(lambda x: to_categorical(x, num_classes=len(input_vocabulary)), encoded)))

    # Building the attention map
    s0 = np.zeros((1, n_s))
    c0 = np.zeros((1, n_s))
    r = f([encoded, s0, c0])
 
    attention_map = np.zeros((Ty, Tx))
    for t in range(Ty):
        for t_prime in range(Tx):
            attention_map[t][t_prime] = r[t][0, t_prime]

    # Normalizing attention map
    row_max = attention_map.max(axis=1)
    attention_map = attention_map / row_max[:, None]

    prediction = modelx.predict([encoded, s0, c0])
    predicted_text = []
    for i in range(len(prediction)):
        # Selecting the index with max probability for each element sequence
        predicted_text.append(int(np.argmax(prediction[i], axis=1)))
    
    predicted_text = int_to_string(predicted_text, inv_output_vocabulary)
    
    # geting the lengths of the string
    input_length = len(text)
    output_length = Ty
    
    # plotting the attention_map
    fig, ax = plt.subplots()
    
    # adding image
    img = ax.imshow(attention_map[:,:len(text)], cmap='Blues')

    # adding labels
    ax.set_yticks(range(output_length))
    ax.set_yticklabels(predicted_text)

    ax.set_xticks(range(input_length))
    ax.set_xticklabels(list(text))

    ax.set_xlabel('Input Sequence')
    ax.set_ylabel('Output Sequence')
    
    # adding colorbar
    x0 = ax.get_position().x1 + .02
    y0 = ax.get_position().y0
    w = .02
    h = ax.get_position().height
    
    cax = fig.add_axes([x0, y0, w, h])
    cbar =  fig.colorbar(img, cax=cax)
    cbar.ax.set_ylabel('Alpha Values', rotation=-90, va="bottom")

    ax.grid()
    plt.show()
    
    return attention_map

In [None]:
attention_map = plot_attention_map(model, human_vocab, inv_machine_vocab, "Tuesday 09 Oct 1993", Tx, Ty, n_s = 64)