#### Text Translation 

Seq2seq encoder-decoder using Keras

Jay Urbain, PhD

#### Language translation task

Translate from a language *X* to *English*.

Below are 5 different results from different translation models. 

Even if we don't know what the original sentence was, the context of the original sentence is clear when following examples when considered as a group. 

Translation A: I ask him whether he will once again make a stand-up comedy tour.

Translation B: I ask him if he will again make a stand-up comedy tour.

Translation C: I wonder him if he will ever make a booth up comedy tour.

Translation D: I ask him if he will ever make a stand-up comedy tour ever.

Translation E: I ask him whether he will again make a stand-up comedy tour.

It should be relatively easy to spot the worst translation as it doesn't quite make sense in English when translated literally. That shows the difficulty of translating in general. Context has a significant impact on language translation. 

In [1]:
!python OpenSeq2Seq/run.py --config_file=OpenSeq2Seq/example_configs/nmt.json --logdir=./noatt --mode=infer --inference_out=baseline.txt

  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "OpenSeq2Seq/run.py", line 263, in <module>
    main()
  File "OpenSeq2Seq/run.py", line 58, in main
    config_module = runpy.run_path(args.config_file, init_globals={'tf': tf})
  File "/Applications/anaconda/envs/py3.6tf1.3keras/lib/python3.6/runpy.py", line 261, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/Applications/anaconda/envs/py3.6tf1.3keras/lib/python3.6/runpy.py", line 231, in _get_code_from_file
    with open(fname, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'OpenSeq2Seq/example_configs/nmt.json'


Let it run, you can open up [baseline.txt](baseline.txt) to see the translation as it's being written. Once you see "I wonder him if he will ever make a booth up com@@ ed@@ y@@ tour ." as the output then you can press Stop button or Interrupt Kernel. It should take about 3.5 minutes to run to reach this line. Last part of the lab will explain why the translations contain words like "com@@" due to byte pair encoding (BPE). You can simply eliminate those characters to get the sentences used above.

So you can see this model produced the worst translation. On the other hand, identifying the best translation might differ from person to person since there's some subjectivity involved. Take Translation D for example, double use of 'ever' in one sentence probably lowers its score as a good English translation. Turns out, all the best results you have identified use **attention** to achieve those results.

## Attention Explained
If we want to understand something then paying attention is really helpful. In Neural Networks, it also helps to identify the most critical or important things to pay attention to. You can find resources on [attention](http://ruder.io/deep-learning-nlp-best-practices/index.html#attention) and current best practices for NLP in general. Mathematically we can also visualize attention with the following image:

<p align="center">
  <img src="https://github.com/philipperemy/keras-attention-mechanism/blob/master/assets/attention_1.png?raw=true" width="400">
</p>


We can see the big spike is where the attention of the model will be directed. Our overall goal is to use attention in Machine Translation models but let's try to implement attention in a simpler problem. 

In [None]:
from keras.layers import Embedding, Bidirectional
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *
from keras.layers.merge import Multiply
from keras.utils import to_categorical
from keras.layers import TimeDistributed

import keras.backend as K
import numpy as np

Function to get activations of attention layer:

In [None]:
def get_activations(model, inputs, layer_name=None):
    activations = []
    inp = model.input
    if layer_name is None:
        outputs = [layer.output for layer in model.layers]
    else:
        outputs = [layer.output for layer in model.layers if layer.name == layer_name]  # all layer outputs
    funcs = [K.function([inp] + [K.learning_phase()], [out]) for out in outputs]  # evaluation functions
    layer_outputs = [func([inputs, 1.])[0] for func in funcs]
    for layer_activations in layer_outputs:
        activations.append(layer_activations)
    return activations

Generation of simple random dataset for attention

In [None]:
def get_data_recurrent(n, time_steps, input_dim, attention_column=None):
    """
    Data generation. x is purely random except that it's first value equals the target y.
    In practice, the network should learn that the target = x[attention_column].
    Therefore, most of its attention should be focused on the value addressed by attention_column.
    :param n: the number of samples to retrieve.
    :param time_steps: the number of time steps of your series.
    :param input_dim: the number of dimensions of each element in the series.
    :param attention_column: the column linked to the target. Everything else is purely random.
    :return: x: model inputs, y: model targets
    """
    if attention_column is None:
        attention_column = np.random.randint(low=0, high=input_dim)
    x = np.random.standard_normal(size=(n, time_steps, input_dim))
    y = np.random.randint(low=0, high=2, size=(n, 1))
    x[:, attention_column, :] = np.tile(y[:], (1, input_dim))
    return x, y

In [None]:
INPUT_DIM = 2
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = False
APPLY_ATTENTION_BEFORE_LSTM = True

The Attention itself:

In [None]:
def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a)
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = Multiply(name='attention_mul')([inputs, a_probs])
    return output_attention_mul

Two places to put attention, in relation to the LSTM. This function will apply attention after. 

In [None]:
def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

This function will apply attention before.

In [None]:
def model_attention_applied_before_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    attention_mul = attention_3d_block(inputs)
    lstm_units = 32
    attention_mul = LSTM(lstm_units, return_sequences=False)(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

Here we can select where to apply the attention depending on whether APPLY_ATTENTION_BEFORE_LSTM is true or false. The following cell will generate data and compile the model:

In [None]:
N = 300000
inputs_1, outputs = get_data_recurrent(N, TIME_STEPS, INPUT_DIM)

if APPLY_ATTENTION_BEFORE_LSTM:
    m = model_attention_applied_before_lstm()
else:
    m = model_attention_applied_after_lstm()

m.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
print(m.summary())

Fit model and get the attention visualization:

In [None]:
m.fit([inputs_1], outputs, epochs=1, batch_size=512, validation_split=0.1)

attention_vectors = []
for i in range(300):
    testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
    attention_vector = np.mean(get_activations(m,
                                               testing_inputs_1,
                                               layer_name='attention_vec')[0], axis=2).squeeze()
    # print('attention =', attention_vector)
    assert (np.sum(attention_vector) - 1.0) < 1e-5
    attention_vectors.append(attention_vector)

attention_vector_final = np.mean(np.array(attention_vectors), axis=0)
# plot part.
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                     title='Attention Mechanism as '
                                                                           'a function of input'
                                                                           ' dimensions.')
plt.show()

### Exercise 1

Change the code so it's being applied after LSTM and rerun everything as required.

## Next Task - Translation

Now that you understand how attention works and its effects on the model, let's turn back to our original task of translation. Machine translation is a well-known application and typical use for Natural Language Processing. Since 1950s, scientists have tried to create a model to automatically translate from say French to English. Nowadays, it became possible for machines to do the translation automatically and the attention mechanism has greatly increased the quality of the translation. Here the example image with attention map for the neural machine translation of sample phrase:
<p align="center">
  <img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/12/Screen-Shot-2015-12-30-at-1.23.48-PM.png" width="400">
</p>

In our lab we will concentrate on a much simpler task: we will translate dates from human readable to machine readable, eg. Oct 25th, 2017 to 2017-10-25. The inspiration comes from this [post](https://medium.com/datalogue/attention-in-keras-1892773a4f22). You can also follow it for more detailed background on this problem.

To do this we need to understand one more concept - Sequence-to-Sequence language modeling.
The idea of such architecture is here:
<p aling="center">
<img src="https://talbaumel.github.io/attention/img/birnn.jpg" width="400">
</p>

There is an Embedding layer at the bottom, the bidirectional RNN in the middle and softmax as an output.

In [None]:
ENCODER_UNITS = 32
DECODER_UNITS = 32

Here we use more complex idea than simple seq2seq: we're adding two explicit parts of our network - encoder and decoder (on which attention is being applied). The explanatory picture for this idea is below:
<p aling="center"><img src="https://i.stack.imgur.com/Zwsmz.png"></p>

The lower part of the network is encoding the input to some hidden intermediate representation and the upper part is decoding the hidden representation into an actual readable output.

Finally, lets create a machine translation model:

In [None]:
def model_simple_nmt(in_chars, out_chars):
    inputs = Input(shape=(TIME_STEPS,))
    
    input_embed = Embedding(in_chars, ENCODER_UNITS * 2, input_length=TIME_STEPS, trainable=True,
                            name='embedding')(inputs)
    
    enc_out = Bidirectional(LSTM(ENCODER_UNITS, return_sequences=True))(input_embed)
    dec_out = LSTM(DECODER_UNITS, return_sequences=True)(enc_out)
    attention_mul = attention_3d_block(dec_out)
    
    output = TimeDistributed(Dense(out_chars, activation='softmax'))(attention_mul)
   
    model = Model(input=[inputs], output=output)
    return model

Now we need to generate data. Our data will be dates in different text formats and in fixed output format.

In [None]:
from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
import numpy as np

In [None]:
fake = Faker()
fake.seed(12345)
random.seed(12345)

FORMATS = ['short',
           'medium',
           'long',
           'full',
           'd MMM YYY', 
           'd MMMM YYY',
           'dd MMM YYY',
           'd MMM, YYY',
           'd MMMM, YYY',
           'dd, MMM YYY',
           'd MM YY',
           'd MMMM YYY',
           'MMMM d YYY',
           'MMMM d, YYY',
           'dd.MM.YY']

# change this if you want it to work with another language
LOCALES = ['en_US']

In [None]:
def create_date():
    """
        Creates some fake dates 
        :returns: tuple containing human readable string, machine readable string, and date object
    """
    dt = fake.date_object()

    try:
        human_readable = format_date(dt, format=random.choice(FORMATS), locale=random.choice(LOCALES))

        case_change = random.choice([0,1,2])
        if case_change == 1:
            human_readable = human_readable.upper()
        elif case_change == 2:
            human_readable = human_readable.lower()
        # if case_change == 0, do nothing

        machine_readable = dt.isoformat()
    except AttributeError as e:
        return None, None, None

    return human_readable, machine_readable, dt

In [None]:
def create_dataset(n_examples):
    """
        Creates a dataset with n_examples and vocabularies
        :n_examples: the number of examples to generate
    """
    human_vocab = set()
    machine_vocab = set()
    dataset = []

    for i in tqdm(range(n_examples)):
        h, m, _ = create_date()
        if h is not None:
            dataset.append((h, m))
            human_vocab.update(tuple(h))
            machine_vocab.update(tuple(m))

    human = dict(zip(list(human_vocab) + ['<unk>', '<pad>'], 
                     list(range(len(human_vocab) + 2))))
    inv_machine = dict(enumerate(list(machine_vocab) + ['<unk>', '<pad>']))
    machine = {v:k for k,v in inv_machine.items()}
 
    return dataset, human, machine, inv_machine

In [None]:
def string_to_int(string, lenght, vocab):
    if len(string) > lenght:
        string = string[:lenght]
        
    rep = list(map(lambda x: vocab.get(x, '<unk>'), string))
    
    if len(string) < lenght:
        rep += [vocab['<pad>']] * (lenght - len(string))
    
    return rep

In [None]:
def int_to_string(ints, inv_vocab):
    return [inv_vocab[i] for i in ints]

Actually generating data:

In [None]:
N = 300000
dataset, human_vocab, machine_vocab, inv_machine_vocab = create_dataset(N)

Compiling and training model:

In [None]:
m = model_simple_nmt(len(human_vocab), len(machine_vocab))

m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(m.summary())

In [None]:
inputs, targets = zip(*dataset)
inputs = np.array([string_to_int(i, TIME_STEPS, human_vocab) for i in inputs])
targets = [string_to_int(t, TIME_STEPS, machine_vocab) for t in targets]
targets = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), targets)))

In [None]:
m.fit([inputs], targets, epochs=1, batch_size=64, validation_split=0.1)

Lets check our model:

In [None]:
EXAMPLES = ['3 May 1979', '5 Apr 09', '20th February 2016', 'Wed 10 Jul 2007']

def run_example(model, input_vocabulary, inv_output_vocabulary, text):
    encoded = string_to_int(text, TIME_STEPS, input_vocabulary)
    prediction = model.predict(np.array([encoded]))
    prediction = np.argmax(prediction[0], axis=-1)
    return int_to_string(prediction, inv_output_vocabulary)

def run_examples(model, input_vocabulary, inv_output_vocabulary, examples=EXAMPLES):
    predicted = []
    for example in examples:
        predicted.append(''.join(run_example(model, input_vocabulary, inv_output_vocabulary, example)))
        print('input:', example)
        print('output:', predicted[-1])
    return predicted

In [None]:
run_examples(m, human_vocab, inv_machine_vocab)

And visualize the actual attention map on some example:

In [None]:
def attention_map(model, input_vocabulary, inv_output_vocabulary, text):
    """
        visualization of attention map
    """
    # encode the string
    encoded = string_to_int(text, TIME_STEPS, input_vocabulary)

    # get the output sequence
    prediction = model.predict(np.array([encoded]))
    predicted_text = np.argmax(prediction[0], axis=-1)
    predicted_text = int_to_string(predicted_text, inv_output_vocabulary)

    text_ = list(text)
    # get the lengths of the string
    input_length = len(text)
    output_length = predicted_text.index('<pad>')
    # get the activation map
    attention_vector = get_activations(model, [encoded], layer_name='attention_vec')[0].squeeze()
    activation_map = attention_vector[0:output_length, 0:input_length]
    
    plt.clf()
    f = plt.figure(figsize=(8, 8.5))
    ax = f.add_subplot(1, 1, 1)

    # add image
    i = ax.imshow(activation_map, interpolation='nearest', cmap='gray')

    # add colorbar
    cbaxes = f.add_axes([0.2, 0, 0.6, 0.03])
    cbar = f.colorbar(i, cax=cbaxes, orientation='horizontal')
    cbar.ax.set_xlabel('Probability', labelpad=2)

    # add labels
    ax.set_yticks(range(output_length))
    ax.set_yticklabels(predicted_text[:output_length])

    ax.set_xticks(range(input_length))
    ax.set_xticklabels(text_[:input_length], rotation=45)

    ax.set_xlabel('Input Sequence')
    ax.set_ylabel('Output Sequence')

    # add grid and legend
    ax.grid()

    f.show()

In [None]:
attention_map(m, human_vocab, inv_machine_vocab, EXAMPLES[0])

As you probably see, the default model for this lab is not that good. But you could try to improve it by yourself. You could get better results, like this:

<p align="center"><img src="https://user-images.githubusercontent.com/6295292/26899949-bbac0c7c-4b9e-11e7-84d6-c2f31166af07.png" width="800"></p>

### Exercise 2

Add layers and modify the code to improve results of the date translation

## Real Case

### Before doing this part run Kernel->Restart so the GPU memory is completely free.

After the toy examples we finally see what attention is good for. Next, we will try an actual Neural Machine Translation model, which is shipped with this lab. This NMT model was trained on a German-English corpus, so it will translate from German to English.

Before we start we need to discuss one more thing, which is really important in machine translation (and also other NLP tasks): BPE representation.

### BPE
BPE stands for byte pair encoding. It means that common byte pairs (bigrams of chars in our case) are replaced by a byte which never occurs in the corpus. Say, in our corpus we have never seen the "#" char, so we could use it to represent some typical bigram like "ie". But in practice all the printable chars are used, so for BPE the unprintable part of codepage is used. To actually print the text, we need to reformat it back. so you'll see in text "@@ " - these are artifacts from such renormalization.

Here we have example text in German, which will be translated in English by our model.

In [None]:
! head wmt/newstest2015.tok.bpe.32000.de

The actual architecture used is:
![](../2017-09-14_23-11-48.png)
It is slightly more complex than in our previous task with dates. Here we again have encoder-decoder architecture, but the attention now is taken from all the input, not the part of it. Also we use the so called *context vector* which is representation of the whole sentence - it is helpful for the model to "get the idea" of a phrase before translating it.


Lets finally see what our model will give us:

In [None]:
!python OpenSeq2Seq/run.py --config_file=OpenSeq2Seq/example_configs/nmt.json --logdir=./nmt --mode=infer --inference_out=pred.txt

Open up [pred.txt](pred.txt) to compare to [baseline.txt](baseline.txt) to see the difference attention makes in the overall quality of the translation. Training with more epochs will improve and possibly to get to the translation that you've identified as the best in the first part of the lab. 

__Acknowledgements__: code based on keras-visualize-activations of Philippe Remy

URL: https://github.com/philipperemy/keras-visualize-activations

The idea of date translation is borrowed from https://github.com/datalogue/keras-attention.

For the real case we have used https://github.com/NVIDIA/OpenSeq2Seq, NVIDIA's implementation of Seq2Seq model.
