# Training on large dataset with attention model

After implementing [Beamsearch on a large dataset](BeamSearchOnLargeDataset.ipynb), I'll now add an attention model.
As trainings set I use the [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/).

Here, it would helped to use `tf.keras` instead of `Keras` as there is attention model in tensorflow. There are also plans to add an attention layer to `Keras`, so I won't reimplement here, allthough it would be instructive. I'll use [keras-attention](https://github.com/datalogue/keras-attention) with an [AttentionDecoder Layer](https://github.com/datalogue/keras-attention/blob/master/models/custom_recurrents.py#L10). There is also a tutorial of how to use it at [machine learning mastery](https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/) and a [medium article about](https://medium.com/datalogue/attention-in-keras-1892773a4f22)

Again, I'll refactor the code a bit, putting most of the implementation details into modules.

In [1]:
# import gc
# import os
# 
import keras
# from keras.backend.tensorflow_backend import set_session
from keras.preprocessing.sequence import pad_sequences
import numpy as np
# import pandas as pd
from sklearn.model_selection import train_test_split
# import tensorflow as tf
# from tqdm import tqdm_notebook as tqdm
#  
# import bytepairencoding as bpe
import seq2seq
from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, preprocess_input_europarl as preprocess
# 
# 
# # Fixing random state ensure reproducible results
# RANDOM_STATE=42
# np.random.seed(RANDOM_STATE)
# tf.set_random_seed(RANDOM_STATE)
# 
# pd.set_option('max_colwidth', 60)  # easier to read texts in e.g. df.head()
# 
# # technical detail so that an instance (maybe running in a different window)
# # doesn't take all the GPU memory resulting in some strange error messages
# config = tf.ConfigProto()
# config.gpu_options.per_process_gpu_memory_fraction = 0.5
# set_session(tf.Session(config=config))

from utils.preparation import Europarl, RANDOM_STATE

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Fixed random seed to 42
Set gpu memory fraction to 0.5


In [2]:
MAX_INPUT_LENGTH = 20 #100  # was 50
MAX_TARGET_LENGTH = 25 # 125  # was 65
# LATENT_DIM = 512
# EMBEDDING_DIM = 300
# BPE_MERGE_OPERATIONS = 5_000  # I'd love to use 10_000 x 300, but this one is broken: https://github.com/bheinzerling/bpemb/issues/6
EPOCHS = 5 #20
BATCH_SIZE = 32
DROPOUT = 0.5
TEST_SIZE = 25 # 2_500  
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [3]:
# PATH = 'data'
# INPUT_LANG = 'en'
# TARGET_LANG = 'de'
# LANGUAGES = [INPUT_LANG, TARGET_LANG]
# BPE_URL = {lang: f'http://cosyne.h-its.org/bpemb/data/{lang}/' for lang in LANGUAGES}
# BPE_MODEL_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.model' for lang in LANGUAGES}
# BPE_WORD2VEC_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.d{EMBEDDING_DIM}.w2v.bin' for lang in LANGUAGES}
# 
# EXTERNAL_RESOURCES = {
#     # Europarl Corpus
#     'de-en.tgz': 'http://statmt.org/europarl/v7/de-en.tgz',
#     
#     # Bytepairencoding subwords (_MODEL_) and pretrained embeddings (_WORD2VEC_)
#     BPE_MODEL_NAME[INPUT_LANG]: f'{BPE_URL[INPUT_LANG]}/{BPE_MODEL_NAME[INPUT_LANG]}',
#     BPE_WORD2VEC_NAME[INPUT_LANG] + '.tar.gz': f'{BPE_URL[INPUT_LANG]}/{BPE_WORD2VEC_NAME[INPUT_LANG]}' + '.tar.gz',
#     BPE_MODEL_NAME[TARGET_LANG]: f'{BPE_URL[TARGET_LANG]}/{BPE_MODEL_NAME[TARGET_LANG]}',
#     BPE_WORD2VEC_NAME[TARGET_LANG] + '.tar.gz': f'{BPE_URL[TARGET_LANG]}/{BPE_WORD2VEC_NAME[TARGET_LANG]}' + '.tar.gz',
# }

europarl = Europarl()
download_and_extract_resources(fnames_and_urls=europarl.external_resources, dest_path=europarl.path)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [4]:
europarl.load_and_preprocess(max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH)
# df = pd.DataFrame(data={
#     'input_texts': read_europarl(INPUT_LANG),
#     'target_texts': read_europarl(TARGET_LANG)
# })

Total number of unfiltered translations 1920209
Filtered translations with length between (1, input=20/target=25) characters: 14943


In [5]:
# print("Nr total input:", len(df))
# df['input_length'] = df.input_texts.apply(len)
# df['target_length'] = df.target_texts.apply(len)
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
67,agenda,arbeitsplan,6,11,"[1, 631, 222, 34, 2]","[1, 941, 197, 3454, 2]"
704,what is the result?,was sind die folgen?,19,20,"[1, 781, 14, 3, 714, 2426, 2]","[1, 748, 126, 6, 2374, 3720, 2]"
1261,with what aim?,zu welchem zweck?,14,17,"[1, 23, 781, 2973, 2426, 2]","[1, 26, 2740, 156, 155, 142, 359, 188, 3720, 2]"
1401,why?,wieso?,4,6,"[1, 958, 38, 2426, 2]","[1, 167, 1659, 3720, 2]"
1403,no.,nein.,3,5,"[1, 220, 5, 2]","[1, 124, 191, 3, 2]"


### Filter translations (only sentences shorter than a given length)

With a full working machine translation system, it's of course better to train on all data (plus maybe some augmented data). Without attention (and maybe copy mechanism, dynamic memory, ...) there's no point anyway in it, but it also reduces training time (a full training on ~2 Mio translations might take days, even with a good GPU).
I use different length for input (english) than target (german) language as german is more verbose.

In [6]:
# non_empty = (df.input_length > 1) & (df.target_length > 1)  # there are empty phrases like '\n' --> 'Frau Präsidentin\n'
# short_inputs = (df.input_length < MAX_INPUT_LENGTH) & (df.target_length < MAX_TARGET_LENGTH)
# print(f'Sentences with length between (1, input={MAX_INPUT_LENGTH}/target={MAX_TARGET_LENGTH}) characters:', sum(non_empty & short_inputs))
# df = df[non_empty & short_inputs]
# gc.collect();  # df with filtered sentences is significant smaller, so time to garbage collect

## Load (pretrained) Bytepairs

I need the subwords dictionary (in `BPE_WORD2VEC_NAME`), the pretrained embeddings (in `BPE_MODEL_NAME`) and a [sentencepiece](https://github.com/google/sentencepiece) handler that can encode/decode them.

In [7]:
# bpe_input, bpe_target = [bpe.Bytepairencoding(
#     word2vec_fname=os.path.join(PATH, BPE_WORD2VEC_NAME[lang]),
#     sentencepiece_fname=os.path.join(PATH, BPE_MODEL_NAME[lang]),
# ) for lang in [INPUT_LANG, TARGET_LANG]] 
print("English subwords", europarl.bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", europarl.bpe_target.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'ierte', '▁zeich', 'eng', 'ruppen']


In [8]:
# Now encode the texts into sequences of indexes of bytepairs
# df['input_sequences'] = df.input_texts.apply(bpe_input.subword_indices)
# df['target_sequences'] = df.target_texts.apply(bpe_target.subword_indices)
# df[['input_sequences', 'target_sequences']].head()

In [9]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = europarl.df.input_sequences.apply(len).max()
max_len_target = europarl.df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(15, 16)

In [10]:
train_ids, val_ids = train_test_split(np.arange(europarl.df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [11]:
# modified version of https://github.com/datalogue/keras-attention/
import tensorflow as tf
import keras
from keras import backend as K
from keras import regularizers, constraints, initializers, activations
from keras.layers.recurrent import Recurrent
from keras.engine import InputSpec


def _time_distributed_dense(x, w, b=None, dropout=None,
                            input_dim=None, output_dim=None,
                            timesteps=None, training=None):
    """Apply `y . w + b` for every temporal slice y of x.
    # Arguments
        x: input tensor.
        w: weight matrix.
        b: optional bias vector.
        dropout: wether to apply dropout (same dropout mask
            for every temporal slice of the input).
        input_dim: integer; optional dimensionality of the input.
        output_dim: integer; optional dimensionality of the output.
        timesteps: integer; optional number of timesteps.
        training: training phase tensor or boolean.
    # Returns
        Output tensor.
    """
    if not input_dim:
        input_dim = K.shape(x)[2]
    if not timesteps:
        timesteps = K.shape(x)[1]
    if not output_dim:
        output_dim = K.shape(w)[1]

    if dropout is not None and 0. < dropout < 1.:
        # apply the same dropout pattern at every timestep
        ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
        dropout_matrix = K.dropout(ones, dropout)
        expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
        x = K.in_train_phase(x * expanded_dropout_matrix, x, training=training)

    # maybe below is more clear implementation compared to older keras
    # at least it works the same for tensorflow, but not tested on other backends
    x = K.dot(x, w)
    if b is not None:
        x = K.bias_add(x, b)
    return x


class AttentionDecoder(Recurrent):

    def __init__(self, units, alphabet_size,
                 embedding_dim=30,
                 is_monotonic=False,
                 normalize_energy=False,
                 activation='tanh',
                 dropout=None,
                 recurrent_dropout=None,
                 return_probabilities=False,
                 name='AttentionDecoder',
                 kernel_initializer='glorot_uniform',
                 recurrent_initializer='orthogonal',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 **kwargs):
        """
        Implements an AttentionDecoder that takes in a sequence encoded by an
        encoder and outputs the decoded states
        :param units: dimension of the hidden state and the attention matrices
        :param alphabet_size: output sequence alphabet size
            (alphabet may contain <end_of_seq> but do not need to have <start_of_seq>,
            because it is added internally inside the layer)
        :param embedding_dim: size of internal embedding for output labels
        :param is_monotonic: if True - Luong-style monotonic attention
            if False - Bahdanau-style attention (non-monotonic)
            See references for details
        :param normalize_energy: whether attention weights are normalized
        references:
            (1) Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio.
            "Neural machine translation by jointly learning to align and translate."
            arXiv preprint arXiv:1409.0473 (2014).
            (2) Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglass Eck
            "Online and Linear-Time Attention by Enforcing Monotonic Alignments"
            arXiv arXiv:1704.00784 (2017)
        notes:
            with `is_monotonic=True`, `normalize_energy=True` equal to model in (2)
            with `is_monotonic=False`, `normalize_energy=False` equal to model in (1)
        """
        # self.start_token = alphabet_size
        output_dim = alphabet_size  # alphabet + end_token

        self.units = units
        self.alphabet_size = alphabet_size
        self.output_dim = output_dim

        self.is_monotonic = is_monotonic
        self.normalize_energy = normalize_energy

        self.embedding_dim = embedding_dim
        self.dropout = dropout
        self.recurrent_dropout = recurrent_dropout

        self.return_probabilities = return_probabilities
        self.activation = activations.get(activation)

        self.kernel_initializer = initializers.get(kernel_initializer)
        self.recurrent_initializer = initializers.get(recurrent_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.recurrent_regularizer = regularizers.get(kernel_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)

        self.kernel_constraint = constraints.get(kernel_constraint)
        self.recurrent_constraint = constraints.get(kernel_constraint)
        self.bias_constraint = constraints.get(bias_constraint)

        super(AttentionDecoder, self).__init__(**kwargs)
        self.name = name
        self.return_sequences = True  # must return sequences
        self.return_state = False
        self.stateful = False
        self.uses_learning_phase = True

    def add_scalar(self, initial_value=0, name=None, trainable=True):
        scalar = K.variable(initial_value, name=name)
        if trainable:
            self._trainable_weights.append(scalar)
        else:
            self._non_trainable_weights.append(scalar)
        return scalar

    def build(self, input_shapes):
        """
          See Appendix 2 of Bahdanau 2014, arXiv:1409.0473
          for model details that correspond to the matrices here.
          See Luong 2017, arXiv:1704.00784
          for model details that correspond to the scalars here.
        """
        input_shape = input_shapes
        if isinstance(input_shapes[0], (list, tuple)):
            if len(input_shapes) > 2:
                raise ValueError('Layer ' + self.name + ' expects ' +
                                 '1 or 2 input tensors, but it received ' +
                                 str(len(input_shapes)) + ' input tensors.')

            self.input_spec = [InputSpec(shape=input_shape) for input_shape in input_shapes]
            input_shape = input_shapes[0]
        else:
            self.input_spec = [InputSpec(shape=input_shape)]
        self.batch_size, self.timesteps, self.input_dim = input_shape

        if self.stateful:
            super(AttentionDecoder, self).reset_states()

        self.states = [None, None, None]  # y, s, t

        """
            Embedding matrix for y (outputs)
        """
        # self.E_y = self.add_weight(shape=(self.alphabet_size + 1, self.embedding_dim),  # +1 for start token
        #                            name='E_y',
        #                            initializer='orthogonal')
        self.E_y = K.variable(s2s.bpe_target.embedding_matrix)

        """
            Matrices for creating the context vector
        """

        self.V_a = self.add_weight(shape=(self.units,),
                                   name='V_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.W_a = self.add_weight(shape=(self.units, self.units),
                                   name='W_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_a = self.add_weight(shape=(self.input_dim, self.units),
                                   name='U_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_a = self.add_weight(shape=(self.units,),
                                   name='b_a',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the r (reset) gate
        """
        self.C_r = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_r',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_r = self.add_weight(shape=(self.units, self.units),
                                   name='U_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_r = self.add_weight(shape=(self.embedding_dim, self.units),
                                   name='W_r',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_r = self.add_weight(shape=(self.units,),
                                   name='b_r',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        """
            Matrices for the z (update) gate
        """
        self.C_z = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_z',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_z = self.add_weight(shape=(self.units, self.units),
                                   name='U_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_z = self.add_weight(shape=(self.embedding_dim, self.units),
                                   name='W_z',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_z = self.add_weight(shape=(self.units,),
                                   name='b_z',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the proposal
        """
        self.C_p = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_p',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_p = self.add_weight(shape=(self.units, self.units),
                                   name='U_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_p = self.add_weight(shape=(self.embedding_dim, self.units),
                                   name='W_p',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_p = self.add_weight(shape=(self.units,),
                                   name='b_p',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for making the final prediction vector
        """
        self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),
                                   name='C_o',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_o = self.add_weight(shape=(self.units, self.output_dim),
                                   name='U_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_o = self.add_weight(shape=(self.embedding_dim, self.output_dim),
                                   name='W_o',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_o = self.add_weight(shape=(self.output_dim,),
                                   name='b_o',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        # For creating the initial state:
        self.W_s = self.add_weight(shape=(self.input_dim, self.units),
                                   name='W_s',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)

        if self.is_monotonic:
            self.Energy_r = self.add_scalar(initial_value=-1,
                                            name='r')
            self.states.append(None)
        if self.normalize_energy:
            self.Energy_g = self.add_scalar(initial_value=1,
                                            name='g')

        self.built = True

    def call(self, x, training=None):
        # TODO: check that model is loading from .h5 correctly
        # TODO: for now cannot be shared layer
        # (only can it we use (or not use) teacher forcing in all cases simultationsly)
        if isinstance(x, list):
            # teacher forcing for training
            self.x_seq, self.y_true = x
            self.use_teacher_forcing = True
        else:
            # inference
            self.x_seq = x
            self.use_teacher_forcing = False

        self.curr_batch_timesteps = tf.shape(self.x_seq)[1]

        # apply a dense layer over the time dimension of the sequence
        # do it here because it doesn't depend on any previous steps
        # therefore we can save computation time:
        self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
                                             dropout=self.dropout,
                                             input_dim=self.input_dim,
                                             timesteps=self.timesteps,
                                             output_dim=self.units,
                                             training=training)

        return super(AttentionDecoder, self).call(self.x_seq, training=training)

    def get_initial_state(self, inputs):
        if isinstance(inputs, list):
            assert len(inputs) == 2  # inputs == [encoder_outputs, y_true]
            encoder_outputs = inputs[0]
        else:
            encoder_outputs = inputs

        memory_shape = K.shape(encoder_outputs)

        # apply the matrix on the first time step to get the initial s0.
        s0 = activations.tanh(K.dot(encoder_outputs[:, 0], self.W_s))

        y0 = K.zeros((memory_shape[0],), dtype='int64') # + self.start_token
        t0 = K.zeros((memory_shape[0],), dtype='int64')

        initial_states = [y0, s0, t0]
        if self.is_monotonic:
            # initial attention has form: [1, 0, 0, ..., 0] for each sample in batch
            alpha0 = K.ones((memory_shape[0], 1))
            alpha0 = K.switch(K.greater(memory_shape[1], 1),
                              lambda: K.concatenate([alpha0, K.zeros((memory_shape[0], memory_shape[1] - 1))], axis=-1),
                              alpha0)
            # like energy, attention is stored in shape (samples, time, 1)
            alpha0 = K.expand_dims(alpha0, -1)
            initial_states.append(alpha0)

        return initial_states

    def step(self, x, states):
        if self.is_monotonic:
            ytm, stm, timestep, previous_attention = states
        else:
            ytm, stm, timestep = states

        ytm = K.gather(self.E_y, K.cast(ytm, 'int32'))

        if self.recurrent_dropout is not None and 0. < self.recurrent_dropout < 1.:
            stm = K.in_train_phase(K.dropout(stm, self.recurrent_dropout), stm)
            ytm = K.in_train_phase(K.dropout(ytm, self.recurrent_dropout), ytm)

        et = self._compute_energy(stm)

        if self.is_monotonic:
            at = self._compute_probabilities(et, previous_attention)
        else:
            at = self._compute_probabilities(et)

        # calculate the context vector
        context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)

        # ~~~> calculate new hidden state

        # first calculate the "r" gate:
        rt = activations.sigmoid(
            K.dot(ytm, self.W_r)
            + K.dot(stm, self.U_r)
            + K.dot(context, self.C_r)
            + self.b_r)

        # now calculate the "z" gate
        zt = activations.sigmoid(
            K.dot(ytm, self.W_z)
            + K.dot(stm, self.U_z)
            + K.dot(context, self.C_z)
            + self.b_z)

        # calculate the proposal hidden state:
        s_tp = activations.tanh(
            K.dot(ytm, self.W_p)
            + K.dot((rt * stm), self.U_p)
            + K.dot(context, self.C_p)
            + self.b_p)

        # new hidden state:
        st = (1 - zt) * stm + zt * s_tp

        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(st, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)

        if self.use_teacher_forcing:
            ys = K.in_train_phase(self.y_true[:, timestep[0]], K.argmax(yt, axis=-1))
            ys = K.flatten(ys)
        else:
            ys = K.flatten(K.argmax(yt, axis=-1))

        if self.return_probabilities:
            output = at
        else:
            output = yt

        next_states = [ys, st, timestep + 1]
        if self.is_monotonic:
            next_states.append(at)
        return output, next_states

    def _compute_energy(self, stm):
        # "concat" energy function
        # energy_i = g * V / |V| * tanh([stm, h_i] * W + b) + r
        _stm = K.dot(stm, self.W_a)

        V_a = self.V_a
        if self.normalize_energy:
            V_a = self.Energy_g * K.l2_normalize(self.V_a)

        et = K.dot(activations.tanh(K.expand_dims(_stm, axis=1) + self._uxpb),
                   K.expand_dims(V_a))

        if self.is_monotonic:
            et += self.Energy_r

        return et

    def _compute_probabilities(self, energy, previous_attention=None):
        if self.is_monotonic:
            # add presigmoid noise to encourage discreteness
            sigmoid_noise = K.in_train_phase(1., 0.)
            noise = K.random_normal(K.shape(energy), mean=0.0, stddev=sigmoid_noise)
            # encourage discreteness in train
            energy = K.in_train_phase(energy + noise, energy)

            p = K.in_train_phase(K.sigmoid(energy),
                                 K.cast(energy > 0, energy.dtype))
            p = K.squeeze(p, -1)
            p_prev = K.squeeze(previous_attention, -1)
            # monotonic attention function from tensorflow
            at = K.in_train_phase(
                tf.contrib.seq2seq.monotonic_attention(p, p_prev, 'parallel'),
                tf.contrib.seq2seq.monotonic_attention(p, p_prev, 'hard'))
            at = K.expand_dims(at, -1)
        else:
            # softmax
            at = keras.activations.softmax(energy, axis=1)

        return at

    def compute_output_shape(self, input_shapes):
        """
            For Keras internal compatability checking
        """
        input_shape = input_shapes
        if isinstance(input_shapes[0], (list, tuple)):
            input_shape = input_shapes[0]

        timesteps = input_shape[1]
        if self.return_probabilities:
            return (None, timesteps, timesteps)
        else:
            return (None, timesteps, self.output_dim)

    def get_config(self):
        """
            For rebuilding models on load time.
        """
        config = {
            'units': self.units,
            'alphabet_size': self.alphabet_size,
            'embedding_dim': self.embedding_dim,
            'return_probabilities': self.return_probabilities,
            'is_monotonic': self.is_monotonic,
            'normalize_energy': self.normalize_energy
        }
        base_config = super(AttentionDecoder, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

In [12]:
#s2s = Seq2SeqWithAttention(
s2s = seq2seq.Seq2SeqWithBPE(
    bpe_input=europarl.bpe_input,
    bpe_target=europarl.bpe_target,
    max_len_input=max_len_input,
    max_len_target=max_len_target
)
# s2s.model.compile(optimizer=keras.optimizers.Adam(clipnorm=1., clipvalue=.5), loss='categorical_crossentropy')
# s2s.model.summary()
import keras.layers as L
rnn_encoded = L.Bidirectional(
    L.GRU(s2s.latent_dim // 2, return_sequences=True),
    name='bidirectional_1',
    merge_mode='concat'
)(s2s.encoder_embeddings)
attention_decoder = AttentionDecoder(
    s2s.latent_dim,
    len(europarl.bpe_target.tokens),
    trainable=EMBEDDING_TRAINABLE,
    embedding_dim=s2s.bpe_target.embedding_dim,
)(rnn_encoded)

# train_generator = s2s.create_batch_generator(
#     train_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
# )
# val_generator = s2s.create_batch_generator(
#     val_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
# )
# 
# s2s.model.fit_generator(
#     train_generator,
#     steps_per_epoch=np.ceil(len(train_ids) / BATCH_SIZE),
#     epochs=EPOCHS,
#     validation_data=val_generator,
#     validation_steps=np.ceil(len(val_ids) / BATCH_SIZE),
# )


In [13]:
model = keras.models.Model(inputs=s2s.encoder_inputs, outputs=attention_decoder)
model.compile(optimizer=keras.optimizers.Adam(clipnorm=1., clipvalue=.5), loss='categorical_crossentropy')

In [14]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder_inputs (InputLayer)  (None, 15)                0         
_________________________________________________________________
input_embedding (Embedding)  (None, 15, 302)           1760358   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 15, 512)           858624    
_________________________________________________________________
AttentionDecoder (AttentionD (None, 15, 5367)          9947737   
Total params: 12,566,719
Trainable params: 12,566,719
Non-trainable params: 0
_________________________________________________________________


In [15]:
def create_batch_generator(
    samples_ids, input_sequences, target_sequences, batch_size
):

    def batch_generator():
        nr_batches = np.ceil(len(samples_ids) / batch_size)
        while True:
            shuffled_ids = np.random.permutation(samples_ids)
            batch_splits = np.array_split(shuffled_ids, nr_batches)
            for batch_ids in batch_splits:
                batch_X = pad_sequences(
                    input_sequences.iloc[batch_ids],
                    padding='post',
                    maxlen=max_len_input
                )
                batch_y = pad_sequences(
                    target_sequences.iloc[batch_ids],
                    padding='post',
                    maxlen=max_len_target
                )
                batch_y_t_output = keras.utils.to_categorical(
                    batch_y[:, 1:],
                    num_classes=len(europarl.bpe_target.tokens)
                )
                batch_x_t_input = batch_y[:, :-1]
                #yield ([batch_X, batch_x_t_input], batch_y_t_output)
                yield(batch_X, batch_y_t_output)
        
    return batch_generator()

train_generator = create_batch_generator(
    train_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
)
val_generator = create_batch_generator(
    val_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
)

model.fit_generator(
    train_generator,
    steps_per_epoch=np.ceil(len(train_ids) / BATCH_SIZE),
    epochs=EPOCHS,
    validation_data=val_generator,
    validation_steps=np.ceil(len(val_ids) / BATCH_SIZE),
)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd36c05ffd0>

In [16]:
name = 'attentionmodel'
# model.save_weights(f'data/{name}_model_weights.h5') 
s2s.model.save_weights(f'data/{name}_model_weights.h5')  # https://drive.google.com/open?id=10Sv-JnAiUT_fvU_cw1_H7mkcTAipC5aA
s2s.inference_encoder_model.save_weights(f'data/{name}_inference_encoder_model_weights.h5')  # https://drive.google.com/open?id=1gNBrn_Wij0PyeE-jJsEnlv7aHXkYuAup
s2s.inference_decoder_model.save_weights(f'data/{name}_inference_decoder_model_weights.h5')  # https://drive.google.com/open?id=1LCU53Hnb4m42QO3qsZTAkyYyroqz2vbe

In [30]:
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
67,agenda,arbeitsplan,6,11,"[1, 631, 222, 34, 2]","[1, 941, 197, 3454, 2]"
704,what is the result?,was sind die folgen?,19,20,"[1, 781, 14, 3, 714, 2426, 2]","[1, 748, 126, 6, 2374, 3720, 2]"
1261,with what aim?,zu welchem zweck?,14,17,"[1, 23, 781, 2973, 2426, 2]","[1, 26, 2740, 156, 155, 142, 359, 188, 3720, 2]"
1401,why?,wieso?,4,6,"[1, 958, 38, 2426, 2]","[1, 167, 1659, 3720, 2]"
1403,no.,nein.,3,5,"[1, 220, 5, 2]","[1, 124, 191, 3, 2]"


In [36]:
p = model.predict(pad_sequences(
    [europarl.bpe_input.subword_indices(preprocess('agenda'))],
    padding='post',
    maxlen=max_len_input
), verbose=True)
p.shape



(1, 15, 5367)

In [35]:
s2s.bpe_target.tokens[np.argmax(p[0, 3, :])]
model.call

'</s>'

In [None]:
def predict(sentence, beam_width=5):
    return s2s.decode_beam_search(pad_sequences(
        [europarl.bpe_input.subword_indices(preprocess(sentence))],
        padding='post',
        maxlen=max_len_input,
    ), beam_width=beam_width)

In [None]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

In [None]:
# Performance on training set:
for en, de in europarl.df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

In [None]:
# Performance on validation set
val_df = europarl.df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

In [None]:
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

# Conclusion

Translations for short sentences are looking decent. But it's also obvious that for longer sentences the translation gets lost somehow in the sentence and alltough the translated sentence is related to a real translation, it's also confusing and self-repeating.
It's worth to notice that the sentences are not too long for a LSTM/GRU model (52, 71) bytepairs for encoding/decoding network. LSTM/GRUs are known to handle sequences up to 100 elements and start decreasing performance at around 60 (for at least that's what the Stanford courses say). So, it could be that a long enough training (we can see here that the training progresses epoch for epoch what's really nice to see for large data) would solve the problem for the choosen sentence lengths here. But of course, it's better to do what humans do also and applicate an attention model instead of trying to keep everything condensed in 512 float32 embedding while also generating bytepair for bytepair.
This model is also already a realistic model in terms of training time. I needed around 18h on a GTX1080. Beside implementing attention model, it is tempting to see how a convolutional network might improve the runtime performance (and also quality). But let's get first to Attention.