# Training on full dataset with attention model

It's a copy+paste version of the original [Attention model with tensorflow](AttentionModelForMachineTranslationWithTensorflow.ipynb) but now with close to the full [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/) english-german translations. I only used 95% of the translation corpus as the remaining 5 percent are very long and would increase total training time a lot (I already needed ~40h here for the training).

Before the training I also tried to stack multiple layers of RNNs for encoder/decoder. I only got marginal improvements: I could only see a slight decreasement of validation loss, but not really in translation quality. As adding layers increasing training time significant and increases chance of overfitting, I passed to continue this path. Usually, without attention model, additional layers would help, but it seems like [Attention is all you need](https://arxiv.org/abs/1706.03762) is true. The [Google NMT model](https://arxiv.org/pdf/1609.08144.pdf) works with 8 layers on encoder (only first layer bidirectional) and 8 layers on decoders and residual connections. Sounds reasonable but for a side project it's probably too much computational intense (Google itself uses low precision arithmetic and there specialised TPU hardware).

Now, I train on sentences with lengths _per sentence_ of up to 400 chars. So _one_ sentence can be up to five usual typewriter lines! It's a parliament corpus and our european representatives (many of them are lawyers) are certainly good in formulating very complicated and hard to understand nested sentence (much longer than this one). So, translation is a really tough task here on high level. Note, for some technical reasons (of the used pretrained bytepairencoding embeddings) I preprocess the input to lowercase and reduce all numbers to 0.

In [1]:
# Check that there is a GPU
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10899378753239656055
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7735830119
locality {
  bus_id: 1
  links {
  }
}
incarnation: 13522463194795574981
physical_device_desc: "device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1"
]


In [2]:
# Check that Cuda/Cudnn/GPU works as intended
import tensorflow as tf
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print(sess.run(c))

[[22. 28.]
 [49. 64.]]


In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
from tqdm import tqdm_notebook as tqdm

from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, preprocess_input_europarl as preprocess
from utils.preparation import Europarl, RANDOM_STATE

Fixed random seed to 42


Using TensorFlow backend.


In [4]:
MAX_INPUT_LENGTH = 400
MAX_TARGET_LENGTH = 450
LATENT_DIM =  256  # was 512, but we should be able to use a smaller hidden representation as we are looking back anyway as needed
EPOCHS = 20
BATCH_SIZE = 128  # was 64, but tensorflow implementation doesn't need so much GPU memory so can increase batch size
DROPOUT = 0.25  # Dropout on input and output for the RNN cells, so effective dropout is 0.5, but works slightly better so
TEST_SIZE = 2500
BEAM_WIDTH = 5
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [5]:
europarl = Europarl()
download_and_extract_resources(fnames_and_urls=europarl.external_resources, dest_path=europarl.path)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [6]:
europarl.load_and_preprocess(max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH)

Total number of unfiltered translations 1920209
Filtered translations with length between (1, input=400/target=450) characters: 1864679


In [7]:
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
0,resumption of the session,wiederaufnahme der sitzungsperiode,25,34,"[1, 344, 146, 498, 90, 6, 3, 3235, 90, 2]","[1, 247, 351, 750, 5, 934, 43, 3158, 4762, 2]"
1,i declare resumed the session of the european ...,"ich erkläre die am freitag, dem 0. dezember un...",203,217,"[1, 305, 1712, 480, 344, 3027, 3, 3235, 90, 6,...","[1, 241, 156, 14, 476, 1252, 6, 46, 333, 324, ..."
2,"although, as you will have seen, the dreaded '...","wie sie feststellen konnten, ist der gefürchte...",191,185,"[1, 651, 4, 18, 983, 329, 126, 1479, 4, 3, 55,...","[1, 167, 54, 604, 1191, 1403, 4, 30, 5, 596, 3..."
3,you have requested a debate on this subject in...,im parlament besteht der wunsch nach einer aus...,105,110,"[1, 983, 126, 1026, 152, 20, 9, 2033, 118, 19,...","[1, 13, 2269, 974, 5, 111, 203, 82, 40, 95, 34..."
4,"in the meantime, i should like to observe a mi...",heute möchte ich sie bitten - das ist auch der...,232,217,"[1, 7, 3, 520, 133, 1258, 4, 305, 1351, 582, 1...","[1, 402, 2187, 2983, 241, 156, 54, 72, 2099, 7..."


In [8]:
print("English subwords", europarl.bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", europarl.bpe_target.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'ierte', '▁zeich', 'eng', 'ruppen']


In [9]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = europarl.df.input_sequences.apply(len).max()
max_len_target = europarl.df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(161, 171)

In [10]:
train_ids, val_ids = train_test_split(np.arange(europarl.df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [11]:
tf.reset_default_graph()

with tf.device('/gpu:0'):

    encoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_input
        dtype=tf.int32,
        name='encoder_inputs' 
    )
    batch_size = tf.shape(encoder_inputs)[0]
    beam_width = tf.placeholder_with_default(1, shape=[])
    dropout = tf.placeholder_with_default(tf.cast(0.0, tf.float32), shape=[])
    keep_prob = tf.cast(1.0, tf.float32) - dropout

    embedding_encoder = tf.get_variable(
        "embedding_encoder", 
        initializer=tf.constant(europarl.bpe_input.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    encoder_emb_inp = tf.nn.embedding_lookup(
        embedding_encoder,
        encoder_inputs,
        name="encoder_emb_inp"
    )
    
    input_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='input_sequence_length'
    )
    
    rnn_cell_type = tf.nn.rnn_cell.GRUCell
    encoder_forward_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name='encoder_forward_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,  # state_keep_prob not set as it was not helpful here
        dtype=tf.float32,
    )
    encoder_backward_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name='encoder_backward_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    )
    encoder_bi_outputs, encoder_bi_state = tf.nn.bidirectional_dynamic_rnn(
        encoder_forward_cell, encoder_backward_cell,
        inputs=encoder_emb_inp,
        sequence_length=input_sequence_length,
        time_major=False,
        dtype=tf.float32,
    )
    encoder_outputs = tf.concat(encoder_bi_outputs, -1)
    encoder_state = tf.concat(encoder_bi_state, -1)
    
    # Regarding time_major:
    # If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`.
    # If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`.
    # Using `time_major = True` is a bit more efficient because it avoids
    # transposes at the beginning and end of the RNN calculation.  However,
    # most TensorFlow data is batch-major, so by default this function
    # accepts input and emits output in batch-major form.
    #
    # for simplicity I work with batch major here instead of time_major
    # so I don't need to transpose inputs and transpose back for attention mechanism
    
    decoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_inputs' 
    )
    embedding_decoder = tf.get_variable(
        "embedding_decoder", 
        initializer=tf.constant(europarl.bpe_target.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    decoder_emb_inp = tf.nn.embedding_lookup(
        embedding_decoder,
        decoder_inputs,
        name="decoder_emb_inp"
    )
    
    target_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='target_sequence_length'
    )
    
    # tiling is necessary to work with BeamSearchDecoder
    # read carefully the NOTE on constructor in
    # https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/AttentionWrapper 
    tiled_encoder_outputs = tf.contrib.seq2seq.tile_batch(encoder_outputs, multiplier=beam_width)
    tiled_encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, multiplier=beam_width)
    tiled_sequence_length = tf.contrib.seq2seq.tile_batch(input_sequence_length, multiplier=beam_width)
    
    attention_mechanism = tf.contrib.seq2seq.LuongAttention(
        LATENT_DIM,
        memory=tiled_encoder_outputs,
        memory_sequence_length=tiled_sequence_length,
        dtype=tf.float32,
        name='attention_mechanism',
    )
    decoder_rnn_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM, name='decoder_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    )
    decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
        decoder_rnn_cell,
        attention_mechanism,
        attention_layer_size=LATENT_DIM, 
        name='attention_wrapper',
    )

    training_helper = tf.contrib.seq2seq.TrainingHelper(
        inputs=decoder_emb_inp, 
        sequence_length=target_sequence_length,
        time_major=False,
        name="decoder_training_helper",
    )
    
    projection_layer = layers_core.Dense(
        units=len(europarl.bpe_target.tokens),
        use_bias=False,
        name='projection_layer',
    )
    
    initial_state=decoder_cell.zero_state(dtype=tf.float32, batch_size=batch_size).clone(
        cell_state=encoder_state
    )
    decoder = tf.contrib.seq2seq.BasicDecoder(
        cell=decoder_cell,
        helper=training_helper,
        initial_state=initial_state,
        output_layer=projection_layer,
    )
    outputs, _final_state, _final_sequence_length = tf.contrib.seq2seq.dynamic_decode(  
        decoder,
        output_time_major=False,
        impute_finished=False,
    )
    logits = outputs.rnn_output
    
    decoder_outputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_outputs',
    )
    target_weights = tf.cast(tf.sequence_mask(target_sequence_length), dtype=tf.float32)
    train_loss = tf.contrib.seq2seq.sequence_loss(logits, decoder_outputs, target_weights)

    params = tf.trainable_variables()
    gradients = tf.gradients(train_loss, params)
    clipped_gradients, _ = tf.clip_by_global_norm(
        t_list=gradients,
        clip_norm=1.,
    )
    
    optimizer = tf.train.AdamOptimizer()
    update_step = optimizer.apply_gradients(zip(clipped_gradients, params))
    
    inference_decoder_initial_state = decoder_cell.zero_state(
        dtype=tf.float32,
        batch_size=batch_size * beam_width  # tricky and somehow unintuitive, but necessary
    ).clone(
        cell_state=tiled_encoder_state
    )
    inference_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
        cell=decoder_cell,
        embedding=embedding_decoder,
        start_tokens=tf.fill([batch_size], europarl.bpe_target.start_token_idx),
        end_token=europarl.bpe_target.stop_token_idx,
        initial_state=inference_decoder_initial_state,
        beam_width=BEAM_WIDTH,
        output_layer=projection_layer,
        length_penalty_weight=1.0,  # https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/
    )

    
    inference_outputs, _inference_final_state, _inference_final_sequence_length = tf.contrib.seq2seq.dynamic_decode(
        inference_decoder,
        maximum_iterations=tf.round(tf.reduce_max(input_sequence_length) * 2),  # a bit more flexible than max_len_target
        impute_finished=False,
    )

In [12]:
def run_train_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    pred, loss, _ = sess.run(
        fetches=[
            outputs, train_loss, update_step
        ],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
            dropout: DROPOUT,
        }
    )
    return loss

def run_val_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    loss = sess.run(
        fetches=[train_loss],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
        }
    )
    return loss

def run_validation_loss():
    return np.mean([
        run_val_batch(ids)
        for ids 
        in np.array_split(val_ids, np.ceil(len(val_ids) / BATCH_SIZE))
    ])

In [13]:
config = tf.ConfigProto(
    allow_soft_placement=True,  # needed as recommendation from https://github.com/tensorflow/tensorflow/issues/2292
    log_device_placement=True,
)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

batches_per_epoch = np.ceil(len(train_ids) / BATCH_SIZE)
for epoch in range(EPOCHS):
    shuffled_ids = np.random.permutation(train_ids)
    batch_splits = np.array_split(shuffled_ids, batches_per_epoch)
    train_losses = []
    N = len(batch_splits)
    with tqdm(batch_splits, desc=f"Epoch {epoch+1}") as t:
        for train_batch_ids in t:
            batch_loss = run_train_batch(train_batch_ids)
            train_losses.append(batch_loss)
            t.set_postfix(train_loss=np.mean(train_losses))
        print("train_loss", np.mean(train_losses), "val_loss", run_validation_loss())
        
validation_input_sequences = europarl.df.input_sequences.iloc[val_ids[:BATCH_SIZE]]
validation_input_lengths = validation_input_sequences.apply(len)

validation_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
    validation_input_sequences,
    maxlen=max_len_input,
    dtype=int,
    padding='post'
)

HBox(children=(IntProgress(value=0, description='Epoch 1', max=13112), HTML(value='')))


train_loss 2.8138769 val_loss 1.9964966


HBox(children=(IntProgress(value=0, description='Epoch 2', max=13112), HTML(value='')))


train_loss 2.1779718 val_loss 1.8172414


HBox(children=(IntProgress(value=0, description='Epoch 3', max=13112), HTML(value='')))


train_loss 2.068193 val_loss 1.7540516


HBox(children=(IntProgress(value=0, description='Epoch 4', max=13112), HTML(value='')))


train_loss 2.0192885 val_loss 1.7253596


HBox(children=(IntProgress(value=0, description='Epoch 5', max=13112), HTML(value='')))


train_loss 1.990736 val_loss 1.7020212


HBox(children=(IntProgress(value=0, description='Epoch 6', max=13112), HTML(value='')))


train_loss 1.9711366 val_loss 1.6883754


HBox(children=(IntProgress(value=0, description='Epoch 7', max=13112), HTML(value='')))


train_loss 1.9569423 val_loss 1.6794814


HBox(children=(IntProgress(value=0, description='Epoch 8', max=13112), HTML(value='')))


train_loss 1.9460126 val_loss 1.6664897


HBox(children=(IntProgress(value=0, description='Epoch 9', max=13112), HTML(value='')))


train_loss 1.9372182 val_loss 1.6598784


HBox(children=(IntProgress(value=0, description='Epoch 10', max=13112), HTML(value='')))


train_loss 1.9297335 val_loss 1.6548756


HBox(children=(IntProgress(value=0, description='Epoch 11', max=13112), HTML(value='')))


train_loss 1.9238572 val_loss 1.6510123


HBox(children=(IntProgress(value=0, description='Epoch 12', max=13112), HTML(value='')))


train_loss 1.9188092 val_loss 1.6469036


HBox(children=(IntProgress(value=0, description='Epoch 13', max=13112), HTML(value='')))


train_loss 1.9143066 val_loss 1.641668


HBox(children=(IntProgress(value=0, description='Epoch 14', max=13112), HTML(value='')))


train_loss 1.9104325 val_loss 1.6381077


HBox(children=(IntProgress(value=0, description='Epoch 15', max=13112), HTML(value='')))


train_loss 1.9070565 val_loss 1.6381866


HBox(children=(IntProgress(value=0, description='Epoch 16', max=13112), HTML(value='')))


train_loss 1.9039155 val_loss 1.6336378


HBox(children=(IntProgress(value=0, description='Epoch 17', max=13112), HTML(value='')))


train_loss 1.9007643 val_loss 1.634092


HBox(children=(IntProgress(value=0, description='Epoch 18', max=13112), HTML(value='')))


train_loss 1.8982465 val_loss 1.6273415


HBox(children=(IntProgress(value=0, description='Epoch 19', max=13112), HTML(value='')))


train_loss 1.8959094 val_loss 1.6293005


HBox(children=(IntProgress(value=0, description='Epoch 20', max=13112), HTML(value='')))


train_loss 1.893739 val_loss 1.6259134


In [14]:
def predict(sentence):
    sequenced = europarl.bpe_input.subword_indices(preprocess(sentence))
    padded = tf.keras.preprocessing.sequence.pad_sequences(
        [sequenced],
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    
    beam_search_output = sess.run(
        fetches=[inference_outputs],
        feed_dict={
            encoder_inputs: padded,
            input_sequence_length: [len(sequenced)],
            beam_width: BEAM_WIDTH,
        }
    )[0]
    
    return europarl.bpe_target.sentencepiece.DecodePieces([
        europarl.bpe_target.tokens[idx] for idx in beam_search_output.predicted_ids[0, :, 0].tolist()
    ])


In [15]:
name = 'tfattentionmodel_full'

saver = tf.train.Saver()
saver.save(sess, f"data/{name}.ckpt")
# tfattentionmodel_full.ckpt.index https://drive.google.com/open?id=1JzIxjjZqcLIBYBZCal7QnwF6yHemrhv4
# tfattentionmodel_full.cpkt.meta https://drive.google.com/open?id=1b5XBioHmCDu_BTgJMl5TiC3vgLGyR_9J
# tfattentionmodel_full.cpkt.data-00000-of-00001 https://drive.google.com/open?id=1Fm61A1ghfVysq-BLpoigAOEwsaJ_hzXu

'data/tfattentionmodel.ckpt'

In [16]:
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
0,resumption of the session,wiederaufnahme der sitzungsperiode,25,34,"[1, 344, 146, 498, 90, 6, 3, 3235, 90, 2]","[1, 247, 351, 750, 5, 934, 43, 3158, 4762, 2]"
1,i declare resumed the session of the european ...,"ich erkläre die am freitag, dem 0. dezember un...",203,217,"[1, 305, 1712, 480, 344, 3027, 3, 3235, 90, 6,...","[1, 241, 156, 14, 476, 1252, 6, 46, 333, 324, ..."
2,"although, as you will have seen, the dreaded '...","wie sie feststellen konnten, ist der gefürchte...",191,185,"[1, 651, 4, 18, 983, 329, 126, 1479, 4, 3, 55,...","[1, 167, 54, 604, 1191, 1403, 4, 30, 5, 596, 3..."
3,you have requested a debate on this subject in...,im parlament besteht der wunsch nach einer aus...,105,110,"[1, 983, 126, 1026, 152, 20, 9, 2033, 118, 19,...","[1, 13, 2269, 974, 5, 111, 203, 82, 40, 95, 34..."
4,"in the meantime, i should like to observe a mi...",heute möchte ich sie bitten - das ist auch der...,232,217,"[1, 7, 3, 520, 133, 1258, 4, 305, 1351, 582, 1...","[1, 402, 2187, 2983, 241, 156, 54, 72, 2099, 7..."


In [17]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'helfen.'
'you are welcome.' --> 'sie begrüßen.'
'how do you do?' --> 'wie tun sie?'
'i hate mondays.' --> 'ich habe die antwort.'
'i am a programmer.' --> 'ich bin ein problem.'
'data is the new oil.' --> 'daten ist das neue öl.'
'it could be worse.' --> 'es könnte noch schlimmer sein.'
'i am on top of it.' --> 'ich bin überhaupt.'
'n° uno' --> 'nito-doo'
'awesome!' --> 'ein willkommen!'
'put your feet up!' --> 'ich habe ihre füße!'
'from the start till the end!' --> 'aus dem anfang!'
'from dusk till dawn.' --> 'aus duskus tatsächlich geht es darum.'


In [18]:
# Performance on training set:
for en, de in europarl.df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'i declare resumed the session of the european parliament adjourned on friday 0 december 0, and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.', got 'ich erkläre die sitzungsperiode des europäischen parlaments am freitag am freitag 0. dezember 0 unterbrochen, und ich möchte noch einmal wünschen, dass sie in der hoffnung, dass sie eine freude festgelegten feiertage haben.', exp: 'ich erkläre die am freitag, dem 0. dezember unterbrochene sitzungsperiode des europäischen parlaments für wiederaufgenommen, wünsche ihnen nochmals alles gute zum jahreswechsel und hoffe, daß sie schöne ferien hatten.'
Original "although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.", got 'obwohl sie, wie sie gesehen haben, die schreckliche "millennium-muggel" versagt haben, müssen die menschen in einigen l

In [19]:
# Performance on validation set
val_df = europarl.df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'it is important not to underestimate the work involved.', got 'es ist wichtig, die arbeit zu unterschätzen.', exp: 'das sollte man nicht unterschätzen.'
Original 'mr vanhanen, you were mr calm, and i think you, mr tuomioja, were mr collected.', got 'herr vanhanen, sie waren kollege kollege, und ich glaube, herr tuioja, herr kommissar.', exp: 'herr vanhanen, sie waren mr. calm, und, ich denke, sie, herr tuomioja, waren mr. collected.'
Original "most members of this parliament are aware of the commission's efforts to make sure that the european support for the palestinian authority is money that is properly spent, well spent and spent in ways that help to promote pluralism, the rule of law and clean government in the palestinian territories.", got 'die meisten mitglieder dieses parlaments sind sich der bemühungen der kommission bewusst, sicherzustellen, dass die europäische unterstützung für die palästinensische autonomiebehörde gut ausgegeben wird, gut ausgegeben und ausgegebe

Original 'the debate is not therefore whether we are in favour of or opposed to the alternative methods.', got 'die aussprache ist also nicht, ob wir für die alternativen methoden sind oder gegen die alternative methoden sind.', exp: 'es geht also nicht darum, ob wir für oder gegen alternativmethoden sind.'
Original 'in conclusion, i would say that only a realistic policy, appropriate to the needs of the population, the environment and an increasingly high-quality market, is capable of achieving the objectives which we think the european union should be aiming at in terms of viticultural policy.', got 'abschließend möchte ich sagen, dass nur eine realistische politik, die für die bedürfnisse der bevölkerung, die umwelt und eine zunehmende qualitativ hochwertige marktwirtschaft angemessen ist, in der lage ist, die ziele zu erreichen, die wir denken, dass die europäische union im hinblick auf die wettbewerbsfähige politik abzielen sollte.', exp: 'abschließend möchte ich feststellen, daß 

In [20]:
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.19960701983863563


# Conclusion

Most sentences are astonishing well translated in my opinion. In the real long, convoluted sentences, you can see that the encoder/decoder model really does not have an understanding of grammar (neither in input nor in target language) and makes mistakes. Sometimes the translation is grammatical wrong and so hard to understand. More problematic is, that even sometimes a translation is readable, but just wrong (e.g. when a negation gets lost, like in "it is important not to underestimate the work involved.' -> 'es ist wichtig, die arbeit zu unterschätzen.'). The translating/transliteration of unknown words (or known words at wrong places "Mr. Miller" -> "Herr Müller") are problematic. It would be better to learn how to look up/copy them. The repetitions are back, allthough more subtle now.

Anyway, at least for me, I'm fascinated as this model really does not know anything of language, nor even the concept of words and works really already well on really complicated sentences. Even most humans will need several years to be better.



## Further ways to improve

### Hyperparameters

- *learning rate*: The training jumps around at the last epochs here. This would be good spot to decrease the learning rate either manually or with an automatic decay and train it longer from there.
- *latent\_dim*: For shorter sentences, there were only marginal improvements when using a higher latent embedding size than used here. It would be worth to check whether it might help here, but it would also multiple the training time.
- *LSTMS vs GRUs or stacking layers*: Here the bytepair sequences of up to 170 is much longer than what LSTMs/GRUs can remember (something around 60-100 of a sequence length). Again it would be worth to check whether LSTMS instead of GRUs help here (for the sake of doubling parameters) or even add another layer working on a coarse level (like words).

For anything intended to use in practice, here we should play around with hyperparams as it does not need engineering power, only (cloud) GPU resources and some time. For me, there's not much to learn, but I don't want to spend much money, so I won't do anything of it.

### More data

Getting more data for translations should be relatively cheap and it definitly would help the translations. Especially my short self created examples (with words and sentences very different to the corpus) where this model performs terrible could be easily changed with just adding a training on a more simpler training set (like movie sub titles or what ever). Again, there's not much to learn from it for me. So, allthough it would improve the translation quality a lot in practice, I won't put energy in more data.


### Convolutional Architecture

CNNs can be used also for neural machine translations. Beside that they should be better trainable with a GPU, they also promise to better track grammar structures (with more position-robustness in the sentences) and a global view (in long, convoluted sentences both get lost). They would also be interesting for me to do as I rarely worked with computer vision models, so working with CNNs would be interesting for me, too, and they require a different fine tuning to work with (working with residuals, normalizations, ...). Allthough that's very interesting, I think, I'll postpone this approach till I studied and programmed several CNNs for CV before. 

### Copy-mechanism

It's possible to train a world alignment model (somehow similiar to the attention model) that predicts the relative positions for unknown words. Given such a UNK-alignment model, we can look up unknown words and either translate them with a dictionary or just copy them. That would be very helpful for the many names and entities in this dataset and also for the many numbers (so they don't need to be preprocessed here all to 0). It would also avoid doubled confusion for unknown words, probably more than the translating/transliteration (e.g. "mr tuomioja" becomes "herr tuioja") that happens now. I'm not sure how difficult it is to implement the copy-mechanism here (especially as the model does not work on word level, but a copy-mechanism would have), but I had this approach in mind from the beginning, so I plan to implement it.

### Pointer-Generator Networks

It's a technique from summarizing texts that looks also worth to consider for this translation corpus given the long, convoluted sentences. It's described e.g. in [Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/abs/1704.04368), I'll quote the abstract here:

> Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points. 

We can see the mentioned repeatings in the translation model here, too.
It might be better to first get a feeling for pointer generator networks on a summerization corpus, first, but I would want to implement the technique also.

### Coverage mechanism

To avoid repetitions, we can also calculate a coverage vector, then calculate modified attentions respecting the coverage vector. It's probably not so difficult and also something I'd like to try out. 

### NLU / dynamical neural network memory

It is something I had in mind also before hand. They are more important for dialogue managers, but probably the hottest, most magic state-of-the-art technique. Maybe I switch to a dialogue corpus or some toy problem with a lot of remembering first. Here it would anyway help, too, but as errors are more subtle here, it's more difficult to make a start.

### Multi learning

The dataset is really well suited to learn translations in many different language pairs at once. On a simple Encoder-into-one-big-state-that-is-more-or-less-language-independent/Decode-from-that model, it would be trivial to implement (but then we'd have all the problems back from [a simple model](SimpleModelForMachineTranslation.ipynb)). Attention modelling is not language indepedent, so it might be necessary to translate into a language independent code first and from this intermediate language to the target language. Maybe those dynamic neural network memories are needed, maybe not. I need to check research papers first.

Beside the big multi language learning, learning anything else in addition would also help. If the model learns to do entitiy-recognition or part-of-speech, translations would get better, too. Of course, the hard part is to get some annotated data similiar without to cheat (using Google NLP API or so). Not sure, whether NLTK can parse these complicated sentences correctly?! A way to multi-learn words to be marked for copy might be to compare multiple translations for words that are copy+pasted in all translations.

### Deploy

It would be fun to have the translation model be deployed on a website or as translation bot somewhere.

## tl;dr

I guess, I'll start implementing copy-mechanism next (and/or deploy it somewhere).
 