# Examine Multiple Layers for Neural Machine Translation

After implementing [Attention](AttentionModelForMachineTranslationWithTensorflow.ipynb), I'll come back first to examine whether it's worth to implement mulitple layers (of encoders and decoders) instead of only one. Of course, Deep Learning is all about deep nets, at some point I'll have to check it.

Again, I'll try to refactor the code a bit, so we can concentrate on the stacking issues.

In [1]:
import re

import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
from tqdm import tqdm_notebook as tqdm

from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, preprocess_input_europarl as preprocess
from utils.preparation import check_gpu_working, Europarl, RANDOM_STATE

check_gpu_working()

Fixed random seed to 42
Availabe devices: [name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6811444189735086180
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7689758311
locality {
  bus_id: 1
  links {
  }
}
incarnation: 8961736636428683247
physical_device_desc: "device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Cuda/Cudnn/GPU works as intended


In [2]:
MAX_INPUT_LENGTH = 20 # 100
MAX_TARGET_LENGTH = 25 # 125
LATENT_DIM = 256  # was 512, but we should be able to use a smaller hidden representation as we are looking back anyway as needed
LAYERS = 2
EPOCHS = 25 #20
BATCH_SIZE = 128  # was 64, but tensorflow implementation doesn't need so much GPU memory so can increase batch size
DROPOUT = 0.25  # Dropout on input and output for the RNN cells, so effective dropout is 0.5, but works slightly better so
TEST_SIZE = 2500
BEAM_WIDTH = 5
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)
WARMUP_EPOCHS = 10

## Download and explore data

In [3]:
europarl = Europarl()
download_and_extract_resources(fnames_and_urls=europarl.external_resources, dest_path=europarl.path)
europarl.load_and_preprocess(max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)
Total number of unfiltered translations 1920209
Filtered translations with length between (1, input=20/target=25) characters: 14943


In [4]:
europarl.df[['input_texts', 'target_texts']].head()

Unnamed: 0,input_texts,target_texts
67,agenda,arbeitsplan
704,what is the result?,was sind die folgen?
1261,with what aim?,zu welchem zweck?
1401,why?,wieso?
1403,no.,nein.


In [5]:
print("English subwords", europarl.bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", europarl.bpe_target.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'ierte', '▁zeich', 'eng', 'ruppen']


In [6]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = europarl.df.input_sequences.apply(len).max()
max_len_target = europarl.df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(15, 16)

In [7]:
train_ids, val_ids = train_test_split(np.arange(europarl.df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [20]:
tf.reset_default_graph()

with tf.device('/gpu:0'):

    encoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_input
        dtype=tf.int32,
        name='encoder_inputs' 
    )
    batch_size = tf.shape(encoder_inputs)[0]
    beam_width = tf.placeholder_with_default(1, shape=[])
    dropout = tf.placeholder_with_default(tf.cast(0.0, tf.float32), shape=[])
    keep_prob = tf.cast(1.0, tf.float32) - dropout
    # embedding_trainable = tf.placeholder_with_default(EMBEDDING_TRAINABLE, shape=[])
    learning_rate = tf.placeholder_with_default(tf.cast(1e-3, tf.float32), shape=[])

    embedding_encoder = tf.get_variable(
        "embedding_encoder", 
        initializer=tf.constant(europarl.bpe_input.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    encoder_emb_inp = tf.nn.embedding_lookup(
        embedding_encoder,
        encoder_inputs,
        name="encoder_emb_inp"
    )
    
    input_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='input_sequence_length'
    )
    
    rnn_cell_type = tf.nn.rnn_cell.GRUCell
    encoder_forward_cells = [tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name=f'encoder_forward_cell{layer}'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,  # state_keep_prob not set as it was not helpful here
        dtype=tf.float32,
    ) for layer in range(LAYERS)]
    encoder_backward_cells = [tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name=f'encoder_backward_cell{layer}'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    ) for layer in range(LAYERS)]
    encoder_outputs, encoder_output_state_fw, encoder_output_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
        encoder_forward_cells, encoder_backward_cells,
        inputs=encoder_emb_inp,
        sequence_length=input_sequence_length,
        time_major=False,
        dtype=tf.float32,
    )
    encoder_state = tf.concat([encoder_output_state_fw[-1], encoder_output_state_bw[-1]], -1)
    
    # Regarding time_major:
    # If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`.
    # If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`.
    # Using `time_major = True` is a bit more efficient because it avoids
    # transposes at the beginning and end of the RNN calculation.  However,
    # most TensorFlow data is batch-major, so by default this function
    # accepts input and emits output in batch-major form.
    #
    # for simplicity I work with batch major here instead of time_major
    # so I don't need to transpose inputs and transpose back for attention mechanism
    
    decoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_inputs' 
    )
    embedding_decoder = tf.get_variable(
        "embedding_decoder", 
        initializer=tf.constant(europarl.bpe_target.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    decoder_emb_inp = tf.nn.embedding_lookup(
        embedding_decoder,
        decoder_inputs,
        name="decoder_emb_inp"
    )
    
    target_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='target_sequence_length'
    )
    
    # tiling is necessary to work with BeamSearchDecoder
    # read carefully the NOTE on constructor in
    # https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/AttentionWrapper 
    tiled_encoder_outputs = tf.contrib.seq2seq.tile_batch(encoder_outputs, multiplier=beam_width)
    tiled_encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, multiplier=beam_width)
    tiled_sequence_length = tf.contrib.seq2seq.tile_batch(input_sequence_length, multiplier=beam_width)
    
    attention_mechanism = tf.contrib.seq2seq.LuongAttention(
        LATENT_DIM,
        memory=tiled_encoder_outputs,
        memory_sequence_length=tiled_sequence_length,
        dtype=tf.float32,
        name='attention_mechanism',
    )
    decoder_rnn_cells = [rnn_cell_type(num_units=LATENT_DIM, name=f'decoder_cell{layer}') for layer in range(LAYERS)]
    def residual_fn(inputs, outputs):
        tf.contrib.framework.nest.assert_same_structure(inputs, outputs)
        inputs_without_attention = tf.slice(inputs, [0, 0], [batch_size, LATENT_DIM])
        return tf.contrib.framework.nest.map_structure(lambda inp, out: inp + out, inputs_without_attention, outputs) 
    for layer in range(1, LAYERS):
        decoder_rnn_cells[layer] = tf.contrib.rnn.ResidualWrapper(
            decoder_rnn_cells[layer],
            residual_fn=residual_fn
        )
    decoder_rnn_cells = [tf.nn.rnn_cell.DropoutWrapper(
        cell,
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    ) for cell in decoder_rnn_cells]
    attention_cells = [tf.contrib.seq2seq.AttentionWrapper(
        cell,
        attention_mechanism,
        attention_layer_size=LATENT_DIM,
        name=f'attention_wrapper{layer}',
    ) for layer, cell in enumerate(decoder_rnn_cells)] 
    decoder_cell = tf.contrib.rnn.MultiRNNCell(attention_cells)

    training_helper = tf.contrib.seq2seq.TrainingHelper(
        inputs=decoder_emb_inp, 
        sequence_length=target_sequence_length,
        time_major=False,
        name="decoder_training_helper",
    )
    
    projection_layer = layers_core.Dense(
        units=len(europarl.bpe_target.tokens),
        use_bias=False,
        name='projection_layer',
    )
    
    initial_state = tuple(
        attention_cells[0].zero_state(dtype=tf.float32, batch_size=batch_size).clone(
            cell_state=encoder_state
        )
        for _ in range(LAYERS)
    )

    decoder = tf.contrib.seq2seq.BasicDecoder(
        cell=decoder_cell,
        helper=training_helper,
        initial_state=initial_state,
        output_layer=projection_layer,
    )
    outputs, _final_state, _final_sequence_length = tf.contrib.seq2seq.dynamic_decode(
        decoder,
        output_time_major=False,
        impute_finished=False,
    )
    logits = outputs.rnn_output
    
    decoder_outputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_outputs',
    )
    target_weights = tf.cast(tf.sequence_mask(target_sequence_length), dtype=tf.float32)
    l2_lambda = tf.placeholder_with_default(tf.cast(1e-5, tf.float32), shape=[])
    loss_l2 = tf.add_n([
        tf.nn.l2_loss(v)
        for v in tf.trainable_variables()
        if not re.match(r'embedding_(de|en)coder', v.name)  # don't regularize embeddings
    ])
    train_loss = tf.contrib.seq2seq.sequence_loss(logits, decoder_outputs, target_weights) + l2_lambda * loss_l2

    params = tf.trainable_variables()
    gradients = tf.gradients(train_loss, params)
    clipped_gradients, _ = tf.clip_by_global_norm(
        t_list=gradients,
        clip_norm=1.,
    )
    
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    update_step = optimizer.apply_gradients(zip(clipped_gradients, params))
    
    params_without_embeddings = [v for v in tf.trainable_variables() if not re.match(r'embedding_(de|en)coder', v.name)]
    gradients_without_embeddings = tf.gradients(train_loss, params_without_embeddings)
    clipped_gradients_without_embeddings, _ = tf.clip_by_global_norm(
        t_list=gradients_without_embeddings,
        clip_norm=1.,
    )
    update_step_without_embeddings = optimizer.apply_gradients(zip(clipped_gradients_without_embeddings, params_without_embeddings))
    
    inference_decoder_initial_state = tuple(
        attention_cells[0].zero_state(
            dtype=tf.float32,
            batch_size=batch_size * beam_width  # tricky and somehow unintuitive, but necessary
        ).clone(
            cell_state=tiled_encoder_state
        ) for _ in range(LAYERS)
    )
    inference_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
        cell=decoder_cell,
        embedding=embedding_decoder,
        start_tokens=tf.fill([batch_size], europarl.bpe_target.start_token_idx),
        end_token=europarl.bpe_target.stop_token_idx,
        initial_state=inference_decoder_initial_state,
        beam_width=BEAM_WIDTH,
        output_layer=projection_layer,
        length_penalty_weight=1.0,  # https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/
    )

    
    inference_outputs, _inference_final_state, _inference_final_sequence_length = tf.contrib.seq2seq.dynamic_decode(
        inference_decoder,
        maximum_iterations=tf.round(tf.reduce_max(input_sequence_length) * 2),  # a bit more flexible than max_len_target
        impute_finished=False,
    )

ValueError: Shapes (?, 256) and (?, 128) are incompatible

In [9]:
def run_train_batch(batch_ids, epoch):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    lr = 1e-3 / 2 + (epoch / WARMUP_EPOCHS) * (1e-3 / 2) if epoch <= WARMUP_EPOCHS else 1e-3 * (0.96 ** (epoch - WARMUP_EPOCHS))
    pred, loss, _ = sess.run(
        fetches=[
            outputs, train_loss, update_step if epoch >= WARMUP_EPOCHS else update_step_without_embeddings
        ],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
            dropout: DROPOUT,
            # embedding_trainable: epoch >= WARMUP_EPOCHS,
            learning_rate: lr
        }
    )
    return loss, lr

def run_val_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    loss = sess.run(
        fetches=[train_loss],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
        }
    )
    return loss

def run_validation_loss():
    return np.mean([
        run_val_batch(ids)
        for ids 
        in np.array_split(val_ids, np.ceil(len(val_ids) / BATCH_SIZE))
    ])

In [10]:
config = tf.ConfigProto(
    allow_soft_placement=True,  # needed as recommendation from https://github.com/tensorflow/tensorflow/issues/2292
    log_device_placement=True,
)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

batches_per_epoch = np.ceil(len(train_ids) / BATCH_SIZE)
for epoch in range(EPOCHS):
    shuffled_ids = np.random.permutation(train_ids)
    batch_splits = np.array_split(shuffled_ids, batches_per_epoch)
    train_losses = []
    N = len(batch_splits)
    with tqdm(batch_splits, desc=f"Epoch {epoch+1}") as t:
        for train_batch_ids in t:
            batch_loss, lr = run_train_batch(train_batch_ids, epoch=epoch)
            train_losses.append(batch_loss)
            t.set_postfix(train_loss=np.mean(train_losses))
        print(f"learning rate={lr:.6}, train_loss={np.mean(train_losses):.6}, val_loss={run_validation_loss():.6}")
        
validation_input_sequences = europarl.df.input_sequences.iloc[val_ids[:BATCH_SIZE]]
validation_input_lengths = validation_input_sequences.apply(len)

validation_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
    validation_input_sequences,
    maxlen=max_len_input,
    dtype=int,
    padding='post'
)

HBox(children=(IntProgress(value=0, description='Epoch 1', max=106), HTML(value='')))


learning rate=0.0005, train_loss=4.43282, val_loss=3.01002


HBox(children=(IntProgress(value=0, description='Epoch 2', max=106), HTML(value='')))


learning rate=0.00055, train_loss=2.60096, val_loss=2.41777


HBox(children=(IntProgress(value=0, description='Epoch 3', max=106), HTML(value='')))


learning rate=0.0006, train_loss=2.15044, val_loss=2.14528


HBox(children=(IntProgress(value=0, description='Epoch 4', max=106), HTML(value='')))


learning rate=0.00065, train_loss=1.90661, val_loss=1.9993


HBox(children=(IntProgress(value=0, description='Epoch 5', max=106), HTML(value='')))


learning rate=0.0007, train_loss=1.74073, val_loss=1.89794


HBox(children=(IntProgress(value=0, description='Epoch 6', max=106), HTML(value='')))


learning rate=0.00075, train_loss=1.61326, val_loss=1.82161


HBox(children=(IntProgress(value=0, description='Epoch 7', max=106), HTML(value='')))


learning rate=0.0008, train_loss=1.50433, val_loss=1.75017


HBox(children=(IntProgress(value=0, description='Epoch 8', max=106), HTML(value='')))


learning rate=0.00085, train_loss=1.41186, val_loss=1.70891


HBox(children=(IntProgress(value=0, description='Epoch 9', max=106), HTML(value='')))


learning rate=0.0009, train_loss=1.33869, val_loss=1.65817


HBox(children=(IntProgress(value=0, description='Epoch 10', max=106), HTML(value='')))


learning rate=0.00095, train_loss=1.26279, val_loss=1.63562


HBox(children=(IntProgress(value=0, description='Epoch 11', max=106), HTML(value='')))


learning rate=0.001, train_loss=1.18154, val_loss=1.54836


HBox(children=(IntProgress(value=0, description='Epoch 12', max=106), HTML(value='')))


learning rate=0.00096, train_loss=1.03525, val_loss=1.51307


HBox(children=(IntProgress(value=0, description='Epoch 13', max=106), HTML(value='')))


learning rate=0.0009216, train_loss=0.940898, val_loss=1.49401


HBox(children=(IntProgress(value=0, description='Epoch 14', max=106), HTML(value='')))


learning rate=0.000884736, train_loss=0.867922, val_loss=1.47633


HBox(children=(IntProgress(value=0, description='Epoch 15', max=106), HTML(value='')))


learning rate=0.000849347, train_loss=0.806498, val_loss=1.4843


HBox(children=(IntProgress(value=0, description='Epoch 16', max=106), HTML(value='')))


learning rate=0.000815373, train_loss=0.754629, val_loss=1.48139


HBox(children=(IntProgress(value=0, description='Epoch 17', max=106), HTML(value='')))


learning rate=0.000782758, train_loss=0.709191, val_loss=1.48578


HBox(children=(IntProgress(value=0, description='Epoch 18', max=106), HTML(value='')))


learning rate=0.000751447, train_loss=0.672443, val_loss=1.48585


HBox(children=(IntProgress(value=0, description='Epoch 19', max=106), HTML(value='')))


learning rate=0.00072139, train_loss=0.636069, val_loss=1.50003


HBox(children=(IntProgress(value=0, description='Epoch 20', max=106), HTML(value='')))


learning rate=0.000692534, train_loss=0.60938, val_loss=1.52113


HBox(children=(IntProgress(value=0, description='Epoch 21', max=106), HTML(value='')))


learning rate=0.000664833, train_loss=0.585555, val_loss=1.51313


HBox(children=(IntProgress(value=0, description='Epoch 22', max=106), HTML(value='')))


learning rate=0.000638239, train_loss=0.561859, val_loss=1.54346


HBox(children=(IntProgress(value=0, description='Epoch 23', max=106), HTML(value='')))


learning rate=0.00061271, train_loss=0.546097, val_loss=1.53758


HBox(children=(IntProgress(value=0, description='Epoch 24', max=106), HTML(value='')))


learning rate=0.000588201, train_loss=0.525103, val_loss=1.54961


HBox(children=(IntProgress(value=0, description='Epoch 25', max=106), HTML(value='')))


learning rate=0.000564673, train_loss=0.5082, val_loss=1.55562


In [11]:
def predict(sentence):
    sequenced = europarl.bpe_input.subword_indices(preprocess(sentence))
    padded = tf.keras.preprocessing.sequence.pad_sequences(
        [sequenced],
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    
    beam_search_output = sess.run(
        fetches=[inference_outputs],
        feed_dict={
            encoder_inputs: padded,
            input_sequence_length: [len(sequenced)],
            beam_width: BEAM_WIDTH,
        }
    )[0]
    
    return europarl.bpe_target.sentencepiece.DecodePieces([
        europarl.bpe_target.tokens[idx] for idx in beam_search_output.predicted_ids[0, :, 0].tolist()
    ])


In [12]:
name = 'tfattentionmodel_2layers'

saver = tf.train.Saver()
saver.save(sess, f"data/{name}.ckpt")
# tfattentionmodel.ckpt.index https://drive.google.com/open?id=1Xv2Qc9Gnac_Of9bIpQef5LZJnX9ij933
# tfattentionmodel.cpkt.meta https://drive.google.com/open?id=162WY8XEjyqfvitiIBfmq9oHLC1rQhyvr
# tfattentionmodel.cpkt.data-00000-of-00001 https://drive.google.com/open?id=1racW0agvk5nJ_xBaxifAaEWM8T9ot7FL

'data/tfattentionmodel_2layers.ckpt'

In [13]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'hallo!'
'you are welcome.' --> 'sie sind begrüßen.'
'how do you do?' --> 'wie meinen sie?'
'i hate mondays.' --> 'ich meine, ja.'
'i am a programmer.' --> 'ich bin ein bäug.'
'data is the new oil.' --> 'die realität ist die.'
'it could be worse.' --> 'das muss nicht sein.'
'i am on top of it.' --> 'ich habe das leid.'
'n° uno' --> 'ärmelaltalta'
'awesome!' --> 'einverstanden!'
'put your feet up!' --> 'ihr wort sich sich sich!'
'from the start till the end!' --> 'der kampf geht!'
'from dusk till dawn.' --> 'viel dank.'


In [14]:
# Performance on training set:
for en, de in europarl.df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'what is the result?', got 'was ist das ergebnis?', exp: 'was sind die folgen?'
Original 'with what aim?', got 'zu welchem zweck?', exp: 'zu welchem zweck?'
Original 'why?', got 'warum?', exp: 'wieso?'
Original 'no.', got 'nein.', exp: 'nein.'
Original 'just like europol.', got 'so weit, so europol.', exp: 'genau wie europol.'
Original 'vote', got 'abstimmungen', exp: 'abstimmungen'
Original 'why not?', got 'warum nicht?', exp: 'warum?'
Original 'and now the erika.', got 'und die zeit drson.', exp: 'und nun erika.'
Original 'they want answers.', got 'sie wollen antworten.', exp: 'sie wollen antworten.'
Original 'storms in europe', got 'stärme europa', exp: 'stürme in europa'
Original 'food safety', got 'lebensmittelicherheit', exp: 'lebensmittelsicherheit'
Original 'first part', got 'erster teil', exp: 'teil i'
Original 'if not, why not?', got 'wenn nicht, warum nicht?', exp: 'wenn nicht, warum nicht?'
Original 'second part', got 'teil ii', exp: 'teil ii'
Original '0 discharge

In [15]:
# Performance on validation set
val_df = europarl.df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'yes.', got 'ja.', exp: 'ja.'
Original 'why?', got 'warum?', exp: 'warum?'
Original '(loud applause)', got '(lebhafter beifall)', exp: '(lebhafter beifall)'
Original 'loud applause', got 'lebhafter beifall', exp: 'lebhafter beifall'
Original 'why?', got 'warum?', exp: 'warum?'
Original 'president.', got 'der präsident.', exp: 'die präsidentin.'
Original 'consumer protection', got 'verbraucherschutz', exp: 'verbraucherschutz'
Original 'biocidal products', got 'biozidprodukte', exp: 'biozid-produkte'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'applause', got 'beifall', exp: 'beifall'
Original 'tempus fugit!', got 'wirft fatuhush!', exp: 'die zeit drängt!'
Original 'hence this debate.', got 'dies muss sein sein.', exp: 'deshalb diese debatte.'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'maes (verts/ale).', got 'maes (verts/ale).', exp: 'maes (verts/ale).'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'how can this 

In [16]:
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=1495), HTML(value='')))


average BLEU on test set = 0.3452285046600621


# Conclusion

The translations for the train and validations are decent. The main problem without attention was that the rnn decoder repeated itself and got confused what it intended to say (like "leider geht es nur um die realität und die realität.") The bleu score also improved from 0.183 to 0.237.

Now, the main problems are any words out of the vocabulary (meaning not seen often enough in the training set). Some of the problems will go away if we use the whole training data set instead of a subset. But in the long run, we might need a copy mechanism, a look up method and/or a dynamic neural network memory. Also, some translations make grammatical mistakes. I can imagine that multi learning (like also learn grammatical features or entity recognition maybe in combination with some data augmentation) would help. It's also somehow funny to see how accurately the translations are for complicated sentences from the validation set, but for my very simple own examples that are different to europarlament discussion, the translations are terrible.

Regarding tensorflow, allthough the code is a bit longer, it doesn't need much energy to figure what's going on. Every line manipulates the computational graph without much surprises. Allthough, debugging is still a hassle (cryptic error messages and no direct evaluation), it's not so mind blogging like working on different abstraction levels with Keras IMHO. It's also nice to see that a lot state-of-the art technology from research is already implemented in tensorflow. The computational speed went up here a lot, too. I think the main reason is that the tensorflow implementation only trains till end-of-sentence and not beyond (where Keras implementation masked them only to ignore in loss function). 

My next step will be to train the neural machine translation on the whole europarliament dataset (2.3 Mio > 600k) and also train for a longer time (even here the training did not converge after 20 epochs).