# Training on large dataset with attention model

After implementing [Beamsearch on a large dataset](BeamSearchOnLargeDataset.ipynb), I'll now add an attention model.
As trainings set I use the [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/).

I first intented to implement it also with `Keras`. First of all, there is no built-in implementation of an attention layer or an attention decoder (it's planned atm). There are several projects like [keras-attention](https://github.com/datalogue/keras-attention) or a bit modified [monitonic-keras-attention](https://github.com/andreyzharkov/keras-monotonic-attention) (that works really better). Also there is [seq2seq project](https://github.com/farizrahman4u/seq2seq) that as lot of issues open. And a promising looking [NMT Keras](https://nmt-keras.readthedocs.io/en/latest/) that failed to install all dependency. I wouldn't mind a reimplemetation on my own (or improving one of these), just as I do the project anyway for learning purposes. After a while I really found this approach disturbing. I don't like switching around between the real high level layers of `Keras` down to `keras.backend` when I pretty much have to low level implement everything (in a different way to usual `Keras`) and in addition everything in object oriented extension of Layers, Cell and so on (with passing all parameters along, as GRU/LSTM have to be changed they also have to be reimplementend, vectorization with own time distributed layers, the masking layer won't work with further inputs like the weighted context vector, so we'll implement also Masking, ...). Just look into the projects and there is a lot of noisy code inside detracting from the original algorithm. 

Attention as basic idea is pretty simple: While decoding, we'll look back the weighted encoded states that depend of the current decoding position, the previous (or current) hidden state of the decoder, and maybe to the previous alignment. We create a contect vector of them and use it also as an input (beside the last generated token) to the decoder. For performance we might only look to local hidden states around an alignment prediction (that can linear in the simplest form or usually als learnable). That's not so tough to represent as direct computation, but as we never work in Keras with the computation graph directly, it's harder than it should be. 

With tensorflow we're closer to research and as I anyway intended to use multiple frameworks, I'll follow now the [seq2seq tutorial from tensorflow](https://www.tensorflow.org/tutorials/seq2seq). So, in this notebook there will be also a tensorflow implementation of the raw seq2seq model and Beam Search. I also refactored all the preparation work into a module.

In [1]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13627177616749298348
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7808362087
locality {
  bus_id: 1
  links {
  }
}
incarnation: 18341031548040615031
physical_device_desc: "device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1"
]


In [2]:
import tensorflow as tf
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print(sess.run(c))

[[22. 28.]
 [49. 64.]]


In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
from tqdm import tqdm_notebook as tqdm

from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, preprocess_input_europarl as preprocess
from utils.preparation import Europarl, RANDOM_STATE

Fixed random seed to 42


Using TensorFlow backend.


In [4]:
MAX_INPUT_LENGTH = 50 #100  # was 50
MAX_TARGET_LENGTH = 65 # 125  # was 65
LATENT_DIM =  512 # 256  # was 512, but we should be able to use a smaller hidden representation as we are looking back anyway as needed
EPOCHS = 20
BATCH_SIZE = 128
DROPOUT = 0.25
TEST_SIZE = 2500
BEAM_WIDTH = 5
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [5]:
europarl = Europarl()
download_and_extract_resources(fnames_and_urls=europarl.external_resources, dest_path=europarl.path)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [6]:
europarl.load_and_preprocess(max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH)

Total number of unfiltered translations 1920209
Filtered translations with length between (1, input=50/target=65) characters: 167211


In [7]:
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
0,resumption of the session,wiederaufnahme der sitzungsperiode,25,34,"[1, 344, 146, 498, 90, 6, 3, 3235, 90, 2]","[1, 247, 351, 750, 5, 934, 43, 3158, 4762, 2]"
5,"please rise, then, for this minute' s silence.","ich bitte sie, sich zu einer schweigeminute zu...",46,55,"[1, 3005, 416, 77, 359, 4, 241, 4, 17, 76, 451...","[1, 241, 156, 72, 3112, 54, 4, 39, 26, 95, 473..."
6,(the house rose and observed a minute' s silence),(das parlament erhebt sich zu einer schweigemi...,49,52,"[1, 29, 140, 414, 3231, 8, 3106, 2484, 9, 451,...","[1, 35, 2444, 2269, 2109, 625, 39, 26, 95, 473..."
7,"madam president, on a point of order.","frau präsidentin, zur geschäftsordnung.",37,39,"[1, 1599, 134, 546, 4, 19, 9, 918, 6, 535, 5, 2]","[1, 1161, 2266, 52, 4, 132, 2232, 1516, 3, 2]"
13,"madam president, on a point of order.","frau präsidentin, zur geschäftsordnung.",37,39,"[1, 1599, 134, 546, 4, 19, 9, 918, 6, 535, 5, 2]","[1, 1161, 2266, 52, 4, 132, 2232, 1516, 3, 2]"


In [8]:
print("English subwords", europarl.bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", europarl.bpe_target.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'ierte', '▁zeich', 'eng', 'ruppen']


In [9]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = europarl.df.input_sequences.apply(len).max()
max_len_target = europarl.df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(27, 40)

In [10]:
train_ids, val_ids = train_test_split(np.arange(europarl.df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [11]:
TIME_MAJOR = False

tf.reset_default_graph()

with tf.device('/gpu:0'):

    encoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_input
        dtype=tf.int32,
        name='encoder_inputs' 
    )
    batch_size = tf.shape(encoder_inputs)[0]
    
    dropout = tf.placeholder_with_default(tf.cast(0.0, tf.float32), shape=[])
    keep_prob = tf.cast(1.0, tf.float32) - dropout

    embedding_encoder = tf.get_variable(
        "embedding_encoder", 
        initializer=tf.constant(europarl.bpe_input.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    encoder_emb_inp = tf.nn.embedding_lookup(
        embedding_encoder,
        encoder_inputs,
        name="encoder_emb_inp"
    )
    
    input_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='input_sequence_length'
    )
    
    rnn_cell_type = tf.nn.rnn_cell.GRUCell
    encoder_forward_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name='encoder_forward_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        #state_keep_prob=keep_prob,
        dtype=tf.float32,
    )
    encoder_backward_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name='encoder_backward_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        #state_keep_prob=keep_prob,
        dtype=tf.float32,
    )
    encoder_bi_outputs, encoder_bi_state = tf.nn.bidirectional_dynamic_rnn(
        encoder_forward_cell, encoder_backward_cell,
        inputs=encoder_emb_inp,
        sequence_length=input_sequence_length,
        time_major=TIME_MAJOR,
        dtype=tf.float32,
    )
    encoder_outputs = tf.concat(encoder_bi_outputs, -1)
    encoder_state = tf.concat(encoder_bi_state, -1)
    
    # Regarding time_major:
    # If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`.
    # If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`.
    # Using `time_major = True` is a bit more efficient because it avoids
    # transposes at the beginning and end of the RNN calculation.  However,
    # most TensorFlow data is batch-major, so by default this function
    # accepts input and emits output in batch-major form.
    
    decoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_inputs' 
    )
    embedding_decoder = tf.get_variable(
        "embedding_decoder", 
        initializer=tf.constant(europarl.bpe_target.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    decoder_emb_inp = tf.nn.embedding_lookup(
        embedding_decoder,
        decoder_inputs,
        name="decoder_emb_inp"
    )
    
    target_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='target_sequence_length'
    )
    decoder_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM, name='decoder_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        #state_keep_prob=keep_prob,
        dtype=tf.float32,
    )
    training_helper = tf.contrib.seq2seq.TrainingHelper(
        inputs=decoder_emb_inp, 
        sequence_length=target_sequence_length,
        time_major=TIME_MAJOR,
        name="decoder_training_helper",
    )
    
    projection_layer = layers_core.Dense(
        units=len(europarl.bpe_target.tokens),
        use_bias=False,
        name='projection_layer',
    )
    
    decoder = tf.contrib.seq2seq.BasicDecoder(
        cell=decoder_cell,
        helper=training_helper,
        initial_state=encoder_state,
        output_layer=projection_layer,
    )
    outputs, _final_state, _final_sequence_length = tf.contrib.seq2seq.dynamic_decode(  
        decoder,
        output_time_major=TIME_MAJOR,
        impute_finished=True,
        # swap_memory=True,
    )
    logits = outputs.rnn_output
    
    decoder_outputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_outputs',
    )
    target_weights = tf.cast(tf.sequence_mask(target_sequence_length), dtype=tf.float32)
    train_loss = tf.contrib.seq2seq.sequence_loss(logits, decoder_outputs, target_weights)

    params = tf.trainable_variables()
    gradients = tf.gradients(train_loss, params)
    clipped_gradients, _ = tf.clip_by_global_norm(
        t_list=gradients,
        clip_norm=1.,
    )
    
    optimizer = tf.train.AdamOptimizer()
    update_step = optimizer.apply_gradients(zip(clipped_gradients, params))
    
    inference_decoder_initial_state = tf.contrib.seq2seq.tile_batch(
        encoder_state,
        multiplier=BEAM_WIDTH,
        name='inference_decoder_initital_state',
    )

    inference_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
        cell=decoder_cell,
        embedding=embedding_decoder,
        start_tokens=tf.fill([batch_size], europarl.bpe_target.start_token_idx),
        end_token=europarl.bpe_target.stop_token_idx,
        initial_state=inference_decoder_initial_state,
        beam_width=BEAM_WIDTH,
        output_layer=projection_layer,
        length_penalty_weight=1.0,  # TODO: check hyperparameter tuning
    )

    
    
    inference_outputs, _inference_final_state, _inference_final_sequence_length = tf.contrib.seq2seq.dynamic_decode(
        inference_decoder,
        maximum_iterations=tf.round(tf.reduce_max(input_sequence_length) * 2),  # a bit more flexible than max_len_target
        swap_memory=True,
    )

In [12]:
def run_train_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    pred, loss, _ = sess.run(
        fetches=[
            outputs, train_loss, update_step
        ],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
            dropout: DROPOUT,
        }
    )
    return loss

def run_val_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    loss = sess.run(
        fetches=[train_loss],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
        }
    )
    return loss

def run_validation_loss():
    return np.mean([
        run_val_batch(ids)
        for ids 
        in np.array_split(val_ids, np.ceil(len(val_ids) / BATCH_SIZE))
    ])

In [13]:
config = tf.ConfigProto(
    allow_soft_placement=True,  # needed as recommendation from https://github.com/tensorflow/tensorflow/issues/2292
    log_device_placement=True,
)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

batches_per_epoch = np.ceil(len(train_ids) / BATCH_SIZE)
for epoch in range(EPOCHS):
    shuffled_ids = np.random.permutation(train_ids)
    batch_splits = np.array_split(shuffled_ids, batches_per_epoch)
    train_losses = []
    N = len(batch_splits)
    with tqdm(batch_splits, desc=f"Epoch {epoch+1}") as t:
        for train_batch_ids in t:
            batch_loss = run_train_batch(train_batch_ids)
            train_losses.append(batch_loss)
            t.set_postfix(train_loss=np.mean(train_losses))
        print("train_loss", np.mean(train_losses), "val_loss", run_validation_loss())
        
validation_input_sequences = europarl.df.input_sequences.iloc[val_ids[:BATCH_SIZE]]
validation_input_lengths = validation_input_sequences.apply(len)

validation_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
    validation_input_sequences,
    maxlen=max_len_input,
    dtype=int,
    padding='post'
)

# translations = sess.run(
#     fetches=[inference_outputs.sample_id],
#     feed_dict={
#         encoder_inputs: validation_input_padded,
#         input_sequence_length: np.array(validation_input_lengths),
#     },
# )
# for input_text, target_text, translation_token_indices in zip(
#     europarl.df.input_texts.iloc[val_ids[:BATCH_SIZE]],
#     europarl.df.target_texts.iloc[val_ids[:BATCH_SIZE]],
#     translations[0]
# ):
#     translation = europarl.bpe_target.sentencepiece.DecodePieces([
#         europarl.bpe_target.tokens[idx] for idx in translation_token_indices
#     ])
#     print(input_text, target_text, translation)


HBox(children=(IntProgress(value=0, description='Epoch 1', max=1176), HTML(value='')))


train_loss 3.3044024 val_loss 2.2765489


HBox(children=(IntProgress(value=0, description='Epoch 2', max=1176), HTML(value='')))


train_loss 2.1493225 val_loss 1.9203267


HBox(children=(IntProgress(value=0, description='Epoch 3', max=1176), HTML(value='')))


train_loss 1.8672987 val_loss 1.7638649


HBox(children=(IntProgress(value=0, description='Epoch 4', max=1176), HTML(value='')))


train_loss 1.7052873 val_loss 1.6825069


HBox(children=(IntProgress(value=0, description='Epoch 5', max=1176), HTML(value='')))


train_loss 1.5943111 val_loss 1.6305454


HBox(children=(IntProgress(value=0, description='Epoch 6', max=1176), HTML(value='')))


train_loss 1.5122467 val_loss 1.6029016


HBox(children=(IntProgress(value=0, description='Epoch 7', max=1176), HTML(value='')))


train_loss 1.4481808 val_loss 1.5801067


HBox(children=(IntProgress(value=0, description='Epoch 8', max=1176), HTML(value='')))


train_loss 1.397142 val_loss 1.5679839


HBox(children=(IntProgress(value=0, description='Epoch 9', max=1176), HTML(value='')))


train_loss 1.3544468 val_loss 1.5588113


HBox(children=(IntProgress(value=0, description='Epoch 10', max=1176), HTML(value='')))


train_loss 1.3182902 val_loss 1.5606792


HBox(children=(IntProgress(value=0, description='Epoch 11', max=1176), HTML(value='')))


train_loss 1.2866696 val_loss 1.5568115


HBox(children=(IntProgress(value=0, description='Epoch 12', max=1176), HTML(value='')))


train_loss 1.2595927 val_loss 1.5548369


HBox(children=(IntProgress(value=0, description='Epoch 13', max=1176), HTML(value='')))


train_loss 1.2369063 val_loss 1.5530577


HBox(children=(IntProgress(value=0, description='Epoch 14', max=1176), HTML(value='')))


train_loss 1.2158996 val_loss 1.556557


HBox(children=(IntProgress(value=0, description='Epoch 15', max=1176), HTML(value='')))


train_loss 1.1947303 val_loss 1.5570173


HBox(children=(IntProgress(value=0, description='Epoch 16', max=1176), HTML(value='')))


train_loss 1.1787221 val_loss 1.5603123


HBox(children=(IntProgress(value=0, description='Epoch 17', max=1176), HTML(value='')))


train_loss 1.1633192 val_loss 1.565825


HBox(children=(IntProgress(value=0, description='Epoch 18', max=1176), HTML(value='')))


train_loss 1.1501997 val_loss 1.5637382


HBox(children=(IntProgress(value=0, description='Epoch 19', max=1176), HTML(value='')))


train_loss 1.1345676 val_loss 1.568669


HBox(children=(IntProgress(value=0, description='Epoch 20', max=1176), HTML(value='')))


train_loss 1.1238617 val_loss 1.5717236


In [14]:
def predict(sentence):
    sequenced = europarl.bpe_input.subword_indices(preprocess(sentence))
    padded = tf.keras.preprocessing.sequence.pad_sequences(
        [sequenced],
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    
    beam_search_output = sess.run(
        fetches=[inference_outputs],
        feed_dict={
            encoder_inputs: padded,
            input_sequence_length: [len(sequenced)],
        }
    )[0]
    
    return europarl.bpe_target.sentencepiece.DecodePieces([
        europarl.bpe_target.tokens[idx] for idx in beam_search_output.predicted_ids[0, :, 0].tolist()
    ])


In [15]:
name = 'tfattentionmodel'

saver = tf.train.Saver()
saver.save(sess, f"data/{name}.ckpt")

# model.save_weights(f'data/{name}_model_weights.h5') 
# s2s.model.save_weights(f'data/{name}_model_weights.h5')  # https://drive.google.com/open?id=10Sv-JnAiUT_fvU_cw1_H7mkcTAipC5aA
# s2s.inference_encoder_model.save_weights(f'data/{name}_inference_encoder_model_weights.h5')  # https://drive.google.com/open?id=1gNBrn_Wij0PyeE-jJsEnlv7aHXkYuAup
# s2s.inference_decoder_model.save_weights(f'data/{name}_inference_decoder_model_weights.h5')  # https://drive.google.com/open?id=1LCU53Hnb4m42QO3qsZTAkyYyroqz2vbe

'data/tfattentionmodel.ckpt'

In [16]:
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
0,resumption of the session,wiederaufnahme der sitzungsperiode,25,34,"[1, 344, 146, 498, 90, 6, 3, 3235, 90, 2]","[1, 247, 351, 750, 5, 934, 43, 3158, 4762, 2]"
5,"please rise, then, for this minute' s silence.","ich bitte sie, sich zu einer schweigeminute zu...",46,55,"[1, 3005, 416, 77, 359, 4, 241, 4, 17, 76, 451...","[1, 241, 156, 72, 3112, 54, 4, 39, 26, 95, 473..."
6,(the house rose and observed a minute' s silence),(das parlament erhebt sich zu einer schweigemi...,49,52,"[1, 29, 140, 414, 3231, 8, 3106, 2484, 9, 451,...","[1, 35, 2444, 2269, 2109, 625, 39, 26, 95, 473..."
7,"madam president, on a point of order.","frau präsidentin, zur geschäftsordnung.",37,39,"[1, 1599, 134, 546, 4, 19, 9, 918, 6, 535, 5, 2]","[1, 1161, 2266, 52, 4, 132, 2232, 1516, 3, 2]"
13,"madam president, on a point of order.","frau präsidentin, zur geschäftsordnung.",37,39,"[1, 1599, 134, 546, 4, 19, 9, 918, 6, 535, 5, 2]","[1, 1161, 2266, 52, 4, 132, 2232, 1516, 3, 2]"


In [17]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'hallo!'
'you are welcome.' --> 'sie sind zu begrüßen.'
'how do you do?' --> 'wie tun sie es?'
'i hate mondays.' --> 'ich habe gesprochen.'
'i am a programmer.' --> 'ich bin ein programm.'
'data is the new oil.' --> 'vors sind die neuen ölländerung.'
'it could be worse.' --> 'das könnte bedenklich werden.'
'i am on top of it.' --> 'ich bin schon davon.'
'n° uno' --> 'neino'
'awesome!' --> 'das ist eine schande!'
'put your feet up!' --> 'fassen sie ihren füßen!'
'from the start till the end!' --> 'beginnen sie am ende!'
'from dusk till dawn.' --> 'aus der mouskampagne.'


In [18]:
# Performance on training set:
for en, de in europarl.df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original "please rise, then, for this minute' s silence.", got 'ich bitte sie, sich zu einer schweigeminute zu erheben.', exp: 'ich bitte sie, sich zu einer schweigeminute zu erheben.'
Original "(the house rose and observed a minute' s silence)", got '(das parlament erhebt sich zu einer schweigeminute.)', exp: '(das parlament erhebt sich zu einer schweigeminute.)'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'thank you, mr segni, i shall do so gladly.', got 'vielen dank, herr segni, ich werde dies tun.', exp: 'vielen dank, herr segni, das will ich gerne tun.'
Original 'it is the case of alexander nikitin.', got 'das ist der fall von alexander nikitin.', exp: 'das ist der fall von alexander nikitin.'
Original 'it will, i hope, be examined 

In [19]:
# Performance on validation set
val_df = europarl.df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'perhaps there is a connection.', got 'vielleicht gibt es hier einen zusammenhang.', exp: 'vielleicht besteht da ein zusammenhang.'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'mr president, what is going on?', got 'herr präsident, was wird hier geschehen?', exp: 'herr präsident! was steckt dahinter?'
Original 'it is something that we must do ourselves.', got 'das müssen wir uns tun.', exp: 'es ist etwas, das wir selbst tun müssen.'
Original "no reform'.", got 'keine reform.', exp: 'keine reform."'
Original 'there must be no more lockerbies.', got 'es darf keine vorbedingungen mehr tun.', exp: 'es darf keine lockerbies mehr geben.'
Original 'the european union must lead by example.', got 'die europäische union muss mit gutem beispiel vorangehen.', exp: 'die europäische union muss mit gutem beispiel vorangehen.'
Original 'that is the point.', got 'das ist der punkt.', exp: 'darum geht es.'
Original 'it is as simple as that.', got 'so einfach ist das.', exp

In [20]:
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.32828439168305806


# Conclusion

...