# Training on large dataset with attention model

After implementing [Beamsearch on a large dataset](BeamSearchOnLargeDataset.ipynb), I'll now add an attention model.
As trainings set I use the [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/).

I first intented to implement it also with `Keras`. First of all, there is no built-in implementation of an attention layer or an attention decoder (it's planned atm). There are several projects like [keras-attention](https://github.com/datalogue/keras-attention) or a bit modified [monitonic-keras-attention](https://github.com/andreyzharkov/keras-monotonic-attention) (that works really better). Also there is [seq2seq project](https://github.com/farizrahman4u/seq2seq) that as lot of issues open. And a promising looking [NMT Keras](https://nmt-keras.readthedocs.io/en/latest/) that failed to install all dependency. I wouldn't mind a reimplemetation on my own (or improving one of these), just as I do the project anyway for learning purposes. After a while I really found this approach disturbing. I don't like switching around between the real high level layers of `Keras` down to `keras.backend` when I pretty much have to low level implement everything (in a different way to usual `Keras`) and in addition everything in object oriented extension of Layers, Cell and so on (with passing all parameters along, as GRU/LSTM have to be changed they also have to be reimplementend, vectorization with own time distributed layers, the masking layer won't work with further inputs like the weighted context vector, so we'll implement also Masking, ...). Just look into the projects and there is a lot of noisy code inside detracting from the original algorithm. 

Attention as basic idea is pretty simple: While decoding, we'll look back the weighted encoded states that depend of the current decoding position, the previous (or current) hidden state of the decoder, and maybe to the previous alignment. We create a contect vector of them and use it also as an input (beside the last generated token) to the decoder. For performance we might only look to local hidden states around an alignment prediction (that can linear in the simplest form or usually als learnable). That's not so tough to represent as direct computation, but as we never work in Keras with the computation graph directly, it's harder than it should be. 

With tensorflow we're closer to research and as I anyway intended to use multiple frameworks, I'll follow now the [seq2seq tutorial from tensorflow](https://www.tensorflow.org/tutorials/seq2seq). So, in this notebook there will be also a tensorflow implementation of the raw seq2seq model and Beam Search. I also refactored all the preparation work into a module.

In [1]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8922951419847043913
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7734522676
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12087032027046734255
physical_device_desc: "device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1"
]


In [2]:
import tensorflow as tf
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print(sess.run(c))

[[22. 28.]
 [49. 64.]]


In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
from tqdm import tqdm_notebook as tqdm

from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, preprocess_input_europarl as preprocess
from utils.preparation import Europarl, RANDOM_STATE

Fixed random seed to 42


Using TensorFlow backend.


In [4]:
MAX_INPUT_LENGTH = 20 #50 #100  # was 50
MAX_TARGET_LENGTH = 25 #65 # 125  # was 65
LATENT_DIM =  256  # was 512, but we should be able to use a smaller hidden representation as we are looking back anyway as needed
EPOCHS = 20
BATCH_SIZE = 128
DROPOUT = 0.25
TEST_SIZE = 2500
BEAM_WIDTH = 5
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [5]:
europarl = Europarl()
download_and_extract_resources(fnames_and_urls=europarl.external_resources, dest_path=europarl.path)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [6]:
europarl.load_and_preprocess(max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH)

Total number of unfiltered translations 1920209
Filtered translations with length between (1, input=20/target=25) characters: 14943


In [7]:
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
67,agenda,arbeitsplan,6,11,"[1, 631, 222, 34, 2]","[1, 941, 197, 3454, 2]"
704,what is the result?,was sind die folgen?,19,20,"[1, 781, 14, 3, 714, 2426, 2]","[1, 748, 126, 6, 2374, 3720, 2]"
1261,with what aim?,zu welchem zweck?,14,17,"[1, 23, 781, 2973, 2426, 2]","[1, 26, 2740, 156, 155, 142, 359, 188, 3720, 2]"
1401,why?,wieso?,4,6,"[1, 958, 38, 2426, 2]","[1, 167, 1659, 3720, 2]"
1403,no.,nein.,3,5,"[1, 220, 5, 2]","[1, 124, 191, 3, 2]"


In [8]:
print("English subwords", europarl.bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", europarl.bpe_target.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'ierte', '▁zeich', 'eng', 'ruppen']


In [9]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = europarl.df.input_sequences.apply(len).max()
max_len_target = europarl.df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(15, 16)

In [10]:
train_ids, val_ids = train_test_split(np.arange(europarl.df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [11]:
TIME_MAJOR = False

tf.reset_default_graph()

with tf.device('/gpu:0'):

    encoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_input
        dtype=tf.int32,
        name='encoder_inputs' 
    )
    batch_size = tf.shape(encoder_inputs)[0]
    beam_width = tf.placeholder_with_default(1, shape=[])
    dropout = tf.placeholder_with_default(tf.cast(0.0, tf.float32), shape=[])
    keep_prob = tf.cast(1.0, tf.float32) - dropout

    embedding_encoder = tf.get_variable(
        "embedding_encoder", 
        initializer=tf.constant(europarl.bpe_input.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    encoder_emb_inp = tf.nn.embedding_lookup(
        embedding_encoder,
        encoder_inputs,
        name="encoder_emb_inp"
    )
    
    input_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='input_sequence_length'
    )
    
    rnn_cell_type = tf.nn.rnn_cell.GRUCell
    encoder_forward_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name='encoder_forward_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,  # state_keep_prob not set as it was not helpful here
        dtype=tf.float32,
    )
    encoder_backward_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM // 2, name='encoder_backward_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    )
    encoder_bi_outputs, encoder_bi_state = tf.nn.bidirectional_dynamic_rnn(
        encoder_forward_cell, encoder_backward_cell,
        inputs=encoder_emb_inp,
        sequence_length=input_sequence_length,
        time_major=TIME_MAJOR,
        dtype=tf.float32,
    )
    encoder_outputs = tf.concat(encoder_bi_outputs, -1)
    encoder_state = tf.concat(encoder_bi_state, -1)
    
    # Regarding time_major:
    # If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`.
    # If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`.
    # Using `time_major = True` is a bit more efficient because it avoids
    # transposes at the beginning and end of the RNN calculation.  However,
    # most TensorFlow data is batch-major, so by default this function
    # accepts input and emits output in batch-major form.
    
    decoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_inputs' 
    )
    embedding_decoder = tf.get_variable(
        "embedding_decoder", 
        initializer=tf.constant(europarl.bpe_target.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    decoder_emb_inp = tf.nn.embedding_lookup(
        embedding_decoder,
        decoder_inputs,
        name="decoder_emb_inp"
    )
    
    target_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='target_sequence_length'
    )
    #attention_states = tf.transpose(encoder_outputs, [1, 0, 2]) if TIME_MAJOR else encoder_outputs
    
    #tiled_inputs = tf.contrib.seq2seq.tile_batch(encoder_inputs, multiplier=beam_width)
    tiled_encoder_outputs = tf.contrib.seq2seq.tile_batch(encoder_outputs, multiplier=beam_width)
    tiled_encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, multiplier=beam_width)
    tiled_sequence_length = tf.contrib.seq2seq.tile_batch(input_sequence_length, multiplier=beam_width)
    
    attention_mechanism = tf.contrib.seq2seq.LuongAttention(
        LATENT_DIM,
        memory=tiled_encoder_outputs,
        memory_sequence_length=tiled_sequence_length,
        dtype=tf.float32,
        name='attention_mechanism',
    )
    decoder_rnn_cell = tf.nn.rnn_cell.DropoutWrapper(
        rnn_cell_type(num_units=LATENT_DIM, name='decoder_cell'),
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    )
    #decoder_rnn_cell = rnn_cell_type(num_units=LATENT_DIM, name='decoder_cell')
    decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
        decoder_rnn_cell,
        attention_mechanism,
        attention_layer_size=LATENT_DIM, 
        name='attention_wrapper',
    )

    training_helper = tf.contrib.seq2seq.TrainingHelper(
        inputs=decoder_emb_inp, 
        sequence_length=target_sequence_length,
        time_major=TIME_MAJOR,
        name="decoder_training_helper",
    )
    
    projection_layer = layers_core.Dense(
        units=len(europarl.bpe_target.tokens),
        use_bias=False,
        name='projection_layer',
    )
    
    initial_state=decoder_cell.zero_state(dtype=tf.float32, batch_size=batch_size).clone(
        cell_state=encoder_state
    )
    decoder = tf.contrib.seq2seq.BasicDecoder(
        cell=decoder_cell,
        helper=training_helper,
        initial_state=initial_state,
        output_layer=projection_layer,
    )
    outputs, _final_state, _final_sequence_length = tf.contrib.seq2seq.dynamic_decode(  
        decoder,
        output_time_major=TIME_MAJOR,
        impute_finished=False, #True,
        # swap_memory=True,
    )
    logits = outputs.rnn_output
    
    decoder_outputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_outputs',
    )
    target_weights = tf.cast(tf.sequence_mask(target_sequence_length), dtype=tf.float32)
    train_loss = tf.contrib.seq2seq.sequence_loss(logits, decoder_outputs, target_weights)

    params = tf.trainable_variables()
    gradients = tf.gradients(train_loss, params)
    clipped_gradients, _ = tf.clip_by_global_norm(
        t_list=gradients,
        clip_norm=1.,
    )
    
    optimizer = tf.train.AdamOptimizer()
    update_step = optimizer.apply_gradients(zip(clipped_gradients, params))
    
    #print(initial_state.shape)
    # inference_decoder_initial_state = tf.contrib.seq2seq.tile_batch(
    #     #encoder_state,
    #     #decoder_cell.zero_state(dtype=tf.float32, batch_size=batch_size).clone(cell_state=encoder_state),
    #     initial_state,
    #     multiplier=BEAM_WIDTH,
    #     name='inference_decoder_initital_state',
    # )
    
    # inference_decoder_initial_state = tf.nn.rnn_cell.LSTMStateTuple(
    #     tf.contrib.seq2seq.tile_batch(encoder_state[0], multiplier=BEAM_WIDTH),
    #     tf.contrib.seq2seq.tile_batch(encoder_state[1], multiplier=BEAM_WIDTH)
    # )
    inference_decoder_initial_state = decoder_cell.zero_state(
        dtype=tf.float32,
        batch_size=batch_size * beam_width
    ).clone(
        cell_state=tiled_encoder_state
    )
    inference_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
        cell=decoder_cell,
        embedding=embedding_decoder,
        start_tokens=tf.fill([batch_size], europarl.bpe_target.start_token_idx),
        end_token=europarl.bpe_target.stop_token_idx,
        # initial_state=tf.contrib.seq2seq.tile_batch(
        #     decoder_cell.zero_state(dtype=tf.float32, batch_size=batch_size).clone(
        #         cell_state=encoder_state#inference_decoder_initial_state
        #     ),  #inference_decoder_initial_state,
        #     multiplier=BEAM_WIDTH
        # ),
        # initial_state=decoder_cell.zero_state(dtype=tf.float32, batch_size=batch_size).clone(
        #     cell_state=encoder_state
        # ),
        initial_state=inference_decoder_initial_state,
        # initial_state=decoder_cell.zero_state(BATCH_SIZE * BEAM_WIDTH,tf.float32).clone(
        #     cell_state=tf.contrib.seq2seq.tile_batch(encoder_state, BEAM_WIDTH)
        # ),
        beam_width=BEAM_WIDTH,
        output_layer=projection_layer,
        length_penalty_weight=1.0,  # TODO: check hyperparameter tuning
    )

    
    inference_outputs, _inference_final_state, _inference_final_sequence_length = tf.contrib.seq2seq.dynamic_decode(
        inference_decoder,
        maximum_iterations=tf.round(tf.reduce_max(input_sequence_length) * 2),  # a bit more flexible than max_len_target
        # swap_memory=True,
        impute_finished=False,
    )

In [12]:
def run_train_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    pred, loss, _ = sess.run(
        fetches=[
            outputs, train_loss, update_step
        ],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
            dropout: DROPOUT,
        }
    )
    return loss

def run_val_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    loss = sess.run(
        fetches=[train_loss],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
        }
    )
    return loss

def run_validation_loss():
    return np.mean([
        run_val_batch(ids)
        for ids 
        in np.array_split(val_ids, np.ceil(len(val_ids) / BATCH_SIZE))
    ])

In [13]:
config = tf.ConfigProto(
    allow_soft_placement=True,  # needed as recommendation from https://github.com/tensorflow/tensorflow/issues/2292
    log_device_placement=True,
)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

batches_per_epoch = np.ceil(len(train_ids) / BATCH_SIZE)
for epoch in range(EPOCHS):
    shuffled_ids = np.random.permutation(train_ids)
    batch_splits = np.array_split(shuffled_ids, batches_per_epoch)
    train_losses = []
    N = len(batch_splits)
    with tqdm(batch_splits, desc=f"Epoch {epoch+1}") as t:
        for train_batch_ids in t:
            batch_loss = run_train_batch(train_batch_ids)
            train_losses.append(batch_loss)
            t.set_postfix(train_loss=np.mean(train_losses))
        print("train_loss", np.mean(train_losses), "val_loss", run_validation_loss())
        
validation_input_sequences = europarl.df.input_sequences.iloc[val_ids[:BATCH_SIZE]]
validation_input_lengths = validation_input_sequences.apply(len)

validation_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
    validation_input_sequences,
    maxlen=max_len_input,
    dtype=int,
    padding='post'
)

HBox(children=(IntProgress(value=0, description='Epoch 1', max=106), HTML(value='')))


train_loss 4.0278277 val_loss 2.7171733


HBox(children=(IntProgress(value=0, description='Epoch 2', max=106), HTML(value='')))


train_loss 2.2796981 val_loss 2.1541057


HBox(children=(IntProgress(value=0, description='Epoch 3', max=106), HTML(value='')))


train_loss 1.7984314 val_loss 1.8657602


HBox(children=(IntProgress(value=0, description='Epoch 4', max=106), HTML(value='')))


train_loss 1.5180403 val_loss 1.7278348


HBox(children=(IntProgress(value=0, description='Epoch 5', max=106), HTML(value='')))


train_loss 1.3311 val_loss 1.6529965


HBox(children=(IntProgress(value=0, description='Epoch 6', max=106), HTML(value='')))


train_loss 1.1854817 val_loss 1.5839186


HBox(children=(IntProgress(value=0, description='Epoch 7', max=106), HTML(value='')))


train_loss 1.0658473 val_loss 1.5412661


HBox(children=(IntProgress(value=0, description='Epoch 8', max=106), HTML(value='')))


train_loss 0.9756876 val_loss 1.5229611


HBox(children=(IntProgress(value=0, description='Epoch 9', max=106), HTML(value='')))


train_loss 0.885829 val_loss 1.4964112


HBox(children=(IntProgress(value=0, description='Epoch 10', max=106), HTML(value='')))


train_loss 0.8155995 val_loss 1.5020987


HBox(children=(IntProgress(value=0, description='Epoch 11', max=106), HTML(value='')))


train_loss 0.75577414 val_loss 1.4849538


HBox(children=(IntProgress(value=0, description='Epoch 12', max=106), HTML(value='')))


train_loss 0.6974251 val_loss 1.4671365


HBox(children=(IntProgress(value=0, description='Epoch 13', max=106), HTML(value='')))


train_loss 0.6548841 val_loss 1.4799666


HBox(children=(IntProgress(value=0, description='Epoch 14', max=106), HTML(value='')))


train_loss 0.6150893 val_loss 1.4828558


HBox(children=(IntProgress(value=0, description='Epoch 15', max=106), HTML(value='')))


train_loss 0.5813776 val_loss 1.4721166


HBox(children=(IntProgress(value=0, description='Epoch 16', max=106), HTML(value='')))


train_loss 0.5453936 val_loss 1.495212


HBox(children=(IntProgress(value=0, description='Epoch 17', max=106), HTML(value='')))


train_loss 0.5149394 val_loss 1.4916309


HBox(children=(IntProgress(value=0, description='Epoch 18', max=106), HTML(value='')))


train_loss 0.49211258 val_loss 1.4876364


HBox(children=(IntProgress(value=0, description='Epoch 19', max=106), HTML(value='')))


train_loss 0.46770316 val_loss 1.5019754


HBox(children=(IntProgress(value=0, description='Epoch 20', max=106), HTML(value='')))


train_loss 0.4528464 val_loss 1.5093204


In [14]:
def predict(sentence):
    sequenced = europarl.bpe_input.subword_indices(preprocess(sentence))
    padded = tf.keras.preprocessing.sequence.pad_sequences(
        [sequenced],
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    
    beam_search_output = sess.run(
        fetches=[inference_outputs],
        feed_dict={
            encoder_inputs: padded,
            input_sequence_length: [len(sequenced)],
            beam_width: BEAM_WIDTH,
        }
    )[0]
    
    return europarl.bpe_target.sentencepiece.DecodePieces([
        europarl.bpe_target.tokens[idx] for idx in beam_search_output.predicted_ids[0, :, 0].tolist()
    ])


In [15]:
name = 'tfattentionmodel'

saver = tf.train.Saver()
saver.save(sess, f"data/{name}.ckpt")

# model.save_weights(f'data/{name}_model_weights.h5') 
# s2s.model.save_weights(f'data/{name}_model_weights.h5')  # https://drive.google.com/open?id=10Sv-JnAiUT_fvU_cw1_H7mkcTAipC5aA
# s2s.inference_encoder_model.save_weights(f'data/{name}_inference_encoder_model_weights.h5')  # https://drive.google.com/open?id=1gNBrn_Wij0PyeE-jJsEnlv7aHXkYuAup
# s2s.inference_decoder_model.save_weights(f'data/{name}_inference_decoder_model_weights.h5')  # https://drive.google.com/open?id=1LCU53Hnb4m42QO3qsZTAkyYyroqz2vbe

'data/tfattentionmodel.ckpt'

In [16]:
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
67,agenda,arbeitsplan,6,11,"[1, 631, 222, 34, 2]","[1, 941, 197, 3454, 2]"
704,what is the result?,was sind die folgen?,19,20,"[1, 781, 14, 3, 714, 2426, 2]","[1, 748, 126, 6, 2374, 3720, 2]"
1261,with what aim?,zu welchem zweck?,14,17,"[1, 23, 781, 2973, 2426, 2]","[1, 26, 2740, 156, 155, 142, 359, 188, 3720, 2]"
1401,why?,wieso?,4,6,"[1, 958, 38, 2426, 2]","[1, 167, 1659, 3720, 2]"
1403,no.,nein.,3,5,"[1, 220, 5, 2]","[1, 124, 191, 3, 2]"


In [17]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'hallo!'
'you are welcome.' --> 'sie sind begrüßen.'
'how do you do?' --> 'wie soll sie tun?'
'i hate mondays.' --> 'ich danke ihnen.'
'i am a programmer.' --> 'ich bin ein problem.'
'data is the new oil.' --> 'es ist die kapösung.'
'it could be worse.' --> 'es ging nicht.'
'i am on top of it.' --> 'ich bin darüber satt.'
'n° uno' --> 'kubaität.'
'awesome!' --> 'einverstanden!'
'put your feet up!' --> 'neben sie sich vor!'
'from the start till the end!' --> 'im gegenteil!'
'from dusk till dawn.' --> 'von dähren sie an.'


In [18]:
# Performance on training set:
for en, de in europarl.df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'what is the result?', got 'was ist das ergebnis?', exp: 'was sind die folgen?'
Original 'with what aim?', got 'zu welchem zweck?', exp: 'zu welchem zweck?'
Original 'why?', got 'warum?', exp: 'wieso?'
Original 'no.', got 'nein.', exp: 'nein.'
Original 'just like europol.', got 'genau, europol.', exp: 'genau wie europol.'
Original 'vote', got 'abstimmungen', exp: 'abstimmungen'
Original 'why not?', got 'warum nicht?', exp: 'warum?'
Original 'and now the erika.', got 'und jetzt die recht.', exp: 'und nun erika.'
Original 'they want answers.', got 'sie wollten wir ab.', exp: 'sie wollen antworten.'
Original 'storms in europe', got 'stürme in europa', exp: 'stürme in europa'
Original 'food safety', got 'lebensmittelsicherheit', exp: 'lebensmittelsicherheit'
Original 'first part', got 'teil i', exp: 'teil i'
Original 'if not, why not?', got 'wenn nicht, warum nicht?', exp: 'wenn nicht, warum nicht?'
Original 'second part', got 'teil ii', exp: 'teil ii'
Original '0 discharge', got 

In [19]:
# Performance on validation set
val_df = europarl.df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'yes.', got 'ja.', exp: 'ja.'
Original 'why?', got 'warum?', exp: 'warum?'
Original '(loud applause)', got '(lebhafter beifall)', exp: '(lebhafter beifall)'
Original 'loud applause', got 'lebhafter beifall', exp: 'lebhafter beifall'
Original 'why?', got 'warum?', exp: 'warum?'
Original 'president.', got 'der präsident.', exp: 'die präsidentin.'
Original 'consumer protection', got 'verbraucherschutz', exp: 'verbraucherschutz'
Original 'biocidal products', got 'biozidprodukte', exp: 'biozid-produkte'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'applause', got 'beifall', exp: 'beifall'
Original 'tempus fugit!', got 'aus fräumenung!', exp: 'die zeit drängt!'
Original 'hence this debate.', got 'das ist ein begrüßen.', exp: 'deshalb diese debatte.'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'maes (verts/ale).', got 'maes (verts/ale).', exp: 'maes (verts/ale).'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'how can this

In [20]:
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=1495), HTML(value='')))


average BLEU on test set = 0.35325087498798863


# Conclusion

...