# Training on large dataset with attention model

After implementing [Beamsearch on a large dataset](BeamSearchOnLargeDataset.ipynb), I'll now add an attention model.
As trainings set I use the [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/).

Here, it would helped to use `tf.keras` instead of `Keras` as there is attention model in tensorflow. There are also plans to add an attention layer to `Keras`, so I won't reimplement here, allthough it would be instructive. I'll use [keras-attention](https://github.com/datalogue/keras-attention) with an [AttentionDecoder Layer](https://github.com/datalogue/keras-attention/blob/master/models/custom_recurrents.py#L10). There is also a tutorial of how to use it at [machine learning mastery](https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/) and a [medium article about](https://medium.com/datalogue/attention-in-keras-1892773a4f22)

Again, I'll refactor the code a bit, putting most of the implementation details into modules.

In [1]:
# import gc
# import os
# 
import keras
# from keras.backend.tensorflow_backend import set_session
from keras.preprocessing.sequence import pad_sequences
import numpy as np
# import pandas as pd
from sklearn.model_selection import train_test_split
# import tensorflow as tf
# from tqdm import tqdm_notebook as tqdm
#  
# import bytepairencoding as bpe
import seq2seq
from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, preprocess_input_europarl as preprocess
# 
# 
# # Fixing random state ensure reproducible results
# RANDOM_STATE=42
# np.random.seed(RANDOM_STATE)
# tf.set_random_seed(RANDOM_STATE)
# 
# pd.set_option('max_colwidth', 60)  # easier to read texts in e.g. df.head()
# 
# # technical detail so that an instance (maybe running in a different window)
# # doesn't take all the GPU memory resulting in some strange error messages
# config = tf.ConfigProto()
# config.gpu_options.per_process_gpu_memory_fraction = 0.5
# set_session(tf.Session(config=config))

from utils.preparation import Europarl, RANDOM_STATE

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Fixed random seed to 42
Set gpu memory fraction to 0.5


In [2]:
# MAX_INPUT_LENGTH = 100  # was 50
# MAX_TARGET_LENGTH = 125  # was 65
# LATENT_DIM = 512
# EMBEDDING_DIM = 300
# BPE_MERGE_OPERATIONS = 5_000  # I'd love to use 10_000 x 300, but this one is broken: https://github.com/bheinzerling/bpemb/issues/6
EPOCHS = 20
BATCH_SIZE = 32
DROPOUT = 0.5
TEST_SIZE = 25 # 2_500  
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [3]:
# PATH = 'data'
# INPUT_LANG = 'en'
# TARGET_LANG = 'de'
# LANGUAGES = [INPUT_LANG, TARGET_LANG]
# BPE_URL = {lang: f'http://cosyne.h-its.org/bpemb/data/{lang}/' for lang in LANGUAGES}
# BPE_MODEL_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.model' for lang in LANGUAGES}
# BPE_WORD2VEC_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.d{EMBEDDING_DIM}.w2v.bin' for lang in LANGUAGES}
# 
# EXTERNAL_RESOURCES = {
#     # Europarl Corpus
#     'de-en.tgz': 'http://statmt.org/europarl/v7/de-en.tgz',
#     
#     # Bytepairencoding subwords (_MODEL_) and pretrained embeddings (_WORD2VEC_)
#     BPE_MODEL_NAME[INPUT_LANG]: f'{BPE_URL[INPUT_LANG]}/{BPE_MODEL_NAME[INPUT_LANG]}',
#     BPE_WORD2VEC_NAME[INPUT_LANG] + '.tar.gz': f'{BPE_URL[INPUT_LANG]}/{BPE_WORD2VEC_NAME[INPUT_LANG]}' + '.tar.gz',
#     BPE_MODEL_NAME[TARGET_LANG]: f'{BPE_URL[TARGET_LANG]}/{BPE_MODEL_NAME[TARGET_LANG]}',
#     BPE_WORD2VEC_NAME[TARGET_LANG] + '.tar.gz': f'{BPE_URL[TARGET_LANG]}/{BPE_WORD2VEC_NAME[TARGET_LANG]}' + '.tar.gz',
# }

europarl = Europarl()
download_and_extract_resources(fnames_and_urls=europarl.external_resources, dest_path=europarl.path)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [4]:
europarl.load_and_preprocess(max_input_length=20, max_target_length=25)
# df = pd.DataFrame(data={
#     'input_texts': read_europarl(INPUT_LANG),
#     'target_texts': read_europarl(TARGET_LANG)
# })

Total number of unfiltered translations 1920209
Filtered translations with length between (1, input=20/target=25) characters: 14943


In [5]:
# print("Nr total input:", len(df))
# df['input_length'] = df.input_texts.apply(len)
# df['target_length'] = df.target_texts.apply(len)
europarl.df.head()

Unnamed: 0,input_texts,target_texts,input_length,target_length,input_sequences,target_sequences
67,agenda,arbeitsplan,6,11,"[1, 631, 222, 34, 2]","[1, 941, 197, 3454, 2]"
704,what is the result?,was sind die folgen?,19,20,"[1, 781, 14, 3, 714, 2426, 2]","[1, 748, 126, 6, 2374, 3720, 2]"
1261,with what aim?,zu welchem zweck?,14,17,"[1, 23, 781, 2973, 2426, 2]","[1, 26, 2740, 156, 155, 142, 359, 188, 3720, 2]"
1401,why?,wieso?,4,6,"[1, 958, 38, 2426, 2]","[1, 167, 1659, 3720, 2]"
1403,no.,nein.,3,5,"[1, 220, 5, 2]","[1, 124, 191, 3, 2]"


### Filter translations (only sentences shorter than a given length)

With a full working machine translation system, it's of course better to train on all data (plus maybe some augmented data). Without attention (and maybe copy mechanism, dynamic memory, ...) there's no point anyway in it, but it also reduces training time (a full training on ~2 Mio translations might take days, even with a good GPU).
I use different length for input (english) than target (german) language as german is more verbose.

In [6]:
# non_empty = (df.input_length > 1) & (df.target_length > 1)  # there are empty phrases like '\n' --> 'Frau Präsidentin\n'
# short_inputs = (df.input_length < MAX_INPUT_LENGTH) & (df.target_length < MAX_TARGET_LENGTH)
# print(f'Sentences with length between (1, input={MAX_INPUT_LENGTH}/target={MAX_TARGET_LENGTH}) characters:', sum(non_empty & short_inputs))
# df = df[non_empty & short_inputs]
# gc.collect();  # df with filtered sentences is significant smaller, so time to garbage collect

## Load (pretrained) Bytepairs

I need the subwords dictionary (in `BPE_WORD2VEC_NAME`), the pretrained embeddings (in `BPE_MODEL_NAME`) and a [sentencepiece](https://github.com/google/sentencepiece) handler that can encode/decode them.

In [7]:
# bpe_input, bpe_target = [bpe.Bytepairencoding(
#     word2vec_fname=os.path.join(PATH, BPE_WORD2VEC_NAME[lang]),
#     sentencepiece_fname=os.path.join(PATH, BPE_MODEL_NAME[lang]),
# ) for lang in [INPUT_LANG, TARGET_LANG]] 
print("English subwords", europarl.bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", europarl.bpe_target.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'ierte', '▁zeich', 'eng', 'ruppen']


In [8]:
# Now encode the texts into sequences of indexes of bytepairs
# df['input_sequences'] = df.input_texts.apply(bpe_input.subword_indices)
# df['target_sequences'] = df.target_texts.apply(bpe_target.subword_indices)
# df[['input_sequences', 'target_sequences']].head()

In [9]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = europarl.df.input_sequences.apply(len).max()
max_len_target = europarl.df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(15, 16)

In [10]:
train_ids, val_ids = train_test_split(np.arange(europarl.df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [11]:
s2s = seq2seq.Seq2SeqWithBPE(
    bpe_input=europarl.bpe_input,
    bpe_target=europarl.bpe_target,
    max_len_input=max_len_input,
    max_len_target=max_len_target
)
#s2s.model.compile(optimizer=keras.optimizers.Adam(clipnorm=1., clipvalue=.5), loss='categorical_crossentropy')
import keras.layers as L
rnn_encoded = L.Bidirectional(
    L.GRU(s2s.latent_dim // 2, return_sequences=True),
    name='bidirectional_1',
    merge_mode='concat'
)(s2s.encoder_embeddings)
attention_decoder = seq2seq.AttentionDecoder(
    s2s.latent_dim,
    len(europarl.bpe_target.tokens),
    trainable=EMBEDDING_TRAINABLE,
)(rnn_encoded)

# train_generator = s2s.create_batch_generator(
#     train_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
# )
# val_generator = s2s.create_batch_generator(
#     val_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
# )
# 
# s2s.model.fit_generator(
#     train_generator,
#     steps_per_epoch=np.ceil(len(train_ids) / BATCH_SIZE),
#     epochs=EPOCHS,
#     validation_data=val_generator,
#     validation_steps=np.ceil(len(val_ids) / BATCH_SIZE),
# )


inputs shape: (?, ?, 512)


In [12]:
model = keras.models.Model(inputs=s2s.encoder_inputs, outputs=attention_decoder)
model.compile(optimizer=keras.optimizers.Adam(clipnorm=1., clipvalue=.5), loss='categorical_crossentropy')

In [13]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder_inputs (InputLayer)  (None, 15)                0         
_________________________________________________________________
input_embedding (Embedding)  (None, 15, 302)           1760358   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 15, 512)           858624    
_________________________________________________________________
AttentionDecoder (AttentionD (None, 15, 5367)          44911432  
Total params: 47,530,414
Trainable params: 47,530,414
Non-trainable params: 0
_________________________________________________________________


In [14]:
def create_batch_generator(
    samples_ids, input_sequences, target_sequences, batch_size
):

    def batch_generator():
        nr_batches = np.ceil(len(samples_ids) / batch_size)
        while True:
            shuffled_ids = np.random.permutation(samples_ids)
            batch_splits = np.array_split(shuffled_ids, nr_batches)
            for batch_ids in batch_splits:
                batch_X = pad_sequences(
                    input_sequences.iloc[batch_ids],
                    padding='post',
                    maxlen=max_len_input
                )
                batch_y = pad_sequences(
                    target_sequences.iloc[batch_ids],
                    padding='post',
                    maxlen=max_len_target
                )
                batch_y_t_output = keras.utils.to_categorical(
                    batch_y[:, 1:],
                    num_classes=len(europarl.bpe_target.tokens)
                )
                batch_x_t_input = batch_y[:, :-1]
                #yield ([batch_X, batch_x_t_input], batch_y_t_output)
                yield(batch_X, batch_y_t_output)
        
    return batch_generator()

train_generator = create_batch_generator(
    train_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
)
val_generator = create_batch_generator(
    val_ids, europarl.df.input_sequences, europarl.df.target_sequences, BATCH_SIZE
)

model.fit_generator(
    train_generator,
    steps_per_epoch=np.ceil(len(train_ids) / BATCH_SIZE),
    epochs=EPOCHS,
    validation_data=val_generator,
    validation_steps=np.ceil(len(val_ids) / BATCH_SIZE),
)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f6fd7b2f470>

In [15]:
name = 'attentionmodel'
model.save_weights(f'data/{name}_model_weights.h5') 
# s2s.model.save_weights(f'data/{name}_model_weights.h5')  # https://drive.google.com/open?id=10Sv-JnAiUT_fvU_cw1_H7mkcTAipC5aA
# s2s.inference_encoder_model.save_weights(f'data/{name}_inference_encoder_model_weights.h5')  # https://drive.google.com/open?id=1gNBrn_Wij0PyeE-jJsEnlv7aHXkYuAup
# s2s.inference_decoder_model.save_weights(f'data/{name}_inference_decoder_model_weights.h5')  # https://drive.google.com/open?id=1LCU53Hnb4m42QO3qsZTAkyYyroqz2vbe

In [35]:
def predict(sentence, beam_width=5):
    return s2s.decode_beam_search(pad_sequences(
        [europarl.bpe_input.subword_indices(preprocess(sentence))],
        padding='post',
        maxlen=max_len_input,
    ), beam_width=beam_width)

In [36]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'hallo!'
'you are welcome.' --> 'sie sind willkommen!'
'how do you do?' --> 'wie können sie das tun?'
'i hate mondays.' --> 'das tut mir leid.'
'i am a programmer.' --> 'darauf bin ich gespannt.'
'data is the new oil.' --> 'das ist die folge.'
'it could be worse.' --> 'dies war überbucht.'
'i am on top of it.' --> 'ich bin es leid.'
'n° uno' --> 'zärger beifall)'
'awesome!' --> 'klar!'
'put your feet up!' --> 'bitte fahren sie sich!'
'from the start till the end!' --> 'im gegenteil!'
'from dusk till dawn.' --> 'ja, frau harms.'


In [38]:
# Performance on training set:
for en, de in europarl.df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'what is the result?', got 'mit welchem ergebnis?', exp: 'was sind die folgen?'
Original 'with what aim?', got 'zu welchem zweck?', exp: 'zu welchem zweck?'
Original 'why?', got 'warum?', exp: 'wieso?'
Original 'no.', got 'nein.', exp: 'nein.'
Original 'just like europol.', got 'genau wie europol.', exp: 'genau wie europol.'
Original 'vote', got 'abstimmungen', exp: 'abstimmungen'
Original 'why not?', got 'warum nicht?', exp: 'warum?'
Original 'and now the erika.', got 'und zu den philippinen', exp: 'und nun erika.'
Original 'they want answers.', got 'sie wollen antworten.', exp: 'sie wollen antworten.'
Original 'storms in europe', got 'stürme in europa', exp: 'stürme in europa'
Original 'food safety', got 'lebensmittelsicherheit', exp: 'lebensmittelsicherheit'
Original 'first part', got 'teil i', exp: 'teil i'
Original 'if not, why not?', got 'wenn nicht, warum?', exp: 'wenn nicht, warum nicht?'
Original 'second part', got 'teil ii', exp: 'teil ii'
Original '0 discharge', got

In [39]:
# Performance on validation set
val_df = europarl.df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'yes.', got 'ja.', exp: 'ja.'
Original 'why?', got 'warum?', exp: 'warum?'
Original '(loud applause)', got '(lebhafter beifall)', exp: '(lebhafter beifall)'
Original 'loud applause', got 'lebhafter beifall', exp: 'lebhafter beifall'
Original 'why?', got 'warum?', exp: 'warum?'
Original 'president.', got 'der präsident.', exp: 'die präsidentin.'
Original 'consumer protection', got 'verbraucherschutz', exp: 'verbraucherschutz'
Original 'biocidal products', got 'biozid-produkte', exp: 'biozid-produkte'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'applause', got 'beifall', exp: 'beifall'
Original 'tempus fugit!', got 'zu den tat!', exp: 'die zeit drängt!'
Original 'hence this debate.', got 'das wäre nett.', exp: 'deshalb diese debatte.'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'maes (verts/ale).', got 'maes (verts/ale).', exp: 'maes (verts/ale).'
Original '(applause)', got '(beifall)', exp: '(beifall)'
Original 'how can this be?', got

In [41]:
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=25), HTML(value='')))


average BLEU on test set = 0.27822399794374275


# Conclusion

Translations for short sentences are looking decent. But it's also obvious that for longer sentences the translation gets lost somehow in the sentence and alltough the translated sentence is related to a real translation, it's also confusing and self-repeating.
It's worth to notice that the sentences are not too long for a LSTM/GRU model (52, 71) bytepairs for encoding/decoding network. LSTM/GRUs are known to handle sequences up to 100 elements and start decreasing performance at around 60 (for at least that's what the Stanford courses say). So, it could be that a long enough training (we can see here that the training progresses epoch for epoch what's really nice to see for large data) would solve the problem for the choosen sentence lengths here. But of course, it's better to do what humans do also and applicate an attention model instead of trying to keep everything condensed in 512 float32 embedding while also generating bytepair for bytepair.
This model is also already a realistic model in terms of training time. I needed around 18h on a GTX1080. Beside implementing attention model, it is tempting to see how a convolutional network might improve the runtime performance (and also quality). But let's get first to Attention.