# Training on large dataset with already implemented BeamSearch but without attention model

After [implementing Beamsearch](BeamSearchForMachineTranslation.ipynb), I'll now train it on a large dataset. The goal is beside a better translation quality also to show problems arising without attention model (that is needed for larger texts). 
As trainings set I use the [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/) German-English corpus with medium sized sentences.

Again, I'll refactor the code a bit, putting most of the implementation details into modules.

In [1]:
import gc
import os

import keras
from keras.backend.tensorflow_backend import set_session
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm
 
import bytepairencoding as bpe
import seq2seq
from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, read_europarl, preprocess_input_europarl as preprocess


# Fixing random state ensure reproducible results
RANDOM_STATE=42
np.random.seed(RANDOM_STATE)
tf.set_random_seed(RANDOM_STATE)

pd.set_option('max_colwidth', 60)  # easier to read texts in e.g. df.head()

# technical detail so that an instance (maybe running in a different window)
# doesn't take all the GPU memory resulting in some strange error messages
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
set_session(tf.Session(config=config))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
MAX_INPUT_LENGTH = 100  # was 50
MAX_TARGET_LENGTH = 125  # was 65
LATENT_DIM = 512
EMBEDDING_DIM = 300
BPE_MERGE_OPERATIONS = 5_000  # I'd love to use 10_000 x 300, but this one is broken: https://github.com/bheinzerling/bpemb/issues/6
EPOCHS = 20
BATCH_SIZE = 32  # was 64, but need to reduced so a batch still fits in GPU memory
DROPOUT = 0.5
TEST_SIZE = 2_500  # was 500
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [3]:
PATH = 'data'
INPUT_LANG = 'en'
TARGET_LANG = 'de'
LANGUAGES = [INPUT_LANG, TARGET_LANG]
BPE_URL = {lang: f'http://cosyne.h-its.org/bpemb/data/{lang}/' for lang in LANGUAGES}
BPE_MODEL_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.model' for lang in LANGUAGES}
BPE_WORD2VEC_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.d{EMBEDDING_DIM}.w2v.bin' for lang in LANGUAGES}

EXTERNAL_RESOURCES = {
    # Europarl Corpus
    'de-en.tgz': 'http://statmt.org/europarl/v7/de-en.tgz',
    
    # Bytepairencoding subwords (_MODEL_) and pretrained embeddings (_WORD2VEC_)
    BPE_MODEL_NAME[INPUT_LANG]: f'{BPE_URL[INPUT_LANG]}/{BPE_MODEL_NAME[INPUT_LANG]}',
    BPE_WORD2VEC_NAME[INPUT_LANG] + '.tar.gz': f'{BPE_URL[INPUT_LANG]}/{BPE_WORD2VEC_NAME[INPUT_LANG]}' + '.tar.gz',
    BPE_MODEL_NAME[TARGET_LANG]: f'{BPE_URL[TARGET_LANG]}/{BPE_MODEL_NAME[TARGET_LANG]}',
    BPE_WORD2VEC_NAME[TARGET_LANG] + '.tar.gz': f'{BPE_URL[TARGET_LANG]}/{BPE_WORD2VEC_NAME[TARGET_LANG]}' + '.tar.gz',
}

download_and_extract_resources(fnames_and_urls=EXTERNAL_RESOURCES, dest_path=PATH)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [4]:
df = pd.DataFrame(data={
    'input_texts': read_europarl(INPUT_LANG),
    'target_texts': read_europarl(TARGET_LANG)
})

In [5]:
print("Nr total input:", len(df))
df['input_length'] = df.input_texts.apply(len)
df['target_length'] = df.target_texts.apply(len)
df.head()

Nr total input: 1920209


Unnamed: 0,input_texts,target_texts,input_length,target_length
0,resumption of the session,wiederaufnahme der sitzungsperiode,25,34
1,i declare resumed the session of the european parliament...,"ich erkläre die am freitag, dem 0. dezember unterbrochen...",203,217
2,"although, as you will have seen, the dreaded 'millennium...","wie sie feststellen konnten, ist der gefürchtete ""millen...",191,185
3,you have requested a debate on this subject in the cours...,im parlament besteht der wunsch nach einer aussprache im...,105,110
4,"in the meantime, i should like to observe a minute' s si...",heute möchte ich sie bitten - das ist auch der wunsch ei...,232,217


### Filter translations (only sentences shorter than a given length)

With a full working machine translation system, it's of course better to train on all data (plus maybe some augmented data). Without attention (and maybe copy mechanism, dynamic memory, ...) there's no point anyway in it, but it also reduces training time (a full training on ~2 Mio translations might take days, even with a good GPU).
I use different length for input (english) than target (german) language as german is more verbose.

In [6]:
non_empty = (df.input_length > 1) & (df.target_length > 1)  # there are empty phrases like '\n' --> 'Frau Präsidentin\n'
short_inputs = (df.input_length < MAX_INPUT_LENGTH) & (df.target_length < MAX_TARGET_LENGTH)
print(f'Sentences with length between (1, input={MAX_INPUT_LENGTH}/target={MAX_TARGET_LENGTH}) characters:', sum(non_empty & short_inputs))
df = df[non_empty & short_inputs]
gc.collect();  # df with filtered sentences is significant smaller, so time to garbage collect

Sentences with length between (1, input=100/target=125) characters: 597331


## Load (pretrained) Bytepairs

I need the subwords dictionary (in `BPE_WORD2VEC_NAME`), the pretrained embeddings (in `BPE_MODEL_NAME`) and a [sentencepiece](https://github.com/google/sentencepiece) handler that can encode/decode them.

In [7]:
bpe_input, bpe_target = [bpe.Bytepairencoding(
    word2vec_fname=os.path.join(PATH, BPE_WORD2VEC_NAME[lang]),
    sentencepiece_fname=os.path.join(PATH, BPE_MODEL_NAME[lang]),
) for lang in [INPUT_LANG, TARGET_LANG]] 
print("English subwords", bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", bpe_input.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁d', 'as', '▁is', 't', '▁e', 'in', '▁test', '▁f', 'ür', '▁v', 'ort', 'rain', 'ier', 'te', '▁ze', 'ic', 'hen', 'gr', 'up', 'p', 'en']


In [8]:
# Now encode the texts into sequences of indexes of bytepairs
df['input_sequences'] = df.input_texts.apply(bpe_input.subword_indices)
df['target_sequences'] = df.target_texts.apply(bpe_target.subword_indices)
df[['input_sequences', 'target_sequences']].head()

Unnamed: 0,input_sequences,target_sequences
0,"[1, 344, 146, 498, 90, 6, 3, 3235, 90, 2]","[1, 247, 351, 750, 5, 934, 43, 3158, 4762, 2]"
5,"[1, 3005, 416, 77, 359, 4, 241, 4, 17, 76, 451, 782, 21,...","[1, 241, 156, 72, 3112, 54, 4, 39, 26, 95, 4739, 89, 937..."
6,"[1, 29, 140, 414, 3231, 8, 3106, 2484, 9, 451, 782, 21, ...","[1, 35, 2444, 2269, 2109, 625, 39, 26, 95, 4739, 89, 937..."
7,"[1, 1599, 134, 546, 4, 19, 9, 918, 6, 535, 5, 2]","[1, 1161, 2266, 52, 4, 132, 2232, 1516, 3, 2]"
12,"[1, 530, 3, 414, 1434, 35, 4, 305, 186, 321, 366, 18, 19...","[1, 835, 19, 684, 494, 21, 161, 48, 838, 30, 4, 781, 67,..."


In [9]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = df.input_sequences.apply(len).max()
max_len_target = df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(52, 71)

In [10]:
train_ids, val_ids = train_test_split(np.arange(df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [11]:
s2s = seq2seq.Seq2SeqWithBPE(
    bpe_input=bpe_input,
    bpe_target=bpe_target,
    max_len_input=max_len_input,
    max_len_target=max_len_target
)
s2s.model.compile(optimizer=keras.optimizers.Adam(clipnorm=1., clipvalue=.5), loss='categorical_crossentropy')
train_generator = s2s.create_batch_generator(train_ids, df.input_sequences, df.target_sequences, BATCH_SIZE)
val_generator = s2s.create_batch_generator(val_ids, df.input_sequences, df.target_sequences, BATCH_SIZE)

s2s.model.fit_generator(
    train_generator,
    steps_per_epoch=np.ceil(len(train_ids) / BATCH_SIZE),
    epochs=EPOCHS,
    validation_data=val_generator,
    validation_steps=np.ceil(len(val_ids) / BATCH_SIZE),
)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f31d6788978>

In [12]:
name = 'beamsearchlarge'
s2s.model.save_weights(f'data/{name}_model_weights.h5')
s2s.inference_encoder_model.save_weights(f'data/{name}_inference_encoder_model_weights.h5')
s2s.inference_decoder_model.save_weights(f'data/{name}_inference_decoder_model_weights.h5')

In [13]:
def predict(sentence, beam_width=5):
    return s2s.decode_beam_search(pad_sequences(
        [bpe_input.subword_indices(preprocess(sentence))],
        padding='post',
        maxlen=max_len_input,
    ), beam_width=beam_width)

In [14]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'helfen sie.'
'you are welcome.' --> 'wir begrüßen sie.'
'how do you do?' --> 'wie tun sie?'
'i hate mondays.' --> 'ich meine das.'
'i am a programmer.' --> 'ich bin ein programm.'
'data is the new oil.' --> 'daten sind das neue öl.'
'it could be worse.' --> 'es könnte schlimmer sein.'
'i am on top of it.' --> 'ich bin dagegen.'
'n° uno' --> 'änderungsantrag'
'awesome!' --> 'nein!'
'put your feet up!' --> 'nehmen sie das!'
'from the start till the end!' --> 'aus dem ende!'
'from dusk till dawn.' --> 'von der drehteu.'


In [15]:
# Performance on training set:
for en, de in df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original "please rise, then, for this minute' s silence.", got 'bitte gestatten sie mir also eine schweigeminute.', exp: 'ich bitte sie, sich zu einer schweigeminute zu erheben.'
Original "(the house rose and observed a minute' s silence)", got '(das parlament erhebt sich zu einer schweigeminute.)', exp: '(das parlament erhebt sich zu einer schweigeminute.)'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'if the house agrees, i shall do as mr evans has suggested.', got 'wenn der abgeordnete soviel ich einverstanden habe, dann habe ich das haus.', exp: 'wenn das haus damit einverstanden ist, werde ich dem vorschlag von herrn evans folgen.'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'i would like your advice about rule 0 concerning inadmissibility.', got 'ich würde gerne ihre

In [16]:
# Performance on validation set
val_df = df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'we congratulate you on a job very well done.', got 'wir gratulieren ihnen zu ihrer sehr guten arbeit.', exp: 'wir gratulieren dir zu deiner hervorragenden arbeit.'
Original 'in this case, i strongly disagree with what was said by the previous speaker, carl schlyter.', got 'in diesem punkt stimme ich dem vorschlag von herrn schnellhardt nicht zu.', exp: 'in diesem fall widerspreche ich klar dem, was der vorredner, carl schlyter, gesagt hat.'
Original 'it only makes sense to rebuild these if the refugees who fled are coming back.', got 'es ist nur so, wenn die flüchtlinge zurückgekehrt werden müssen.', exp: 'deren wiederaufbau ist nur dann von nutzen, wenn die flüchtlinge aus den betreffenden gebieten wieder zurückkehren.'
Original 'eba: everything but arms.', got 'eurobonds: alles.', exp: 'eba: everything but arms.'
Original 'in wider terms, this directive is not ambitious enough.', got 'diese richtlinie reicht nicht aus.', exp: 'generell gesehen fehlt es dieser richtlinie an 

In [17]:
bleu = bleu_scores_europarl(
    input_texts=df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.18274617561255022


# Conclusion

The texts feel more readable, allthough the BLEU score rises up only a bit ($0.328 > $0.316$).
A lot of the problems in the translations certainly depend on the still small training set (~200k), so as next step, I'll train on a bigger sub-corpus of longer texts. This will also make the need to use an attention model more clear.