# Bytepairencoding seq2seq model in keras that translates english to german

As a next I'll use higher than of char-level-embedding. The first natural choice would be to use word embeddings like word2vec, glove or the more modern facebook's fasttext (especially for german) or the most modern starspace. It has the advantage of getting very quick good results, so it would be the most natural choice for prototyping. The disadvantage is of course that out-of-vocabulary words can't be handled and that linguistic patterns (like plural or case) are harder to detect. The vocabulary size is the most critical, especially from terms of performance and memory consumption. A big vocabulary (like >50k) would be very problematic as we have softmax layer of that size for the outputs. Not only it's slow, also the one-hot-encoding takes enourmous space in memory and would get tricky to work around (even here, I had to reduce the batch size to work with 5k bytepairs). So, the state-of-the art technique is to use bytepairencodings with a medium size (around 10k). Here, I'll use the pretrained models from https://github.com/bheinzerling/bpemb 

One of the consequences is that I'll preprocess the raw sentences (lower case, convert numbers to $0$) to be able to use the pretrained bytepair embeddings. Wouldn't be a big hassle for production to train it on your own, but it's not necessary for demonstration purposes here.

As trainings set I use the [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/) German-English corpus with medium sized sentences.

In [1]:
# technical detail so that an instance (maybe running in a different window)
# doesn't take all the GPU memory resulting in some strange error messages
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
set_session(tf.Session(config=config))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
import math
import matplotlib.pyplot as plt
import os
import re
import tarfile

from gensim.models import KeyedVectors
import keras
import keras.layers as L
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras import regularizers
import numpy as np
import pandas as pd
import requests
import sentencepiece as spm
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm

# Fixing random state ensure reproducible results
RANDOM_STATE=42
np.random.seed(RANDOM_STATE)
tf.set_random_seed(RANDOM_STATE)

In [3]:
MAX_INPUT_LENGTH = 50
MAX_TARGET_LENGTH = 65
LATENT_DIM = 512
EMBEDDING_DIM = 300
BPE_MERGE_OPERATIONS = 5_000  # I'd love to use 10_000 x 300, but this one is broken: https://github.com/bheinzerling/bpemb/issues/6
EPOCHS = 20
BATCH_SIZE = 64
DROPOUT = 0.5
TEST_SIZE = 500
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [4]:
def download_file(fname, url):
    print(f"Downloading {fname} from {url} ...")
    response = requests.get(url, stream=True)

    total_size = int(response.headers.get('content-length', 0)); 
    block_size = 1024

    download = tqdm(
        response.iter_content(block_size),
        total=math.ceil(total_size // block_size),
        unit='KB',
        unit_scale=True
    )
    with open(f"{fname}", "wb") as handle:
        for data in download:
            handle.write(data)

PATH = 'data'
INPUT_LANG = 'en'
TARGET_LANG = 'de'
LANGUAGES = [INPUT_LANG, TARGET_LANG]
BPE_URL = {lang: f'http://cosyne.h-its.org/bpemb/data/{lang}/' for lang in LANGUAGES}
BPE_MODEL_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.model' for lang in LANGUAGES}
BPE_WORD2VEC_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.d{EMBEDDING_DIM}.w2v.bin' for lang in LANGUAGES}
DOWNLOAD_FILES = {
    'de-en.tgz': 'http://statmt.org/europarl/v7/de-en.tgz',
    BPE_MODEL_NAME[INPUT_LANG]: f'{BPE_URL[INPUT_LANG]}/{BPE_MODEL_NAME[INPUT_LANG]}',
    BPE_WORD2VEC_NAME[INPUT_LANG] + '.tar.gz': f'{BPE_URL[INPUT_LANG]}/{BPE_WORD2VEC_NAME[INPUT_LANG]}' + '.tar.gz',
    BPE_MODEL_NAME[TARGET_LANG]: f'{BPE_URL[TARGET_LANG]}/{BPE_MODEL_NAME[TARGET_LANG]}',
    BPE_WORD2VEC_NAME[TARGET_LANG] + '.tar.gz': f'{BPE_URL[TARGET_LANG]}/{BPE_WORD2VEC_NAME[TARGET_LANG]}' + '.tar.gz',
}
os.makedirs(PATH, exist_ok=True)

for name, url in DOWNLOAD_FILES.items():
    fname = os.path.join(PATH, name)
    exists = os.path.exists(fname)
    size = os.path.getsize(fname) if exists else -1
    if exists and size > 0:
        print(f'{name} already downloaded ({size / 2**20:3.1f} MB)')
        continue
    download_file(fname, url)
    if (re.search(r'\.(tgz|tar\.gz)$', fname)):
        tar = tarfile.open(fname, "r:gz")
        tar.extractall(path=PATH)
        tar.close()
        print(f'Extracted {fname} ...')


de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [5]:
# Following https://github.com/bheinzerling/bpemb/blob/master/preprocess_text.sh
# (ignoring urls as there shouldn't be any in parliament discussions)
def preprocess(line):
    line = re.sub(r'\d+', '0', line)
    line = re.sub(r'\s+', ' ', line)
    return line.lower().strip()

def read_corpus_lines(language):
    return [preprocess(line) for line in open(f'{PATH}/europarl-v7.de-en.{language}', 'r').readlines()]
    
pd.set_option('max_colwidth', 60)
df = pd.DataFrame(data={
    'input_texts': read_corpus_lines('en'),
    'target_texts': read_corpus_lines('de'), 
})

In [6]:
print("Nr total input:", len(df))
df.target_texts = df.target_texts  # encode a start symbol (doesn't occur in texts)
df['input_length'] = df.input_texts.apply(len)
df['target_length'] = df.target_texts.apply(len)
df.head()

Nr total input: 1920209


Unnamed: 0,input_texts,target_texts,input_length,target_length
0,resumption of the session,wiederaufnahme der sitzungsperiode,25,34
1,i declare resumed the session of the european parliament...,"ich erkläre die am freitag, dem 0. dezember unterbrochen...",203,217
2,"although, as you will have seen, the dreaded 'millennium...","wie sie feststellen konnten, ist der gefürchtete ""millen...",191,185
3,you have requested a debate on this subject in the cours...,im parlament besteht der wunsch nach einer aussprache im...,105,110
4,"in the meantime, i should like to observe a minute' s si...",heute möchte ich sie bitten - das ist auch der wunsch ei...,232,217


In [7]:
non_empty = (df.input_length > 1) & (df.target_length > 1)  # there are empty phrases like '\n' --> 'Frau Präsidentin\n'
short_inputs = (df.input_length < MAX_INPUT_LENGTH) & (df.target_length < MAX_TARGET_LENGTH)
sum(non_empty & short_inputs)
df = df[non_empty & short_inputs]

167211

In [8]:
input_pretrained_bpe = KeyedVectors.load_word2vec_format(os.path.join(PATH, BPE_WORD2VEC_NAME[INPUT_LANG]), binary=True)
target_pretrained_bpe = KeyedVectors.load_word2vec_format(os.path.join(PATH, BPE_WORD2VEC_NAME[TARGET_LANG]), binary=True)
sp_input = spm.SentencePieceProcessor()
sp_input.Load(os.path.join(PATH, BPE_MODEL_NAME[INPUT_LANG]))
subwords = sp_input.EncodeAsPieces("this is a test for pretrained bytepairembeddings")
print(subwords)
sp_target = spm.SentencePieceProcessor()
sp_target.Load(os.path.join(PATH, BPE_MODEL_NAME[TARGET_LANG]))
subwords = sp_target.EncodeAsPieces("das ist ein test für vortrainiert teilwort-embeddings")
print(subwords)

True

['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']


True

['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'iert', '▁teil', 'wort', '-', 'em', 'be', 'd', 'dings']


In [9]:
input_wordvec_index = dict({
    word: index 
    for index, word 
    in enumerate(['<pad>', '<s>', '</s>'] + input_pretrained_bpe.index2word)  # haven't found start/stop tokens, so add them manually
})
input_unk_index = input_wordvec_index['<unk>']

target_wordvec_index = dict({
    word: index 
    for index, word 
    in enumerate(['<pad>', '<s>', '</s>'] + target_pretrained_bpe.index2word)  # haven't found start/stop tokens, so add them manually
})
target_unk_index = target_wordvec_index['<unk>']

def subword_indices(text, unk_index, sp, wordvec_index):
    subwords = ['<s>'] + sp.EncodeAsPieces(text) + ['</s>']  # automatic add start/stop index
    return [wordvec_index.get(subword, unk_index) for subword in subwords]

def input_subword_indices(text):
    return subword_indices(text, input_unk_index, sp_input, input_wordvec_index)

def target_subword_indices(text):
    return subword_indices(text, target_unk_index, sp_target, target_wordvec_index)

FULL_EMBEDDING_DIM = EMBEDDING_DIM + 2
input_embedding_matrix = np.zeros((len(input_wordvec_index), FULL_EMBEDDING_DIM), dtype=np.float32)
input_embedding_matrix[0, :] = 1e-6 * np.random.standard_normal(FULL_EMBEDDING_DIM)  # pad symbol as close to zero
input_embedding_matrix[1, -1] = 1  # one hot encode start symbol
input_embedding_matrix[2, -2] = 1  # one hot encode stop symbol
input_embedding_matrix[3:, :-2] = input_pretrained_bpe.vectors

target_embedding_matrix = np.zeros((len(target_wordvec_index), FULL_EMBEDDING_DIM), dtype=np.float32)
target_embedding_matrix[0, :] = 1e-6 * np.random.standard_normal(FULL_EMBEDDING_DIM)  # pad symbol as close to zero
target_embedding_matrix[1, -1] = 1  # one hot encode start symbol
target_embedding_matrix[2, -2] = 1  # one hot encode stop symbol
target_embedding_matrix[3:, :-2] = target_pretrained_bpe.vectors

df['input_sequences'] = df.input_texts.apply(input_subword_indices)
df['target_sequences'] = df.target_texts.apply(target_subword_indices)

In [10]:
input_embedding_matrix[:5, -8:]
input_embedding_matrix[-5:, -8:]
target_embedding_matrix[:4, -8:]

array([[ 3.5701549e-07, -6.9290962e-07,  8.9959985e-07,  3.0729953e-07,
         8.1286214e-07,  6.2962886e-07, -8.2899498e-07, -5.6018104e-07],
       [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  1.0000000e+00],
       [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
         0.0000000e+00,  0.0000000e+00,  1.0000000e+00,  0.0000000e+00],
       [-7.1291000e-02,  1.3662800e-01, -9.3874000e-02, -1.3716000e-02,
         5.3709999e-02,  5.3034998e-02,  0.0000000e+00,  0.0000000e+00],
       [-5.3902999e-02, -1.0761100e-01, -3.5625100e-01,  6.3295998e-02,
         1.0626500e-01, -1.8801700e-01,  0.0000000e+00,  0.0000000e+00]],
      dtype=float32)

array([[-0.240234,  0.259846,  0.233096,  1.183777, -0.265572, -0.147195,
         0.      ,  0.      ],
       [ 0.741856,  0.96104 , -0.303011, -0.01168 , -0.268806,  0.085153,
         0.      ,  0.      ],
       [-0.086124, -0.168327,  0.055605,  1.531863, -0.044677,  0.176467,
         0.      ,  0.      ],
       [-0.368369,  0.186533,  0.397875,  0.504644,  0.097682, -0.148711,
         0.      ,  0.      ],
       [ 0.272315,  0.356335,  0.225856,  0.584159, -0.160238,  0.018695,
         0.      ,  0.      ]], dtype=float32)

array([[-2.6987493e-07, -9.7876375e-07, -4.4429325e-07,  3.7730049e-07,
         7.5698864e-07, -9.2216533e-07,  8.6960591e-07,  1.3556379e-06],
       [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  1.0000000e+00],
       [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
         0.0000000e+00,  0.0000000e+00,  1.0000000e+00,  0.0000000e+00],
       [-3.5677999e-02, -7.2214000e-02, -6.2057000e-02,  2.3062800e-01,
         1.2143200e-01,  1.2812901e-01,  0.0000000e+00,  0.0000000e+00]],
      dtype=float32)

In [11]:
max_len_input = df.input_sequences.apply(len).max()
max_len_target = df.target_sequences.apply(len).max()
nr_input_tokens = len(input_wordvec_index)  
nr_target_tokens = len(target_wordvec_index)

# one hot encoded y_t_output wouldn't fit into memory any longer
# so need to train/validate on batches generated on the fly
def create_batch_generator(samples_ids):
    
    def batch_generator():
        nr_batches = np.ceil(len(samples_ids) / BATCH_SIZE)
        while True:
            shuffled_ids = np.random.permutation(samples_ids)
            batch_splits = np.array_split(shuffled_ids, nr_batches)
            for batch_ids in batch_splits:
                batch_X = pad_sequences(df.iloc[batch_ids].input_sequences, padding='post', maxlen=max_len_input)
                batch_y = pad_sequences(df.iloc[batch_ids].target_sequences, padding='post', maxlen=max_len_target)
                batch_y_t_output = keras.utils.to_categorical(batch_y[:,1:], num_classes=nr_target_tokens)
                batch_x_t_input = batch_y[:,:-1]
                yield ([batch_X, batch_x_t_input], batch_y_t_output)
    
    return batch_generator()

In [12]:
train_ids, val_ids = train_test_split(np.arange(df.shape[0]), test_size=0.1)

In [13]:
nr_input_tokens, nr_target_tokens
len(train_ids), len(val_ids)

(5829, 5367)

(150489, 16722)

In [14]:
encoder_gru = L.Bidirectional(
    L.GRU(LATENT_DIM // 2, dropout=DROPOUT, return_state=True, name='encoder_gru'),
    name='encoder_bidirectional'
)
decoder_gru = L.GRU(LATENT_DIM, dropout=DROPOUT, return_sequences=True, return_state=True, name='decoder_gru', dtype=tf.float32)
decoder_dense = L.Dense(nr_target_tokens, activation='softmax', name='decoder_outputs', dtype=tf.float32)

input_embedding = L.Embedding(
    nr_input_tokens,
    FULL_EMBEDDING_DIM,
    mask_zero=True,
    weights=[input_embedding_matrix],
    trainable=EMBEDDING_TRAINABLE,
    name='input_embedding',
    dtype=tf.float32,
)
target_embedding = L.Embedding(
    nr_target_tokens,
    FULL_EMBEDDING_DIM,
    mask_zero=True,
    weights=[target_embedding_matrix],
    trainable=EMBEDDING_TRAINABLE,
    name='target_embedding',
    dtype=tf.float32,
)

encoder_inputs = L.Input(shape=(max_len_input, ), dtype='int32', name='encoder_inputs')
encoder_embeddings = input_embedding(encoder_inputs)
_, encoder_state_1, encoder_state_2 = encoder_gru(encoder_embeddings)
encoder_states = L.concatenate([encoder_state_1, encoder_state_2])

decoder_inputs = L.Input(shape=(max_len_target-1, ), dtype='int32', name='decoder_inputs')
decoder_mask = L.Masking(mask_value=0)(decoder_inputs)
decoder_embeddings_inputs = target_embedding(decoder_mask)
decoder_embeddings_outputs, _ = decoder_gru(decoder_embeddings_inputs, initial_state=encoder_states) 
decoder_outputs = decoder_dense(decoder_embeddings_outputs)

model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)

inference_encoder_model = Model(encoder_inputs, encoder_states)
    
inference_decoder_state_inputs = L.Input(shape=(LATENT_DIM, ), dtype='float32', name='inference_decoder_state_inputs')
inference_decoder_embeddings_outputs, inference_decoder_states = decoder_gru(
    decoder_embeddings_inputs, initial_state=inference_decoder_state_inputs
)
inference_decoder_outputs = decoder_dense(inference_decoder_embeddings_outputs)

inference_decoder_model = Model(
    [decoder_inputs, inference_decoder_state_inputs], 
    [inference_decoder_outputs, inference_decoder_states]
)

In [15]:
model.summary()
inference_decoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, 27)           0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     (None, 39)           0                                            
__________________________________________________________________________________________________
input_embedding (Embedding)     (None, 27, 302)      1760358     encoder_inputs[0][0]             
__________________________________________________________________________________________________
masking_1 (Masking)             (None, 39)           0           decoder_inputs[0][0]             
__________________________________________________________________________________________________
encoder_bi

In [16]:
model.compile(optimizer=keras.optimizers.Adam(clipnorm=1., clipvalue=.5), loss='categorical_crossentropy')

In [17]:
train_generator = create_batch_generator(train_ids)
val_generator = create_batch_generator(val_ids)
model.fit_generator(
    train_generator,
    steps_per_epoch=np.ceil(len(train_ids) / BATCH_SIZE),
    epochs=EPOCHS,
    validation_data=val_generator,
    validation_steps=np.ceil(len(val_ids) / BATCH_SIZE),
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f8088884a58>

In [18]:
def decode_sequence(input_seq):
    states_value = inference_encoder_model.predict(input_seq)
    
    tokens = {idx: token for (token, idx) in target_wordvec_index.items()}
    start_token_idx = target_wordvec_index['<s>']
    end_token_idx = target_wordvec_index['</s>']
    
    target_seq = np.zeros((1, max_len_target-1))
    target_seq[0, 0] = start_token_idx
    
    decoded_sequence = [] 
    for i in range(max_len_target):
        output_tokens, output_states = inference_decoder_model.predict(
            [target_seq, states_value]
        )
        
        # greedy search
        sampled_token_idx = np.argmax(output_tokens[0, 0, :])
        if sampled_token_idx == end_token_idx:
            break
        sampled_token = tokens.get(sampled_token_idx, '~')
        decoded_sequence.append(sampled_token)
            
        target_seq[0, 0] = sampled_token_idx
        states_value = output_states
    
    return sp_target.DecodePieces(decoded_sequence)

In [19]:
def predict(sentence):
    return decode_sequence(keras.preprocessing.sequence.pad_sequences(
        [input_subword_indices(preprocess(sentence))],
        padding='post',
        maxlen=max_len_input,
    ))

In [20]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{en!r} --> {predict(en)!r}")

'Hello.\n' --> 'hallam.'
'You are welcome.\n' --> 'seien sie willkommen.'
'How do you do?\n' --> 'was tun sie?'
'I hate mondays.\n' --> 'ich habe mir meine fehler.'
'I am a programmer.\n' --> 'ich bin ein programm.'
'Data is the new oil.\n' --> 'die erdöl ist ein ölliger anfang.'
'It could be worse.\n' --> 'das könnte weiter verschlechtert werden.'
'I am on top of it.\n' --> 'ich bin überzeugt.'
'N° Uno\n' --> 'napvess'
'Awesome!\n' --> 'das ist eine schande!'
'Put your feet up!\n' --> 'fangen sie hart!'
'From the start till the end!\n' --> 'mit dem zielsetzungen!'
'From dusk till dawn.\n' --> 'von einem dialog wird sie tun.'


In [21]:
# Performance on training set:
for en, de in df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original "please rise, then, for this minute' s silence.", got 'bitte fahren sie sich das wort für mich.', exp: 'ich bitte sie, sich zu einer schweigeminute zu erheben.'
Original "(the house rose and observed a minute' s silence)", got '(das parlament erhebt sich zu einer schweigeminute.)', exp: '(das parlament erhebt sich zu einer schweigeminute.)'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'thank you, mr segni, i shall do so gladly.', got 'vielen dank, herr segni, ich möchte es danken.', exp: 'vielen dank, herr segni, das will ich gerne tun.'
Original 'it is the case of alexander nikitin.', got 'das ist der fall von alexander nikitin.', exp: 'das ist der fall von alexander nikitin.'
Original 'it will, i hope, be examined in a positive

In [22]:
# Performance on validation set
val_df = df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'a lot of inspiration is needed.', got 'eine vielzahl von uns ist erforderlich.', exp: 'diese denkanstöße sind dringend nötig.'
Original 'so what was cancún about?', got 'was ist also zu cancún?', exp: 'worum ging es in cancún?'
Original 'i now turn to another subject.', got 'ich komme nun zu einem weiteren punkt.', exp: 'ich komme jetzt zu einem anderen punkt.'
Original 'thank you, mr prodi.', got 'vielen dank, herr prodi.', exp: 'vielen dank, herr prodi.'
Original 'the experience acquired varied somewhat.', got 'die erfahrung hat etwas ähnliches.', exp: 'die damit gemachten erfahrungen sind recht unterschiedlich.'
Original 'that is what mr smith suggested.', got 'das hat sagt, herr smith.', exp: 'das hat herr smith vorgeschlagen.'
Original 'it is stable internally and externally.', got 'das ist ein ständiges und interessanter punkt.', exp: 'er ist stabil nach innen und außen.'
Original 'mr president, i am wearing my irish scarf today.', got 'herr präsident, ich bin der irisc

In [23]:
import spacy
try:
    from spacy.lang.de import German
except ModuleNotFoundError:
    spacy.cli.download('de')
    from spacy.lang.de import German
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

parser = German()
chencherry = SmoothingFunction()  # to handle short sequences, see also http://www.nltk.org/_modules/nltk/translate/bleu_score.html#SmoothingFunction.method3

def remove_spaces_and_puncts(tokens):
     return [token.orth_ for token in tokens if not (token.is_space or token.is_punct)]  

bleu_scores = np.zeros(TEST_SIZE)
nist_scores = np.zeros(TEST_SIZE)

for i in tqdm(range(TEST_SIZE)):
    pred_tokens = remove_spaces_and_puncts(parser(predict(df.iloc[val_ids[i]].input_texts)))
    ref_tokens = remove_spaces_and_puncts(parser(df.iloc[val_ids[i]].target_texts))
    bleu_scores[i] = sentence_bleu([ref_tokens], pred_tokens, smoothing_function=chencherry.method3)

HBox(children=(IntProgress(value=0, max=500), HTML(value='')))


Average bleu score: 0.31657723155182826


In [24]:
name = 'bytepairencoding'
model.save_weights(f'models/{name}_model_weights.h5')  # https://drive.google.com/open?id=1xK2QVTsIpJLmphSEUZl1Unqmz85MYeQK
inference_encoder_model.save_weights(f'models/{name}_inference_encoder_model_weights.h5')  # https://drive.google.com/open?id=115Kp7ZIMqxu6YDvk4RhjvfYShcQdRP_o
inference_decoder_model.save_weights(f'models/{name}_inference_decoder_model_weights.h5')  # https://drive.google.com/open?id=1_e3DE5lDw10joIb83UFbzGJyvQrMfb8w

# Conclusion

It is definitly an improvement to the pure char level, allthough a lot of the translations still are more funny than decent. The bleu scores correlates with this feeling with a very significant increase ($0.316 > 0.213$).
On a side note, the training could be early stopped after a few epochs (and indeed overfits from that moment). So, this new model has a lot of power left to be filled with either more data or a better model. But the next thing missing is the Beam Search instead of the Greedy Search. I guess, this will reduce a bit some of the strangeness/funny translations.