# BeamSearch for seq2seq model in keras that translates english to german

After [implementing Bytepairencoding](BytepairencodingForMachineTranslation.ipynb), I'll now optimize the decoding inference mechanism. Instead of always taking the most likely next symbol when decoding, BeamSearch keeps a candidate list of `beam_width` best translations so far and expands them all for the next symbol and takes from those new candidates again the best `beam_width` ones. I will implement it by scratch in python here. It would be an alternative to use the BeamSearch method from `tensorflow` (and probably use `tf.keras` instead of `Keras`). As my main purpose here is to go step by step on my own, I'll follow the educational approach to do it myself and use https://gist.github.com/udibr/67be473cf053d8c38730 as a template.

As trainings set I use the [European Parliament Proceedings Parallel Corpus 1996-2011](http://statmt.org/europarl/) German-English corpus with medium sized sentences.

In [1]:
# technical detail so that an instance (maybe running in a different window)
# doesn't take all the GPU memory resulting in some strange error messages
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
set_session(tf.Session(config=config))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
import gc
import math
import matplotlib.pyplot as plt
import os
import re
import tarfile

from gensim.models import KeyedVectors
import keras
import keras.layers as L
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras import regularizers
import numpy as np
import pandas as pd
import requests
import sentencepiece as spm
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm

import bytepairencoding as bpe
import seq2seq
from utils import download_and_extract_resources, read_europarl, preprocess_europarl as preprocess


# Fixing random state ensure reproducible results
RANDOM_STATE=42
np.random.seed(RANDOM_STATE)
tf.set_random_seed(RANDOM_STATE)

pd.set_option('max_colwidth', 60)

In [3]:
MAX_INPUT_LENGTH = 50
MAX_TARGET_LENGTH = 65
LATENT_DIM = 512
EMBEDDING_DIM = 300
BPE_MERGE_OPERATIONS = 5_000  # I'd love to use 10_000 x 300, but this one is broken: https://github.com/bheinzerling/bpemb/issues/6
EPOCHS = 20
BATCH_SIZE = 64
DROPOUT = 0.5
TEST_SIZE = 500
EMBEDDING_TRAINABLE = True  # Improves results significant and for at least it's not the most dominant training time factor (that's the output softmax layer)

## Download and explore data

In [4]:
PATH = 'data'
INPUT_LANG = 'en'
TARGET_LANG = 'de'
LANGUAGES = [INPUT_LANG, TARGET_LANG]
BPE_URL = {lang: f'http://cosyne.h-its.org/bpemb/data/{lang}/' for lang in LANGUAGES}
BPE_MODEL_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.model' for lang in LANGUAGES}
BPE_WORD2VEC_NAME = {lang: f'{lang}.wiki.bpe.op{BPE_MERGE_OPERATIONS}.d{EMBEDDING_DIM}.w2v.bin' for lang in LANGUAGES}

EXTERNAL_RESOURCES = {
    # Europarl Corpus
    'de-en.tgz': 'http://statmt.org/europarl/v7/de-en.tgz',
    
    # Bytepairencoding subwords (_MODEL_) and pretrained embeddings (_WORD2VEC_)
    BPE_MODEL_NAME[INPUT_LANG]: f'{BPE_URL[INPUT_LANG]}/{BPE_MODEL_NAME[INPUT_LANG]}',
    BPE_WORD2VEC_NAME[INPUT_LANG] + '.tar.gz': f'{BPE_URL[INPUT_LANG]}/{BPE_WORD2VEC_NAME[INPUT_LANG]}' + '.tar.gz',
    BPE_MODEL_NAME[TARGET_LANG]: f'{BPE_URL[TARGET_LANG]}/{BPE_MODEL_NAME[TARGET_LANG]}',
    BPE_WORD2VEC_NAME[TARGET_LANG] + '.tar.gz': f'{BPE_URL[TARGET_LANG]}/{BPE_WORD2VEC_NAME[TARGET_LANG]}' + '.tar.gz',
    
    # Bytepairencoded model weights from BytepairencodingForMachineTranslation.ipynb
    'bytepairencoding_model_weights.h5': 'https://drive.google.com/open?id=1xK2QVTsIpJLmphSEUZl1Unqmz85MYeQK',
    'bytepairencoding_inference_encoder_model_weights.h5': 'https://drive.google.com/open?id=115Kp7ZIMqxu6YDvk4RhjvfYShcQdRP_o',
    'bytepairencoding_inference_decoder_model_weights.h5': 'https://drive.google.com/open?id=1_e3DE5lDw10joIb83UFbzGJyvQrMfb8w',
}

In [9]:
download_and_extract_resources(fnames_and_urls=EXTERNAL_RESOURCES, dest_path=PATH)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)


In [10]:
df = pd.DataFrame(data={
    'input_texts': read_europarl(INPUT_LANG),
    'target_texts': read_europarl(TARGET_LANG)
})

In [11]:
print("Nr total input:", len(df))
df.target_texts = df.target_texts  # encode a start symbol (doesn't occur in texts)
df['input_length'] = df.input_texts.apply(len)
df['target_length'] = df.target_texts.apply(len)
df.head()

Nr total input: 1920209


Unnamed: 0,input_texts,target_texts,input_length,target_length
0,resumption of the session,wiederaufnahme der sitzungsperiode,25,34
1,i declare resumed the session of the european parliament...,"ich erkläre die am freitag, dem 0. dezember unterbrochen...",203,217
2,"although, as you will have seen, the dreaded 'millennium...","wie sie feststellen konnten, ist der gefürchtete ""millen...",191,185
3,you have requested a debate on this subject in the cours...,im parlament besteht der wunsch nach einer aussprache im...,105,110
4,"in the meantime, i should like to observe a minute' s si...",heute möchte ich sie bitten - das ist auch der wunsch ei...,232,217


In [12]:
non_empty = (df.input_length > 1) & (df.target_length > 1)  # there are empty phrases like '\n' --> 'Frau Präsidentin\n'
short_inputs = (df.input_length < MAX_INPUT_LENGTH) & (df.target_length < MAX_TARGET_LENGTH)
sum(non_empty & short_inputs)
df = df[non_empty & short_inputs]

167211

In [13]:
gc.collect()

17

In [14]:
bpe_input, bpe_target = [bpe.Bytepairencoding(
    word2vec_fname=os.path.join(PATH, BPE_WORD2VEC_NAME[lang]),
    sentencepiece_fname=os.path.join(PATH, BPE_MODEL_NAME[lang]),
) for lang in [INPUT_LANG, TARGET_LANG]] 
print("English subwords", bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", bpe_input.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁d', 'as', '▁is', 't', '▁e', 'in', '▁test', '▁f', 'ür', '▁v', 'ort', 'rain', 'ier', 'te', '▁ze', 'ic', 'hen', 'gr', 'up', 'p', 'en']


In [15]:
df['input_sequences'] = df.input_texts.apply(bpe_input.subword_indices)  #input_subword_indices)
df['target_sequences'] = df.target_texts.apply(bpe_target.subword_indices)

In [16]:
max_len_input = df.input_sequences.apply(len).max()
max_len_target = df.target_sequences.apply(len).max()
# nr_input_tokens = len(input_wordvec_index)  
# nr_target_tokens = len(target_wordvec_index)
# 
# # one hot encoded y_t_output wouldn't fit into memory any longer
# # so need to train/validate on batches generated on the fly
# def create_batch_generator(samples_ids):
#     
#     def batch_generator():
#         nr_batches = np.ceil(len(samples_ids) / BATCH_SIZE)
#         while True:
#             shuffled_ids = np.random.permutation(samples_ids)
#             batch_splits = np.array_split(shuffled_ids, nr_batches)
#             for batch_ids in batch_splits:
#                 batch_X = pad_sequences(df.iloc[batch_ids].input_sequences, padding='post', maxlen=max_len_input)
#                 batch_y = pad_sequences(df.iloc[batch_ids].target_sequences, padding='post', maxlen=max_len_target)
#                 batch_y_t_output = keras.utils.to_categorical(batch_y[:,1:], num_classes=nr_target_tokens)
#                 batch_x_t_input = batch_y[:,:-1]
#                 yield ([batch_X, batch_x_t_input], batch_y_t_output)
#     
#     return batch_generator()

In [17]:
train_ids, val_ids = train_test_split(np.arange(df.shape[0]), test_size=0.1)

In [18]:
# nr_input_tokens, nr_target_tokens
len(train_ids), len(val_ids)

(150489, 16722)

In [19]:
# encoder_gru = L.Bidirectional(
#     L.GRU(LATENT_DIM // 2, dropout=DROPOUT, return_state=True, name='encoder_gru'),
#     name='encoder_bidirectional'
# )
# decoder_gru = L.GRU(LATENT_DIM, dropout=DROPOUT, return_sequences=True, return_state=True, name='decoder_gru', dtype=tf.float32)
# decoder_dense = L.Dense(nr_target_tokens, activation='softmax', name='decoder_outputs', dtype=tf.float32)
# 
# input_embedding = L.Embedding(
#     nr_input_tokens,
#     FULL_EMBEDDING_DIM,
#     mask_zero=True,
#     weights=[input_embedding_matrix],
#     trainable=EMBEDDING_TRAINABLE,
#     name='input_embedding',
#     dtype=tf.float32,
# )
# target_embedding = L.Embedding(
#     nr_target_tokens,
#     FULL_EMBEDDING_DIM,
#     mask_zero=True,
#     weights=[target_embedding_matrix],
#     trainable=EMBEDDING_TRAINABLE,
#     name='target_embedding',
#     dtype=tf.float32,
# )
# 
# encoder_inputs = L.Input(shape=(max_len_input, ), dtype='int32', name='encoder_inputs')
# encoder_embeddings = input_embedding(encoder_inputs)
# _, encoder_state_1, encoder_state_2 = encoder_gru(encoder_embeddings)
# encoder_states = L.concatenate([encoder_state_1, encoder_state_2])
# 
# decoder_inputs = L.Input(shape=(max_len_target-1, ), dtype='int32', name='decoder_inputs')
# decoder_mask = L.Masking(mask_value=0)(decoder_inputs)
# decoder_embeddings_inputs = target_embedding(decoder_mask)
# decoder_embeddings_outputs, _ = decoder_gru(decoder_embeddings_inputs, initial_state=encoder_states) 
# decoder_outputs = decoder_dense(decoder_embeddings_outputs)
# 
# model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)
# 
# inference_encoder_model = Model(encoder_inputs, encoder_states)
#     
# inference_decoder_state_inputs = L.Input(shape=(LATENT_DIM, ), dtype='float32', name='inference_decoder_state_inputs')
# inference_decoder_embeddings_outputs, inference_decoder_states = decoder_gru(
#     decoder_embeddings_inputs, initial_state=inference_decoder_state_inputs
# )
# inference_decoder_outputs = decoder_dense(inference_decoder_embeddings_outputs)
# 
# inference_decoder_model = Model(
#     [decoder_inputs, inference_decoder_state_inputs], 
#     [inference_decoder_outputs, inference_decoder_states]
# )

s2s = seq2seq.Seq2SeqWithBPE(
    bpe_input=bpe_input,
    bpe_target=bpe_target,
    max_len_input=max_len_input,
    max_len_target=max_len_target
)

In [20]:
s2s.model.summary()
s2s.inference_decoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, 27)           0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     (None, 39)           0                                            
__________________________________________________________________________________________________
input_embedding (Embedding)     (None, 27, 302)      1760358     encoder_inputs[0][0]             
__________________________________________________________________________________________________
masking_1 (Masking)             (None, 39)           0           decoder_inputs[0][0]             
__________________________________________________________________________________________________
encoder_bi

In [21]:
# s2s.model.compile(optimizer=keras.optimizers.Adam(clipnorm=1., clipvalue=.5), loss='categorical_crossentropy')

In [22]:
# s2s.model.load_weights('models/bytepairencoding_model_weights.h5')
s2s.inference_encoder_model.load_weights(os.path.join(DATA, 'bytepairencoding_inference_encoder_model_weights.h5'))
s2s.inference_decoder_model.load_weights(os.path.join(DATA, 'bytepairencoding_inference_decoder_model_weights.h5'))
# train_generator = create_batch_generator(train_ids)
# val_generator = create_batch_generator(val_ids)
# model.fit_generator(
#     train_generator,
#     steps_per_epoch=np.ceil(len(train_ids) / BATCH_SIZE),
#     epochs=EPOCHS,
#     validation_data=val_generator,
#     validation_steps=np.ceil(len(val_ids) / BATCH_SIZE),
# )

In [71]:
def decode_sequence(input_seq):
    states_value = s2s.inference_encoder_model.predict(input_seq)
    
    tokens = {idx: token for (token, idx) in bpe_target.wordvec_index.items()}
    start_token_idx = bpe_target.wordvec_index['<s>']
    end_token_idx = bpe_target.wordvec_index['</s>']
    
    target_seq = np.zeros((1, max_len_target-1))
    target_seq[0, 0] = start_token_idx
    
    decoded_sequence = [] 
    for i in range(max_len_target):
        output_tokens, output_states = s2s.inference_decoder_model.predict(
            [target_seq, states_value]
        )
        
        # greedy search
        sampled_token_idx = np.argmax(output_tokens[0, 0, :])
        if sampled_token_idx == end_token_idx:
            break
        sampled_token = tokens.get(sampled_token_idx, '~')
        decoded_sequence.append(sampled_token)
            
        target_seq[0, 0] = sampled_token_idx
        states_value = output_states
    
    return bpe_target.sentencepiece.DecodePieces(decoded_sequence)

def decode_beam_search(input_seq, beam_width, branch_width=None):
    if branch_width is None:
        branch_width = beam_width
        
    initial_states = s2s.inference_encoder_model.predict(input_seq)
    
    top_candidates = [{
        'states': initial_states,
        'idx_sequence': [bpe_target.start_token_idx],
        'token_sequence': [bpe_target.start_token],
        'score': 0.0,
        'live': True
    }]
    live_k = 1
    dead_k = 0
    
    for _ in range(max_len_target):
        if not(live_k and dead_k < beam_width):
            break
        new_candidates = []
        for candidate in top_candidates:
            if not candidate['live']:
                new_candidates.append(candidate)
                continue
         
            target_seq = np.zeros((1, max_len_target - 1))
            target_seq[0, 0] = candidate['idx_sequence'][-1]
            output, states = s2s.inference_decoder_model.predict(
                [target_seq, candidate['states']]
            )
            probs = output[0, 0, :]
        
            for idx in np.argsort(-probs)[:branch_width]:
                new_candidates.append({
                    'states': states,
                    'idx_sequence': candidate['idx_sequence'] + [idx],
                    'token_sequence': candidate['token_sequence'] + [bpe_target.tokens[idx]],
                    # sum -log(prob) numerical more stable than to multiplate probs                    
                    # goal now to minimize the score
                    'score': candidate['score'] - np.log(probs[idx]),  
                    'live': idx != bpe_target.stop_token_idx,
                })
        
        top_candidates = sorted(
            new_candidates, key=lambda c: c['score']
        )[:beam_width]
        
    return bpe_target.sentencepiece.DecodePieces(top_candidates[0]['token_sequence'])

In [80]:
def predict(sentence, beam_width=10, branch_width=2):
    return decode_beam_search(keras.preprocessing.sequence.pad_sequences(
        [bpe_input.subword_indices(preprocess(sentence))],
        padding='post',
        maxlen=max_len_input,
    ), beam_width=beam_width)

In [81]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'hallam.'
'you are welcome.' --> 'seien sie willkommen.'
'how do you do?' --> 'was tun sie?'
'i hate mondays.' --> 'ich habe vielleicht.'
'i am a programmer.' --> 'ich bin ein programm.'
'data is the new oil.' --> 'daten sind die öls.'
'it could be worse.' --> 'das könnte besorgniserregend sein.'
'i am on top of it.' --> 'ich gehe davon aus.'
'n° uno' --> 'änderungsantrag 0'
'awesome!' --> 'einverstanden!'
'put your feet up!' --> 'fangen sie hart!'
'from the start till the end!' --> 'mit dem zielsetzungen!'
'from dusk till dawn.' --> 'als dilemma.'


In [75]:
# Performance on training set:
for en, de in df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original "please rise, then, for this minute' s silence.", got 'bitte lassen sie mich das wort äußern.', exp: 'ich bitte sie, sich zu einer schweigeminute zu erheben.'
Original "(the house rose and observed a minute' s silence)", got '(das parlament erhebt sich zu einer schweigeminute.)', exp: '(das parlament erhebt sich zu einer schweigeminute.)'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'madam president, on a point of order.', got 'frau präsidentin, zur geschäftsordnung.', exp: 'frau präsidentin, zur geschäftsordnung.'
Original 'thank you, mr segni, i shall do so gladly.', got 'vielen dank, herr segni.', exp: 'vielen dank, herr segni, das will ich gerne tun.'
Original 'it is the case of alexander nikitin.', got 'das ist der fall von alexander nikitin.', exp: 'das ist der fall von alexander nikitin.'
Original 'it will, i hope, be examined in a positive light.', got 'ich hoffe

In [76]:
# Performance on validation set
val_df = df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'a lot of inspiration is needed.', got 'eine vielversprechnung ist erforderlich.', exp: 'diese denkanstöße sind dringend nötig.'
Original 'so what was cancún about?', got 'was ist also in cancún?', exp: 'worum ging es in cancún?'
Original 'i now turn to another subject.', got 'ich komme nun zu einem weiteren punkt.', exp: 'ich komme jetzt zu einem anderen punkt.'
Original 'thank you, mr prodi.', got 'vielen dank, herr prodi.', exp: 'vielen dank, herr prodi.'
Original 'the experience acquired varied somewhat.', got 'die erfahrungen haben etwas ähnliches.', exp: 'die damit gemachten erfahrungen sind recht unterschiedlich.'
Original 'that is what mr smith suggested.', got 'das hat smith gesagt.', exp: 'das hat herr smith vorgeschlagen.'
Original 'it is stable internally and externally.', got 'das ist in der tat exprimierend.', exp: 'er ist stabil nach innen und außen.'
Original 'mr president, i am wearing my irish scarf today.', got 'herr präsident, ich möchte meinen irischen bed

In [82]:
import spacy
try:
    from spacy.lang.de import German
except ModuleNotFoundError:
    spacy.cli.download('de')
    from spacy.lang.de import German
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

parser = German()
chencherry = SmoothingFunction()  # to handle short sequences, see also http://www.nltk.org/_modules/nltk/translate/bleu_score.html#SmoothingFunction.method3

def remove_spaces_and_puncts(tokens):
     return [token.orth_ for token in tokens if not (token.is_space or token.is_punct)]  

bleu_scores = np.zeros(TEST_SIZE)
nist_scores = np.zeros(TEST_SIZE)

for i in tqdm(range(TEST_SIZE)):
    pred_tokens = remove_spaces_and_puncts(parser(predict(df.iloc[val_ids[i]].input_texts)))
    ref_tokens = remove_spaces_and_puncts(parser(df.iloc[val_ids[i]].target_texts))
    bleu_scores[i] = sentence_bleu([ref_tokens], pred_tokens, smoothing_function=chencherry.method3)
    
print("Average bleu score", bleu_scores.mean())

HBox(children=(IntProgress(value=0, max=500), HTML(value='')))


Average bleu score 0.3276391602399966


In [24]:
name = 'bytepairencoding'
model.save_weights(f'models/{name}_model_weights.h5')  # https://drive.google.com/open?id=1xK2QVTsIpJLmphSEUZl1Unqmz85MYeQK
inference_encoder_model.save_weights(f'models/{name}_inference_encoder_model_weights.h5')  # https://drive.google.com/open?id=115Kp7ZIMqxu6YDvk4RhjvfYShcQdRP_o
inference_decoder_model.save_weights(f'models/{name}_inference_decoder_model_weights.h5')  # https://drive.google.com/open?id=1_e3DE5lDw10joIb83UFbzGJyvQrMfb8w

# Conclusion

It is definitly an improvement to the pure char level, allthough a lot of the translations still are more funny than decent. The bleu scores correlates with this feeling with a very significant increase ($0.316 > 0.213$).
On a side note, the training could be early stopped after a few epochs (and indeed overfits from that moment). So, this new model has a lot of power left to be filled with either more data or a better model. But the next thing missing is the Beam Search instead of the Greedy Search. I guess, this will reduce a bit some of the strangeness/funny translations.