# Sequence-to-Sequence Learning

## Data

[OPUS](http://opus.lingfil.uu.se/) (Open Parallel Corpus) provides many free parallel corpora. In particular, we'll use their [English-German Tatoeba corpus](http://opus.lingfil.uu.se/) which consists of phrases translated from English to German or vice-versa.

Some preprocessing was involved to extract just the aligned sentences from the various XML files OPUS provides; I've provided the [processed data for you](../data/en_de_corpus.json).

## Preparing the data

In [87]:
import numpy as np
from keras.models import Model
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers.wrappers import TimeDistributed
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Activation, Dense, RepeatVector, Input, Concatenate, SimpleRNN, Dropout, Average

In [20]:
import json

data = json.load(open('en_de_corpus.json', 'r'))

# to deal with memory issues,
# limit the dataset
# we could also generate the training samples on-demand
# with a generator and use keras models' `fit_generator` method
max_len = 6
max_examples = 80000
max_vocab_size = 10000

def get_texts(source_texts, target_texts, max_len, max_examples):
    """extract texts
    training gets difficult with widely varying lengths
    since some sequences are mostly padding
    long sequences get difficult too, so we are going
    to cheat and just consider short-ish sequences.
    this assumes whitespace as a token delimiter
    and that the texts are already aligned.
    """
    sources, targets = [], []
    for i, source in enumerate(source_texts):
        # assume we split on whitespace
        if len(source.split(' ')) <= max_len:
            target = target_texts[i]
            if len(target.split(' ')) <= max_len:
                sources.append(source)
                targets.append(target)
    return sources[:max_examples], targets[:max_examples]

en_texts, de_texts = get_texts(data['en'], data['de'], max_len, max_examples)
n_examples = len(en_texts)

In [21]:
# add start and stop tokens
start_token = '^'
end_token = '$'
en_texts = [' '.join([start_token, text, end_token]) for text in en_texts]
de_texts = [' '.join([start_token, text, end_token]) for text in de_texts]

In [22]:
# characters for the tokenizers to filter out
# preserve start and stop tokens
filter_chars = '!"#$%&()*+,-./:;<=>?@[\\]^_{|}~\t\n\'`“”–'.replace(start_token, '').replace(end_token, '')

source_tokenizer = Tokenizer(max_vocab_size, filters=filter_chars)
source_tokenizer.fit_on_texts(en_texts)
target_tokenizer = Tokenizer(max_vocab_size, filters=filter_chars)
target_tokenizer.fit_on_texts(de_texts)

# vocab sizes
# idx 0 is reserved by keras (for padding)
# and not part of the word_index,
# so add 1 to account for it
source_vocab_size = len(source_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1

In [23]:
# find max length (in tokens) of input and output sentences
max_input_length = max(len(seq) for seq in source_tokenizer.texts_to_sequences_generator(en_texts))
max_output_length = max(len(seq) for seq in target_tokenizer.texts_to_sequences_generator(de_texts))

In [24]:
sequences = pad_sequences(source_tokenizer.texts_to_sequences(en_texts[:1]), maxlen=max_input_length)
print(en_texts[0])
# >>> ^ I took the bus back. $
print(sequences[0])
# >>> [  0   0   0   2   4 223   3 461 114   1]

^ I took the bus back. $
[  0   0   0   0   1   3 217   5 448 112   2]


In [25]:
def build_one_hot_vecs(sequences):
    """generate one-hot vectors from token sequences"""
    # boolean to reduce memory footprint
    X = np.zeros((len(sequences), max_input_length, source_vocab_size), dtype=np.bool)
    for i, sent in enumerate(sequences):
        word_idxs = np.arange(max_input_length)
        X[i][[word_idxs, sent]] = True
    return X

In [26]:
def build_target_vecs():
    """encode words in the target sequences as one-hots"""
    y = np.zeros((n_examples, max_output_length, target_vocab_size), dtype=np.bool)
    for i, sent in enumerate(pad_sequences(target_tokenizer.texts_to_sequences(de_texts), maxlen=max_output_length)):
        word_idxs = np.arange(max_output_length)
        y[i][[word_idxs, sent]] = True
    return y

## Defining the model

In [96]:
hidden_dim = 128
embedding_dim = 128

def build_model(one_hot=False, bidirectional=False, extra_dense=False, random_stuff=False):
    """build a vanilla sequence-to-sequence model.
    specify `one_hot=True` to build it for one-hot encoded inputs,
    otherwise, pass in sequences directly and embeddings will be learned.
    specify `bidirectional=False` to use a bidirectional LSTM"""
    if one_hot:
        input = Input(shape=(max_input_length,source_vocab_size))
        input_ = input
    else:
        input = Input(shape=(max_input_length,), dtype='int32')
        input_ = Embedding(source_vocab_size, embedding_dim, input_length=max_input_length)(input)

    # encoder; don't return sequences, just give us one representation vector
    if bidirectional:
        forwards = LSTM(hidden_dim, return_sequences=False)(input_)
        backwards = LSTM(hidden_dim, return_sequences=False, go_backwards=True)(input_)
        encoder = Concatenate(-1)([forwards, backwards])
    else:
        encoder = LSTM(hidden_dim, return_sequences=False)(input_)
        
    # random extra dense layer
    if extra_dense:
        encoder = Dense(hidden_dim)(encoder)

    # repeat encoder output for each desired output from the decoder
    encoder = RepeatVector(max_output_length)(encoder)
    
    # decoder; do return sequences (timesteps)
    decoder = LSTM(hidden_dim, return_sequences=True)(encoder)

    # apply the dense layer to each timestep
    # give output conforming to target vocab size
    decoder = TimeDistributed(Dense(target_vocab_size))(decoder)
    
    # just some random layers to test out adding stuff
    if random_stuff:
        random_layer_1 = Dense(embedding_dim, activation='sigmoid')(decoder)
        random_layer_2 = SimpleRNN(embedding_dim, activation='relu')(random_layer_1)
        random_layer_3 = Average()([random_layer_1, random_layer_2])
        decoder = Dropout(0.3)(random_layer_3)

    # convert to a proper distribution
    predictions = Activation('softmax')(decoder)
    return Model(input, predictions)

In [38]:
target_reverse_word_index = {v:k for k,v in target_tokenizer.word_index.items()}

def decode_outputs(predictions):
    outputs = []
    for probs in predictions:
        preds = probs.argmax(axis=-1)
        tokens = []
        for idx in preds:
            tokens.append(target_reverse_word_index.get(idx))
        outputs.append(' '.join([t for t in tokens if t is not None]))
    return outputs

## Training

In [39]:
def build_seq_vecs(sequences):
    return np.array(sequences)

In [40]:
import math
def generate_batches(batch_size, one_hot=False):
    # each epoch
    n_batches = math.ceil(n_examples/batch_size)
    while True:
        sequences = pad_sequences(source_tokenizer.texts_to_sequences(en_texts), maxlen=max_input_length)

        if one_hot:
            X = build_one_hot_vecs(sequences)
        else:
            X = build_seq_vecs(sequences)
        y = build_target_vecs()

        # shuffle
        idx = np.random.permutation(len(sequences))
        X = X[idx]
        y = y[idx]

        for i in range(n_batches):
            start = batch_size * i
            end = start+batch_size
            yield X[start:end], y[start:end]

## Building some Models

In [92]:
n_epochs = 100
batch_size = 128

model = build_model(one_hot=True, bidirectional=False, extra_dense=False)

model.summary()
#model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.fit_generator(generate_batches(batch_size, one_hot=True), n_examples, n_epochs, verbose=1)

Model: "functional_47"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_36 (InputLayer)        [(None, 11, 11464)]       0         
_________________________________________________________________
lstm_75 (LSTM)               (None, 128)               5935616   
_________________________________________________________________
repeat_vector_29 (RepeatVect (None, 11, 128)           0         
_________________________________________________________________
lstm_76 (LSTM)               (None, 11, 128)           131584    
_________________________________________________________________
time_distributed_29 (TimeDis (None, 11, 19102)         2464158   
_________________________________________________________________
activation_27 (Activation)   (None, 11, 19102)         0         
Total params: 8,531,358
Trainable params: 8,531,358
Non-trainable params: 0
___________________________________________

In [93]:
model = build_model(one_hot=False, bidirectional=False, extra_dense=True)

model.summary()

Model: "functional_49"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_37 (InputLayer)        [(None, 11)]              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 11, 128)           1467392   
_________________________________________________________________
lstm_77 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_52 (Dense)             (None, 128)               16512     
_________________________________________________________________
repeat_vector_30 (RepeatVect (None, 11, 128)           0         
_________________________________________________________________
lstm_78 (LSTM)               (None, 11, 128)           131584    
_________________________________________________________________
time_distributed_30 (TimeDis (None, 11, 19102)       

In [94]:
model = build_model(one_hot=True, bidirectional=True, extra_dense=True)

model.summary()

Model: "functional_51"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_38 (InputLayer)           [(None, 11, 11464)]  0                                            
__________________________________________________________________________________________________
lstm_79 (LSTM)                  (None, 128)          5935616     input_38[0][0]                   
__________________________________________________________________________________________________
lstm_80 (LSTM)                  (None, 128)          5935616     input_38[0][0]                   
__________________________________________________________________________________________________
concatenate_9 (Concatenate)     (None, 256)          0           lstm_79[0][0]                    
                                                                 lstm_80[0][0]        

In [95]:
model = build_model(one_hot=False, bidirectional=True, random_stuff=True)

model.summary()

Model: "functional_53"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_39 (InputLayer)           [(None, 11)]         0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 11, 128)      1467392     input_39[0][0]                   
__________________________________________________________________________________________________
lstm_82 (LSTM)                  (None, 128)          131584      embedding_4[0][0]                
__________________________________________________________________________________________________
lstm_83 (LSTM)                  (None, 128)          131584      embedding_4[0][0]                
______________________________________________________________________________________

## Translation

Note that, since training a model takes far too long for me on this computer, none of the subsequent cells have actually been run.

In [67]:
def translate(model, sentences, one_hot=False):
    seqs = pad_sequences(source_tokenizer.texts_to_sequences(sentences), maxlen=max_input_length)
    if one_hot:
        input = build_one_hot_vecs(seqs)
    else:
        input = build_seq_vecs(seqs)
    preds = model.predict(input, verbose=0)
    return decode_outputs(preds)

In [None]:
print(en_texts[0])
print(de_texts[0])
print(translate(model, [en_texts[0]], one_hot=True))
# >>> ^ I took the bus back. $
# >>> ^ Ich nahm den Bus zurück. $
# >>> ^ ich ich die die verloren $

In [None]:
model = build_model(one_hot=False, bidirectional=False)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(generator=generate_batches(batch_size, one_hot=False), samples_per_epoch=n_examples, nb_epoch=n_epochs, verbose=1)

In [None]:
model = build_model(one_hot=False, bidirectional=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(generator=generate_batches(batch_size, one_hot=False), samples_per_epoch=n_examples, nb_epoch=n_epochs, verbose=1)

## Model Information

- The underlying data is text. Specificially, the training data is pairs of translated English/German sentences.
- First the underlying text data is filtered for size (this is for memory/efficiency's sake). Then, tokens are added to mark the start and end of samples (this also involves removing the tokens if they show up in the samples). Finally, the samples of words are transformed into vectors (potentially one-hot depending on the setup).
- The output of the model is another vector which can be trnasformed back into a (German) sentence by basically reversing the process used to generate the input vectors (though obviously there are some differences since the language has changed).
- The basic, one-hot version of the model has 4 hidden layers. These are an LSTM encoding layer, a RepeatVector (copying) layer, an LSTM decoding layer, and a TimeDistributed dense layer.
- The final activation function used is a softmax function.
- The loss function is something called Categorical Cross-entropy.
- The only validation metric included was a graph of the accuracy across the epochs of various configurations for the model. Unfortunately, this graph just seems to be an image included in the notebook, so no code on how it was generated was provided. Apparently, after 300 epochs, an accuracy of ~82% was achieved.
- Beyond just a letter by letter measure of accuracy, something like a whole-word measure of accuracy would probably be more helpful, as it punishes misspellings much more harshly. Also, one (admittedly much more complicated and somewhat impractical) idea would be to train a German to English translation network, and see if it is able to recover the original English phrases using this network's translations.
- The idea behind this particular architecture is probably that the encoding step takes in the English words and reduces them down (in some fashion) to their actual meaning, while the decoding layer can then take that fundamental meaning and turn it into German words. The other layers just fascilitate this.
- One thing that might be interesting to see is if one used the input layer as a further input to the decoding layer, as it somehow might be possible that knowing both the 'meaning of the sentence' (from the encoding layer) and the sentence itself could improve the translations. This could probably be achieved with some type of concatenation layer (I'm not quite sure what the proper architecture would be here).