# Simple seq2seq model in keras that learns adding and subtracting integers

To not get distracted, I'll start with a toy problem (I had as exercise in https://www.coursera.org/learn/language-processing). It will learn to calculate integers on char level end to end.
I follow the standard seq2seq model with a RNN encoder and decoder (where the decoder gets as input a thought vector from the encoder).
Design-Decisions:

+ *GRUs*: The usual advice is to use LSTMs first, then implement GRUs and use the LSTM version as benchmark. If the GRU version is similiar, use GRUs as they only need half of the parameters as LSTMs. Here, it's a known problem and GRUs work fine (I also tried LSTMs, but the results were only slightly better).
+ *Bidirectional-Encoder*: If useful, bidirectional encoders work better usually, so I use them for the encoder here (obviously they can't be used for the decoder, for at least not in an easy way).
+ *1 Layer*: Well, working with deeper layers would be a way for improving. Here, a 2nd layer would improve the results a bit. But in the end, the training time is much longer, it overfits much easier. And of course for the toy problem, I could easily train a much simpler network of fully connected layers here - in the end we have a linear function to be solved ($\sum_{i=0}^{d_a} 10^i a_d \pm \sum_{i=0}^{d_b} 10^i a_b$ for $a=a_{d_a} \dots a_0, b=b_{d_b} \dots b_0$). But this is not the purpose of creating a seq2seq model. The goal of course here is to show that seq2seq can learn how to calculate without it even knows itselfs that it calculates :-)
+ *Embeddings*: Here, I use Embeddings from the start on, allthough training an one hot encoded characters would be fine here too (indeed, it should give similiar results from a theoretical point of view and I also tried it). But working with Embeddings makes it easier to later plug in Word-Embeddings or Bytepair-Encodings.
+ *Masking*: It's very important to mask the paddings (here 0-values), so that the loss function doesn't care about the paddings. Otherwise the training process would take much longer (where it only learns padding in addition). Keras really has the nice Masking layer that does it all automatic for us, but it took me astonishing time to understand it.
+ *START, END coding*: Results were better if there is also an END sign in the encoding strings.
+ *Log-Uniform distributed equations*: If I'd just train on uniform distributed operands, small values are underrepresented (in the way that it might not learn how to add 1 or 2 digits numbers as most of the training examples have more digits). When working with texts, for a lot of time, we'd solve the problem by generating seq2seq models for different input lengths (like <=5, 5-10, 10-15, >15 or so). It took me some time to figure out how important it was to loguniform the training distributions here (and it was frustrating to see that $1+1=?$ was impossible to learn for the model, but $1234+4567$ was not)
+ *Hyperparameters*: It's not a kaggle here, so I decided not to play around unless really necessary. So I take the defaults for learning rate and others, I choosed dropout to 0.5 what's always a reasonable value (and worked better than no dropout) and just some typical values for small problems (training size 100k is fine and still quick enough to run it through, batch size ~128 is usual and so on).
+ *Decoding*: For simplicity, I take Greedy Search here, Beam Search is something for later.
+ *OOP*: For production and reusability, it would be much better to write a Seq2Seq class. But to understand what's going on, it's a bit disturbing, so here it as simple imperative notebook without much syntactic noise or even comments.

This script follows very narrow https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

In [1]:
# technical detail so that an instance (maybe running in a different window)
# doesn't take all the GPU memory resulting in some strange error messages
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
set_session(tf.Session(config=config))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
import keras
import keras.layers as L
from keras.models import Model
import numpy as np
import pandas as pd

# Fixing random state ensure reproducible results
RANDOM_STATE=42
np.random.seed(RANDOM_STATE)
tf.set_random_seed(RANDOM_STATE)

In [3]:
START = '^'
END = '$'

SIZE = 100_000
LATENT_DIM = 512
EMBEDDING_DIM = 16
EPOCHS = 20
BATCH_SIZE = 128
DROPOUT = 0.5

In [4]:
def loguniform_int(low=0, high=1, size=1):
    offset = np.max([1 - low, 0])
    low, high = np.log([low + offset, high + offset])
    return (np.exp(np.random.uniform(low, high, size)) - offset).astype(int)

def create_equations_df(size, min_value=0, max_value=9999, operations={'+': np.add, '-': np.subtract}):
    df = pd.DataFrame()
    df['a'] = loguniform_int(low=min_value, high=max_value, size=size)
    df['b'] = loguniform_int(low=min_value, high=max_value, size=size)
    df['op'] = np.random.choice(list(operations.keys()), size)
    df['result'] = np.zeros(size, dtype='int')
    for symbol, calc in operations.items():
        df.loc[df.op == symbol, 'result'] = calc(df[df.op == symbol]['a'], df[df.op == symbol]['b'])
        
    df['input_texts'] = df.a.astype(str) + df.op + df.b.astype(str) + END
    df['target_texts'] = START + df.result.astype(str) + END
    return df

In [5]:
df = create_equations_df(SIZE)

In [6]:
corpus = pd.concat([df.input_texts, df.target_texts])

In [7]:
tokenizer = keras.preprocessing.text.Tokenizer(num_words=None, filters=None, char_level=True)
tokenizer.fit_on_texts(corpus)
df['input_sequences'] = tokenizer.texts_to_sequences(df.input_texts)
df['target_sequences'] = tokenizer.texts_to_sequences(df.target_texts)

In [8]:
X = keras.preprocessing.sequence.pad_sequences(df.input_sequences, padding='post')
y = keras.preprocessing.sequence.pad_sequences(df.target_sequences, padding='post')
y_t_output = keras.utils.to_categorical(y[:,1:], num_classes=len(tokenizer.word_index)+1)
x_t_input = y[:,:-1]

max_len_input = X.shape[1]
max_len_target = x_t_input.shape[1]
nr_tokens = y_t_output.shape[2]

In [9]:
tokenizer.word_index
nr_tokens
y_t_output.shape
len(tokenizer.word_index)

{'$': 1,
 '1': 2,
 '2': 3,
 '^': 4,
 '3': 5,
 '4': 6,
 '-': 7,
 '5': 8,
 '0': 9,
 '6': 10,
 '7': 11,
 '8': 12,
 '9': 13,
 '+': 14}

15

(100000, 6, 15)

14

In [10]:
encoder_gru = L.Bidirectional(
    L.GRU(LATENT_DIM // 2, dropout=DROPOUT, return_state=True, name='encoder_gru'),
    name='encoder_bidirectional'
)
decoder_gru = L.GRU(LATENT_DIM, dropout=DROPOUT, return_sequences=True, return_state=True, name='decoder_gru')
decoder_dense = L.Dense(nr_tokens, activation='softmax', name='decoder_outputs')

shared_embedding = L.Embedding(nr_tokens, EMBEDDING_DIM, mask_zero=True, name='shared_embedding')

encoder_inputs = L.Input(shape=(max_len_input, ), dtype='int32', name='encoder_inputs')
encoder_embeddings = shared_embedding(encoder_inputs)
_, encoder_state_1, encoder_state_2 = encoder_gru(encoder_embeddings)
encoder_states = L.concatenate([encoder_state_1, encoder_state_2])

decoder_inputs = L.Input(shape=(max_len_target, ), dtype='int32', name='decoder_inputs')
decoder_mask = L.Masking(mask_value=0)(decoder_inputs)
decoder_embeddings_inputs = shared_embedding(decoder_mask)
decoder_embeddings_outputs, _ = decoder_gru(decoder_embeddings_inputs, initial_state=encoder_states) 
decoder_outputs = decoder_dense(decoder_embeddings_outputs)


model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)

inference_encoder_model = Model(encoder_inputs, encoder_states)
    
inference_decoder_state_inputs = L.Input(shape=(LATENT_DIM, ), dtype='float32', name='inference_decoder_state_inputs')
inference_decoder_embeddings_outputs, inference_decoder_states = decoder_gru(
    decoder_embeddings_inputs, initial_state=inference_decoder_state_inputs
)
inference_decoder_outputs = decoder_dense(inference_decoder_embeddings_outputs)

inference_decoder_model = Model(
    [decoder_inputs, inference_decoder_state_inputs], 
    [inference_decoder_outputs, inference_decoder_states]
)

In [11]:
model.summary()
inference_decoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_inputs (InputLayer)     (None, 6)            0                                            
__________________________________________________________________________________________________
masking_1 (Masking)             (None, 6)            0           decoder_inputs[0][0]             
__________________________________________________________________________________________________
encoder_inputs (InputLayer)     (None, 10)           0                                            
__________________________________________________________________________________________________
shared_embedding (Embedding)    multiple             240         encoder_inputs[0][0]             
                                                                 masking_1[0][0]                  
__________

In [12]:
model.compile(optimizer=keras.optimizers.Adam(clipnorm=1.), loss='categorical_crossentropy')

In [13]:
model.fit([X, x_t_input], y_t_output, validation_split=0.1, epochs=EPOCHS, batch_size=BATCH_SIZE)

Train on 90000 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f53f81c9ba8>

In [14]:
def decode_sequence(input_seq):
    states_value = inference_encoder_model.predict(input_seq)
    
    target_seq = np.zeros((1, max_len_target))
    target_seq[0, 0] = tokenizer.word_index[START]
    
    tokens = {idx: token for (token, idx) in tokenizer.word_index.items()}
    
    decoded_sequence = ''
    for i in range(max_len_target):
        output_tokens, output_states = inference_decoder_model.predict(
            [target_seq, states_value]
        )
        
        # greedy search
        sampled_token_idx = np.argmax(output_tokens[0, 0, :])
        sampled_token = tokens.get(sampled_token_idx, '.')
        if sampled_token == END:
            break
        decoded_sequence += sampled_token
            
        target_seq[0, 0] = sampled_token_idx
        states_value = output_states
    
    return decoded_sequence 

In [15]:
def predict(equation):
    return decode_sequence(keras.preprocessing.sequence.pad_sequences(
        tokenizer.texts_to_sequences([equation]),
        padding='post',
        maxlen=X.shape[1]
    ))

In [16]:
# Performance on some examples:
for calc in [eq + '$' for eq in ['1+1', '9+11', '21+34', '359+468', '1359+468', '1-1', '19-1', '34-359', '11359-1468']]:
    print(f"{calc}=got: {predict(calc)}, exp: {eval(calc[:-1])}")

1+1$=got: 2, exp: 2
9+11$=got: 20, exp: 20
21+34$=got: 55, exp: 55
359+468$=got: 827, exp: 827
1359+468$=got: 1827, exp: 1827
1-1$=got: 0, exp: 0
19-1$=got: 18, exp: 18
34-359$=got: -325, exp: -325
11359-1468$=got: -19, exp: 9891


In [17]:
# Performance on training set:
for calc in df.input_texts[:10].tolist():
    print(f"{calc}=got: {predict(calc)}, exp: {eval(calc[:-1])}")

30-209$=got: -179, exp: -179
6350+127$=got: 6477, exp: 6477
846-24$=got: 822, exp: 822
247-92$=got: 155, exp: 155
3+27$=got: 30, exp: 30
3-427$=got: -424, exp: -424
0-2187$=got: -2187, exp: -2187
2914-1403$=got: 1511, exp: 1511
252+22$=got: 274, exp: 274
678-108$=got: 570, exp: 570


In [18]:
# Mean average error on a test set
test_df = create_equations_df(size=1000)
test_df['y_pred'] = test_df.input_texts.apply(predict).astype(int)
test_df['y_true'] = test_df.result
print("MAE", np.mean(np.abs(test_df.y_pred - test_df.y_true)))

MAE 10.071


## Conclusion

It doesn't work perfect, but fine enough to show that seq2seq works in some way. I wouldn't be surprised if the mean average error is better than average human bias for calculating without any tools.
For improvements and further discussions I'll move to a real problem (translating) and main steps will be:
* Bytepairencoding/Word embeddings
* Beam Search
* Attention models