# Learning seq2seq
The point of this notebook is to understand and implement a simple sequence-to-sequence (seq2seq) model. I'll generally be following [this tutorial](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html) on Keras' site.

### Imports

In [8]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense

import numpy as np

### Model building
The general process is as follows:
- Split the model into an encoder and a decoder. 
- Encoder:
    - The encoder takes the input sequence as its input and outputs its internal state (both cell and hidden states, but only at the final timestep)
- Decoder:
    - The decoder uses as initial states the output states from the encoder
    - It uses as input the *target* sequence and is trained to predict the next timestep in the target sequence (again)
    
Note then that the entire purpose of the encoder is to generate a vector space representation of the input in terms of the hidden and cell states of the encoder. That is, the encoder serves to compress the information in the input sequencer.

The role of the decoder is to use this compression as information to decode a target sequence. Note that we *do not* use the input sequence in the decoder. This is because all information from the target sequence is (in theory) already held in the hidden states from the encoder. Thus the role of the decoder is to use that information to work directly with the target sequence. 

Let's now build the model.

In [204]:
def my_model(input_seq_length, vocab_size, batch_size, num_units_enc, target_seq_length, num_units_dec=None):
    if num_units_dec is None:
        num_units_dec = num_units_enc
        
    encoder_input = Input(shape=(input_seq_length, vocab_size), batch_size=batch_size, name='Encoder_Input')
    encoder_lstm = LSTM(units=num_units_enc, return_state=True, name='Encoder_LSTM')
    encoder_lstm_output = encoder_lstm(encoder_input)
    encoder_states = encoder_lstm_output[1:]
        
    decoder_input = Input(shape=(target_seq_length, vocab_size), batch_size=batch_size, name='Decoder_Input')
    # todo: Why is `return_sequences` needed? It relates to the input size of the final dense layer.
    decoder_lstm = LSTM(units=num_units_dec, return_sequences=True, name='Decoder_LSTM')
    decoder_lstm_output = decoder_lstm(decoder_input, initial_state=encoder_states) 
    decoder_dense = Dense(vocab_size, activation='softmax', name='Decoder_Dense')
    decoder_output = decoder_dense(decoder_lstm_output)
    
    model = Model(inputs=[encoder_input, decoder_input], outputs=decoder_output)
    return model

### Data generation
We'll do the simple two-digit addition problem, since data is easy to generate from scratch. So the input sequence will be a string (one-hotted later) of digits and '+', and the output will be the sum (also as a string). We will include an end-of-string token for both as well. Like in the notebook we can either encode them as lists of integers and then let an embedding layer handle representing them as vectors, or one-hot everything. For simplicity (and to see what effects the embedding layer has later on) we'll one-hot for now.

In [126]:
char_to_int = {str(n): n for n in range(10)}
char_to_int[' '] = 10
char_to_int['+'] = 11
char_to_int['\n'] = 12

int_to_char = {v: k for k, v in char_to_int.items()}

In [127]:
def one_hot(n, dim):
    # One-hots a positive integer n
    one_hot_n = np.zeros(dim)
    one_hot_n[n] = 1
    return one_hot_n
    
def undo_one_hot(v):
    return np.argmax(v)

def one_hot_matrix(M, vocab_size):
    n_samples, seq_length = M.shape
    M_oh = np.array([one_hot(r, vocab_size) for r in np.array(M).flatten()]).reshape((n_samples, seq_length, vocab_size))
    return np.squeeze(M_oh) # In case this is a target vector, we don't want to include an unnecessary axis

In [247]:
def generate_sample(num_terms=2, digit_length=2, max_target_digits=3, int_encoder=None, reverse=False):
    x = []
    for _ in range(num_terms):
        x.append(np.random.randint(10**digit_length))
        
    y = np.sum(x)
    
    x_str = '+'.join(str(n) for n in x)
    y_str = str(y)
    
    # Pad x so that is always has the same length. It should be of length digit_length for each digit, plus num_terms - 1 "plus" signs
    x_str = x_str.rjust(num_terms * digit_length + num_terms - 1)
    y_str = y_str.rjust(max_target_digits) # todo: Fix to be a calculated value
    
    if reverse:
        x_str = x_str[::-1]

    x_str += '\n'
    y_str += '\n'
    
    x_list = list(x_str)
    y_list = list(y_str)
    
    if int_encoder is not None:
        assert isinstance(int_encoder, dict), 'int_encoder must be a dictionary mapping characters to integers'
        x_list = [int_encoder[c] for c in x_list]
        y_list = [int_encoder[c] for c in y_list]
        
    return x_list, y_list

In [248]:
def generate_samples(n_samples, num_terms=2, digit_length=2, max_target_digits=3, int_encoder=None, one_hot=False, reverse=False):
    X = []
    y = []
    for _ in range(n_samples):
        x_sample, y_sample = generate_sample(num_terms, digit_length, max_target_digits, int_encoder, reverse)
        X.append(x_sample)
        y.append(y_sample)
        
    X = np.array(X)
    y = np.array(y)
    
    if one_hot:
        X = one_hot_matrix(X, len(int_encoder))
        y = one_hot_matrix(y, len(int_encoder))
    
    return X, y

In [130]:
generate_sample(append_token=True)

(['7', '3', '+', '6', '7', '\n'], ['1', '4', '0', '\n'])

In [131]:
generate_sample(append_token=True, int_encoder=char_to_int)

([3, 6, 11, 3, 1, 12], [10, 6, 7, 12])

In [132]:
X_oh, y_oh = generate_samples(n_samples=10, append_token=True, int_encoder=char_to_int, one_hot=True)

In [133]:
X_oh.shape, y_oh.shape

((10, 6, 13), (10, 4, 13))

### Test out the model
Let's run some data through our model to make sure everything works as intended.

In [134]:
NUM_TERMS = 2
DIGIT_LENGTH = 2
SEQ_LENGTH = NUM_TERMS * DIGIT_LENGTH + NUM_TERMS

model = my_model(input_seq_length=SEQ_LENGTH, vocab_size=len(char_to_int), batch_size=1, num_units_enc=1, target_seq_length=4)

In [135]:
def example_prediction(model, X, y):
    X_singleton = X_oh[0].reshape(1, *X_oh[0].shape)
    y_singleton = y_oh[0].reshape(1, *y_oh[0].shape)

    output = model.predict([X_singleton, y_singleton], batch_size=1)
    return output

def sample_from_softmax(output):
    squeezed_output = np.squeeze(output)
    sampled_output = np.random.choice(np.arange(len(squeezed_output)), p=squeezed_output)
    return sampled_output

def predict_one(model, X, y):
    model_output = example_prediction(model, X, y)
    sampled_pred = sample_from_softmax(model_output)
    # todo: Make this output nicer to get a better idea what's actually being predicted/used
    print(f'X = {X}')
    print(f'y = {y}')
    print(f'Pred = {sampled_pred}')

In [136]:
example_prediction(model, X, y)

array([[0.07845037, 0.07061684, 0.07993645, 0.08047733, 0.07797949,
        0.07469489, 0.07416169, 0.07028104, 0.06670289, 0.08318146,
        0.08559087, 0.07626706, 0.08165962]], dtype=float32)

In [137]:
predict_one(model, X, y)

X = [[ 8  3 11  9  4 12]
 [ 1  8 11  1  6 12]
 [ 3  6 11  1  3 12]
 [ 9  9 11  5  4 12]
 [10  1 11  3  1 12]
 [ 1  7 11  8  4 12]
 [ 6  1 11  9  8 12]
 [ 4  9 11  3  3 12]
 [10  0 11  4  2 12]
 [10  4 11  9  4 12]]
y = [[ 1  7  7 12]
 [10  3  4 12]
 [10  4  9 12]
 [ 1  5  3 12]
 [10  3  2 12]
 [ 1  0  1 12]
 [ 1  5  9 12]
 [10  8  2 12]
 [10  4  2 12]
 [10  9  8 12]]
Pred = 9


### Train the model

In [179]:
X_oh, y_oh = generate_samples(n_samples=10**4, append_token=True, int_encoder=char_to_int, one_hot=True)

In [201]:
model = my_model(input_seq_length=SEQ_LENGTH, vocab_size=len(char_to_int), batch_size=10, num_units_enc=1, target_seq_length=3)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [182]:
input_target = y_oh[:,:-1,:]
output_target = y_oh[:,1:,:]
model.fit([X_oh, input_target], output_target, epochs=30, validation_split=0.3)

Train on 7000 samples, validate on 3000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x65ef41bd0>

So training for 30 epochs with a single unit in the LSTM gives ~43% validation accuracy. 

Let's do it again, but mirroring the parameters set by the [Addition RNN](https://keras.io/examples/addition_rnn/) script at Keras. They train using three digits with 5k samples with 10% validation for 200 epochs with a batch size of 128 and with 128 units in the LSTM and achieve 99% accuracy.

In [239]:
NUM_TERMS = 2
DIGIT_LENGTH = 3
SEQ_LENGTH = NUM_TERMS * DIGIT_LENGTH + NUM_TERMS

X_oh, y_oh = generate_samples(n_samples=5*10**3, \
                              digit_length=DIGIT_LENGTH, \
                              max_target_digits=4, \
                              append_token=True, \
                              int_encoder=char_to_int, \
                              one_hot=True)

In [240]:
model = my_model(input_seq_length=SEQ_LENGTH, vocab_size=len(char_to_int), batch_size=128, num_units_enc=128, target_seq_length=4)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [242]:
input_target = y_oh[:,:-1,:]
output_target = y_oh[:,1:,:]
model.fit([X_oh, input_target], output_target, epochs=200)

Train on 5000 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200

<tensorflow.python.keras.callbacks.History at 0x66d602250>

This seems promising. Let's generate a test set and see how it performs.

In [243]:
X_test, y_test = generate_samples(n_samples=1000, \
                              digit_length=DIGIT_LENGTH, \
                              max_target_digits=4, \
                              append_token=True, \
                              int_encoder=char_to_int, \
                              one_hot=True)

In [244]:
test_metrics = model.evaluate([X_test, y_test[:, :-1, :]], y_test[:, 1:, :], verbose=0)

In [245]:
print('\n'.join(f'{n} = {v:.4f}' for n, v in list(zip(model.metrics_names, test_metrics))))

loss = 0.9939
accuracy = 0.6507


So an improvement, but still well short of the 99% they claim.

Next, let's try adding in "reversing". They claim this gives a good boost.

In [249]:
NUM_TERMS = 2
DIGIT_LENGTH = 3
SEQ_LENGTH = NUM_TERMS * DIGIT_LENGTH + NUM_TERMS

X_oh, y_oh = generate_samples(n_samples=5*10**3, \
                              digit_length=DIGIT_LENGTH, \
                              max_target_digits=4, \
                              int_encoder=char_to_int, \
                              one_hot=True,
                              reverse=True)

In [250]:
model = my_model(input_seq_length=SEQ_LENGTH, vocab_size=len(char_to_int), batch_size=128, num_units_enc=128, target_seq_length=4)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [251]:
input_target = y_oh[:,:-1,:]
output_target = y_oh[:,1:,:]
model.fit([X_oh, input_target], output_target, epochs=200)

Train on 5000 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200

<tensorflow.python.keras.callbacks.History at 0x66ae019d0>

In [252]:
X_test, y_test = generate_samples(n_samples=1000, \
                              digit_length=DIGIT_LENGTH, \
                              max_target_digits=4, \
                              int_encoder=char_to_int, \
                              one_hot=True, \
                              reverse=True)

In [253]:
test_metrics = model.evaluate([X_test, y_test[:, :-1, :]], y_test[:, 1:, :], verbose=0)

In [254]:
print('\n'.join(f'{n} = {v:.4f}' for n, v in list(zip(model.metrics_names, test_metrics))))

loss = 0.1018
accuracy = 0.9737


Wow. So reversing the input increased our test accuracy from 65% to 97%, and resulted in a nearly 90% reduction in the loss. That's incredible. Reverse reverse reverse!