# Seq2seq RNN using Keras

<p style="text-align:justify;">This project is simply a walk through the code I found on the [Keras's blog](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). The goal is to use LSTM to translate a sentence from English to French. The difficulty of this assignement is the sizes of the input and outpu data. Indeed, we have various unknowned sizes for the input sentences and the output is not likely to have the same size as the input.</p>

<div class='alert alert-info'>"the cat sat on the mat" -> [Seq2Seq model] -> "le chat etait assis sur le tapis"
</div>

## Loading the datas

In [1]:
from __future__ import print_function

from keras.models import Model, load_model
from keras.layers import LSTM, Dense, Input
import numpy as np

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
batch_size = 64
epochs = 100
latent_dim = 256
num_samples = 10000
data_path = 'datas/fra.txt'

We start reading the datas and putting it inside numpy 3 dimensionnal vector of dimension (m, $T_{x}$, nbr_chars).

m: The number of examples<br>
$T_{x}$: The number of characters in the sentence. In this case, it is the nbr of chars in the longest sentence<br>
nbr_chars: The number of characters in the dictionnary for each language

<p style="text-align:justify;">It is interesting to note that we added \t at the beginning of each output datas and \n at the end of each output datas. The goal is to tell the algorithm where to start and where to stop. \t is given as the first character, then the RNN needs to predict the next one until \n.</p>

In [4]:
with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")

input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

for line in lines[:min(num_samples, len(lines)-1)]:
    input_text, target_text = line.split("\t")
    target_text = "\t"+target_text+"\n"
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)
            
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 10000
Number of unique input tokens: 71
Number of unique output tokens: 94
Max sequence length for inputs: 16
Max sequence length for outputs: 59


In [5]:
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

In [6]:
encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32")
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32")
decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32")

<p style="text-align:justify;">The decoder output data is at T+1 compare to decoder input data. This is made like that so the RNN learns to predict the next character given the context of the encoder RNN.</p>

In [7]:
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1
    for t, char in enumerate(target_text):
        decoder_input_data[i, t, target_token_index[char]] = 1
        if t > 0:
            decoder_target_data[i, t - 1, target_token_index[char]] = 1

## RNN Model

<p style="text-align:justify;">RNN models need a lot of computation power to be trained. In this example, running 100 epochs on a 10000 sentences dataset took 20 seconds for each epochs. The total computation took around 33 minutes on a computer using only the CPU.</p>

In [None]:
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_state=True, return_sequences=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy")
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=batch_size, epochs=epochs, validation_split=0.2)

model.save('s2s.h5')

Train on 8000 samples, validate on 2000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
 896/8000 [==>...........................] - ETA: 16s - loss: 0.5058

## The inference mode

Here is how it is going to work:
1. Encode input and retrieve initial decoder state
2. Run one step of decoder with this initial state
   and a "start of sequence" token as target.
   Output will be the next target token
3. Repeat with the current target token and current states

In [15]:
#define sampling model
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

In [16]:
reverse_input_char_index = dict([(i, char) for i, char in enumerate(input_characters)])
reverse_target_char_index = dict([(i, char) for i, char in enumerate(target_characters)])

In [17]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    
    target_seq = np.zeros((1,1,num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1
    
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char
        
        if (sampled_char == '\n' or len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True
        
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1
        
        states_value = [h, c]
    
    return decoded_sentence

In [18]:
for seq_index in range(100):
    #seq: seq+1 to keep the shape of (1, 16, 71) not (16, 71)
    input_seq = encoder_input_data[seq_index: seq_index+1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: Go.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Run!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Run!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Fire!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Help!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Jump.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Stop!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Stop!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Stop!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Wait!
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR

-
Input sentence: Be nice.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Be nice.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Be nice.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Be nice.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Beat it.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Call me.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Call me.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Call us.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Call us.
Decoded sentence: dddddzuddddzÉHHHHHHiHHiHHiHHiB61AAR8A(R8aA(R8 a1AR(8 a($ÀÀ''
-
Input sentence: Come in.
Decoded sentence: d