### Character-level Sequence-to-Sequence for Machine Translation - [Keras tutorial](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)

implement a basic character-level sequence-to-sequence model to translate short English sentences into short French sentences, character-by-character. Note that it is fairly unusual to do character-level, rather than word-level machine translation.

In [1]:
# Use a dataset of pairs of English sentences and their French translation

!wget http://www.manythings.org/anki/fra-eng.zip
!unzip fra-eng.zip -d fra-eng
!more fra-eng/fra.txt

--2019-07-19 04:02:11--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 104.24.109.196, 104.24.108.196, 2606:4700:30::6818:6dc4, ...
Connecting to www.manythings.org (www.manythings.org)|104.24.109.196|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3467257 (3.3M) [application/zip]
Saving to: ‘fra-eng.zip’


2019-07-19 04:02:13 (2.15 MB/s) - ‘fra-eng.zip’ saved [3467257/3467257]

Archive:  fra-eng.zip
  inflating: fra-eng/_about.txt      
  inflating: fra-eng/fra.txt         
Go.	Va !
Hi.	Salut !
Hi.	Salut.
Run!	Cours !
Run!	Courez !
Who?	Qui ?
Wow!	Ça alors !
Fire!	Au feu !
Help!	À l'aide !
Jump.	Saute.
Stop!	Ça suffit !
Stop!	Stop !
Stop!	Arrête-toi !
Wait!	Attends !
Wait!	Attendez !
Go on.	Poursuis.
Go on.	Continuez.
Go on.	Poursuivez.
Hello!	Bonjour !
Hello!	Salut !
I see.	Je comprends.
I try.	J'essaye.
I won!	J'ai gagné !
[K

A few samples from the dataset

Stop!	Arrête-toi !

Wait!	Attends !

Wait!	Attendez !

Go on.	Poursuis.

Go on.	Continuez.

Go on.	Poursuivez.

In [0]:
from __future__ import print_function

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'fra-eng/fra.txt'

**Training the model**

The model has two parts:


1.   A RNN layer acts as "encoder": it processes the input sequence and returns its own internal state.
2.   Another RNN layer acts as "decoder": it is trained to predict the next characters of the target sequence, given previous characters of the target sequence. Specifically, it is trained to turn the target sequences into the same sequences but offset by one timestep. Importantly, the encoder uses as initial state the state vectors from the encoder

![alt text](https://blog.keras.io/img/seq2seq/seq2seq-teacher-forcing.png)

To prepare the training data, we read sentences from the data file and convert them into 3 Numpy arrays, encoder_input_data, decoder_input_data, decoder_target_data:

1.   encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences
2.   decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) containg a one-hot vectorization of the French sentences.
3.   decoder_target_data is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].


In [3]:
#----------------------------------------------
# Convert a string to an array of characters
#----------------------------------------------
def word2char (line):
   char = [c for c in line]
   return char 

#----------------------------------------------
# Given an array of sentences, find the length of the longest sentence
#----------------------------------------------
def max_sentence_length (char_sentences):
  max_sentence_length = 0
  for char_sentence in char_sentences:
    max_sentence_length = max (max_sentence_length, len (char_sentence))
  return (max_sentence_length)    

#----------------------------------------------
# Read num_samples lines from the data file, and return three lists of 
# sentences. Each sentence is an array of characters.
#
# eng_char_sentences = [['G', 'o'], ['H', 'i', '.'], ...]
# fr_char_sentences_shift = [[V', 'a', '!'], ['S', 'a', 'l', 'u', 't', '!'], ...]
# fr_char_sentences = [['\t', 'V', 'a', '!'], ['\t', 'S', 'a', 'l', 'u', 't', '!'], ...]
#
# The 'eng_char_sentences' are the original English sentences and will be input to the Encoder
# The 'fr_char_sentences_shift' are the original French sentences and will be the target output of the Decoder
# The 'fr_char_sentences' are the original French sentences with a Tab('\t') start token and will be the input to the Decoder
#----------------------------------------------
def read_sentences (data_path, num_samples):
  eng_char_sentences = []
  fr_char_sentences = []
  fr_char_sentences_shift = []
  
  with open(data_path, "r") as fp:
    # Read the file line by line
    line = fp.readline()
  
    # We will read up to num_samples lines
    samples = 0
    while line and (samples < num_samples):
      # Split the line as <eng_sentence>TAB<fr_sentence>
      eng_sentence,fr_sentence = line.split('\t')
      fr_sentence_shift = fr_sentence
      # We use "tab" as the "start sequence" character
      fr_sentence = '\t' + fr_sentence
    
      # Convert the sentence from a string to an array of characters
      eng_char_sentence = word2char (eng_sentence)
      fr_char_sentence = word2char (fr_sentence)
      fr_char_sentence_shift = word2char (fr_sentence_shift)
    
      # Add to the array of sentences, each sentence in turn being an array of characters
      eng_char_sentences.append (eng_char_sentence)
      fr_char_sentences.append (fr_char_sentence)
      fr_char_sentences_shift.append (fr_char_sentence_shift)
      
      # Increment samples read so far, and then read the next line
      samples = samples + 1
      line = fp.readline()
  
  return (eng_char_sentences, fr_char_sentences, fr_char_sentences_shift)

# Read sentences from the data file
eng_char_sentences, fr_char_sentences, fr_char_sentences_shift = read_sentences (data_path, num_samples)

# Get the length of the longest sentence
max_eng_sentence_length = max_sentence_length (eng_char_sentences)
max_fr_sentence_length = max_sentence_length (fr_char_sentences)

print (len (eng_char_sentences), len (fr_char_sentences))
print (eng_char_sentences[10], fr_char_sentences[10], fr_char_sentences_shift[10])
print (eng_char_sentences[75], fr_char_sentences[75], fr_char_sentences_shift[75])
print (max_eng_sentence_length, max_fr_sentence_length)

10000 10000
['S', 't', 'o', 'p', '!'] ['\t', 'Ç', 'a', ' ', 's', 'u', 'f', 'f', 'i', 't', '\u202f', '!', '\n'] ['Ç', 'a', ' ', 's', 'u', 'f', 'f', 'i', 't', '\u202f', '!', '\n']
['A', 'w', 'e', 's', 'o', 'm', 'e', '!'] ['\t', 'F', 'a', 'n', 't', 'a', 's', 't', 'i', 'q', 'u', 'e', '\u202f', '!', '\n'] ['F', 'a', 'n', 't', 'a', 's', 't', 'i', 'q', 'u', 'e', '\u202f', '!', '\n']
16 59


In [4]:
#----------------------------------------------
# Create the vocab as a set of all the characters in the corpus of sentences
#----------------------------------------------
def vocab (char_sentences):
  # Single flat list of all the characters from all sentences
  merged_chars = [char for char_sentence in char_sentences for char in char_sentence]
  
  # Create a set from the list, so that it contains only unique characters
  vocab_chars = sorted (set (merged_chars))
  vocab_size = len (vocab_chars)
  return (vocab_chars, vocab_size)

# Get the set of characters in the vocab
eng_vocab_chars, eng_vocab_size = vocab (eng_char_sentences)
fr_vocab_chars, fr_vocab_size = vocab (fr_char_sentences)

eng_vocab_size, fr_vocab_size

(70, 93)

In [5]:
#----------------------------------------------
# Dictionary to map from char to char_index in the vocab
#----------------------------------------------
eng_vocab_dict = {c:i for i, c in enumerate (eng_vocab_chars)}
fr_vocab_dict = {c:i for i, c in enumerate (fr_vocab_chars)}

#----------------------------------------------
# Char -> index utility function
#----------------------------------------------
def char2idx (vocab_dict, c):
  return (vocab_dict [c])

#----------------------------------------------
# Index -> char utility function
#----------------------------------------------
def idx2char (vocab_chars, i):
  return (vocab_chars[i])
  
print (char2idx (eng_vocab_dict, 'G'))
print (idx2char (eng_vocab_chars, 7))

26
-


In [0]:
#----------------------------------------------
# The training data consists of three array, each of which is a 3D array of 
# shape (num of samples, num of timesteps, num of input values per timestep)
#
# Each sample is one sentence
#
# The input at each timestep will be a single character from a sentence. Since we will have a fixed 
# number of timesteps, it will be taken as the maximum number of characters of any sentence
#
# Since each character will be one-hot encoded from the vocab, the number of input values per timestep
# will be the size of the vocab
#----------------------------------------------

# Create the training data of the right shape, and initialise all values to 0
encoder_input_data = np.zeros((num_samples, max_eng_sentence_length, eng_vocab_size), dtype='float32')
decoder_input_data = np.zeros((num_samples, max_fr_sentence_length, fr_vocab_size), dtype='float32')
decoder_target_data = np.zeros((num_samples, max_fr_sentence_length, fr_vocab_size), dtype='float32')

#----------------------------------------------
# Populate the training data 3D array
# We go through each sample (ie. first dimension) and each character in that sample (ie. second dimension)
# And for that character we convert it to its index, and then one-hot encode it (third dimension)
#----------------------------------------------
def char_indices (data, char_sentences, vocab_dict):
  # Loop through each sentence (ie. each sample)
  for i, char_sentence in enumerate (char_sentences):
    
    # Loop through each character in each sentence
    for j, char in enumerate (char_sentence):
      
      # Get the character index for the character
      idx = char2idx (vocab_dict, char)
      
      # To one-hot encode the character, set the array element for that index to 1
      data [i, j, idx] = 1.0

# Prepare all three arrays for training
char_indices (encoder_input_data, eng_char_sentences, eng_vocab_dict)
char_indices (decoder_input_data, fr_char_sentences, fr_vocab_dict)
char_indices (decoder_target_data, fr_char_sentences_shift, fr_vocab_dict)

In [0]:
#----------------------------------------------
# Now that the training data has been prepared, we will build the a LSTM Encoder-Decoder model and
# train it to predict decoder_target_data given encoder_input_data and decoder_input_data.
#
# The training process and inference process (decoding sentences) are quite different, so we will 
# use different models for both. However both models will use the same inner layers
#
# We have one Embedding Layer and one Encoder LSTM layer
# We have another Embedding Layer, one Decoder LSTM layer and a Dense layer for predictions
# We build the model using Keras Functional API
#----------------------------------------------

num_encoder_tokens = eng_vocab_size
num_decoder_tokens = fr_vocab_size

#----------------------------------------------
# Build the Encoder - it is a LSTM layer that takes the encoder_input_data as input
# We discard its output and keep the states, so that they can be passed to the Decoder
encoder_inputs = Input(shape=(None, num_encoder_tokens))
# return_state=True tells the RNN layer to return a list where the first entry is the 
# outputs and the next entries are the internal RNN states
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
#----------------------------------------------

#----------------------------------------------
# Build the Decoder - it is a LSTM layer that takes the decoder_input_data as input
# and uses the `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# return_sequences=True tells the RNN layer to return its full sequence of outputs (instead 
# of just the last output, which the defaults behavior). We also return the internal
# states. We don't use those states in the training but we will use them during inference
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
# Pass the decoder outputs through a Dense layer with Softmax to get the predictions
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
#----------------------------------------------

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)

**Doing Inference**

To decode a test sentence, we will repeatedly:

1.  Encode the input sequence into state vectors.
2.  Start with a target sequence of size 1 (just the start-of-sequence character).
3.  Feed the state vectors and 1-char target sequence to the decoder and run one step of the decoder to produce predictions for the next character.
4.  Sample the next character using these predictions (we simply use argmax).
5.  Append the sampled character to the target sequence
6.  Repeat until we generate the end-of-sequence character or we hit the character limit.

![alt text](https://blog.keras.io/img/seq2seq/seq2seq-inference.png)

In [0]:
#----------------------------------------------
# We will build the Inference model using the Encoder-Decoder layers we created
# earlier. 
#----------------------------------------------

# This is simply to change some variable names
max_encoder_seq_length = max_eng_sentence_length
max_decoder_seq_length = max_fr_sentence_length
input_token_index = eng_vocab_dict
target_token_index = fr_vocab_dict
input_texts = eng_char_sentences

# ------------------------------------------
# Build the Encoder model, which returns the encoder states
encoder_model = Model(encoder_inputs, encoder_states)
# ------------------------------------------

# ------------------------------------------
# The Decoder will work in a loop. For its first iteration, its input sequence contains only
# the start token. And its initial state is the Encoder state. It then generates the
# next character output along with its own internal state.
# 
# Now for the next iteration, it appends the next character to the input sequence
# and uses that as input. It also takes its own internal state and feeds that back
# to itself as the initial state for the next iteration
#
# Hence, when building the model, we define the Decoder Initial State as an input
# variable (rather than using the Encoder state directly since we will take the Encoder state 
# only for the first time)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# Use the Decoder LSTM layer created during training, but with different inputs
# and initial states
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
# Use the Dense layer created during training to generate predictions from the
# Decoder's outputs
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)
# ------------------------------------------

# Try to understand shapes
print (encoder_inputs, '\n', encoder_outputs, '\n', state_h, '\n', state_c)
print (decoder_inputs, '\n', decoder_outputs, '\n', decoder_states_inputs)

# Reverse-lookup token index to decode sequences back to characters
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

# ------------------------------------------
# Implements the Inference process. 
# ------------------------------------------
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)
        
        #print ('Target seq = ', target_seq.shape)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # KD - the original logic was to always use a target sequence
        # of length 1, which we update with the last sampled token. We
        # also feed the states_value back as the initial_state for the
        # next iteration
        #
        # The alternate logic which I tried was to keep extending the target
        # sequence in each iteration, so that the last sampled token is appended
        # to the previous target sequence. If we do that we don't change the
        # initial_state value at all, and continue to use the earlier value
        #
        # Both alternates are giving the same results. But I should experiment
        # some more, try out other unseen input sequences and see how both
        # alternates perform
        kdAlternate = True
        if (kdAlternate):
          b = np.zeros((1, target_seq.shape[1] + 1, num_decoder_tokens))
          b[:,:-1] = target_seq
          b[0, -1, sampled_token_index] = 1.
          target_seq = b
        else:
          # Update the target sequence (of length 1).
          target_seq = np.zeros((1, 1, num_decoder_tokens))
          target_seq[0, 0, sampled_token_index] = 1.

          # Update states
          states_value = [h, c]

    return decoded_sentence


for seq_index in range(20):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)