# NMT-Keras tutorial

In this module, we are going to create an encoder-decoder model with:
* A bidirectional GRU encoder and a GRU decoder
* An attention model 
* The previously generated word feeds back de decoder
* MLPs for initializing the initial RNN state
* Skip connections from inputs to outputs
* Beam search.  

As usual, first we import the necessary stuff.

In [28]:
from keras.engine import Input
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU, AttGRUCond
from keras.layers import TimeDistributed, Bidirectional
from keras.layers.core import Dense, Activation, Lambda, MaxoutDense, MaskedMean
from keras import backend as K
from keras.engine.topology import merge

And define the dimesnions of our model. For instance, a word embedding size of 50 and 100 units in RNNs. The inputs/outpus are defined as in previous tutorials.

In [29]:
ids_inputs = ['source_text', 'state_below']
ids_outputs = ['target_text']
word_embedding_size = 50
hidden_state_size = 100
input_vocabulary_size=686  # Autoset in the library
output_vocabulary_size=513  # Autoset in the library

Now, let's define our encoder. First, we have to create an Input layer to connect the input text to our model.  Next, we'll apply a word embedding to the sequence of input indices. This word embedding will feed a Bidirectional GRU network, which will produce our sequence of annotations:

In [30]:
# 1. Source text input
src_text = Input(name=ids_inputs[0],
                 batch_shape=tuple([None, None]), # Since the input sequences have variable-length, we do not retrict the Input shape
                 dtype='int32')
# 2. Encoder
# 2.1. Source word embedding
src_embedding = Embedding(input_vocabulary_size, word_embedding_size, 
                          name='source_word_embedding', mask_zero=True # Zeroes as mask
                          )(src_text)
# 2.2. BRNN encoder (GRU/LSTM)
annotations = Bidirectional(GRU(hidden_state_size, 
                                return_sequences=True  # Return the full sequence
                                ),
                            name='bidirectional_encoder',
                            merge_mode='concat')(src_embedding)

Once we have built the encoder, let's build our decoder.  First, we have an additional input: The previously generated word (the so-called state_below). We introduce it by means of an Input layer and a (target language) word embedding:

In [31]:
# 3. Decoder
# 3.1.1. Previously generated words as inputs for training -> Teacher forcing
next_words = Input(name=ids_inputs[1], batch_shape=tuple([None, None]), dtype='int32')
# 3.1.2. Target word embedding
state_below = Embedding(output_vocabulary_size, word_embedding_size,
                        name='target_word_embedding', 
                        mask_zero=True)(next_words)

The initial hidden state of the decoder's GRU is initialized by means of a MLP (in this case, single-layered) from the average of the annotations:

In [32]:
ctx_mean = MaskedMean()(annotations)
initial_state = Dense(hidden_state_size, name='initial_state',
                      activation='tanh')(ctx_mean)

So, we have the input of our decoder:

In [33]:
input_attentional_decoder = [state_below, annotations, initial_state]

Note that, for a sample, the sequence of annotations and initial state is the same, independently of the decoding time-step. In order to avoid computation time, we build two models, one for training and the other one for sampling. They will share weights, but the sampling model will be made up of  two different models. One (model_init) will compute the sequence of annotations and initial_stat. The other model (model_next) will compute a single recurrent step, given the sequence of annotations, the previous hidden state and the generated words up to this moment. 

Therefore, now we slightly change the form of declaring layers. We must share layers between the decoding models. 

So, let's start by building the attentional-conditional GRU:

In [34]:
# Define the AttGRUCond function
sharedAttGRUCond = AttGRUCond(hidden_state_size,
                              return_sequences=True,
                              return_extra_variables=True, # Return attended input and attenton weights 
                              return_states=True # Returns the sequence of hidden states (see discussion above)
                              )
[proj_h, x_att, alphas, h_state] = sharedAttGRUCond(input_attentional_decoder) # Apply shared_AttnGRUCond to our input

Now, we set skip connections between input and output layer. Note that, since we have a temporal dimension because of the RNN decoder, we must apply the layers in a TimeDistributed way. Finally, we will merge all skip-connections and apply a 'tanh' no-linearlity:

In [35]:
# Define layer function
shared_FC_mlp = TimeDistributed(Dense(word_embedding_size, activation='linear',),
                                name='logit_lstm')
# Apply layer function
out_layer_mlp = shared_FC_mlp(proj_h)

# Define layer function
shared_FC_ctx = TimeDistributed(Dense(word_embedding_size, activation='linear'),
                                name='logit_ctx')
# Apply layer function
out_layer_ctx = shared_FC_ctx(x_att)
shared_Lambda_Permute = Lambda(lambda x: K.permute_dimensions(x, [1, 0, 2]))
out_layer_ctx = shared_Lambda_Permute(out_layer_ctx)

# Define layer function
shared_FC_emb = TimeDistributed(Dense(word_embedding_size, activation='linear'),
                                name='logit_emb')
# Apply layer function
out_layer_emb = shared_FC_emb(state_below)

additional_output = merge([out_layer_mlp, out_layer_ctx, out_layer_emb], mode='sum', name='additional_input')
shared_activation_tanh = Activation('tanh')
out_layer = shared_activation_tanh(additional_output)

Now, we'll' apply a deep output layer, with MaxOut activation:

In [36]:
shared_maxout = TimeDistributed(MaxoutDense(word_embedding_size), name='maxout_layer')
out_layer = shared_maxout(out_layer)

Finally, we apply a softmax function for obtaining a probability distribution over the target vocabulary words at each timestep,.

In [37]:
shared_FC_soft = TimeDistributed(Dense(output_vocabulary_size,
                                               activation='softmax',
                                               name='softmax_layer'),
                                         name=ids_outputs[0])
softout = shared_FC_soft(out_layer)

TODO: Include the beam search model in this tutorial