<a href="https://colab.research.google.com/github/lvapeab/nmt-keras/blob/master/examples/modeling_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Model definition in NMT-Keras


In this module, we are going to create an encoder-decoder model with:
* A bidirectional GRU encoder and a GRU decoder
* An attention model 
* The previously generated word feeds back de decoder
* MLPs for initializing the initial RNN state
* Skip connections from inputs to outputs
* Beam search.  

We first setup the machine.

In [1]:
!pip install update pip
!pip uninstall -y kapre keras tensorflow-probability albumentations datascience h5py keras-nightly # Avoid crashes with pre-installed packages 
!git clone https://github.com/lvapeab/nmt-keras
import os
os.chdir('nmt-keras')
!pip install -e .


Collecting update
  Downloading update-0.0.1-py2.py3-none-any.whl (2.9 kB)
Collecting style==1.1.0
  Downloading style-1.1.0-py2.py3-none-any.whl (6.4 kB)
Installing collected packages: style, update
Successfully installed style-1.1.0 update-0.0.1
Found existing installation: kapre 0.3.5
Uninstalling kapre-0.3.5:
  Successfully uninstalled kapre-0.3.5
Found existing installation: Keras 2.4.3
Uninstalling Keras-2.4.3:
  Successfully uninstalled Keras-2.4.3
Found existing installation: tensorflow-probability 0.13.0
Uninstalling tensorflow-probability-0.13.0:
  Successfully uninstalled tensorflow-probability-0.13.0
Found existing installation: albumentations 0.1.12
Uninstalling albumentations-0.1.12:
  Successfully uninstalled albumentations-0.1.12
Found existing installation: datascience 0.10.6
Uninstalling datascience-0.10.6:
  Successfully uninstalled datascience-0.10.6
Found existing installation: h5py 3.1.0
Uninstalling h5py-3.1.0:
  Successfully uninstalled h5py-3.1.0
Found existing

Let's import necessary modules:


In [2]:
from keras.layers import *
from keras.models import model_from_json, Model
from keras.optimizers import Adam, RMSprop, Nadam, Adadelta, SGD, Adagrad, Adamax
from keras.regularizers import l2
from keras_wrapper.cnn_model import Model_Wrapper
from keras_wrapper.extra.regularize import Regularize


Using TensorFlow backend.
[30/07/2021 09:28:20] NumExpr defaulting to 2 threads.
[30/07/2021 09:28:20] <<< Cupy not available. Using numpy. >>>


And let's define the dimesnions of our model. For instance, a word embedding size of 50 and 100 units in RNNs. The inputs/outpus are defined as [the NMT-Keras tutorial](https://colab.research.google.com/github/lvapeab/nmt-keras/blob/master/examples/tutorial.ipynb/).

In [3]:
ids_inputs = ['source_text', 'state_below']
ids_outputs = ['target_text']
word_embedding_size = 50
hidden_state_size = 100
input_vocabulary_size=686  # Autoset in the library
output_vocabulary_size=513  # Autoset in the library


## Encoder

Let's define our encoder. First, we have to create an Input layer to connect the input text to our model.  Next, we'll apply a word embedding to the sequence of input indices. This word embedding will feed a Bidirectional GRU network, which will produce our sequence of annotations:


In [4]:

# 1. Source text input
src_text = Input(name=ids_inputs[0],
                 batch_shape=tuple([None, None]), # Since the input sequences have variable-length, we do not retrict the Input shape
                 dtype='int32')
# 2. Encoder
# 2.1. Source word embedding
src_embedding = Embedding(input_vocabulary_size, word_embedding_size, 
                          name='source_word_embedding', mask_zero=True # Zeroes as mask
                          )(src_text)
# 2.2. BRNN encoder (GRU/LSTM)
annotations = Bidirectional(GRU(hidden_state_size, 
                                return_sequences=True  # Return the full sequence
                                ),
                            name='bidirectional_encoder',
                            merge_mode='concat')(src_embedding)





[30/07/2021 09:28:20] From /usr/local/lib/python3.7/dist-packages/keras/backend/tensorflow_backend.py:650: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.






[30/07/2021 09:28:20] From /usr/local/lib/python3.7/dist-packages/keras/backend/tensorflow_backend.py:4786: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.






[30/07/2021 09:28:20] From /usr/local/lib/python3.7/dist-packages/keras/backend/tensorflow_backend.py:157: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


[30/07/2021 09:28:21] From /usr/local/lib/python3.7/dist-packages/keras/backend/tensorflow_backend.py:3599: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


## Decoder
Once we have built the encoder, let's build our decoder.  First, we have an additional input: The previously generated word (the so-called state_below). We introduce it by means of an Input layer and a (target language) word embedding:


In [5]:
# 3. Decoder
# 3.1.1. Previously generated words as inputs for training -> Teacher forcing
next_words = Input(name=ids_inputs[1], batch_shape=tuple([None, None]), dtype='int32')
# 3.1.2. Target word embedding
state_below = Embedding(output_vocabulary_size, word_embedding_size,
                        name='target_word_embedding', 
                        mask_zero=True)(next_words)

The initial hidden state of the decoder's GRU is initialized by means of a MLP (in this case, single-layered) from the average of the annotations:


In [6]:
ctx_mean = MaskedMean()(annotations)
annotations = MaskLayer()(annotations)  # We may want the padded annotations

initial_state = Dense(hidden_state_size, name='initial_state',
                      activation='tanh')(ctx_mean)


So, we have the input of our decoder:


In [7]:
input_attentional_decoder = [state_below, annotations, initial_state]


Note that, for a sample, the sequence of annotations and initial state is the same, independently of the decoding time-step. In order to avoid computation time, we build two models, one for training and the other one for sampling. They will share weights, but the sampling model will be made up of  two different models. One (model_init) will compute the sequence of annotations and initial_state. The other model (model_next) will compute a single recurrent step, given the sequence of annotations, the previous hidden state and the generated words up to this moment. 

Therefore, now we slightly change the form of declaring layers. We must share layers between the decoding models. 

So, let's start by building the attentional-conditional GRU:


In [8]:

# Define the AttGRUCond function
sharedAttGRUCond = AttGRUCond(hidden_state_size,
                              return_sequences=True,
                              return_extra_variables=True, # Return attended input and attenton weights 
                              return_states=True # Returns the sequence of hidden states (see discussion above)
                              )
[proj_h, x_att, alphas, h_state] = sharedAttGRUCond(input_attentional_decoder) # Apply shared_AttnGRUCond to our input


Now, we set skip connections between input and output layer. Note that, since we have a temporal dimension because of the RNN decoder, we must apply the layers in a TimeDistributed way. Finally, we will merge all skip-connections and apply a 'tanh' no-linearlity:



In [9]:

# Define layer function
shared_FC_mlp = TimeDistributed(Dense(word_embedding_size, activation='linear',),
                                name='logit_lstm')
# Apply layer function
out_layer_mlp = shared_FC_mlp(proj_h)

# Define layer function
shared_FC_ctx = TimeDistributed(Dense(word_embedding_size, activation='linear'),
                                name='logit_ctx')
# Apply layer function
out_layer_ctx = shared_FC_ctx(x_att)
shared_Lambda_Permute = PermuteGeneral((1, 0, 2))
out_layer_ctx = shared_Lambda_Permute(out_layer_ctx)

# Define layer function
shared_FC_emb = TimeDistributed(Dense(word_embedding_size, activation='linear'),
                                name='logit_emb')
# Apply layer function
out_layer_emb = shared_FC_emb(state_below)

shared_additional_output_merge = Add(name='additional_input')
additional_output = shared_additional_output_merge([out_layer_mlp, out_layer_ctx, out_layer_emb])
shared_activation_tanh = Activation('tanh')
out_layer = shared_activation_tanh(additional_output)


Now, we'll' apply a deep output layer, with linear activation:


In [10]:

shared_deep_out = TimeDistributed(Dense(word_embedding_size, activation='linear', name='maxout_layer'))
out_layer = shared_deep_out(out_layer)


Our output layer is a projection to the target vocabulary space plus a softmax function for obtaining a probability distribution over the target vocabulary words at each timestep.


In [11]:
shared_FC_soft = TimeDistributed(Dense(output_vocabulary_size,
                                               activation='softmax',
                                               name='softmax_layer'),
                                         name=ids_outputs[0])
softout = shared_FC_soft(out_layer)


And we build the Keras model:

In [12]:
model = Model(name='NMT Model', inputs=[src_text, next_words], outputs=softout)

model.summary()

Model: "NMT Model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
source_text (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________________
source_word_embedding (Embeddin (None, None, 50)     34300       source_text[0][0]                
__________________________________________________________________________________________________
bidirectional_encoder (Bidirect (None, None, 200)    90600       source_word_embedding[0][0]      
__________________________________________________________________________________________________
state_below (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________


That's all! We built a NMT Model!

## Sampling models

Now, let's build the models required for sampling. Recall that we are building two models, one for encoding the inputs and the other one for advancing steps in the decoding stage.

Let's start with model_init. It will take the usual inputs (src_text and state_below) and will output: 
1)  The vector probabilities (for timestep 1)
2) The sequence of annotations (from encoder)
3) The current decoder's hidden state

The only restriction here is that the first output must be the output layer (probabilities) of the model.

In [13]:

model_init = Model(inputs=[src_text, next_words], outputs=[softout, annotations, h_state])
# Store inputs and outputs names for model_init
ids_inputs_init = ids_inputs

# first output must be the output probs.
ids_outputs_init = ids_outputs + ['preprocessed_input', 'next_state']

Next, we will be the model_next. It will have the following inputs:
* Preprocessed input
* Previously generated word
* Previous hidden state

And the following outputs:
* Model probabilities
* Current hidden state

So first, we define the inputs:


In [14]:

preprocessed_size = hidden_state_size*2
preprocessed_annotations = Input(name='preprocessed_input', shape=tuple([None, preprocessed_size]))
prev_h_state = Input(name='prev_state', shape=tuple([hidden_state_size]))
input_attentional_decoder = [state_below, preprocessed_annotations, prev_h_state]

And now, we build the model, using the functions stored in the 'shared*'  variables declared before:


In [15]:
# Apply decoder
[proj_h, x_att, alphas, h_state] = sharedAttGRUCond(input_attentional_decoder)
out_layer_mlp = shared_FC_mlp(proj_h)
out_layer_ctx = shared_FC_ctx(x_att)
out_layer_ctx = shared_Lambda_Permute(out_layer_ctx)
out_layer_emb = shared_FC_emb(state_below)
additional_output = shared_additional_output_merge([out_layer_mlp, out_layer_ctx, out_layer_emb])
out_layer = shared_activation_tanh(additional_output)
out_layer = shared_deep_out(out_layer)
softout = shared_FC_soft(out_layer)
model_next = Model(inputs=[next_words, preprocessed_annotations, prev_h_state],
                   outputs=[softout, preprocessed_annotations, h_state])

Finally, we store inputs/outputs for model_next. In addition, we create a couple of dictionaries, matching inputs/outputs from the different models (model_init->model_next, model_nex->model_next):


In [16]:
# Store inputs and outputs names for model_next
# first input must be previous word
ids_inputs_next = [ids_inputs[1]] + ['preprocessed_input', 'prev_state']
# first output must be the output probs.
ids_outputs_next = ids_outputs + ['preprocessed_input', 'next_state']

# Input -> Output matchings from model_init to model_next and from model_next to model_nextxt
matchings_init_to_next = {'preprocessed_input': 'preprocessed_input', 'next_state': 'prev_state'}
matchings_next_to_next = {'preprocessed_input': 'preprocessed_input', 'next_state': 'prev_state'}


And that's all! For using this model together with the facilities provided by the [multimodal_model_wrapper](https://github.com/MarcBS/multimodal_keras_wrapper) library, we should declare the model as a method of a Model_Wrapper class. A complete example of this can be found at [`model_zoo.py`](https://github.com/lvapeab/nmt-keras/blob/master/nmt_keras/model_zoo.py).
