<a href="https://colab.research.google.com/github/lvapeab/nmt-keras/blob/master/examples/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# NMT-Keras tutorial
---

This notebook describes, step by step, how to build a neural machine translation model with NMT-Keras. The tutorial is organized in different sections:


1. Create a Dataset instance, in order to properly manage the data. 
2. Create and train the Neural Translation Model in the training data.
3. Apply the trained model on new (unseen) data.

All these steps are automatically run by the toolkit. But, to learn and understand the full process, it is didactic to follow this tutorial.


So, let's start installing the toolkit.

In [0]:
!pip install update pip
!pip uninstall -y keras  # Avoid crashes with pre-installed packages
!git clone https://github.com/lvapeab/nmt-keras
import os
os.chdir('nmt-keras')
!pip install -e .


Uninstalling Keras-2.2.4:
  Successfully uninstalled Keras-2.2.4
Cloning into 'nmt-keras'...
remote: Enumerating objects: 4482, done.[K
remote: Total 4482 (delta 0), reused 0 (delta 0), pack-reused 4482[K
Receiving objects: 100% (4482/4482), 5.59 MiB | 3.18 MiB/s, done.
Resolving deltas: 100% (3030/3030), done.
Obtaining file:///content/nmt-keras/nmt-keras/nmt-keras
Collecting keras@ https://github.com/MarcBS/keras/archive/master.zip
  Using cached https://github.com/MarcBS/keras/archive/master.zip
Building wheels for collected packages: keras
  Building wheel for keras (setup.py) ... [?25l[?25hdone
  Created wheel for keras: filename=Keras-2.2.4-cp36-none-any.whl size=455356 sha256=9c4eb3a20bacec7bbc93b46cee34b5ac1771a6df736b001541c9fec743073545
  Stored in directory: /tmp/pip-ephem-wheel-cache-owclryko/wheels/82/f8/db/7c0c999dced9850abb60944d255a31dbdf10f76f645454b715
Successfully built keras
Installing collected packages: keras, nmt-keras
  Found existing installation: nmt-keras

## 1. Building a Dataset model
First, we are creating a [Dataset](https://github.com/MarcBS/multimodal_keras_wrapper/keras_wrapper/dataset.py) object (from the [Multimodal Keras Wrapper](https://github.com/MarcBS/multimodal_keras_wrapper) library). This object will be the interface between our data (text files) and the model:

In [0]:
from keras_wrapper.dataset import Dataset, saveDataset
from data_engine.prepare_data import keep_n_captions
ds = Dataset('tutorial_dataset', 'tutorial', silence=False)

Now that we have the empty dataset, we must indicate its inputs and outputs. In our case, we'll have two different inputs and one single output:

1. Outputs:
**target_text**: Sentences in our target language.

2. Inputs:
**source_text**: Sentences in the source language.

**state_below**: Sentences in the target language, but shifted one position to the right (for teacher-forcing training of the model).

For setting up the outputs, we use the setOutputs function, with the appropriate parameters. Note that, when we are building the dataset for the training split, we build the vocabulary (up to 30000 words).

In [0]:
ds.setOutput('examples/EuTrans/training.en',
             'train',
             type='text',
             id='target_text',
             tokenization='tokenize_none',
             build_vocabulary=True,
             pad_on_batch=True,
             sample_weights=True,
             max_text_len=30,
             max_words=30000,
             min_occ=0)

ds.setOutput('examples/EuTrans/dev.en',
             'val',
             type='text',
             id='target_text',
             pad_on_batch=True,
             tokenization='tokenize_none',
             sample_weights=True,
             max_text_len=30,
             max_words=0)

[24/03/2020 14:27:55] 	Applying tokenization function: "tokenize_none".
[24/03/2020 14:27:55] Creating vocabulary for data with data_id 'target_text'.
[24/03/2020 14:27:55] 	 Total: 513 unique words in 9900 sentences with a total of 98304 words.
[24/03/2020 14:27:55] Creating dictionary of 30000 most common words, covering 100.0% of the text.
[24/03/2020 14:27:55] Loaded "train" set outputs of data_type "text" with data_id "target_text" and length 9900.
[24/03/2020 14:27:55] 	Applying tokenization function: "tokenize_none".
[24/03/2020 14:27:55] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 100.


Similarly, we introduce the source text data, with the setInputs function. Again, when building the training split, we must construct the vocabulary.

In [0]:
ds.setInput('examples/EuTrans/training.es',
            'train',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            build_vocabulary=True,
            fill='end',
            max_text_len=30,
            max_words=30000,
            min_occ=0)
ds.setInput('examples/EuTrans/dev.es',
            'val',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            fill='end',
            max_text_len=30,
            min_occ=0)

[24/03/2020 14:27:55] 	Applying tokenization function: "tokenize_none".
[24/03/2020 14:27:55] Creating vocabulary for data with data_id 'source_text'.
[24/03/2020 14:27:55] 	 Total: 686 unique words in 9900 sentences with a total of 96172 words.
[24/03/2020 14:27:55] Creating dictionary of 30000 most common words, covering 100.0% of the text.
[24/03/2020 14:27:55] Loaded "train" set inputs of data_type "text" with data_id "source_text" and length 9900.
[24/03/2020 14:27:55] 	Applying tokenization function: "tokenize_none".
[24/03/2020 14:27:55] Loaded "val" set inputs of data_type "text" with data_id "source_text" and length 100.


...and for the 'state_below' data. Note that: 1) The offset flat is set to 1, which means that the text will be shifted to the right 1 position. 2) During sampling time, we won't have this input. Hence, we 'hack' the dataset model by inserting an artificial input, of type 'ghost' for the validation split.

In [0]:
ds.setInput('examples/EuTrans/training.en',
            'train',
            type='text',
            id='state_below',
            required=False,
            tokenization='tokenize_none',
            pad_on_batch=True,
            build_vocabulary='target_text',
            offset=1,
            fill='end',
            max_text_len=30,
            max_words=30000)
ds.setInput(None,
            'val',
            type='ghost',
            id='state_below',
            required=False)

[24/03/2020 14:27:55] 	Applying tokenization function: "tokenize_none".
[24/03/2020 14:27:55] 	Reusing vocabulary named "target_text" for data with data_id "state_below".
[24/03/2020 14:27:55] Loaded "train" set inputs of data_type "text" with data_id "state_below" and length 9900.
[24/03/2020 14:27:55] Loaded "val" set inputs of data_type "ghost" with data_id "state_below" and length 100.


We can also keep the literal source words (for replacing unknown words).

In [0]:
  for split, input_text_filename in zip(['train', 'val'], ['examples/EuTrans/training.es', 'examples/EuTrans/dev.es']):
    ds.setRawInput(input_text_filename,
                  split,
                  type='file-name',
                  id='raw_source_text',
                  overwrite_split=True)

[24/03/2020 14:27:55] Loaded "train" set inputs of type "file-name" with id "raw_source_text".
[24/03/2020 14:27:55] Loaded "val" set inputs of type "file-name" with id "raw_source_text".


We also need to match the references with the inputs. Since we only have one reference per input sample, we set `repeat=1`.

In [0]:
keep_n_captions(ds, repeat=1, n=1, set_names=['val'])


[24/03/2020 14:27:55] Keeping 1 captions per input on the val set.
[24/03/2020 14:27:55] Samples reduced to 100 in val set.


Finally, we can save our dataset instance for using in other experiments:

In [0]:
saveDataset(ds, 'datasets')


[24/03/2020 14:27:55] <<< creating directory datasets ... >>>
[24/03/2020 14:27:55] <<< Saving Dataset instance to datasets/Dataset_tutorial_dataset.pkl ... >>>
[24/03/2020 14:27:55] <<< Dataset instance saved >>>


## 2. Creating and training a Neural Translation Model
Now, we'll create and train a Neural Machine Translation (NMT) model. Since there is a significant number of hyperparameters, we'll use the default ones, specified in the `config.py` file. Note that almost every hardcoded parameter is automatically set from config if we run  `main.py `.

We'll create an `'AttentionRNNEncoderDecoder'` (a LSTM encoder-decoder with attention mechanism). Refer to the [`model_zoo.py`](https://github.com/lvapeab/nmt-keras/blob/master/nmt_keras/model_zoo.py) file for other models (e.g. Transformer). 

So first, let's import the model and the hyperparameters. We'll also load the dataset we stored in the previous section (not necessary as it is in memory, but as a demonstration):

In [48]:
from config import load_parameters
from nmt_keras.model_zoo import TranslationModel
from keras_wrapper.cnn_model import loadModel
from keras_wrapper.dataset import loadDataset
from keras_wrapper.extra.callbacks import PrintPerformanceMetricOnEpochEndOrEachNUpdates
params = load_parameters()
dataset = loadDataset('datasets/Dataset_tutorial_dataset.pkl')


[24/03/2020 14:55:00] <<< Loading Dataset instance from datasets/Dataset_tutorial_dataset.pkl ... >>>
[24/03/2020 14:55:00] <<< Dataset instance loaded >>>


Since the number of words in the dataset may be unknown beforehand, we must update the params information according to the dataset instance:


In [0]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['source_text']
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['target_text']

Now, we create a `TranslationModel` instance:


In [0]:
nmt_model = TranslationModel(params,
                             model_type='AttentionRNNEncoderDecoder', 
                             model_name='tutorial_model',
                             vocabularies=dataset.vocabulary,
                             store_path='trained_models/tutorial_model/',
                             verbose=True)


[24/03/2020 14:27:55] <<< Building AttentionRNNEncoderDecoder Translation_Model >>>


-----------------------------------------------------------------------------------
		TranslationModel instance
-----------------------------------------------------------------------------------
_model_type: AttentionRNNEncoderDecoder
name: tutorial_model
model_path: trained_models/tutorial_model/
verbose: True

Params:
	ACCUMULATE_GRADIENTS: 1
	ADDITIONAL_OUTPUT_MERGE_MODE: Add
	ALIGN_FROM_RAW: True
	ALPHA_FACTOR: 0.6
	AMSGRAD: False
	APPLY_DETOKENIZATION: False
	ATTENTION_DROPOUT_P: 0.0
	ATTENTION_MODE: add
	ATTENTION_SIZE: 32
	BATCH_NORMALIZATION_MODE: 1
	BATCH_SIZE: 50
	BEAM_SEARCH: True
	BEAM_SIZE: 6
	BETA_1: 0.9
	BETA_2: 0.999
	BIDIRECTIONAL_DEEP_ENCODER: True
	BIDIRECTIONAL_ENCODER: True
	BIDIRECTIONAL_MERGE_MODE: concat
	BPE_CODES_PATH: examples/EuTrans//training_codes.joint
	CLASSIFIER_ACTIVATION: softmax
	CLIP_C: 5.0
	CLIP_V: 0.0
	COVERAGE_NORM_FACTOR: 0.2
	COVERAGE_PENALTY: False
	DATASET_NAME: EuTrans
	DATASET_STORE_PATH: datasets/
	DATA_AUGMENTATION: False
	DATA_ROOT_PATH

[24/03/2020 14:27:57] Preparing optimizer and compiling. Optimizer configuration: 
	 LR: 0.001
	 LOSS: categorical_crossentropy
	 BETA_1: 0.9
	 BETA_2: 0.999
	 EPSILON: 1e-08


Next, we must define the inputs and outputs mapping from our Dataset instance to our model:


In [0]:

inputMapping = dict()
for i, id_in in enumerate(params['INPUTS_IDS_DATASET']):
    pos_source = dataset.ids_inputs.index(id_in)
    id_dest = nmt_model.ids_inputs[i]
    inputMapping[id_dest] = pos_source
nmt_model.setInputsMapping(inputMapping)

outputMapping = dict()
for i, id_out in enumerate(params['OUTPUTS_IDS_DATASET']):
    pos_target = dataset.ids_outputs.index(id_out)
    id_dest = nmt_model.ids_outputs[i]
    outputMapping[id_dest] = pos_target
nmt_model.setOutputsMapping(outputMapping)


We can add some callbacks for controlling the training (e.g. Sampling each N updates, early stop, learning rate annealing...). For instance, let's build a sampling callback. After each epoch, it will compute the 'coco' scores on the development set. We need to pass some configuration variables to the callback (in the extra_vars dictionary):


In [0]:
search_params = {
    'language': 'en',
    'tokenize_f': eval('dataset.' + 'tokenize_none'),
    'beam_size': 12,
    'optimized_search': True,
    'model_inputs': params['INPUTS_IDS_MODEL'],
    'model_outputs': params['OUTPUTS_IDS_MODEL'],
    'dataset_inputs':  params['INPUTS_IDS_DATASET'],
    'dataset_outputs':  params['OUTPUTS_IDS_DATASET'],
    'n_parallel_loaders': 1,
    'maxlen': 50,
    'model_inputs': ['source_text', 'state_below'],
    'model_outputs': ['target_text'],
    'dataset_inputs': ['source_text', 'state_below'],
    'dataset_outputs': ['target_text'],
    'normalize': True,
    'pos_unk': True,
    'heuristic': 0,
    'state_below_maxlen': 1,
    'val': {'references': dataset.extra_variables['val']['target_text']}
  }

vocab = dataset.vocabulary['target_text']['idx2words']
callbacks = []
input_text_id = params['INPUTS_IDS_DATASET'][0]

callbacks.append(PrintPerformanceMetricOnEpochEndOrEachNUpdates(nmt_model,
                                                                dataset,
                                                                gt_id='target_text',
                                                                metric_name=['coco'],
                                                                set_name=['val'],
                                                                batch_size=50,
                                                                each_n_epochs=1,
                                                                extra_vars=search_params,
                                                                reload_epoch=0,
                                                                is_text=True,
                                                                input_text_id=input_text_id,
                                                                index2word_y=vocab,
                                                                sampling_type='max_likelihood',
                                                                beam_search=True,
                                                                save_path=nmt_model.model_path,
                                                                start_eval_on_epoch=0,
                                                                write_samples=True,
                                                                write_type='list',
                                                                verbose=True))

Now we are ready to train. Let's set up some training parameters...


In [0]:
training_params = {'n_epochs': 4,
                   'batch_size': 50,
                   'maxlen': 30,
                   'epochs_for_save': 1,
                   'verbose': 1,
                   'eval_on_sets': [], 
                   'n_parallel_loaders': 1,
                   'extra_callbacks': callbacks,
                   'reload_epoch': 0,
                   'epoch_offset': 0}


And train!


In [0]:
nmt_model.trainNet(dataset, training_params)


[24/03/2020 14:29:37] <<< Training model >>>
[24/03/2020 14:29:37] Training parameters: { 
	batch_size: 50
	class_weights: None
	da_enhance_list: []
	da_patch_type: resize_and_rndcrop
	data_augmentation: False
	each_n_epochs: 1
	epoch_offset: 0
	epochs_for_save: 1
	eval_on_epochs: True
	eval_on_sets: []
	extra_callbacks: [<keras_wrapper.extra.callbacks.EvalPerformance object at 0x7f5e1b11fe80>]
	homogeneous_batches: False
	initial_lr: 1.0
	joint_batches: 4
	lr_decay: None
	lr_gamma: 0.1
	lr_half_life: 50000
	lr_reducer_exp_base: 0.5
	lr_reducer_type: linear
	lr_warmup_exp: -1.5
	maxlen: 30
	mean_substraction: False
	metric_check: None
	min_delta: 0.0
	min_lr: 1e-09
	n_epochs: 4
	n_gpus: 1
	n_parallel_loaders: 1
	normalization_type: None
	normalize: False
	num_iterations_val: None
	patience: 0
	patience_check_split: val
	reduce_each_epochs: True
	reload_epoch: 0
	shuffle: True
	start_eval_on_epoch: 0
	start_reduction_on_epoch: 0
	tensorboard: False
	tensorboard_params: {'log_dir': 'tens

Epoch 1/4


[24/03/2020 14:30:05] <<< Saving model to trained_models/tutorial_model//epoch_1 ... >>>





  'TensorFlow optimizers do not '
[24/03/2020 14:30:05] <<< Model saved >>>

[24/03/2020 14:30:05] <<< Predicting outputs of val set >>>




 Total cost of the translations: 413.325317 	 Average cost of the translations: 4.133253
The sampling took: 11.451336 secs (Speed: 0.114513 sec/sample)


[24/03/2020 14:30:17] Prediction output 0: target_text (text)
[24/03/2020 14:30:17] Decoding beam search prediction ...
[24/03/2020 14:30:17] Using heuristic 0
[24/03/2020 14:30:17] Evaluating on metric coco





[24/03/2020 14:30:29] Computing coco scores on the val split...
[24/03/2020 14:30:29] Bleu_1: 0.47932978332553244
[24/03/2020 14:30:29] Bleu_2: 0.44420341781998574
[24/03/2020 14:30:29] Bleu_3: 0.41484331636093436
[24/03/2020 14:30:29] Bleu_4: 0.3858938496848188
[24/03/2020 14:30:29] CIDEr: 3.9957461278437503
[24/03/2020 14:30:29] METEOR: 0.29851574857051033
[24/03/2020 14:30:29] ROUGE_L: 0.6115729583070446
[24/03/2020 14:30:29] TER: 0.46356275303643724
[24/03/2020 14:30:29] Done evaluating on metric coco

[24/03/2020 14:30:29] <<< Progress plot saved in trained_models/tutorial_model//epoch_1.jpg >>>


Epoch 2/4


[24/03/2020 14:30:57] <<< Saving model to trained_models/tutorial_model//epoch_2 ... >>>





  'TensorFlow optimizers do not '
[24/03/2020 14:30:57] <<< Model saved >>>

[24/03/2020 14:30:57] <<< Predicting outputs of val set >>>




 Total cost of the translations: 297.684967 	 Average cost of the translations: 2.976850
The sampling took: 10.919711 secs (Speed: 0.109197 sec/sample)


[24/03/2020 14:31:08] Prediction output 0: target_text (text)
[24/03/2020 14:31:08] Decoding beam search prediction ...
[24/03/2020 14:31:08] Using heuristic 0
[24/03/2020 14:31:08] Evaluating on metric coco





[24/03/2020 14:31:19] Computing coco scores on the val split...
[24/03/2020 14:31:19] Bleu_1: 0.7396147552214621
[24/03/2020 14:31:19] Bleu_2: 0.7123021720367329
[24/03/2020 14:31:19] Bleu_3: 0.6917392773743373
[24/03/2020 14:31:19] Bleu_4: 0.6712081322654118
[24/03/2020 14:31:19] CIDEr: 6.599257729584952
[24/03/2020 14:31:19] METEOR: 0.45359735747932894
[24/03/2020 14:31:19] ROUGE_L: 0.8193126963132882
[24/03/2020 14:31:19] TER: 0.2520242914979757
[24/03/2020 14:31:19] Done evaluating on metric coco

[24/03/2020 14:31:19] <<< Progress plot saved in trained_models/tutorial_model//epoch_2.jpg >>>


Epoch 3/4


[24/03/2020 14:31:46] <<< Saving model to trained_models/tutorial_model//epoch_3 ... >>>





  'TensorFlow optimizers do not '
[24/03/2020 14:31:47] <<< Model saved >>>

[24/03/2020 14:31:47] <<< Predicting outputs of val set >>>




 Total cost of the translations: 197.531387 	 Average cost of the translations: 1.975314
The sampling took: 10.611021 secs (Speed: 0.106110 sec/sample)


[24/03/2020 14:31:57] Prediction output 0: target_text (text)
[24/03/2020 14:31:57] Decoding beam search prediction ...
[24/03/2020 14:31:57] Using heuristic 0
[24/03/2020 14:31:57] Evaluating on metric coco





[24/03/2020 14:32:07] Computing coco scores on the val split...
[24/03/2020 14:32:07] Bleu_1: 0.8863528829767809
[24/03/2020 14:32:07] Bleu_2: 0.8659776671633338
[24/03/2020 14:32:07] Bleu_3: 0.8536660039929164
[24/03/2020 14:32:07] Bleu_4: 0.8427886854101327
[24/03/2020 14:32:07] CIDEr: 8.052806455838352
[24/03/2020 14:32:07] METEOR: 0.5647919618681059
[24/03/2020 14:32:07] ROUGE_L: 0.9119877474469017
[24/03/2020 14:32:07] TER: 0.11538461538461539
[24/03/2020 14:32:07] Done evaluating on metric coco

[24/03/2020 14:32:07] <<< Progress plot saved in trained_models/tutorial_model//epoch_3.jpg >>>


Epoch 4/4


[24/03/2020 14:32:35] <<< Saving model to trained_models/tutorial_model//epoch_4 ... >>>





  'TensorFlow optimizers do not '
[24/03/2020 14:32:35] <<< Model saved >>>

[24/03/2020 14:32:35] <<< Predicting outputs of val set >>>




 Total cost of the translations: 152.965485 	 Average cost of the translations: 1.529655
The sampling took: 11.152328 secs (Speed: 0.111523 sec/sample)


[24/03/2020 14:32:47] Prediction output 0: target_text (text)
[24/03/2020 14:32:47] Decoding beam search prediction ...
[24/03/2020 14:32:47] Using heuristic 0
[24/03/2020 14:32:47] Evaluating on metric coco





[24/03/2020 14:32:57] Computing coco scores on the val split...
[24/03/2020 14:32:57] Bleu_1: 0.9307269346402889
[24/03/2020 14:32:57] Bleu_2: 0.9142206854883193
[24/03/2020 14:32:57] Bleu_3: 0.90259550366933
[24/03/2020 14:32:57] Bleu_4: 0.8919264143953128
[24/03/2020 14:32:57] CIDEr: 8.613646097403906
[24/03/2020 14:32:57] METEOR: 0.6056404030968355
[24/03/2020 14:32:57] ROUGE_L: 0.9431622423078433
[24/03/2020 14:32:57] TER: 0.08097165991902834
[24/03/2020 14:32:57] Done evaluating on metric coco

[24/03/2020 14:32:57] <<< Progress plot saved in trained_models/tutorial_model//epoch_4.jpg >>>
[24/03/2020 14:32:57] <<< Finished training model >>>


## 3. Decoding with a trained Neural Machine Translation Model

Now, we'll load from disk the model we just trained and we'll apply it for translating new text. In this case, we want to translate the 'test' split from our dataset.

Since we want to translate a new data split ('test') we must add it to the dataset instance, just as we did before (at the first tutorial). In case we also had the refences of the test split and we wanted to evaluate it, we can add it to the dataset. Note that this is not mandatory and we could just predict without evaluating.

In [50]:
dataset.setInput('examples/EuTrans/test.es',
            'test',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            fill='end',
            max_text_len=30,
            min_occ=0)

dataset.setInput(None,
            'test',
            type='ghost',
            id='state_below',
            required=False)

dataset.setRawInput('examples/EuTrans/test.es',
              'test',
              type='file-name',
              id='raw_source_text',
              overwrite_split=True)


[24/03/2020 14:55:34] 	Applying tokenization function: "tokenize_none".
[24/03/2020 14:55:34] Loaded "test" set inputs of data_type "text" with data_id "source_text" and length 2996.
[24/03/2020 14:55:34] Loaded "test" set inputs of data_type "ghost" with data_id "state_below" and length 2996.
[24/03/2020 14:55:34] Loaded "test" set inputs of type "file-name" with id "raw_source_text".


Now, let's load the translation model. Suppose we want to load the model saved at the end of the epoch 4:


In [0]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['INPUTS_IDS_DATASET'][0]]
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['OUTPUTS_IDS_DATASET'][0]]

# Load model
nmt_model = loadModel('trained_models/tutorial_model', 4)


Once we loaded the model, we just have to invoke the sampling method (in this case, the Beam Search algorithm) for the 'test' split:


In [55]:
params_prediction = {
    'language': 'en',
    'tokenize_f': eval('dataset.' + 'tokenize_none'),
    'beam_size': 12,
    'optimized_search': True,
    'model_inputs': params['INPUTS_IDS_MODEL'],
    'model_outputs': params['OUTPUTS_IDS_MODEL'],
    'dataset_inputs':  params['INPUTS_IDS_DATASET'],
    'dataset_outputs':  params['OUTPUTS_IDS_DATASET'],
    'n_parallel_loaders': 1,
    'maxlen': 50,
    'model_inputs': ['source_text', 'state_below'],
    'model_outputs': ['target_text'],
    'dataset_inputs': ['source_text', 'state_below'],
    'dataset_outputs': ['target_text'],
    'normalize': True,
    'pos_unk': True,
    'heuristic': 0,
    'state_below_maxlen': 1,
    'predict_on_sets': ['test'],
    'verbose': 1,

  }
predictions = nmt_model.predictBeamSearchNet(dataset, params_prediction)['test']


[24/03/2020 15:03:16] <<< Predicting outputs of test set >>>




 Total cost of the translations: 5359.951172 	 Average cost of the translations: 1.789036
The sampling took: 377.352035 secs (Speed: 0.125952 sec/sample)


Up to now, in the variable 'predictions', we have the indices of the words of the hypotheses. We must decode them into words. For doing this, we'll use the dictionary stored in the dataset object:


In [62]:
from keras_wrapper.utils import decode_predictions_beam_search
vocab = dataset.vocabulary['target_text']['idx2words']
predictions = decode_predictions_beam_search(predictions[0],  # The first element of predictions contain the word indices.
                                             vocab,
                                             verbose=params['VERBOSE'])

[24/03/2020 15:16:44] Decoding beam search prediction ...


Finally, we store the hypotheses:



In [67]:
filepath = 'test.pred'
from keras_wrapper.extra.read_write import list2file
list2file(filepath, predictions)
!head -n 4 test.pred

I would like to book a room until tomorrow , please .
please wake us up tomorrow at a quarter past seven .
I am leaving today in the afternoon .
would you mind sending down our luggage to room number oh one three , please ?


If we have the references of this split, we can also evaluate the performance of our system on it. First, we must add them to the dataset object:


In [68]:
dataset.setOutput('examples/EuTrans/test.en',
             'test',
             type='text',
             id='target_text',
             pad_on_batch=True,
             tokenization='tokenize_none',
             sample_weights=True,
             max_text_len=30,
             max_words=0)
keep_n_captions(dataset, repeat=1, n=1, set_names=['test'])

[24/03/2020 15:17:54] 	Applying tokenization function: "tokenize_none".
[24/03/2020 15:17:54] Loaded "test" set outputs of data_type "text" with data_id "target_text" and length 2996.
[24/03/2020 15:17:54] Keeping 1 captions per input on the test set.
[24/03/2020 15:17:54] Samples reduced to 2996 in test set.


Next, we call the evaluation system: The COCO package. Although its main usage is for multimodal captioning, we can use it in machine translation:


In [69]:

from keras_wrapper.extra.evaluation import select
metric = 'coco'
# Apply sampling
extra_vars = dict()
extra_vars['tokenize_f'] = eval('dataset.' + 'tokenize_none')
extra_vars['language'] = params['TRG_LAN']
extra_vars['test'] = dict()
extra_vars['test']['references'] = dataset.extra_variables['test']['target_text']
metrics = select[metric](pred_list=predictions,
                                          verbose=1,
                                          extra_vars=extra_vars,
                                          split='test')

[24/03/2020 15:18:21] Computing coco scores on the test split...
[24/03/2020 15:18:21] Bleu_1: 0.8990784125154527
[24/03/2020 15:18:21] Bleu_2: 0.8819113101031038
[24/03/2020 15:18:21] Bleu_3: 0.8691817537848936
[24/03/2020 15:18:21] Bleu_4: 0.8569641977912813
[24/03/2020 15:18:21] CIDEr: 7.880240064453121
[24/03/2020 15:18:21] METEOR: 0.5841539919810526
[24/03/2020 15:18:21] ROUGE_L: 0.9157293025324528
[24/03/2020 15:18:21] TER: 0.10828884518123068


And that's all!