# Hands-on: Build and fix a machine translation model

You will learn the following:

- adapting and extending the transformer architecture, the baseline implementation taken from the GluonNLP API.
- training and debugging a neural machine translation model.

We target a low-resource translation setting and you are asked to implement changes based on Nguyen and Salazar's (2019) paper to make the model provided in this notebook converge.

In this session we explore normalization-centric changes to improve Transformer training based on Nguyen and Salazar's (2019) paper.

The session teaches how to adapt and extend models based on implementations provided in GluonNLP.
We target the low-resource setting, specifically the English-Galician translation task based on the [TED Talks dataset](https://github.com/neulab/word-embeddings-for-nmt) by Ye et al. (2018).
The low-resource is specifically suitable to a hands-on session, as due to the lack of large datasets training is quick and results can be compared fast.

The model code provided in this notebook is directly taken from [GluonNLP's implementation of the Transformer](https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/model/transformer.py) (Vaswani et al. 2017). Running the notebook as-is, will reproduce convergence failure, as no warmup is used.
Your task is to make several basic but surgical changes, which will enable you to let the model converge.
The minimal changes required will be outlined throughout the notebook.

In GluonNLP's Model Zoo you can also find pre-trained machine translation models, together with the scripts to train them: [Machine Translation Models](http://gluon-nlp.mxnet.io/model_zoo/machine_translation/index.html)

References:
- Nguyen, Toan Q., and Julian Salazar. "[Transformers without Tears: Improving the Normalization of Self-Attention](https://arxiv.org/abs/1910.05895)." International Workshop on Spoken Language Translation (2019).
- Qi, Ye, et al. "[When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?](https://www.aclweb.org/anthology/N18-2084/)." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
- Vaswani, Ashish, et al. "[Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)." Advances in neural information processing systems. 2017.

In [1]:
import numpy as np
import mxnet as mx
from mxnet import gluon, metric
from mxnet.gluon import Block, HybridBlock, nn

import gluonnlp as nlp

ctx = mx.gpu(0)

## Data pipeline

We have extracted the English-Galician part of the TED Talks dataset using the [script provided by Ye et al. (2018)](https://github.com/neulab/word-embeddings-for-nmt/blob/master/ted_reader.py), filtered out sentences longer than 200 tokens and learned a joint BPE using Lample's [fastBPE](https://github.com/glample/fastBPE).

In [2]:
# List all data files provided in this session 
!tree data

data
└── en_gl
    ├── codes
    ├── dev.en
    ├── dev.en.3000
    ├── dev.gl
    ├── dev.gl.3000
    ├── test.en
    ├── test.en.3000
    ├── test.gl
    ├── test.gl.3000
    ├── train.en
    ├── train.en.3000
    ├── train.gl
    ├── train.gl.3000
    └── vocab.3000

1 directory, 14 files


Let's first load the Vocabulary generated by fastBPE.

In [3]:
!head -n 5 data/en_gl/vocab.3000

. 20061
, 19950
a 10534
the 7688
que 6559


In [4]:
with open('data/en_gl/vocab.3000') as f:
    counter = dict((word, int(count)) for word, count in map(str.split, f.readlines()))
vocab = nlp.Vocab(counter, padding_token=None)
print(vocab)

Vocab(size=3163, unk="<unk>", reserved="['<bos>', '<eos>']")


We then load the BPE-tokenized datasets into a `gluon.Dataset`.

In [5]:
def get_data(part):
    en = nlp.data.CorpusDataset('data/en_gl/{}.en.3000'.format(part))
    gl = nlp.data.CorpusDataset('data/en_gl/{}.gl.3000'.format(part))
    indices = list(range(len(en)))
    return mx.gluon.data.ArrayDataset(en, gl, indices)

sentences_train = get_data('train')
sentences_dev = get_data('dev')
sentences_test = get_data('test')

def data_transform(src, tgt, idx):
    src = vocab[src + [vocab.eos_token]]
    tgt = vocab[[vocab.bos_token] + tgt + [vocab.eos_token]]
    return src, tgt, len(src), len(tgt), idx

data_train = sentences_train.transform(data_transform, lazy=False)
data_dev = sentences_dev.transform(data_transform, lazy=False)
data_test = sentences_test.transform(data_transform, lazy=False)

Then define the `gluon.DataLoader` which loads batches from the `DataSet` for training or evaluation.  

In [6]:
import gluonnlp.data.batchify as btf

batch_size = 4096  # number of tokens per batch
def get_dataloader(dataset, is_train=False):
    batchify_fn = btf.Tuple(btf.Pad(pad_val=0), btf.Pad(pad_val=0),
                            btf.Stack(dtype='float32'), btf.Stack(dtype='float32'),
                            btf.Stack())

    data_lengths = dataset.transform(lambda src, tgt, src_len, tgt_len, idx: (src_len, tgt_len))
    batch_sampler = nlp.data.FixedBucketSampler(lengths=data_lengths,
                                                batch_size=batch_size,
                                                shuffle=is_train,
                                                use_average_length=True)
    data_loader = mx.gluon.data.DataLoader(dataset,
                                           batch_sampler=batch_sampler,
                                           batchify_fn=batchify_fn)
    return data_loader

loader_train = get_dataloader(data_train, is_train=True)
loader_dev = get_dataloader(data_dev)

Let's peek at the data loaders output

In [7]:
batch = next(iter(loader_dev))
print('Array holding English src tokens has shape\t', batch[0].shape,'\tand lengths\t', batch[2].asnumpy())
print('Array holding Galician tgt tokens has shape\t', batch[1].shape, '\tand lengths\t', batch[3].asnumpy())


Array holding English src tokens has shape	 (2, 168) 	and lengths	 [168. 149.]
Array holding Galician tgt tokens has shape	 (2, 175) 	and lengths	 [175. 165.]


## Modeling

For the purpose of this session, we focus on two of the changes proposed in the "Transformers without Tears" paper.

### PostNorm vs. PreNorm

Vaswani et al. (2017) applies normalization after the sublayer and residual addition: $$x_{l+1} = \text{Norm}(x_l + F_l(x_l))$$
Instead, let's use $$x_{l+1} = x_l + F_l(\text{Norm}(x_l))$$

Reference: https://arxiv.org/pdf/1910.05895.pdf#subsection.2.1

### ScaleNorm

[1] uncovered that BatchNorm makes the optimization landscape significantly smoother.
As LayerNorm is inspired by BatchNorm, let's replace it with a scaled L2 normalization
which also helps to smoothen the loss landscape.

$$\text{ScaleNorm}(\mathbf{x}, g) = g \frac{\mathbf{x}}{\|\mathbf{x}\|}$$

Reference: https://arxiv.org/pdf/1910.05895.pdf#subsection.2.3

[1] Santurkar, Shibani, et al. "[How does batch normalization help optimization?](http://papers.nips.cc/paper/7515-how-does-batch-normalization-help-optimization)." Advances in Neural Information Processing Systems. 2018.

In [8]:
import math

class ScaleNorm(mx.gluon.HybridBlock):
    """ScaleNorm"""
    def __init__(self, scale, epsilon=1e-5, prefix=None, params=None):
        super(ScaleNorm, self).__init__(prefix=prefix, params=params)
        self.epsilon = epsilon
        with self.name_scope():
            self.scale = self.params.get('scale', shape=(1, ), init=mx.init.Constant(scale))

    def hybrid_forward(self, F, x, scale):
        norm = F.broadcast_div(scale, F.norm(x, axis=-1, keepdims=True).clip(a_min=self.epsilon, a_max=math.inf))
        return F.broadcast_mul(x, norm)

### Adapting the Position-wise Feed-Forward Networks

To implement the changes outlined above, please follow below steps.

- `PositionwiseFFN.hybrid_forward` to use Pre-Norm instead of Post-Norm.
- Change the `PositionwiseFFN.__init__` constructor to use `ScaleNorm` instead of `LayerNorm`.

You may decide to do both changes at the same time, or experiment with each change individually by proceeding with the following code-blocks.


In [9]:
class PositionwiseFFN(HybridBlock):
    """Positionwise Feed-Forward Neural Network.

    Parameters
    ----------
    units : int
        Number of units for the output
    hidden_size : int
        Number of units in the hidden layer of position-wise feed-forward networks
    dropout : float
        Dropout probability for the output
    use_residual : bool
        Add residual connection between the input and the output
    ffn1_dropout : bool, default False
        If True, apply dropout both after the first and second Positionwise
        Feed-Forward Neural Network layers. If False, only apply dropout after
        the second.
    activation : str, default 'relu'
        Activation function
    layer_norm_eps : float, default 1e-5
        Epsilon parameter passed to for mxnet.gluon.nn.LayerNorm
    weight_initializer : str or Initializer
        Initializer for the input weights matrix, used for the linear
        transformation of the inputs.
    bias_initializer : str or Initializer
        Initializer for the bias vector.
    prefix : str, default None
        Prefix for name of `Block`s
        (and name of weight if params is `None`).
    params : Parameter or None
        Container for weight sharing between cells.
        Created if `None`.
    """

    def __init__(self, *, units=512, hidden_size=2048, dropout=0.0, use_residual=True,
                 ffn1_dropout=False, activation='relu', layer_norm_eps=1e-5,
                 weight_initializer=None, bias_initializer='zeros', prefix=None, params=None):
        super().__init__(prefix=prefix, params=params)
        self._use_residual = use_residual
        self._dropout = dropout
        self._ffn1_dropout = ffn1_dropout
        with self.name_scope():
            self.ffn_1 = nn.Dense(units=hidden_size, flatten=False,
                                  weight_initializer=weight_initializer,
                                  bias_initializer=bias_initializer,
                                  prefix='ffn_1_')
            self.activation = gluon.nn.Activation(activation)
            self.ffn_2 = nn.Dense(units=units, flatten=False,
                                  weight_initializer=weight_initializer,
                                  bias_initializer=bias_initializer,
                                  prefix='ffn_2_')
            if dropout:
                self.dropout_layer = nn.Dropout(rate=dropout)
            self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=layer_norm_eps)

    def hybrid_forward(self, F, inputs):  # pylint: disable=arguments-differ
        """Position-wise encoding of the inputs.

        Parameters
        ----------
        inputs : Symbol or NDArray
            Input sequence. Shape (batch_size, length, C_in)

        Returns
        -------
        outputs : Symbol or NDArray
            Shape (batch_size, length, C_out)
        """
        outputs = self.ffn_1(inputs)
        if self.activation:
            outputs = self.activation(outputs)
        if self._dropout and self._ffn1_dropout:
            outputs = self.dropout_layer(outputs)
        outputs = self.ffn_2(outputs)
        if self._dropout:
            outputs = self.dropout_layer(outputs)
        if self._use_residual:
            outputs = outputs + inputs
        outputs = self.layer_norm(outputs)
        return outputs


### Adapting the Encoder and Decoder Cells

We redefine the `TransformerEncoderCell` and `TransformerDecoderCell`s from GluonNLP, making use of the `PositionwiseFFN` declared in the previous code block.

To implement the changes outlined above, change
- `TransformerEncoderCell.hybrid_forward` to use Pre-Norm instead of Post-Norm.
- Change the `TransformerEncoderCell.__init__` constructor to use `ScaleNorm` instead of `LayerNorm`.

and similarly

- `TransformerDecoderCell.hybrid_forward` to use Pre-Norm instead of Post-Norm.
- Change the `TransformerDecoderCell.__init__` constructor to use `ScaleNorm` instead of `LayerNorm`.


Again you may decide to do both changes at the same time, or experiment with each change individually by proceeding with the following code-blocks.

In [10]:
from gluonnlp.model.seq2seq_encoder_decoder import _get_attention_cell

class TransformerEncoderCell(HybridBlock):
    """Structure of the Transformer Encoder Cell.

    Parameters
    ----------
    attention_cell : AttentionCell or str, default 'multi_head'
        Arguments of the attention cell.
        Can be 'multi_head', 'scaled_luong', 'scaled_dot', 'dot', 'cosine', 'normed_mlp', 'mlp'
    units : int
        Number of units for the output
    hidden_size : int
        number of units in the hidden layer of position-wise feed-forward networks
    num_heads : int
        Number of heads in multi-head attention
    scaled : bool
        Whether to scale the softmax input by the sqrt of the input dimension
        in multi-head attention
    dropout : float
    use_residual : bool
    output_attention: bool
        Whether to output the attention weights
    attention_use_bias : bool, default False
        Whether to use bias when projecting the query/key/values in the attention cell.
    attention_proj_use_bias : bool, default False
        Whether to use bias when projecting the output of the attention cell.
    weight_initializer : str or Initializer
        Initializer for the input weights matrix, used for the linear
        transformation of the inputs.
    bias_initializer : str or Initializer
        Initializer for the bias vector.
    prefix : str, default None
        Prefix for name of `Block`s. (and name of weight if params is `None`).
    params : Parameter or None
        Container for weight sharing between cells. Created if `None`.
    activation : str, default None
        Activation methods in PositionwiseFFN
    layer_norm_eps : float, default 1e-5
        Epsilon for layer_norm

    Inputs:
        - **inputs** : input sequence. Shape (batch_size, length, C_in)
        - **mask** : mask for inputs. Shape (batch_size, length, length)

    Outputs:
        - **outputs**: output tensor of the transformer encoder cell.
            Shape (batch_size, length, C_out)
        - **additional_outputs**: the additional output of all the transformer encoder cell.
    """

    def __init__(self, *, attention_cell='multi_head', units=128, hidden_size=512, num_heads=4,
                 scaled=True, dropout=0.0, use_residual=True, output_attention=False,
                 attention_proj_use_bias=False, attention_use_bias=False, weight_initializer=None,
                 bias_initializer='zeros', prefix=None, params=None, activation='relu',
                 layer_norm_eps=1e-5):
        super().__init__(prefix=prefix, params=params)
        self._dropout = dropout
        self._use_residual = use_residual
        self._output_attention = output_attention
        with self.name_scope():
            if dropout:
                self.dropout_layer = nn.Dropout(rate=dropout)
            self.attention_cell = _get_attention_cell(attention_cell, units=units,
                                                      num_heads=num_heads, scaled=scaled,
                                                      dropout=dropout, use_bias=attention_use_bias)
            self.proj = nn.Dense(units=units, flatten=False, use_bias=attention_proj_use_bias,
                                 weight_initializer=weight_initializer,
                                 bias_initializer=bias_initializer, prefix='proj_')
            self.ffn = PositionwiseFFN(units=units, hidden_size=hidden_size, dropout=dropout,
                                       use_residual=use_residual,
                                       weight_initializer=weight_initializer,
                                       bias_initializer=bias_initializer, activation=activation,
                                       layer_norm_eps=layer_norm_eps)
            self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=layer_norm_eps)


    def hybrid_forward(self, F, inputs, mask=None):  # pylint: disable=arguments-differ
        """Transformer Encoder Attention Cell.

        Parameters
        ----------
        inputs : Symbol or NDArray
            Input sequence. Shape (batch_size, length, C_in)
        mask : Symbol or NDArray or None
            Mask for inputs. Shape (batch_size, length, length)

        Returns
        -------
        encoder_cell_outputs: list
            Outputs of the encoder cell. Contains:

            - outputs of the transformer encoder cell. Shape (batch_size, length, C_out)
            - additional_outputs of all the transformer encoder cell
        """
        outputs, attention_weights = self.attention_cell(inputs, inputs, inputs, mask)
        outputs = self.proj(outputs)
        if self._dropout:
            outputs = self.dropout_layer(outputs)
        if self._use_residual:
            outputs = outputs + inputs
        outputs = self.layer_norm(outputs)
        outputs = self.ffn(outputs)
        additional_outputs = []
        if self._output_attention:
            additional_outputs.append(attention_weights)
        return outputs, additional_outputs

In [11]:
class TransformerDecoderCell(HybridBlock):
    """Structure of the Transformer Decoder Cell.

    Parameters
    ----------
    attention_cell : AttentionCell or str, default 'multi_head'
        Arguments of the attention cell.
        Can be 'multi_head', 'scaled_luong', 'scaled_dot', 'dot', 'cosine', 'normed_mlp', 'mlp'
    units : int
        Number of units for the output
    hidden_size : int
        number of units in the hidden layer of position-wise feed-forward networks
    num_heads : int
        Number of heads in multi-head attention
    scaled : bool
        Whether to scale the softmax input by the sqrt of the input dimension
        in multi-head attention
    dropout : float
        Dropout probability.
    use_residual : bool
        Whether to use residual connection.
    output_attention: bool
        Whether to output the attention weights
    weight_initializer : str or Initializer
        Initializer for the input weights matrix, used for the linear
        transformation of the inputs.
    bias_initializer : str or Initializer
        Initializer for the bias vector.
    prefix : str, default None
        Prefix for name of `Block`s
        (and name of weight if params is `None`).
    params : Parameter or None
        Container for weight sharing between cells.
        Created if `None`.
    """
    def __init__(self, attention_cell='multi_head', units=128,
                 hidden_size=512, num_heads=4, scaled=True,
                 dropout=0.0, use_residual=True, output_attention=False,
                 weight_initializer=None, bias_initializer='zeros',
                 prefix=None, params=None):
        super(TransformerDecoderCell, self).__init__(prefix=prefix, params=params)
        self._units = units
        self._num_heads = num_heads
        self._dropout = dropout
        self._use_residual = use_residual
        self._output_attention = output_attention
        self._scaled = scaled
        with self.name_scope():
            if dropout:
                self.dropout_layer = nn.Dropout(rate=dropout)
            self.attention_cell_in = _get_attention_cell(attention_cell,
                                                         units=units,
                                                         num_heads=num_heads,
                                                         scaled=scaled,
                                                         dropout=dropout)
            self.attention_cell_inter = _get_attention_cell(attention_cell,
                                                            units=units,
                                                            num_heads=num_heads,
                                                            scaled=scaled,
                                                            dropout=dropout)
            self.proj_in = nn.Dense(units=units, flatten=False,
                                    use_bias=False,
                                    weight_initializer=weight_initializer,
                                    bias_initializer=bias_initializer,
                                    prefix='proj_in_')
            self.proj_inter = nn.Dense(units=units, flatten=False,
                                       use_bias=False,
                                       weight_initializer=weight_initializer,
                                       bias_initializer=bias_initializer,
                                       prefix='proj_inter_')
            self.ffn = PositionwiseFFN(hidden_size=hidden_size,
                                       units=units,
                                       use_residual=use_residual,
                                       dropout=dropout,
                                       weight_initializer=weight_initializer,
                                       bias_initializer=bias_initializer)

            self.layer_norm_in = nn.LayerNorm()
            self.layer_norm_inter = nn.LayerNorm()

    def hybrid_forward(self, F, inputs, mem_value, mask=None, mem_mask=None):  #pylint: disable=unused-argument
        #  pylint: disable=arguments-differ
        """Transformer Decoder Attention Cell.

        Parameters
        ----------
        inputs : Symbol or NDArray
            Input sequence. Shape (batch_size, length, C_in)
        mem_value : Symbol or NDArrays
            Memory value, i.e. output of the encoder. Shape (batch_size, mem_length, C_in)
        mask : Symbol or NDArray or None
            Mask for inputs. Shape (batch_size, length, length)
        mem_mask : Symbol or NDArray or None
            Mask for mem_value. Shape (batch_size, length, mem_length)

        Returns
        -------
        decoder_cell_outputs: list
            Outputs of the decoder cell. Contains:

            - outputs of the transformer decoder cell. Shape (batch_size, length, C_out)
            - additional_outputs of all the transformer decoder cell
        """
        outputs, attention_in_outputs =\
            self.attention_cell_in(inputs, inputs, inputs, mask)
        outputs = self.proj_in(outputs)
        if self._dropout:
            outputs = self.dropout_layer(outputs)
        if self._use_residual:
            outputs = outputs + inputs
        outputs = self.layer_norm_in(outputs)
        inputs = outputs
        outputs, attention_inter_outputs = \
            self.attention_cell_inter(inputs, mem_value, mem_value, mem_mask)
        outputs = self.proj_inter(outputs)
        if self._dropout:
            outputs = self.dropout_layer(outputs)
        if self._use_residual:
            outputs = outputs + inputs
        outputs = self.layer_norm_inter(outputs)
        outputs = self.ffn(outputs)
        additional_outputs = []
        if self._output_attention:
            additional_outputs.append(attention_in_outputs)
            additional_outputs.append(attention_inter_outputs)
        return outputs, additional_outputs

### Adapting the Encoder and Decoder

Finaly we patch the `TransformerEncoder` and `TransformerDecoder` from GluonNLP, to make use of the changed components defined in the previous code blocks. Note that GluonNLP defines both a `TransformerDecoder` for training and `TransformerOneStepDecoder` for testing, inheriting from a `_BaseTransformerDecoder`. Here we thus modify the `_BaseTransformerDecoder`.

In the next code-block, we simply overwrite the constructors of `TransformerEncoder` and `_BaseTransformerDecoder`, essentially patching GluonNLPs transformer implementation to make use of the updated models above. You thus don't need to make any changes to the next code-block.

In [12]:
from gluonnlp.model.transformer import _position_encoding_init

class MyTransformerEncoder(nlp.model.transformer.TransformerEncoder):
    def __init__(self, *, attention_cell='multi_head', num_layers=2, units=512, hidden_size=2048,
             max_length=50, num_heads=4, scaled=True, scale_embed=True, norm_inputs=False,
             dropout=0.0, use_residual=True, output_attention=False, output_all_encodings=False,
             weight_initializer=None, bias_initializer='zeros', prefix=None, params=None):
        HybridBlock.__init__(self, prefix=prefix, params=params)
        assert units % num_heads == 0,\
            'In TransformerEncoder, The units should be divided exactly ' \
            'by the number of heads. Received units={}, num_heads={}' \
            .format(units, num_heads)
        self._max_length = max_length
        self._units = units
        self._output_attention = output_attention
        self._output_all_encodings = output_all_encodings
        self._dropout = dropout
        self._scale_embed = scale_embed
        self._norm_inputs = norm_inputs

        with self.name_scope():
            if dropout:
                self.dropout_layer = nn.Dropout(rate=dropout)
            if self._norm_inputs:
                self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=1e-5)
            self.position_weight = self.params.get_constant(
                'const', _position_encoding_init(max_length, units))
            self.transformer_cells = nn.HybridSequential()
            for i in range(num_layers):
                cell = TransformerEncoderCell(
                    units=units, hidden_size=hidden_size, num_heads=num_heads,
                    attention_cell=attention_cell, weight_initializer=weight_initializer,
                    bias_initializer=bias_initializer, dropout=dropout, use_residual=use_residual,
                    scaled=scaled, output_attention=output_attention, prefix='transformer%d_' % i)
                self.transformer_cells.add(cell)

In [13]:
class MyTransformerDecoder(nlp.model.transformer.TransformerDecoder):
    def __init__(self, attention_cell='multi_head', num_layers=2, units=128, hidden_size=2048,
             max_length=50, num_heads=4, scaled=True, scale_embed=True, norm_inputs=True,
             dropout=0.0, use_residual=True, output_attention=False, weight_initializer=None,
             bias_initializer='zeros', prefix=None, params=None):
        HybridBlock.__init__(self, prefix=prefix, params=params)
        assert units % num_heads == 0, 'In TransformerDecoder, the units should be divided ' \
                                       'exactly by the number of heads. Received units={}, ' \
                                       'num_heads={}'.format(units, num_heads)
        self._num_layers = num_layers
        self._units = units
        self._hidden_size = hidden_size
        self._num_states = num_heads
        self._max_length = max_length
        self._dropout = dropout
        self._use_residual = use_residual
        self._output_attention = output_attention
        self._scaled = scaled
        self._scale_embed = scale_embed
        self._norm_inputs = norm_inputs
        with self.name_scope():
            if dropout:
                self.dropout_layer = nn.Dropout(rate=dropout)
            if self._norm_inputs:
                self.layer_norm = nn.LayerNorm()
            encoding = _position_encoding_init(max_length, units)
            self.position_weight = self.params.get_constant('const', encoding.astype(np.float32))
            self.transformer_cells = nn.HybridSequential()
            for i in range(num_layers):
                self.transformer_cells.add(
                    TransformerDecoderCell(units=units, hidden_size=hidden_size,
                                           num_heads=num_heads, attention_cell=attention_cell,
                                           weight_initializer=weight_initializer,
                                           bias_initializer=bias_initializer, dropout=dropout,
                                           scaled=scaled, use_residual=use_residual,
                                           output_attention=output_attention,
                                           prefix='transformer%d_' % i))

                
class MyTransformerOneStepDecoder(nlp.model.transformer.TransformerOneStepDecoder):
        def __init__(self, *args, **kwargs):
            MyTransformerDecoder.__init__(self, *args, **kwargs)

## Training and Evaluation

We then define a training and evaluation function

Next, let's instantiate the model and train it.

In [14]:
from gluonnlp.model.translation import NMTModel

kwargs = dict(units=512, hidden_size=2048, dropout=0.4, num_layers=4, num_heads=4, max_length=500)

encoder = MyTransformerEncoder(**kwargs, prefix='transformer_enc_')
decoder = MyTransformerDecoder(**kwargs, prefix='transformer_dec_')
one_step_ahead_decoder = MyTransformerOneStepDecoder(**kwargs, params=decoder.collect_params())

model = NMTModel(src_vocab=vocab, tgt_vocab=vocab, encoder=encoder, decoder=decoder,
                 one_step_ahead_decoder=one_step_ahead_decoder, share_embed=True,
                 embed_size=kwargs['units'], tie_weights=True, embed_initializer=None,
                 prefix='transformer_')
model.initialize(init=mx.init.Xavier(magnitude=3), ctx=ctx)
model.hybridize()

trainer = mx.gluon.Trainer(model.collect_params(), 'Adam', {'learning_rate': 3e-4, 'beta2': 0.98, 'epsilon': 1e-9})

In [15]:
from mxnet.gluon.contrib import estimator
from gluonnlp.loss import LabelSmoothing


label_smoothing = LabelSmoothing(epsilon=0.1, units=len(vocab))
label_smoothing.hybridize()

class MyEstimator(estimator.Estimator):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        assert len(self.context) == 1, 'Please change the fit_batch function to support multi-GPU training.'
        
    def fit_batch(self, train_batch, batch_axis=0):
        src, tgt, src_valid_length, tgt_valid_length, idx = [x.as_in_context(self.context[0]) for x in train_batch]
        with mx.autograd.record():            
            out, _ = model(src, tgt[:, :-1], src_valid_length, tgt_valid_length - 1)
            smoothed_label = label_smoothing(tgt[:, 1:])
            ls = self.loss(out, smoothed_label, tgt_valid_length - 1).sum()
            ls = ls / tgt_valid_length.sum()
        ls.backward()
        trainer.step(1)
        return (src, src_valid_length), (tgt, tgt_valid_length), out, ls

In [16]:
from gluonnlp.loss import MaskedSoftmaxCELoss

loss_function = MaskedSoftmaxCELoss(sparse_label=False)
loss_function.hybridize()

est = MyEstimator(net=model, loss=loss_function, metrics=[metric.Loss()], trainer=trainer, context=ctx)

In [17]:
est.fit(train_data=loader_train, epochs=100)

Training begin: using optimizer Adam with current learning rate 0.0003 
Train for 100 epochs.
[Epoch 0] Begin, current learning rate: 0.0003
[Epoch 0] Finished in 14.830s, training loss: 0.1622
[Epoch 1] Begin, current learning rate: 0.0003
[Epoch 1] Finished in 8.969s, training loss: 0.1596
[Epoch 2] Begin, current learning rate: 0.0003
[Epoch 2] Finished in 8.981s, training loss: 0.1596
[Epoch 3] Begin, current learning rate: 0.0003
[Epoch 3] Finished in 8.983s, training loss: 0.1592
[Epoch 4] Begin, current learning rate: 0.0003
[Epoch 4] Finished in 9.036s, training loss: 0.1570
[Epoch 5] Begin, current learning rate: 0.0003
[Epoch 5] Finished in 8.997s, training loss: 0.1564
[Epoch 6] Begin, current learning rate: 0.0003
[Epoch 6] Finished in 8.963s, training loss: 0.1565
[Epoch 7] Begin, current learning rate: 0.0003
[Epoch 7] Finished in 9.041s, training loss: 0.1562
[Epoch 8] Begin, current learning rate: 0.0003
[Epoch 8] Finished in 9.077s, training loss: 0.1559
[Epoch 9] Begi

In [20]:
from nmt.translation import BeamSearchTranslator

def translate(model, data_loader):
    scorer = nlp.model.BeamSearchScorer(alpha=0.6,  K=5)
    translator = BeamSearchTranslator(model=model, beam_size=4, scorer=scorer)
    translation_out = dict()
    for train_batch in data_loader:
        src, _, src_valid_length, _, idx = [x.as_in_context(ctx) for x in train_batch]
        samples, _, sample_valid_length = translator.translate(src, src_valid_length)
        max_score_sample = samples[:, 0, :].asnumpy()
        sample_valid_length = sample_valid_length[:, 0].asnumpy()
        idx = idx.asnumpy().tolist()
        for i in range(max_score_sample.shape[0]):
            translation_out[idx[i]] = vocab.to_tokens(max_score_sample[i][1:(sample_valid_length[i] - 1)].tolist())
    translation_out = [translation_out[i] for i in range(len(translation_out))]        
    return translation_out

In [21]:
translation_out = translate(model, loader_dev)

In [22]:
from nmt.bleu import compute_bleu, _bpe_to_words

bleu_score, ngram_precisions, brevity_penalty, reference_length, translation_length = \
    compute_bleu(reference_corpus_list=[sentences_dev.transform(lambda src, tgt, idx: tgt)],
                 translation_corpus=translation_out, bpe=True)

print(bleu_score, ngram_precisions, brevity_penalty)

0 [0.0007478005865102639, 0.0, 0.0, 0.0] 1


In [23]:
print(sentences_dev[0][0])
print(sentences_dev[0][1])
print(translation_out[0])

['(', 'A@@', 'ra@@', 'bi@@', 'c', ')', 'I', 'see@@', 'k', 're@@', 'fu@@', 'ge', 'in', 'Al@@', 'la@@', 'h', 'from', 'cur@@', 'sed', 'Sa@@', 'tan', '.', 'In', 'the', 'Na@@', 'me', 'of', 'Al@@', 'la@@', 'h', ',', 'the', 'most', 'Gra@@', 'ci@@', 'ous', ',', 'the', 'most', 'M@@', 'er@@', 'ci@@', 'ful', '.']
['(', 'Á@@', 'ra@@', 'be', ')', 'R@@', 'ef@@', 'ú@@', 'xi@@', 'o@@', 'me', 'en', 'Al@@', 'á', 'de', 'S@@', 'at@@', 'an@@', 'ás', ',', 'o', 'mal@@', 'di@@', 'to', '.', 'En', 'nome', 'de', 'Al@@', 'á', ',', 'o', 'mi@@', 'seri@@', 'co@@', 'ri@@', 'di@@', 'oso', '.']
['(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '(', '

If you'd like to train your own machine translation models, check the GluonNLP [Model Zoo](http://gluon-nlp.mxnet.io/model_zoo/machine_translation/index.html) or the source on Github: [github.com/dmlc/gluon-nlp](https://github.com/dmlc/gluon-nlp/tree/master/scripts/machine_translation)