# Examine Multiple Layers for Neural Machine Translation

After implementing [Attention](AttentionModelForMachineTranslationWithTensorflow.ipynb), I'll come back first to examine whether it's worth to implement multiple layers (of encoders and decoders) instead of only one. Of course, Deep Learning is all about deep nets, at some point I'll have to check it.

In my first attempts I rejected it as a simple 2 layer encoder didn't result in significant improvements. After reading [The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849) and looking into the [Google NMT](https://github.com/tensorflow/nmt) project, I realized that multiple layers are worth it if done correctly.

Here I made several hyper parameter decisions to naive stacking:
* _Residual Connections_: While I knew residuals from CNNs like [ResNet](https://arxiv.org/abs/1512.03385), I didn't realize how important they are for stacked RNNs. In CNNs they are crucial for very deep networks, but here they already matter for two layers of RNNs. A bit disappointing is that implementing a bidirectional stacked RNN is still more difficulty than it should be. In the end I copy+pasted a `ExtendedMultiRNNCell` implementation from the Google NMT project. It's easy to make subtle mistakes where the computation graph compiles and is trainable, but in the end performs poor.
* _L2 Regularization/Weight decay_: For a simple 1-layer RNN the dropout is enough regularization. When I experimented with L2 regularization for 1 layer RNN I found a decrease in performance. For multiple layers it seems useful to have a *low* l2 regularizer so the layers can learn better. As it might be counterproductive, I don't l2 regularize the embeddings (they are anyway pretrained, so they only need some fine adjustments). *Note:* There is a subtle, but important difference between l2 regularization via adapting l2 loss as implemented here and implmenentig weight decay in combination with the Adam Optimizer used here. Read more about this _Bug_ from [Boris Babenko](https://bbabenko.github.io/weight-decay/) or from [fast.ai](http://www.fast.ai/2018/07/02/adam-weight-decay/) or the direct [arxiv paper about](https://arxiv.org/abs/1711.05101). It would be important to implement weight decay, that is *wrong* here and might be the reason for the poor performance. I'll rerun it with weight decay instead of l2 loss.
* _Learning Rate_: I preferred not to handcraft one, but here I followed the advice I found to have three different phases. A warmup phase with linear increasing learning rate but without training the predefined embeddings. A second phase where everything is trained with default learning rate. And a third phase with an exponential learning rate decay. The idea is to first find a solid initialization for the layer RNNs, then train them till a first saturation and then lower the learning rate so it can make progress (allthough a bit slower).
* _Epochs_: I had to increase the number of epochs. Adjusting the learning rate to a lower value always needs some more epochs. Also as there are much more parameters with multiple layers (that are interconnecting) it also needs more time for a saturation. So, beside needing more time anyway to train layer times more RNN parameters, we also need more epochs, so in the end we need much more time.
* _Layers_: Google is funny. Google NMT works with 8 layers as they can run each layer on a seperate GPU instance in parallel - of course, having 8 high performance GPU in a computer is great. Well, I don't have the resources, so all I can do here is to implement 2 layers. I'd love to have >= 4 layers (so they are looking a bit above word levels), but that's not doable here for a side project.
* _Label smoothing_: Would be nice to have. But for it, we would need to one-hot-encode the RNN outputs what would drastically increase the computational effort. Look to a [stackoverflow](https://stackoverflow.com/questions/49136472/tensorflow-sequence-loss-with-label-smoothing) entry about. The (hidden) usage of `sparse_softmax_cross_entropy` instead of `softmax_cross_entropy` might be a big reason why my tensorflow solution is much more efficient in computing time to the keras solution.

In [1]:
import re

import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
from tensorflow.python.util import nest
from tqdm import tqdm_notebook as tqdm

from utils.download import download_and_extract_resources
from utils.linguistic import bleu_scores_europarl, preprocess_input_europarl as preprocess
from utils.preparation import check_gpu_working, Europarl, RANDOM_STATE

check_gpu_working()

Fixed random seed to 42
Availabe devices: [name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2047769553909087464
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7769512346
locality {
  bus_id: 1
  links {
  }
}
incarnation: 17468272896898241021
physical_device_desc: "device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Cuda/Cudnn/GPU works as intended


In [2]:
MAX_INPUT_LENGTH = 400
MAX_TARGET_LENGTH = 450
LATENT_DIM = 256
LAYERS = 2
EPOCHS = 25
BATCH_SIZE = 128
DROPOUT = 0.25  # Dropout on input and output for the RNN cells, so effective dropout is 0.5, but works slightly better so
TEST_SIZE = 2500
BEAM_WIDTH = 5
EMBEDDING_TRAINABLE = True  # after warmup phase
WARMUP_EPOCHS = 5
LEARNING_RATE_DECAY_START_EPOCH = 10
LEARNING_RATE_DECAY_RATE = 0.97

## Download and explore data

In [3]:
europarl = Europarl()
download_and_extract_resources(fnames_and_urls=europarl.external_resources, dest_path=europarl.path)
europarl.load_and_preprocess(max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH)

de-en.tgz already downloaded (188.6 MB)
en.wiki.bpe.op5000.model already downloaded (0.3 MB)
en.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (6.2 MB)
de.wiki.bpe.op5000.model already downloaded (0.3 MB)
de.wiki.bpe.op5000.d300.w2v.bin.tar.gz already downloaded (5.7 MB)
Total number of unfiltered translations 1920209
Filtered translations with length between (1, input=400/target=450) characters: 1864679


In [4]:
europarl.df[['input_texts', 'target_texts']].head()

Unnamed: 0,input_texts,target_texts
0,resumption of the session,wiederaufnahme der sitzungsperiode
1,i declare resumed the session of the european ...,"ich erkläre die am freitag, dem 0. dezember un..."
2,"although, as you will have seen, the dreaded '...","wie sie feststellen konnten, ist der gefürchte..."
3,you have requested a debate on this subject in...,im parlament besteht der wunsch nach einer aus...
4,"in the meantime, i should like to observe a mi...",heute möchte ich sie bitten - das ist auch der...


In [5]:
print("English subwords", europarl.bpe_input.sentencepiece.EncodeAsPieces("this is a test for pretrained bytepairembeddings"))
print("German subwords", europarl.bpe_target.sentencepiece.EncodeAsPieces("das ist ein test für vortrainierte zeichengruppen"))

English subwords ['▁this', '▁is', '▁a', '▁test', '▁for', '▁pre', 'tr', 'ained', '▁by', 'te', 'pa', 'ire', 'm', 'bed', 'd', 'ings']
German subwords ['▁das', '▁ist', '▁ein', '▁test', '▁für', '▁v', 'ort', 'rain', 'ierte', '▁zeich', 'eng', 'ruppen']


In [6]:
# Those will be the inputs for the seq2seq model (that needs to know how long the sequences can get)
max_len_input = europarl.df.input_sequences.apply(len).max()
max_len_target = europarl.df.target_sequences.apply(len).max()
(max_len_input, max_len_target)

(161, 171)

In [7]:
train_ids, val_ids = train_test_split(np.arange(europarl.df.shape[0]), test_size=0.1, random_state=RANDOM_STATE)  # fixed random_state

In [8]:
# copy+pasted from https://github.com/google/seq2seq/blob/7f485894d412e8d81ce0e07977831865e44309ce/seq2seq/contrib/rnn_cell.py#L39
class ExtendedMultiRNNCell(tf.nn.rnn_cell.MultiRNNCell):
    """Extends the Tensorflow MultiRNNCell with residual connections"""

    def __init__(
        self,
        cells,
        residual_connections=False,
        residual_combiner="add",
        residual_dense=False
    ):
        """Create a RNN cell composed sequentially of a number of RNNCells.
        Args:
          cells: list of RNNCells that will be composed in this order.
          state_is_tuple: If True, accepted and returned states are n-tuples, where
            `n = len(cells)`.  If False, the states are all
            concatenated along the column axis.  This latter behavior will soon be
            deprecated.
          residual_connections: If true, add residual connections between all cells.
            This requires all cells to have the same output_size. Also, iff the
            input size is not equal to the cell output size, a linear transform
            is added before the first layer.
          residual_combiner: One of "add" or "concat". To create inputs for layer
            t+1 either "add" the inputs from the prev layer or concat them.
          residual_dense: Densely connect each layer to all other layers
        Raises:
          ValueError: if cells is empty (not allowed), or at least one of the cells
            returns a state tuple but the flag `state_is_tuple` is `False`.
        """
        super(ExtendedMultiRNNCell, self).__init__(cells, state_is_tuple=True)
        assert residual_combiner in ["add", "concat", "mean"]

        self._residual_connections = residual_connections
        self._residual_combiner = residual_combiner
        self._residual_dense = residual_dense

    def __call__(self, inputs, state, scope=None):
        """Run this multi-layer cell on inputs, starting from state."""
        if not self._residual_connections:
            return super(ExtendedMultiRNNCell, self).__call__(
                inputs, state, (scope or "extended_multi_rnn_cell")
            )

        with tf.variable_scope(scope or "extended_multi_rnn_cell"):
            # Adding Residual connections are only possible when input and output
            # sizes are equal. Optionally transform the initial inputs to
            # `cell[0].output_size`
            if self._cells[0].output_size != inputs.get_shape().as_list()[1] and (self._residual_combiner in ["add", "mean"]):
                inputs = tf.contrib.layers.fully_connected(
                    inputs=inputs,
                    num_outputs=self._cells[0].output_size,
                    activation_fn=None,
                    scope="input_transform"
                )

            # Iterate through all layers (code from MultiRNNCell)
            cur_inp = inputs
            prev_inputs = [cur_inp]
            new_states = []
            for i, cell in enumerate(self._cells):
                with tf.variable_scope("cell_%d" % i):
                    if not nest.is_sequence(state):
                        raise ValueError(
                            "Expected state to be a tuple of length %d, but received: %s" %
                            (len(self.state_size), state)
                        )
                    cur_state = state[i]
                    next_input, new_state = cell(cur_inp, cur_state)

                    # Either combine all previous inputs or only the current input
                    input_to_combine = prev_inputs[-1:]
                    if self._residual_dense:
                        input_to_combine = prev_inputs

                    # Add Residual connection
                    if self._residual_combiner == "add":
                        next_input = next_input + sum(input_to_combine)
                    if self._residual_combiner == "mean":
                        combined_mean = tf.reduce_mean(tf.stack(input_to_combine), 0)
                        next_input = next_input + combined_mean
                    elif self._residual_combiner == "concat":
                        next_input = tf.concat([next_input] + input_to_combine, 1)

                    cur_inp = next_input
                    prev_inputs.append(cur_inp)

                    new_states.append(new_state)
        new_states = (tuple(new_states) if self._state_is_tuple else array_ops.concat(new_states, 1))
        return cur_inp, new_states

In [9]:
# Copy+pasted from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/opt/python/training/weight_decay_optimizers.py
# would need tensorflow >= 1.9.2, here I use 1.8

from tensorflow.python.framework import ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import resource_variable_ops
from tensorflow.python.ops import state_ops
from tensorflow.python.training import adam
from tensorflow.python.training import momentum as momentum_opt
from tensorflow.python.training import optimizer
from tensorflow.python.util.tf_export import tf_export


class DecoupledWeightDecayExtension(object):
  """This class allows to extend optimizers with decoupled weight decay.
  It implements the decoupled weight decay described by Loshchilov & Hutter
  (https://arxiv.org/pdf/1711.05101.pdf), in which the weight decay is
  decoupled from the optimization steps w.r.t. to the loss function.
  For SGD variants, this simplifies hyperparameter search since it decouples
  the settings of weight decay and learning rate.
  For adaptive gradient algorithms, it regularizes variables with large
  gradients more than L2 regularization would, which was shown to yield better
  training loss and generalization error in the paper above.
  This class alone is not an optimizer but rather extends existing
  optimizers with decoupled weight decay. We explicitly define the two examples
  used in the above paper (SGDW and AdamW), but in general this can extend
  any OptimizerX by using
  `extend_with_weight_decay(OptimizerX, weight_decay=weight_decay)`.
  In order for it to work, it must be the first class the Optimizer with
  weight decay inherits from, e.g.
  ```python
  class AdamWOptimizer(DecoupledWeightDecayExtension, adam.AdamOptimizer):
    def __init__(self, weight_decay, *args, **kwargs):
      super(AdamWOptimizer, self).__init__(weight_decay, *args, **kwargs).
  ```
  Note that this extension decays weights BEFORE applying the update based
  on the gradient, i.e. this extension only has the desired behaviour for
  optimizers which do not depend on the value of'var' in the update step!
  """

  def __init__(self, weight_decay, **kwargs):
    """Construct the extension class that adds weight decay to an optimizer.
    Args:
      weight_decay: A `Tensor` or a floating point value, the factor by which
        a variable is decayed in the update step.
      **kwargs: Optional list or tuple or set of `Variable` objects to
        decay.
    """
    self._decay_var_list = None  # is set in minimize or apply_gradients
    self._weight_decay = weight_decay
    # The tensors are initialized in call to _prepare
    self._weight_decay_tensor = None
    super(DecoupledWeightDecayExtension, self).__init__(**kwargs)

  def minimize(self, loss, global_step=None, var_list=None,
               gate_gradients=optimizer.Optimizer.GATE_OP,
               aggregation_method=None, colocate_gradients_with_ops=False,
               name=None, grad_loss=None, decay_var_list=None):
    """Add operations to minimize `loss` by updating `var_list` with decay.
    This function is the same as Optimizer.minimize except that it allows to
    specify the variables that should be decayed using decay_var_list.
    If decay_var_list is None, all variables in var_list are decayed.
    For more information see the documentation of Optimizer.minimize.
    Args:
      loss: A `Tensor` containing the value to minimize.
      global_step: Optional `Variable` to increment by one after the
        variables have been updated.
      var_list: Optional list or tuple of `Variable` objects to update to
        minimize `loss`.  Defaults to the list of variables collected in
        the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.
      gate_gradients: How to gate the computation of gradients.  Can be
        `GATE_NONE`, `GATE_OP`, or  `GATE_GRAPH`.
      aggregation_method: Specifies the method used to combine gradient terms.
        Valid values are defined in the class `AggregationMethod`.
      colocate_gradients_with_ops: If True, try colocating gradients with
        the corresponding op.
      name: Optional name for the returned operation.
      grad_loss: Optional. A `Tensor` holding the gradient computed for `loss`.
      decay_var_list: Optional list of decay variables.
    Returns:
      An Operation that updates the variables in `var_list`.  If `global_step`
      was not `None`, that operation also increments `global_step`.
    """
    self._decay_var_list = set(decay_var_list) if decay_var_list else False
    return super(DecoupledWeightDecayExtension, self).minimize(
        loss, global_step=global_step, var_list=var_list,
        gate_gradients=gate_gradients, aggregation_method=aggregation_method,
        colocate_gradients_with_ops=colocate_gradients_with_ops, name=name,
        grad_loss=grad_loss)

  def apply_gradients(self, grads_and_vars, global_step=None, name=None,
                      decay_var_list=None):
    """Apply gradients to variables and decay the variables.
    This function is the same as Optimizer.apply_gradients except that it
    allows to specify the variables that should be decayed using
    decay_var_list. If decay_var_list is None, all variables in var_list
    are decayed.
    For more information see the documentation of Optimizer.apply_gradients.
    Args:
      grads_and_vars: List of (gradient, variable) pairs as returned by
        `compute_gradients()`.
      global_step: Optional `Variable` to increment by one after the
        variables have been updated.
      name: Optional name for the returned operation.  Default to the
        name passed to the `Optimizer` constructor.
      decay_var_list: Optional list of decay variables.
    Returns:
      An `Operation` that applies the specified gradients. If `global_step`
      was not None, that operation also increments `global_step`.
    """
    self._decay_var_list = set(decay_var_list) if decay_var_list else False
    return super(DecoupledWeightDecayExtension, self).apply_gradients(
        grads_and_vars, global_step=global_step, name=name)

  def _prepare(self):
    weight_decay = self._weight_decay
    if callable(weight_decay):
      weight_decay = weight_decay()
    self._weight_decay_tensor = ops.convert_to_tensor(
        weight_decay, name="weight_decay")
    # Call the optimizers _prepare function.
    super(DecoupledWeightDecayExtension, self)._prepare()

  def _decay_weights_op(self, var):
    if not self._decay_var_list or var in self._decay_var_list:
      return var.assign_sub(self._weight_decay * var, self._use_locking)
    return control_flow_ops.no_op()

  def _decay_weights_sparse_op(self, var, indices, scatter_add):
    if not self._decay_var_list or var in self._decay_var_list:
      return scatter_add(var, indices, -self._weight_decay * var,
                         self._use_locking)
    return control_flow_ops.no_op()

  # Here, we overwrite the apply functions that the base optimizer calls.
  # super().apply_x resolves to the apply_x function of the BaseOptimizer.
  def _apply_dense(self, grad, var):
    with ops.control_dependencies([self._decay_weights_op(var)]):
      return super(DecoupledWeightDecayExtension, self)._apply_dense(grad, var)

  def _resource_apply_dense(self, grad, var):
    with ops.control_dependencies([self._decay_weights_op(var)]):
      return super(DecoupledWeightDecayExtension, self)._resource_apply_dense(
          grad, var)

  def _apply_sparse(self, grad, var):
    scatter_add = state_ops.scatter_add
    decay_op = self._decay_weights_sparse_op(var, grad.indices, scatter_add)
    with ops.control_dependencies([decay_op]):
      return super(DecoupledWeightDecayExtension, self)._apply_sparse(
          grad, var)

  def _resource_scatter_add(self, x, i, v, _=None):
    # last argument allows for one overflow argument, to have the same function
    # signature as state_ops.scatter_add
    with ops.control_dependencies(
        [resource_variable_ops.resource_scatter_add(x.handle, i, v)]):
      return x.value()

  def _resource_apply_sparse(self, grad, var, indices):
    scatter_add = self._resource_scatter_add
    decay_op = self._decay_weights_sparse_op(var, indices, scatter_add)
    with ops.control_dependencies([decay_op]):
      return super(DecoupledWeightDecayExtension, self)._resource_apply_sparse(
          grad, var, indices)


def extend_with_decoupled_weight_decay(base_optimizer):
  """Factory function returning an optimizer class with decoupled weight decay.
  Returns an optimizer class. An instance of the returned class computes the
  update step of `base_optimizer` and additionally decays the weights.
  E.g., the class returned by
  `extend_with_decoupled_weight_decay(tf.train.AdamOptimizer)` is equivalent to
  `tf.contrib.opt.AdamWOptimizer`.
  The API of the new optimizer class slightly differs from the API of the
  base optimizer:
  - The first argument to the constructor is the weight decay rate.
  - `minimize` and `apply_gradients` accept the optional keyword argument
    `decay_var_list`, which specifies the variables that should be decayed.
    If `None`, all variables that are optimized are decayed.
  Usage example:
  ```python
  # MyAdamW is a new class
  MyAdamW = extend_with_decoupled_weight_decay(tf.train.AdamOptimizer)
  # Create a MyAdamW object
  optimizer = MyAdamW(weight_decay=0.001, learning_rate=0.001)
  sess.run(optimizer.minimize(loss, decay_variables=[var1, var2]))
  Note that this extension decays weights BEFORE applying the update based
  on the gradient, i.e. this extension only has the desired behaviour for
  optimizers which do not depend on the value of'var' in the update step!
  ```
  Args:
    base_optimizer: An optimizer class that inherits from tf.train.Optimizer.
  Returns:
    A new optimizer class that inherits from DecoupledWeightDecayExtension
    and base_optimizer.
  """

  class OptimizerWithDecoupledWeightDecay(DecoupledWeightDecayExtension,
                                          base_optimizer):
    """Base_optimizer with decoupled weight decay.
    This class computes the update step of `base_optimizer` and
    additionally decays the variable with the weight decay being decoupled from
    the optimization steps w.r.t. to the loss function, as described by
    Loshchilov & Hutter (https://arxiv.org/pdf/1711.05101.pdf).
    For SGD variants, this simplifies hyperparameter search since
    it decouples the settings of weight decay and learning rate.
    For adaptive gradient algorithms, it regularizes variables with large
    gradients more than L2 regularization would, which was shown to yield
    better training loss and generalization error in the paper above.
    """

    def __init__(self, weight_decay, *args, **kwargs):
      # super delegation is necessary here
      # pylint: disable=useless-super-delegation
      super(OptimizerWithDecoupledWeightDecay, self).__init__(
          weight_decay, *args, **kwargs)
      # pylint: enable=useless-super-delegation

  return OptimizerWithDecoupledWeightDecay


@tf_export("contrib.opt.MomentumWOptimizer")
class MomentumWOptimizer(DecoupledWeightDecayExtension,
                         momentum_opt.MomentumOptimizer):
  """Optimizer that implements the Momentum algorithm with weight_decay.
  This is an implementation of the SGDW optimizer described in "Fixing
  Weight Decay Regularization in Adam" by Loshchilov & Hutter
  (https://arxiv.org/abs/1711.05101)
  ([pdf])(https://arxiv.org/pdf/1711.05101.pdf).
  It computes the update step of `train.MomentumOptimizer` and additionally
  decays the variable. Note that this is different from adding
  L2 regularization on the variables to the loss. Decoupling the weight decay
  from other hyperparameters (in particular the learning rate) simplifies
  hyperparameter search.
  For further information see the documentation of the Momentum Optimizer.
  Note that this optimizer can also be instantiated as
  ```python
  extend_with_weight_decay(tf.train.MomentumOptimizer,
                           weight_decay=weight_decay)
  ```
  """

  def __init__(self, weight_decay, learning_rate, momentum,
               use_locking=False, name="MomentumW", use_nesterov=False):
    """Construct a new MomentumW optimizer.
    For further information see the documentation of the Momentum Optimizer.
    Args:
      weight_decay:  A `Tensor` or a floating point value.  The weight decay.
      learning_rate: A `Tensor` or a floating point value.  The learning rate.
      momentum: A `Tensor` or a floating point value.  The momentum.
      use_locking: If `True` use locks for update operations.
      name: Optional name prefix for the operations created when applying
        gradients.  Defaults to "Momentum".
      use_nesterov: If `True` use Nesterov Momentum.
        See [Sutskever et al., 2013](
        http://jmlr.org/proceedings/papers/v28/sutskever13.pdf).
        This implementation always computes gradients at the value of the
        variable(s) passed to the optimizer. Using Nesterov Momentum makes the
        variable(s) track the values called `theta_t + mu*v_t` in the paper.
    @compatibility(eager)
    When eager execution is enabled, learning_rate, weight_decay and momentum
    can each be a callable that takes no arguments and returns the actual value
    to use. This can be useful for changing these values across different
    invocations of optimizer functions.
    @end_compatibility
    """
    super(MomentumWOptimizer, self).__init__(
        weight_decay, learning_rate=learning_rate, momentum=momentum,
        use_locking=use_locking, name=name, use_nesterov=use_nesterov)


# @tf_export("contrib.opt.AdamWOptimizer")
class AdamWOptimizer(DecoupledWeightDecayExtension, adam.AdamOptimizer):
  """Optimizer that implements the Adam algorithm with weight decay.
  This is an implementation of the AdamW optimizer described in "Fixing
  Weight Decay Regularization in Adam" by Loshchilov & Hutter
  (https://arxiv.org/abs/1711.05101)
  ([pdf])(https://arxiv.org/pdf/1711.05101.pdf).
  It computes the update step of `train.AdamOptimizer` and additionally decays
  the variable. Note that this is different from adding L2 regularization on
  the variables to the loss: it regularizes variables with large
  gradients more than L2 regularization would, which was shown to yield better
  training loss and generalization error in the paper above.
  For further information see the documentation of the Adam Optimizer.
  Note that this optimizer can also be instantiated as
  ```python
  extend_with_weight_decay(tf.train.AdamOptimizer, weight_decay=weight_decay)
  ```
  """

  def __init__(self, weight_decay, learning_rate=0.001, beta1=0.9, beta2=0.999,
               epsilon=1e-8, use_locking=False, name="AdamW"):
    """Construct a new AdamW optimizer.
    For further information see the documentation of the Adam Optimizer.
    Args:
      weight_decay:  A `Tensor` or a floating point value.  The weight decay.
      learning_rate: A Tensor or a floating point value.  The learning rate.
      beta1: A float value or a constant float tensor.
        The exponential decay rate for the 1st moment estimates.
      beta2: A float value or a constant float tensor.
        The exponential decay rate for the 2nd moment estimates.
      epsilon: A small constant for numerical stability. This epsilon is
        "epsilon hat" in the Kingma and Ba paper (in the formula just before
        Section 2.1), not the epsilon in Algorithm 1 of the paper.
      use_locking: If True use locks for update operations.
      name: Optional name for the operations created when applying gradients.
        Defaults to "Adam".
    """
    super(AdamWOptimizer, self).__init__(
        weight_decay, learning_rate=learning_rate, beta1=beta1, beta2=beta2,
epsilon=epsilon, use_locking=use_locking, name=name)

In [10]:
tf.reset_default_graph()

with tf.device('/gpu:0'):

    encoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_input
        dtype=tf.int32,
        name='encoder_inputs' 
    )
    batch_size = tf.shape(encoder_inputs)[0]
    beam_width = tf.placeholder_with_default(1, shape=[])
    dropout = tf.placeholder_with_default(tf.cast(0.0, tf.float32), shape=[])
    keep_prob = tf.cast(1.0, tf.float32) - dropout
    learning_rate = tf.placeholder_with_default(tf.cast(1e-3, tf.float32), shape=[])

    embedding_encoder = tf.get_variable(
        "embedding_encoder", 
        initializer=tf.constant(europarl.bpe_input.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    encoder_emb_inp = tf.nn.embedding_lookup(
        embedding_encoder,
        encoder_inputs,
        name="encoder_emb_inp"
    )
    
    input_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='input_sequence_length'
    )
    
    rnn_cell_type = tf.nn.rnn_cell.GRUCell
    encoder_forward_cells = [
        rnn_cell_type(num_units=LATENT_DIM // 2, name=f'encoder_forward_cell{layer}') 
        for layer in range(LAYERS)
    ]
    encoder_backward_cells = [
        rnn_cell_type(num_units=LATENT_DIM // 2, name=f'encoder_backward_cell{layer}') 
        for layer in range(LAYERS)
    ]
    encoder_forward_cells = [tf.nn.rnn_cell.DropoutWrapper(
        cell,
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    ) for cell in encoder_forward_cells]
    encoder_backward_cells = [tf.nn.rnn_cell.DropoutWrapper(
        cell,
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    ) for cell in encoder_backward_cells]
    encoder_outputs, encoder_state_fw, encoder_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
        cells_fw = ExtendedMultiRNNCell(encoder_forward_cells, residual_connections=True)._cells,
        cells_bw = ExtendedMultiRNNCell(encoder_backward_cells, residual_connections=True)._cells,
        inputs=encoder_emb_inp,
        sequence_length=input_sequence_length,
        time_major=False,
        dtype=tf.float32,
    )
    encoder_state = tf.concat([encoder_state_fw[-1], encoder_state_bw[-1]], -1)
    
    # Regarding time_major:
    # If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`.
    # If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`.
    # Using `time_major = True` is a bit more efficient because it avoids
    # transposes at the beginning and end of the RNN calculation.  However,
    # most TensorFlow data is batch-major, so by default this function
    # accepts input and emits output in batch-major form.
    #
    # for simplicity I work with batch major here instead of time_major
    # so I don't need to transpose inputs and transpose back for attention mechanism
    
    decoder_inputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_inputs' 
    )
    embedding_decoder = tf.get_variable(
        "embedding_decoder", 
        initializer=tf.constant(europarl.bpe_target.embedding_matrix),
        trainable=EMBEDDING_TRAINABLE,
    )
    decoder_emb_inp = tf.nn.embedding_lookup(
        embedding_decoder,
        decoder_inputs,
        name="decoder_emb_inp"
    )
    
    target_sequence_length = tf.placeholder(
        shape=(None, ),
        dtype=tf.int32,
        name='target_sequence_length'
    )
    
    # tiling is necessary to work with BeamSearchDecoder
    # read carefully the NOTE on constructor in
    # https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/AttentionWrapper 
    tiled_encoder_outputs = tf.contrib.seq2seq.tile_batch(encoder_outputs, multiplier=beam_width)
    tiled_encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, multiplier=beam_width)
    tiled_sequence_length = tf.contrib.seq2seq.tile_batch(input_sequence_length, multiplier=beam_width)
    
    attention_mechanism = tf.contrib.seq2seq.LuongAttention(
        LATENT_DIM,
        memory=tiled_encoder_outputs,
        memory_sequence_length=tiled_sequence_length,
        dtype=tf.float32,
        name='attention_mechanism',
    )
    decoder_rnn_cells = [rnn_cell_type(num_units=LATENT_DIM, name=f'decoder_cell{layer}') for layer in range(LAYERS)]
    def residual_fn(inputs, outputs):
        tf.contrib.framework.nest.assert_same_structure(inputs, outputs)
        inputs_without_attention = tf.slice(inputs, [0, 0], [batch_size, LATENT_DIM])
        return tf.contrib.framework.nest.map_structure(lambda inp, out: inp + out, inputs_without_attention, outputs) 
    for layer in range(1, LAYERS):
        decoder_rnn_cells[layer] = tf.contrib.rnn.ResidualWrapper(
            decoder_rnn_cells[layer],
            residual_fn=residual_fn
        )
    decoder_rnn_cells = [tf.nn.rnn_cell.DropoutWrapper(
        cell,
        input_keep_prob=keep_prob,
        output_keep_prob=keep_prob,
        dtype=tf.float32,
    ) for cell in decoder_rnn_cells]
    attention_cells = [tf.contrib.seq2seq.AttentionWrapper(
        cell,
        attention_mechanism,
        attention_layer_size=LATENT_DIM,
        name=f'attention_wrapper{layer}',
    ) for layer, cell in enumerate(decoder_rnn_cells)] 
    decoder_cell = tf.contrib.rnn.MultiRNNCell(attention_cells)

    training_helper = tf.contrib.seq2seq.TrainingHelper(
        inputs=decoder_emb_inp, 
        sequence_length=target_sequence_length,
        time_major=False,
        name="decoder_training_helper",
    )
    
    projection_layer = layers_core.Dense(
        units=len(europarl.bpe_target.tokens),
        use_bias=False,
        name='projection_layer',
    )
    
    initial_state = tuple(
        attention_cells[0].zero_state(dtype=tf.float32, batch_size=batch_size).clone(
            cell_state=encoder_state
        )
        for _ in range(LAYERS)
    )

    decoder = tf.contrib.seq2seq.BasicDecoder(
        cell=decoder_cell,
        helper=training_helper,
        initial_state=initial_state,
        output_layer=projection_layer,
    )
    outputs, _final_state, _final_sequence_length = tf.contrib.seq2seq.dynamic_decode(
        decoder,
        output_time_major=False,
        impute_finished=False,
    )
    logits = outputs.rnn_output
    
    decoder_outputs = tf.placeholder(
        shape=(None, None),  # batch_size x max_len_target
        dtype=tf.int32,
        name='decoder_outputs',
    )
    target_weights = tf.cast(tf.sequence_mask(target_sequence_length), dtype=tf.float32)
    weight_decay = tf.placeholder_with_default(tf.cast(1e-5, tf.float32), shape=[])
    train_loss = tf.contrib.seq2seq.sequence_loss(logits, decoder_outputs, target_weights) # + l2_lambda * loss_l2
    params = tf.trainable_variables()
    gradients = tf.gradients(train_loss, params)
    clipped_gradients, _ = tf.clip_by_global_norm(
        t_list=gradients,
        clip_norm=1.,
    )
    
    params_without_embeddings = [v for v in tf.trainable_variables() if not re.match(r'embedding_(de|en)coder', v.name)]
    
    optimizer = AdamWOptimizer(learning_rate=learning_rate, weight_decay=weight_decay)
    update_step = optimizer.apply_gradients(zip(clipped_gradients, params), decay_var_list=params_without_embeddings)
    
    gradients_without_embeddings = tf.gradients(train_loss, params_without_embeddings)
    clipped_gradients_without_embeddings, _ = tf.clip_by_global_norm(
        t_list=gradients_without_embeddings,
        clip_norm=1.,
    )
    update_step_without_embeddings = optimizer.apply_gradients(zip(clipped_gradients_without_embeddings, params_without_embeddings))
    
    inference_decoder_initial_state = tuple(
        attention_cells[0].zero_state(
            dtype=tf.float32,
            batch_size=batch_size * beam_width  # tricky and somehow unintuitive, but necessary
        ).clone(
            cell_state=tiled_encoder_state
        ) for _ in range(LAYERS)
    )
    inference_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
        cell=decoder_cell,
        embedding=embedding_decoder,
        start_tokens=tf.fill([batch_size], europarl.bpe_target.start_token_idx),
        end_token=europarl.bpe_target.stop_token_idx,
        initial_state=inference_decoder_initial_state,
        beam_width=BEAM_WIDTH,
        output_layer=projection_layer,
        length_penalty_weight=1.0,  # https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/
    )

    
    inference_outputs, _inference_final_state, _inference_final_sequence_length = tf.contrib.seq2seq.dynamic_decode(
        inference_decoder,
        maximum_iterations=tf.round(tf.reduce_max(input_sequence_length) * 2),  # a bit more flexible than max_len_target
        impute_finished=False,
    )

In [11]:
def run_train_batch(batch_ids, epoch):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    base_lr = 1e-3
    if epoch < WARMUP_EPOCHS:
        lr = base_lr * (0.5 + epoch / (2 * WARMUP_EPOCHS))
    elif epoch < LEARNING_RATE_DECAY_START_EPOCH:
        lr = base_lr
    else:
        lr = base_lr * (LEARNING_RATE_DECAY_RATE ** (epoch - LEARNING_RATE_DECAY_START_EPOCH))
    pred, loss, _ = sess.run(
        fetches=[
            outputs, train_loss, update_step if epoch >= WARMUP_EPOCHS else update_step_without_embeddings
        ],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
            dropout: DROPOUT,
            # embedding_trainable: epoch >= WARMUP_EPOCHS,
            learning_rate: lr
        }
    )
    return loss, lr

def run_val_batch(batch_ids):
    batch_input_sequences = europarl.df.input_sequences.iloc[batch_ids]
    batch_input_lengths = batch_input_sequences.apply(len)
    batch_target_sequences = europarl.df.target_sequences.iloc[batch_ids]
    batch_target_lengths = batch_target_sequences.apply(len) - 1

    batch_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_input_sequences,
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    batch_target_padded = tf.keras.preprocessing.sequence.pad_sequences(
        batch_target_sequences,
        maxlen=max_len_target,
        dtype=int,
        padding='post'
    )
    loss = sess.run(
        fetches=[train_loss],
        feed_dict={
            encoder_inputs: batch_input_padded,
            input_sequence_length: np.array(batch_input_lengths),
            decoder_inputs: batch_target_padded[:, :batch_target_lengths.max()],
            target_sequence_length: np.array(batch_target_lengths),
            decoder_outputs: batch_target_padded[:, 1:batch_target_lengths.max() + 1],
        }
    )
    return loss

def run_validation_loss():
    return np.mean([
        run_val_batch(ids)
        for ids 
        in np.array_split(val_ids, np.ceil(len(val_ids) / BATCH_SIZE))
    ])

In [12]:
def predict(sentence):
    sequenced = europarl.bpe_input.subword_indices(preprocess(sentence))
    padded = tf.keras.preprocessing.sequence.pad_sequences(
        [sequenced],
        maxlen=max_len_input,
        dtype=int,
        padding='post'
    )
    
    beam_search_output = sess.run(
        fetches=[inference_outputs],
        feed_dict={
            encoder_inputs: padded,
            input_sequence_length: [len(sequenced)],
            beam_width: BEAM_WIDTH,
        }
    )[0]
    
    return europarl.bpe_target.sentencepiece.DecodePieces([
        europarl.bpe_target.tokens[idx] for idx in beam_search_output.predicted_ids[0, :, 0].tolist()
    ])

In [13]:
config = tf.ConfigProto(
    allow_soft_placement=True,  # needed as recommendation from https://github.com/tensorflow/tensorflow/issues/2292
    log_device_placement=True,
)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

batches_per_epoch = np.ceil(len(train_ids) / BATCH_SIZE)
for epoch in range(EPOCHS):
    shuffled_ids = np.random.permutation(train_ids)
    batch_splits = np.array_split(shuffled_ids, batches_per_epoch)
    train_losses = []
    N = len(batch_splits)
    with tqdm(batch_splits, desc=f"Epoch {epoch+1}") as t:
        for train_batch_ids in t:
            batch_loss, lr = run_train_batch(train_batch_ids, epoch=epoch)
            train_losses.append(batch_loss)
            t.set_postfix(train_loss=np.mean(train_losses))
        print(f"learning rate={lr:.6}, train_loss={np.mean(train_losses):.6}, val_loss={run_validation_loss():.6}")
    
    if epoch > 0 and ((epoch+1) % 5 == 0) and (epoch+1 < EPOCHS):
        bleu = bleu_scores_europarl(
            input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
            target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
            predict=lambda text: predict(text)
        )
        print(f'average BLEU on test set = {bleu.mean()}')
        
validation_input_sequences = europarl.df.input_sequences.iloc[val_ids[:BATCH_SIZE]]
validation_input_lengths = validation_input_sequences.apply(len)

validation_input_padded = tf.keras.preprocessing.sequence.pad_sequences(
    validation_input_sequences,
    maxlen=max_len_input,
    dtype=int,
    padding='post'
)

HBox(children=(IntProgress(value=0, description='Epoch 1', max=13112), HTML(value='')))


learning rate=0.0005, train_loss=3.51415, val_loss=2.52636


HBox(children=(IntProgress(value=0, description='Epoch 2', max=13112), HTML(value='')))


learning rate=0.0006, train_loss=2.81383, val_loss=2.26058


HBox(children=(IntProgress(value=0, description='Epoch 3', max=13112), HTML(value='')))


learning rate=0.0007, train_loss=2.6593, val_loss=2.15732


HBox(children=(IntProgress(value=0, description='Epoch 4', max=13112), HTML(value='')))


learning rate=0.0008, train_loss=2.58927, val_loss=2.11433


HBox(children=(IntProgress(value=0, description='Epoch 5', max=13112), HTML(value='')))


learning rate=0.0009, train_loss=2.5511, val_loss=2.08338


HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.14742867793506825


HBox(children=(IntProgress(value=0, description='Epoch 6', max=13112), HTML(value='')))


learning rate=0.001, train_loss=2.24391, val_loss=1.80562


HBox(children=(IntProgress(value=0, description='Epoch 7', max=13112), HTML(value='')))


learning rate=0.001, train_loss=2.1181, val_loss=1.75325


HBox(children=(IntProgress(value=0, description='Epoch 8', max=13112), HTML(value='')))


learning rate=0.001, train_loss=2.07112, val_loss=1.71901


HBox(children=(IntProgress(value=0, description='Epoch 9', max=13112), HTML(value='')))


learning rate=0.001, train_loss=2.04338, val_loss=1.70204


HBox(children=(IntProgress(value=0, description='Epoch 10', max=13112), HTML(value='')))


learning rate=0.001, train_loss=2.02408, val_loss=1.68641


HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.19216522202581007


HBox(children=(IntProgress(value=0, description='Epoch 11', max=13112), HTML(value='')))


learning rate=0.001, train_loss=2.01039, val_loss=1.67712


HBox(children=(IntProgress(value=0, description='Epoch 12', max=13112), HTML(value='')))


learning rate=0.00097, train_loss=1.99403, val_loss=1.66369


HBox(children=(IntProgress(value=0, description='Epoch 13', max=13112), HTML(value='')))


learning rate=0.0009409, train_loss=1.98071, val_loss=1.65155


HBox(children=(IntProgress(value=0, description='Epoch 14', max=13112), HTML(value='')))


learning rate=0.000912673, train_loss=1.96839, val_loss=1.6437


HBox(children=(IntProgress(value=0, description='Epoch 15', max=13112), HTML(value='')))


learning rate=0.000885293, train_loss=1.95737, val_loss=1.63398


HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.20081519462137531


HBox(children=(IntProgress(value=0, description='Epoch 16', max=13112), HTML(value='')))


learning rate=0.000858734, train_loss=1.94778, val_loss=1.62813


HBox(children=(IntProgress(value=0, description='Epoch 17', max=13112), HTML(value='')))


learning rate=0.000832972, train_loss=1.93884, val_loss=1.62212


HBox(children=(IntProgress(value=0, description='Epoch 18', max=13112), HTML(value='')))


learning rate=0.000807983, train_loss=1.93075, val_loss=1.61687


HBox(children=(IntProgress(value=0, description='Epoch 19', max=13112), HTML(value='')))


learning rate=0.000783743, train_loss=1.92314, val_loss=1.60873


HBox(children=(IntProgress(value=0, description='Epoch 20', max=13112), HTML(value='')))


learning rate=0.000760231, train_loss=1.9161, val_loss=1.60678


HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.19941596977259465


HBox(children=(IntProgress(value=0, description='Epoch 21', max=13112), HTML(value='')))


learning rate=0.000737424, train_loss=1.90963, val_loss=1.60114


HBox(children=(IntProgress(value=0, description='Epoch 22', max=13112), HTML(value='')))


learning rate=0.000715301, train_loss=1.9037, val_loss=1.59932


HBox(children=(IntProgress(value=0, description='Epoch 23', max=13112), HTML(value='')))


learning rate=0.000693842, train_loss=1.89771, val_loss=1.58867


HBox(children=(IntProgress(value=0, description='Epoch 24', max=13112), HTML(value='')))


learning rate=0.000673027, train_loss=1.89226, val_loss=1.58643


HBox(children=(IntProgress(value=0, description='Epoch 25', max=13112), HTML(value='')))


learning rate=0.000652836, train_loss=1.88718, val_loss=1.58144


In [14]:
name = f'tfattentionmodel_{LAYERS}layers_with_weight_decay'

saver = tf.train.Saver()
saver.save(sess, f"data/{name}.ckpt")
# tfattentionmodel_2layers.ckpt.index https://drive.google.com/open?id=1t5f7vbI6sdBqlTJ3DguUnJwKjx2NInC4 
# tfattentionmodel_2layers.cpkt.meta https://drive.google.com/open?id=1Ikp266cw7c93S6mCYHkE0SaBfI1WZtmF 
# tfattentionmodel_2layers.cpkt.data-00000-of-00001 https://drive.google.com/open?id=1QMT_5nA7dOHe5G8FdCOY0MDvwh5b5_-A 

'data/tfattentionmodel_2layers_with_weight_decay.ckpt'

In [15]:
# Performance on some examples:
EXAMPLES = [
    'Hello.',
    'You are welcome.',
    'How do you do?',
    'I hate mondays.',
    'I am a programmer.',
    'Data is the new oil.',
    'It could be worse.',
    "I am on top of it.",
    "N° Uno",
    "Awesome!",
    "Put your feet up!",
    "From the start till the end!",
    "From dusk till dawn.",
]
for en in [sentence + '\n' for sentence in EXAMPLES]:
    print(f"{preprocess(en)!r} --> {predict(en)!r}")

'hello.' --> 'helfen.'
'you are welcome.' --> 'sie sind begrüßenswert.'
'how do you do?' --> 'wie tun sie?'
'i hate mondays.' --> 'ich habe gesprochen.'
'i am a programmer.' --> 'ich bin ein programm.'
'data is the new oil.' --> 'die daten sind das neue öl.'
'it could be worse.' --> 'es könnte schlimmer sein.'
'i am on top of it.' --> 'ich bin ganz wichtig.'
'n° uno' --> 'nio-oo'
'awesome!' --> 'einwände!'
'put your feet up!' --> 'lassen sie mich die füße sein!'
'from the start till the end!' --> 'aus dem beginn des endes!'
'from dusk till dawn.' --> 'aus der dusus till-zulassung.'


In [16]:
# Performance on training set:
for en, de in europarl.df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'i declare resumed the session of the european parliament adjourned on friday 0 december 0, and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.', got 'ich erkläre die sitzungsperiode des europäischen parlaments, der am freitag am freitag am 0. dezember 0 unterbrochen wird, und ich möchte ihnen noch einmal wünschen, dass sie in der hoffnung, dass sie einen freudefähigen zeitraum haben, ein glückliches neues jahr ermöglicht.', exp: 'ich erkläre die am freitag, dem 0. dezember unterbrochene sitzungsperiode des europäischen parlaments für wiederaufgenommen, wünsche ihnen nochmals alles gute zum jahreswechsel und hoffe, daß sie schöne ferien hatten.'
Original "although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.", got 'obwohl sie sehen, daß das schreckliche "millium-misszettel" geschei

In [17]:
# Performance on validation set
val_df = europarl.df.iloc[val_ids]
for en, de in val_df[['input_texts', 'target_texts']][1:20].values.tolist():
    print(f"Original {en!r}, got {predict(en)!r}, exp: {de!r}")

Original 'it is important not to underestimate the work involved.', got 'es ist wichtig, die arbeitnehmer zu unterschätzen.', exp: 'das sollte man nicht unterschätzen.'
Original 'mr vanhanen, you were mr calm, and i think you, mr tuomioja, were mr collected.', got 'herr vanhanen, sie waren herrn calders, und ich denke, sie sind herrn tuomioja, herrn kollege.', exp: 'herr vanhanen, sie waren mr. calm, und, ich denke, sie, herr tuomioja, waren mr. collected.'
Original "most members of this parliament are aware of the commission's efforts to make sure that the european support for the palestinian authority is money that is properly spent, well spent and spent in ways that help to promote pluralism, the rule of law and clean government in the palestinian territories.", got 'die meisten mitglieder dieses parlaments sind der bemühungen der kommission bekannt, sicherzustellen, dass die europäische unterstützung für die palästinensische autonomiebehörde geld, das in den palästinensischen gebie

Original 'the debate is not therefore whether we are in favour of or opposed to the alternative methods.', got 'die debatte ist nicht, ob wir für die alternativen methoden stimmen oder gegen die alternativen methoden stimmen.', exp: 'es geht also nicht darum, ob wir für oder gegen alternativmethoden sind.'
Original 'in conclusion, i would say that only a realistic policy, appropriate to the needs of the population, the environment and an increasingly high-quality market, is capable of achieving the objectives which we think the european union should be aiming at in terms of viticultural policy.', got 'abschließend möchte ich sagen, dass nur eine realistische politik, die den bedürfnissen der bevölkerung, der umwelt und einem zunehmenden qualitativ hochwertigen markt angemessen ist, in der lage ist, die ziele zu erreichen, die wir unserer meinung nach auf die vititikpolitik richten sollten.', exp: 'abschließend möchte ich feststellen, daß nur eine realistische, auf die bedürfnisse der m

In [18]:
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))


average BLEU on test set = 0.204563636855427


In [19]:
# Checking on a bigger test size
BIGGER_TEST_SIZE = 12500  # 5 * TEST_SIZE
bleu = bleu_scores_europarl(
    input_texts=europarl.df.input_texts.iloc[val_ids[:BIGGER_TEST_SIZE]],
    target_texts=europarl.df.target_texts.iloc[val_ids[:BIGGER_TEST_SIZE]],
    predict=lambda text: predict(text)
)
print(f'average BLEU on test set = {bleu.mean()}')

HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


average BLEU on test set = 0.20097956867751898


# Conclusion

There is only a marginal improvement ($0.202 > 0.199$) while the model needed twice as much time to train. And it's not clear that even the tiny improvement is not a product of the hyperparameter adjustments. I'll need to rerun this experiment with weight decay to check whether 2 layers can be at least a significant (even if it is small improvement).

Of course, even small improvements could add to a solid improvement. Google uses for its NMT model 8 layers (that can in parallel on one machine with 8 GPUs), so there might be potential anyway. But of course for a side project with only one GPU access, that would not worth to go on in case.