# Neural Machine Translation: a Practical Session

This notebook presents a practical session to introduce the training of neural machine translation systems to students. It was created for the lab sessions of the *machine translation* module of the [IARFID](https://www.upv.es/titulaciones/MUIARFID/) master from [Universitat Politècnica de València](https://www.upv.es/en).

## Introduction

The goal of this lab session is to build machine translation systems based on neural networks (neural machine translation; NMT) from a dataset of bilingual parallel sentences using a [custom version](https://github.com/PRHLT/OpenNMT-py/tree/lab_sessions) of the **OpenNMT-py** toolkit (Klein et al., 2017).

### Dataset

The dataset we are going to use in this practical session is the Spanish–English language pair of the **EuTrans** corpus (Casacuberta et al., 2004), whose content involves the interaction of a customer with a receptionist at the frontdesk of a hotel. It comes with the custom version of **OpenNMT-py** that we are using. It is located at *OpenNMT-py/dataset/EuTrans*.

Here we can see an example of its content:

> *por favor, ¿nos puede dar la llave de la habitación?*

> *can you give us the key to the room, please?*

### Network description

The neural network that we are going to use for training the NMT system has the following configuration:

* Encoder and decoder are both Transformer with 64 neurons.
* 2 layers.
* Hidden Transformer feed-forward of size 64.
* 2 self-attention heads.
* Source word vector of size 64.
* Target word vector of size 64.

## Setup

### Installation
To install **OpenNMT-py**, the first step is to clone the repository:

In [None]:
! git clone --branch lab_sessions https://github.com/PRHLT/OpenNMT-py.git

Then, it can be easily installed through pip:

In [None]:
! pip install -e OpenNMT-py/

Some requirements are needed for the evaluation step:

In [None]:
! pip install sacrebleu glom

Finallly, it is recommended to install some optional requirements:

In [None]:
! pip install -r OpenNMT-py/requirements.opt.txt

### Storage

By default, everything generated through **Google Colab** is stored at their servers. Thus, we need to setup **Google Drive** in order to be able to save a copy of the data:

In [None]:
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Then, we are going to create a new folder inside our **Google Drive**:

In [None]:
! mkdir -p drive/MyDrive/NMT

## Training

### System's configuration

The network's configuration and the settings of the training process are defined at the following file:

In [None]:
%%bash
cat <<EOF > config.yaml
# Data
save_data: drive/MyDrive/NMT/dataset/
src_vocab: drive/MyDrive/NMT/dataset/EuTrans-es.vocab
tgt_vocab: drive/MyDrive/NMT/dataset/EuTrans-en.vocab
overwrite: False

# Corpora:
data:
    corpus_1:
        path_src: OpenNMT-py/dataset/EuTrans/training.es
        path_tgt: OpenNMT-py/dataset/EuTrans/training.en
    valid:
        path_src: OpenNMT-py/dataset/EuTrans/development.es
        path_tgt: OpenNMT-py/dataset/EuTrans/development.en
# Model
decoder_type: transformer
encoder_type: transformer
word_vec_size: 64
rnn_size: 64
layers: 2
transformer_ff: 64
heads: 2
accum_count: 8
warmup_steps: 8000
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
batch_size: 50
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1
max_generator_batches: 2
param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

# Train on a single GPU
world_size: 1
gpu_ranks:
  - 0

# Checkpoints
save_model: drive/MyDrive/NMT/models/EuTrans
save_checkpoint_steps: 1000
train_steps: 5000
valid_steps: 1000
keep_checkpoint: 10
report_every: 100

EOF

### Vocabulary building
Prior to training the model, the vocabulary needs to be build. You can do so by doing:

In [None]:
! onmt_build_vocab -config config.yaml

### Model training

After that, you can start the training process by doing:

In [None]:
! onmt_train -config config.yaml

## Translation
Once the network has been trained, the translation can be performed by doing:

In [None]:
! onmt_translate -model drive/MyDrive/NMT/models/EuTrans_step_5000.pt -src OpenNMT-py/dataset/EuTrans/test.es -output drive/MyDrive/NMT/EuTrans-test.en.hyp -verbose -replace_unk

## Evaluation
Finally, the translation hypothesis can be evaluated by doing:

In [None]:
! sacrebleu --force OpenNMT-py/dataset/EuTrans/test.en < drive/MyDrive/NMT/EuTrans-test.en.hyp | glom score

## Exercises

* **Try different sizes of the word embeddings for source and target words.**

* **Try different optimization algorithms (e.g., SGD, Adagrad, Adadelta).**

* **Try recurrent neural networks.**

## Resources

* [OpenNMT-py's documentation](https://opennmt.net/OpenNMT-py).

## References

* Casacuberta, F., Ney, H., Och, F.J., Vidal, E., Vilar, J.M., Barrachina, S., García-Varea, I., Llorens, D., Hinarejos, C.D., & Molau, S. (2004). [Some approaches to statistical and finite-state speech-to-speech translation](https://doi.org/10.1016/S0885-2308(03)00028-7). Comput. Speech Lang., 18, 25–47.
* Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M. (2017). [OpenNMT: Open-Source Toolkit for Neural Machine Translation](https://www.aclweb.org/anthology/P17-4012). In *Proceedings of the Association for Computational Linguistics: System Demonstration*, 67–72.