In [None]:
# Install OpenNMT-py 3.x
!pip3 install OpenNMT-py

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.4.1-py3-none-any.whl (252 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/252.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/252.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.9/252.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting configargparse (from OpenNMT-py)
  Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Collecting ctranslate2<4,>=3.2 (from OpenNMT-py)
  Downloading ctranslate2-3.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.7/35.7 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
Collecting waitress (from OpenNMT-py)
  Downloading waitress-2.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m

# Prepare Your Datasets
Please make sure you have completed the [first exercise](https://colab.research.google.com/drive/1rsFPnAQu9-_A6e2Aw9JYK3C8mXx9djsF?usp=sharing).

In [None]:
# Open the folder where you saved your prepapred datasets from the first exercise
# You might need to mount your Google Drive first
%cd /content/drive/MyDrive/nmt/
!ls

/content/drive/MyDrive/nmt
drive-download-20231011T073737Z-001.zip
Europarl.en-nl.en-filtered.en.subword.dev
Europarl.en-nl.en-filtered.en.subword.test
Europarl.en-nl.en-filtered.en.subword.train
Europarl.en-nl.nl-filtered.nl.subword.dev
Europarl.en-nl.nl-filtered.nl.subword.test
Europarl.en-nl.nl-filtered.nl.subword.train
source.model
source.vocab
target.model
target.vocab


In [None]:
!unzip /content/drive/MyDrive/nmt/drive-download-20231011T073737Z-001.zip

Archive:  /content/drive/MyDrive/nmt/drive-download-20231011T073737Z-001.zip
  inflating: source.model            
  inflating: Europarl.en-nl.nl-filtered.nl.subword.test  
  inflating: source.vocab            
  inflating: target.model            
  inflating: Europarl.en-nl.en-filtered.en.subword.test  
  inflating: Europarl.en-nl.en-filtered.en.subword.dev  
  inflating: Europarl.en-nl.nl-filtered.nl.subword.dev  
  inflating: target.vocab            
  inflating: Europarl.en-nl.nl-filtered.nl.subword.train  
  inflating: Europarl.en-nl.en-filtered.en.subword.train  


# Create the Training Configuration File

The following config file matches most of the recommended values for the Transformer model [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762). As the current dataset is small, we reduced the following values:
* `train_steps` - for datasets with a few millions of sentences, consider using a value between 100000 and 200000, or more! Enabling the option `early_stopping` can help stop the training when there is no considerable improvement.
* `valid_steps` - 10000 can be good if the value `train_steps` is big enough.
* `warmup_steps` - obviously, its value must be less than `train_steps`. Try 4000 and 8000 values.

Refer to [OpenNMT-py training parameters](https://opennmt.net/OpenNMT-py/options/train.html) for more details. If you are interested in further explanation of the Transformer model, you can check this article, [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).

In [None]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano
# Note here we are using some smaller values because the dataset is small
# For larger datasets, consider increasing: train_steps, valid_steps, warmup_steps, save_checkpoint_steps, keep_checkpoint

config = '''# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: Europarl.en-nl.en-filtered.en.subword.train
        path_tgt: Europarl.en-nl.nl-filtered.nl.subword.train
        transforms: [filtertoolong]
    valid:
        path_src: Europarl.en-nl.en-filtered.en.subword.dev
        path_tgt: Europarl.en-nl.nl-filtered.nl.subword.dev
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.en-nl

# Stop training if it does not imporve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 1000

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 3435

# Default: 100000 - Train the model to max n steps
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: 10000

# Default: 10000 - Run validation after n steps
valid_steps: 1000

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 1000
report_every: 100

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.2]
attention_dropout: [0.1]
'''

with open("config.yaml", "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# [Optional] Check the content of the configuration file
!cat config.yaml

# Build Vocabulary

For large datasets, it is not feasable to use all words/tokens found in the corpus. Instead, a specific set of vocabulary is extracted from the training dataset, usually betweeen 32k and 100k words. This is the main purpose of the vocabulary building step.

In [None]:
# Find the number of CPUs/cores on the machine
!nproc --all

2


In [None]:
# Build Vocabulary

# -config: path to your config.yaml file
# -n_sample: use -1 to build vocabulary on all the segment in the training dataset
# -num_threads: change it to match the number of CPUs to run it faster

!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2

2023-10-11 07:44:18.362361: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-11 07:44:21.493838: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-11 07:44:21.494435: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


From the **Runtime menu** > **Change runtime type**, make sure that the "**Hardware accelerator**" is "**GPU**".


In [None]:
# Check if the GPU is active
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-1759f39f-df0c-a03f-3066-463f5fec7c38)


In [None]:
# Check if the GPU is visable to PyTorch

import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

gpu_memory = torch.cuda.mem_get_info(0)
print("Free GPU memory:", gpu_memory[0]/1024**2, "out of:", gpu_memory[1]/1024**2)

True
Tesla T4
Free GPU memory: 15007.75 out of: 15109.75


# Training

Now, start training your NMT model! 🎉 🎉 🎉

In [None]:
!rm -rf drive/MyDrive/nmt/models/

In [None]:
# Train the NMT model
!onmt_train -config config.yaml

2023-10-11 07:49:17.604071: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-11 07:49:20.115497: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-11 07:49:20.115959: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


In [None]:
# For error debugging try:
# !dmesg -T

# Translation

Translation Options:
* `-model` - specify the last model checkpoint name; try testing the quality of multiple checkpoints
* `-src` - the subworded test dataset, source file
* `-output` - give any file name to the new translation output file
* `-gpu` - GPU ID, usually 0 if you have one GPU. Otherwise, it will translate on CPU, which would be slower.
* `-min_length` - [optional] to avoid empty translations
* `-verbose` - [optional] if you want to print translations

Refer to [OpenNMT-py translation options](https://opennmt.net/OpenNMT-py/options/translate.html) for more details.

In [None]:
# Translate the "subworded" source file of the test dataset
# Change the model name, if needed.
!onmt_translate -model /content/drive/MyDrive/nmt/models/model.en-nl_step_10000.pt -src /content/drive/MyDrive/nmt/Europarl.en-nl.en-filtered.en.subword.test -output en-nl.translated -gpu 0 -min_length 1

2023-10-11 09:53:50.070770: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-11 09:53:55.884677: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-11 09:53:55.885162: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


In [None]:
# Check the first 5 lines of the translation file
!head -n 5 en-nl.translated

▁In ▁on s ▁la at s te ▁ ja ar ▁he b ben ▁we ▁ vo or uit gang ▁ ge bo ek t ▁bi j ▁de ▁opening ▁van ▁de ▁Inter gouvernement ele ▁Conf er ent ie .
▁M ij n he er ▁de ▁Vo or zi tter , ▁in ▁m ij n ▁ho or zi tting ▁ op ▁ 1 3 ▁j anu ari ▁ z al ▁ ik ▁ ze g gen ▁da t ▁ er ▁ vo or ▁he t ▁e in de ▁van ▁ dit ▁ ja ar ▁ vier ▁ ja ar ▁ ge en ▁spe ci fi eke ▁ vo or wa a rden ▁ zi j n ▁ ge s te ld ▁ vo or ▁de ▁ uit vo ering ▁van ▁de ▁we t ge v ing . ▁I k ▁ z ou ▁gr a ag ▁ zi en ▁da t ▁de ▁amend emen ten ▁ 1 0 , ▁ 5 , ▁ 1 0 , ▁ 1 0 , ▁ 1 5 , ▁ 1 0 , ▁ 5 0 , ▁ 1 0 , ▁ 5 0 , ▁ 1 0 , ▁ 1 5 ▁ ja ar ▁ z ou den ▁ zi j n ▁go edge k eur d .
▁De ▁Europe se ▁Ra ad ▁ z al ▁ zi j n ▁men ing ▁g even ▁over ▁de ze ▁a an passing en .
▁B 4 - 1 3 6 4 / 9 6 ▁van ▁me v r ou w ▁Aelvoet ▁en ▁and eren , ▁name ns ▁de ▁ELDR - F rac ti e , ▁over ▁de ▁crisis ▁in ▁de ▁Noord el ijk e ▁All i anti e ;
▁M ij n he er ▁de ▁Vo or zi tter , ▁na ar ▁m ij n ▁men ing ▁mo et en ▁de ▁prior ite ite n ▁van ▁he t ▁we rk pro gram ma ▁van ▁de ▁Commi

In [None]:
!pip3 install -r MT-Preparation/requirements.txt

Collecting sentencepiece (from -r MT-Preparation/requirements.txt (line 3))
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
# If needed install/update sentencepiece
!pip3 install --upgrade -q sentencepiece

# Desubword the translation file
!python3 MT-Preparation/subwording/3-desubword.py target.model en-nl.translated

Done desubwording! Output: en-nl.translated.desubword


In [None]:
# Check the first 5 lines of the desubworded translation file
!head -n 5 en-nl.translated.desubword

In ons laatste jaar hebben we vooruitgang geboekt bij de opening van de Intergouvernementele▁Conferentie.
Mijnheer de Voorzitter, in mijn hoorzitting op 13▁januari zal ik zeggen dat er voor het▁einde van dit jaar vier jaar geen specifieke voorwaarden zijn gesteld voor de uitvoering van de wetgeving. Ik zou▁graag zien dat de▁amendementen 10, 5, 10, 10, 15, 10, 50, 10, 50, 10, 15 jaar zouden zijn goedgekeurd.
De Europese Raad zal zijn mening geven over deze aanpassingen.
B4-1364/96 van mevrouw Aelvoet en anderen, namens de ELDR-Fractie, over de crisis in de Noordelijke Alliantie;
Mijnheer de Voorzitter, naar mijn mening moeten de▁prioriteiten van het werkprogramma van de Commissie voor 2002 onmiddellijk aandacht worden besteed aan de uitdagingen van de Europese Unie, zowel op het gebied van de interne markt als op het gebied van de interne markt en de economische en▁monetaire▁unie, vooral op het gebied van de buitenlandse aangelegenheden.


In [None]:
# Desubword the target file (reference) of the test dataset
# Note: You might as well have split files *before* subwording during dataset preperation,
# but sometimes datasets have tokeniztion issues, so this way you are sure the file is really untokenized.
!python3 MT-Preparation/subwording/3-desubword.py target.model /content/drive/MyDrive/nmt/Europarl.en-nl.nl-filtered.nl.subword.test

Done desubwording! Output: /content/drive/MyDrive/nmt/Europarl.en-nl.nl-filtered.nl.subword.test.desubword


In [None]:
# Check the first 5 lines of the desubworded reference
!head -n 5 /content/drive/MyDrive/nmt/Europarl.en-nl.nl-filtered.nl.subword.test.desubword

Tijdens ons debat van vorig jaar verheugden we ons op de start van de Intergouvernementele▁Conferentie.
Mijnheer de Voorzitter, ik herinner me in mijn hoorzitting op 13▁januari precies te hebben gezegd dat ik er in de vijf▁jaren die we▁samen zouden doorbrengen,▁ernaar zou▁streven om▁samen met andere collega's, want ik ben▁niet de enige▁commissaris, er hebben waarschijnlijk vijftien of▁zestien▁commissarissen te maken met een redelijk groot deel van de wetgeving, 1 500 teksten met elkaar te verenigen die moeten worden toegepast en intelligent moeten worden toegepast op de markt.
De Europese Raad zal zich vervolgens over deze aanpassingen uitspreken.
B4-1346/96 van mevrouw André en anderen, namens de▁Fractie van de Europese Liberale en▁Democratische▁Partij, over de crisis in Oost-Zaïre; -B4-1367/96 van mevrouw Baldi en anderen, namens de▁Fractie Unie voor Europa, over de situatie in Zaïre; -B4-1392/96 van mevrouw Sauquillo Pérez▁del Arco en de heer Pons Grau, namens de▁Fractie van de▁Part

# MT Evaluation

There are several MT Evaluation metrics such as BLEU, TER, METEOR, COMET, BERTScore, among others.

Here we are using BLEU. Files must be detokenized/desubworded beforehand.

In [None]:
# Download the BLEU script
!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py

--2023-10-11 10:01:31--  https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957 [text/plain]
Saving to: ‘compute-bleu.py’


2023-10-11 10:01:31 (16.5 MB/s) - ‘compute-bleu.py’ saved [957/957]



In [None]:
# Install sacrebleu
!pip3 install sacrebleu



In [None]:
# Evaluate the translation (without subwording)
!python3 compute-bleu.py /content/drive/MyDrive/nmt/Europarl.en-nl.nl-filtered.nl.subword.test.desubword en-nl.translated.desubword

Reference 1st sentence: Tijdens ons debat van vorig jaar verheugden we ons op de start van de Intergouvernementele▁Conferentie.
MTed 1st sentence: In ons laatste jaar hebben we vooruitgang geboekt bij de opening van de Intergouvernementele▁Conferentie.
BLEU:  14.973550832214752


# More Features and Directions to Explore

Experiment with the following ideas:
* Icrease `train_steps` and see to what extent new checkpoints provide better translation, in terms of both BLEU and your human evaluation.

* Check other MT Evaluation mentrics other than BLEU such as [TER](https://github.com/mjpost/sacrebleu#ter), [WER](https://blog.machinetranslation.io/compute-wer-score/), [METEOR](https://blog.machinetranslation.io/compute-bleu-score/#meteor), [COMET](https://github.com/Unbabel/COMET), and [BERTScore](https://github.com/Tiiiger/bert_score). What are the conceptual differences between them? Is there special cases for using a specific metric?

* Continue training from the last model checkpoint using the `-train_from` option, only if the training stopped and you want to continue it. In this case, `train_steps` in the config file should be larger than the steps of the last checkpoint you train from.
```
!onmt_train -config config.yaml -train_from models/model.fren_step_3000.pt
```

* **Ensemble Decoding:** During translation, instead of adding one model/checkpoint to the `-model` argument, add multiple checkpoints. For example, try the two last checkpoints. Does it improve quality of translation? Does it affect translation seepd?

* **Averaging Models:** Try to average multiple models into one model using the [average_models.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/bin/average_models.py) script, and see how this affects translation quality.
```
python3 average_models.py -models model_step_xxx.pt model_step_yyy.pt -output model_avg.pt
```
* **Release the model:** Try this command and see how it reduce the model size.
```
onmt_release_model --model "model.pt" --output "model_released.pt
```
* **Use CTranslate2:** For efficient translation, consider using [CTranslate2](https://github.com/OpenNMT/CTranslate2), a fast inference engine. Check out an [example](https://gist.github.com/ymoslem/60e1d1dc44fe006f67e130b6ad703c4b).

* **Work on low-resource languages:** Find out more details about [how to train NMT models for low-resource languages](https://blog.machinetranslation.io/low-resource-nmt/).

* **Train a multilingual model:** Find out helpful notes about [training multilingual models](https://blog.machinetranslation.io/multilingual-nmt).

* **Publish a demo:** Show off your work through a [simple demo with CTranslate2 and Streamlit](https://blog.machinetranslation.io/nmt-web-interface/).
