# OpenNMT Tutorial and Starter Code
(modified from the OpenNMT quickstart to work in Colab)

While creating your own models from scratch is common for many tasks, often times it's useful to rely on a tool or framework to aid in this. In this exercise we're going to look at one popular NMT tool, OpenNMT, as a way to use beam search, which could be tricky to implement efficiently on your own.

Finally we'll look at how to configure different models for OpenNMT including Transformer, which we'll look at in detail next week.

OpenNMT, is similar to other ML frameworks in that it relies on a combination of editable .yaml files and command line tools to run the training procedure.  
### Make sure you have the *.yml config files from the lab repository.



In [None]:
!git clone https://github.com/OpenNMT/OpenNMT-py.git

fatal: destination path 'OpenNMT-py' already exists and is not an empty directory.


In [None]:
!pip install --upgrade OpenNMT-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install -r /content/OpenNMT-py/requirements.opt.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/NVIDIA/apex.git@700d6825e205732c1d6be511306ca4e595297070 (from -r /content/OpenNMT-py/requirements.opt.txt (line 2))
  Cloning https://github.com/NVIDIA/apex.git (to revision 700d6825e205732c1d6be511306ca4e595297070) to /tmp/pip-req-build-nbd6rjnl
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/apex.git /tmp/pip-req-build-nbd6rjnl
  Running command git rev-parse -q --verify 'sha^700d6825e205732c1d6be511306ca4e595297070'
  Running command git fetch -q https://github.com/NVIDIA/apex.git 700d6825e205732c1d6be511306ca4e595297070
  Running command git checkout -q 700d6825e205732c1d6be511306ca4e595297070
  Resolved https://github.com/NVIDIA/apex.git to commit 700d6825e205732c1d6be511306ca4e595297070
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyroug

### Next let's get OpenNMT as well as a toy English to German corpus.

In [None]:
!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
!tar xf toy-ende.tar.gz
!cd toy-ende

--2023-03-09 22:41:44--  https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.224.24, 52.216.209.56, 52.216.79.30, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.224.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1662081 (1.6M) [application/x-gzip]
Saving to: ‘toy-ende.tar.gz’


2023-03-09 22:41:44 (4.23 MB/s) - ‘toy-ende.tar.gz’ saved [1662081/1662081]



## Processing Vocab

Once we have the corpus and OpenNMT we can build the vocab we'll use. This relies on having a config file with this information laid out.

Let's take a second to look at the config file we'll be using toy-ende.yml, which we will upload from the students repo into the root project directory inside our Colab environment.

The important part of the data processing are in the top parts of the yaml file:

```
# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt

```
In this file, we specify where the data is, where to save it, as well as the vocab files corresponding to the corpus.

Once uploaded, we can run the cell below:

In [None]:
!onmt_build_vocab -config /content/drive/MyDrive/COLX_531_lab3_jhlbxx/toy_en_de.yaml -n_sample 10000


Corpus corpus_1's weight should be given. We default it to 1 for you.
[2023-03-09 22:43:20,175 INFO] Counter vocab from 10000 samples.
[2023-03-09 22:43:20,175 INFO] Build vocab on 10000 transformed examples/corpus.
[2023-03-09 22:43:20,573 INFO] Counters src:24995
[2023-03-09 22:43:20,573 INFO] Counters tgt:35816
Traceback (most recent call last):
  File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/onmt/bin/build_vocab.py", line 202, in main
    build_vocab_main(opts)
  File "/usr/local/lib/python3.9/dist-packages/onmt/bin/build_vocab.py", line 186, in build_vocab_main
    save_counter(src_counter, opts.src_vocab)
  File "/usr/local/lib/python3.9/dist-packages/onmt/bin/build_vocab.py", line 175, in save_counter
  File "/usr/local/lib/python3.9/dist-packages/onmt/utils/misc.py", line 47, in check_path
    raise IOError(f"path {path} exists, stop.")
OSError: path toy-ende/run/example.vocab.src exists, stop.


* `-n_sample` is required here -- it represents the number of lines sampled from each corpus to build the vocab.
* This configuration is the simplest possible, without any tokenization or other transforms. See other example configurations for more complex pipelines.


## Training

Next we will beging training with OpenNMT, again using the same config file, however, **adding** into it all the relevant parts we need:

```
# toy_en_de.yaml

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
# Note it won't actually make it to 10,000 steps because of early stopping
save_model: toy-ende/run/model
save_checkpoint_steps: 500
train_steps: 10000
valid_steps: 500
early_stopping: 2


# Checkpoint settings
keep_checkpoint: 3
seed: 531
warmup_steps: 400
report_every: 100

# Model (note these are actually default values, but I've explicitely written them out to show how you can edit them)
decoder_type: rnn
encoder_type: rnn 
enc_layers: 2
dec_layers: 2
enc_rnn_size: 500
dec_rnn_size: 500
dropout: 0.3
global_attention : dot


# Optimizer settings
optim: sgd
learning_rate: 1

```

Here the config file covers two major things: Model checkpointing and Model Hyperparameters.

Certain settings are available only for certain models, for instance you wouldn't (want to) use positional encoding for an RNN-based model, however, it is necessary for proper training of Transformers and we could include it if we added a line ```positional_encoding: 'true'```.

If we wanted to know more about any of these settings, we could take a peek at the OpenNMT [train documentation](https://opennmt.net/OpenNMT-py/options/train.html)

For instance for the encoder options, it shows what available models can be used:
```
--encoder_type, -encoder_type
Possible choices: rnn, brnn, ggnn, mean, transformer, cnn, transformer_lm

Type of encoder layer to use. Non-RNN layers are experimental. Options are [rnn|brnn|ggnn|mean|transformer|cnn|transformer_lm].

```


Finally we will train our model with this configuration. (It took about 10 minutes for the small RNN model to train). 

In [None]:
!onmt_train -config /content/drive/MyDrive/COLX_531_lab3_jhlbxx/toy_en_de.yaml

[2023-03-09 22:13:49,714 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2023-03-09 22:13:49,715 INFO] Missing transforms field for valid data, set to default: [].
[2023-03-09 22:13:49,715 INFO] Parsed 2 corpora from -data.
[2023-03-09 22:13:49,715 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2023-03-09 22:13:49,872 INFO] Building model...
[2023-03-09 22:13:58,941 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(25000, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.3, inplace=False)
    )
    (rnn): LSTM(500, 500, num_layers=2, batch_first=True, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(32768, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.3, inplace=False)


Once our model is saved. We can use it to actually generate predictions on our output files. Our models will be saved under the ```save_model``` setting of our config file, in this case: ```toy-ende/run/model_```  Since we are only saving every 500 training steps, and keeping the past three checkpoints, we can choose from the available models. ```model_step_2500.pt``` and ```model_step_3000.pt``` and ```model_step_3500.pt```. Our early stopping indicates the best model (lowest perplexity/highest acc) of the three is 2500, but let's look at how to pick between these three using BLEU:

## Translating

To do so we will need to translate the source sentences, decoding with Beam search, in this case we've chosen a ```-beam_size``` of 10, however you will be asked in the question to adjust it to different sizes.

Let's first create predictions for our ```_step_2500.pt```, ```_step_3000.pt``` , ```_step_3500.pt``` models (NOTE YOUR MODEL MAY HAVE STOPPED AT A DIFFERENT POINT, IN WHICH CASE USE THE APPROPRIATE 3 LAST CHECKPOINTS):

In [None]:
!onmt_translate -model toy-ende/run/model_step_1500.pt -src toy-ende/src-val.txt -output toy-ende/val_2500.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model toy-ende/run/model_step_2000.pt -src toy-ende/src-val.txt -output toy-ende/val_3000.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model toy-ende/run/model_step_2500.pt -src toy-ende/src-val.txt -output toy-ende/val_3500.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2


[2023-03-09 22:55:53,705 INFO] PRED SCORE: -1.6577, PRED PPL: 5.25 NB SENTENCES: 3000
[2023-03-09 22:59:22,300 INFO] PRED SCORE: -1.5070, PRED PPL: 4.51 NB SENTENCES: 3000
[2023-03-09 23:02:37,808 INFO] PRED SCORE: -1.4296, PRED PPL: 4.18 NB SENTENCES: 3000


[Note we can now manually inspect the results under val_*.txt]

Finally let's calculate the BLEU scores of the outputs! We would eventually want to select the model with Highest BLEU (in our case 37 with our 2500 step model) and use this on our test set.

We will upload the file `multi-bleu.perl` from the students repo into the root project directory and run it as follows:



In [None]:
!perl  OpenNMT-py/tools/multi-bleu-detok.perl toy-ende/tgt-val.txt < toy-ende/val_2500.txt
!perl  OpenNMT-py/tools/multi-bleu-detok.perl toy-ende/tgt-val.txt < toy-ende/val_3000.txt
!perl  OpenNMT-py/tools/multi-bleu-detok.perl toy-ende/tgt-val.txt < toy-ende/val_3500.txt

Use of uninitialized value in division (/) at OpenNMT-py/tools/multi-bleu-detok.perl line 149, <STDIN> line 3000.
BLEU = 0.00, 17.7/1.3/0.2/0.0 (BP=0.794, ratio=0.812, hyp_len=59251, ref_len=72954)
Use of uninitialized value in division (/) at OpenNMT-py/tools/multi-bleu-detok.perl line 149, <STDIN> line 3000.
BLEU = 0.00, 19.2/1.1/0.1/0.0 (BP=0.649, ratio=0.698, hyp_len=50937, ref_len=72954)
BLEU = 0.13, 14.6/0.6/0.0/0.0 (BP=0.924, ratio=0.927, hyp_len=67614, ref_len=72954)


# Lab 3 - Exercise 1

We have seen how OpenNMT can be used, now let's apply it to our Multi30k dataset.

You can run your code in here and then download the results to submit on github.

*You are provided with a `Multi30k.yaml` to fill in, be sure to submit this alongside your colab notebook and other files in the repository.*

## 1.1

### Build the vocab for the Multi30k En-Fr dataset

While just having a vocabulary is fine for some cases, using a sub-word tokenization might help capture morphological information better.

To do this, in your config file add ```transforms: [filtertoolong]``` to the training corpora.

Please provide the code you ran to build the vocab as well as the "data" section of your multi30k config file.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import spacy.cli

spacy.cli.download("en_core_web_sm")
spacy.cli.download("fr_core_news_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [None]:
import fr_core_news_sm
import en_core_web_sm
import pandas as pd

spacy_fr = fr_core_news_sm.load()
spacy_en = en_core_web_sm.load()

Before using `OpenMT` to load our vocabs, we need to decouple the French and English sentences from the original Multi30k dataset and save them in a separate file. You can use the below code to do this:

In [None]:
# THIS CODE GENERATES THE TOKENIZED FILES FOR EACH LANGUAGE

import csv
from tqdm import tqdm

FILE_LIST = ["train_eng_fre.tsv", "val_eng_fre.tsv", "test_eng_fre.tsv"]

# NOTE: update with your desired path
path = "/content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/"

for file in FILE_LIST:
  with open(path + file, "r", encoding="utf-8") as tsv:
    tsv_reader = csv.reader(tsv, delimiter="\t")
    next(tsv_reader, None)
    outfile_fr = file.split("_")[0] + "_fr.tokd"
    outfile_en = file.split("_")[0] + "_en.tokd"
    with open(path + outfile_fr, "w", encoding="utf-8") as out_fr:
      with open(path + outfile_en, "w", encoding="utf-8") as out_en:
        for row in tqdm(tsv_reader):
          tokenized_en = [tok.text for tok in spacy_en(row[0])]
          tokenized_fr = [tok.text for tok in spacy_fr(row[1])]
          out_fr.write(" ".join(tokenized_fr) + "\n")
          out_en.write(" ".join(tokenized_en) + "\n")

29000it [06:51, 70.47it/s]
1014it [00:14, 71.19it/s]
1000it [00:13, 71.60it/s]


In [None]:
!onmt_build_vocab -config /content/drive/MyDrive/COLX_531_lab3_jhlbxx/multi30k.yml -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2023-03-10 01:34:20,451 INFO] Counter vocab from 10000 samples.
[2023-03-10 01:34:20,451 INFO] Build vocab on 10000 transformed examples/corpus.
[2023-03-10 01:34:20,636 INFO] Counters src:6886
[2023-03-10 01:34:20,636 INFO] Counters tgt:6411


```
Changes made to Data saving, Corpus, and Vocab section in the config file go HERE
````

In [None]:
# TODO Train Model
# multi30k.yaml

## TO DO COMPLETE DATA SAVING
## Where the samples will be written
save_data: /content/drive/MyDrive/lab3_results/multi30k/run/example
## Where the vocab(s) will be written
src_vocab: /content/drive/MyDrive/lab3_results/multi30k/run/example.vocab.src
tgt_vocab: /content/drive/MyDrive/lab3_results/multi30k/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False


# Corpus opts:
data:
## TODO COMPLETE CORPUS OPTIONS
    corpus_1:
        path_src: /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/train_fr.tokd
        path_tgt: /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/train_en.tokd
        transforms: [filtertoolong]
    valid:
        path_src: /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_fr.tokd
        path_tgt: /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_en.tokd
## Add sentencepiece and filter long segments
    


#TODO Fill in vocab you create
src_vocab: /content/drive/MyDrive/lab3_results/multi30k/run/example.vocab.src
tgt_vocab: /content/drive/MyDrive/lab3_results/multi30k/run/example.vocab.tgt

## 1.2
Train Model

Fill in the `multi30k.yaml` config to setup a seq2seq model that has a 3 layer RNN encoder 2 layer RNN decoder, MLP attention, with 20% dropout, using Adam as your optimizer.

Copy and paste the changed parts of the *.yml file below along with the training command you used.

```
Changes to model, and optimizer here.

```

In [None]:
# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
# Note it won't actually make it to 10,000 steps because of early stopping
save_model: /content/drive/MyDrive/lab3_results/multi30k/run/model
save_checkpoint_steps: 500
train_steps: 10000
valid_steps: 500
early_stopping: 2


# Checkpoint settings
keep_checkpoint: 5
seed: 531
warmup_steps: 400
report_every: 100

# Model 
## TODO Create RNN enc/dec with MLP attention
## Should have 3 layers in encoder and 2 layers in decoder
## 20% dropout and 500 hidden units
decoder_type: rnn
encoder_type: rnn 
enc_layers: 3
dec_layers: 2
enc_rnn_size: 500
dec_rnn_size: 500
dropout: 0.2
global_attention : mlp


# Optimizer settings
## TODO Set Adam as Optimizer
optim: adam
learning_rate: 0.001

## 1.3

Decoding

Create predictions for the validation set using your saved models and select the one that has the highest BLEU. You should set beam size to 5 for each of these models.

Report the BLEU on this model.

In [None]:
## Code to create predictions
!onmt_train -config /content/drive/MyDrive/COLX_531_lab3_jhlbxx/multi30k.yml

[2023-03-10 01:37:48,578 INFO] Missing transforms field for valid data, set to default: [].
[2023-03-10 01:37:48,578 INFO] Parsed 2 corpora from -data.
[2023-03-10 01:37:48,579 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2023-03-10 01:37:48,640 INFO] Building model...
[2023-03-10 01:37:51,588 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(6896, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.2, inplace=False)
    )
    (rnn): LSTM(500, 500, num_layers=3, batch_first=True, dropout=0.2)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(6416, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.2, inplace=False)
    )
    (dropout): Dropout(p=0.2, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropo

In [None]:
!onmt_translate -model /content/drive/MyDrive/lab3_results/multi30k/run/model_step_3000.pt -src /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_fr.tokd -output /content/drive/MyDrive/lab3_results/multi30k/run/val_3000.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model /content/drive/MyDrive/lab3_results/multi30k/run/model_step_3500.pt -src /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_fr.tokd -output /content/drive/MyDrive/lab3_results/multi30k/run/val_3500.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model /content/drive/MyDrive/lab3_results/multi30k/run/model_step_4000.pt -src /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_fr.tokd -output /content/drive/MyDrive/lab3_results/multi30k/run/val_4000.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model /content/drive/MyDrive/lab3_results/multi30k/run/model_step_4500.pt -src /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_fr.tokd -output /content/drive/MyDrive/lab3_results/multi30k/run/val_4500.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2

[2023-03-10 01:47:27,365 INFO] PRED SCORE: -0.3187, PRED PPL: 1.38 NB SENTENCES: 1014
[2023-03-10 01:47:59,592 INFO] PRED SCORE: -0.2852, PRED PPL: 1.33 NB SENTENCES: 1014
[2023-03-10 01:48:31,087 INFO] PRED SCORE: -0.2513, PRED PPL: 1.29 NB SENTENCES: 1014
[2023-03-10 01:49:02,291 INFO] PRED SCORE: -0.2388, PRED PPL: 1.27 NB SENTENCES: 1014


In [None]:
## Code to compute BLEU scores
!perl  OpenNMT-py/tools/multi-bleu-detok.perl /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_en.tokd < /content/drive/MyDrive/lab3_results/multi30k/run/val_3000.txt
!perl  OpenNMT-py/tools/multi-bleu-detok.perl /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_en.tokd < /content/drive/MyDrive/lab3_results/multi30k/run/val_3500.txt
!perl  OpenNMT-py/tools/multi-bleu-detok.perl /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_en.tokd < /content/drive/MyDrive/lab3_results/multi30k/run/val_4000.txt
!perl  OpenNMT-py/tools/multi-bleu-detok.perl /content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/val_en.tokd < /content/drive/MyDrive/lab3_results/multi30k/run/val_4500.txt

BLEU = 36.11, 63.9/42.7/29.8/20.9 (BP=1.000, ratio=1.075, hyp_len=14439, ref_len=13431)
BLEU = 38.60, 66.9/45.5/32.1/22.7 (BP=1.000, ratio=1.034, hyp_len=13892, ref_len=13431)
BLEU = 39.77, 67.3/46.4/33.3/24.0 (BP=1.000, ratio=1.041, hyp_len=13984, ref_len=13431)
BLEU = 39.84, 67.3/46.6/33.4/24.1 (BP=1.000, ratio=1.044, hyp_len=14018, ref_len=13431)
