
# Fine-tuning with no teacher forcing an en-pt NMT model trained with JoeyNMT 2.0 on the tatoeba corpus

This notebook is based on [this demo](https://github.com/joeynmt/joeynmt/blob/main/notebooks/quick-start-with-joeynmt2.ipynb).

> ⚠ **Important:** Before you start, set runtime type to GPU.

In [22]:
!nvidia-smi

Fri Aug  5 11:08:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 26%   27C    P0    53W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 29%   28C    P0    53W / 250W |      0MiB / 11019MiB |      1%      Default |
|       

Make sure that you have a compatible PyTorch version.

In [2]:
import torch
torch.__version__

'1.11.0'

Install joeynmt (it's important to clone it from my fork, so teacher forcing can be deactivated).

In [None]:
! git clone https://github.com/lina-conti/joeynmt

In [29]:
%pip3 install -e /home/lconti/joeynmt

Obtaining file:///home/lconti/joeynmt
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: joeynmt
  Attempting uninstall: joeynmt
    Found existing installation: joeynmt 2.0.0
    Uninstalling joeynmt-2.0.0:
      Successfully uninstalled joeynmt-2.0.0
  Running setup.py develop for joeynmt
Successfully installed joeynmt-2.0.0
[0m

### Dataset and vocabulary

The dataset and vocabulary are the same as for the pre-trained model, so we don't need to compute them again.

In [12]:
data_dir = "/home/lconti/en-pt_tatoeba"

## Configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places.

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

Note the "teacher_forcing" configuration in "model" — this is specific to my fork of joeynmt. It is where you can choose between "on", "off" or "alternating" (default is on).

In [15]:
old_model_dir = "/home/lconti/en-pt_tatoeba/models/normal_tf"
new_model_dir = "/home/lconti/en-pt_tatoeba/models/finetune_tf"

In [16]:
from pathlib import Path

# Create the config
config = """
name: "tatoeba_enpt_finetune_tf_sp"
joeynmt_version: "2.0.0"

data:
    train: "{data_dir}/train"
    dev: "{data_dir}/validation"
    test: "{data_dir}/test"
    dataset_type: "huggingface"
    #dataset_cfg:           # not necessary for manually saved pyarray daraset
    #    name: "en-pt"
    sample_dev_subset: 200
    src:
        lang: "en"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 32000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

    trg:
        lang: "pt"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 32000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

""".format(data_dir=data_dir)
with (Path(data_dir) / "config_finetune.yaml").open('w') as f:
    f.write(config)

In [17]:
config += """
testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 256
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    #return_prob: "hyp"
    #return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"

training:
    load_model: "{old_model_dir}/latest.ckpt"
    reset_best_ckpt: True
    reset_scheduler: True
    reset_optimizer: True
    #reset_iter_state: False
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 2000
    learning_rate: 0.0002
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    loss: "crossentropy"
    batch_size: 512
    batch_type: "token"
    batch_multiplier: 4
    early_stopping_metric: "bleu"
    epochs: 10
    updates: 40000
    validation_freq: 1000
    logging_freq: 100
    model_dir: "{new_model_dir}"
    overwrite: False
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

""".format(old_model_dir=old_model_dir, new_model_dir=new_model_dir)
with (Path(data_dir) / "config_finetune.yaml").open('w') as f:
    f.write(config)

## Model Training


### Run training
⏳ This will take a while. Model parameters will be stored on mounted google drive. The log reports the training process, look out for the prints of example translations and the BLEU evaluation scores to get an impression of the current quality.

> ⛔ If you execute this twice, you might get an error that the model directory already exists. You can specify in the configuration to overwrite it, or delete it manually (`!rm -r {model_dir}`).

In [None]:
!python -m joeynmt train {data_dir}/config_finetune.yaml

### Continue training after interruption
To continue after an interruption, the configuration needs to be modified in the following places:

- `load_model` to point to the checkpoint to load.
- `reset_*` options (must be False) to resume the previous session.
- `model_dir` to create a new directory.

When resuming training, I get the bug described [here](https://github.com/pytorch/pytorch/issues/80809). I haven't been able to find another solution besides downgrading pytorch.

In [24]:
%pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.10.1+cu111
  Using cached https://download.pytorch.org/whl/cu111/torch-1.10.1%2Bcu111-cp39-cp39-linux_x86_64.whl (2137.7 MB)
Collecting torchvision==0.11.2+cu111
  Using cached https://download.pytorch.org/whl/cu111/torchvision-0.11.2%2Bcu111-cp39-cp39-linux_x86_64.whl (24.5 MB)
Collecting torchaudio==0.10.1
  Using cached https://download.pytorch.org/whl/rocm4.1/torchaudio-0.10.1%2Brocm4.1-cp39-cp39-linux_x86_64.whl (2.7 MB)
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 1.12.0
    Uninstalling torch-1.12.0:
      Successfully uninstalled torch-1.12.0
Successfully installed torch-1.10.1+cu111 torchaudio-0.10.1+rocm4.1 torchvision-0.11.2+cu111
[0m

In [31]:
%pip install setuptools==59.5.0

Collecting setuptools==59.5.0
  Downloading setuptools-59.5.0-py3-none-any.whl (952 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m952.4/952.4 kB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 62.3.2
    Uninstalling setuptools-62.3.2:
      Successfully uninstalled setuptools-62.3.2
Successfully installed setuptools-59.5.0
[0m

In [12]:
resume_config = config\
  .replace('load_model: "/home/lconti/en-pt_tatoeba/models/normal_tf/latest.ckpt"', '"/home/lconti/en-pt_tatoeba/models/finetune_tf/latest.ckpt"')\
  .replace('reset_best_ckpt: True', 'reset_best_ckpt: False')\
  .replace('reset_scheduler: True', 'reset_scheduler: False')\
  .replace('reset_optimizer: True', 'reset_optimizer: False')\
  .replace('#reset_iter_state: False', 'reset_iter_state: False')\
  .replace('model_dir: "/home/lconti/en-pt_tatoeba/models/finetune_tf"', 'model_dir: "/home/lconti/en-pt_tatoeba/models/finetune2_tf"')

with (Path(data_dir) / "config_finetune2.yaml").open('w') as f:
    f.write(resume_config)

In [19]:
!python -m joeynmt train {data_dir}/config_finetune2.yaml

2022-08-05 17:00:26,618 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -                           cfg.name : tatoeba_enpt_finetune2_tf_sp
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -                cfg.joeynmt_version : 2.0.0
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -                     cfg.data.train : /home/lconti/en-pt_tatoeba/train
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -                       cfg.data.dev : /home/lconti/en-pt_tatoeba/validation
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -                      cfg.data.test : /home/lconti/en-pt_tatoeba/test
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -              cfg.data.dataset_type : huggingface
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -         cfg.data.sample_dev_subset : 200
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -                  cfg.data.src.lang : en
2022-08-05 17:00:26,618 - INFO - joeynmt.helpers -        

> 💡 It starts counting the epochs from the beginning again, but step numbers should continue from before and you should find a "reloading" line in the training log.

## Evaluation


The `test` mode can be used to translate (and evaluate on) the test set specified in the configuration. We usually do this only once after we've tuned hyperparameters on the dev set.

In [20]:
!python -m joeynmt test {data_dir}/config_finetune2.yaml --ckpt /home/lconti/en-pt_tatoeba/models/finetune2_tf/best.ckpt

2022-08-05 18:56:39,107 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-05 18:56:39,107 - INFO - joeynmt.data - Building tokenizer...
2022-08-05 18:56:39,192 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-05 18:56:39,192 - INFO - joeynmt.tokenizers - pt tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-05 18:56:39,192 - INFO - joeynmt.data - Building vocabulary...
2022-08-05 18:56:51,465 - INFO - joeynmt.data - Loading dev set...
2022-08-05 18:56:52,365 - INFO - joeynmt.data - Loading test set...
2022-08-05 18:56:53,032 - INFO - joeynmt.data - Data loaded.
2022-08-05 18:56:53,032 - INFO - joeynmt.helpers - Train dataset: None
2022-08-05 18:56:53,033 - I

> ⚠ In beam search, the batch size is expanded {beam_size} times. For instance, if batch_size=10, batch_type=sentence and beam_size=5, joeynmt internally creates a batch of length 10*5=50. It may cause an out-of-memory error. Please specify the batch_size in `testing` section of config.yaml by taking this into account.


The `translate` mode is more interactive and takes prompts to translate interactively.

Let's Translate a few examples!

In [None]:
!python -m joeynmt translate {data_dir}/config.yaml --ckpt {model_dir}_resume/best.ckpt

2022-06-04 22:43:59,643 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-06-04 22:44:17,067 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-06-04 22:44:17,429 - INFO - joeynmt.model - Enc-dec model built.
2022-06-04 22:44:21,857 - INFO - joeynmt.helpers - Load model from /content/drive/MyDrive/models/tatoeba_deen_resume/19000.ckpt.
2022-06-04 22:44:22,065 - INFO - joeynmt.tokenizers - de tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-06-04 22:44:22,065 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)

Please enter a source sentence:
Maschinelle Übersetzung macht Spaß!
2022-06-04 22:46:22,281 - INFO - joeynmt.prediction - Predicting 1 example(s)... 

You can also get the n-best hypotheses (up to the size of the beam, in our example 5), not only the highest scoring one. The better your model gets, the more interesting should the alternatives be.



In [None]:
nbest_config = config.replace('n_best: 1', 'n_best: 5')\
  .replace('#return_prob: "hyp"', 'return_prob: "hyp"')

with (Path(data_dir) / "nbest_config.yaml").open('w') as f:
    f.write(nbest_config)

In [None]:
!python -m joeynmt translate {data_dir}/nbest_config.yaml --ckpt {model_dir}_resume/best.ckpt

2022-06-04 22:48:11,849 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-06-04 22:48:29,174 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-06-04 22:48:29,557 - INFO - joeynmt.model - Enc-dec model built.
2022-06-04 22:48:34,476 - INFO - joeynmt.helpers - Load model from /content/drive/MyDrive/models/tatoeba_deen_resume/19000.ckpt.
2022-06-04 22:48:34,743 - INFO - joeynmt.tokenizers - de tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-06-04 22:48:34,743 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)

Please enter a source sentence:
Maschinelle Übersetzung macht Spaß!
2022-06-04 22:49:28,668 - INFO - joeynmt.prediction - Predicting 1 example(s)... 

> 💡 In BPE decoding, there are multiple ways to tokenize one sequence. That is, the same output string sequence might appear multiple times in the n best list, because they have different tokenization and thus different sequence in the generation.
> For instance, say 3-best generation were:
> ```
> #1 best ['▁', 'N', 'e', 'w', '▁York']
> #2 best ['▁', 'New', '▁York']
> #3 best ['▁', 'New', '▁Y', 'o', 'r', 'k']
> ````
All three were different in next-token prediction, but ended up the same string sequence `New York` after being un-bpe-ed.


## Appendix:

### plotting learning curves

`plot_validations.py` script will generate validation learning curves.

We resumed the training once, so first we concatenate the validations.txt file, and use the concatenated validations.txt for plotting.

In [None]:
!cat /home/lconti/en-pt_tatoeba/models/normal_tf/validations.txt /home/lconti/en-pt_tatoeba/models/finetune_tf/validations.txt > /home/lconti/en-pt_tatoeba/models/finetune_tf/_validations.txt
!mv /home/lconti/en-pt_tatoeba/models/finetune_tf/validations.txt /home/lconti/en-pt_tatoeba/models/finetune_tf/resumed_valudations.txt
!mv /home/lconti/en-pt_tatoeba/models/finetune_tf/_validations.txt /home/lconti/en-pt_tatoeba/models/finetune_tf/validations.txt

In [None]:
!python /home/lconti/joeynmt/scripts/plot_validations.py /home/lconti/en-pt_tatoeba/models/finetune_tf --output_path /home/lconti/results/finetune_learning_curve.png