
# Training a toy NMT model en-pt with the tatoeba corpus and JoeyNMT 2.0

This notebook is based on [this demo](https://github.com/joeynmt/joeynmt/blob/main/notebooks/quick-start-with-joeynmt2.ipynb).

> ⚠ **Important:** Before you start, set runtime type to GPU.

In [1]:
!nvidia-smi

Wed Aug  3 17:55:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 26%   25C    P0    52W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 29%   26C    P0    53W / 250W |      0MiB / 11019MiB |      1%      Default |
|       

Make sure that you have a compatible PyTorch version.

In [2]:
import torch
torch.__version__

'1.11.0'

Install joeynmt 

In [None]:
!git clone https://github.com/lina-conti/joeynmt

In [None]:
!pip3 install -e ./joeynmt

## Data Preparation

### Download
We'll use English - Portuguese translations from the [Tatoeba](https://tatoeba.org/) collection ([CC-BY 2.0 FR](https://creativecommons.org/licenses/by/2.0/fr/)).

[Tatoeba](https://huggingface.co/datasets/tatoeba) corpus is available in Huggingface's datasets library.



In [7]:
data_dir = "/home/lconti/en-pt_tatoeba"

In [10]:
!mkdir {data_dir}
!wget -O {data_dir}/get_tatoeba.py https://raw.githubusercontent.com/may-/datasets/master/datasets/tatoeba/tatoeba.py

--2022-08-01 11:07:17--  https://raw.githubusercontent.com/may-/datasets/master/datasets/tatoeba/tatoeba.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4425 (4.3K) [text/plain]
Saving to: '/home/lconti/en-pt_tatoeba/data/get_tatoeba.py'


2022-08-01 11:07:18 (40.1 MB/s) - '/home/lconti/en-pt_tatoeba/data/get_tatoeba.py' saved [4425/4425]



The Tatoeba dataset on HuggingFace Hub doesn't have dev and test split, but train split only. So let's split the data manually and save it locally.

> 📝 Note that most of the dataset loading scripts in Huggingface have pre-defined train-dev-test splits, e.g. [wmt17](https://huggingface.co/datasets/wmt17). In that case, you can skip this step, please go to the Vocabulary generation part.

In [12]:
from datasets import load_dataset
tatoeba_kwargs = {
  "path": f"{data_dir}/get_tatoeba.py",
  "lang1": "en",
  "lang2": "pt",
  "ignore_verifications": True,
  "cache_dir": "/tmp/.cache/huggingface"
}

tatoeba_dev = load_dataset(split="train[:1000]", **tatoeba_kwargs)
tatoeba_test = load_dataset(split="train[1000:2000]", **tatoeba_kwargs)
tatoeba_train = load_dataset(split="train[2000:]", **tatoeba_kwargs)

tatoeba_dev, tatoeba_test, tatoeba_train

Using custom data configuration en-pt-lang1=en,lang2=pt


Downloading and preparing dataset get_tatoeba/en-pt to /tmp/.cache/huggingface/get_tatoeba/en-pt-lang1=en,lang2=pt/0.0.0/336de120b2cb1a268f4eb9ebc7969075ccfabb978716d834a58a7889dbb5f267...


Downloading data:   0%|          | 0.00/6.14M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Using custom data configuration en-pt-lang1=en,lang2=pt
Reusing dataset get_tatoeba (/tmp/.cache/huggingface/get_tatoeba/en-pt-lang1=en,lang2=pt/0.0.0/336de120b2cb1a268f4eb9ebc7969075ccfabb978716d834a58a7889dbb5f267)
Using custom data configuration en-pt-lang1=en,lang2=pt
Reusing dataset get_tatoeba (/tmp/.cache/huggingface/get_tatoeba/en-pt-lang1=en,lang2=pt/0.0.0/336de120b2cb1a268f4eb9ebc7969075ccfabb978716d834a58a7889dbb5f267)


Dataset get_tatoeba downloaded and prepared to /tmp/.cache/huggingface/get_tatoeba/en-pt-lang1=en,lang2=pt/0.0.0/336de120b2cb1a268f4eb9ebc7969075ccfabb978716d834a58a7889dbb5f267. Subsequent calls will reuse this data.


(Dataset({
     features: ['id', 'translation'],
     num_rows: 1000
 }),
 Dataset({
     features: ['id', 'translation'],
     num_rows: 1000
 }),
 Dataset({
     features: ['id', 'translation'],
     num_rows: 215647
 }))

Inspect the data

In [13]:
tatoeba_dev['translation'][:3]

[{'en': "Let's try something.", 'pt': 'Vamos tentar alguma coisa!'},
 {'en': "Let's try something.", 'pt': 'Vamos tentar algo!'},
 {'en': "Let's try something.", 'pt': 'Vamos tentar algo.'}]

In [14]:
tatoeba_test['translation'][:3]

[{'en': "You're my type.", 'pt': 'Você é o meu tipo.'},
 {'en': "You're irresistible.", 'pt': 'Você é irresistível.'},
 {'en': 'Could you call again later, please?',
  'pt': 'Você poderia telefonar de novo mais tarde, por favor?'}]

In [15]:
tatoeba_train['translation'][:3]

[{'en': 'What do you want now?', 'pt': 'O que você deseja agora?'},
 {'en': 'Do you love music?', 'pt': 'Você ama música?'},
 {'en': 'Do you love music?', 'pt': 'Você aprecia a música?'}]

Save the train-dev-test splits in local dir

In [16]:
from datasets.dataset_dict import DatasetDict

dataset_dict = DatasetDict({ 
  "train": tatoeba_train,
  "validation": tatoeba_dev,
  "test": tatoeba_test
})

dataset_dict.save_to_disk(data_dir)

### Vocabulary

We will use the [sentencepiece](https://github.com/google/sentencepiece) library to split words into subwords (BPE) according to their frequency in the training corpus.

`build_vocab.py` script will train the BPE model and creates joint vocabulary. It takes the same config file as the joeynmt.

In [13]:
from pathlib import Path

# Create the config
config = """
name: "tatoeba_enpt_sp"
joeynmt_version: "2.0.0"

data:
    train: "{data_dir}/train"
    dev: "{data_dir}/validation"
    test: "{data_dir}/test"
    dataset_type: "huggingface"
    #dataset_cfg:           # not necessary for manually saved pyarray daraset
    #    name: "en-pt"
    sample_dev_subset: 200
    src:
        lang: "en"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 32000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

    trg:
        lang: "pt"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 32000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

""".format(data_dir=data_dir)
with (Path(data_dir) / "config_normal.yaml").open('w') as f:
    f.write(config)

Call the script with `--joint` flag

In [19]:
! wget -O {data_dir}/build_vocab.py https://raw.githubusercontent.com/joeynmt/joeynmt/main/scripts/build_vocab.py

--2022-08-01 11:20:25--  https://raw.githubusercontent.com/joeynmt/joeynmt/main/scripts/build_vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10277 (10K) [text/plain]
Saving to: '/home/lconti/en-pt_tatoeba/data/build_vocab.py'


2022-08-01 11:20:25 (92.8 MB/s) - '/home/lconti/en-pt_tatoeba/data/build_vocab.py' saved [10277/10277]



In [20]:
!python {data_dir}/build_vocab.py {data_dir}/config_normal.yaml --joint

Dropping NaN...: 100%|████████████████████████| 216/216 [00:02<00:00, 72.46ba/s]
Preprocessing...: 100%|███████████████| 215642/215642 [00:28<00:00, 7436.45ex/s]
### Training sentencepiece...
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/tmp/sentencepiece_6py8tfgs.txt --model_prefix=/home/lconti/en-pt_tatoeba/data/sp --model_type=unigram --vocab_size=32000 --character_coverage=1.0 --accept_language=en,pt --unk_piece=<unk> --bos_piece=<s> --eos_piece=</s> --pad_piece=<pad> --unk_id=0 --bos_id=2 --eos_id=3 --pad_id=1 --vocabulary_output_piece_score=false
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /tmp/sentencepiece_6py8tfgs.txt
  input_format: 
  model_prefix: /home/lconti/en-pt_tatoeba/data/sp
  model_type: UNIGRAM
  vocab_size: 32000
  accept_language: en
  accept_language: pt
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinki

The generated vocabulary looks like this:

In [21]:
!head -20 {data_dir}/vocab.txt

<unk>
<pad>
<s>
</s>
.
▁Tom
'
▁I
?
▁a
▁to
▁que
s
,
▁the
▁de
▁you
t
▁o
▁não


## Configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places.

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

In [1]:
model_dir = "/home/lconti/en-pt_tatoeba/models/normal_tf"

In [14]:
config += """
testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 256
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    #return_prob: "hyp"
    #return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"

training:
    #load_model: "{model_dir}/latest.ckpt"
    #reset_best_ckpt: False
    #reset_scheduler: False
    #reset_optimizer: False
    #reset_iter_state: False
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 2000
    learning_rate: 0.0002
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    loss: "crossentropy"
    batch_size: 512
    batch_type: "token"
    batch_multiplier: 4
    early_stopping_metric: "bleu"
    epochs: 10
    updates: 20000
    validation_freq: 1000
    logging_freq: 100
    model_dir: "{model_dir}"
    overwrite: False
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

""".format(model_dir=model_dir)
with (Path(data_dir) / "config_normal.yaml").open('w') as f:
    f.write(config)

## Model Training


### Run training
⏳ This will take a while. Model parameters will be stored on mounted google drive. The log reports the training process, look out for the prints of example translations and the BLEU evaluation scores to get an impression of the current quality.

> ⛔ If you execute this twice, you might get an error that the model directory already exists. You can specify in the configuration to overwrite it, or delete it manually (`!rm -r {model_dir}`).

In [None]:
!python -m joeynmt train {data_dir}/config_normal.yaml

2022-08-01 11:33:41,216 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-01 11:33:41,216 - INFO - joeynmt.helpers -                           cfg.name : tatoeba_enpt_sp
2022-08-01 11:33:41,216 - INFO - joeynmt.helpers -                cfg.joeynmt_version : 2.0.0
2022-08-01 11:33:41,217 - INFO - joeynmt.helpers -                     cfg.data.train : /home/lconti/en-pt_tatoeba/data/train
2022-08-01 11:33:41,217 - INFO - joeynmt.helpers -                       cfg.data.dev : /home/lconti/en-pt_tatoeba/data/validation
2022-08-01 11:33:41,217 - INFO - joeynmt.helpers -                      cfg.data.test : /home/lconti/en-pt_tatoeba/data/test
2022-08-01 11:33:41,217 - INFO - joeynmt.helpers -              cfg.data.dataset_type : huggingface
2022-08-01 11:33:41,217 - INFO - joeynmt.helpers -         cfg.data.sample_dev_subset : 200
2022-08-01 11:33:41,217 - INFO - joeynmt.helpers -                  cfg.data.src.lang : en
2022-08-01 11:33:41,217 - INFO - joeynmt.helper

2022-08-01 11:33:41,255 - INFO - joeynmt.data - Building tokenizer...
2022-08-01 11:33:41,544 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-01 11:33:41,552 - INFO - joeynmt.tokenizers - pt tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-01 11:33:41,552 - INFO - joeynmt.data - Loading train set...
Dropping NaN...: 100%|████████████████████████| 216/216 [00:03<00:00, 69.49ba/s]
Preprocessing...: 100%|███████████████| 215642/215642 [00:29<00:00, 7243.66ex/s]
2022-08-01 11:34:15,656 - INFO - joeynmt.data - Building vocabulary...
2022-08-01 11:34:48,734 - INFO - joeynmt.data - Loading dev set...
Dropping NaN...: 100%|████████████████████████████| 1/1 [00:00<00:00, 76.94ba

2022-08-01 12:02:40,873 - INFO - joeynmt.training - Epoch   1, Step:     1100, Batch Loss:     1.111693, Batch Acc: 0.002789, Tokens per Sec:      634, Lr: 0.000110
2022-08-01 12:05:07,983 - INFO - joeynmt.training - Epoch   1, Step:     1200, Batch Loss:     1.034755, Batch Acc: 0.003340, Tokens per Sec:      643, Lr: 0.000120
2022-08-01 12:07:33,440 - INFO - joeynmt.training - Epoch   1, Step:     1300, Batch Loss:     0.999859, Batch Acc: 0.003175, Tokens per Sec:      647, Lr: 0.000130
2022-08-01 12:10:01,040 - INFO - joeynmt.training - Epoch   1, Step:     1400, Batch Loss:     0.922331, Batch Acc: 0.003699, Tokens per Sec:      648, Lr: 0.000140
2022-08-01 12:12:24,275 - INFO - joeynmt.training - Epoch   1, Step:     1500, Batch Loss:     0.957557, Batch Acc: 0.003696, Tokens per Sec:      661, Lr: 0.000150
2022-08-01 12:14:30,006 - INFO - joeynmt.training - Epoch   1, Step:     1600, Batch Loss:     0.861257, Batch Acc: 0.004069, Tokens per Sec:      752, Lr: 0.000160
2022-08-01

2022-08-01 12:54:33,575 - INFO - joeynmt.training - Epoch   2, Step:     3200, Batch Loss:     0.565820, Batch Acc: 0.005704, Tokens per Sec:      790, Lr: 0.000158
2022-08-01 12:56:33,791 - INFO - joeynmt.training - Epoch   2, Step:     3300, Batch Loss:     0.596018, Batch Acc: 0.005703, Tokens per Sec:      780, Lr: 0.000156
2022-08-01 12:58:58,749 - INFO - joeynmt.training - Epoch   2, Step:     3400, Batch Loss:     0.551572, Batch Acc: 0.005913, Tokens per Sec:      658, Lr: 0.000153
2022-08-01 13:01:30,826 - INFO - joeynmt.training - Epoch   2, Step:     3500, Batch Loss:     0.498475, Batch Acc: 0.006715, Tokens per Sec:      629, Lr: 0.000151
2022-08-01 13:03:56,886 - INFO - joeynmt.training - Epoch   2, Step:     3600, Batch Loss:     0.539954, Batch Acc: 0.006345, Tokens per Sec:      655, Lr: 0.000149
2022-08-01 13:06:26,392 - INFO - joeynmt.training - Epoch   2, Step:     3700, Batch Loss:     0.512146, Batch Acc: 0.005858, Tokens per Sec:      637, Lr: 0.000147
2022-08-01

2022-08-01 13:43:42,894 - INFO - joeynmt.training - Epoch   3, Step:     5200, Batch Loss:     0.353230, Batch Acc: 0.006974, Tokens per Sec:      671, Lr: 0.000124
2022-08-01 13:46:07,872 - INFO - joeynmt.training - Epoch   3, Step:     5300, Batch Loss:     0.452040, Batch Acc: 0.005981, Tokens per Sec:      655, Lr: 0.000123
2022-08-01 13:48:34,816 - INFO - joeynmt.training - Epoch   3, Step:     5400, Batch Loss:     0.344018, Batch Acc: 0.008686, Tokens per Sec:      640, Lr: 0.000122
2022-08-01 13:50:58,321 - INFO - joeynmt.training - Epoch   3, Step:     5500, Batch Loss:     0.396004, Batch Acc: 0.006378, Tokens per Sec:      659, Lr: 0.000121
2022-08-01 13:53:25,711 - INFO - joeynmt.training - Epoch   3, Step:     5600, Batch Loss:     0.418149, Batch Acc: 0.006702, Tokens per Sec:      644, Lr: 0.000120
2022-08-01 13:55:49,337 - INFO - joeynmt.training - Epoch   3, Step:     5700, Batch Loss:     0.402390, Batch Acc: 0.007443, Tokens per Sec:      658, Lr: 0.000118
2022-08-01

2022-08-01 14:30:12,787 - INFO - joeynmt.training - Epoch   4, Step:     7100, Batch Loss:     0.332396, Batch Acc: 0.005975, Tokens per Sec:      977, Lr: 0.000106
2022-08-01 14:31:48,179 - INFO - joeynmt.training - Epoch   4, Step:     7200, Batch Loss:     0.367908, Batch Acc: 0.007055, Tokens per Sec:      982, Lr: 0.000105
2022-08-01 14:33:24,281 - INFO - joeynmt.training - Epoch   4, Step:     7300, Batch Loss:     0.303412, Batch Acc: 0.008551, Tokens per Sec:      985, Lr: 0.000105
2022-08-01 14:37:54,865 - INFO - joeynmt.training - Epoch   4, Step:     7400, Batch Loss:     0.366174, Batch Acc: 0.007413, Tokens per Sec:      352, Lr: 0.000104
2022-08-01 14:39:31,064 - INFO - joeynmt.training - Epoch   4, Step:     7500, Batch Loss:     0.330252, Batch Acc: 0.007389, Tokens per Sec:      985, Lr: 0.000103
2022-08-01 14:41:05,626 - INFO - joeynmt.training - Epoch   4, Step:     7600, Batch Loss:     0.381350, Batch Acc: 0.006794, Tokens per Sec:      995, Lr: 0.000103
2022-08-01

2022-08-01 15:05:59,921 - INFO - joeynmt.training - Epoch   5, Step:     9100, Batch Loss:     0.321830, Batch Acc: 0.007476, Tokens per Sec:      991, Lr: 0.000094
2022-08-01 15:07:35,293 - INFO - joeynmt.training - Epoch   5, Step:     9200, Batch Loss:     0.280348, Batch Acc: 0.008095, Tokens per Sec:      975, Lr: 0.000093
2022-08-01 15:09:12,679 - INFO - joeynmt.training - Epoch   5, Step:     9300, Batch Loss:     0.266997, Batch Acc: 0.007042, Tokens per Sec:      978, Lr: 0.000093
2022-08-01 15:10:48,266 - INFO - joeynmt.training - Epoch   5, Step:     9400, Batch Loss:     0.305801, Batch Acc: 0.006992, Tokens per Sec:      988, Lr: 0.000092
2022-08-01 15:12:23,063 - INFO - joeynmt.training - Epoch   5, Step:     9500, Batch Loss:     0.283467, Batch Acc: 0.008758, Tokens per Sec:     1005, Lr: 0.000092
2022-08-01 15:13:58,635 - INFO - joeynmt.training - Epoch   5, Step:     9600, Batch Loss:     0.320564, Batch Acc: 0.007887, Tokens per Sec:      987, Lr: 0.000091
2022-08-01

2022-08-01 15:38:58,828 - INFO - joeynmt.training - Epoch   6, Step:    11100, Batch Loss:     0.274397, Batch Acc: 0.007073, Tokens per Sec:      967, Lr: 0.000085
2022-08-01 15:40:34,746 - INFO - joeynmt.training - Epoch   6, Step:    11200, Batch Loss:     0.293030, Batch Acc: 0.006856, Tokens per Sec:      995, Lr: 0.000085
2022-08-01 15:42:13,376 - INFO - joeynmt.training - Epoch   6, Step:    11300, Batch Loss:     0.260951, Batch Acc: 0.006553, Tokens per Sec:      966, Lr: 0.000084
2022-08-01 15:43:52,027 - INFO - joeynmt.training - Epoch   6, Step:    11400, Batch Loss:     0.281529, Batch Acc: 0.008095, Tokens per Sec:      960, Lr: 0.000084
2022-08-01 15:45:27,945 - INFO - joeynmt.training - Epoch   6, Step:    11500, Batch Loss:     0.330304, Batch Acc: 0.007825, Tokens per Sec:      987, Lr: 0.000083
2022-08-01 15:47:03,477 - INFO - joeynmt.training - Epoch   6, Step:    11600, Batch Loss:     0.300006, Batch Acc: 0.007041, Tokens per Sec:      986, Lr: 0.000083
2022-08-01

### Continue training after interruption
To continue after an interruption, the configuration needs to be modified in the following places:

- `load_model` to point to the checkpoint to load.
- `reset_*` options (must be False) to resume the previous session.
- `model_dir` to create a new directory.

In [None]:
resume_config = config\
  .replace('#load_model:', 'load_model:')\
  .replace('#reset_best_ckpt: False', 'reset_best_ckpt: False')\
  .replace('#reset_scheduler: False', 'reset_scheduler: False')\
  .replace('#reset_optimizer: False', 'reset_optimizer: False')\
  .replace('#reset_iter_state: False', 'reset_iter_state: False')\
  .replace(f'model_dir: "{model_dir}"', f'model_dir: "{model_dir}_resume"')

with (Path(data_dir) / "resume_config.yaml").open('w') as f:
    f.write(resume_config)

In [4]:
!python -m joeynmt train {data_dir}/resume_config.yaml

Traceback (most recent call last):
  File "/home/lconti/anaconda3/envs/jnmt/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lconti/anaconda3/envs/jnmt/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/lconti/anaconda3/envs/jnmt/lib/python3.9/site-packages/joeynmt/__main__.py", line 64, in <module>
    main()
  File "/home/lconti/anaconda3/envs/jnmt/lib/python3.9/site-packages/joeynmt/__main__.py", line 44, in main
    train(cfg_file=args.config_path, skip_test=args.skip_test)
  File "/home/lconti/anaconda3/envs/jnmt/lib/python3.9/site-packages/joeynmt/training.py", line 804, in train
    cfg = load_config(Path(cfg_file))
  File "/home/lconti/anaconda3/envs/jnmt/lib/python3.9/site-packages/joeynmt/helpers.py", line 193, in load_config
    with path.open("r", encoding="utf-8") as ymlfile:
  File "/home/lconti/anaconda3/envs/jnmt/lib/python3.9/pathlib.py", line 1252, in open
    r

## Evaluation


The `test` mode can be used to translate (and evaluate on) the test set specified in the configuration. We usually do this only once after we've tuned hyperparameters on the dev set.

In [11]:
!python -m joeynmt test {data_dir}/config_normal.yaml --ckpt {model_dir}/best.ckpt

2022-08-03 18:38:06,199 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-03 18:38:06,200 - INFO - joeynmt.data - Building tokenizer...
2022-08-03 18:38:06,327 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-03 18:38:06,327 - INFO - joeynmt.tokenizers - pt tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-03 18:38:06,327 - INFO - joeynmt.data - Building vocabulary...
2022-08-03 18:38:17,789 - INFO - joeynmt.data - Loading dev set...
Dropping NaN...: 100%|████████████████████████████| 1/1 [00:00<00:00, 86.76ba/s]
Preprocessing...: 100%|██████████████████| 1000/1000 [00:00<00:00, 16241.76ex/s]
2022-08-03 18:38:18,943 - INFO - joeynmt.data - Loading test set.

> ⚠ In beam search, the batch size is expanded {beam_size} times. For instance, if batch_size=10, batch_type=sentence and beam_size=5, joeynmt internally creates a batch of length 10*5=50. It may cause an out-of-memory error. Please specify the batch_size in `testing` section of config.yaml by taking this into account.


The `translate` mode is more interactive and takes prompts to translate interactively.

Let's Translate a few examples!

In [54]:
!python -m joeynmt translate {data_dir}/config.yaml --ckpt {model_dir}_resume/best.ckpt

2022-06-04 22:43:59,643 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-06-04 22:44:17,067 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-06-04 22:44:17,429 - INFO - joeynmt.model - Enc-dec model built.
2022-06-04 22:44:21,857 - INFO - joeynmt.helpers - Load model from /content/drive/MyDrive/models/tatoeba_deen_resume/19000.ckpt.
2022-06-04 22:44:22,065 - INFO - joeynmt.tokenizers - de tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-06-04 22:44:22,065 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)

Please enter a source sentence:
Maschinelle Übersetzung macht Spaß!
2022-06-04 22:46:22,281 - INFO - joeynmt.prediction - Predicting 1 example(s)... 

You can also get the n-best hypotheses (up to the size of the beam, in our example 5), not only the highest scoring one. The better your model gets, the more interesting should the alternatives be.



In [79]:
nbest_config = config.replace('n_best: 1', 'n_best: 5')\
  .replace('#return_prob: "hyp"', 'return_prob: "hyp"')

with (Path(data_dir) / "nbest_config.yaml").open('w') as f:
    f.write(nbest_config)

In [56]:
!python -m joeynmt translate {data_dir}/nbest_config.yaml --ckpt {model_dir}_resume/best.ckpt

2022-06-04 22:48:11,849 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-06-04 22:48:29,174 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-06-04 22:48:29,557 - INFO - joeynmt.model - Enc-dec model built.
2022-06-04 22:48:34,476 - INFO - joeynmt.helpers - Load model from /content/drive/MyDrive/models/tatoeba_deen_resume/19000.ckpt.
2022-06-04 22:48:34,743 - INFO - joeynmt.tokenizers - de tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-06-04 22:48:34,743 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)

Please enter a source sentence:
Maschinelle Übersetzung macht Spaß!
2022-06-04 22:49:28,668 - INFO - joeynmt.prediction - Predicting 1 example(s)... 

> 💡 In BPE decoding, there are multiple ways to tokenize one sequence. That is, the same output string sequence might appear multiple times in the n best list, because they have different tokenization and thus different sequence in the generation.
> For instance, say 3-best generation were:
> ```
> #1 best ['▁', 'N', 'e', 'w', '▁York']
> #2 best ['▁', 'New', '▁York']
> #3 best ['▁', 'New', '▁Y', 'o', 'r', 'k']
> ````
All three were different in next-token prediction, but ended up the same string sequence `New York` after being un-bpe-ed.


## Appendix:

### plotting learning curves

`plot_validations.py` script will generate validation learning curves.


In [2]:
!python /home/lconti/joeynmt/scripts/plot_validations.py {model_dir} --output_path /home/lconti/results/normal_learning_curve.png