
# Training a toy NMT model en-pt with no teacher forcing and using the tatoeba corpus and JoeyNMT 2.0

This notebook is based on [this demo](https://github.com/joeynmt/joeynmt/blob/main/notebooks/quick-start-with-joeynmt2.ipynb).

> ⚠ **Important:** Before you start, set runtime type to GPU.

In [1]:
!nvidia-smi

Tue Aug  2 13:43:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 26%   26C    P0    52W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 29%   27C    P0    52W / 250W |      0MiB / 11019MiB |      1%      Default |
|       

Make sure that you have a compatible PyTorch version.

In [2]:
import torch
torch.__version__

'1.11.0'

Install joeynmt (it's important to clone it from my fork, so teacher forcing can be deactivated).

In [None]:
! git clone https://github.com/lina-conti/joeynmt

In [3]:
!pip3 install -e ./joeynmt

Obtaining file:///home/lconti/joeynmt
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m829.2/829.2 kB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pillow
  Downloading Pillow-9.2.0-cp39-cp39-manylinux_2_28_x86_64.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m100.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.19.5
  Downloading numpy-1.23.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m86.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting torch>=1.10.0
  Downloading torch-1.12.0-cp39-cp39-manylinux1_x86_64.whl (776.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m6.1 MB/s[0

## Data Preparation

### Download
We'll use English - Portuguese translations from the [Tatoeba](https://tatoeba.org/) collection ([CC-BY 2.0 FR](https://creativecommons.org/licenses/by/2.0/fr/)).

[Tatoeba](https://huggingface.co/datasets/tatoeba) corpus is available in Huggingface's datasets library.



In [7]:
data_dir = "/home/lconti/en-pt_tatoeba"

In [None]:
!mkdir {data_dir}
!wget -O {data_dir}/get_tatoeba.py https://raw.githubusercontent.com/may-/datasets/master/datasets/tatoeba/tatoeba.py

--2022-08-01 11:07:17--  https://raw.githubusercontent.com/may-/datasets/master/datasets/tatoeba/tatoeba.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4425 (4.3K) [text/plain]
Saving to: '/home/lconti/en-pt_tatoeba/data/get_tatoeba.py'


2022-08-01 11:07:18 (40.1 MB/s) - '/home/lconti/en-pt_tatoeba/data/get_tatoeba.py' saved [4425/4425]



The Tatoeba dataset on HuggingFace Hub doesn't have dev and test split, but train split only. So let's split the data manually and save it locally.

> 📝 Note that most of the dataset loading scripts in Huggingface have pre-defined train-dev-test splits, e.g. [wmt17](https://huggingface.co/datasets/wmt17). In that case, you can skip this step, please go to the Vocabulary generation part.

In [3]:
from datasets import load_dataset
tatoeba_kwargs = {
  "path": f"{data_dir}/get_tatoeba.py",
  "lang1": "en",
  "lang2": "pt",
  "ignore_verifications": True,
  "cache_dir": "/tmp/.cache/huggingface"
}

tatoeba_dev = load_dataset(split="train[:1000]", **tatoeba_kwargs)
tatoeba_test = load_dataset(split="train[1000:2000]", **tatoeba_kwargs)
tatoeba_train = load_dataset(split="train[2000:]", **tatoeba_kwargs)

tatoeba_dev, tatoeba_test, tatoeba_train

Using custom data configuration en-pt-lang1=en,lang2=pt
Reusing dataset get_tatoeba (/tmp/.cache/huggingface/get_tatoeba/en-pt-lang1=en,lang2=pt/0.0.0/336de120b2cb1a268f4eb9ebc7969075ccfabb978716d834a58a7889dbb5f267)
Using custom data configuration en-pt-lang1=en,lang2=pt
Reusing dataset get_tatoeba (/tmp/.cache/huggingface/get_tatoeba/en-pt-lang1=en,lang2=pt/0.0.0/336de120b2cb1a268f4eb9ebc7969075ccfabb978716d834a58a7889dbb5f267)
Using custom data configuration en-pt-lang1=en,lang2=pt
Reusing dataset get_tatoeba (/tmp/.cache/huggingface/get_tatoeba/en-pt-lang1=en,lang2=pt/0.0.0/336de120b2cb1a268f4eb9ebc7969075ccfabb978716d834a58a7889dbb5f267)


(Dataset({
     features: ['id', 'translation'],
     num_rows: 1000
 }),
 Dataset({
     features: ['id', 'translation'],
     num_rows: 1000
 }),
 Dataset({
     features: ['id', 'translation'],
     num_rows: 215647
 }))

Inspect the data

In [4]:
tatoeba_dev['translation'][:3]

[{'en': "Let's try something.", 'pt': 'Vamos tentar alguma coisa!'},
 {'en': "Let's try something.", 'pt': 'Vamos tentar algo!'},
 {'en': "Let's try something.", 'pt': 'Vamos tentar algo.'}]

In [5]:
tatoeba_test['translation'][:3]

[{'en': "You're my type.", 'pt': 'Você é o meu tipo.'},
 {'en': "You're irresistible.", 'pt': 'Você é irresistível.'},
 {'en': 'Could you call again later, please?',
  'pt': 'Você poderia telefonar de novo mais tarde, por favor?'}]

In [6]:
tatoeba_train['translation'][:3]

[{'en': 'What do you want now?', 'pt': 'O que você deseja agora?'},
 {'en': 'Do you love music?', 'pt': 'Você ama música?'},
 {'en': 'Do you love music?', 'pt': 'Você aprecia a música?'}]

Save the train-dev-test splits in local dir

In [7]:
from datasets.dataset_dict import DatasetDict

dataset_dict = DatasetDict({ 
  "train": tatoeba_train,
  "validation": tatoeba_dev,
  "test": tatoeba_test
})

dataset_dict.save_to_disk(data_dir)

### Vocabulary

We will use the [sentencepiece](https://github.com/google/sentencepiece) library to split words into subwords (BPE) according to their frequency in the training corpus.

`build_vocab.py` script will train the BPE model and creates joint vocabulary. It takes the same config file as the joeynmt.

In [7]:
from pathlib import Path

# Create the config
config = """
name: "tatoeba_enpt_no_tf_sp"
joeynmt_version: "2.0.0"

data:
    train: "{data_dir}/train"
    dev: "{data_dir}/validation"
    test: "{data_dir}/test"
    dataset_type: "huggingface"
    #dataset_cfg:           # not necessary for manually saved pyarray daraset
    #    name: "en-pt"
    sample_dev_subset: 200
    src:
        lang: "en"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 32000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

    trg:
        lang: "pt"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 32000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

""".format(data_dir=data_dir)
with (Path(data_dir) / "config_no_tf.yaml").open('w') as f:
    f.write(config)

Call the script with `--joint` flag

In [9]:
! wget -O {data_dir}/build_vocab.py https://raw.githubusercontent.com/joeynmt/joeynmt/main/scripts/build_vocab.py

--2022-08-02 13:45:25--  https://raw.githubusercontent.com/joeynmt/joeynmt/main/scripts/build_vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10277 (10K) [text/plain]
Saving to: '/home/lconti/en-pt_no_tf/data/build_vocab.py'


2022-08-02 13:45:25 (53.3 MB/s) - '/home/lconti/en-pt_no_tf/data/build_vocab.py' saved [10277/10277]



In [10]:
!python {data_dir}/build_vocab.py {data_dir}/config.yaml --joint

Dropping NaN...: 100%|███████████████████████| 216/216 [00:01<00:00, 203.73ba/s]
Preprocessing...: 100%|██████████████| 215642/215642 [00:12<00:00, 16708.24ex/s]
### Training sentencepiece...
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/tmp/sentencepiece_4q60qtkm.txt --model_prefix=/home/lconti/en-pt_no_tf/data/sp --model_type=unigram --vocab_size=32000 --character_coverage=1.0 --accept_language=en,pt --unk_piece=<unk> --bos_piece=<s> --eos_piece=</s> --pad_piece=<pad> --unk_id=0 --bos_id=2 --eos_id=3 --pad_id=1 --vocabulary_output_piece_score=false
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /tmp/sentencepiece_4q60qtkm.txt
  input_format: 
  model_prefix: /home/lconti/en-pt_no_tf/data/sp
  model_type: UNIGRAM
  vocab_size: 32000
  accept_language: en
  accept_language: pt
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_f

The generated vocabulary looks like this:

In [11]:
!head -20 {data_dir}/vocab.txt

<unk>
<pad>
<s>
</s>
.
▁Tom
'
▁I
?
▁a
▁to
▁que
s
,
▁the
▁de
▁you
t
▁o
▁não


## Configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places.

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

Note the "teacher_forcing" configuration in "model" — this is specific to my fork of joeynmt. It is where you can choose between "on", "off" or "alternating" (default is on).

In [6]:
model_dir = "/home/lconti/en-pt_tatoeba/models/no_tf"

In [8]:
config += """
testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 256
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    #return_prob: "hyp"
    #return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"

training:
    #load_model: "{model_dir}/latest.ckpt"
    #reset_best_ckpt: False
    #reset_scheduler: False
    #reset_optimizer: False
    #reset_iter_state: False
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 2000
    learning_rate: 0.0002
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    loss: "crossentropy"
    batch_size: 512
    batch_type: "token"
    batch_multiplier: 4
    early_stopping_metric: "bleu"
    epochs: 10
    updates: 20000
    validation_freq: 1000
    logging_freq: 100
    model_dir: "{model_dir}"
    overwrite: False
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3

model:
    teacher_forcing: "off"
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

""".format(model_dir=model_dir)
with (Path(data_dir) / "config_no_tf.yaml").open('w') as f:
    f.write(config)

## Model Training


### Run training
⏳ This will take a while. Model parameters will be stored on mounted google drive. The log reports the training process, look out for the prints of example translations and the BLEU evaluation scores to get an impression of the current quality.

> ⛔ If you execute this twice, you might get an error that the model directory already exists. You can specify in the configuration to overwrite it, or delete it manually (`!rm -r {model_dir}`).

In [9]:
!python -m joeynmt train {data_dir}/config_no_tf.yaml

2022-08-03 23:04:04,500 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-03 23:04:04,500 - INFO - joeynmt.helpers -                           cfg.name : tatoeba_enpt_no_tf_sp
2022-08-03 23:04:04,500 - INFO - joeynmt.helpers -                cfg.joeynmt_version : 2.0.0
2022-08-03 23:04:04,500 - INFO - joeynmt.helpers -                     cfg.data.train : /home/lconti/en-pt_tatoeba/train
2022-08-03 23:04:04,500 - INFO - joeynmt.helpers -                       cfg.data.dev : /home/lconti/en-pt_tatoeba/validation
2022-08-03 23:04:04,500 - INFO - joeynmt.helpers -                      cfg.data.test : /home/lconti/en-pt_tatoeba/test
2022-08-03 23:04:04,500 - INFO - joeynmt.helpers -              cfg.data.dataset_type : huggingface
2022-08-03 23:04:04,500 - INFO - joeynmt.helpers -         cfg.data.sample_dev_subset : 200
2022-08-03 23:04:04,501 - INFO - joeynmt.helpers -                  cfg.data.src.lang : en
2022-08-03 23:04:04,501 - INFO - joeynmt.helpers -            cfg

### Continue training after interruption
To continue after an interruption, the configuration needs to be modified in the following places:

- `load_model` to point to the checkpoint to load.
- `reset_*` options (must be False) to resume the previous session.
- `model_dir` to create a new directory.

In [None]:
resume_config = config\
  .replace('#load_model:', 'load_model:')\
  .replace('#reset_best_ckpt: False', 'reset_best_ckpt: False')\
  .replace('#reset_scheduler: False', 'reset_scheduler: False')\
  .replace('#reset_optimizer: False', 'reset_optimizer: False')\
  .replace('#reset_iter_state: False', 'reset_iter_state: False')\
  .replace(f'model_dir: "{model_dir}"', f'model_dir: "{model_dir}_resume"')

with (Path(data_dir) / "resume_config.yaml").open('w') as f:
    f.write(resume_config)

In [None]:
!python -m joeynmt train {data_dir}/resume_config.yaml

2022-06-04 20:53:16,099 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-06-04 20:53:16,100 - INFO - joeynmt.helpers -                           cfg.name : tatoeba_deen_sp
2022-06-04 20:53:16,100 - INFO - joeynmt.helpers -                cfg.joeynmt_version : 2.0.0
2022-06-04 20:53:16,100 - INFO - joeynmt.helpers -                     cfg.data.train : /content/drive/MyDrive/tatoeba_deen/train
2022-06-04 20:53:16,101 - INFO - joeynmt.helpers -                       cfg.data.dev : /content/drive/MyDrive/tatoeba_deen/validation
2022-06-04 20:53:16,101 - INFO - joeynmt.helpers -                      cfg.data.test : /content/drive/MyDrive/tatoeba_deen/test
2022-06-04 20:53:16,101 - INFO - joeynmt.helpers -              cfg.data.dataset_type : huggingface
2022-06-04 20:53:16,101 - INFO - joeynmt.helpers -         cfg.data.sample_dev_subset : 200
2022-06-04 20:53:16,101 - INFO - joeynmt.helpers -                  cfg.data.src.lang : de
2022-06-04 20:53:16,101 - INFO - joeynmt.hel

> 💡 It starts counting the epochs from the beginning again, but step numbers should continue from before and you should find a "reloading" line in the training log.

## Evaluation


The `test` mode can be used to translate (and evaluate on) the test set specified in the configuration. We usually do this only once after we've tuned hyperparameters on the dev set.

In [10]:
!python -m joeynmt test {data_dir}/config_no_tf.yaml --ckpt {model_dir}/best.ckpt

2022-08-05 17:51:13,445 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-05 17:51:13,445 - INFO - joeynmt.data - Building tokenizer...
2022-08-05 17:51:13,545 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-05 17:51:13,546 - INFO - joeynmt.tokenizers - pt tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-08-05 17:51:13,546 - INFO - joeynmt.data - Building vocabulary...
2022-08-05 17:51:28,042 - INFO - joeynmt.data - Loading dev set...
2022-08-05 17:51:29,068 - INFO - joeynmt.data - Loading test set...
2022-08-05 17:51:29,821 - INFO - joeynmt.data - Data loaded.
2022-08-05 17:51:29,821 - INFO - joeynmt.helpers - Train dataset: None
2022-08-05 17:51:29,821 - I

> ⚠ In beam search, the batch size is expanded {beam_size} times. For instance, if batch_size=10, batch_type=sentence and beam_size=5, joeynmt internally creates a batch of length 10*5=50. It may cause an out-of-memory error. Please specify the batch_size in `testing` section of config.yaml by taking this into account.


The `translate` mode is more interactive and takes prompts to translate interactively.

Let's Translate a few examples!

In [54]:
!python -m joeynmt translate {data_dir}/config.yaml --ckpt {model_dir}_resume/best.ckpt

2022-06-04 22:43:59,643 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-06-04 22:44:17,067 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-06-04 22:44:17,429 - INFO - joeynmt.model - Enc-dec model built.
2022-06-04 22:44:21,857 - INFO - joeynmt.helpers - Load model from /content/drive/MyDrive/models/tatoeba_deen_resume/19000.ckpt.
2022-06-04 22:44:22,065 - INFO - joeynmt.tokenizers - de tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-06-04 22:44:22,065 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)

Please enter a source sentence:
Maschinelle Übersetzung macht Spaß!
2022-06-04 22:46:22,281 - INFO - joeynmt.prediction - Predicting 1 example(s)... 

You can also get the n-best hypotheses (up to the size of the beam, in our example 5), not only the highest scoring one. The better your model gets, the more interesting should the alternatives be.



In [79]:
nbest_config = config.replace('n_best: 1', 'n_best: 5')\
  .replace('#return_prob: "hyp"', 'return_prob: "hyp"')

with (Path(data_dir) / "nbest_config.yaml").open('w') as f:
    f.write(nbest_config)

In [56]:
!python -m joeynmt translate {data_dir}/nbest_config.yaml --ckpt {model_dir}_resume/best.ckpt

2022-06-04 22:48:11,849 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-06-04 22:48:29,174 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-06-04 22:48:29,557 - INFO - joeynmt.model - Enc-dec model built.
2022-06-04 22:48:34,476 - INFO - joeynmt.helpers - Load model from /content/drive/MyDrive/models/tatoeba_deen_resume/19000.ckpt.
2022-06-04 22:48:34,743 - INFO - joeynmt.tokenizers - de tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)
2022-06-04 22:48:34,743 - INFO - joeynmt.tokenizers - en tokenizer: SentencePieceTokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, 100), pretokenizer=none, tokenizer=SentencePieceProcessor, nbest_size=5, alpha=0.0)

Please enter a source sentence:
Maschinelle Übersetzung macht Spaß!
2022-06-04 22:49:28,668 - INFO - joeynmt.prediction - Predicting 1 example(s)... 

> 💡 In BPE decoding, there are multiple ways to tokenize one sequence. That is, the same output string sequence might appear multiple times in the n best list, because they have different tokenization and thus different sequence in the generation.
> For instance, say 3-best generation were:
> ```
> #1 best ['▁', 'N', 'e', 'w', '▁York']
> #2 best ['▁', 'New', '▁York']
> #3 best ['▁', 'New', '▁Y', 'o', 'r', 'k']
> ````
All three were different in next-token prediction, but ended up the same string sequence `New York` after being un-bpe-ed.


## Appendix:

### plotting learning curves

`plot_validations.py` script will generate validation learning curves.

In [2]:
!python /home/lconti/joeynmt/scripts/plot_validations.py {model_dir} --output_path /home/lconti/results/no_tf_learning_curve.png