<a href="https://colab.research.google.com/github/may-/joeynmt/blob/main/notebooks/tokenizer_tutorial_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# JoeyNMT v2 tokenizer tutorial

In this notebook, we explain how to integrate a new tokenizer to JoeyNMT.

Author: Mayumi Ohta  
Date: 21. July 2022

> :warning: **important:** Before you start, set runtime type to GPU.

In [1]:
!nvidia-smi

Tue Aug 16 07:11:40 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Make sure that you have a compatible PyTorch version.

In [2]:
import torch
torch.__version__

'1.12.1+cu113'

Mount your Google Drive.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Set root dir path.

In [4]:
import os
root_dir = '/content/drive/MyDrive' # for Google Colab
#root_dir = os.environ["WORK_DIR"] # '/home/studio-lab-user' for Amazon SageMaker

In [5]:
%env HF_DATASETS_CACHE={root_dir}/.cache

env: HF_DATASETS_CACHE=/content/drive/MyDrive/.cache


Install joeynmt for tokenizer integration.

In [6]:
!pip install -q git+https://github.com/may-/joeynmt.git

[K     |████████████████████████████████| 1.0 MB 4.7 MB/s 
[K     |████████████████████████████████| 116 kB 71.6 MB/s 
[K     |████████████████████████████████| 1.3 MB 55.6 MB/s 
[K     |████████████████████████████████| 596 kB 60.5 MB/s 
[K     |████████████████████████████████| 488 kB 69.0 MB/s 
[K     |████████████████████████████████| 190 kB 72.8 MB/s 
[K     |████████████████████████████████| 61 kB 561 kB/s 
[K     |████████████████████████████████| 365 kB 76.7 MB/s 
[K     |████████████████████████████████| 212 kB 68.4 MB/s 
[K     |████████████████████████████████| 115 kB 74.1 MB/s 
[K     |████████████████████████████████| 141 kB 72.3 MB/s 
[K     |████████████████████████████████| 101 kB 15.3 MB/s 
[K     |████████████████████████████████| 127 kB 71.0 MB/s 
[K     |████████████████████████████████| 66 kB 5.5 MB/s 
[K     |████████████████████████████████| 61 kB 602 kB/s 
[K     |████████████████████████████████| 61 kB 601 kB/s 
[K     |████████████████████████

## fastBPE

JoeyNMT v2 contains `subword-nmt` and `sentencepiece` subword tokenizers. What should we do, when we want to use a different tokenizer?

Let's implement [fastBPE](https://github.com/glample/fastBPE) tokenizer, for example.

First, install fastBPE python API.

In [7]:
!pip install -q fastbpe sacremoses

[K     |████████████████████████████████| 880 kB 7.6 MB/s eta 0:00:01
[?25h  Building wheel for fastbpe (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


The tokenizers are defined in `joeynmt/tokenizers.py`. We add a new class for fastBPE here.

In principle, you can inherit `BasicTokenizer` class and override `__call__()` function to tokenize and `post_process()` function to detokenize.

fastBPE is a library which implements subword-nmt algorithms in C++. So, we inherit here `SubwordNMTTokenizer` class, instead.

In [None]:
from joeynmt.tokenizers import BasicTokenizer, SubwordNMTTokenizer
from pathlib import Path
from typing import List

class FastBPETokenizer(SubwordNMTTokenizer):
    def __init__(
        self,
        level: str = "bpe",
        lowercase: bool = False,
        normalize: bool = False,
        max_length: int = -1,
        min_length: int = -1,
        **kwargs,
    ):
        try:
            import fastBPE
        except ImportError as e:
            #logger.error(e)
            raise ImportError from e
        super(SubwordNMTTokenizer, self).__init__(level, lowercase, normalize, max_length, min_length, **kwargs)
        assert self.level == "bpe"

        # get codes file path
        self.codes: Path = Path(kwargs["codes"])
        assert self.codes.is_file(), f"codes file {self.codes} not found."

        # instantiate fastBPE object
        self.bpe = fastBPE.fastBPE(self.codes.as_posix())
        self.separator = "@@"
        self.dropout = 0.0

    def __call__(self, raw_input: str, is_train: bool = False) -> List[str]:
        # fastBPE.apply()
        tokenized = self.bpe.apply([raw_input])
        tokenized = tokenized[0].strip().split()

        # check if the input sequence length stays within the valid length range
        if is_train and self._filter_by_length(len(tokenized)):
            return None
        return tokenized

The `FastBPETokenizer` class defined above will be instantiated in `_build_tokenizer()` function.

Change `_build_tokenizer()` function so that `FastBPETokenizer` class will be called when config file specifies `tokenizer_type: "fastbpe"`.

In [None]:
def _build_tokenizer(cfg):
    # [...]
    tokenizer_cfg = cfg.get("tokenizer_cfg", {})
    
    if cfg["level"] in ["word", "char"]:
        tokenizer = BasicTokenizer(
            # [...]
        )
        
    elif cfg["level"] == "bpe":
        tokenizer_type = cfg.get("tokenizer_type", cfg.get("bpe_type", "sentencepiece"))
        if tokenizer_type == "sentencepiece":
            assert "model_file" in tokenizer_cfg
            # [...]
        elif tokenizer_type == "subword-nmt":
            assert "codes" in tokenizer_cfg
            # [...]
        elif tokenizer_type == "fastbpe":
            assert "codes" in tokenizer_cfg
            tokenizer = FastBPETokenizer(
                level=cfg["level"],
                lowercase=cfg.get("lowercase", False),
                normalize=cfg.get("normalize", False),
                max_length=cfg.get("max_length", -1),
                min_length=cfg.get("min_length", -1),
                **tokenizer_cfg,
            )

In the config file `config.yaml`, you can select "fastbpe" as follows. `codes` attribute is necessary.

```yaml
name: "transformer_iwslt14_deen_fastbpe"
joeynmt_version: "2.0.0"

data:
    train: "iwslt14"
    dev: "iwslt14"
    test: "iwslt14"
    dataset_type: "huggingface"
    dataset_cfg:
        name: "de-en"
    src:
        lang: "de"
        max_length: 128
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "data/iwslt14/vocab.32000"
        tokenizer_type: "fastbpe"
        tokenizer_cfg:
            codes: "data/iwslt14/bpe.32000"  # necessary
            pretokenizer: "moses"
    trg:
        lang: "en"
        max_length: 128
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "data/iwslt14/vocab.32000"
        tokenizer_type: "fastbpe"
        tokenizer_cfg:
            codes: "data/iwslt14/bpe.32000"  # necessary
            pretokenizer: "moses"
[...]
```

### use in pretrained models

Let's check if it works on a real dataset, i.e. iwslt14 en-de.

Download a pretrained model. (The codes file here was trained on subword-nmt.)

In [None]:
!wget -O {root_dir}/transformer_iwslt14_deen_bpe.tar.gz https://www.cl.uni-heidelberg.de/statnlpgroup/joeynmt2/transformer_iwslt14_deen_bpe.tar.gz
!cd {root_dir} && tar -xvf transformer_iwslt14_deen_bpe.tar.gz
!ls {root_dir}/transformer_iwslt14_deen_bpe

--2022-08-15 20:15:23--  https://www.cl.uni-heidelberg.de/statnlpgroup/joeynmt2/transformer_iwslt14_deen_bpe.tar.gz
Resolving www.cl.uni-heidelberg.de (www.cl.uni-heidelberg.de)... 147.142.207.78
Connecting to www.cl.uni-heidelberg.de (www.cl.uni-heidelberg.de)|147.142.207.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220750239 (211M) [application/x-gzip]
Saving to: ‘/content/drive/MyDrive/transformer_iwslt14_deen_bpe.tar.gz’


2022-08-15 20:15:42 (13.0 MB/s) - ‘/content/drive/MyDrive/transformer_iwslt14_deen_bpe.tar.gz’ saved [220750239/220750239]

transformer_iwslt14_deen_bpe/
transformer_iwslt14_deen_bpe/best.ckpt
transformer_iwslt14_deen_bpe/trg_vocab.txt
transformer_iwslt14_deen_bpe/config_v1.yaml
transformer_iwslt14_deen_bpe/config_v2.yaml
transformer_iwslt14_deen_bpe/train.log
transformer_iwslt14_deen_bpe/bpe.32000
transformer_iwslt14_deen_bpe/test.log
transformer_iwslt14_deen_bpe/hyp.test
transformer_iwslt14_deen_bpe/hyp.dev
transformer_iwslt14_de

Copy codes and vocab files.

In [None]:
!mkdir {root_dir}/data
!mkdir {root_dir}/data/iwslt14
!cp {root_dir}/transformer_iwslt14_deen_bpe/bpe.32000 {root_dir}/data/iwslt14/codes.32000
!cp {root_dir}/transformer_iwslt14_deen_bpe/trg_vocab.txt {root_dir}/data/iwslt14/vocab.32000

In [None]:
fastbpe_tokenizer = FastBPETokenizer(codes=f"{root_dir}/data/iwslt14/codes.32000")
fastbpe_tokenizer

FastBPETokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, -1), pretokenizer=none, tokenizer=fastBPE, separator=@@, dropout=0.0)

In [None]:
fastbpe_tokenizer("This is a test .")

['T@@', 'his', 'is', 'a', 'test', '.']

In [None]:
fastbpe_tokenizer("Das ist ein Beispiel .")

['D@@', 'as', 'ist', 'ein', 'B@@', 'ei@@', 'spiel', '.']

Decode using fastBPE tokenizer. Specify "fastbpe" in the config file.

In [None]:
new_config = """
name: "transformer_iwslt14_deen_fastbpe"
joeynmt_version: "2.0.0"

data:
    train: "iwslt14"
    dev: "iwslt14"
    test: "iwslt14"
    dataset_type: "huggingface"
    dataset_cfg:
        name: "de-en"
    src:
        lang: "de"
        max_length: 128
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "data/iwslt14/vocab.32000"
        tokenizer_type: "fastbpe"
        tokenizer_cfg:
            codes: "data/iwslt14/codes.32000"  # required
            pretokenizer: "moses"
    trg:
        lang: "en"
        max_length: 128
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "data/iwslt14/vocab.32000"
        tokenizer_type: "fastbpe"
        tokenizer_cfg:
            codes: "data/iwslt14/codes.32000" # required
            pretokenizer: "moses"

testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 1024
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    return_prob: "none"
    return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"
        lowercase: True

training:
    load_model: "transformer_iwslt14_deen_bpe/best.ckpt"
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "plateau"
    patience: 5
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    early_stopping_metric: "bleu"
    epochs: 100
    validation_freq: 1000
    logging_freq: 100
    model_dir: "transformer_iwslt14_deen_bpe"
    overwrite: False
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3, 4]
    keep_best_ckpts: 5

model:
    initializer: "xavier_uniform"
    embed_initializer: "xavier_uniform"
    embed_init_gain: 1.0
    init_gain: 1.0
    bias_initializer: "zeros"
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.3
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.3
        layer_norm: "pre"

"""

(Path(root_dir) / 'data/iwslt14/config.yaml').write_text(new_config)

2675

Generate translations in interactive mode. (Press [Ctrl]+[C] to stop the cell execution.)

In [None]:
!cd {root_dir} && python -m joeynmt translate data/iwslt14/config.yaml

2022-08-15 20:31:51,569 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-15 20:32:06,238 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-08-15 20:32:06,605 - INFO - joeynmt.model - Enc-dec model built.
2022-08-15 20:32:07,196 - INFO - joeynmt.helpers - Load model from /content/drive/MyDrive/transformer_iwslt14_deen_bpe/best.ckpt.
Loading codes from data/iwslt14/codes.32000 ...
Read 32001 codes from the codes file.
Loading codes from data/iwslt14/codes.32000 ...
Read 32001 codes from the codes file.
2022-08-15 20:32:09,415 - INFO - joeynmt.tokenizers - de tokenizer: FastBPETokenizer(level=bpe, lowercase=True, normalize=False, filter_by_length=(-1, 128), pretokenizer=moses, tokenizer=fastBPE, separator=@@, dropout=0.0)
2022-08-15 20:32:09,415 - INFO - joeynmt.tokenizers - en tokenizer: FastBPETokenizer(level=bpe, lowercase=True, normalize=False, filter_by_length=(-1, 128), pretokenizer=moses, tokenizer=fastBPE, separator=@@, dropout=0.0)

Please enter a

### use in training from scratch

In this section, we demonstrate how to use a new tokenizer in training from scratch.

First, we need to prepare codes file. We train a tokenizer model using fastBPE's `fast` command.

In [None]:
!git clone https://github.com/glample/fastBPE.git {root_dir}/fastBPE
!cd {root_dir}/fastBPE && g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
!cd {root_dir}/fastBPE && ./fast -h

usage: fastbpe <command> <args>

The commands supported by fastBPE are:

getvocab input1 [input2]             extract the vocabulary from one or two text files
learnbpe nCodes input1 [input2]      learn BPE codes from one or two text files
applybpe output input codes [vocab]  apply BPE codes to a text file
applybpe_stream codes [vocab]        apply BPE codes to stdin and output to stdout



We use iwslt14 en-de dataset for this purpose.

Download the script and call it from huggingface's datasets package.

In [None]:
!wget -O {root_dir}/data/iwslt14/iwslt14.py https://raw.githubusercontent.com/may-/datasets/master/datasets/iwslt14/iwslt14.py

--2022-08-15 20:34:59--  https://raw.githubusercontent.com/may-/datasets/master/datasets/iwslt14/iwslt14.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7614 (7.4K) [text/plain]
Saving to: ‘/content/drive/MyDrive/data/iwslt14/iwslt14.py’


2022-08-15 20:34:59 (5.56 MB/s) - ‘/content/drive/MyDrive/data/iwslt14/iwslt14.py’ saved [7614/7614]



Before subword tokenization, we pretokenize the text by MosesTokenizer in order to handle punctuations better.

In [None]:
from datasets import load_dataset
from sacremoses import MosesTokenizer

moses_tokenizer = {
    "en": MosesTokenizer(lang="en"),
    "de": MosesTokenizer(lang="de"),
}
iwslt14 = load_dataset(f"{root_dir}/data/iwslt14", name="de-en")
iwslt14_train = iwslt14['train'].flatten()
iwslt14_train = iwslt14_train.map(
    lambda item: {"en": moses_tokenizer["en"].tokenize(item["translation.en"], return_str=True),
                  "de": moses_tokenizer["de"].tokenize(item["translation.de"], return_str=True)}
)
iwslt14_train

Downloading and preparing dataset iwslt14/de-en to /content/drive/MyDrive/.cache/iwslt14/de-en/1.0.0/ddd017b0ab639227607efd17fdf9687b1c4f06edf13c78a00314d7ea682d408d...


Downloading data:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset iwslt14 downloaded and prepared to /content/drive/MyDrive/.cache/iwslt14/de-en/1.0.0/ddd017b0ab639227607efd17fdf9687b1c4f06edf13c78a00314d7ea682d408d. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/174443 [00:00<?, ?ex/s]

Dataset({
    features: ['translation.de', 'translation.en', 'en', 'de'],
    num_rows: 174443
})

We create a joint vocaburaly in size 40,000 shared both in source and target.

JoeyNMT expects vocab file in one-token-per-line format. The vocab file generated by fastBPE has two coloumns, which is imcompatible with JoeyNMT. So we drop the second column and make it reusable in JoeyNMT.

In [None]:
train_ende = iwslt14_train['en'] + iwslt14_train['de']
(Path(root_dir) / 'data/iwslt14/train.ende').write_text('\n'.join(train_ende))

36662010

In [None]:
!cd {root_dir}/data/iwslt14 && {root_dir}/fastBPE/fast learnbpe 40000 train.ende > codes.40000
!cd {root_dir}/data/iwslt14 && {root_dir}/fastBPE/fast applybpe train.ende.40000 train.ende codes.40000
!cd {root_dir}/data/iwslt14 && {root_dir}/fastBPE/fast getvocab train.ende.40000 > vocab_freq.40000
!cd {root_dir}/data/iwslt14 && cut -d " " -f 1 vocab_freq.40000 > vocab.40000

Loading vocabulary from train.ende ...
Read 6865852 words (173116 unique) from text file.
tcmalloc: large alloc 12000002048 bytes == 0x556bb453a000 @  0x7f99f1a4a887 0x556baef878f3 0x556baef7c78f 0x7f99f0e85c87 0x556baef7ca1a
Loading codes from codes.40000 ...
Read 40000 codes from the codes file.
Loading vocabulary from train.ende ...
Read 6865852 words (173116 unique) from text file.
Applying BPE to train.ende ...
Modified 6865852 words from text file.
Loading vocabulary from train.ende.40000 ...
Read 7421753 words (39687 unique) from text file.


In [None]:
!head -10 {root_dir}/data/iwslt14/vocab.40000

,
.
the
in
to
of
die
a
and
und


Now we've prepared a codes file, and a vocab file in one-toke-per-line format.

Let's try to tokenize a text using the codes file trained aboeve.

In [None]:
fastbpe_tokenizer = FastBPETokenizer(codes=f"{root_dir}/data/iwslt14/codes.40000")
fastbpe_tokenizer

FastBPETokenizer(level=bpe, lowercase=False, normalize=False, filter_by_length=(-1, -1), pretokenizer=none, tokenizer=fastBPE, separator=@@, dropout=0.0)

In [None]:
fastbpe_tokenizer("This is a test .")

['This', 'is', 'a', 'test', '.']

In [None]:
fastbpe_tokenizer("Das ist ein Beispiel .")

['Das', 'ist', 'ein', 'Beispiel', '.']

Rewrite the config file so that the new codes and vocab file will be loaded.

In [None]:
fastbpe_config = new_config\
  .replace('vocab.32000', 'vocab.40000')\
  .replace('codes.32000', 'codes.40000')\
  .replace('model_dir: "transformer_iwslt14_deen_bpe"', 'model_dir: "iwslt14_deen_fastbpe"')\
  .replace('load_model:', '#load_model:')

with (Path(root_dir) / "data/iwslt14/fastbpe_config.yaml").open('w') as f:
    f.write(fastbpe_config)

Start training from scratch using fastBPE tokenizer with the newly created codes file.

Please check the line with "INFO - joeynmt.tokenizers" below. You will see "FastBPETokenizer" in the log.

In [None]:
!cd {root_dir} && python -m joeynmt train data/iwslt14/fastbpe_config.yaml

2022-08-15 20:46:16,536 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-15 20:46:16,537 - INFO - joeynmt.helpers -                           cfg.name : transformer_iwslt14_deen_fastbpe
2022-08-15 20:46:16,538 - INFO - joeynmt.helpers -                cfg.joeynmt_version : 2.0.0
2022-08-15 20:46:16,538 - INFO - joeynmt.helpers -                     cfg.data.train : iwslt14
2022-08-15 20:46:16,538 - INFO - joeynmt.helpers -                       cfg.data.dev : iwslt14
2022-08-15 20:46:16,538 - INFO - joeynmt.helpers -                      cfg.data.test : iwslt14
2022-08-15 20:46:16,538 - INFO - joeynmt.helpers -              cfg.data.dataset_type : huggingface
2022-08-15 20:46:16,538 - INFO - joeynmt.helpers -          cfg.data.dataset_cfg.name : de-en
2022-08-15 20:46:16,538 - INFO - joeynmt.helpers -                  cfg.data.src.lang : de
2022-08-15 20:46:16,539 - INFO - joeynmt.helpers -            cfg.data.src.max_length : 128
2022-08-15 20:46:16,539 - INFO - joeynm

## Split on Whitespaces


JoeyNMT will split texts on whitespaces when you specify `level="word"` in the config.
It is useful if your input texts are alreay tokenized by someone else and you have to retain it.

Here, we show a sample usecase with iwslt15 en-vi dataset preprocessed by Stanford NLP group.
https://nlp.stanford.edu/projects/nmt/

Download the dataset builder script and vocab files.

In [8]:
!mkdir {root_dir}/data/iwslt15
!wget -O {root_dir}/data/iwslt15/vocab.en https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/vocab.en
!wget -O {root_dir}/data/iwslt15/vocab.vi https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/vocab.vi
!wget -O {root_dir}/data/iwslt15/iwslt15.py https://raw.githubusercontent.com/may-/datasets/master/datasets/iwslt15/iwslt15.py

--2022-08-16 07:13:22--  https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/vocab.en
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 139741 (136K) [text/plain]
Saving to: ‘/content/drive/MyDrive/data/iwslt15/vocab.en’


2022-08-16 07:13:23 (561 KB/s) - ‘/content/drive/MyDrive/data/iwslt15/vocab.en’ saved [139741/139741]

--2022-08-16 07:13:23--  https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/vocab.vi
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46767 (46K) [text/plain]
Saving to: ‘/content/drive/MyDrive/data/iwslt15/vocab.vi’


2022-08-16 07:13:23 (469 KB/s) - ‘/content/drive/MyDrive/data/iwslt15/vocab.vi’ saved [46767/46767]

--2022-08-16 07:13:23--  https://raw.git

We have separated vocab files per language, not a single joint vocab file.

In [9]:
!head -10 {root_dir}/data/iwslt15/vocab.en

<unk>
<s>
</s>
Rachel
:
The
science
behind
a
climate


In [10]:
!head -10 {root_dir}/data/iwslt15/vocab.vi

<unk>
<s>
</s>
Khoa
học
đằng
sau
một
tiêu
đề


Download the preprocessed data.

In [13]:
from datasets import load_dataset

iwslt15_envi = load_dataset(f"{root_dir}/data/iwslt15", name="en-vi")
iwslt15_envi

Downloading and preparing dataset iwslt15/en-vi to /content/drive/MyDrive/.cache/iwslt15/en-vi/1.0.0/c3b9d1bd246837934d62a45b3fbead26d7e863e9db5c20d8e54e6d865fad8b81...


Downloading data:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/140k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/188k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/132k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/184k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset iwslt15 downloaded and prepared to /content/drive/MyDrive/.cache/iwslt15/en-vi/1.0.0/c3b9d1bd246837934d62a45b3fbead26d7e863e9db5c20d8e54e6d865fad8b81. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 133317
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1553
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1268
    })
})

Inspect the data. We can see the punctuations are already separated.

In [14]:
iwslt15_envi['train'][10]['translation']

{'en': 'Over 15,000 scientists go to San Francisco every year for that .',
 'vi': 'Mỗi năm , hơn 15,000 nhà khoa học đến San Francisco để tham dự hội nghị này .'}

Create a config file specifying `level: "word"` in both src and trg.

In [21]:
iwslt15_envi_config = """
name: "transformer_iwslt15_envi"
joeynmt_version: "2.0.0"

data:
    train: "iwslt15"
    dev: "iwslt15"
    test: "iwslt15"
    dataset_type: "huggingface"
    dataset_cfg:
        name: "en-vi"
    src:
        lang: "en"
        lowercase: False
        normalize: False
        level: "word"
        voc_file: "data/iwslt15/vocab.en"
        tokenizer_cfg:
            pretokenizer: "none"
    trg:
        lang: "vi"
        lowercase: False
        normalize: False
        level: "word"
        voc_file: "data/iwslt15/vocab.vi"
        tokenizer_cfg:
            pretokenizer: "none"

testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 1024
    batch_type: "token"
    max_output_length: 150
    eval_metrics: ["bleu"]
    return_prob: "none"
    return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"
        lowercase: False

training:
    #load_model: "iwslt15_envi/best.ckpt"
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.99]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 4000
    loss: "crossentropy"
    learning_rate: 0.0001
    learning_rate_min: 0.000005
    weight_decay: 0.0
    clip_grad_norm: 1.0
    label_smoothing: 0.1
    batch_multiplier: 4
    batch_size: 1024
    batch_type: "token"
    early_stopping_metric: "bleu"
    epochs: 40
    validation_freq: 1000
    logging_freq: 100
    model_dir: "iwslt15_envi"
    overwrite: False
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3, 4]
    keep_best_ckpts: 5

model:
    initializer: "xavier_uniform"
    embed_initializer: "xavier_uniform"
    embed_init_gain: 1.0
    init_gain: 1.0
    bias_initializer: "zeros"
    tied_embeddings: False
    tied_softmax: False
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

"""

(Path(root_dir) / 'data/iwslt15/config.yaml').write_text(iwslt15_envi_config)

2421

Start training with preprocessed input texts. You will see "BasicTokenizer" in the lines with "INFO - joeynmt.tokenizers".  

Please pay attention not only to the console log, but also to the output of "train.log" file. There, you can observe the model generation before detokenizing, besides the post-processed string outputs, in the validation.

In [None]:
!cd {root_dir} && python -m joeynmt train data/iwslt15/config.yaml

2022-08-16 07:22:52,896 - INFO - root - Hello! This is Joey-NMT (version 2.0.0).
2022-08-16 07:22:52,897 - INFO - joeynmt.helpers -                           cfg.name : transformer_iwslt15_envi
2022-08-16 07:22:52,898 - INFO - joeynmt.helpers -                cfg.joeynmt_version : 2.0.0
2022-08-16 07:22:52,898 - INFO - joeynmt.helpers -                     cfg.data.train : iwslt15
2022-08-16 07:22:52,898 - INFO - joeynmt.helpers -                       cfg.data.dev : iwslt15
2022-08-16 07:22:52,898 - INFO - joeynmt.helpers -                      cfg.data.test : iwslt15
2022-08-16 07:22:52,898 - INFO - joeynmt.helpers -              cfg.data.dataset_type : huggingface
2022-08-16 07:22:52,898 - INFO - joeynmt.helpers -          cfg.data.dataset_cfg.name : en-vi
2022-08-16 07:22:52,899 - INFO - joeynmt.helpers -                  cfg.data.src.lang : en
2022-08-16 07:22:52,899 - INFO - joeynmt.helpers -             cfg.data.src.lowercase : False
2022-08-16 07:22:52,899 - INFO - joeynmt.help