## Train According to [NLP Course from Hugging Face](https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt)

We'd like to build our VinaLM's tokenizer almost identical to both
- ~PhoBERT's tokenizer~
- RoBERTa's tokenizer (`"roberta-base"`)

In [1]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

In [2]:
from transformers import AutoTokenizer

2023-06-26 10:42:03.610680: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 10:42:06.354208: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/phunc20/.config/miniconda3/envs/py3.10/lib
2023-06-26 10:42:06.354426: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/phunc20/.config/miniconda3/envs/py3.10/lib


In [3]:
roberta_base_tokenizer = AutoTokenizer.from_pretrained(
    "roberta-base")

In [4]:
roberta_base_tokenizer.model_max_length

512

In [5]:
cased_bpe_tokenizer = Tokenizer(
    models.BPE(unk_token=roberta_base_tokenizer.unk_token),
)

In [6]:
print([s for s in dir(cased_bpe_tokenizer) if not s.startswith("_")])

['add_special_tokens', 'add_tokens', 'decode', 'decode_batch', 'decoder', 'enable_padding', 'enable_truncation', 'encode', 'encode_batch', 'from_buffer', 'from_file', 'from_pretrained', 'from_str', 'get_vocab', 'get_vocab_size', 'id_to_token', 'model', 'no_padding', 'no_truncation', 'normalizer', 'num_special_tokens_to_add', 'padding', 'post_process', 'post_processor', 'pre_tokenizer', 'save', 'to_str', 'token_to_id', 'train', 'train_from_iterator', 'truncation']


### Set `normalizer, pre_tokenizer, post_processor` etc.
- `post_processor` to have, say, the final output of BOS, EOS tokens
    - Simply choose `processors.RobertaProcessing`!
    - `processors.ByteLevel` alone won't add BOS or EOS
- `decoder`: `decoders.ByteLevel()`

Seems that we don't need any normalizer for Esperanto. Or, we just pick a neural one, e.g. `BerNormalizer`, which we try to ask to do nothing.

In [13]:
cased_bpe_tokenizer.normalizer = normalizers.Sequence([
    normalizers.BertNormalizer(
        lowercase=False,
    ),
])

In [15]:
sent2 = "Multaj homoj jam havas ĝin."
cased_bpe_tokenizer.normalizer.normalize_str(sent2)

'Multaj homoj jam havas ĝin.'

In [16]:
cased_bpe_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
    add_prefix_space=False)

In [23]:
sent3 = """\
Por Apple estas konata la apo Duolingo:
por multaj lingvoj ĝi havas lingvan lernilon,
nun ankaŭ por Esperanto.\
"""
normalized_sent3 = cased_bpe_tokenizer.normalizer.normalize_str(
    sent3)
pre_tokenized_sent3 = cased_bpe_tokenizer.pre_tokenizer.pre_tokenize_str(
    normalized_sent3)
print(normalized_sent3)
print()
print(pre_tokenized_sent3)

Por Apple estas konata la apo Duolingo: por multaj lingvoj ĝi havas lingvan lernilon, nun ankaŭ por Esperanto.

[('Por', (0, 3)), ('ĠApple', (3, 9)), ('Ġestas', (9, 15)), ('Ġkonata', (15, 22)), ('Ġla', (22, 25)), ('Ġapo', (25, 29)), ('ĠDuolingo', (29, 38)), (':', (38, 39)), ('Ġpor', (39, 43)), ('Ġmultaj', (43, 50)), ('Ġlingvoj', (50, 58)), ('ĠÄĿi', (58, 61)), ('Ġhavas', (61, 67)), ('Ġlingvan', (67, 75)), ('Ġlernilon', (75, 84)), (',', (84, 85)), ('Ġnun', (85, 89)), ('ĠankaÅŃ', (89, 95)), ('Ġpor', (95, 99)), ('ĠEsperanto', (99, 109)), ('.', (109, 110))]


In [24]:
sep_token, sep_token_id = (
    roberta_base_tokenizer.sep_token,
    roberta_base_tokenizer.sep_token_id,
)
cls_token, cls_token_id = (
    roberta_base_tokenizer.cls_token,
    roberta_base_tokenizer.cls_token_id,
)
cased_bpe_tokenizer.post_processor = processors.RobertaProcessing(
    sep=(sep_token, sep_token_id),
    cls=(cls_token, cls_token_id),
    #trim_offsets=True,
    add_prefix_space=False,
)

Lastly, `decoder`:

In [25]:
cased_bpe_tokenizer.decoder = decoders.ByteLevel()

Recall that, for `"robert-base"` tokenizer,
| Token Type | Token | Token ID |
| --- | --- | --- |
| `bos_token` | `"<s>"` | 0 |
| `pad_token` | `"<pad>"` | 1 |
| `eos_token` | `"</s>"` | 2 |
| `unk_token` | `"<unk>"` | 3 |
| `mask_token` | `"<mask>"` | 50264 |
| `sep_token` | `"</s>"` | 2 |
| `cls_token` | `"<s>"` | 0 |
| `???` | `"<\|endoftext\|>"` | 50260 |

## `Trainer` and Train

In [26]:
roberta_base_tokenizer.special_tokens_map_extended

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'sep_token': '</s>',
 'pad_token': '<pad>',
 'cls_token': '<s>',
 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)}

In [27]:
from tokenizers import AddedToken
AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)

AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)

In [28]:
trainer = trainers.BpeTrainer(
    show_progress=True,
    # special_tokens=list(
    #     roberta_base_tokenizer.special_tokens_map.values()),
    #special_tokens=["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"],
    #special_tokens=["<s>", "<pad>", "</s>", "<unk>"],
    special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
    # special_tokens=["<s>", "<pad>", "</s>", "<unk>",
    #     roberta_base_tokenizer.special_tokens_map_extended["mask_token"],
    # ],
    #special_tokens=[],
)

In [29]:
trainer.special_tokens

[AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False),
 AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False),
 AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False),
 AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False),
 AddedToken("<mask>", rstrip=False, lstrip=False, single_word=False, normalized=False)]

In [30]:
from pathlib import Path

In [31]:
data_dir = Path("../../../data/")
assert data_dir.is_dir()

In [32]:
text_paths = [
    data_dir/"oscar.eo.txt"
]
assert all(p.exists() for p in text_paths), "some path doesn't exist! Check again, please!"

In [33]:
cased_bpe_tokenizer.train(
    [str(p) for p in text_paths],
    #[str(p) for p in toy_text_paths],
    trainer=trainer,
)






In [34]:
encoding = cased_bpe_tokenizer.encode(sent3)
encoding

Encoding(num_tokens=24, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [35]:
encoding.tokens

['<s>',
 'Por',
 'ĠApple',
 'Ġestas',
 'Ġkonata',
 'Ġla',
 'Ġapo',
 'ĠDuolingo',
 ':',
 'Ġpor',
 'Ġmultaj',
 'Ġlingvoj',
 'ĠÄĿi',
 'Ġhavas',
 'Ġlingvan',
 'Ġlern',
 'ilon',
 ',',
 'Ġnun',
 'ĠankaÅŃ',
 'Ġpor',
 'ĠEsperanto',
 '.',
 '</s>']

In [36]:
encoding.ids

[0,
 1949,
 18453,
 265,
 1792,
 213,
 2324,
 10652,
 30,
 289,
 824,
 1151,
 433,
 581,
 8438,
 901,
 2240,
 16,
 584,
 474,
 289,
 488,
 18,
 2]

In [37]:
vocab_size = cased_bpe_tokenizer.get_vocab_size()
vocab_size

30000

In [38]:
[cased_bpe_tokenizer.id_to_token(i) for i in range(7)]

['<s>', '<pad>', '</s>', '<unk>', '<mask>', '!', '"']

In [39]:
[cased_bpe_tokenizer.id_to_token(vocab_size-i) for i in range(1,7)]

['ãģĹ', 'ĠkaÅĿitan', 'migrado', '130', 'Ġlegiti', 'Ġplibonigoj']

### Making It into A Fast Tokenizer

In [40]:
from transformers import (
    RobertaTokenizerFast,
    PreTrainedTokenizerFast,
    RobertaForMaskedLM,
)

In [41]:
roberta_base_tokenizer.model_max_length

512

In [42]:
roberta_base_tokenizer.max_model_input_sizes

{'roberta-base': 512,
 'roberta-large': 512,
 'roberta-large-mnli': 512,
 'distilroberta-base': 512,
 'roberta-base-openai-detector': 512,
 'roberta-large-openai-detector': 512}

In [43]:
fast_cased_bpe_tokenizer = RobertaTokenizerFast(
    tokenizer_object=cased_bpe_tokenizer,
    model_max_length=512,
)

# fast_cased_bpe_tokenizer = PreTrainedTokenizerFast(
#     tokenizer_object=cased_bpe_tokenizer,
#     unk_token=roberta_base_tokenizer.unk_token,
#     pad_token=roberta_base_tokenizer.pad_token,
#     cls_token=roberta_base_tokenizer.cls_token,
#     sep_token=roberta_base_tokenizer.sep_token,
#     mask_token=roberta_base_tokenizer.mask_token,
# )

In [44]:
fast_cased_bpe_tokenizer.model_max_length

512

In [45]:
fast_cased_bpe_tokenizer.is_fast

True

In [46]:
fast_cased_bpe_tokenizer.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'sep_token': '</s>',
 'pad_token': '<pad>',
 'cls_token': '<s>',
 'mask_token': '<mask>'}

In [47]:
for special_token_type, special_token in fast_cased_bpe_tokenizer.special_tokens_map.items():
    special_token_id = fast_cased_bpe_tokenizer.convert_tokens_to_ids(
        special_token)
    print((special_token_id, special_token_type, special_token))

(0, 'bos_token', '<s>')
(2, 'eos_token', '</s>')
(3, 'unk_token', '<unk>')
(2, 'sep_token', '</s>')
(1, 'pad_token', '<pad>')
(0, 'cls_token', '<s>')
(4, 'mask_token', '<mask>')


In [48]:
encoding = fast_cased_bpe_tokenizer(sent3)
encoding

{'input_ids': [0, 1949, 18453, 265, 1792, 213, 2324, 10652, 30, 289, 824, 1151, 433, 581, 8438, 901, 2240, 16, 584, 474, 289, 488, 18, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [49]:
encoding.tokens()

['<s>',
 'Por',
 'ĠApple',
 'Ġestas',
 'Ġkonata',
 'Ġla',
 'Ġapo',
 'ĠDuolingo',
 ':',
 'Ġpor',
 'Ġmultaj',
 'Ġlingvoj',
 'ĠÄĿi',
 'Ġhavas',
 'Ġlingvan',
 'Ġlern',
 'ilon',
 ',',
 'Ġnun',
 'ĠankaÅŃ',
 'Ġpor',
 'ĠEsperanto',
 '.',
 '</s>']

In [50]:
type(encoding)

transformers.tokenization_utils_base.BatchEncoding

In [51]:
fast_cased_bpe_tokenizer.backend_tokenizer is cased_bpe_tokenizer

False

In [52]:
fast_cased_bpe_tokenizer.backend_tokenizer == cased_bpe_tokenizer

False

### Push to hub

In [53]:
hub_repo = "phunc20/esperoberta-cased"
fast_cased_bpe_tokenizer.push_to_hub(hub_repo)

CommitInfo(commit_url='https://huggingface.co/phunc20/esperoberta-cased/commit/491d7128d0b603b50bf06fd098661609b0d78938', commit_message='Upload tokenizer', commit_description='', oid='491d7128d0b603b50bf06fd098661609b0d78938', pr_url=None, pr_revision=None, pr_num=None)

## Reload to Verify

In [None]:
from transformers import AutoTokenizer

In [55]:
reloaded_tokenizer = AutoTokenizer.from_pretrained(
    hub_repo,
)
reloaded_tokenizer

Downloading (…)okenizer_config.json:   0%|          | 0.00/311 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

RobertaTokenizerFast(name_or_path='phunc20/esperoberta-cased', vocab_size=30000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})

In [56]:
reloaded_tokenizer.is_fast

True

In [57]:
sent2

'Multaj homoj jam havas ĝin.'

In [58]:
sent3

'Por Apple estas konata la apo Duolingo:\npor multaj lingvoj ĝi havas lingvan lernilon,\nnun ankaŭ por Esperanto.'

In [59]:
encoding2 = reloaded_tokenizer(sent2)
encoding2

{'input_ids': [0, 7628, 768, 600, 581, 644, 18, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [60]:
encoding2.tokens()

['<s>', 'Multaj', 'Ġhomoj', 'Ġjam', 'Ġhavas', 'ĠÄĿin', '.', '</s>']

In [61]:
reloaded_tokenizer.convert_tokens_to_string(encoding2.tokens())

'<s>Multaj homoj jam havas ĝin.</s>'

In [62]:
encoding3 = reloaded_tokenizer(sent3)
encoding3

{'input_ids': [0, 1949, 18453, 265, 1792, 213, 2324, 10652, 30, 289, 824, 1151, 433, 581, 8438, 901, 2240, 16, 584, 474, 289, 488, 18, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [63]:
encoding3.tokens()

['<s>',
 'Por',
 'ĠApple',
 'Ġestas',
 'Ġkonata',
 'Ġla',
 'Ġapo',
 'ĠDuolingo',
 ':',
 'Ġpor',
 'Ġmultaj',
 'Ġlingvoj',
 'ĠÄĿi',
 'Ġhavas',
 'Ġlingvan',
 'Ġlern',
 'ilon',
 ',',
 'Ġnun',
 'ĠankaÅŃ',
 'Ġpor',
 'ĠEsperanto',
 '.',
 '</s>']

In [64]:
reloaded_tokenizer.convert_tokens_to_string(encoding3.tokens())

'<s>Por Apple estas konata la apo Duolingo: por multaj lingvoj ĝi havas lingvan lernilon, nun ankaŭ por Esperanto.</s>'