In [1]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer
)

Initialize a tokenizer instance

In [2]:
tokenizer = Tokenizer(models.BPE(unk_token='<UNK>'))

We need to set the normalization components, which are *convert to lowercase* -> *normalize with NFKC*

In [4]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Lowercase(), normalizers.NFKD()]
)

Next we have pretokenization, eg how do we split into words before tokenization? For Dhivehi we want to split on both whitespace and punctuation (a comma isn't part of a word - it's separate). This is a common approach and covered with:

In [5]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Next we train the tokenizer, we've already assigned the BPE tokenizer so all we do now is pass Dhivehi text data to the training function, specify any special tokens to include in our vocab (as they will not be found in the training data... Hopefully!), and specify our target vocab size.

In [6]:
trainer = trainers.BpeTrainer(
    vocab_size=30_000,
    special_tokens=['<s>', '</s>', '<unk>', '<pad>', '<mask>'],
    min_frequency=2
)

In [7]:
read_dv = open('../data/dv-corpus-clean-unique.txt', encoding='utf-8')
tokenizer.train_from_iterator(read_dv, trainer=trainer)
read_dv.close()