Let's build a tokenizer brick by brick. Here we construct 3 fully custom implementations of the BPE, WordPiece, and Unigram tokenizers we preoviously discussed by using the backend functionality of the Hugging Face Tokenizers library. We train each of these tokenizers on the wiki-text corpus

In [0]:
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")


def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]



In [0]:
dataset

Dataset({
    features: ['text'],
    num_rows: 36718
})

In [0]:
with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n")

We begin by building a WordPiece tokenizer

In [0]:
# import all relevant pieces of a tokenizer from the HF tokenizers lib and instantiate a base WordPiece model with a given unknown token
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

In [0]:
# we use the standard BERT uncased normalizer
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

In [0]:
# rather than using a predefined normalizer though, we can make use of a custom one by combining specific normalization steps into a Sequence
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)
# the NFD Unicode nomrlizer helps normalize inputs so that the StripAccents normalizer recognizes the accented chars and strips accents out

In [0]:
# lets peek the result of our custom normalizer on a sample input
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?


In [0]:
# we can the predfined BERT pretokenizer again as we did before 
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

In [0]:
# or we can consctust a custom one with available pretokenization techniques in the HF documenatation
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()

In [0]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[("Let's", (0, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre-tokenizer.', (14, 28))]

In [0]:
# and similarly we can construct a sequence of multiple pre-tokenization techniques
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

In [0]:
# we can use the trainer to train a new tokenizer, but we must pass it any special tokens, otherwise they wont appear in our model
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

# As well as specifying the vocab_size and special_tokens, we can set the min_frequency (the number of times a token must appear to be included in the vocabulary) or change the continuing_subword_prefix (if we want to use something different from ##)

In [0]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)






In [0]:
# we can also train directly on a txt file by using the commented out command below
# tokenizer.model = models.WordPiece(unk_token="[UNK]")
# tokenizer.train(["wikitext-2.txt"], trainer=trainer)

In [0]:
# test out our trained tokenizer on an input
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['let', "##'", '##s', 'test', 'this', 'tok', '##eni', '##zer', '##.']


In [0]:
# for post-processing, the final step in a tokenizer, we need to add the [CLS] token at the beginning and the [SEP] token at the end (or after each sentence, if we have a pair of sentences)
# but first, we identify the values ascribed to each of these tokens to use them in a 'TemplateProcessor'
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3


In [0]:
# we construct templates for input postprocessing as follows, where $A and $B are placeholders for the first and second sentences, respectively
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)
# the ids for the special tokens are passed along so that the tokenizer knows how to map them properly

In [0]:
# test on a single sentence
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['[CLS]', 'let', "##'", '##s', 'test', 'this', 'tok', '##eni', '##zer', '##.', '[SEP]']


In [0]:
# test on a pair of sentences
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "##'", '##s', 'test', 'this', 'tok', '##eni', '##zer', '##..', '##.', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '##.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


In [0]:
# the last thing we need to implement is a decoder, which takes a prefix for knowing how to decode the tokenization - here we used ##
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [0]:
# lets confirm this decoder works as anticipated
tokenizer.decode(encoding.ids)

"let's test this tokenizer... on a pair of sentences."

In [0]:
# we can save our tokenizer and push it to the hub using methods similar to those we have used before
tokenizer.save("wordpiece_tokenizer.json")

In [0]:
# load from file
loaded_tokenizer = Tokenizer.from_file("wordpiece_tokenizer.json")

In [0]:
# Finally, to be able to use this custom built tokenizer in Hugging Face transformers, we need to wrap it in a PreTrainedTokenizerFast class
# OR, if our tokenizer corresponds to an existing model, we can use that specific class (here, BertTokenizerFast)
# we do the former (for truly custom tokenizers) as follows
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# NOTE: it is important to remark that all special tokens must be passed manually as so to configure them since they are not inherently known

Finally, now that we have created our custom tokenizer brick by brick and wrapped it in a `PreTrainedTokenizerFast` class, we can then use this tokenizer like any other 🤗 Transformers tokenizer. We can save it with the `save_pretrained()` method, or upload it to the HF Hub with the `push_to_hub()` method

Now we will construct a custom byte level BPE tokenizer that is equivalent to GPT-2's

In [0]:
# initialize with a BPE model - since byte level BPE (like that of GPT-2) doesnt need an unknown token, we dont pass one
tokenizer = Tokenizer(models.BPE())

In [0]:
# GPT-2 doesnt use a normalizer, so we skip that
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) # elect to not add a space before the first token

In [0]:
# sample pretokenization
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

[('Let', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġpre', (10, 14)),
 ('-', (14, 15)),
 ('tokenization', (15, 27)),
 ('!', (27, 28))]

In [0]:
# for training here of byte level BPE, we only need the end of line special token - it is the only one!
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)






In [0]:
# as before, we can similarly train the tokenizer on text files using the commented out code below
# tokenizer.model = models.BPE()
# tokenizer.train(["wikitext-2.txt"], trainer=trainer)

In [0]:
# peek how our encoding works are training
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']


In [0]:
# we add a trim_offsets flag to our post processor so that tokens beginning with the Ġ retain their space if individually accessed
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

In [0]:
# we can see the effect of the above here:
sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

' test'

In [0]:
# lastly we add a decoder
tokenizer.decoder = decoders.ByteLevel()

In [0]:
# we double check it works
tokenizer.decode(encoding.ids)

"Let's test this tokenizer."

In [0]:
# and we wrap our tokenizer in the PreTrainedTokenizerFast class for integration into the HF ecosystem
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)

# or since we recreated GPT-2, we can use its pretrained tokenizer class as follows in the commented out code below
# from transformers import GPT2TokenizerFast
# wrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

# in general, for any arbitrary custom tokenizer, wrapping in the PreTrainedTokenizerFast class is best


At last, we build an XLNet tokenizer. Like for the previous tokenizers, we start by initializing a Tokenizer with a Unigram model

In [0]:
tokenizer = Tokenizer(models.Unigram())

In [0]:
# for the normalization, XLNet uses a few replacements (which come from SentencePiece)
# these replace all accent character formats with the " character, removes sequences of two or more spaces,and strips all accents
from tokenizers import Regex

tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

In [0]:
# SentencePiece tokenizers use the Metaspace pre-tokenizer
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

In [0]:
# peek pre-tokenization sample
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")

[("▁Let's", (0, 5)),
 ('▁test', (5, 10)),
 ('▁the', (10, 14)),
 ('▁pre-tokenizer!', (14, 29))]

In [0]:
# XLNet has quite a few special tokens, and we need to be sure to pass an unknown token to the trainer
# additional optional args include: the `shrinking_factor` for each step where we remove tokens (defaults to 0.75) and the `max_piece_length` to specify the maximum length of a given token (defaults to 16)
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)





In [0]:
# peek the tokenization of a sample after training
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.']


In [0]:
# we grab the ideas of the relevant special tokens before creating a TemplateProcessing to pass to the postprocessor
cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")
print(cls_token_id, sep_token_id)

0 1


In [0]:
# interestingly, XLNet processes with the <cls> token at the end of the sequence and pads on the left
# the <cls> token in XLNet has a type_id of 2 to distinguish it from other special tokens (see the next cell)
tokenizer.post_processor = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)

In [0]:
# peek encoding of a set of sentences
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences!")
print(encoding.tokens)
print(encoding.type_ids)

['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.', '.', '.', '<sep>', '▁', 'on', '▁', 'a', '▁pair', '▁of', '▁sentence', 's', '!', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]


In [0]:
# lastly, we add a Metaspace decoder
tokenizer.decoder = decoders.Metaspace()

In [0]:
# peek how the decoder functions
print(tokenizer.decode(encoding.ids))

Let's test this tokenizer... on a pair of sentences!


In [0]:
# and in conclusion, we can wrap this custom built tokenizer in a PreTrainedTokenizerFast class for integration with the rest of the HF ecosystem
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)