# Building a tokenizer, block by block

As we've seen in the previous sections, tokenization comprises several steps:

- **Normalization** (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
- **Pre-tokenization** (splitting the input into words)
- **Running the input through the model** (using the pre-tokenized words to produce a sequence of tokens)
- **Post-processing** (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)

As a reminder, here's another look at the overall process:

![tokenization_pipeline.svg](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/tokenization_pipeline.svg)



The 🤗 Tokenizers library has been built to provide several options for each of those steps, which you can mix and match together. In this section we'll see how we can build a tokenizer from scratch, as opposed to training a new tokenizer from an old one. You'll then be able to build any kind of tokenizer you can think of!

More precisely, the library is built around a central `Tokenizer` class with the building blocks regrouped in submodules:

- `normalizers` contains all the possible types of `Normalizer` you can use
- `pre_tokenizers` contains all the possible types of `PreTokenizer` you can use
- `models` contains the various types of `Model` you can use, like `BPE`, `WordPiece`, and `Unigram`
- `trainers` contains all the different types of `Trainer` you can use to train your model on a corpus
- `post_processors` contains the various types of `PostProcessor` you can use
- `decoders` contains the various types of `Decoder` you can use to decode the outputs of tokenization

## Installing Required Libraries

First, let's make sure we have all the necessary libraries installed.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

In [179]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    Regex,
)

Also log into Hugging Face.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Acquiring a corpus

To train our new tokenizer, we will use a small corpus of text (so the examples run fast). The steps for acquiring the corpus are similar to the ones we took at the beginning of this chapter, but this time we'll use the [Math Text](https://huggingface.co/datasets/ddrg/math_text) dataset:

In [3]:
from datasets import load_dataset

dataset = load_dataset("ddrg/math_text", split="train")

def get_training_corpus():
    # First 1000 examples
    for i in range(0, min(1000, len(dataset)), 100):
        yield dataset[i : i + 100]["text"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00011-1bad7e0c80eec6(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00001-of-00011-6054fb3cdc4e32(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00002-of-00011-a127d6ce69a897(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00003-of-00011-21f909d3674ae6(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00004-of-00011-a1f39061ea377e(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00005-of-00011-82690ea34a3fea(…):   0%|          | 0.00/276M [00:00<?, ?B/s]

data/train-00006-of-00011-7bec0b65691d3e(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00007-of-00011-e781d94c7434f2(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00008-of-00011-dd59671427660d(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00009-of-00011-30fc022f190ef6(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/train-00010-of-00011-6fd49c27adf144(…):   0%|          | 0.00/277M [00:00<?, ?B/s]

data/test-00000-of-00002-1ca61c509f1e9ec(…):   0%|          | 0.00/169M [00:00<?, ?B/s]

data/test-00001-of-00002-94c6d0300d3d067(…):   0%|          | 0.00/169M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6320415 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/701969 [00:00<?, ? examples/s]

The function `get_training_corpus()` is a generator that will yield batches of 100 texts, which we will use to train the tokenizer.

🤗 Tokenizers can also be trained on text files directly. Here's how we can generate a text file containing all the texts/inputs from Math Text that we can use locally:

In [5]:
with open("math_text.txt", "w", encoding="utf-8") as f:
    for i in range(1000):
        f.write(dataset[i]["text"] + "\n")

In [17]:
print("Column names:", dataset.column_names)
first_example = dataset[0]

print("Text:")
for key, value in first_example.items():
    if isinstance(value, str):
        for i in range(0, len(value), 140):
            print(f"  {value[i:i+140]}")

Column names: ['id', 'text']
Text:
  Before providing my solution, I'd must admit that Oliver Oloa provides the  way to calculate this integral. I merely provide a different app
  roach, using Fourier transforms. First a comment. I tried to use a symmetry argument saying that  $\int\limits_0^\infty g(x + \frac1x) \oper
  atorname{atan}(x\frac1x)\,dx = \frac{\pi}{4} \int\limits_0^\infty g(x + \frac1x) \frac1x\,dx$ but I was not able to put this integral into t
  hat form. Now to the solution: Since the integrand is even, our integral equals  $ \frac{1}{2}\int_{-\infty}^{+\infty}\frac{\arctan x}{x(1+x
  ^2)}\,dx.  $ We need to know the Fourier transforms  $ \mathcal Fl[\frac{1}{1+x^2}r](\xi)=\sqrt{\frac{\pi}{2}}e^{-|\xi|}\quad\text{and}\quad
   \mathcal Fl[\frac{\arctan x}{x}r](\xi)=\sqrt{\frac{\pi}{2}}\int_{|\xi|}^{+\infty}\frac{e^{-t}}{t}\,dt.  $ By Parseval's formula,  $ \int_0^
  {+\infty}\frac{\arctan x}{x(1+x^2)}\,dx= \frac{1}{2}\sqrt{\frac{\pi}{2}}\sqrt{\frac{\pi}{2}} \int_{-\inft

Next we'll show you how to build your own BERT, GPT-2, and XLNet tokenizers, block by block. That will give us an example of each of the three main tokenization algorithms: WordPiece, BPE, and Unigram. Let's start with BERT!

## Building a WordPiece tokenizer from scratch

To build a tokenizer with the 🤗 Tokenizers library, we start by instantiating a `Tokenizer` object with a `model`, then set its `normalizer`, `pre_tokenizer`, `post_processor`, and `decoder` attributes to the values we want.

For this example, we'll create a `Tokenizer` with a WordPiece model:

In [88]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

We have to specify the `unk_token` so the model knows what to return when it encounters characters it hasn't seen before. Other arguments we can set here include the `vocab` of our model (we're going to train the model, so we don't need to set this) and `max_input_chars_per_word`, which specifies a maximum length for each word (words longer than the value passed will be split).

### Setting up the Normalizer

The first step of tokenization is normalization, so let's begin with that. Since BERT is widely used, there is a `BertNormalizer` with the classic options we can set for BERT:

In [89]:
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

Generally speaking, however, when building a new tokenizer you won't have access to such a handy normalizer already implemented in the 🤗 Tokenizers library -- so let's see how to create the BERT normalizer by hand. The library provides a `Lowercase` normalizer and a `StripAccents` normalizer, and you can compose several normalizers using a `Sequence`:

In [90]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

We're also using an `NFD` Unicode normalizer, as otherwise the `StripAccents` normalizer won't properly recognize the accented characters and thus won't strip them out.

As we've seen before, we can use the `normalize_str()` method of the `normalizer` to check out the effects it has on a given text:

In [91]:
math_ex = "All non-trivial zeroes of the Riemann zeta function $\zeta(s)$ lie on the critical line $\text{Re}(s)=1/2."
print(tokenizer.normalizer.normalize_str(math_ex))

all non-trivial zeroes of the riemann zeta function $\zeta(s)$ lie on the critical line $	ext{re}(s)=1/2.


  math_ex = "All non-trivial zeroes of the Riemann zeta function $\zeta(s)$ lie on the critical line $\text{Re}(s)=1/2."


### Setting up the Pre-tokenizer

Next is the pre-tokenization step. Again, there is a prebuilt `BertPreTokenizer` that we can use:

In [92]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Or we can build it from scratch:

In [93]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Note that the `Whitespace` pre-tokenizer splits on whitespace and all characters that are not letters, digits, or the underscore character, so it technically splits on whitespace and punctuation:

In [94]:
tokenizer.pre_tokenizer.pre_tokenize_str(math_ex)

[('All', (0, 3)),
 ('non', (4, 7)),
 ('-', (7, 8)),
 ('trivial', (8, 15)),
 ('zeroes', (16, 22)),
 ('of', (23, 25)),
 ('the', (26, 29)),
 ('Riemann', (30, 37)),
 ('zeta', (38, 42)),
 ('function', (43, 51)),
 ('$\\', (52, 54)),
 ('zeta', (54, 58)),
 ('(', (58, 59)),
 ('s', (59, 60)),
 (')$', (60, 62)),
 ('lie', (63, 66)),
 ('on', (67, 69)),
 ('the', (70, 73)),
 ('critical', (74, 82)),
 ('line', (83, 87)),
 ('$', (88, 89)),
 ('ext', (90, 93)),
 ('{', (93, 94)),
 ('Re', (94, 96)),
 ('}(', (96, 98)),
 ('s', (98, 99)),
 (')=', (99, 101)),
 ('1', (101, 102)),
 ('/', (102, 103)),
 ('2', (103, 104)),
 ('.', (104, 105))]

If you only want to split on whitespace, you should use the `WhitespaceSplit` pre-tokenizer instead:

In [95]:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str(math_ex)

[('All', (0, 3)),
 ('non-trivial', (4, 15)),
 ('zeroes', (16, 22)),
 ('of', (23, 25)),
 ('the', (26, 29)),
 ('Riemann', (30, 37)),
 ('zeta', (38, 42)),
 ('function', (43, 51)),
 ('$\\zeta(s)$', (52, 62)),
 ('lie', (63, 66)),
 ('on', (67, 69)),
 ('the', (70, 73)),
 ('critical', (74, 82)),
 ('line', (83, 87)),
 ('$', (88, 89)),
 ('ext{Re}(s)=1/2.', (90, 105))]

Like with normalizers, you can use a `Sequence` to compose several pre-tokenizers:

In [96]:
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)
pre_tokenizer.pre_tokenize_str(math_ex)

[('All', (0, 3)),
 ('non', (4, 7)),
 ('-', (7, 8)),
 ('trivial', (8, 15)),
 ('zeroes', (16, 22)),
 ('of', (23, 25)),
 ('the', (26, 29)),
 ('Riemann', (30, 37)),
 ('zeta', (38, 42)),
 ('function', (43, 51)),
 ('$', (52, 53)),
 ('\\', (53, 54)),
 ('zeta', (54, 58)),
 ('(', (58, 59)),
 ('s', (59, 60)),
 (')', (60, 61)),
 ('$', (61, 62)),
 ('lie', (63, 66)),
 ('on', (67, 69)),
 ('the', (70, 73)),
 ('critical', (74, 82)),
 ('line', (83, 87)),
 ('$', (88, 89)),
 ('ext', (90, 93)),
 ('{', (93, 94)),
 ('Re', (94, 96)),
 ('}', (96, 97)),
 ('(', (97, 98)),
 ('s', (98, 99)),
 (')', (99, 100)),
 ('=', (100, 101)),
 ('1', (101, 102)),
 ('/', (102, 103)),
 ('2', (103, 104)),
 ('.', (104, 105))]

### Training the Model

The next step in the tokenization pipeline is running the inputs through the model. We already specified our model in the initialization, but we still need to train it, which will require a `WordPieceTrainer`. The main thing to remember when instantiating a trainer in 🤗 Tokenizers is that you need to pass it all the special tokens you intend to use -- otherwise it won't add them to the vocabulary, since they are not in the training corpus:

In [97]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

As well as specifying the `vocab_size` and `special_tokens`, we can set the `min_frequency` (the number of times a token must appear to be included in the vocabulary) or change the `continuing_subword_prefix` (if we want to use something different from `##`).

To train our model using the iterator we defined earlier, we just have to execute this command:

In [98]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

We can also use text files to train our tokenizer, which would look like this (we reinitialize the model with an empty `WordPiece` beforehand):

In [99]:
# Alternative training method using text files
# tokenizer.model = models.WordPiece(unk_token="[UNK]")
# tokenizer.train(["wikitext-2.txt"], trainer=trainer)

In both cases, we can then test the tokenizer on a text by calling the `encode()` method:

In [100]:
encoding = tokenizer.encode(math_ex)
print(encoding.tokens)

['all', 'non', '-', 'trivial', 'zeroes', 'of', 'the', 'riemann', 'zeta', 'function', '$\\', 'zeta', '(', 's', ')$', 'lie', 'on', 'the', 'critical', 'line', '$', 'ext', '{', 're', '}(', 's', ')=', '1', '/', '2', '.']


The `encoding` obtained is an `Encoding`, which contains all the necessary outputs of the tokenizer in its various attributes: `ids`, `type_ids`, `tokens`, `offsets`, `attention_mask`, `special_tokens_mask`, and `overflowing`.

### Post-processing

The last step in the tokenization pipeline is post-processing. We need to add the `[CLS]` token at the beginning and the `[SEP]` token at the end (or after each sentence, if we have a pair of sentences). We will use a `TemplateProcessor` for this, but first we need to know the IDs of the `[CLS]` and `[SEP]` tokens in the vocabulary:

In [101]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3


To write the template for the `TemplateProcessor`, we have to specify how to treat a single sentence and a pair of sentences. For both, we write the special tokens we want to use; the first (or single) sentence is represented by `$A`, while the second sentence (if encoding a pair) is represented by `$B`. For each of these (special tokens and sentences), we also specify the corresponding token type ID after a colon.

The classic BERT template is thus defined as follows:

In [102]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Note that we need to pass along the IDs of the special tokens, so the tokenizer can properly convert them to their IDs.

Once this is added, going back to our previous example will give:

In [103]:
encoding = tokenizer.encode(math_ex)
print(encoding.tokens)

['[CLS]', 'all', 'non', '-', 'trivial', 'zeroes', 'of', 'the', 'riemann', 'zeta', 'function', '$\\', 'zeta', '(', 's', ')$', 'lie', 'on', 'the', 'critical', 'line', '$', 'ext', '{', 're', '}(', 's', ')=', '1', '/', '2', '.', '[SEP]']


And on a pair of sentences, we get the proper result:

In [142]:
math_ex_2 = "The sine function $sin(x)$ has zeros at $k\pi$."
encoding = tokenizer.encode(math_ex, math_ex_2)
print(encoding.tokens)
print(encoding.type_ids)

['▁A', 'll', '▁non-', 'trivial', '▁zero', 'es', '▁of', '▁the', '▁Riemann', '▁', 'zeta', '▁function', '▁$', '\\zeta(s)', '$', '▁lie', '▁on', '▁the', '▁critic', 'al', '▁line', '▁$', '\t', 'e', 'x', 't', '{R', 'e', '}(s)', '=', '1/2', '.', '<sep>', '▁The', '▁sine', '▁function', '▁$', 's', 'in', '(x)$', '▁has', '▁zeros', '▁at', '▁$k', '\\pi', '$.', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]


  math_ex_2 = "The sine function $sin(x)$ has zeros at $k\pi$."


### Adding a Decoder

We've almost finished building this tokenizer from scratch -- the last step is to include a decoder:

In [105]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

Let's test it on our previous `encoding`:

In [106]:
tokenizer.decode(encoding.ids)

'all non - trivial zeroes of the riemann zeta function $\\ zeta ( s )$ lie on the critical line $ ext { re }( s )= 1 / 2. on a pair of sentences.'

### Saving and Loading the Tokenizer

Great! We can save our tokenizer in a single JSON file like this:

In [107]:
tokenizer.save("tokenizer.json")

We can then reload that file in a `Tokenizer` object with the `from_file()` method:

In [108]:
new_tokenizer = Tokenizer.from_file("tokenizer.json")

### Using with 🤗 Transformers

To use this tokenizer in 🤗 Transformers, we have to wrap it in a `PreTrainedTokenizerFast`. We can either use the generic class or, if our tokenizer corresponds to an existing model, use that class (here, `BertTokenizerFast`).

To wrap the tokenizer in a `PreTrainedTokenizerFast`, we can either pass the tokenizer we built as a `tokenizer_object` or pass the tokenizer file we saved as `tokenizer_file`. The key thing to remember is that we have to manually set all the special tokens:

In [109]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

If you are using a specific tokenizer class (like `BertTokenizerFast`), you will only need to specify the special tokens that are different from the default ones (here, none):

In [110]:
from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

You can then use this tokenizer like any other 🤗 Transformers tokenizer. You can save it with the `save_pretrained()` method, or upload it to the Hub with the `push_to_hub()` method.

Now that we've seen how to build a WordPiece tokenizer, let's do the same for a BPE tokenizer. We'll go a bit faster since you know all the steps, and only highlight the differences.

## Building a BPE tokenizer from scratch

Let's now build a GPT-2 tokenizer. Like for the BERT tokenizer, we start by initializing a `Tokenizer` with a BPE model:

In [111]:
tokenizer = Tokenizer(models.BPE())

Also like for BERT, we could initialize this model with a vocabulary if we had one (we would need to pass the `vocab` and `merges` in this case), but since we will train from scratch, we don't need to do that. We also don't need to specify an `unk_token` because GPT-2 uses byte-level BPE, which doesn't require it.

GPT-2 does not use a normalizer, so we skip that step and go directly to the pre-tokenization:

In [112]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

The option we added to `ByteLevel` here is to not add a space at the beginning of a sentence (which is the default otherwise). We can have a look at the pre-tokenization of an example text like before:

In [143]:
tokenizer.pre_tokenizer.pre_tokenize_str(math_ex)

[('▁All', (0, 3)),
 ('▁non-trivial', (3, 15)),
 ('▁zeroes', (15, 22)),
 ('▁of', (22, 25)),
 ('▁the', (25, 29)),
 ('▁Riemann', (29, 37)),
 ('▁zeta', (37, 42)),
 ('▁function', (42, 51)),
 ('▁$\\zeta(s)$', (51, 62)),
 ('▁lie', (62, 66)),
 ('▁on', (66, 69)),
 ('▁the', (69, 73)),
 ('▁critical', (73, 82)),
 ('▁line', (82, 87)),
 ('▁$\text{Re}(s)=1/2.', (87, 105))]

Next is the model, which needs training. For GPT-2, the only special token is the end-of-text token:

In [114]:
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

Like with the `WordPieceTrainer`, as well as the `vocab_size` and `special_tokens`, we can specify the `min_frequency` if we want to, or if we have an end-of-word suffix (like `</w>`), we can set it with `end_of_word_suffix`.

This tokenizer can also be trained on text files:

In [115]:
# Alternative training method
# tokenizer.model = models.BPE()
# tokenizer.train(["wikitext-2.txt"], trainer=trainer)

Let's have a look at the tokenization of a sample text:

In [144]:
encoding = tokenizer.encode(math_ex)
print(encoding.tokens)

['▁A', 'll', '▁non-', 'trivial', '▁zero', 'es', '▁of', '▁the', '▁Riemann', '▁', 'zeta', '▁function', '▁$', '\\zeta(s)', '$', '▁lie', '▁on', '▁the', '▁critic', 'al', '▁line', '▁$', '\t', 'e', 'x', 't', '{R', 'e', '}(s)', '=', '1/2', '.', '<sep>', '<cls>']


We apply the byte-level post-processing for the GPT-2 tokenizer as follows:

In [117]:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

The `trim_offsets = False` option indicates to the post-processor that we should leave the offsets of tokens that begin with 'Ġ ' as they are: this way the start of the offsets will point to the space before the word, not the first character of the word (since the space is technically part of the token). Let's have a look at the result with the text we just encoded, where `'Ġ test'` is the token at index 4:

In [145]:
sentence = math_ex
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

' zero'

Finally, we add a byte-level decoder:

In [119]:
tokenizer.decoder = decoders.ByteLevel()

and we can double-check it works properly:

In [120]:
tokenizer.decode(encoding.ids)

"Let's test this tokenizer."

Great! Now that we're done, we can save the tokenizer like before, and wrap it in a `PreTrainedTokenizerFast` or `GPT2TokenizerFast` if we want to use it in 🤗 Transformers:

In [121]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)

or:

In [122]:
from transformers import GPT2TokenizerFast

wrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

As the last example, we'll show you how to build a Unigram tokenizer from scratch.

## Building a Unigram tokenizer from scratch

Let's now build an XLNet tokenizer. Like for the previous tokenizers, we start by initializing a `Tokenizer` with a Unigram model:

In [123]:
tokenizer = Tokenizer(models.Unigram())

Again, we could initialize this model with a vocabulary if we had one.

For the normalization, XLNet uses a few replacements (which come from SentencePiece):

In [124]:
from tokenizers import Regex

tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

This replaces `` and '' with " and any sequence of two or more spaces with a single space, as well as removing the accents in the texts to tokenize.

The pre-tokenizer to use for any SentencePiece tokenizer is `Metaspace`:

In [125]:
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

We can have a look at the pre-tokenization of an example text like before:

In [146]:
tokenizer.pre_tokenizer.pre_tokenize_str(math_ex)

[('▁All', (0, 3)),
 ('▁non-trivial', (3, 15)),
 ('▁zeroes', (15, 22)),
 ('▁of', (22, 25)),
 ('▁the', (25, 29)),
 ('▁Riemann', (29, 37)),
 ('▁zeta', (37, 42)),
 ('▁function', (42, 51)),
 ('▁$\\zeta(s)$', (51, 62)),
 ('▁lie', (62, 66)),
 ('▁on', (66, 69)),
 ('▁the', (69, 73)),
 ('▁critical', (73, 82)),
 ('▁line', (82, 87)),
 ('▁$\text{Re}(s)=1/2.', (87, 105))]

Next is the model, which needs training. XLNet has quite a few special tokens:

In [147]:
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

A very important argument not to forget for the `UnigramTrainer` is the `unk_token`. We can also pass along other arguments specific to the Unigram algorithm, such as the `shrinking_factor` for each step where we remove tokens (defaults to 0.75) or the `max_piece_length` to specify the maximum length of a given token (defaults to 16).

This tokenizer can also be trained on text files:

In [128]:
# Alternative training method
# tokenizer.model = models.Unigram()
# tokenizer.train(["wikitext-2.txt"], trainer=trainer)

Let's have a look at the tokenization of a sample text:

In [148]:
encoding = tokenizer.encode(math_ex)
print(encoding.tokens)

['▁A', 'll', '▁non-', 'trivial', '▁zero', 'es', '▁of', '▁the', '▁Riemann', '▁', 'zeta', '▁function', '▁$', '\\zeta(s)', '$', '▁lie', '▁on', '▁the', '▁critic', 'al', '▁line', '▁$', '\t', 'e', 'x', 't', '{R', 'e', '}(s)', '=', '1/2', '.', '<sep>', '<cls>']


A peculiarity of XLNet is that it puts the `<cls>` token at the end of the sentence, with a type ID of 2 (to distinguish it from the other tokens). It's padding on the left, as a result. We can deal with all the special tokens and token type IDs with a template, like for BERT, but first we have to get the IDs of the `<cls>` and `<sep>` tokens:

In [130]:
cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")
print(cls_token_id, sep_token_id)

0 1


The template looks like this:

In [131]:
tokenizer.post_processor = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)

And we can test it works by encoding a pair of sentences:

In [149]:
encoding = tokenizer.encode(math_ex, math_ex_2)
print(encoding.tokens)
print(encoding.type_ids)

['▁A', 'll', '▁non-', 'trivial', '▁zero', 'es', '▁of', '▁the', '▁Riemann', '▁', 'zeta', '▁function', '▁$', '\\zeta(s)', '$', '▁lie', '▁on', '▁the', '▁critic', 'al', '▁line', '▁$', '\t', 'e', 'x', 't', '{R', 'e', '}(s)', '=', '1/2', '.', '<sep>', '▁The', '▁sine', '▁function', '▁$', 's', 'in', '(x)$', '▁has', '▁zeros', '▁at', '▁$k', '\\pi', '$.', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]


Finally, we add a `Metaspace` decoder:

In [133]:
tokenizer.decoder = decoders.Metaspace()

and we're done with this tokenizer! We can save the tokenizer like before, and wrap it in a `PreTrainedTokenizerFast` or `XLNetTokenizerFast` if we want to use it in 🤗 Transformers. One thing to note when using `PreTrainedTokenizerFast` is that on top of the special tokens, we need to tell the 🤗 Transformers library to pad on the left:

In [134]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)

Or alternatively:

In [135]:
from transformers import XLNetTokenizerFast

wrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)

Now that you have seen how the various building blocks are used to build existing tokenizers, you should be able to write any tokenizer you want with the 🤗 Tokenizers library and be able to use it in 🤗 Transformers.

## Summary

In this notebook, we've learned how to build tokenizers from scratch using the 🤗 Tokenizers library. We covered three major tokenization algorithms:

1. **WordPiece** (used by BERT): Uses a greedy algorithm to build subwords by iteratively adding the symbol pairs that most improves the likelihood of the training data.

2. **BPE (Byte-Pair Encoding)** (used by GPT-2): Starts with individual characters and iteratively merges the most frequent pairs of adjacent symbols.

3. **Unigram** (used by XLNet): Uses a probabilistic model to determine the best segmentation of a word into subwords.

For each tokenizer, we covered the essential components:
- **Normalization**: Text preprocessing (lowercasing, accent removal, etc.)
- **Pre-tokenization**: Initial splitting of text (whitespace, punctuation, etc.)
- **Model training**: Learning the vocabulary and merge rules
- **Post-processing**: Adding special tokens and token type IDs
- **Decoding**: Converting token IDs back to text

The key takeaway is that tokenizers are highly modular, and you can mix and match different components to create custom tokenizers for your specific needs. The 🤗 Tokenizers library provides a flexible framework for building and training tokenizers that can then be seamlessly integrated with 🤗 Transformers models.

## Additional Exercises

Try these exercises to deepen your understanding:

1. **Custom Normalizer**: Create a tokenizer with a custom normalizer that removes all digits from the text.


In [156]:
def get_training_corpus(start=0, end=1000):
    """Get training corpus"""
    for i in range(start, min(end, len(dataset)), 100):
        yield dataset[i : i + 100]["text"]


def digit_removing_tokenizer():
    """Tokenizer that removes all digits"""
    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

    # Custom normalizer sequence that removes digits
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(r'\d+'), ''),  # Remove all digits
        normalizers.Replace(Regex(r'\s+'), ' '),  # Clean up extra spaces
    ])

    tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

    special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    trainer = trainers.WordPieceTrainer(vocab_size=10000, special_tokens=special_tokens)

    return tokenizer, trainer

# Test on math examples
digit_removing_tokenizer, trainer = create_digit_removing_tokenizer()
print(f"Math Ex: {math_ex}")
print(f"Normalized Math Ex: {digit_removing_tokenizer.normalizer.normalize_str(math_ex)}")
print()

print(f"Math Ex 2: {math_ex_2}")
print(f"Normalized Math Ex 2: {digit_removing_tokenizer.normalizer.normalize_str(math_ex_2)}")
print()

# Train the tokenizer
digit_removing_tokenizer.train_from_iterator(get_training_corpus(0, 500), trainer=trainer)

# Test encoding
print(f"Math Ex: {math_ex}")
encoding = digit_removing_tokenizer.encode(math_ex)
print(f"Tokens: {encoding.tokens}")
print()

print(f"Math Ex 2: {math_ex_2}")
encoding = digit_removing_tokenizer.encode(math_ex_2)
print(f"Tokens: {encoding.tokens}")
print()

math_ex_3 = "Find 2x + 3y = 15 when x = 4"
print(f"Math Ex 3: {math_ex_3}")
encoding = digit_removing_tokenizer.encode(math_ex_3)
print(f"Tokens: {encoding.tokens}")


Math Ex: All non-trivial zeroes of the Riemann zeta function $\zeta(s)$ lie on the critical line $	ext{Re}(s)=1/2.
Normalized Math Ex: all non-trivial zeroes of the riemann zeta function $\zeta(s)$ lie on the critical line $ ext{re}(s)=/.

Math Ex 2: The sine function $sin(x)$ has zeros at $k\pi$.
Normalized Math Ex 2: the sine function $sin(x)$ has zeros at $k\pi$.

Math Ex: All non-trivial zeroes of the Riemann zeta function $\zeta(s)$ lie on the critical line $	ext{Re}(s)=1/2.
Tokens: ['all', 'non', '-', 'trivial', 'zero', '##es', 'of', 'the', 'riemann', 'zeta', 'function', '$', '\\', 'zeta', '(', 's', ')', '$', 'lie', 'on', 'the', 'critical', 'line', '$', 'ext', '{', 're', '}', '(', 's', ')', '=', '/', '.']

Math Ex 2: The sine function $sin(x)$ has zeros at $k\pi$.
Tokens: ['the', 'sine', 'function', '$', 'sin', '(', 'x', ')', '$', 'has', 'zeros', 'at', '$', 'k', '\\', 'pi', '$', '.']

Math Ex 3: Find 2x + 3y = 15 when x = 4
Tokens: ['find', 'x', '+', 'y', '=', 'when', 'x', '=']


2. **Mixed Pre-tokenizers**: Experiment with different combinations of pre-tokenizers and observe how they affect the tokenization results.

In [165]:
def compare_pretokenizers(ex_sen):
    """Compare different pre-tokenizers"""

    pretokenizers = {
        "Whitespace": pre_tokenizers.Whitespace(),
        "BertPreTokenizer": pre_tokenizers.BertPreTokenizer(),
        "WhitespaceSplit + Punctuation": pre_tokenizers.Sequence([
            pre_tokenizers.WhitespaceSplit(),
            pre_tokenizers.Punctuation()
        ]),
        "ByteLevel": pre_tokenizers.ByteLevel(add_prefix_space=False),
        "Metaspace": pre_tokenizers.Metaspace(),
        "LaTeX": pre_tokenizers.Sequence([
            pre_tokenizers.Split(Regex(r'(\$[^$]*\$)'), behavior='isolated'),  # Isolate LaTeX
            pre_tokenizers.WhitespaceSplit(),
            pre_tokenizers.Punctuation()
        ])
    }

    for name, pretokenizer in pretokenizers.items():
        try:
            result = pretokenizer.pre_tokenize_str(ex_sen)
            print(f"\n{name}:")
            print(f"  Tokens: {[token for token, span in result]}")
        except Exception as e:
            print(f"\n{name}: Error - {e}")

print(f"Original sentence: {math_ex_3}")
compare_pretokenizers(math_ex_3)
print()
print(f"Original sentence: {math_ex}")
compare_pretokenizers(math_ex)
print()



Original sentence: Find 2x + 3y = 15 when x = 4

Whitespace:
  Tokens: ['Find', '2x', '+', '3y', '=', '15', 'when', 'x', '=', '4']

BertPreTokenizer:
  Tokens: ['Find', '2x', '+', '3y', '=', '15', 'when', 'x', '=', '4']

WhitespaceSplit + Punctuation:
  Tokens: ['Find', '2x', '+', '3y', '=', '15', 'when', 'x', '=', '4']

ByteLevel:
  Tokens: ['Find', 'Ġ2', 'x', 'Ġ+', 'Ġ3', 'y', 'Ġ=', 'Ġ15', 'Ġwhen', 'Ġx', 'Ġ=', 'Ġ4']

Metaspace:
  Tokens: ['▁Find', '▁2x', '▁+', '▁3y', '▁=', '▁15', '▁when', '▁x', '▁=', '▁4']

LaTeX:
  Tokens: ['Find', '2x', '+', '3y', '=', '15', 'when', 'x', '=', '4']

Original sentence: All non-trivial zeroes of the Riemann zeta function $\zeta(s)$ lie on the critical line $	ext{Re}(s)=1/2.

Whitespace:
  Tokens: ['All', 'non', '-', 'trivial', 'zeroes', 'of', 'the', 'Riemann', 'zeta', 'function', '$\\', 'zeta', '(', 's', ')$', 'lie', 'on', 'the', 'critical', 'line', '$', 'ext', '{', 'Re', '}(', 's', ')=', '1', '/', '2', '.']

BertPreTokenizer:
  Tokens: ['All', 'non', 

3. **Vocabulary Size Impact**: Train the same tokenizer with different vocabulary sizes (e.g., 1000, 5000, 10000, 25000) and compare the tokenization results.

In [169]:
def compare_vocab_sizes():
    """Compare tokenization with different vocabulary sizes"""
    vocab_sizes = [1000, 5000, 10000, 25000]
    test_texts = [
        math_ex,
        math_ex_2,
        math_ex_3
    ]

    results = {}

    for vocab_size in vocab_sizes:

        tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
        tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
        tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

        special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
        trainer = trainers.WordPieceTrainer(vocab_size=vocab_size, special_tokens=special_tokens)

        tokenizer.train_from_iterator(get_training_corpus(0, 200), trainer=trainer)

        # Test on sample texts
        vocab_results = []
        for text in test_texts:
            encoding = tokenizer.encode(text)
            vocab_results.append({
                'text': text,
                'num_tokens': len(encoding.tokens),
                'tokens': encoding.tokens,
                'unk_count': encoding.tokens.count('[UNK]')
            })

        results[vocab_size] = vocab_results

    for i, text in enumerate(test_texts):
        print(f"\nText {i+1}: {text}")
        print("Vocab Size | Tokens | UNKs | Sample Tokens")
        print("-" * 80)
        for vocab_size in vocab_sizes:
            result = results[vocab_size][i]
            sample_tokens = result['tokens'][:8]  # First 8 tokens
            print(f"{vocab_size:9d} | {result['num_tokens']:6d} | {result['unk_count']:4d} | {sample_tokens}")

compare_vocab_sizes()


Text 1: All non-trivial zeroes of the Riemann zeta function $\zeta(s)$ lie on the critical line $	ext{Re}(s)=1/2.
Vocab Size | Tokens | UNKs | Sample Tokens
--------------------------------------------------------------------------------
     1000 |     47 |    0 | ['all', 'non', '-', 'tri', '##v', '##ial', 'zero', '##es']
     5000 |     36 |    0 | ['all', 'non', '-', 'trivial', 'zero', '##es', 'of', 'the']
    10000 |     36 |    0 | ['all', 'non', '-', 'trivial', 'zero', '##es', 'of', 'the']
    25000 |     36 |    0 | ['all', 'non', '-', 'trivial', 'zero', '##es', 'of', 'the']

Text 2: The sine function $sin(x)$ has zeros at $k\pi$.
Vocab Size | Tokens | UNKs | Sample Tokens
--------------------------------------------------------------------------------
     1000 |     20 |    0 | ['the', 'sin', '##e', 'function', '$', 'sin', '(', 'x']
     5000 |     20 |    0 | ['the', 'sin', '##e', 'function', '$', 'sin', '(', 'x']
    10000 |     20 |    0 | ['the', 'sin', '##e', 'function',

4. **Domain-Specific Tokenizer**: Try building a tokenizer on a domain-specific corpus (e.g., medical texts, legal documents) and compare it with a general-purpose tokenizer.

In [182]:
def create_latex_tokenizer():
    """Create a tokenizer for LaTeX """

    # WordPiece appears to work well
    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

    # Clean up text
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(r'\s+'), ' '),
    ])

    # Preserve LaTeX
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        # Protect LaTeX math expressions
        pre_tokenizers.Split(
            Regex(r'(\$\$[^$]*\$\$|\$[^$]*\$)'),
            behavior='isolated'
        ),

        # Protect LaTeX commands
        pre_tokenizers.Split(
            Regex(r'(\\[a-zA-Z]+(?:\{[^}]*\})?(?:\[[^\]]*\])?)'),
            behavior='isolated'
        ),

        # Keep functions together
        pre_tokenizers.Split(
            Regex(r'\b(sin|cos|tan|log|ln|exp|sqrt|lim|max|min|sum|int)\b'),
            behavior='isolated'
        ),

        # Keep variable patterns together (such as "2x", "x²", etc.)
        pre_tokenizers.Split(
            Regex(r'(\d*[a-zA-Z][²³⁴⁵⁶⁷⁸⁹⁰¹]*|\d+[a-zA-Z]+)'),
            behavior='isolated'
        ),

        # Keep mathematical symbols as single units
        pre_tokenizers.Split(
            Regex(r'([≤≥≠≈∞±×÷∑∏∫∂∇])'),
            behavior='isolated'
        ),

        # Use BERT-style splitting for the rest
        pre_tokenizers.BertPreTokenizer()
    ])

    # Add focused special tokens
    special_tokens = [
        "[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]",
        "[MATH]", "[LATEX]", "[FORMULA]", "[EQUATION]"
    ]

    trainer = trainers.WordPieceTrainer(
        vocab_size=25000,
        special_tokens=special_tokens,
        continuing_subword_prefix="##",
        min_frequency=2
    )

    return tokenizer, trainer

def create_math_tokenizer():
    """Create a tokenizer for math """

    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

    # Normalization
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
    ])

    # Pre-tokenization
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        # Isolate LaTeX expressions
        pre_tokenizers.Split(
            Regex(r'(\$[^$]+\$|\\\([^)]*\\\)|\\\[[^\]]*\\\])'),
            behavior='isolated'
        ),

        # Keep common math expressions together
        pre_tokenizers.Split(
            Regex(r'(\d*[a-zA-Z][₀₁₂₃₄₅₆₇₈₉²³⁴⁵⁶⁷⁸⁹⁰¹]*|\d+\.\d+|\d+/\d+)'),
            behavior='isolated'
        ),

        # Whitespace + punctuation splitting
        pre_tokenizers.Sequence([
            pre_tokenizers.WhitespaceSplit(),
            pre_tokenizers.Punctuation()
        ])
    ])

    # Include mathematical Unicode in special tokens
    special_tokens = [
        "[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]",
        # Preserve these symbols
        "π", "∞", "∑", "∏", "∫", "∂", "∇", "α", "β", "γ", "δ", "ε", "ζ", "η",
        "θ", "λ", "μ", "ν", "ξ", "π", "ρ", "σ", "τ", "φ", "χ", "ψ", "ω",
        "≤", "≥", "≠", "≈", "±", "×", "÷", "√"
    ]

    trainer = trainers.WordPieceTrainer(
        vocab_size=25000,
        special_tokens=special_tokens,
        continuing_subword_prefix="##"
    )

    return tokenizer, trainer

def test_tokenizers():
    """Test tokenizers"""

    # Create tokenizers
    latex_tokenizer, latex_trainer = create_latex_tokenizer()
    math_tokenizer, math_trainer = create_math_tokenizer()

    # Create training data
    def get_math_corpus():
        math_texts = [
            "The function f(x) = x² + 2x + 1 has derivative f'(x) = 2x + 2.",
            "The integral ∫₀¹ x² dx = 1/3 by the fundamental theorem.",
            "The Riemann zeta function ζ(s) converges for Re(s) > 1.",
            "Euler's identity: e^(iπ) + 1 = 0 is beautiful.",
            "The quadratic formula: x = (-b ± √(b²-4ac))/(2a).",
            "The sine function sin(x) has period 2π.",
            "The limit lim_{x→0} sin(x)/x = 1.",
            "The series ∑_{n=1}^∞ 1/n² = π²/6.",
            "The matrix equation Ax = b has solution x = A⁻¹b.",
            "Complex numbers: z = a + bi where i² = -1.",
            math_ex,
            math_ex_2,
            math_ex_3
        ]

        # Repeat and batch the data
        for i in range(0, len(math_texts) * 10, 2):
            yield math_texts[(i//2) % len(math_texts):(i//2 + 2) % len(math_texts)]

    # Train the tokenizers
    latex_tokenizer.train_from_iterator(get_math_corpus(), trainer=latex_trainer)
    math_tokenizer.train_from_iterator(get_math_corpus(), trainer=math_trainer)

    # Compare tokenizers with a standard one
    standard_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
    standard_tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
    standard_tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
    standard_trainer = trainers.WordPieceTrainer(
        vocab_size=25000,
        special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    )
    standard_tokenizer.train_from_iterator(get_math_corpus(), trainer=standard_trainer)

    test_examples = [
        "All non-trivial zeroes of the Riemann zeta function ζ(s) lie on the critical line Re(s)=1/2.",
        "The sine function sin(x) has zeros at kπ.",
        "Find 2x + 3y = 15 when x = 4",
        "The integral ∫₀¹ x² dx = 1/3 by the fundamental theorem of calculus.",
        "The quadratic formula is x = (-b ± √(b²-4ac))/(2a)."
    ]

    for i, example in enumerate(test_examples):
        print(f"\nExample {i+1}: {example}")
        print("-" * 80)

        # Test all three tokenizers
        latex_encoding = latex_tokenizer.encode(example)
        math_encoding = math_tokenizer.encode(example)
        standard_encoding = standard_tokenizer.encode(example)

        print(f"LaTeX ({len(latex_encoding.tokens)} tokens):")
        print(f"  {latex_encoding.tokens}")

        print(f"Math ({len(math_encoding.tokens)} tokens):")
        print(f"  {math_encoding.tokens}")

        print(f"Standard ({len(standard_encoding.tokens)} tokens):")
        print(f"  {standard_encoding.tokens}")

        # Count UNK tokens
        latex_unks = latex_encoding.tokens.count('[UNK]')
        math_unks = math_encoding.tokens.count('[UNK]')
        standard_unks = standard_encoding.tokens.count('[UNK]')

        print(f"UNK counts - LaTeX: {latex_unks}, Math: {math_unks}, Standard: {standard_unks}")

    return latex_tokenizer, math_tokenizer, standard_tokenizer

def analyze_vocabulary_learned(tokenizer, name):
    """Analyze what vocabulary the tokenizer learned"""

    # Get the vocabulary
    vocab = tokenizer.get_vocab()

    # Mathematical functions and operations
    math_terms = []
    greek_letters = []
    math_symbols = []
    functions = []

    for token in vocab.keys():
        token_clean = token.replace('##', '')

        if any(greek in token_clean for greek in ['alpha', 'beta', 'gamma', 'delta', 'pi', 'sigma', 'theta', 'lambda', 'mu', 'omega']):
            greek_letters.append(token)
        elif any(func in token_clean for func in ['sin', 'cos', 'tan', 'log', 'exp', 'sqrt', 'lim', 'max', 'min']):
            functions.append(token)
        elif any(symbol in token for symbol in ['∫', '∑', '∏', '∂', '∇', '±', '≤', '≥', '≠', '≈', '∞', 'π', 'α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ', 'λ', 'μ', 'ν', 'ξ', 'ρ', 'σ', 'τ', 'φ', 'χ', 'ψ', 'ω']):
            math_symbols.append(token)
        elif any(math_word in token_clean for math_word in ['integral', 'derivative', 'equation', 'formula', 'theorem', 'matrix', 'vector', 'function']):
            math_terms.append(token)

    print()
    print(f"Mathematical terms: {math_terms[:10]}...")
    print(f"Greek letters: {greek_letters[:10]}...")
    print(f"Math symbols: {math_symbols[:10]}...")
    print(f"Functions: {functions[:10]}...")
    print(f"Total math-related tokens: {len(math_terms + greek_letters + math_symbols + functions)}")


latex_tok, math_tok, standard_tok = test_tokenizers()

# Analyze what each tokenizer learned
analyze_vocabulary_learned(latex_tok, "LaTeX Tokenizer")
analyze_vocabulary_learned(math_tok, "Math Tokenizer")
analyze_vocabulary_learned(standard_tok, "Standard Tokenizer")



Example 1: All non-trivial zeroes of the Riemann zeta function ζ(s) lie on the critical line Re(s)=1/2.
--------------------------------------------------------------------------------
LaTeX (78 tokens):
  ['a', 'l', 'l', 'n', 'o', 'n', '-', 't', 'r', 'i', 'v', 'i', 'a', 'l', 'z', 'e', 'r', 'o', 'e', 's', 'o', 'f', 't', 'h', 'e', 'r', 'i', 'e', 'm', 'a', 'n', 'n', 'z', 'e', 't', 'a', 'f', 'u', 'n', 'c', 't', 'i', 'o', 'n', 'ζ', '(', 's', ')', 'l', 'i', 'e', 'o', 'n', 't', 'h', 'e', 'c', 'r', 'i', 't', 'i', 'c', 'a', 'l', 'l', 'i', 'n', 'e', 'r', 'e', '(', 's', ')', '=', '1', '/', '2', '.']
Math (78 tokens):
  ['a', 'l', 'l', 'n', 'o', 'n', '-', 't', 'r', 'i', 'v', 'i', 'a', 'l', 'z', 'e', 'r', 'o', 'e', 's', 'o', 'f', 't', 'h', 'e', 'r', 'i', 'e', 'm', 'a', 'n', 'n', 'z', 'e', 't', 'a', 'f', 'u', 'n', 'c', 't', 'i', 'o', 'n', 'ζ', '(', 's', ')', 'l', 'i', 'e', 'o', 'n', 't', 'h', 'e', 'c', 'r', 'i', 't', 'i', 'c', 'a', 'l', 'l', 'i', 'n', 'e', 'r', 'e', '(', 's', ')', '=', '1', '/', '

5. **Special Token Handling**: Experiment with different special tokens and see how they affect the model's behavior.

In [186]:
import re

def create_latex_specific_tokenizer():
    """
    A tokenizer for LaTeX documents which focuses on preserving LaTeX command
    structure and mathematical expressions.
    """

    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

    # Normalization
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        normalizers.Replace(Regex(r'\s+'), ' '),  # Clean up whitespace
    ])

    # Pre-tokenization
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        # Isolate LaTeX environments
        pre_tokenizers.Split(
            Regex(r'(\\begin\{[^}]+\}.*?\\end\{[^}]+\})'),
            behavior='isolated'
        ),

        # Isolate display math environments
        pre_tokenizers.Split(
            Regex(r'(\$\$[^$]*\$\$|\\\[[^\]]*\\\])'),
            behavior='isolated'
        ),

        # Isolate inline math
        pre_tokenizers.Split(
            Regex(r'(\$[^$]*\$|\\\([^)]*\\\))'),
            behavior='isolated'
        ),

        # Isolate LaTeX commands with their arguments
        pre_tokenizers.Split(
            Regex(r'(\\[a-zA-Z]+(?:\*)?(?:\[[^\]]*\])?(?:\{[^}]*\})*(?:\{[^}]*\})*)'),
            behavior='isolated'
        ),

        # Isolate standalone braces and their contents
        pre_tokenizers.Split(
            Regex(r'(\{[^{}]*\})'),
            behavior='isolated'
        ),

        # Standard split regular text
        pre_tokenizers.WhitespaceSplit(),
        pre_tokenizers.Punctuation()
    ])

    # LaTeX vocabulary
    latex_special_tokens = [
        "[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]",

        # Document structure
        "[DOCUMENT]", "[SECTION]", "[SUBSECTION]", "[PARAGRAPH]",
        "[MATH_DISPLAY]", "[MATH_INLINE]", "[EQUATION]", "[ALIGN]",

        # LaTeX commands as single tokens
        "\\documentclass", "\\usepackage", "\\begin", "\\end",
        "\\section", "\\subsection", "\\paragraph", "\\item",
        "\\label", "\\ref", "\\cite", "\\bibliography",

        # Mathematical commands
        "\\frac", "\\sqrt", "\\sum", "\\int", "\\prod", "\\lim",
        "\\partial", "\\nabla", "\\infty", "\\alpha", "\\beta", "\\gamma",
        "\\delta", "\\epsilon", "\\zeta", "\\eta", "\\theta", "\\iota",
        "\\kappa", "\\lambda", "\\mu", "\\nu", "\\xi", "\\pi", "\\rho",
        "\\sigma", "\\tau", "\\upsilon", "\\phi", "\\chi", "\\psi", "\\omega",

        # Mathematical environments
        "\\equation", "\\align", "\\gather", "\\split", "\\cases",
        "\\matrix", "\\pmatrix", "\\bmatrix", "\\vmatrix",

        # Formatting
        "\\textbf", "\\textit", "\\emph", "\\underline", "\\overline",
        "\\left", "\\right", "\\big", "\\Big", "\\bigg", "\\Bigg",

        # Symbols that should never be split
        "\\leq", "\\geq", "\\neq", "\\approx", "\\equiv", "\\propto",
        "\\in", "\\notin", "\\subset", "\\supset", "\\cup", "\\cap",
        "\\times", "\\div", "\\pm", "\\mp", "\\cdot", "\\circ",
    ]

    trainer = trainers.WordPieceTrainer(
        vocab_size=50000,
        special_tokens=latex_special_tokens,
        continuing_subword_prefix="##",
        min_frequency=1,  # Include rare LaTeX commands
    )

    return tokenizer, trainer

def create_latex_training_corpus():
    """Generate LaTeX-rich training data"""

    latex_documents = [
        # Basic document structure
        "\\documentclass{article} \\usepackage{amsmath} \\begin{document} Hello World \\end{document}",

        # Mathematical expressions
        "The quadratic formula is \\begin{equation} x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a} \\end{equation}",

        # Inline and display math
        "We have $f(x) = x^2$ and \\[g(x) = \\int_0^x f(t) dt\\]",

        # Complex mathematical structures
        "\\begin{align} \\nabla \\times \\mathbf{E} &= -\\frac{\\partial \\mathbf{B}}{\\partial t} \\\\ \\nabla \\times \\mathbf{B} &= \\mu_0 \\mathbf{J} + \\mu_0 \\epsilon_0 \\frac{\\partial \\mathbf{E}}{\\partial t} \\end{align}",

        # Matrices and arrays
        "\\begin{pmatrix} a & b \\\\ c & d \\end{pmatrix} \\begin{bmatrix} x \\\\ y \\end{bmatrix} = \\begin{bmatrix} ax + by \\\\ cx + dy \\end{bmatrix}",

        # Greek letters and symbols
        "Let $\\alpha, \\beta, \\gamma$ be angles such that $\\alpha + \\beta + \\gamma = \\pi$",

        # Fractions and radicals
        "\\frac{\\sqrt{a + b}}{\\sqrt{c - d}} = \\sqrt{\\frac{a + b}{c - d}}",

        # Summations and integrals
        "\\sum_{n=1}^{\\infty} \\frac{1}{n^2} = \\frac{\\pi^2}{6} \\quad \\text{and} \\quad \\int_0^{\\infty} e^{-x^2} dx = \\frac{\\sqrt{\\pi}}{2}",

        # Function definitions
        "Let $f: \\mathbb{R} \\to \\mathbb{R}$ be defined by $f(x) = \\sin(x) + \\cos(x)$",

        # Limits and derivatives
        "\\lim_{x \\to 0} \\frac{\\sin(x)}{x} = 1 \\quad \\text{and} \\quad \\frac{d}{dx}[x^n] = nx^{n-1}",

        # Set theory
        "For sets $A, B \\subset \\mathbb{R}$, we have $A \\cup B = \\{x : x \\in A \\text{ or } x \\in B\\}$",

        # Probability
        "If $X \\sim \\mathcal{N}(\\mu, \\sigma^2)$, then $\\mathbb{E}[X] = \\mu$ and $\\text{Var}(X) = \\sigma^2$",

        # Linear algebra
        "The eigenvalues of matrix $A$ satisfy $\\det(A - \\lambda I) = 0$",

        # Complex analysis
        "The Cauchy-Riemann equations: $\\frac{\\partial u}{\\partial x} = \\frac{\\partial v}{\\partial y}$ and $\\frac{\\partial u}{\\partial y} = -\\frac{\\partial v}{\\partial x}$",

        # Document sections
        "\\section{Introduction} \\subsection{Background} \\paragraph{Motivation} This work studies...",

        # References and citations
        "As shown in \\cite{einstein1905} and \\ref{eq:main}, we conclude that...",

        # Environments
        "\\begin{theorem} \\label{thm:main} Every continuous function on $[a,b]$ is uniformly continuous. \\end{theorem}",

        # Cases and piecewise functions
        "\\begin{equation} f(x) = \\begin{cases} x^2 & \\text{if } x \\geq 0 \\\\ -x^2 & \\text{if } x < 0 \\end{cases} \\end{equation}",

        # Tables and formatting
        "\\begin{table} \\begin{tabular}{|c|c|} \\hline $x$ & $f(x)$ \\\\ \\hline 1 & 2 \\\\ 2 & 4 \\\\ \\hline \\end{tabular} \\end{table}",
    ]

    # Expand the corpus
    expanded_corpus = latex_documents * 1000  # Repeat for more training data

    def latex_corpus_generator():
        for i in range(0, len(expanded_corpus), 10):
            yield expanded_corpus[i:i+10]

    return latex_corpus_generator

def test_latex_tokenizer():
    """Test the LaTeX-specific tokenizer"""

    latex_tokenizer, trainer = create_latex_specific_tokenizer()

    # Train on LaTeX corpus
    latex_corpus = create_latex_training_corpus()
    latex_tokenizer.train_from_iterator(latex_corpus(), trainer=trainer)

    # Create standard tokenizer for comparison
    standard_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
    standard_tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
    standard_tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
    standard_trainer = trainers.WordPieceTrainer(
        vocab_size=50000,
        special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    )
    standard_tokenizer.train_from_iterator(latex_corpus(), trainer=standard_trainer)

    # Test on LaTeX examples
    latex_test_examples = [
        "\\frac{\\sqrt{b^2 - 4ac}}{2a}",
        "\\begin{equation} \\sum_{n=1}^{\\infty} \\frac{1}{n^2} = \\frac{\\pi^2}{6} \\end{equation}",
        "\\int_0^{\\infty} e^{-x^2} dx = \\frac{\\sqrt{\\pi}}{2}",
        "\\begin{pmatrix} a & b \\\\ c & d \\end{pmatrix}",
        "\\lim_{x \\to 0} \\frac{\\sin(x)}{x} = 1",
        "\\alpha + \\beta + \\gamma = \\pi",
        "\\documentclass{article} \\usepackage{amsmath}",
        "\\begin{align} f(x) &= x^2 \\\\ g(x) &= 2x \\end{align}",
    ]

    for i, example in enumerate(latex_test_examples):
        print(f"\nExample {i+1}: {example}")
        print("-" * 80)

        latex_encoding = latex_tokenizer.encode(example)
        standard_encoding = standard_tokenizer.encode(example)

        latex_unks = latex_encoding.tokens.count('[UNK]')
        standard_unks = standard_encoding.tokens.count('[UNK]')

        print(f"LaTeX-specific ({len(latex_encoding.tokens)} tokens, {latex_unks} UNKs):")
        print(f"  {latex_encoding.tokens}")

        print(f"Standard ({len(standard_encoding.tokens)} tokens, {standard_unks} UNKs):")
        print(f"  {standard_encoding.tokens}")

        # Check LaTeX command preservation
        latex_commands_intact = sum(1 for token in latex_encoding.tokens if token.startswith('\\') and len(token) > 2)
        standard_commands_intact = sum(1 for token in standard_encoding.tokens if token.startswith('\\') and len(token) > 2)

        print(f"LaTeX commands preserved intact - LaTeX: {latex_commands_intact}, Standard: {standard_commands_intact}")

    return latex_tokenizer, standard_tokenizer

def analyze_latex_vocabulary(tokenizer, name):
    """Analyze LaTeX-specific vocabulary learned"""
    print(f"\n=== LaTeX Vocabulary Analysis: {name} ===")

    vocab = tokenizer.get_vocab()

    # Categorize LaTeX vocabulary
    latex_commands = [token for token in vocab.keys() if token.startswith('\\') and len(token) > 2]
    math_environments = [token for token in vocab.keys() if any(env in token for env in ['equation', 'align', 'matrix', 'cases'])]
    greek_letters = [token for token in vocab.keys() if any(greek in token for greek in ['alpha', 'beta', 'gamma', 'delta', 'pi', 'sigma', 'theta', 'lambda'])]
    document_structure = [token for token in vocab.keys() if any(struct in token for struct in ['section', 'chapter', 'document', 'usepackage'])]

    print(f"LaTeX commands: {len(latex_commands)} (e.g., {latex_commands[:10]})")
    print(f"Math environments: {len(math_environments)} (e.g., {math_environments[:5]})")
    print(f"Greek letters: {len(greek_letters)} (e.g., {greek_letters[:10]})")
    print(f"Document structure: {len(document_structure)} (e.g., {document_structure[:5]})")

    return {
        'commands': len(latex_commands),
        'environments': len(math_environments),
        'greek': len(greek_letters),
        'structure': len(document_structure)
    }


latex_tok, standard_tok = test_latex_tokenizer()

# Analyze vocabularies
latex_stats = analyze_latex_vocabulary(latex_tok, "LaTeX-Specific Tokenizer")
standard_stats = analyze_latex_vocabulary(standard_tok, "Standard Tokenizer")



Example 1: \frac{\sqrt{b^2 - 4ac}}{2a}
--------------------------------------------------------------------------------
LaTeX-specific (14 tokens, 0 UNKs):
  ['\\frac', '{', '\\sqrt', '{', 'b', '^', '2', '-', '4ac', '}', '}', '{', '2a', '}']
Standard (16 tokens, 0 UNKs):
  ['\\', 'frac', '{', '\\', 'sqrt', '{', 'b', '^', '2', '-', '4ac', '}', '}', '{', '2a', '}']
LaTeX commands preserved intact - LaTeX: 2, Standard: 0

Example 2: \begin{equation} \sum_{n=1}^{\infty} \frac{1}{n^2} = \frac{\pi^2}{6} \end{equation}
--------------------------------------------------------------------------------
LaTeX-specific (38 tokens, 0 UNKs):
  ['\\begin', '{', 'equation', '}', '\\sum', '_', '{', 'n', '=', '1', '}', '^', '{', '\\infty', '}', '\\frac', '{', '1', '}', '{', 'n', '^', '2', '}', '=', '\\frac', '{', '\\pi', '^', '2', '}', '{', '6', '}', '\\end', '{', 'equation', '}']
Standard (45 tokens, 0 UNKs):
  ['\\', 'begin', '{', 'equation', '}', '\\', 'sum', '_', '{', 'n', '=', '1', '}', '^', '{', '