# Unigram tokenization
Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [2]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also log into Hugging Face.

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

The <font color='blue'>Unigram algorithm</font> is used in <font color='blue'>combination</font> with [SentencePiece](https://huggingface.co/papers/1808.06226), which is the tokenization algorithm used by models like AlBERT, T5, mBART, Big Bird, and XLNet.

SentencePiece addresses the fact that <font color='blue'>not all languages</font> use <font color='blue'>spaces</font> to <font color='blue'>separate words</font>. Instead, SentencePiece treats the <font color='blue'>input</font> as a <font color='blue'>raw input stream</font> which includes the <font color='blue'>space</font> in the <font color='blue'>set of characters</font> to use. Then it can use the <font color='blue'>Unigram algorithm</font> to <font color='blue'>construct</font> the appropriate <font color='blue'>vocabulary</font>.

💡 **Tip:** This section covers Unigram in depth, going as far as showing a full implementation. You can skip to the end if you just want a general overview of the tokenization algorithm.

## Training algorithm

Compared to BPE and WordPiece, Unigram works in the other direction: it <font color='blue'>starts</font> from a <font color='blue'>big vocabulary</font> and <font color='blue'>removes tokens</font> from it until it <font color='blue'>reaches</font> the <font color='blue'>desired vocabulary size</font>. There are <font color='blue'>several options</font> to use to build that base vocabulary: we can take the <font color='blue'>most common substrings</font> in <font color='blue'>pre-tokenized words</font>, for instance, or <font color='blue'>apply BPE</font> on the <font color='blue'>initial corpus</font> with a large vocabulary size.

At <font color='blue'>each step</font> of the training, the Unigram algorithm <font color='blue'>computes</font> a <font color='blue'>loss</font> over the <font color='blue'>corpus</font> given the current vocabulary. Then, for <font color='blue'>each symbol</font> in the <font color='blue'>vocabulary</font>, the algorithm <font color='blue'>computes</font> how much the <font color='blue'>overall loss</font> would <font color='blue'>increase</font> if the <font color='blue'>symbol</font> was <font color='blue'>removed</font>, and looks for the <font color='blue'>symbols</font> that would <font color='blue'>increase</font> it the <font color='blue'>least</font>. Those symbols have a <font color='blue'>lower effect</font> on the <font color='blue'>overall loss</font> over the <font color='blue'>corpus</font>, so in a sense they are "less needed" and are the best candidates for removal.

This is all a very <font color='blue'>costly operation</font>, so we don't just remove the single symbol associated with the lowest loss increase, but the <font color='blue'>p</font> (p being a <font color='blue'>hyperparameter</font> you <font color='blue'>can control</font>, usually <font color='blue'>10</font> or <font color='blue'>20</font>) <font color='blue'>percent</font> of the <font color='blue'>symbols</font> associated with the <font color='blue'>lowest loss increase</font>. This process is then repeated until the vocabulary has reached the desired size.

Note that we <font color='blue'>never remove</font> the <font color='blue'>base characters</font>, to make sure any word can be tokenized.

Now, this is still a bit vague: the <font color='blue'>main part</font> of the <font color='blue'>algorithm</font> is to <font color='blue'>compute</font> a <font color='blue'>loss</font> over the <font color='blue'>corpus</font> and see how it <font color='blue'>changes</font> when we <font color='blue'>remove some tokens</font> from the <font color='blue'>vocabulary</font>, but we haven't explained how to do this yet. This step relies on the <font color='blue'>tokenization algorithm</font> of a <font color='blue'>Unigram model</font>, so we'll dive into this next.

We'll <font color='blue'>reuse</font> the <font color='blue'>corpus</font> from the <font color='blue'>previous examples</font>:

```
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
```

and for this example, we will take <font color='blue'>all strict substrings</font> for the initial vocabulary :

```
["h", "u", "g", "hu", "ug", "p", "pu", "n", "un", "b", "bu", "s", "hug", "gs", "ugs"]
```

## Tokenization algorithm

A <font color='blue'>Unigram model</font> is a type of language model that considers <font color='blue'>each token</font> to be <font color='blue'>independent</font> of the <font color='blue'>tokens before it</font>. It's the simplest language model, in the sense that the <font color='blue'>probability of token X</font> given the <font color='blue'>previous context</font> is just the <font color='blue'>probability of token X</font>. So, if we used a Unigram language model to generate text, we would <font color='blue'>always predict</font> the <font color='blue'>most common token</font>.

The <font color='blue'>probability</font> of a <font color='blue'>given token</font> is its <font color='blue'>frequency</font> (the number of times we find it) in the original corpus, <font color='blue'>divided</font> by the <font color='blue'>sum of all frequencies</font> of <font color='blue'>all tokens</font> in the <font color='blue'>vocabulary</font> (to make sure the probabilities sum up to 1). For instance, `"ug"` is present in `"hug"`, `"pug"`, and `"hugs"`, so it has a frequency of 20 in our corpus.

Here are the <font color='blue'>frequencies</font> of <font color='blue'>all</font> the <font color='blue'>possible subwords</font> in the vocabulary:

```
("h", 15) ("u", 36) ("g", 20) ("hu", 15) ("ug", 20) ("p", 17) ("pu", 17) ("n", 16)
("un", 16) ("b", 4) ("bu", 4) ("s", 5) ("hug", 15) ("gs", 5) ("ugs", 5)
```

So, the <font color='blue'>sum</font> of <font color='blue'>all frequencies</font> is <font color='blue'>210</font>, and the probability of the subword `"ug"` is thus 20/210.

✏️ **Now your turn!** Write the code to compute the frequencies above and double-check that the results shown are correct, as well as the total sum.

In [22]:
# Your code here to compute the frequencies
def compute_frequencies(corpus):
    """Compute all subword frequencies for Unigram model."""
    freq = {}
    for word in corpus:
        for i in range(len(word)):
            for j in range(i + 1, len(word) + 1):
                subword = word[i:j]
                freq[subword] = freq.get(subword, 0) + 1
    return freq


corpus = ["hug"]*10 + ["pug"]*5 + ["pun"]*12 + ["bun"]*4 + ["hugs"]*5
frequencies = compute_frequencies(corpus)

print("Computed frequencies:")
for subword in sorted(frequencies.keys()):
    print(f'("{subword}", {frequencies[subword]})')

vocabulary = {
    "h": 15, "u": 36, "g": 20, "hu": 15, "ug": 20, "p": 17,
    "pu": 17, "n": 16, "un": 16, "b": 4, "bu": 4, "s": 5,
    "hug": 15, "gs": 5, "ugs": 5
}


print(f"\nTotal sum: {sum(frequencies.values())}")
excluded_total = 0
print(f"\nTotal sum of excluded:")
for subword, freq in frequencies.items():
    if subword not in vocabulary:
        print(f'  "{subword}": {freq} occurrences')
        excluded_total += freq
print('\nTotal sum calculated: 236-26 excluded = 210')

Computed frequencies:
("b", 4)
("bu", 4)
("bun", 4)
("g", 20)
("gs", 5)
("h", 15)
("hu", 15)
("hug", 15)
("hugs", 5)
("n", 16)
("p", 17)
("pu", 17)
("pug", 5)
("pun", 12)
("s", 5)
("u", 36)
("ug", 20)
("ugs", 5)
("un", 16)

Total sum: 236

Total sum of excluded:
  "pug": 5 occurrences
  "pun": 12 occurrences
  "bun": 4 occurrences
  "hugs": 5 occurrences

Total sum calculated: 236-26 excluded = 210


Now, to <font color='blue'>tokenize</font> a given word, we look at <font color='blue'>all</font> the <font color='blue'>possible segmentations</font> into tokens and <font color='blue'>compute</font> the <font color='blue'>probability of each</font> according to the Unigram model. Since <font color='blue'>all tokens</font> are considered <font color='blue'>independent</font>, this probability is just the <font color='blue'>product</font> of the <font color='blue'>probability of each token</font>. For instance, the tokenization `["p", "u", "g"]` of `"pug"` has the probability:

$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389$$

Comparatively, the tokenization `["pu", "g"]` has the probability:

$$P(``pu", "g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676$$

so that one is way more likely. In general, <font color='blue'>tokenizations</font> with the <font color='blue'>least tokens possible</font> will have the <font color='blue'>highest probability</font> (because of that division by 210 repeated for each token), which corresponds to what we <font color='blue'>want intuitively</font>: to <font color='blue'>split</font> a <font color='blue'>word</font> into the <font color='blue'>least number of tokens</font> possible.

The <font color='blue'>tokenization</font> of a <font color='blue'>word</font> with the Unigram model is then the <font color='blue'>tokenization</font> with the <font color='blue'>highest probability</font>. In the example of `"pug"`, here are the <font color='blue'>probabilities</font> we would get for <font color='blue'>each possible segmentation</font>:

```
["p", "u", "g"] : 0.000389
["p", "ug"] : 0.0022676
["pu", "g"] : 0.0022676
```

So, `"pug"` would be tokenized as `["p", "ug"]` or `["pu", "g"]`, depending on <font color='blue'>which</font> of those <font color='blue'>segmentations</font> is <font color='blue'>encountered first</font> (note that in a larger corpus, equality cases like this will be rare).

In this case, it was easy to find all the possible segmentations and compute their probabilities, but in general it's going to be a bit harder. There is a <font color='blue'>classic algorithm</font> used for this, called the <font color='blue'>Viterbi algorithm</font>. Essentially, we can <font color='blue'>build a graph</font> to <font color='blue'>detect</font> the <font color='blue'>possible segmentations</font> of a <font color='blue'>given word</font> by saying there is a <font color='blue'>branch</font> from <font color='blue'>character _a_</font> to <font color='blue'>character _b_</font> if the <font color='blue'>subword</font> from <font color='blue'>_a_ to _b_</font> is <font color='blue'>in the vocabulary</font>, and attribute to that <font color='blue'>branch</font> the <font color='blue'>probability of the subword</font>.

To <font color='blue'>find the path</font> in that <font color='blue'>graph</font> that is going to have the <font color='blue'>best score</font> the Viterbi algorithm determines, for <font color='blue'>each position</font> in the <font color='blue'>word</font>, the <font color='blue'>segmentation</font> with the <font color='blue'>best score</font> that <font color='blue'>ends</font> at <font color='blue'>that position</font>. Since we go from the <font color='blue'>beginning to the end</font>, that <font color='blue'>best score</font> can be found by <font color='blue'>looping</font> through <font color='blue'>all subwords ending</font> at the <font color='blue'>current position</font> and then <font color='blue'>using</font> the <font color='blue'>best tokenization score</font> from the <font color='blue'>position</font> this <font color='blue'>subword begins at</font>. Then, we just have to <font color='blue'>unroll the path taken</font> to <font color='blue'>arrive at the end</font>.

Let's take a look at an example using our vocabulary and the word `"unhug"`. For <font color='blue'>each position</font>, the <font color='blue'>subwords</font> with the <font color='blue'>best scores</font> ending there are the following:

```
Character 0 (u): "u" (score 0.171429)
Character 1 (n): "un" (score 0.076191)
Character 2 (h): "un" "h" (score 0.005442)
Character 3 (u): "un" "hu" (score 0.005442)
Character 4 (g): "un" "hug" (score 0.005442)
```

Thus `"unhug"` would be tokenized as `["un", "hug"]`.

✏️ **Now your turn!** Determine the tokenization of the word `"huggun"`, and its score.

In [23]:
# Your code here to determine the tokenization of "huggun"
def unigram_tokenize(word, vocabulary):
    """Tokenize using Unigram model with Viterbi algorithm."""
    total = sum(vocabulary.values())
    probs = {k: v/total for k, v in vocabulary.items()}

    n = len(word)
    best_score = [0.0] * (n + 1)
    best_path = [[] for _ in range(n + 1)]
    best_score[0] = 1.0

    for i in range(1, n + 1):
        for j in range(i):
            subword = word[j:i]
            if subword in probs:
                score = best_score[j] * probs[subword]
                if score > best_score[i]:
                    best_score[i] = score
                    best_path[i] = best_path[j] + [subword]

    return best_path[n], best_score[n]

vocab = {"h": 15, "u": 36, "g": 20, "hu": 15, "ug": 20, "p": 17,
         "pu": 17, "n": 16, "un": 16, "b": 4, "bu": 4, "s": 5,
         "hug": 15, "gs": 5, "ugs": 5}

tokens, score = unigram_tokenize("huggun", vocab)
print(f'Tokenization of "huggun": {tokens}')
print(f'Score: {score:.6f}')

Tokenization of "huggun": ['hug', 'g', 'un']
Score: 0.000518


## Back to training

Now that we have seen how the tokenization works, we can dive a little more deeply into the <font color='blue'>loss</font> used <font color='blue'>during training</font>. At any given stage, this <font color='blue'>loss</font> is computed by <font color='blue'>tokenizing every word</font> in the <font color='blue'>corpus</font>, using the <font color='blue'>current vocabulary</font> and the <font color='blue'>Unigram model</font> determined by the frequencies of each token in the corpus (as seen before).

<font color='blue'>Each word</font> in the corpus has a <font color='blue'>score</font>, and the <font color='blue'>loss</font> is the <font color='blue'>negative log likelihood</font> of those scores -- that is, the sum for all the words in the corpus of all the `-log(P(word))`.

Let's go back to our example with the following corpus:

```
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
```

The <font color='blue'>tokenization</font> of <font color='blue'>each word</font> with their respective <font color='blue'>scores</font> is:

```
"hug": ["hug"] (score 0.071428)
"pug": ["pu", "g"] (score 0.007710)
"pun": ["pu", "n"] (score 0.006168)
"bun": ["bu", "n"] (score 0.001451)
"hugs": ["hug", "s"] (score 0.001701)
```

So the <font color='blue'>loss</font> is:

```
10 * (-log(0.071428)) + 5 * (-log(0.007710)) + 12 * (-log(0.006168)) + 4 * (-log(0.001451)) + 5 * (-log(0.001701)) = 169.8
```

Now we need to compute how <font color='blue'>removing each token affects</font> the <font color='blue'>loss</font>. This is rather tedious, so we'll just do it for <font color='blue'>two tokens</font> here and save the whole process for when we have code to help us. In this (very) particular case, we had <font color='blue'>two equivalent tokenizations</font> of <font color='blue'>all</font> the <font color='blue'>words</font>: as we saw earlier, for example, `"pug"` could be tokenized `["p", "ug"]` with the same score. Thus, <font color='blue'>removing</font> the <font color='blue'>`"pu"` token</font> from the <font color='blue'>vocabulary</font> will <font color='blue'>give</font> the <font color='blue'>exact same loss</font>.

On the other hand,<font color='blue'>removing `"hug"`</font> will <font color='blue'>make</font> the <font color='blue'>loss worse</font>, because the tokenization of `"hug"` and `"hugs"` will become:

```
"hug": ["hu", "g"] (score 0.006802)
"hugs": ["hu", "gs"] (score 0.001701)
```

These changes will cause the <font color='blue'>loss</font> to <font color='blue'>rise by</font>:

```
- 10 * (-log(0.071428)) + 10 * (-log(0.006802)) = 23.5
```

Therefore, the token `"pu"` will probably be removed from the vocabulary, but not `"hug"`.

## Implementing Unigram

Now let's implement everything we've seen so far in code. Like with BPE and WordPiece, this is not an efficient implementation of the Unigram algorithm (quite the opposite), but it should help you understand it a bit better.

We will use the <font color='blue'>same corpus</font> as <font color='blue'>before</font> as an example:

In [24]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

This time, we will use `xlnet-base-cased` as our model:

In [26]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

Like for BPE and WordPiece, we begin by <font color='blue'>counting the number</font> of <font color='blue'>occurrences</font> of <font color='blue'>each word</font> in the corpus:

In [27]:
from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(int,
            {'▁This': 3,
             '▁is': 2,
             '▁the': 1,
             '▁Hugging': 1,
             '▁Face': 1,
             '▁Course.': 1,
             '▁chapter': 1,
             '▁about': 1,
             '▁tokenization.': 1,
             '▁section': 1,
             '▁shows': 1,
             '▁several': 1,
             '▁tokenizer': 1,
             '▁algorithms.': 1,
             '▁Hopefully,': 1,
             '▁you': 1,
             '▁will': 1,
             '▁be': 1,
             '▁able': 1,
             '▁to': 1,
             '▁understand': 1,
             '▁how': 1,
             '▁they': 1,
             '▁are': 1,
             '▁trained': 1,
             '▁and': 1,
             '▁generate': 1,
             '▁tokens.': 1})

Then, we need to <font color='blue'>initialize our vocabulary</font> to something <font color='blue'>larger</font> than the <font color='blue'>vocab size</font> we will want at the end. We have to include <font color='blue'>all</font> the <font color='blue'>basic characters</font> (otherwise we won't be able to tokenize every word), but for the <font color='blue'>bigger substrings</font> we'll only <font color='blue'>keep</font> the <font color='blue'>most common</font> ones, so we <font color='blue'>sort</font> them by <font color='blue'>frequency</font>:

In [28]:
char_freqs = defaultdict(int)
subwords_freqs = defaultdict(int)
for word, freq in word_freqs.items():
    for i in range(len(word)):
        char_freqs[word[i]] += freq
        # Loop through the subwords of length at least 2
        for j in range(i + 2, len(word) + 1):
            subwords_freqs[word[i:j]] += freq

# Sort subwords by frequency
sorted_subwords = sorted(subwords_freqs.items(), key=lambda x: x[1], reverse=True)
sorted_subwords[:10]

[('▁t', 7),
 ('is', 5),
 ('er', 5),
 ('▁a', 5),
 ('▁to', 4),
 ('to', 4),
 ('en', 4),
 ('▁T', 3),
 ('▁Th', 3),
 ('▁Thi', 3)]

In [29]:
token_freqs = list(char_freqs.items()) + sorted_subwords[: 300 - len(char_freqs)]
token_freqs = {token: freq for token, freq in token_freqs}

💡 **Tip:** SentencePiece uses a more efficient algorithm called <font color='blue'>Enhanced Suffix Array (ESA)</font> to create the initial vocabulary.

Next, we compute the <font color='blue'>sum of all frequencies</font>, to <font color='blue'>convert</font> the <font color='blue'>frequencies</font> into <font color='blue'>probabilities</font>. For our model we will <font color='blue'>store</font> the <font color='blue'>logarithms</font> of the <font color='blue'>probabilities</font>, because it's more numerically stable to add logarithms than to multiply small numbers, and this will simplify the computation of the loss of the model:

In [30]:
from math import log

total_sum = sum([freq for token, freq in token_freqs.items()])
model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}

Now the main function is the one that <font color='blue'>tokenizes words</font> using the <font color='blue'>Viterbi algorithm</font>. As we saw before, that algorithm computes the <font color='blue'>best segmentation</font> of <font color='blue'>each substring</font> of the <font color='blue'>word</font>, which we will store in a variable named `best_segmentations`. We will store <font color='blue'>one dictionary</font> per <font color='blue'>position</font> in the <font color='blue'>word</font> (from 0 to its total length), with <font color='blue'>two keys</font>: the <font color='blue'>index</font> of the <font color='blue'>start</font> of the <font color='blue'>last token</font> in the <font color='blue'>best segmentation</font>, and the <font color='blue'>score</font> of the <font color='blue'>best segmentation</font>. With the <font color='blue'>index</font> of the <font color='blue'>start</font> of the <font color='blue'>last token</font>, we will be able to <font color='blue'>retrieve</font> the <font color='blue'>full segmentation</font> once the <font color='blue'>list</font> is <font color='blue'>completely populated</font>.

Populating the list is done with just <font color='blue'>two loops</font>: the <font color='blue'>main loop</font> goes over <font color='blue'>each start position</font>, and the <font color='blue'>second loop</font> tries <font color='blue'>all substrings beginning</font> at that <font color='blue'>start position</font>. If the <font color='blue'>substring</font> is in the <font color='blue'>vocabulary</font>, we have a <font color='blue'>new segmentation</font> of the <font color='blue'>word</font> up until that <font color='blue'>end position</font>, which we compare to what is in `best_segmentations`.

Once the main loop is finished, we just <font color='blue'>start from the end </font>and <font color='blue'>hop</font> from <font color='blue'>one start position</font> to the <font color='blue'>next</font>, <font color='blue'>recording</font> the <font color='blue'>tokens</font> as we go, <font color='blue'>until</font> we <font color='blue'>reach</font> the <font color='blue'>start</font> of <font color='blue'>the word</font>:

In [31]:
def encode_word(word, model):
    best_segmentations = [{"start": 0, "score": 1}] + [
        {"start": None, "score": None} for _ in range(len(word))
    ]
    for start_idx in range(len(word)):
        # This should be properly filled by the previous steps of the loop
        best_score_at_start = best_segmentations[start_idx]["score"]
        for end_idx in range(start_idx + 1, len(word) + 1):
            token = word[start_idx:end_idx]
            if token in model and best_score_at_start is not None:
                score = model[token] + best_score_at_start
                # If we have found a better segmentation ending at end_idx, we update
                if (
                    best_segmentations[end_idx]["score"] is None
                    or best_segmentations[end_idx]["score"] > score
                ):
                    best_segmentations[end_idx] = {"start": start_idx, "score": score}

    segmentation = best_segmentations[-1]
    if segmentation["score"] is None:
        # We did not find a tokenization of the word -> unknown
        return ["<unk>"], None

    score = segmentation["score"]
    start = segmentation["start"]
    end = len(word)
    tokens = []
    while start != 0:
        tokens.insert(0, word[start:end])
        next_start = best_segmentations[start]["start"]
        end = start
        start = next_start
    tokens.insert(0, word[start:end])
    return tokens, score

We can already try our initial model on some words:

In [32]:
print(encode_word("Hopefully", model))
print(encode_word("This", model))

(['H', 'o', 'p', 'e', 'f', 'u', 'll', 'y'], 41.5157494601402)
(['This'], 6.288267030694535)


Now it's straightforward to compute the <font color='blue'>loss</font> of the <font color='blue'>model</font> on the corpus!

In [33]:
def compute_loss(model):
    loss = 0
    for word, freq in word_freqs.items():
        _, word_loss = encode_word(word, model)
        loss += freq * word_loss
    return loss

We can check it works on the model we have:

In [34]:
compute_loss(model)

413.10377642940875

Computing the <font color='blue'>scores</font> for <font color='blue'>each token</font> is not very hard either; we just have to <font color='blue'>compute</font> the <font color='blue'>loss</font> for the <font color='blue'>models</font> obtained by deleting each token:

In [36]:
import copy

def compute_scores(model):
    scores = {}
    model_loss = compute_loss(model)
    for token, score in model.items():
        # We always keep tokens of length 1
        if len(token) == 1:
            continue
        model_without_token = copy.deepcopy(model)
        _ = model_without_token.pop(token)
        scores[token] = compute_loss(model_without_token) - model_loss
    return scores

We can try it on a <font color='blue'>given token</font>:

In [37]:
scores = compute_scores(model)
print(scores["ll"])
print(scores["his"])

6.376412403623874
0.0


Since `"ll"` is used in the tokenization of `"Hopefully"`, and removing it will probably make us use the token `"l"` twice instead, we expect it will have a positive loss. `"his"` is only used inside the word `"This"`, which is tokenized as itself, so we expect it to have a zero loss.

💡 **Tip:** This approach is <font color='blue'>very inefficient</font>, so <font color='blue'>SentencePiece</font> uses an <font color='blue'>approximation</font> of the <font color='blue'>loss of the model</font> without token X: <font color='blue'>instead</font> of <font color='blue'>starting from scratch</font>, it just replaces token X by its segmentation in the vocabulary that is left. This way, all the scores can be computed at once at the same time as the model loss.

With all of this in place, the last thing we need to do is <font color='blue'>add</font> the <font color='blue'>special tokens</font> used by the model <font color='blue'>to the vocabulary</font>, then <font color='blue'>loop</font> until we have <font color='blue'>pruned enough tokens</font> from the <font color='blue'>vocabulary</font> to reach our desired size:

In [38]:
percent_to_remove = 0.1
while len(model) > 100:
    scores = compute_scores(model)
    sorted_scores = sorted(scores.items(), key=lambda x: x[1])
    # Remove percent_to_remove tokens with the lowest scores.
    for i in range(int(len(model) * percent_to_remove)):
        _ = token_freqs.pop(sorted_scores[i][0])

    total_sum = sum([freq for token, freq in token_freqs.items()])
    model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}

Then, to <font color='blue'>tokenize some text</font>, we just need to <font color='blue'>apply</font> the <font color='blue'>pre-tokenization</font> and then use our `encode_word()` function:

In [39]:
def tokenize(text, model):
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in words_with_offsets]
    encoded_words = [encode_word(word, model)[0] for word in pre_tokenized_text]
    return sum(encoded_words, [])


tokenize("This is the Hugging Face course.", model)

['▁This',
 '▁is',
 '▁the',
 '▁Hugging',
 '▁Face',
 '▁',
 'c',
 'ou',
 'r',
 's',
 'e',
 '.']

**Tip:** The XLNetTokenizer uses SentencePiece which is why the `"_"` character is included. To decode with SentencePiece, concatenate all the tokens and replace `"_"` with a space.

That's it for Unigram! Hopefully by now you're feeling like an expert in all things tokenizer. In the next section, we will delve into the building blocks of the 🤗 Tokenizers library, and show you how you can use them to build your own tokenizer.