# Tokenizers

In this notebook, you'll learn how to work with tokenizers using the Hugging Face library.

We'll explore several tokenizers from different Large Language Models (LLMs) and see how each one handles tokenization differently.

## ⚙️ Setup Workspace

We start with setting up the workspace by installing the `transformers` library and ignoring the warnings.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers

In [None]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

## 🧩 Tokenization

### Tokenization Definition

<p style="background-color:#fff1d7; padding:15px; "> <b>💡 </b> <b>Tokenizers</b> are essential tools in Natural Language Processing (NLP), acting as a bridge between human language and machine learning models. </br>
Their main role is to <b>break down raw text into tokens</b> —units like words, subwords, or characters—and then convert these tokens into  <b>numerical IDs</b> that models can understand.</p><br>
This process, called <b>tokenization</b>, is a key step before feeding text into a model

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/tokenization.png" alt="Tokenization Diagram" width="300">


### Tokenization Pipeline

An implementation of a tokenizer consists of the following pipeline of processes, each applying different transformations to the textual information:

<img src="https://raw.githubusercontent.com/mallociFrancesca/XAIKGRLGM/a77f9ea5633475efe43038ef2a11e1341342e0ef/hands-on-session/tokenization-pipeline.png" alt="Tokenization Pipeline" width="600">


- **Normalization**: Standardize text (remove accents, lowercase...)
- **Pre-tokenization**: Split text into basic units
- **Model**: Apply model-specific token rules
- **Postprocessor**: Add special tokens ([CLS], [SEP])

There are various tokenization methods, and typically, a tokenizer must be trained on a dataset.

Generally, each model has its own tokenizer.

However, the Hugging Face library provides pre-trained tokenizers that can be used out of the box 🤗.

<p style="background-color:#f2f2ff; padding:15px; border-width:3px; border-color:#e2e2ff; border-style:solid; border-radius:6px">
🪄 The transformers library has a set of <b>Auto classes</b>, like <code>AutoConfig</code>,  <code>AutoModel</code>,and <code>AutoTokenizer</code>. <br> The <b>Auto classes</b> are designed to automatically do the job for you.</p>
</p>

Let’s see how to tokenize a text using the `AutoTokenizer` class.

## ✂️ Tokenizing Text

In this section, you will tokenize the sentence **Hello World!** using the tokenizer of the [`Helsinki-NLP/opus-mt-en-it` translation model](https://huggingface.co/Helsinki-NLP/opus-mt-en-it).

Let's import the `Autotokenizer` class, define the sentence to tokenize, and instantiate the tokenizer.

In [None]:
from transformers import AutoTokenizer

# Specify the model name
model_name = 'Helsinki-NLP/opus-mt-en-it'

# load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(type(tokenizer))

<class 'transformers.models.marian.tokenization_marian.MarianTokenizer'>


In [None]:
# define the sentence to tokenize
sentence = "Hello world!"
# Apply the tokenizer to the sentence.
tokens = tokenizer(sentence)
print(tokens)

{'input_ids': [226, 1127, 499, 49, 0], 'attention_mask': [1, 1, 1, 1, 1]}


The tokeziner splits the sentence into tokens and returns the IDs of each token.

The tokenizer returns a dictionary containing:

* [input_ids](https://huggingface.co/docs/transformers/main/en/./glossary#input-ids): numerical representations of the tokens.
* [attention_mask](https://huggingface.co/docs/transformers/main/en/.glossary#attention-mask): indicates which tokens should be attended to.

In [None]:
# If we want directly extract the token ids
token_ids = tokenizer(sentence).input_ids
print(token_ids)

[226, 1127, 499, 49, 0]


To map each token ID to its corresponding token, you can use the `decode` method of the tokenizer.

In [None]:
for id in token_ids:
    print(tokenizer.decode(id))

H
ello
world
!
</s>


The special token `</s>` is automatically added by the tokenizer and it is used to indicate the end of the input text.

## 🔡 Understanding Tokenization Space

Each token the model understands is mapped to a unique integer ID.

To better understand how this mapping works, we can access the tokenizer vocabulary via the `.get_vocab()` method, which provides the mapping between tokens and respective IDs.

**Vocabulary**

In [None]:
import random

# Get the tokenizer's vocabulary (token -> ID)
vocabulary = tokenizer.get_vocab()

# Get all the tokens (i.e., the vocabulary keys)
vocab_keys = list(vocabulary.keys())

# Shuffle the tokens randomly
random.shuffle(vocab_keys)

# Show 10 random tokens along with their IDs
{ k: vocabulary[k] for k in vocab_keys[:10] }

{'.2.2007,': 53163,
 'นี': 57478,
 '가': 56056,
 '▁probation': 38013,
 '白色蝴蝶兰': 77080,
 '▁Ringrazi': 16562,
 '▁Coreper': 36359,
 '▁Chloro': 50328,
 '▁6,': 5763,
 '42/2010': 51362}

In [None]:
print("Total vocabulary size:", tokenizer.vocab_size)

Total vocabulary size: 80035


The `vocabulary size` indicates that the tokenizer can recognize and encode 80,035 distinct token types.

What are the **special tokens** in this vocabulary?

In [None]:
# Get the list of all special tokens
special_tokens = tokenizer.all_special_tokens

# Get the corresponding IDs of the special tokens
special_token_ids = tokenizer.all_special_ids

print(f"Number of special tokens: {len(special_tokens)}")
print(f"Special tokens: {special_tokens}")
print(f"Special tokens ID: {special_token_ids}")

Number of special tokens: 3
Special tokens: ['</s>', '<unk>', '<pad>']
Special tokens ID: [0, 1, 80034]


* `</s>`: special token used to mark the end-of-the-sentence
* `<unk>`: unknown token
* `<pad>`: used to pad shorter sequences in a batch so that all inputs are the same length

Indeed, note that the tokens have a 0 showing up at the end!

In [None]:
print(tokenizer.decode(token_ids))
print(token_ids)

Hello world!</s>
[226, 1127, 499, 49, 0]


We can include special tokens inside of the strings themselves. For instance:

In [None]:
tokenizer("hello!</s>")

{'input_ids': [1711, 1127, 49, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

The tokenizer does not treat it as a normal word, but interprets it as a special token already present.

So it will not be “tokenized” into multiple pieces, but treated as a unit.

> ⚠️ These details are specific to the tokenizer being used.

>  **The tokenizer is responsible for deciding how to tokenize the input text.**

> As a result, you might notice varying behavior depending on which tokenizer is applied.

Now, in the next session, let’s examine different behaviors by comparing three models.

## 🔍 Visualizing Tokenization Differences

In this section, you compare the tokenizer of   `Helsinki-NLP/opus-mt-en-it ` with those of `Llama-2-7b-mt-Italian-to-English` and `bert-base-cased`, focusing on their respective tokenization strategies.

In [None]:
# Define the text to use for comparing different tokenization strategies of each model
text = """
Language Mix & punctuation!
🧠 HOLA
display_tokens Pizza True if != <= import: one tab:"\t" Four tabs: "        "
7.5+42=49.5
"""

You'll wrap the code from the previous section into a function called `show_tokens`.
The function takes in a text and the model name, and prints the vocabulary length of the tokenizer and a colored list of the tokens.

In [None]:
# A list of colors in RGB for representing the tokens
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence: str, tokenizer_name: str):
    """ Show the tokens each separated by a different color """

    # Load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids

    # Extract vocabulary length
    print(f"Vocab length: {len(tokenizer)}")

    # Print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors[idx % len(colors)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

Consider the following features when you're doing your comparison:
- Vocabulary length
- Special tokens
- Tokenization of the tabs, special characters and special keywords

<code>
text = """
Language Mix & punctuation!
🧠 HOLA
display_tokens Pizza True if != <= import: one tab:"\t" Four tabs: "        "
7.5+42=49.5
"""
</code>

**Helsinki-NLP/opus-mt-en-it**

In [None]:
show_tokens(text, "Helsinki-NLP/opus-mt-en-it")

Vocab length: 80035
[0;30;48;2;102;194;165mLanguage[0m [0;30;48;2;252;141;98mMix[0m [0;30;48;2;141;160;203m&[0m [0;30;48;2;231;138;195mp[0m [0;30;48;2;166;216;84mun[0m [0;30;48;2;255;217;47mctu[0m [0;30;48;2;102;194;165mation[0m [0;30;48;2;252;141;98m![0m [0;30;48;2;141;160;203m[0m [0;30;48;2;231;138;195m<unk>[0m [0;30;48;2;166;216;84mH[0m [0;30;48;2;255;217;47mOLA[0m [0;30;48;2;102;194;165mdisplay[0m [0;30;48;2;252;141;98m_[0m [0;30;48;2;141;160;203mto[0m [0;30;48;2;231;138;195mken[0m [0;30;48;2;166;216;84ms[0m [0;30;48;2;255;217;47mP[0m [0;30;48;2;102;194;165mizza[0m [0;30;48;2;252;141;98mTrue[0m [0;30;48;2;141;160;203mif[0m [0;30;48;2;231;138;195m![0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m<=[0m [0;30;48;2;102;194;165mimport[0m [0;30;48;2;252;141;98m:[0m [0;30;48;2;141;160;203mone[0m [0;30;48;2;231;138;195mtab[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47m"[0m [0;30;48;2;102;194;165m"[0m [0;30;48;2;252;14

**Llama-2-7b-mt-Italian-to-English**

In [None]:
show_tokens(text, "kaitchup/Llama-2-7b-mt-Italian-to-English")

Vocab length: 32000
[0;30;48;2;102;194;165m<s>[0m [0;30;48;2;252;141;98m[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mLanguage[0m [0;30;48;2;166;216;84mMix[0m [0;30;48;2;255;217;47m&[0m [0;30;48;2;102;194;165mpun[0m [0;30;48;2;252;141;98mctu[0m [0;30;48;2;141;160;203mation[0m [0;30;48;2;231;138;195m![0m [0;30;48;2;166;216;84m
[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195mHO[0m [0;30;48;2;166;216;84mLA[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165mdisplay[0m [0;30;48;2;252;141;98m_[0m [0;30;48;2;141;160;203mto[0m [0;30;48;2;231;138;195mkens[0m [0;30;48;2;166;216;84mP[0m [0;30;48;2;255;217;47mizza[0m [0;30;48;2;102;194;165mTrue[0m [0;30;48;2;252;141;98mif[0m [0;30;48;2;141;160;203m!=[0m [0;30;48;2;231;138;195m<=[0m [0;30;48;2;166;216;84mimport[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165mone[0m [0;30;48;2;252;141

**bert-base-cased**

In [None]:
show_tokens(text, "bert-base-cased")

Vocab length: 28996
[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mLanguage[0m [0;30;48;2;141;160;203mMix[0m [0;30;48;2;231;138;195m&[0m [0;30;48;2;166;216;84mpu[0m [0;30;48;2;255;217;47m##nc[0m [0;30;48;2;102;194;165m##tu[0m [0;30;48;2;252;141;98m##ation[0m [0;30;48;2;141;160;203m![0m [0;30;48;2;231;138;195m[UNK][0m [0;30;48;2;166;216;84mH[0m [0;30;48;2;255;217;47m##OL[0m [0;30;48;2;102;194;165m##A[0m [0;30;48;2;252;141;98mdisplay[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mPizza[0m [0;30;48;2;102;194;165mTrue[0m [0;30;48;2;252;141;98mif[0m [0;30;48;2;141;160;203m![0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84m<[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165mimport[0m [0;30;48;2;252;141;98m:[0m [0;30;48;2;141;160;203mone[0m [0;30;48;2;231;138;195mta[0m [0;30;48;2;166;216;84m##b[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m"[0m 

---
**👉 Note:**
Each model typically defines its own set of special tokens.
Some of these tokens are essential during training, while others may be helpful during inference.

The `tokenizer` object provides convenient attributes to access these special tokens. Here are some common examples:

- **`pad_token`** — used to pad input sequences to the same length (as discussed later),
- **`bos_token`** and **`eos_token`** — mark the beginning and end of a sequence, respectively,
- **`mask_token`** — used to mask tokens during training,
- **`sep_token`** — separates segments or sentences (e.g., next sentence prediction),
- **`cls_token`** — indicates the start of a sequence (e.g., for classification tasks),
- **`unk_token`** — used to indicate out-of-vocabulary tokens (i.e. tokens that are not in the vocabulary).

> ⚠️  Not all tokenizers use all of these special tokens.
If a token isn't used by a specific tokenizer, its corresponding attribute will be set to `None`.



Now, let’s finish this notebook by examining how the tokenizer works when it has to process multiple sentences at once, and how padding tokens are used in this case

## 📚 Batch Tokenization

Generally, esplecially at the training time, we want to process several text sequence at once (e.g., an entire batch of sentences).

A tokenizer can also accept a list of sentences as follows:

In [None]:
sentences = [
    "My first sentence",
    "I'm the seconde sentence"
]

tokens = tokenizer(sentences)

for id in tokens["input_ids"]:
    print(id)

[888, 308, 11347, 0]
[22, 5, 98, 4, 993, 40, 11347, 0]


> ⚠️ Sentences of different lengths contain a different number of tokens.
However, for the model to process them properly, the input tensors must have **uniform dimensions** — meaning the same length across all sequences.


<p style="background-color:#f2f2ff; padding:15px; border-width:3px; border-color:#e2e2ff; border-style:solid; border-radius:6px">
To achieve this, we use <b>padding</b>: shorter sentences are extended with special padding tokens (such as <code>PAD</code>) so they match the length of the longest sentence in the batch.
</p>

Since these padding tokens aren't part of the actual input content, it's important to tell the model to ignore them during processing.
>
> That's where the **`attention_mask`** comes in — it tells the model which tokens are real and which are just padding.

In [None]:
# Use the tokenizer to tokenize the sentences with padding and print the result (use padding=True)
tokens = tokenizer(sentences, padding=True)

for tok, att in zip(tokens["input_ids"], tokens["attention_mask"]):
    print(tok, att)

[888, 308, 11347, 0, 80034, 80034, 80034, 80034] [1, 1, 1, 1, 0, 0, 0, 0]
[22, 5, 98, 4, 993, 40, 11347, 0] [1, 1, 1, 1, 1, 1, 1, 1]


To match the length of the second sentence, the first one is padded with `80034`s — these represent the `<pad>` token (whose ID is `80034`).

Similarly, the `attention_mask` for the first sentence includes `0`s in the positions where padding was added. This signals the model to skip those tokens during processing.


For completeness, we can also decode batches of sentences, with `tokenizer.batch_decode()`.

In [None]:
tokenizer.batch_decode(tokens["input_ids"])

['My first sentence</s> <pad> <pad> <pad> <pad>',
 "I'm the seconde sentence</s>"]

# References
<a name="r1">[1]</a> Alammar, J., & Grootendorst, M. (2024). Hands-On Large Language Models. O'Reilly Media. ISBN 9781098150969

<a name="r2">[2]</a> https://huggingface.co/docs/transformers/fast_tokenizers

<a name="r3">[3]</a> https://medium.com/@awaldeep/hugging-face-understanding-tokenizers-1b7e4afdb154

<a name="r4">[4]</a> DataBase and Data Mining Group Research Course