# **TOKENIZER**

## what's a tokenizer?



A tokenizer is like a translator that converts human text into numbers that computers can understand. Think of it as the bridge between how we communicate and how machines process language.

## The fundamental problem

Computers can only work with numbers, but we communicate with words, sentences, and complex language. So we need a way to convert "Hello, how are you?" into something like [7592, 11, 703, 389, 345, 30]. This conversion process is called tokenization, and the tool that does it is a tokenizer.

But here's where it gets interesting: there are many different ways to break down text into pieces, just like there are different ways to slice a pizza. Each method has trade-offs, and different models were trained expecting their text to be "sliced" in specific ways.

## Understanding tokens through examples

Let's start with a simple example. Consider the word "unhappiness":

A **character-level** tokenizer would break it down letter by letter: ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]. This gives us 11 tokens.

A **word-level** tokenizer would keep it as one piece: ["unhappiness"]. This gives us 1 token.

A **subword** tokenizer (which most modern models use) might split it into meaningful chunks like ["un", "happy", "ness"]. This gives us 3 tokens that each carry meaning.

Why does this matter? The subword approach is brilliant because it recognizes that "un-" means "not," "happy" is a core concept, and "-ness" turns adjectives into nouns. The model can learn these patterns and apply them to new words it has never seen before.



## The vocabulary challenge

Every tokenizer comes with a vocabulary - essentially a dictionary of all the tokens it knows. Think of this like a chef's ingredient list. If you're making Italian food, your ingredients (vocabulary) might include "basil," "mozzarella," and "prosciutto." If you're making Japanese food, you'd need different ingredients like "miso," "nori," and "wasabi."

Similarly, a tokenizer trained on English text will have different vocabulary than one trained on Chinese text, or one trained on computer code, or one trained on medical texts. Each has learned to recognize the most important "ingredients" for its domain.

## Why different models need different tokenizers

Here's the crucial part: when a model like BERT or GPT was trained, it learned to understand text that was tokenized in a very specific way. It's like teaching someone a secret code - if you later try to communicate using a different code, they won't understand you.

Here is a concrete example:

In [2]:
from transformers import AutoTokenizer

In [3]:
# BERT was trained with WordPiece tokenization
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tokenizer.tokenize("preprocessing")
print("BERT sees:", bert_tokens)  # ['pre', '##processing']

# GPT-2 was trained with Byte-Pair Encoding
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_tokens = gpt_tokenizer.tokenize("preprocessing")
print("GPT-2 sees:", gpt_tokens)  # ['prep', 'rocessing']

BERT sees: ['prep', '##ro', '##ces', '##sing']
GPT-2 sees: ['pre', 'processing']


Notice how they split the same word differently? BERT recognizes "pre" as a prefix and marks "##processing" as a continuation. GPT-2 splits it as "prep" and "rocessing" based on what character combinations it saw most frequently during training.

If you used BERT's tokenizer with a GPT-2 model, it would be like giving someone a message written in one code while they only know how to read a different code. The model would be completely confused.

## The three main tokenization strategies

**WordPiece tokenization** (used by BERT family) starts with individual characters and gradually merges them based on frequency, but it uses special markers like "##" to show which pieces belong together. It's like building words from common syllables.

**Byte-Pair Encoding** (used by GPT and RoBERTa families) also merges frequent character combinations, but without special markers. It's more like finding the most common letter patterns and treating them as units.

**SentencePiece** (used by T5 and many multilingual models) is designed to work across different languages and uses a special marker "▁" to show where words begin. This approach is particularly good for languages that don't use spaces between words.

## Domain-specific considerations

Just as you'd use different vocabularies when talking to a doctor versus a mechanic, different tokenizers are optimized for different domains. A tokenizer trained on medical texts knows to keep terms like "hypertension" as single tokens, while a general tokenizer might break it into meaningless pieces.

A code-specific tokenizer understands programming concepts and keeps function names and operators intact, while a general tokenizer might split "getUserName()" in ways that lose the programming meaning.

## The practical impact

This explains why you can't just swap tokenizers between models. Each model learned to understand the world through the lens of its specific tokenization scheme. It's not just about converting text to numbers - it's about converting text to the specific numbers that model was trained to understand.

When you see code like `AutoTokenizer.from_pretrained("bert-base-uncased")`, you're not just loading a tool - you're loading the exact "translation dictionary" that BERT expects. Using a different tokenizer would be like trying to use a French-English dictionary to translate Spanish text.


## more examples

In [4]:
# BERT uses WordPiece tokenization
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# GPT uses Byte-Pair Encoding (BPE)
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")

# RoBERTa uses a different BPE variant
roberta_tokenizer = AutoTokenizer.from_pretrained("roberta-base")

# Same text, different results:
text = "unhappiness"
print("BERT:", bert_tokenizer.tokenize(text))      # ['un', '##happiness']
print("GPT-2:", gpt_tokenizer.tokenize(text))      # ['unh', 'app', 'iness']
print("RoBERTa:", roberta_tokenizer.tokenize(text)) # ['un', 'happy', 'ness']

BERT: ['un', '##ha', '##pp', '##iness']
GPT-2: ['un', 'h', 'appiness']
RoBERTa: ['un', 'h', 'appiness']
