# 📝 Tokenization — Introduction

**Tokenization** is the process of breaking a sentence into smaller pieces, or *tokens*.

A **Tokenizer** is a program that splits text into tokens.

Tokenizers generate tokens primarily through three main methods:

---

| **Type**          | **Description**                                                        | **Pros**                                                                                           | **Cons**                                               | **Example**                                   |
|-------------------|------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|--------------------------------------------------------|------------------------------------------------|
| **Word-based**    | Splits text at spaces/punctuation into full words.                     | Preserves meaning well.                                                                            | Large vocabulary, struggles with rare/unknown words.  | `"playing"` → `"playing"`                     |
| **Character-based** | Splits into individual characters.                                    | Very small vocabulary, handles unknown words.                                                      | Loses semantic meaning of whole words.                | `"playing"` → `"p"`, `"l"`, `"a"`, `"y"`, `"i"`, `"n"`, `"g"` |
| **Subword-based** | Splits common words whole, breaks rare words into smaller parts.        | Reduces vocabulary size, handles rare/unknown words, and keeps frequent words intact for semantics. | Slightly more complex tokenization process.           | `"playing"` → `"play"`, `"##ing"`              |

---

**Visual Example:**

## Word-Based

### **NLTK** (Natural Language Toolkit)
A classic Python library for teaching and experimenting with Natural Language Processing (NLP). Uses rule-based tokenizers like **Punkt** (needs downloading via `nltk.download()`).

In [1]:
import nltk
# NTLK doesn't have tokenizer rules built into python code itself, it stores them in seperate files (like a small database).
nltk.download('punkt')
nltk.download('punkt_tab')

sentence = "We love playing playful NLP token games, don't we?"

tokens = nltk.word_tokenize(sentence)
print(tokens)

['We', 'love', 'playing', 'playful', 'NLP', 'token', 'games', ',', 'do', "n't", 'we', '?']


[nltk_data] Downloading package punkt to /Users/jonas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/jonas/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### **spaCy**
Modern, fast, production-ready NLP library. Includes an efficient tokenizer that handles many edge cases automatically.

In [19]:
import spacy
# Loads the pre-trained model (en_core_web_sm) and tokenizer rules into the memory. 
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")
# Runs the tokeniser and other components not considered in this notebook (e.g. POS tagging and NER).
doc = nlp(sentence)

# Wrapper Function
def spacy_tok(text: str):
    return [t.text for t in nlp.make_doc(text)]  # call tokenizer directly

tokens = spacy_tok(sentence)
print(tokens)

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m16.5 MB/s[0m  [33m0:00:00[0ma [36m0:00:01[0m36m0:00:01[0m:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
['We', 'love', 'playing', 'playful', 'NLP', 'token', 'games', ',', 'do', "n't", 'we', '?']


### Special Tokens in Word-Based Tokenizers

- **`<unk>`** (unknown): Safe handling of **out-of-vocabulary (OOV)** tokens.
- **`<pad>`** (padding): Take the longest sequence in a batch, and append **`<pad>`** until sentences have the same length
- **`<bos>`** (begin-of-sequence): Tells the decoder when to start.
- **`<eos>`** (end-of-sequence): Tells the decoder when to stop.

The decoder is the part of the model that generates the output sequence step-by-step.

In [5]:
from torchtext.data.utils import get_tokenizer

def add_specials(lines, tok, bos="<bos>", eos="<eos>", pad="<pad>"):
    seqs = [[bos] + tok(s) + [eos] for s in lines]
    max_len = max(len(x) for x in seqs)
    return [x + [pad] * (max_len - len(x)) for x in seqs]

# tokenizer used inside the helper
tok = get_tokenizer("basic_english")  # or: get_tokenizer("spacy", language="en_core_web_sm")
lines = [sentence, "padding example"]
tokens = add_specials(lines,tok)
print(tokens)

[['<bos>', 'we', 'love', 'playing', 'playful', 'nlp', 'token', 'games', ',', 'don', "'", 't', 'we', '?', '<eos>'], ['<bos>', 'padding', 'example', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]


## **Character-Based**

Character-based tokenization splits text into individual characters rather than words or subwords. One example is the **Keras Tokenizer** (from `tensorflow.keras.preprocessing.text`), which can be configured with `char_level=True` to perform this type of tokenization.

In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Create a character-level tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts([sentence])

# Token IDs
ids = tokenizer.texts_to_sequences([sentence])[0]

# Map IDs back to characters
reverse_map = {v: k for k, v in tokenizer.word_index.items()}
chars = [reverse_map[i] for i in ids]

print("Characters:", chars)
print("Token IDs:", ids)

Characters: ['w', 'e', ' ', 'l', 'o', 'v', 'e', ' ', 'p', 'l', 'a', 'y', 'i', 'n', 'g', ' ', 'p', 'l', 'a', 'y', 'f', 'u', 'l', ' ', 'n', 'l', 'p', ' ', 't', 'o', 'k', 'e', 'n', ' ', 'g', 'a', 'm', 'e', 's', ',', ' ', 'd', 'o', 'n', "'", 't', ' ', 'w', 'e', '?']
Token IDs: [8, 2, 1, 3, 5, 12, 2, 1, 6, 3, 7, 9, 13, 4, 10, 1, 6, 3, 7, 9, 14, 15, 3, 1, 4, 3, 6, 1, 11, 5, 16, 2, 4, 1, 10, 7, 17, 2, 18, 19, 1, 20, 5, 4, 21, 11, 1, 8, 2, 22]


**Why is "W" = 8?**

1. **Build Vocabulary**  
   `tokenizer.fit_on_texts([sentence])` scans the text (with `char_level=True`) and collects all unique characters.

2. **Assign Index Numbers**  
   Each character gets a numeric ID starting at **1**, ordered **by frequency** in the text:  
   - Most frequent → ID 1  
   - Next most frequent → ID 2, etc.  
   - Rare characters get higher IDs.
   - `"W"` is 8 because it’s the 8th most common character in the sentence.

3. **Convert Text → Sequences**  
   `tokenizer.texts_to_sequences([sentence])` replaces each character with its assigned ID.


In [5]:
print(f"Most frequent character (ID 1): '{reverse_map[1]}'")
print(f"Second most frequent character (ID 2): '{reverse_map[2]}'")
print(f"Least frequent character (highest ID {len(reverse_map)}): '{reverse_map[len(reverse_map)]}'")

Most frequent character (ID 1): ' '
Second most frequent character (ID 2): 'e'
Least frequent character (highest ID 22): '?'


## Subword-Based

### WordPiece

WordPice is a **subword tokenization algorithm** developed for speech recognition at **Google** and later used in **BERT**. 

**How it works**

1. It start small, the initial vocuabulary includes every charachter that appear in the draining data.
2. It learns iteratively and merges vocabulary to repesent the training text until it hits the target vocubulary size (e.g., 30k)

**Example**
Training text: "playing"

Start vocab: ['p', 'l', 'a', 'y', 'i', 'n', 'g']

Merge pairs if it improves likelihood:

Maybe merge "p" + "l" → "pl"

Later merge "play" + "ing"

Final tokenization might be:
"playing" → ["play", "##ing"]

\## means “continuation of a previous token” in BERT’s WordPiece.

In [13]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokenizer.tokenize(sentence)

['we',
 'love',
 'playing',
 'playful',
 'nl',
 '##p',
 'token',
 'games',
 ',',
 'don',
 "'",
 't',
 'we',
 '?']

Right now "playing" appears as a whole token because the BERT vocabulary already contains it.
WordPiece only breaks a word into subwords (e.g., play + ##ing) when that exact word is not in the vocabulary.

### Unigram

**Unigram** is a **subword tokenization algorithm** that starts with a large vocabulary and gradually removes pieces that contribute the least to representing the training data.  

**How it works (Unigram)**

1. Start with a **large candidate vocabulary** (all characters + many possible substrings).
2. Assign each piece a **probability** of appearing in the text.
3. Iteratively **prune** the least useful pieces (those whose removal least hurts the model’s likelihood).
4. Stop when you reach the **target vocabulary size** (e.g., 32k).

**Example**
Training word: `"playing"`

Initial vocab:  
`['p', 'l', 'a', 'y', 'i', 'n', 'g', 'pl', 'play', 'ing']`

Step 1: Assign probabilities based on training data usage.  
Step 2: Remove rare or low-probability pieces (e.g., `'pl'` if it’s rarely used).  
Step 3: Keep high-probability pieces like `'play'` and `'ing'`.

Final tokenization might be:  
`"playing"` → `['play', 'ing']`

**SentencePiece** is the framework/tool that implements tokenization algorithms like Unigram and a required dependency for the XLNetTokenizer.

In [14]:
from transformers import XLNetTokenizer

xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
xlnet_tokenizer.tokenize(sentence)

['▁We',
 '▁love',
 '▁playing',
 '▁playful',
 '▁N',
 'LP',
 '▁token',
 '▁games',
 ',',
 '▁don',
 "'",
 't',
 '▁we',
 '?']

Tokens are prefixed with "▁" to indicate they are new words preceded by a space in the original text

# 📊 Tokenization Performance Analysis

In this section, we evaluate and compare the tokenization capabilities of four different NLP libaries (nltk, spaCy, BertTokenizer, XLNetTokenizer) by analyzing the frequenzy of tokenized words and measuring the processing time for each tool using datetime.

In [20]:
from datetime import datetime
from collections import Counter

test_corpus = [
    "We love playing playful NLP token games, don't we?",
    "Tokenization isn't magic—it's rules, data, and edge cases.",
    "Re-enter vs reenter vs re-enter: which one splits?",
    "Email me at alice.bob@example.co.uk or visit https://example.org.",
    "Café Münster costs €3.50 — deal?",
    "I’m not mad—just surprised 😄.",
    "IBM taught me tokenization; XLNet taught me SentencePiece.",
    "New words emerge daily: doomscrolling, micro-SaaS, and finfluencers.",
    "Version v2.1.0-alpha+build.7 fixed 12/08/2025 bugs.",
    "Numbers: 1,234,567 and 3.14 and 0.001%.",
    "Hashtags and mentions: #NLP @you @OpenAI",
    "Quotes: “smart” vs 'dumb' and ASCII vs UTF-8.",
    "Unicorns play, playing and replaying token games.",
    "中文字符混合 with English tokens.",
    "Pokémon and München are tricky for lowercasing.",
    "Let's test don't, couldn't, and shouldn't together."
]
    
def run_test(tokenize_fn, lines, name="tokenizer"):
    start = datetime.now()
    all_tokens = []
    for line in lines:
        all_tokens.extend(tokenize_fn(line))
    ms = (datetime.now() - start).total_seconds() * 1000.0
    freq = Counter(all_tokens)
    print(f"\n== {name} ==")
    print(f"time: {ms:.2f} ms | tokens: {len(all_tokens)} | unique: {len(freq)}")
    print("top 10:", freq.most_common(10))
    return ms, freq

tokenizers = [
    ("NLTK word_tokenize", nltk.word_tokenize),
    ("spaCy en_core_web_sm", spacy_tok),
    ("BERT WordPiece", bert_tokenizer.tokenize),
    ("XLNet Unigram", xlnet_tokenizer.tokenize)
]

for name, fn in tokenizers:
    run_test(fn,test_corpus,name)


== NLTK word_tokenize ==
time: 1.74 ms | tokens: 158 | unique: 107
top 10: [('.', 12), ('and', 9), (',', 8), (':', 6), ("n't", 5), ('vs', 4), ('?', 3), ('me', 3), ('@', 3), ('playing', 2)]

== spaCy en_core_web_sm ==
time: 3.77 ms | tokens: 165 | unique: 110
top 10: [('.', 12), ('and', 9), (',', 8), ("n't", 5), (':', 5), ('-', 4), ('vs', 4), ('?', 3), ('—', 3), ('me', 3)]

== BERT WordPiece ==
time: 4.72 ms | tokens: 247 | unique: 147
top 10: [('.', 22), (',', 10), ("'", 9), ('and', 9), (':', 6), ('token', 5), ('t', 5), ('-', 5), ('vs', 4), ('/', 4)]

== XLNet Unigram ==
time: 1.82 ms | tokens: 273 | unique: 162
top 10: [('.', 22), ('▁', 15), (',', 9), ("'", 9), ('s', 9), ('▁and', 9), ('t', 6), ('-', 5), (':', 5), ('▁token', 4)]
