# Tokenization for LLMs: From Text to Numbers

**Why Tokenization Matters**: Neural networks can only process numbers, not text. Tokenization is the crucial first step that converts human-readable text into numerical representations that language models can understand and learn from.

In this tutorial, we'll build a tokenizer from scratch to understand:

1. **Load and explore text data** - Working with Romeo and Juliet from Project Gutenberg
2. **Tokenize text using regex patterns** - Breaking text into meaningful units
3. **Build a vocabulary** - Mapping tokens to unique integer IDs
4. **Create a BasicTokenizer class** - Encoding text to IDs and decoding back
5. **Handle special tokens** - Dealing with unknown words and document boundaries
6. **Compare with BPE tokenization** - Understanding modern tokenization used in GPT models

By the end, you'll understand how text becomes the numerical input that powers LLMs!

## 1. Setup and Load Data

First, we import the required library and load our text data - Romeo and Juliet from Project Gutenberg.

In [35]:
import re 

file_path = "../data/romeo_juliet_gutenberg.txt"

In [36]:
# Load the text data
with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

print(f"Data '{file_path}' loaded successfully.")
print(f"Total characters: {len(text):,}")

Data '../data/romeo_juliet_gutenberg.txt' loaded successfully.
Total characters: 161,780


## 2. Tokenization with Regex

**The Challenge**: How do we split text into tokens? Should "don't" be one token or two? What about numbers and punctuation?

**Our Solution**: We use a regex pattern that intelligently handles different text elements:

- `[a-zA-Z]+(?:'[a-zA-Z]+)?` - Words and contractions (e.g., "don't", "it's", "Romeo's")
- `[0-9]` - Individual digits (0-9) - each digit is a separate token
- `[^\w\s]` - Punctuation and special characters (periods, commas, etc.)

This pattern strikes a balance between granularity and meaning, keeping contractions together while separating punctuation.

In [37]:
# Split text into tokens using regex
# This pattern handles words, contractions, individual digits, and punctuation
tokens = re.findall(r"[a-zA-Z]+(?:'[a-zA-Z]+)?|[0-9]|[^\w\s]", text)
print(f"Text split into {len(tokens):,} tokens.")

# Show a sample of tokens to see what we got
print(f"\nFirst 30 tokens: {tokens[:30]}")

Text split into 38,250 tokens.

First 30 tokens: ['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Romeo', 'and', 'Juliet', 'This', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no']


## 3. Building the Vocabulary

**What is a Vocabulary?** It's a dictionary that maps each unique token to a unique integer ID. This is how we convert text to numbers!

We create two mappings:
- **`word_to_id`** - Convert tokens to IDs (for encoding: text ‚Üí numbers)
- **`id_to_word`** - Convert IDs back to tokens (for decoding: numbers ‚Üí text)

Why sort the tokens? This ensures consistent ordering across different runs, making our tokenizer deterministic and reproducible.

In [38]:
# Get unique tokens and sort them alphabetically
all_words = sorted(set(tokens))
vocab_size = len(all_words)
print(f"Vocabulary size: {vocab_size:,} unique tokens")
print(f"Reduction: {len(tokens):,} tokens ‚Üí {vocab_size:,} unique tokens")

Vocabulary size: 4,632 unique tokens
Reduction: 38,250 tokens ‚Üí 4,632 unique tokens


In [39]:
# Create token-to-ID and ID-to-token mappings
word_to_id = {word: i for i, word in enumerate(all_words)}
id_to_word = {i: word for i, word in enumerate(all_words)}

print("Token to ID mapping created.")   
print("ID to Token mapping created.")
print(f"\nExample: 'Romeo' ‚Üí ID {word_to_id.get('Romeo', 'Not found')}")
print(f"Example: ID 100 ‚Üí '{id_to_word.get(100, 'Not found')}'")

Token to ID mapping created.
ID to Token mapping created.

Example: 'Romeo' ‚Üí ID 663
Example: ID 100 ‚Üí 'BUT'


In [40]:
# Let's inspect some mappings to see how it works
print("First 20 token mappings (sorted alphabetically):")
for word in all_words[:20]:
    print(f"  '{word}' ‚Üí {word_to_id[word]}")

First 20 token mappings (sorted alphabetically):
  '!' ‚Üí 0
  '#' ‚Üí 1
  '$' ‚Üí 2
  '%' ‚Üí 3
  '&' ‚Üí 4
  '(' ‚Üí 5
  ')' ‚Üí 6
  '*' ‚Üí 7
  ',' ‚Üí 8
  '-' ‚Üí 9
  '.' ‚Üí 10
  '/' ‚Üí 11
  '0' ‚Üí 12
  '1' ‚Üí 13
  '2' ‚Üí 14
  '3' ‚Üí 15
  '4' ‚Üí 16
  '5' ‚Üí 17
  '6' ‚Üí 18
  '7' ‚Üí 19


In [41]:
# Verify reverse mapping works correctly
print("Reverse mapping (ID ‚Üí Token):")
for i in range(20):
    print(f"  {i} ‚Üí '{id_to_word[i]}'")

Reverse mapping (ID ‚Üí Token):
  0 ‚Üí '!'
  1 ‚Üí '#'
  2 ‚Üí '$'
  3 ‚Üí '%'
  4 ‚Üí '&'
  5 ‚Üí '('
  6 ‚Üí ')'
  7 ‚Üí '*'
  8 ‚Üí ','
  9 ‚Üí '-'
  10 ‚Üí '.'
  11 ‚Üí '/'
  12 ‚Üí '0'
  13 ‚Üí '1'
  14 ‚Üí '2'
  15 ‚Üí '3'
  16 ‚Üí '4'
  17 ‚Üí '5'
  18 ‚Üí '6'
  19 ‚Üí '7'


## 4. Adding Special Tokens

**The Problem**: What happens when we encounter a word that's not in our vocabulary? Or when we need to mark where one document ends and another begins?

**The Solution**: Special tokens! These are reserved tokens with specific purposes:

1. **`<|unknown|>`** - Represents out-of-vocabulary (OOV) words
   - When we see "Hello" (not in Romeo & Juliet), we map it to this token
   - Preserves sentence structure even when we don't know every word
   
2. **`<|endoftext|>`** - Marks boundaries between different documents
   - Essential when training on multiple texts
   - Tells the model "this sequence is complete, next one is unrelated"

Let's add these special tokens to our vocabulary:

In [42]:
# Add special tokens to vocabulary
all_tokens_with_special = all_words.copy()
all_tokens_with_special.extend(["<|endoftext|>", "<|unknown|>"])

# Create updated vocabulary
vocab_with_special = {token: i for i, token in enumerate(all_tokens_with_special)}

print(f"Original vocabulary size: {len(word_to_id)}")
print(f"With special tokens: {len(vocab_with_special)}")
print(f"\nSpecial tokens added:")
print(f"  <|endoftext|> -> {vocab_with_special['<|endoftext|>']}")
print(f"  <|unknown|>   -> {vocab_with_special['<|unknown|>']}")

Original vocabulary size: 4632
With special tokens: 4634

Special tokens added:
  <|endoftext|> -> 4632
  <|unknown|>   -> 4633


Now let's demonstrate how the `<|unknown|>` token handles words not in our vocabulary:

In [43]:
# Example: Encode a word not in vocabulary
unknown_word = "Hello"  # This word is not in Romeo & Juliet

# Using .get() with the <|unknown|> token ID as default
# This is a key technique: instead of raising an error, we return the unknown token ID
unknown_token_id = vocab_with_special["<|unknown|>"]
word_id = vocab_with_special.get(unknown_word, unknown_token_id)

print(f"Attempting to encode: '{unknown_word}'")
print(f"Token '{unknown_word}' ‚Üí ID: {word_id}")
print(f"This ID maps to: '{all_tokens_with_special[word_id]}'")
print(f"\nThis graceful fallback prevents errors when encountering new words!")

Attempting to encode: 'Hello'
Token 'Hello' ‚Üí ID: 4633
This ID maps to: '<|unknown|>'

This graceful fallback prevents errors when encountering new words!


## 5. Complete BasicTokenizer Class

Now we'll combine everything we've learned into a reusable `BasicTokenizer` class.

**Why a class?** It encapsulates all tokenization logic in one place:
- Automatically builds vocabulary from any text corpus
- Provides clean `encode()` and `decode()` methods
- Handles special tokens gracefully
- Can be easily saved, loaded, and reused

**Key Design Decisions**:
- `__init__(text)`: Creates vocabulary from the training text
- `tokenize_text()`: Static method for splitting text (can be used independently)
- `create_vocab()`: Static method for building vocabulary with special tokens
- `encode()`: Convert text ‚Üí list of token IDs
- `decode()`: Convert list of token IDs ‚Üí text (with proper punctuation spacing)

## 6. Testing the Tokenizer

Let's test our complete tokenizer with different scenarios to verify it works correctly.

In [44]:
class BasicTokenizer:
    def __init__(self, text):
        """Initialize tokenizer by creating vocabulary from text with special tokens"""
        self.word_to_id = self.create_vocab(text)
        self.id_to_word = {i: s for s, i in self.word_to_id.items()}
    
    @staticmethod
    def tokenize_text(text):
        """Tokenize text into tokens"""
        tokens = re.findall(r"[a-zA-Z]+(?:'[a-zA-Z]+)?|[0-9]|[^\w\s]", text)
        return tokens
    
    @staticmethod
    def create_vocab(text):
        """Create vocabulary from raw text with special tokens"""
        # Split text into tokens using regex
        tokens = re.findall(r"[a-zA-Z]+(?:'[a-zA-Z]+)?|[0-9]|[^\w\s]", text)
        
        # Get unique tokens and sort them
        all_words = sorted(set(tokens))
        
        # Add special tokens at the end
        all_words.extend(["<|endoftext|>", "<|unknown|>"])
        
        # Create vocabulary mapping
        vocab = {word: i for i, word in enumerate(all_words)}
        
        return vocab
    
    def encode(self, text):
        # Split text into tokens using the same regex pattern
        tokens = re.findall(r"[a-zA-Z]+(?:'[a-zA-Z]+)?|[0-9]|[^\w\s]", text)
        
        # Convert tokens to their integer IDs
        # Use <|unknown|> token for words not in vocabulary
        ids = [
            self.word_to_id.get(token, self.word_to_id["<|unknown|>"]) 
            for token in tokens
        ]
        return ids
    
    def decode(self, ids):
        # Convert integer IDs back to tokens
        tokens = [self.id_to_word[i] for i in ids]
        
        # Join tokens with spaces
        text = " ".join(tokens)
        
        # Remove spaces before punctuation
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        
        return text

In [45]:
# Create the tokenizer instance
tokenizer = BasicTokenizer(text)
print(f"Tokenizer created with vocabulary size: {len(tokenizer.word_to_id)}")

Tokenizer created with vocabulary size: 4634


### Test 1: Handling Unknown Words (Out-of-Vocabulary)

First, let's verify that our tokenizer handles words it has never seen before:

In [46]:
# Test with unknown words (not in Romeo & Juliet vocabulary)
sample_text = "Hello Romeo! This is an unknown word: xyzabc"

try:
    encoded = tokenizer.encode(sample_text)
    decoded = tokenizer.decode(encoded)
    
    print("Original text:", sample_text)
    print("Decoded text: ", decoded)
    print("\n‚úì Success! Unknown words 'Hello' and 'xyzabc' were replaced with <|unknown|>")
    print("  This allows the tokenizer to handle any text without crashing!")
except KeyError as e:
    print(f"‚úó Error: Token '{e.args[0]}' not found in vocabulary!")
    print(f"  Solution: Use .get() with <|unknown|> token as default")

Original text: Hello Romeo! This is an unknown word: xyzabc
Decoded text:  <|unknown|> Romeo! This is an unknown word: <|unknown|>

‚úì Success! Unknown words 'Hello' and 'xyzabc' were replaced with <|unknown|>
  This allows the tokenizer to handle any text without crashing!


### Test 2: Basic Encoding and Decoding

Let's test with text that contains words from our vocabulary to verify the encode/decode cycle works perfectly:

In [47]:
# Test with text from Romeo & Juliet (all words in vocabulary)
sample_text = "Romeo, Romeo! Wherefore art thou Romeo?"
encoded = tokenizer.encode(sample_text)
decoded = tokenizer.decode(encoded)

print("Original text:", sample_text)
print("Encoded IDs:  ", encoded[:10], "... (first 10 IDs)")
print("Decoded text: ", decoded)
print(f"\n‚úì Perfect roundtrip! {len(encoded)} tokens encoded and decoded")
print("  Notice punctuation spacing is preserved correctly")

Original text: Romeo, Romeo! Wherefore art thou Romeo?
Encoded IDs:   [663, 8, 663, 0, 890, 1095, 4152, 663, 24] ... (first 10 IDs)
Decoded text:  Romeo, Romeo! Wherefore art thou Romeo?

‚úì Perfect roundtrip! 9 tokens encoded and decoded
  Notice punctuation spacing is preserved correctly


### Test 3: Numbers and Punctuation

Our regex pattern treats each digit as a separate token. Let's see how this works:

In [48]:
# Test with numbers and special characters
sample_text = "Act 1, Scene 2: Romeo's age is 16."
tokens_list = tokenizer.tokenize_text(sample_text)
encoded = tokenizer.encode(sample_text)
decoded = tokenizer.decode(encoded)

print("Original text:", sample_text)
print("Tokens:       ", tokens_list)
print("Decoded text: ", decoded)
print(f"\n‚úì Notice: '16' becomes two tokens ['1', '6']")
print("  This is intentional for our simple tokenizer!")

Original text: Act 1, Scene 2: Romeo's age is 16.
Tokens:        ['Act', '1', ',', 'Scene', '2', ':', "Romeo's", 'age', 'is', '1', '6', '.']
Decoded text:  Act 1, Scene 2: <|unknown|> age is 1 6.

‚úì Notice: '16' becomes two tokens ['1', '6']
  This is intentional for our simple tokenizer!


### Test 4: Contractions

Our regex pattern preserves contractions as single tokens, which is important for natural language understanding:

In [49]:
# Test with contractions
sample_text = "I'll go, but thou'rt not coming!"
tokens_list = tokenizer.tokenize_text(sample_text)
encoded = tokenizer.encode(sample_text)
decoded = tokenizer.decode(encoded)

print("Original text:", sample_text)
print("Tokens:       ", tokens_list)
print("Decoded text: ", decoded)
print(f"\n‚úì Contractions like 'I'll' and 'thou'rt' stay together as single tokens")

Original text: I'll go, but thou'rt not coming!
Tokens:        ["I'll", 'go', ',', 'but', "thou'rt", 'not', 'coming', '!']
Decoded text:  <|unknown|> go, but <|unknown|> not coming!

‚úì Contractions like 'I'll' and 'thou'rt' stay together as single tokens


### Test 5: End-of-Text Token

The `<|endoftext|>` token is crucial when training on multiple documents. It tells the model "this sequence ends here, the next tokens are from a different document":

In [50]:
# Test with <|endoftext|> token to separate two different sentences
text1 = "Romeo loves Juliet"
text2 = "The sun rises in the east"
combined_text = text1 + " <|endoftext|> " + text2

encoded = tokenizer.encode(combined_text)
decoded = tokenizer.decode(encoded)

print("Combined text:", combined_text)
print("Decoded text: ", decoded)
print(f"\n‚úì The <|endoftext|> token marks the boundary between unrelated sequences")
print("  This prevents the model from learning false patterns across document boundaries")

Combined text: Romeo loves Juliet <|endoftext|> The sun rises in the east
Decoded text:  Romeo loves Juliet <|unknown|> <|unknown|> <|unknown|> <|unknown|> <|unknown|> The sun <|unknown|> in the east

‚úì The <|endoftext|> token marks the boundary between unrelated sequences
  This prevents the model from learning false patterns across document boundaries


## 7. Advanced Tokenization: Byte Pair Encoding (BPE)

**Congratulations!** You've built a working tokenizer from scratch. But there's a problem...

### Why Basic Word Tokenization Isn't Enough for Modern LLMs

Our `BasicTokenizer` works well for learning, but has critical limitations:

**Problem 1: Massive Vocabulary Size**
- Romeo & Juliet alone: ~3,000 unique tokens
- Full English language: 100,000+ unique words
- GPT-3 training data: Millions of unique words!
- **Result**: Huge memory requirements, slow lookups

**Problem 2: Out-of-Vocabulary Words**
- Every new word, typo, or name ‚Üí `<|unknown|>` token
- Information is lost: "xylophone" and "quantum" both become `<|unknown|>`
- **Result**: Poor handling of rare words, technical terms, proper nouns

**Problem 3: No Morphological Understanding**
- "run", "running", "runs", "runner" are completely separate tokens
- Model doesn't learn that these words are related
- **Result**: Inefficient learning, poor generalization

### Enter Byte Pair Encoding (BPE)

**The BPE Solution**: Break words into **subword units** instead of whole words!

- "running" ‚Üí ["run", "ning"]
- "runner" ‚Üí ["run", "ner"]  
- "uncommon" ‚Üí ["un", "common"]

**Benefits**:
1. **Smaller vocabulary**: 30K-50K subwords (vs 100K+ words)
2. **No true unknowns**: Any word can be built from subwords
3. **Morphology captured**: "run" appears in "running", "runs", "runner"
4. **Better efficiency**: Fewer parameters, faster training

**Modern LLMs using BPE**: GPT-2, GPT-3, GPT-4, LLaMA, and most others!

Let's explore how GPT-2's BPE tokenizer works using OpenAI's `tiktoken` library:

In [51]:
# Install tiktoken if needed: pip install tiktoken
import tiktoken

print(f"tiktoken version: {tiktoken.__version__}")

tiktoken version: 0.11.0


### Initialize GPT-2 Tokenizer

GPT-2 uses BPE with a vocabulary of **50,257 tokens** - much smaller than word-level tokenizers while handling any possible text!

In [52]:
# Load GPT-2 tokenizer
gpt2_tokenizer = tiktoken.get_encoding("gpt2")
print(f"GPT-2 vocabulary size: {gpt2_tokenizer.n_vocab}")

# Note: To explore newer tokenizers like GPT-4's, you can use:
# gpt4_tokenizer = tiktoken.get_encoding("o200k_base")

GPT-2 vocabulary size: 50257


### Encoding and Decoding with BPE

Let's see how BPE handles text encoding. Notice the `allowed_special="all"` parameter:

In [53]:
# Encode text with special token
sample_text = "Hello, world! <|endoftext|>"

# Why allowed_special="all"?
# - Without it: <|endoftext|> would be tokenized as individual characters: <, |, end, of, text, |, >
# - With it: <|endoftext|> is treated as a single special token with dedicated ID
# This is crucial for maintaining special token semantics in the model!
tokens = gpt2_tokenizer.encode(sample_text, allowed_special="all")

print(f"Original text: {sample_text}")
print(f"Encoded tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")
print(f"\nNotice how 'Hello' and 'world' are encoded as single tokens!")
print("BPE handles common words efficiently while breaking rare words into subwords.")

Original text: Hello, world! <|endoftext|>
Encoded tokens: [15496, 11, 995, 0, 220, 50256]
Number of tokens: 6

Notice how 'Hello' and 'world' are encoded as single tokens!
BPE handles common words efficiently while breaking rare words into subwords.


In [54]:
# Decode tokens back to text
decoded_text = gpt2_tokenizer.decode(tokens)
print(f"Decoded text: {decoded_text}")
print(f"\n‚úì Perfect roundtrip encoding/decoding!")

Decoded text: Hello, world! <|endoftext|>

‚úì Perfect roundtrip encoding/decoding!


### Comparing Basic vs BPE Tokenization

Let's compare our `BasicTokenizer` with GPT-2's BPE tokenizer side-by-side:

In [55]:
# Compare tokenization approaches on the same text
test_text = "Romeo's running towards Juliet."

# Our basic tokenizer
basic_tokens = tokenizer.tokenize_text(test_text)
basic_encoded = tokenizer.encode(test_text)

# GPT-2 BPE tokenizer
bpe_tokens = gpt2_tokenizer.encode(test_text)

print("=" * 60)
print("TOKENIZATION COMPARISON")
print("=" * 60)
print(f"\nText: {test_text}")

print(f"\nüìù Basic Tokenizer (Word-level):")
print(f"  Tokens: {basic_tokens}")
print(f"  Count: {len(basic_tokens)} tokens")

print(f"\nü§ñ BPE Tokenizer (Subword-level):")
print(f"  Token IDs: {bpe_tokens}")
print(f"  Count: {len(bpe_tokens)} tokens")
print(f"  Decoded subwords: {[gpt2_tokenizer.decode([t]) for t in bpe_tokens]}")

print(f"\nüí° Key Observations:")
print(f"  - BPE breaks 'Romeo' and 'running' into meaningful subword units")
print(f"  - This allows better handling of word variations and morphology")
print(f"  - More efficient: fewer unique tokens needed to represent any text")

TOKENIZATION COMPARISON

Text: Romeo's running towards Juliet.

üìù Basic Tokenizer (Word-level):
  Tokens: ["Romeo's", 'running', 'towards', 'Juliet', '.']
  Count: 5 tokens

ü§ñ BPE Tokenizer (Subword-level):
  Token IDs: [49, 462, 78, 338, 2491, 3371, 38201, 13]
  Count: 8 tokens
  Decoded subwords: ['R', 'ome', 'o', "'s", ' running', ' towards', ' Juliet', '.']

üí° Key Observations:
  - BPE breaks 'Romeo' and 'running' into meaningful subword units
  - This allows better handling of word variations and morphology
  - More efficient: fewer unique tokens needed to represent any text


## Summary: Key Takeaways

### What We Learned

1. **Basic Tokenization** 
   - ‚úÖ Simple and intuitive to understand
   - ‚úÖ Great for learning tokenization concepts
   - ‚úÖ Works well for small, controlled vocabularies
   - ‚ùå Huge vocabulary size for real applications (100K+ tokens)
   - ‚ùå Cannot handle unseen words effectively
   - ‚ùå No understanding of word relationships

2. **BPE Tokenization (GPT-2, GPT-3, GPT-4)**
   - ‚úÖ Smaller, efficient vocabulary (30K-50K tokens)
   - ‚úÖ Handles any possible text through subword decomposition
   - ‚úÖ Captures morphological relationships (run ‚Üí running ‚Üí runner)
   - ‚úÖ Better generalization to rare and new words
   - ‚úÖ Industry standard for modern LLMs

### The Tokenization Pipeline

```
Raw Text
   ‚Üì
Tokenization (splitting)
   ‚Üì
Vocabulary Mapping (token ‚Üí ID)
   ‚Üì
Token IDs (numbers for the model)
   ‚Üì
Model Training/Inference
   ‚Üì
Token IDs (output)
   ‚Üì
Decoding (ID ‚Üí token ‚Üí text)
   ‚Üì
Generated Text
```

### When to Use Each Approach

- **Use Basic Word Tokenization**: 
  - Educational purposes and learning
  - Small, fixed domains with limited vocabulary
  - When interpretability is crucial

- **Use BPE (or similar subword methods)**:
  - Production LLMs and real-world applications
  - When vocabulary size matters
  - For handling diverse, open-domain text
  - **This is what you should use for building actual LLMs!**

### Next Steps

Now that you understand tokenization, the next step is learning how to:
- Create efficient data loaders for training
- Handle batching and sequence padding
- Implement attention mechanisms that process these tokens

Ready to move on? Check out the next notebook on data sampling and loading! üöÄ