# üß© Mini-Lab: Tokenizer Explorer

**Module 2: LLM Core Concepts** | **Duration: ~30 min** | **Type: Mini-Lab**

---

## Learning Objectives

By the end of this mini-lab, you will be able to:

1. **Understand** how tokenization converts text into tokens that LLMs process
2. **Explore** different tokenization algorithms (BPE, WordPiece, SentencePiece)
3. **Compare** tokenizers across different models (GPT-4, GPT-3.5, Claude)
4. **Predict** token counts for cost estimation and context planning
5. **Identify** edge cases that affect token counts unexpectedly

## Target Concepts

| Concept | Description |
|---------|-------------|
| Tokenization | Converting text into discrete units (tokens) for model processing |
| Byte-Pair Encoding (BPE) | Algorithm that iteratively merges frequent character pairs |

## Why This Matters

- **Cost Control**: API costs are based on tokens, not characters
- **Context Limits**: Understanding token counts helps you fit content within limits
- **Prompt Design**: Knowing how text tokenizes helps write efficient prompts

## 1. Setup

In [11]:
import tiktoken
from collections import Counter

# Initialize tokenizers for different models
gpt4_enc = tiktoken.encoding_for_model("gpt-4o")
gpt35_enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

print("‚úì Tokenizers initialized")
print(f"  - GPT-4o tokenizer: {gpt4_enc.name}")
print(f"  - GPT-3.5 tokenizer: {gpt35_enc.name}")

‚úì Tokenizers initialized
  - GPT-4o tokenizer: o200k_base
  - GPT-3.5 tokenizer: cl100k_base


## 2. Understanding Byte-Pair Encoding (BPE)

BPE is the algorithm used by GPT models. It works by:
1. Starting with a vocabulary of individual characters
2. Iteratively merging the most frequent adjacent pairs
3. Building a vocabulary of subwords that balance frequency and meaning

Let's visualize how this works:

In [10]:
def visualize_tokenization(text, encoder, name="Tokenizer"):
    """Visualize how text is broken into tokens."""
    tokens = encoder.encode(text)
    
    print(f"\n{'='*60}")
    print(f"üìù Input: \"{text}\"")
    print(f"üî¢ {name}: {len(tokens)} tokens")
    print(f"üìä Ratio: {len(text)/len(tokens):.2f} chars/token")
    print(f"\nüß© Token breakdown:")
    
    for i, token_id in enumerate(tokens):
        token_text = encoder.decode([token_id])
        # Escape special characters for display
        display_text = token_text.replace('\n', '\\n').replace('\t', '\\t')
        print(f"   [{i}] ID: {token_id:6d} ‚Üí '{display_text}'")
    
    return tokens

# Simple example
visualize_tokenization("Hello, world!", gpt4_enc, "GPT-4o")


üìù Input: "Hello, world!"
üî¢ GPT-4o: 4 tokens
üìä Ratio: 3.25 chars/token

üß© Token breakdown:
   [0] ID:  13225 ‚Üí 'Hello'
   [1] ID:     11 ‚Üí ','
   [2] ID:   2375 ‚Üí ' world'
   [3] ID:      0 ‚Üí '!'


[13225, 11, 2375, 0]

## 3. Token Patterns: What Gets Merged?

BPE creates tokens based on frequency in the training data. Common patterns become single tokens:

In [7]:
# Explore different types of content
examples = [
    # Common English words (often single tokens)
    "The quick brown fox jumps over the lazy dog",
    
    # Technical terms (might be split)
    "TensorFlow PyTorch LangChain embeddings",
    
    # Numbers (different patterns)
    "2024 1234567890 3.14159 $1,000,000",
    
    # Code (special tokenization)
    "def calculate_embedding(text: str) -> list[float]:",
    
    # Whitespace variations
    "word    word\nword\tword",
]

for text in examples:
    visualize_tokenization(text, gpt4_enc, "GPT-4o")


üìù Input: "The quick brown fox jumps over the lazy dog"
üî¢ GPT-4o: 9 tokens
üìä Ratio: 4.78 chars/token

üß© Token breakdown:
   [0] ID:    976 ‚Üí 'The'
   [1] ID:   4853 ‚Üí ' quick'
   [2] ID:  19705 ‚Üí ' brown'
   [3] ID:  68347 ‚Üí ' fox'
   [4] ID:  65613 ‚Üí ' jumps'
   [5] ID:   1072 ‚Üí ' over'
   [6] ID:    290 ‚Üí ' the'
   [7] ID:  29082 ‚Üí ' lazy'
   [8] ID:   6446 ‚Üí ' dog'

üìù Input: "TensorFlow PyTorch LangChain embeddings"
üî¢ GPT-4o: 7 tokens
üìä Ratio: 5.57 chars/token

üß© Token breakdown:
   [0] ID:  40994 ‚Üí 'Tensor'
   [1] ID:  18017 ‚Üí 'Flow'
   [2] ID:  15993 ‚Üí ' Py'
   [3] ID: 162709 ‚Üí 'Torch'
   [4] ID:  27830 ‚Üí ' Lang'
   [5] ID:  20848 ‚Üí 'Chain'
   [6] ID: 174989 ‚Üí ' embeddings'

üìù Input: "2024 1234567890 3.14159 $1,000,000"
üî¢ GPT-4o: 18 tokens
üìä Ratio: 1.89 chars/token

üß© Token breakdown:
   [0] ID:   1323 ‚Üí '202'
   [1] ID:     19 ‚Üí '4'
   [2] ID:    220 ‚Üí ' '
   [3] ID:   7633 ‚Üí '123'
   [4] ID:  19354 ‚Üí '

## 4. Cross-Model Comparison

Different models use different tokenizers. This affects:
- Token counts (and thus costs)
- How meaning is captured at the token level

In [None]:
def compare_tokenizers(text):
    """Compare how different tokenizers handle the same text."""
    tokenizers = [
        ("GPT-4o (o200k_base)", gpt4_enc),
        ("GPT-3.5 (cl100k_base)", gpt35_enc),
    ]
    
    print(f"\n{'='*60}")
    print(f"üìù Text: \"{text}\"")
    print(f"üìè Length: {len(text)} characters")
    print(f"\nüìä Token counts:")
    
    for name, enc in tokenizers:
        tokens = enc.encode(text)
        print(f"   {name}: {len(tokens)} tokens")
    
    print("\nüîç Detailed breakdown:")
    for name, enc in tokenizers:
        tokens = enc.encode(text)
        decoded = [enc.decode([t]) for t in tokens]
        print(f"\n   {name}:")
        print(f"   {decoded}")

# Test with various content types
test_texts = [
    "Artificial Intelligence",
    "The embedding dimension is 1536",
    "async def fetch_data(): await api.call()",
    "„Åì„Çì„Å´„Å°„ÅØ‰∏ñÁïå",  # Japanese: "Hello World"
    "ü§ñüí°üöÄ",  # Emojis
]

for text in test_texts:
    compare_tokenizers(text)

## 5. Token Count Estimation Tool

Build a practical tool for estimating API costs:

In [None]:
# Pricing per million tokens (approximate, as of 2025)
PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "gpt-3.5-turbo": {"input": 0.50, "output": 1.50},
}

def estimate_cost(prompt, expected_output_tokens=500, model="gpt-4o-mini"):
    """Estimate API cost for a prompt."""
    # Use appropriate tokenizer
    if model.startswith("gpt-4o"):
        enc = gpt4_enc
    else:
        enc = gpt35_enc
    
    input_tokens = len(enc.encode(prompt))
    prices = PRICING.get(model, PRICING["gpt-4o-mini"])
    
    input_cost = (input_tokens / 1_000_000) * prices["input"]
    output_cost = (expected_output_tokens / 1_000_000) * prices["output"]
    total_cost = input_cost + output_cost
    
    print(f"\nüí∞ Cost Estimate for {model}")
    print(f"{'='*40}")
    print(f"üì• Input tokens:  {input_tokens:,}")
    print(f"üì§ Output tokens: {expected_output_tokens:,} (estimated)")
    print(f"\nüíµ Costs:")
    print(f"   Input:  ${input_cost:.6f}")
    print(f"   Output: ${output_cost:.6f}")
    print(f"   Total:  ${total_cost:.6f}")
    print(f"\nüìä Per 1000 calls: ${total_cost * 1000:.2f}")
    
    return {
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "total_cost": total_cost
    }

# Example: RAG system prompt
rag_prompt = """
You are a helpful assistant. Use the following context to answer the question.

Context:
Large Language Models (LLMs) are AI systems trained on vast amounts of text data.
They can understand and generate human-like text. Common architectures include
transformer-based models like GPT, which use attention mechanisms to process
sequential data efficiently.

Question: What is an LLM?

Answer:
"""

estimate_cost(rag_prompt, expected_output_tokens=100, model="gpt-4o-mini")
estimate_cost(rag_prompt, expected_output_tokens=100, model="gpt-4o")

## 6. Edge Cases & Gotchas

Understanding edge cases helps avoid surprises:

In [None]:
def analyze_edge_cases():
    """Explore tokenization edge cases."""
    
    edge_cases = [
        # Repeated characters
        ("aaaaaaaaaa", "Repeated letters"),
        ("...........", "Repeated punctuation"),
        
        # Case sensitivity
        ("Hello hello HELLO", "Case variations"),
        
        # Leading/trailing spaces
        (" word", "Leading space"),
        ("word ", "Trailing space"),
        (" word ", "Both spaces"),
        
        # Special characters
        ("user@email.com", "Email"),
        ("https://example.com/path?query=value", "URL"),
        
        # Code patterns
        ("{{variable}}", "Template syntax"),
        ("/* comment */", "Comment block"),
        
        # Unicode
        ("caf√© r√©sum√© na√Øve", "Accented characters"),
        ("‚Üí ‚Üê ‚Üî ‚áí", "Arrows"),
    ]
    
    print("\nüî¨ Tokenization Edge Cases")
    print("="*60)
    
    for text, description in edge_cases:
        tokens = gpt4_enc.encode(text)
        ratio = len(text) / len(tokens) if len(tokens) > 0 else 0
        print(f"\n{description}:")
        print(f"  Text: \"{text}\"")
        print(f"  Chars: {len(text)}, Tokens: {len(tokens)}, Ratio: {ratio:.2f}")
        decoded = [gpt4_enc.decode([t]) for t in tokens]
        print(f"  Breakdown: {decoded}")

analyze_edge_cases()

## 7. Practical Exercise: Optimize a Prompt

Apply what you've learned to optimize a prompt for token efficiency:

In [None]:
# Original verbose prompt
verbose_prompt = """
You are an extremely helpful and knowledgeable AI assistant. Your job is to help
users with their questions. Please make sure to be thorough, accurate, and helpful
in all of your responses. When answering questions, please consider the context
carefully and provide comprehensive answers.

The user has the following question that they would like you to answer:

What is machine learning?

Please provide your response below:
"""

# Optimized prompt (same meaning, fewer tokens)
optimized_prompt = """
You are a helpful AI assistant. Answer accurately and thoroughly.

Question: What is machine learning?

Answer:
"""

print("üìä Prompt Optimization Analysis")
print("="*60)

verbose_tokens = len(gpt4_enc.encode(verbose_prompt))
optimized_tokens = len(gpt4_enc.encode(optimized_prompt))
savings = verbose_tokens - optimized_tokens
percent_saved = (savings / verbose_tokens) * 100

print(f"\nüìù Verbose prompt: {verbose_tokens} tokens")
print(f"‚ú® Optimized prompt: {optimized_tokens} tokens")
print(f"üí∞ Tokens saved: {savings} ({percent_saved:.1f}%)")
print(f"\nüî¢ Over 1M API calls, this saves ~{savings * 1_000_000:,} tokens!")

## üéØ Summary

### Key Takeaways

1. **Tokenization Basics**
   - Tokens are the units LLMs process
   - 1 token ‚âà 4 characters ‚âà 0.75 words (in English)
   - BPE builds vocabulary from frequent patterns

2. **Model Differences**
   - Different models use different tokenizers
   - GPT-4o uses `o200k_base`, GPT-3.5 uses `cl100k_base`
   - Same text can have different token counts across models

3. **Cost Implications**
   - API pricing is per token, not per character
   - Efficient prompts save money at scale
   - Input/output tokens often have different prices

4. **Edge Cases**
   - Non-English text often uses more tokens
   - Code may tokenize unexpectedly
   - Whitespace handling varies

### Next Steps

- **mini-context**: Learn about context window limits
- **mini-temperature**: Explore generation parameters
- **lab-llm-playground**: Combine all LLM core concepts