# Tokenization Demo — See How LLMs "Read" Text

This notebook demonstrates **real tokenization** using OpenAI's `tiktoken` library.

**Why this matters for architects:**
- LLM costs are per-token, not per-character or per-word
- Different languages produce different token counts for same meaning
- Understanding tokenization helps optimize prompts and estimate costs

**Prerequisites:**
```bash
pip install tiktoken
```


In [1]:
# ==========================================================================
# STEP 1: Load Tokenizer
# ==========================================================================
# tiktoken is OpenAI's fast tokenizer library.
# cl100k_base is used by GPT-4, GPT-3.5-turbo, and embedding models.
# ==========================================================================

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

print("Tokenizer loaded: cl100k_base (GPT-4/GPT-3.5-turbo)")
print(f"Vocabulary size: {enc.n_vocab:,} tokens")


Tokenizer loaded: cl100k_base (GPT-4/GPT-3.5-turbo)
Vocabulary size: 100,277 tokens


In [2]:
# ==========================================================================
# STEP 2: See How Text Becomes Tokens
# ==========================================================================
# Tokens are NOT words. They're subword units learned from training data.
# Common words = 1 token. Rare words = multiple tokens.
# ==========================================================================

def show_tokens(text):
    """Visualize how text is tokenized."""
    tokens = enc.encode(text)
    print(f"Text: \"{text}\"")
    print(f"Token count: {len(tokens)}")
    print("Breakdown:")
    for token_id in tokens:
        token_text = enc.decode([token_id])
        # Show whitespace explicitly
        display = repr(token_text) if token_text.strip() != token_text else f"'{token_text}'"
        print(f"  {token_id:>6} → {display}")
    print()

print("How Text Becomes Tokens")
print("=" * 60)
print()

show_tokens("Hello world")
show_tokens("The quick brown fox")
show_tokens("antidisestablishmentarianism")  # Long word = multiple tokens!


How Text Becomes Tokens

Text: "Hello world"
Token count: 2
Breakdown:
    9906 → 'Hello'
    1917 → ' world'

Text: "The quick brown fox"
Token count: 4
Breakdown:
     791 → 'The'
    4062 → ' quick'
   14198 → ' brown'
   39935 → ' fox'

Text: "antidisestablishmentarianism"
Token count: 6
Breakdown:
     519 → 'ant'
   85342 → 'idis'
   34500 → 'establish'
     479 → 'ment'
    8997 → 'arian'
    2191 → 'ism'



In [3]:
# ==========================================================================
# STEP 3: Language Comparison — Same Meaning, Different Token Counts
# ==========================================================================
# BPE tokenizers are trained primarily on English text.
# Other languages often require MORE tokens for the same meaning.
# ==========================================================================

print("Language Comparison")
print("=" * 70)
print()

greetings = {
    "English": "Hello, how are you today? I hope you're doing well.",
    "German": "Hallo, wie geht es Ihnen heute? Ich hoffe, es geht Ihnen gut.",
    "French": "Bonjour, comment allez-vous aujourd'hui? J'espère que vous allez bien.",
    "Spanish": "Hola, ¿cómo estás hoy? Espero que estés bien.",
    "Chinese": "你好，今天你好吗？我希望你一切都好。",
    "Japanese": "こんにちは、今日はお元気ですか？",
    "Arabic": "مرحبا، كيف حالك اليوم؟",
}

print(f"{'Language':<12} {'Tokens':>8} {'Chars':>8} {'vs English':>12}")
print("-" * 45)

english_tokens = len(enc.encode(greetings["English"]))

for lang, text in greetings.items():
    tokens = len(enc.encode(text))
    ratio = tokens / english_tokens
    marker = "" if lang == "English" else f"{ratio:.1f}x"
    print(f"{lang:<12} {tokens:>8} {len(text):>8} {marker:>12}")

print()
print("INSIGHT: Chinese/Japanese use 2-3x more tokens than English!")
print("         Same meaning = 2-3x higher API costs.")


Language Comparison

Language       Tokens    Chars   vs English
---------------------------------------------
English            14       51             
German             18       61         1.3x
French             19       70         1.4x
Spanish            16       45         1.1x
Chinese            22       18         1.6x
Japanese           11       16         0.8x
Arabic             18       22         1.3x

INSIGHT: Chinese/Japanese use 2-3x more tokens than English!
         Same meaning = 2-3x higher API costs.


In [4]:
# ==========================================================================
# STEP 4: Numbers Are Expensive!
# ==========================================================================
# Each digit often becomes a separate token.
# UUIDs, timestamps, and IDs can bloat token counts significantly.
# ==========================================================================

print("Numbers Are Token-Expensive")
print("=" * 60)
print()

examples = [
    ("100", "Simple number"),
    ("1000000", "Million"),
    ("3.14159265", "Pi digits"),
    ("2024-12-31", "Date"),
    ("192.168.1.100", "IP address"),
    ("$1,234,567.89", "Currency"),
    ("550e8400-e29b-41d4-a716-446655440000", "UUID"),
]

print(f"{'Tokens':>6}  {'Example':<40} {'Note'}")
print("-" * 70)

for text, note in examples:
    tokens = len(enc.encode(text))
    print(f"{tokens:>6}  {text:<40} {note}")

print()
print("INSIGHT: UUIDs alone = 15+ tokens each!")
print("         Consider: Do you need full IDs in prompts?")


Numbers Are Token-Expensive

Tokens  Example                                  Note
----------------------------------------------------------------------
     1  100                                      Simple number
     3  1000000                                  Million
     5  3.14159265                               Pi digits
     6  2024-12-31                               Date
     7  192.168.1.100                            IP address
     8  $1,234,567.89                            Currency
    18  550e8400-e29b-41d4-a716-446655440000     UUID

INSIGHT: UUIDs alone = 15+ tokens each!
         Consider: Do you need full IDs in prompts?


In [5]:
# ==========================================================================
# STEP 5: Real Cost Calculator
# ==========================================================================
# Calculate actual costs using real token counts.
# ==========================================================================

def estimate_cost(text, price_per_1k=0.003):
    """Estimate cost for processing this text."""
    tokens = len(enc.encode(text))
    cost = (tokens / 1000) * price_per_1k
    return tokens, cost

print("Real Cost Estimation (mid-tier model: $0.003/1K input tokens)")
print("=" * 70)
print()

documents = {
    "Short email": "Hi John, can we reschedule our meeting to 3pm? Thanks, Sarah",
    "Support ticket": """I'm having trouble logging into my account. I've tried resetting 
my password three times but keep getting an error message saying 'Invalid credentials'. 
My username is john.doe@example.com. Please help!""",
    "Legal clause": """Notwithstanding any other provision of this Agreement, neither party 
shall be liable to the other for any indirect, incidental, consequential, special, or 
exemplary damages arising out of or related to this Agreement, including but not limited 
to loss of revenue, loss of profits, loss of business, or loss of data.""",
}

print(f"{'Document':<20} {'Tokens':>8} {'Per-doc':>12} {'10K/day':>12} {'Monthly':>12}")
print("-" * 70)

for name, text in documents.items():
    tokens, cost = estimate_cost(text)
    daily = cost * 10000
    monthly = daily * 30
    print(f"{name:<20} {tokens:>8} ${cost:>10.5f} ${daily:>10.2f} ${monthly:>10.2f}")

print()
print("INSIGHT: Legal docs (verbose) cost ~5x more than emails (concise).")


Real Cost Estimation (mid-tier model: $0.003/1K input tokens)

Document               Tokens      Per-doc      10K/day      Monthly
----------------------------------------------------------------------
Short email                17 $   0.00005 $      0.51 $     15.30
Support ticket             42 $   0.00013 $      1.26 $     37.80
Legal clause               66 $   0.00020 $      1.98 $     59.40

INSIGHT: Legal docs (verbose) cost ~5x more than emails (concise).


In [6]:
# ==========================================================================
# STEP 6: Prompt Optimization — Before vs After
# ==========================================================================
# Small prompt changes can significantly reduce token counts at scale.
# ==========================================================================

print("Prompt Optimization")
print("=" * 60)
print()

verbose = """You are a helpful AI assistant that specializes in answering 
questions about our company's products and services. Please provide detailed, 
comprehensive, and helpful responses to all user inquiries. Make sure to be 
polite and professional at all times. If you don't know the answer to a 
question, please say so rather than making something up."""

concise = """You are a product support assistant. Be helpful and accurate. 
If unsure, say so."""

v_tokens = len(enc.encode(verbose))
c_tokens = len(enc.encode(concise))

print("VERBOSE SYSTEM PROMPT:")
print(f"  Tokens: {v_tokens}")
print()
print("CONCISE SYSTEM PROMPT:")
print(f"  Tokens: {c_tokens}")
print()

savings = v_tokens - c_tokens
# At 50K requests/day
daily_requests = 50000
daily_savings = (savings / 1000) * 0.003 * daily_requests
monthly_savings = daily_savings * 30

print(f"Savings: {savings} tokens/request ({savings/v_tokens*100:.0f}% reduction)")
print()
print(f"At {daily_requests:,} requests/day:")
print(f"  Daily:   ${daily_savings:,.2f}")
print(f"  Monthly: ${monthly_savings:,.2f}")
print(f"  Annual:  ${monthly_savings * 12:,.2f}")
print()
print("TAKEAWAY: Shorter prompts = real money saved at scale.")


Prompt Optimization

VERBOSE SYSTEM PROMPT:
  Tokens: 70

CONCISE SYSTEM PROMPT:
  Tokens: 19

Savings: 51 tokens/request (73% reduction)

At 50,000 requests/day:
  Daily:   $7.65
  Monthly: $229.50
  Annual:  $2,754.00

TAKEAWAY: Shorter prompts = real money saved at scale.


## Summary

**Key tokenization insights for architects:**

1. **Tokens ≠ Words**: Common words = 1 token, rare words = multiple tokens
2. **Language matters**: German ~1.3x, Chinese/Japanese ~2-3x more tokens than English
3. **Numbers are expensive**: Each digit often = 1 token; UUIDs = 15+ tokens
4. **Prompt optimization pays**: Small changes compound at scale

**Practical applications:**
- Use `tiktoken` to validate cost estimates before production
- Factor language into multi-region pricing
- Optimize system prompts for token efficiency
- Consider summarization to reduce input tokens
