# Understanding Tokenization: From Text to Numbers

Welcome to the foundation of language models! Before transformers can process text, they need to convert it into numbers. This process is called **tokenization**.

## What You'll Learn

1. **What is Tokenization?** - Converting text to numerical representations
2. **Character-Level Tokenization** - Simplest approach for learning
3. **Subword Tokenization** - How modern LLMs handle text (BPE, GPT-2 style)
4. **Building Vocabularies** - Creating token mappings from text
5. **Special Tokens** - Handling padding, unknown words, and boundaries
6. **Real-World Impact** - How tokenization affects model performance

Let's start with the basics and work our way up to modern approaches!

In [None]:
import sys
import os
sys.path.append('..')

import re
import json
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple

# Try to import advanced tokenizers
try:
    import tiktoken
    TIKTOKEN_AVAILABLE = True
    print("✅ tiktoken available - we can use GPT-2 style tokenization")
except ImportError:
    TIKTOKEN_AVAILABLE = False
    print("⚠️ tiktoken not available - we'll focus on character-level tokenization")

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

print("Environment setup complete!")

## 1. What is Tokenization?

**Tokenization** is the process of breaking down text into smaller units called **tokens**, then converting these tokens into numerical IDs that neural networks can process.

```
Text: "Hello world!"
  ↓ Tokenization
Tokens: ["Hello", " world", "!"]
  ↓ Convert to IDs
Token IDs: [15496, 995, 0]
```

Let's see this in action with different approaches:

In [None]:
def demonstrate_tokenization_concept():
    """Show the basic concept of tokenization with different granularities."""
    
    text = "Hello world! 🌍"
    print(f"Original text: '{text}'")
    print()
    
    # Different tokenization approaches
    approaches = [
        ("Character-level", list(text)),
        ("Word-level", text.split()),
        ("Subword (manual)", ["Hello", " world", "!", " 🌍"])
    ]
    
    for name, tokens in approaches:
        print(f"{name:15}: {len(tokens):2d} tokens → {tokens}")
    
    print()
    print("🔍 Key Observations:")
    print("• Character-level: Many tokens, handles any text, but very long sequences")
    print("• Word-level: Fewer tokens, but struggles with new/rare words")
    print("• Subword: Balanced approach - handles new words while keeping sequences manageable")
    
    return approaches

# Demonstrate the concept
tokenization_examples = demonstrate_tokenization_concept()

## 2. Character-Level Tokenization

Let's start with the simplest approach: treating each character as a token. This is great for learning because:
- ✅ Simple to understand and implement
- ✅ No out-of-vocabulary (OOV) problems
- ✅ Works with any language
- ❌ Creates very long sequences
- ❌ Hard to learn word-level patterns

In [None]:
class SimpleCharacterTokenizer:
    """Simple character-level tokenizer for educational purposes."""
    
    def __init__(self, text_corpus: str = None):
        # Special tokens
        self.PAD = '<PAD>'  # For padding sequences to same length
        self.UNK = '<UNK>'  # For unknown characters
        self.BOS = '<BOS>'  # Beginning of sequence
        self.EOS = '<EOS>'  # End of sequence
        
        # Build vocabulary
        if text_corpus:
            self.build_vocab(text_corpus)
        else:
            # Default vocabulary with common characters
            self.vocab = self._create_default_vocab()
            self._create_mappings()
    
    def _create_default_vocab(self):
        """Create a default vocabulary with common characters."""
        # Letters, digits, punctuation, and special tokens
        chars = []
        chars.extend([chr(i) for i in range(32, 127)])  # Printable ASCII
        chars.extend([self.PAD, self.UNK, self.BOS, self.EOS])
        return chars
    
    def build_vocab(self, text: str):
        """Build vocabulary from text corpus."""
        # Get unique characters from text
        unique_chars = sorted(set(text))
        
        # Add special tokens
        self.vocab = [self.PAD, self.UNK, self.BOS, self.EOS] + unique_chars
        self._create_mappings()
        
        print(f"Built vocabulary from {len(text):,} characters")
        print(f"Vocabulary size: {len(self.vocab)} unique characters")
    
    def _create_mappings(self):
        """Create character ↔ ID mappings."""
        self.char_to_id = {char: i for i, char in enumerate(self.vocab)}
        self.id_to_char = {i: char for i, char in enumerate(self.vocab)}
    
    def encode(self, text: str, add_special_tokens: bool = True) -> List[int]:
        """Convert text to token IDs."""
        tokens = []
        
        if add_special_tokens:
            tokens.append(self.char_to_id[self.BOS])
        
        for char in text:
            token_id = self.char_to_id.get(char, self.char_to_id[self.UNK])
            tokens.append(token_id)
        
        if add_special_tokens:
            tokens.append(self.char_to_id[self.EOS])
        
        return tokens
    
    def decode(self, token_ids: List[int], skip_special_tokens: bool = True) -> str:
        """Convert token IDs back to text."""
        special_ids = {
            self.char_to_id[self.PAD],
            self.char_to_id[self.BOS],
            self.char_to_id[self.EOS]
        }
        
        chars = []
        for token_id in token_ids:
            if skip_special_tokens and token_id in special_ids:
                continue
            chars.append(self.id_to_char.get(token_id, self.UNK))
        
        return ''.join(chars)
    
    def tokenize(self, text: str) -> List[str]:
        """Return list of character tokens."""
        return [self.BOS] + list(text) + [self.EOS]

# Create and test character tokenizer
char_tokenizer = SimpleCharacterTokenizer()

# Test with example text
test_text = "Hello, world! 🚀"
print(f"Original text: '{test_text}'")
print(f"Vocabulary size: {len(char_tokenizer.vocab)}")
print()

# Tokenization process
tokens = char_tokenizer.tokenize(test_text)
token_ids = char_tokenizer.encode(test_text)
decoded_text = char_tokenizer.decode(token_ids)

print("Tokenization process:")
print(f"1. Tokens:     {tokens}")
print(f"2. Token IDs:  {token_ids}")
print(f"3. Decoded:    '{decoded_text}'")
print()
print(f"✅ Round-trip successful: {test_text == decoded_text}")

### Visualizing Character Tokenization

In [None]:
def visualize_character_tokenization():
    """Visualize how character tokenization works."""
    
    text = "AI is amazing!"
    tokens = char_tokenizer.tokenize(text)
    token_ids = char_tokenizer.encode(text)
    
    print(f"Text: '{text}'")
    print()
    
    # Show character-by-character mapping
    print("Character → Token ID mapping:")
    print("─" * 30)
    
    for i, (token, token_id) in enumerate(zip(tokens, token_ids)):
        if token in ['<BOS>', '<EOS>']:
            print(f"{i:2d}: {token:>6} → {token_id:3d}  (special)")
        else:
            print(f"{i:2d}: {repr(token):>6} → {token_id:3d}")
    
    # Vocabulary analysis
    print(f"\n📊 Vocabulary Analysis:")
    print(f"Total vocabulary size: {len(char_tokenizer.vocab)}")
    print(f"Sequence length: {len(token_ids)} tokens")
    print(f"Original text length: {len(text)} characters")
    
    # Show some vocabulary examples
    print("\n🔤 Sample vocabulary (first 20 tokens):")
    for i in range(min(20, len(char_tokenizer.vocab))):
        char = char_tokenizer.vocab[i]
        if char in ['<PAD>', '<UNK>', '<BOS>', '<EOS>']:
            print(f"{i:3d}: {char}")
        else:
            print(f"{i:3d}: {repr(char)}")

visualize_character_tokenization()

## 3. Building Vocabulary from Real Text

Let's see how to build a vocabulary from a real text corpus and analyze character frequencies:

In [None]:
def analyze_text_corpus():
    """Analyze character frequencies in a text corpus."""
    
    # Sample text corpus (you could load from a file)
    corpus = """
    The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet!
    Machine learning is transforming how we process natural language. 
    Transformers use attention mechanisms to understand context and relationships between words.
    Deep learning models can generate human-like text with remarkable accuracy.
    The future of AI looks very promising! 🤖✨
    """.strip()
    
    print(f"Corpus length: {len(corpus):,} characters")
    print(f"Sample text: {repr(corpus[:100])}...")
    print()
    
    # Build tokenizer from this corpus
    corpus_tokenizer = SimpleCharacterTokenizer(corpus)
    
    # Analyze character frequencies
    char_counts = Counter(corpus)
    total_chars = len(corpus)
    
    print("📊 Character Frequency Analysis:")
    print("─" * 40)
    
    # Show top 15 most common characters
    for char, count in char_counts.most_common(15):
        percentage = (count / total_chars) * 100
        char_repr = repr(char) if char.isprintable() else f"'\\x{ord(char):02x}'"
        print(f"{char_repr:>8}: {count:3d} ({percentage:4.1f}%)")
    
    # Visualize character frequencies
    plt.figure(figsize=(12, 8))
    
    # Top characters
    plt.subplot(2, 1, 1)
    top_chars = char_counts.most_common(20)
    chars, counts = zip(*top_chars)
    char_labels = [repr(c) if c.isprintable() else f"\\x{ord(c):02x}" for c in chars]
    
    plt.bar(range(len(chars)), counts, color='skyblue')
    plt.xlabel('Characters')
    plt.ylabel('Frequency')
    plt.title('Top 20 Most Frequent Characters')
    plt.xticks(range(len(chars)), char_labels, rotation=45)
    
    # Character distribution
    plt.subplot(2, 1, 2)
    frequencies = list(char_counts.values())
    plt.hist(frequencies, bins=20, color='lightgreen', alpha=0.7)
    plt.xlabel('Character Frequency')
    plt.ylabel('Number of Characters')
    plt.title('Distribution of Character Frequencies')
    plt.yscale('log')
    
    plt.tight_layout()
    plt.show()
    
    # Test encoding/decoding
    test_sentence = "Hello AI! 🤖"
    encoded = corpus_tokenizer.encode(test_sentence)
    decoded = corpus_tokenizer.decode(encoded)
    
    print(f"\n🧪 Encoding Test:")
    print(f"Original: '{test_sentence}'")
    print(f"Encoded:  {encoded}")
    print(f"Decoded:  '{decoded}'")
    print(f"Success:  {test_sentence == decoded}")
    
    return corpus_tokenizer, char_counts

# Analyze the corpus
corpus_tokenizer, char_frequencies = analyze_text_corpus()

## 4. Introduction to Subword Tokenization

While character-level tokenization is simple, modern language models use **subword tokenization**. This approach:
- ✅ Handles new words by breaking them into known subparts
- ✅ Creates shorter sequences than character-level
- ✅ Captures morphological patterns (prefixes, suffixes)
- ✅ Works well across different languages

Let's implement a simple version of **Byte Pair Encoding (BPE)**, which is used by GPT-2 and many other models:

In [None]:
class SimpleBPETokenizer:
    """Simple Byte Pair Encoding tokenizer for educational purposes."""
    
    def __init__(self):
        self.vocab = {}
        self.merges = []
        self.word_freqs = {}
    
    def train(self, texts: List[str], vocab_size: int = 1000):
        """Train BPE on a corpus of texts."""
        print(f"Training BPE tokenizer on {len(texts)} texts...")
        
        # Step 1: Count word frequencies and initialize with characters
        self.word_freqs = self._count_word_frequencies(texts)
        
        # Step 2: Initialize vocabulary with characters
        self.vocab = self._get_character_vocab()
        
        # Step 3: Iteratively merge most frequent pairs
        while len(self.vocab) < vocab_size:
            # Find most frequent pair
            pair_counts = self._count_pairs()
            if not pair_counts:
                break
                
            most_frequent_pair = max(pair_counts, key=pair_counts.get)
            
            # Merge the pair
            self._merge_pair(most_frequent_pair)
            
            # Add to vocabulary and merges
            new_token = ''.join(most_frequent_pair)
            self.vocab[new_token] = len(self.vocab)
            self.merges.append(most_frequent_pair)
        
        print(f"Training complete! Vocabulary size: {len(self.vocab)}")
        print(f"Number of merges: {len(self.merges)}")
    
    def _count_word_frequencies(self, texts: List[str]) -> Dict[str, int]:
        """Count word frequencies in the corpus."""
        word_freqs = Counter()
        
        for text in texts:
            # Simple word splitting (could be more sophisticated)
            words = re.findall(r'\w+|[^\w\s]', text.lower())
            word_freqs.update(words)
        
        # Convert to character-separated format for BPE
        char_word_freqs = {}
        for word, freq in word_freqs.items():
            # Add space to mark word boundary
            char_word = ' '.join(word) + ' </w>'
            char_word_freqs[char_word] = freq
        
        return char_word_freqs
    
    def _get_character_vocab(self) -> Dict[str, int]:
        """Initialize vocabulary with all characters."""
        chars = set()
        for word in self.word_freqs:
            chars.update(word.split())
        
        # Add special tokens
        chars.update(['<pad>', '<unk>', '<bos>', '<eos>'])
        
        return {char: i for i, char in enumerate(sorted(chars))}
    
    def _count_pairs(self) -> Dict[Tuple[str, str], int]:
        """Count frequency of adjacent pairs in the vocabulary."""
        pair_counts = Counter()
        
        for word, freq in self.word_freqs.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pair = (symbols[i], symbols[i + 1])
                pair_counts[pair] += freq
        
        return pair_counts
    
    def _merge_pair(self, pair: Tuple[str, str]):
        """Merge a pair in all words."""
        new_word_freqs = {}
        
        for word, freq in self.word_freqs.items():
            new_word = self._merge_word(word, pair)
            new_word_freqs[new_word] = freq
        
        self.word_freqs = new_word_freqs
    
    def _merge_word(self, word: str, pair: Tuple[str, str]) -> str:
        """Merge a specific pair in a word."""
        symbols = word.split()
        new_symbols = []
        i = 0
        
        while i < len(symbols):
            if i < len(symbols) - 1 and (symbols[i], symbols[i + 1]) == pair:
                # Merge the pair
                new_symbols.append(symbols[i] + symbols[i + 1])
                i += 2
            else:
                new_symbols.append(symbols[i])
                i += 1
        
        return ' '.join(new_symbols)
    
    def encode(self, text: str) -> List[str]:
        """Encode text using learned BPE merges."""
        # Simple word splitting
        words = re.findall(r'\w+|[^\w\s]', text.lower())
        
        tokens = []
        for word in words:
            # Start with character-separated word
            word_tokens = ' '.join(word) + ' </w>'
            
            # Apply merges in order
            for pair in self.merges:
                word_tokens = self._merge_word(word_tokens, pair)
            
            tokens.extend(word_tokens.split())
        
        return tokens
    
    def show_merges(self, n: int = 10):
        """Show the first n merges learned."""
        print(f"First {n} BPE merges:")
        print("─" * 30)
        for i, (left, right) in enumerate(self.merges[:n]):
            merged = left + right
            print(f"{i+1:2d}: '{left}' + '{right}' → '{merged}'")

# Train a simple BPE tokenizer
training_texts = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is amazing and transformative",
    "Natural language processing with transformers",
    "Deep learning models learn patterns from data",
    "Artificial intelligence is changing the world",
    "Tokenization is the first step in text processing"
]

bpe_tokenizer = SimpleBPETokenizer()
bpe_tokenizer.train(training_texts, vocab_size=100)

# Show some learned merges
bpe_tokenizer.show_merges(10)

# Test encoding
test_text = "learning transformers"
bpe_tokens = bpe_tokenizer.encode(test_text)
char_tokens = char_tokenizer.tokenize(test_text)

print(f"\n🧪 Tokenization Comparison:")
print(f"Text: '{test_text}'")
print(f"BPE tokens:       {bpe_tokens} ({len(bpe_tokens)} tokens)")
print(f"Character tokens: {char_tokens} ({len(char_tokens)} tokens)")
print(f"\n📊 BPE creates {len(char_tokens)/len(bpe_tokens):.1f}x shorter sequences!")

## 5. Real-World Tokenization: GPT-2 Style

Let's see how modern language models like GPT-2 actually tokenize text using the `tiktoken` library:

In [None]:
def demonstrate_gpt2_tokenization():
    """Show how GPT-2 tokenizes text using tiktoken."""
    
    if not TIKTOKEN_AVAILABLE:
        print("⚠️ tiktoken not available. Install with: pip install tiktoken")
        return
    
    # Load GPT-2 tokenizer
    enc = tiktoken.get_encoding("gpt2")
    
    print(f"GPT-2 Tokenizer:")
    print(f"Vocabulary size: {enc.n_vocab:,} tokens")
    print()
    
    # Test different types of text
    test_cases = [
        "Hello world!",
        "Tokenization is fascinating",
        "The transformer architecture revolutionized NLP",
        "🤖 AI + ML = 🚀",
        "antidisestablishmentarianism",  # Long word
        "New_words_with_underscores",
        "CamelCaseWords",
        "123-456-7890"
    ]
    
    print("🔍 GPT-2 Tokenization Examples:")
    print("─" * 60)
    
    for text in test_cases:
        # Encode to token IDs
        token_ids = enc.encode(text)
        
        # Decode back to check
        decoded = enc.decode(token_ids)
        
        # Get token strings
        tokens = [enc.decode([token_id]) for token_id in token_ids]
        
        print(f"Text: '{text}'")
        print(f"Tokens: {tokens}")
        print(f"Token IDs: {token_ids}")
        print(f"Length: {len(token_ids)} tokens (vs {len(text)} chars)")
        print(f"Round-trip: {text == decoded}")
        print()
    
    # Compare tokenization approaches
    comparison_text = "The transformer model uses self-attention mechanisms"
    
    gpt2_tokens = enc.encode(comparison_text)
    char_tokens = char_tokenizer.encode(comparison_text, add_special_tokens=False)
    
    print("📊 Tokenization Comparison:")
    print(f"Text: '{comparison_text}'")
    print(f"Characters: {len(comparison_text)} chars")
    print(f"GPT-2 tokens: {len(gpt2_tokens)} tokens")
    print(f"Character tokens: {len(char_tokens)} tokens")
    print(f"GPT-2 efficiency: {len(char_tokens)/len(gpt2_tokens):.1f}x compression")
    
    # Show actual GPT-2 tokens
    gpt2_token_strings = [enc.decode([tid]) for tid in gpt2_tokens]
    print(f"\nGPT-2 tokens: {gpt2_token_strings}")
    
    return enc

# Demonstrate GPT-2 tokenization
gpt2_encoder = demonstrate_gpt2_tokenization()

## 6. Special Tokens and Their Importance

Special tokens serve crucial roles in language models. Let's understand what they do:

In [None]:
def demonstrate_special_tokens():
    """Show the role of special tokens in language models."""
    
    print("🎯 Special Tokens and Their Roles:")
    print("═" * 50)
    
    special_tokens = {
        '<PAD>': 'Padding - Make all sequences the same length',
        '<UNK>': 'Unknown - Handle out-of-vocabulary words',
        '<BOS>': 'Beginning of Sequence - Mark sequence start',
        '<EOS>': 'End of Sequence - Mark sequence end',
        '<MASK>': 'Mask - For masked language modeling (BERT)',
        '<CLS>': 'Classification - For sequence classification',
        '<SEP>': 'Separator - Separate different parts of input'
    }
    
    for token, description in special_tokens.items():
        print(f"{token:8} : {description}")
    
    print("\n" + "─" * 50)
    
    # Demonstrate padding
    sentences = [
        "AI is cool",
        "Machine learning",
        "Natural language processing with transformers"
    ]
    
    print("\n📦 Padding Example (making sequences same length):")
    
    # Tokenize all sentences
    tokenized = [char_tokenizer.encode(s, add_special_tokens=False) for s in sentences]
    
    # Find max length
    max_length = max(len(tokens) for tokens in tokenized)
    pad_id = char_tokenizer.char_to_id['<PAD>']
    
    print(f"Max sequence length: {max_length}")
    print(f"PAD token ID: {pad_id}")
    print()
    
    # Pad sequences
    for i, (sentence, tokens) in enumerate(zip(sentences, tokenized)):
        # Pad to max length
        padded_tokens = tokens + [pad_id] * (max_length - len(tokens))
        
        print(f"Sentence {i+1}: '{sentence}'")
        print(f"Original:  {tokens} (len: {len(tokens)})")
        print(f"Padded:    {padded_tokens} (len: {len(padded_tokens)})")
        print()
    
    # Demonstrate unknown tokens
    print("❓ Unknown Token Example:")
    
    # Text with emoji that might not be in vocabulary
    text_with_unknown = "Hello 🦄 world"
    
    # Create a limited tokenizer
    limited_vocab = list("abcdefghijklmnopqrstuvwxyz HELLO")
    limited_tokenizer = SimpleCharacterTokenizer()
    limited_tokenizer.vocab = ['<PAD>', '<UNK>', '<BOS>', '<EOS>'] + limited_vocab
    limited_tokenizer._create_mappings()
    
    tokens = limited_tokenizer.encode(text_with_unknown)
    decoded = limited_tokenizer.decode(tokens)
    
    print(f"Original: '{text_with_unknown}'")
    print(f"Tokens:   {tokens}")
    print(f"Decoded:  '{decoded}'")
    print(f"Note: Emoji became {limited_tokenizer.UNK} (unknown token)")

demonstrate_special_tokens()

## 7. Impact on Model Architecture

Tokenization decisions directly affect model architecture and performance. Let's see how:

In [None]:
def analyze_tokenization_impact():
    """Analyze how tokenization affects model architecture and performance."""
    
    print("🏗️ TOKENIZATION IMPACT ON MODEL ARCHITECTURE")
    print("═" * 60)
    
    # Different tokenization scenarios
    scenarios = [
        ("Character-level", 128, "Small vocab, long sequences"),
        ("Word-level", 50000, "Large vocab, medium sequences"),
        ("Subword (GPT-2)", 50257, "Balanced vocab and sequences"),
        ("Subword (Large)", 100000, "Very large vocab, short sequences")
    ]
    
    d_model = 512  # Model dimension
    sample_text = "The transformer architecture revolutionized natural language processing"
    
    print(f"Sample text: '{sample_text}'")
    print(f"Text length: {len(sample_text)} characters")
    print()
    
    results = []
    
    for name, vocab_size, description in scenarios:
        # Estimate sequence length based on tokenization type
        if "Character" in name:
            seq_len = len(sample_text) + 2  # +2 for BOS/EOS
        elif "Word" in name:
            seq_len = len(sample_text.split()) + 2
        else:  # Subword
            # Estimate based on compression ratio
            if vocab_size <= 50257:
                seq_len = int(len(sample_text) / 4) + 2  # ~4 chars per token
            else:
                seq_len = int(len(sample_text) / 6) + 2  # ~6 chars per token
        
        # Calculate memory requirements
        embedding_params = vocab_size * d_model
        attention_memory = seq_len ** 2  # Attention matrix
        
        results.append({
            'name': name,
            'vocab_size': vocab_size,
            'seq_len': seq_len,
            'embedding_params': embedding_params,
            'attention_memory': attention_memory,
            'description': description
        })
    
    # Display results
    print("📊 Architecture Impact Analysis:")
    print("─" * 80)
    print(f"{'Tokenization':<15} {'Vocab':<8} {'Seq Len':<8} {'Embed Params':<12} {'Attn Memory':<12} {'Description':<25}")
    print("─" * 80)
    
    for r in results:
        print(f"{r['name']:<15} {r['vocab_size']:<8,} {r['seq_len']:<8} {r['embedding_params']:<12,} {r['attention_memory']:<12,} {r['description']:<25}")
    
    # Visualize trade-offs
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
    
    names = [r['name'] for r in results]
    
    # Vocabulary size
    vocab_sizes = [r['vocab_size'] for r in results]
    ax1.bar(names, vocab_sizes, color='skyblue')
    ax1.set_title('Vocabulary Size')
    ax1.set_ylabel('Number of Tokens')
    ax1.tick_params(axis='x', rotation=45)
    ax1.set_yscale('log')
    
    # Sequence length
    seq_lens = [r['seq_len'] for r in results]
    ax2.bar(names, seq_lens, color='lightgreen')
    ax2.set_title('Sequence Length')
    ax2.set_ylabel('Number of Tokens')
    ax2.tick_params(axis='x', rotation=45)
    
    # Embedding parameters
    embed_params = [r['embedding_params'] for r in results]
    ax3.bar(names, embed_params, color='lightcoral')
    ax3.set_title('Embedding Parameters')
    ax3.set_ylabel('Number of Parameters')
    ax3.tick_params(axis='x', rotation=45)
    ax3.set_yscale('log')
    
    # Attention memory
    attn_memory = [r['attention_memory'] for r in results]
    ax4.bar(names, attn_memory, color='gold')
    ax4.set_title('Attention Memory (O(n²))')
    ax4.set_ylabel('Memory Units')
    ax4.tick_params(axis='x', rotation=45)
    ax4.set_yscale('log')
    
    plt.tight_layout()
    plt.show()
    
    print("\n🎯 Key Trade-offs:")
    print("• Large vocabulary → More embedding parameters")
    print("• Long sequences → Quadratic attention memory growth")
    print("• Character-level → Simple but inefficient")
    print("• Subword → Best balance for most applications")
    print("• Optimal choice depends on task, compute, and language")
    
    return results

# Analyze tokenization impact
impact_analysis = analyze_tokenization_impact()

## 8. Practical Considerations

Let's wrap up with real-world considerations when choosing tokenization strategies:

In [None]:
def practical_tokenization_guide():
    """Provide practical guidance for choosing tokenization strategies."""
    
    print("🎯 PRACTICAL TOKENIZATION GUIDE")
    print("═" * 50)
    
    guidelines = {
        "🔤 Character-Level": {
            "Best for": [
                "Learning and experimentation",
                "Languages without clear word boundaries",
                "Small datasets",
                "Character-level tasks (spelling correction)"
            ],
            "Pros": ["No OOV words", "Simple implementation", "Language agnostic"],
            "Cons": ["Long sequences", "Hard to learn word-level patterns", "Computationally expensive"]
        },
        
        "📝 Word-Level": {
            "Best for": [
                "Well-defined vocabularies",
                "Domain-specific applications",
                "When semantic meaning is crucial"
            ],
            "Pros": ["Preserves word meaning", "Shorter sequences", "Interpretable"],
            "Cons": ["Large vocabularies", "OOV problems", "Language-specific", "Poor morphology handling"]
        },
        
        "🧩 Subword (BPE/WordPiece)": {
            "Best for": [
                "General-purpose language models",
                "Multilingual applications",
                "Production systems",
                "Most modern NLP tasks"
            ],
            "Pros": ["Handles OOV words", "Balanced vocab/sequence length", "Good morphology", "Cross-lingual"],
            "Cons": ["More complex", "Requires training", "Less interpretable"]
        }
    }
    
    for category, info in guidelines.items():
        print(f"\n{category}")
        print("─" * 30)
        
        print("✅ Best for:")
        for item in info["Best for"]:
            print(f"   • {item}")
        
        print("\n✅ Pros:")
        for pro in info["Pros"]:
            print(f"   • {pro}")
        
        print("\n❌ Cons:")
        for con in info["Cons"]:
            print(f"   • {con}")
    
    print("\n" + "═" * 50)
    print("🎬 TOKENIZATION IN TRANSFORMER PIPELINE")
    print("═" * 50)
    
    pipeline_steps = [
        ("1. Raw Text", "'The transformer is amazing!'"),
        ("2. Tokenization", "['The', ' transform', 'er', ' is', ' amazing', '!']"),
        ("3. Token IDs", "[464, 5516, 263, 318, 4998, 0]"),
        ("4. Embeddings", "[[0.1, -0.3, 0.8, ...], [0.2, 0.1, -0.4, ...], ...]"),
        ("5. Transformer", "Attention → Feed-Forward → Layer Norm → ..."),
        ("6. Output Logits", "Probability distribution over vocabulary"),
        ("7. Decoding", "Convert back to text"),
        ("8. Final Text", "'Transformers revolutionized AI!'") 
    ]
    
    for step, description in pipeline_steps:
        print(f"{step:<15} → {description}")
    
    print("\n🔑 KEY TAKEAWAYS:")
    print("─" * 20)
    print("1. Tokenization is the bridge between human language and AI")
    print("2. Choice affects model size, speed, and performance")
    print("3. Subword tokenization is the current best practice")
    print("4. Consider your specific use case and constraints")
    print("5. Vocabulary size directly impacts embedding layer size")
    print("6. Sequence length affects attention computation (O(n²))")

practical_tokenization_guide()

## Summary

Congratulations! You've learned the fundamentals of tokenization - the essential first step in all language models.

### What We Covered:

1. **Tokenization Basics** - Converting text to numbers for neural networks
2. **Character-Level** - Simple approach treating each character as a token
3. **Subword Tokenization** - Modern approach using BPE and similar algorithms
4. **Vocabulary Building** - Creating token mappings from text corpora
5. **Special Tokens** - Handling padding, unknowns, and sequence boundaries
6. **Architecture Impact** - How tokenization affects model size and performance
7. **Practical Guidelines** - Choosing the right approach for your needs

### Key Insights:

- **Tokenization is crucial** - It's the foundation that enables all text processing
- **Trade-offs matter** - Vocabulary size vs sequence length vs interpretability
- **Subword wins** - Best balance for most modern applications
- **Context matters** - Choose based on your specific task and constraints

### What's Next?

Now that you understand how text becomes numbers, you're ready to explore:
- **01_attention_mechanism** - How transformers process token sequences
- **02_transformer_blocks** - Building complete processing units
- **03_positional_encoding** - Adding position information to tokens

The journey from text to intelligent AI responses starts with tokenization - and you've just mastered it! 🚀