# Understanding Tokenization: From Text to Numbers

Before transformers can process text, they need to convert it into numbers. This process is called **tokenization** - the essential bridge between human language and AI.

## What You'll Learn

1. **Tokenization Fundamentals** - Converting text to numerical representations
2. **Character-Level Approach** - Simplest method for learning
3. **Subword Tokenization** - Modern approach used by GPT and BERT
4. **Special Tokens** - Handling boundaries and unknown content

Let's master the foundation that makes all language models possible!

In [None]:
import sys
import os
sys.path.append('..')

import re
import json
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple

# Try to import advanced tokenizers
try:
    import tiktoken
    TIKTOKEN_AVAILABLE = True
    print("✅ tiktoken available - we can use GPT-2 style tokenization")
except ImportError:
    TIKTOKEN_AVAILABLE = False
    print("⚠️ tiktoken not available - we'll focus on character-level tokenization")

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

print("Environment setup complete!")

## Character-Level Tokenization

**Tokenization** converts text into numerical tokens that neural networks can process: `Text → Tokens → IDs → Embeddings`

Let's implement the simplest approach - character-level tokenization:

In [None]:
def demonstrate_tokenization_concept():
    """Show the basic concept with different approaches."""
    
    text = "Hello world! 🌍"
    
    approaches = [
        ("Character-level", list(text), "Simple, handles any text, long sequences"),
        ("Subword (GPT-style)", ["Hello", " world", "!", " 🌍"], "Balanced approach, modern standard")
    ]
    
    print(f"Text: '{text}' ({len(text)} chars)")
    print("\nTokenization Approaches:")
    for name, tokens, description in approaches:
        print(f"{name:15}: {len(tokens):2d} tokens → {description}")
    
    return approaches[0][1]  # Return character tokens for next cell

char_tokens = demonstrate_tokenization_concept()

## Implementing Character Tokenization

Character-level is perfect for learning because it's simple and has no out-of-vocabulary issues:

In [None]:
class SimpleCharacterTokenizer:
    """Simple character-level tokenizer for educational purposes."""
    
    def __init__(self, text_corpus: str = None):
        # Special tokens
        self.PAD = '<PAD>'  # For padding sequences to same length
        self.UNK = '<UNK>'  # For unknown characters
        self.BOS = '<BOS>'  # Beginning of sequence
        self.EOS = '<EOS>'  # End of sequence
        
        # Build vocabulary
        if text_corpus:
            self.build_vocab(text_corpus)
        else:
            # Default vocabulary with common characters
            self.vocab = self._create_default_vocab()
            self._create_mappings()
    
    def _create_default_vocab(self):
        """Create a default vocabulary with common characters."""
        # Letters, digits, punctuation, and special tokens
        chars = []
        chars.extend([chr(i) for i in range(32, 127)])  # Printable ASCII
        chars.extend([self.PAD, self.UNK, self.BOS, self.EOS])
        return chars
    
    def build_vocab(self, text: str):
        """Build vocabulary from text corpus."""
        # Get unique characters from text
        unique_chars = sorted(set(text))
        
        # Add special tokens
        self.vocab = [self.PAD, self.UNK, self.BOS, self.EOS] + unique_chars
        self._create_mappings()
        
        print(f"Built vocabulary from {len(text):,} characters")
        print(f"Vocabulary size: {len(self.vocab)} unique characters")
    
    def _create_mappings(self):
        """Create character ↔ ID mappings."""
        self.char_to_id = {char: i for i, char in enumerate(self.vocab)}
        self.id_to_char = {i: char for i, char in enumerate(self.vocab)}
    
    def encode(self, text: str, add_special_tokens: bool = True) -> List[int]:
        """Convert text to token IDs."""
        tokens = []
        
        if add_special_tokens:
            tokens.append(self.char_to_id[self.BOS])
        
        for char in text:
            token_id = self.char_to_id.get(char, self.char_to_id[self.UNK])
            tokens.append(token_id)
        
        if add_special_tokens:
            tokens.append(self.char_to_id[self.EOS])
        
        return tokens
    
    def decode(self, token_ids: List[int], skip_special_tokens: bool = True) -> str:
        """Convert token IDs back to text."""
        special_ids = {
            self.char_to_id[self.PAD],
            self.char_to_id[self.BOS],
            self.char_to_id[self.EOS]
        }
        
        chars = []
        for token_id in token_ids:
            if skip_special_tokens and token_id in special_ids:
                continue
            chars.append(self.id_to_char.get(token_id, self.UNK))
        
        return ''.join(chars)
    
    def tokenize(self, text: str) -> List[str]:
        """Return list of character tokens."""
        return [self.BOS] + list(text) + [self.EOS]

# Create and test character tokenizer
char_tokenizer = SimpleCharacterTokenizer()

# Test with example text
test_text = "Hello, world! 🚀"
print(f"Original text: '{test_text}'")
print(f"Vocabulary size: {len(char_tokenizer.vocab)}")
print()

# Tokenization process
tokens = char_tokenizer.tokenize(test_text)
token_ids = char_tokenizer.encode(test_text)
decoded_text = char_tokenizer.decode(token_ids)

print("Tokenization process:")
print(f"1. Tokens:     {tokens}")
print(f"2. Token IDs:  {token_ids}")
print(f"3. Decoded:    '{decoded_text}'")
print()
print(f"✅ Round-trip successful: {test_text == decoded_text}")

In [None]:
def visualize_character_tokenization():
    """Visualize how character tokenization works."""
    
    text = "AI is amazing!"
    tokens = char_tokenizer.tokenize(text)
    token_ids = char_tokenizer.encode(text)
    
    print(f"Text: '{text}'")
    print()
    
    # Show character-by-character mapping
    print("Character → Token ID mapping:")
    print("─" * 30)
    
    for i, (token, token_id) in enumerate(zip(tokens, token_ids)):
        if token in ['<BOS>', '<EOS>']:
            print(f"{i:2d}: {token:>6} → {token_id:3d}  (special)")
        else:
            print(f"{i:2d}: {repr(token):>6} → {token_id:3d}")
    
    # Vocabulary analysis
    print(f"\n📊 Vocabulary Analysis:")
    print(f"Total vocabulary size: {len(char_tokenizer.vocab)}")
    print(f"Sequence length: {len(token_ids)} tokens")
    print(f"Original text length: {len(text)} characters")
    
    # Show some vocabulary examples
    print("\n🔤 Sample vocabulary (first 20 tokens):")
    for i in range(min(20, len(char_tokenizer.vocab))):
        char = char_tokenizer.vocab[i]
        if char in ['<PAD>', '<UNK>', '<BOS>', '<EOS>']:
            print(f"{i:3d}: {char}")
        else:
            print(f"{i:3d}: {repr(char)}")

visualize_character_tokenization()

## GPT-2 Comparison & Summary

Let's compare our character approach with GPT-2's subword tokenization:

In [None]:
def compare_tokenization_approaches():
    """Compare character vs GPT-2 tokenization."""
    
    if not TIKTOKEN_AVAILABLE:
        print("⚠️ tiktoken not available - showing character-level only")
        return
    
    enc = tiktoken.get_encoding("gpt2")
    test_cases = [
        "Hello world!",
        "Tokenization is fascinating",
        "antidisestablishmentarianism"  # Long word
    ]
    
    print("📊 Tokenization Comparison:")
    print("─" * 50)
    
    for text in test_cases:
        gpt2_tokens = enc.encode(text)
        char_tokens = char_tokenizer.encode(text, add_special_tokens=False)
        
        # Show token breakdown
        gpt2_token_strings = [enc.decode([tid]) for tid in gpt2_tokens]
        
        print(f"Text: '{text}'")
        print(f"GPT-2 ({len(gpt2_tokens):2d}): {gpt2_token_strings}")
        print(f"Chars ({len(char_tokens):2d}): compression = {len(char_tokens)/len(gpt2_tokens):.1f}x")
        print()
    
    print("🎯 Key Takeaways:")
    print("• Character-level: Simple but creates long sequences")
    print("• Subword (GPT-2): Balanced vocabulary size and sequence length")
    print("• Special tokens: Essential for sequence boundaries and padding")
    print(f"• GPT-2 vocabulary: {enc.n_vocab:,} tokens vs {len(char_tokenizer.vocab)} characters")

compare_tokenization_approaches()

## What's Next?

You've mastered tokenization - the foundation that enables all language models! 

### The Learning Path:
1. **01_attention_mechanism** - How transformers process token sequences
2. **02_transformer_blocks** - Complete processing units  
3. **03_positional_encoding** - Adding position information

### Remember:
- **Tokenization** converts `Text → Numbers` for AI processing
- **Character-level** is simple but creates long sequences
- **Subword** (like GPT-2) balances efficiency and coverage
- **Special tokens** handle boundaries and padding

Ready to dive deeper into transformer architecture! 🚀