<a href="https://colab.research.google.com/github/kenechiomeke/intro_to_language_ai/blob/main/getting_started_tokenisers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔤➡️🔢 Tokenization Playground: Breaking Language Into Pieces

Welcome to your interactive tokenization laboratory! This notebook will help you understand how different tokenization strategies work by letting you experiment with them hands-on.

## 📚 What You'll Learn:
- How to implement different tokenization methods
- When to use each strategy
- The trade-offs between different approaches
- How tokenization affects downstream NLP tasks

## 🚀 Getting Started:
1. Run each cell in order (Shift + Enter)
2. Experiment with different text inputs
3. Compare results across tokenization methods
4. Try your own examples!

---

## 📦 Step 1: Install Required Libraries

First, let's install all the libraries we'll need. This might take a minute!

In [1]:
# Install required packages
!pip install transformers tokenizers sentencepiece nltk spacy pandas matplotlib seaborn
!python -m spacy download en_core_web_sm

print("✅ All libraries installed successfully!")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
✅ All libraries installed successfully!


## 📚 Step 2: Import Libraries

In [2]:
import re
import nltk
import spacy
from collections import Counter
from transformers import GPT2Tokenizer, AutoTokenizer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
import sentencepiece as spm
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

print("✅ All imports successful!")

✅ All imports successful!


## 📝 Step 3: Prepare Sample Text

Let's create some diverse sample text that will showcase the differences between tokenization methods. We'll include:
- Regular sentences
- Technical terms
- Social media style text
- Multilingual content
- Numbers and special characters

In [3]:
# Sample texts to demonstrate different tokenization challenges
sample_texts = {
    "simple": "Hello world! How are you today?",

    "technical": "ChatGPT uses transformer architecture with self-attention mechanisms. It's pre-trained on massive datasets.",

    "social_media": "OMG! Just saw the most amazing sunset!!! #beautiful #nofilter Can't believe it's already 2024... Time flies!",

    "complex_words": "The unhappiness of the disconnected users was unprecedented. They were extraordinarily disappointed.",

    "multilingual": "Hello! Bonjour! Hola! The weather is nice today. Il fait beau aujourd'hui.",

    "numbers_code": "The function calculate_loss() returned 0.0045 after 1000 iterations. Error code: HTTP-404.",

    "misspellings": "I definatly think your right about this. Recieve my sincere apoligies for the delay."
}

# Display sample texts
print("📝 Sample Texts for Tokenization:")
print("=" * 50)
for name, text in sample_texts.items():
    print(f"\n🔹 {name.upper()}:")
    print(f"{text}")

# Let users add their own text
print("\n" + "=" * 50)
print("💡 Want to try your own text? Add it below and re-run the tokenization cells!")
sample_texts["custom"] = "Add your own text here to see how different tokenizers handle it!"

📝 Sample Texts for Tokenization:

🔹 SIMPLE:
Hello world! How are you today?

🔹 TECHNICAL:
ChatGPT uses transformer architecture with self-attention mechanisms. It's pre-trained on massive datasets.

🔹 SOCIAL_MEDIA:
OMG! Just saw the most amazing sunset!!! #beautiful #nofilter Can't believe it's already 2024... Time flies!

🔹 COMPLEX_WORDS:
The unhappiness of the disconnected users was unprecedented. They were extraordinarily disappointed.

🔹 MULTILINGUAL:
Hello! Bonjour! Hola! The weather is nice today. Il fait beau aujourd'hui.

🔹 NUMBERS_CODE:
The function calculate_loss() returned 0.0045 after 1000 iterations. Error code: HTTP-404.

🔹 MISSPELLINGS:
I definatly think your right about this. Recieve my sincere apoligies for the delay.

💡 Want to try your own text? Add it below and re-run the tokenization cells!


## 🔤 Method 1: Word-Level Tokenization

The simplest approach: split text by spaces and punctuation. This is intuitive but has limitations with compound words and out-of-vocabulary terms.

In [None]:
def word_tokenize_simple(text):
    """Simple word tokenization by splitting on whitespace and punctuation"""
    # Remove extra whitespace and split
    tokens = re.findall(r'\b\w+\b|[^\w\s]', text.lower())
    return tokens

def word_tokenize_nltk(text):
    """NLTK's word tokenization (more sophisticated)"""
    return nltk.word_tokenize(text.lower())

def word_tokenize_spacy(text):
    """spaCy's word tokenization (handles linguistics better)"""
    doc = nlp(text)
    return [token.text.lower() for token in doc]

# Test word tokenization methods
print("🔤 WORD-LEVEL TOKENIZATION COMPARISON")
print("=" * 60)

test_text = sample_texts["technical"]
print(f"Original text: {test_text}")
print()

methods = {
    "Simple (Regex)": word_tokenize_simple,
    "NLTK": word_tokenize_nltk,
    "spaCy": word_tokenize_spacy
}

for method_name, method_func in methods.items():
    tokens = method_func(test_text)
    print(f"📌 {method_name}:")
    print(f"  Tokens: {tokens}")
    print(f"  Count: {len(tokens)} tokens")
    print()

print("💡 Notice how different methods handle contractions and punctuation differently!")

## 🔡 Method 2: Character-Level Tokenization

Split text into individual characters. This handles any text but loses word-level meaning and creates very long sequences.

In [None]:
def char_tokenize(text):
    """Character-level tokenization"""
    return list(text)

def char_tokenize_no_spaces(text):
    """Character-level tokenization without spaces"""
    return [char for char in text if char != ' ']

# Test character tokenization
print("🔡 CHARACTER-LEVEL TOKENIZATION")
print("=" * 50)

test_text = sample_texts["simple"]
print(f"Original text: '{test_text}'")
print()

# With spaces
char_tokens = char_tokenize(test_text)
print(f"📌 With spaces:")
print(f"  Tokens: {char_tokens}")
print(f"  Count: {len(char_tokens)} tokens")
print()

# Without spaces
char_tokens_no_space = char_tokenize_no_spaces(test_text)
print(f"📌 Without spaces:")
print(f"  Tokens: {char_tokens_no_space}")
print(f"  Count: {len(char_tokens_no_space)} tokens")
print()

# Show vocabulary size
unique_chars = set(char_tokens)
print(f"💡 Unique characters (vocabulary): {sorted(unique_chars)}")
print(f"  Vocabulary size: {len(unique_chars)}")

## 🧩 Method 3: Byte Pair Encoding (BPE)

The breakthrough subword method! BPE learns the most frequent character pairs and merges them iteratively. This balances vocabulary size with meaningful chunks.

In [None]:
# Initialize GPT-2 tokenizer for BPE example
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Create a simple BPE tokenizer (for illustrative purposes, not for production)
def create_bpe_tokenizer(texts, vocab_size=1000):
    tokenizer = Tokenizer(BPE())
    tokenizer.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    tokenizer.train_from_iterator(texts, trainer=trainer)
    return tokenizer

# Train a small BPE tokenizer on our sample texts
corpus = list(sample_texts.values())
custom_bpe_tokenizer = create_bpe_tokenizer(corpus, vocab_size=200)

# Test BPE tokenization
print("🧩 BYTE PAIR ENCODING (BPE) TOKENIZATION")
print("=" * 60)

test_text = sample_texts["social_media"]
print(f"Original text: '{test_text}'")
print()

# Using pre-trained GPT-2 tokenizer
gpt2_tokens = gpt2_tokenizer.tokenize(test_text)
print(f"📌 GPT-2 BPE Tokenizer:")
print(f"  Tokens: {gpt2_tokens}")
print(f"  Count: {len(gpt2_tokens)} tokens")
print()

# Using our custom trained BPE tokenizer
custom_bpe_encoding = custom_bpe_tokenizer.encode(test_text)
print(f"📌 Custom BPE Tokenizer (vocab_size=200):")
print(f"  Tokens: {custom_bpe_encoding.tokens}")
print(f"  Count: {len(custom_bpe_encoding.tokens)} tokens")
print(f"  IDs: {custom_bpe_encoding.ids}")
print()

test_text_oov = "supercalifragilisticexpialidocious"
print(f"Original OOV text: '{test_text_oov}'")
print(f"GPT-2 OOV tokens: {gpt2_tokenizer.tokenize(test_text_oov)}")
print("💡 Notice how BPE breaks down unknown words into known subword units!")

## 🌍 Method 4: SentencePiece (for multilingual and robust tokenization)

SentencePiece is a language-agnostic subword tokenizer that can be trained on raw text without pre-tokenization. It's great for multilingual models and consistency.

In [None]:
# Prepare data for SentencePiece training (requires a text file)
corpus_file = "corpus.txt"
with open(corpus_file, "w", encoding="utf-8") as f:
    for text in sample_texts.values():
        f.write(text + "\n")

# Train SentencePiece model
spm.SentencePieceTrainer.train(
    f'--input={corpus_file} --model_prefix=m --vocab_size=200 --model_type=unigram'
)

# Load SentencePiece model
sp_tokenizer = spm.SentencePieceProcessor()
sp_tokenizer.load("m.model")

print("🌍 SENTENCEPIECE TOKENIZATION")
print("=" * 50)

test_text = sample_texts["multilingual"]
print(f"Original text: '{test_text}'")
print()

sp_tokens = sp_tokenizer.encode_as_pieces(test_text)
print(f"📌 SentencePiece Tokens:")
print(f"  Tokens: {sp_tokens}")
print(f"  Count: {len(sp_tokens)} tokens")
print()

test_text_korean = "안녕하세요 세상"
print(f"Original Korean text: '{test_text_korean}'")
sp_tokens_korean = sp_tokenizer.encode_as_pieces(test_text_korean)
print(f"📌 SentencePiece Korean Tokens: {sp_tokens_korean}")
print("💡 SentencePiece handles diverse languages gracefully!")

## 📊 Token Scale and LLMs: What does '1 Trillion Tokens' Mean?

You often hear about Large Language Models (LLMs) being trained on vast amounts of data, measured in 'tokens'. Let's explore what that means.

In [None]:
def analyze_token_scale():
    print("📊 ANALYZING TOKEN SCALE AND LLMs")
    print("=" * 50)

    llm_sizes = {
        "Small LLM (e.g., Early GPT)": {"tokens": 1 * (10**9), "description": "1 Billion tokens"},
        "Medium LLM (e.g., GPT-3 Small)": {"tokens": 10 * (10**9), "description": "10 Billion tokens"},
        "Large LLM (e.g., GPT-3)": {"tokens": 300 * (10**9), "description": "300 Billion tokens"},
        "Very Large LLM (e.g., PaLM, GPT-4)": {"tokens": 1 * (10**12), "description": "1 Trillion tokens"}
    }

    avg_tokens_per_word = 1.3 # Common ratio for subword tokenizers
    avg_words_per_page = 500
    avg_pages_per_book = 300
    avg_words_per_tweet = 20

    print("\nApproximate Equivalents:")
    for name, data in llm_sizes.items():
        tokens = data['tokens']
        description = data['description']

        words = tokens / avg_tokens_per_word
        books = words / (avg_words_per_page * avg_pages_per_book)
        tweets = words / avg_words_per_tweet

        print(f"\n🚀 {name} ({description}):")
        print(f"  Approx. words: {words:,.0f}")
        print(f"  Equivalent to approx. {books:,.0f} books")
        print(f"  Equivalent to approx. {tweets:,.0f} tweets")

    print("\n" + "=" * 50)
    print("💡 Why does more tokens matter?")
    print("  - **Wider Exposure:** Model sees more diverse language, facts, and styles.")
    print("  - **Deeper Understanding:** Learns more complex patterns and relationships.")
    print("  - **Better Generalization:** Performs better on unseen data and tasks.")
    print("  - **Emergent Capabilities:** Larger scale can unlock new abilities (e.g., reasoning).")

    print("\n  Is an LLM trained on 1T better than one trained on 1B?")
    print("  **Generally, YES!** 1 Trillion tokens is 1000 times more data than 1 Billion tokens. \n  This massive difference in training data typically leads to vastly superior performance, \n  more robust understanding, and greater capabilities in an LLM.")
    print("  However, the *quality* and *diversity* of tokens also matter, not just the quantity.")

token_stats = analyze_token_scale()

## 🛠️ Hands-On Experimentation

Now it's your turn! Try different texts and see how tokenization affects the results.

In [None]:
# Interactive experimentation function
def tokenize_and_analyze(text, show_details=True):
    """Tokenize text with multiple methods and show detailed analysis"""

    print(f"🔍 ANALYZING: '{text}'")
    print("=" * min(60, len(text) + 20))
    print(f"📏 Text length: {len(text)} characters")
    print(f"📝 Word count: {len(text.split())} words")
    print()

    # Tokenize with different methods
    methods = {
        "Word (NLTK)": word_tokenize_nltk(text),
        "Character": char_tokenize(text),
        "GPT-2 BPE": gpt2_tokenizer.tokenize(text)
    }

    results = {}
    for method_name, tokens in methods.items():
        results[method_name] = {
            'count': len(tokens),
            'tokens': tokens,
            'ratio': len(tokens) / len(text) if len(text) > 0 else 0,
            'compression': len(text) / len(tokens) if tokens else 0
        }

        print(f"📌 {method_name}:")
        print(f"  Tokens: {len(tokens)}")
        print(f"  Ratio: {len(tokens)/len(text):.3f} tokens/char" if len(text) > 0 else "  Ratio: N/A")
        print(f"  Compression: {len(text)/len(tokens):.1f}x" if tokens else "  Compression: N/A")

        if show_details and len(tokens) <= 20:
            print(f"  All tokens: {tokens}")
        elif show_details:
            print(f"  First 10: {tokens[:10]}")
            print(f"  Last 10: {tokens[-10:]}")
        print()

    # Vocabulary analysis for longer texts
    if len(text) > 50:
        word_tokens = methods["Word (NLTK)"]
        unique_words = set(word_tokens)
        word_freq = Counter(word_tokens)

        print("📊 VOCABULARY ANALYSIS:")
        print(f"  Unique words: {len(unique_words)}")
        print(f"  Vocabulary richness: {len(unique_words)/len(word_tokens):.3f}")
        print(f"  Most common: {word_freq.most_common(5)}")
        print()

    return results

# Example experiments
print("🛠️ HANDS-ON TOKENIZATION EXPERIMENTS")
print("=" * 50)
print("Try these examples or add your own text below!\n")

# Experiment 1: Technical jargon
experiment_texts = [
    "The transformer architecture uses self-attention mechanisms.",
    "OMG! This is sooooo cool!!!",
    "Machine learning enables computers to learn without explicit programming.",
    "antidisestablishmentarianism pseudopseudohypoparathyroidism"
]

for i, text in enumerate(experiment_texts, 1):
    print(f"\n🧪 EXPERIMENT {i}:")
    tokenize_and_analyze(text, show_details=True)
    print("-" * 50)

print("\n💡 TRY YOUR OWN TEXT:")
print("Replace the text below and run the cell to see how different methods tokenize it!")

# User's custom text (they can modify this)
your_text = "Replace this with any text you want to analyze!"
print(f"\n🎯 YOUR CUSTOM TEXT ANALYSIS:")
tokenize_and_analyze(your_text, show_details=True)

## 🎯 When to Use Which Tokenization Method?

Based on our experiments, here's a practical guide for choosing tokenization strategies.

In [None]:
def tokenization_decision_guide():
    """Interactive guide to help choose the right tokenization method"""

    print("🎯 TOKENIZATION DECISION GUIDE")
    print("=" * 40)

    scenarios = {
        "📚 Clean Text Analysis": {
            "description": "News articles, books, formal documents",
            "recommendation": "Word-level tokenization (NLTK/spaCy)",
            "why": "Text is well-formed, vocabulary is manageable",
            "pros": ["Human-readable", "Preserves word meaning", "Fast processing"],
            "cons": ["Struggles with OOV words", "Large vocabulary", "No subword info"]
        },

        "🌍 Multilingual Applications": {
            "description": "Apps supporting multiple languages",
            "recommendation": "SentencePiece or multilingual BERT tokenizer",
            "why": "Language-agnostic, handles different scripts",
            "pros": ["Works across languages", "Consistent handling", "Space-aware"],
            "cons": ["More complex setup", "Larger models", "Training required"]
        },

        "🤖 Large Language Models": {
            "description": "Training or fine-tuning LLMs",
            "recommendation": "BPE (GPT-style) or SentencePiece",
            "why": "Balances vocabulary size with semantic meaning",
            "pros": ["Efficient encoding", "Handles rare words", "Proven at scale"],
            "cons": ["Less interpretable", "Training overhead", "Hyperparameter tuning"]
        },

        "📱 Social Media/Noisy Text": {
            "description": "Tweets, comments, user-generated content",
            "recommendation": "Robust BPE or character-level for extreme cases",
            "why": "Handles misspellings, slang, and novel expressions",
            "pros": ["Robust to noise", "Handles creativity", "No OOV issues"],
            "cons": ["May lose word boundaries", "Longer sequences", "Complex preprocessing"]
        },

        "🔬 Research/Prototyping": {
            "description": "Quick experiments and proof of concepts",
            "recommendation": "Pre-trained tokenizers (GPT-2, BERT)",
            "why": "Ready to use, well-tested, community support",
            "pros": ["No training needed", "Proven performance", "Easy integration"],
            "cons": ["May not fit domain", "Fixed vocabulary", "Less control"]
        },

        "⚡ Real-time Applications": {
            "description": "Chatbots, live translation, streaming",
            "recommendation": "Fast word-level or optimized BPE",
            "why": "Speed is critical, latency matters",
            "pros": ["Fast processing", "Low latency", "Predictable performance"],
            "cons": ["May sacrifice accuracy", "Limited handling of edge cases"]
        }
    }

    for scenario, details in scenarios.items():
        print(f"\n{scenario}")
        print(f"📝 Use case: {details['description']}")
        print(f"🎯 Recommended: {details['recommendation']}")
        print(f"💡 Why: {details['why']}")
        print(f"✅ Pros: {', '.join(details['pros'])}")
        print(f"❌ Cons: {', '.join(details['cons'])}")

    print("\n" + "=" * 50)
    print("🚀 QUICK DECISION FLOWCHART:")
    print("""
    ┌─ Multilingual? ──→ YES ──→ SentencePiece
    │
    ├─ Large scale LLM? ──→ YES ──→ BPE (GPT-2 style)
    │
    ├─ Clean, formal text? ──→ YES ──→ Word-level (NLTK)
    │
    ├─ Noisy/social media? ──→ YES ──→ Robust BPE
    │
    ├─ Quick prototype? ──→ YES ──→ Pre-trained tokenizer
    │
    └─ Real-time critical? ──→ YES ──→ Fast word-level
    """)

    print("\n🔧 IMPLEMENTATION TIPS:")
    tips = [
    "Always evaluate tokenization on your specific domain",
    "Consider vocabulary size vs. sequence length trade-offs",
    "Test with edge cases (URLs, hashtags, code, etc.)",
    "Monitor out-of-vocabulary rates during development",
    "Use subword regularization for robustness (if available)",
    "Document your tokenization choices for reproducibility",
    "Consider computational constraints in production"
    ]

    for i, tip in enumerate(tips, 1):
    print(f"  {i}. {tip}")

tokenization_decision_guide()

## ⚡ Performance and Efficiency Analysis

Let's measure the speed and memory efficiency of different tokenization methods.

In [None]:
import time
import sys

def measure_tokenization_performance():
    """Measure speed and efficiency of different tokenization methods"""

    # Create test text of different sizes
    base_text = " ".join(sample_texts.values())
    test_texts = {
        "Small (1KB)": base_text[:1000],
        "Medium (10KB)": (base_text * 10)[:10000],
        "Large (100KB)": (base_text * 100)[:100000]
    }

    methods = {
        "Word (NLTK)": lambda x: word_tokenize_nltk(x),
        "Character": lambda x: char_tokenize(x),
        "GPT-2 BPE": lambda x: gpt2_tokenizer.tokenize(x)
    }

    print("⚡ TOKENIZATION PERFORMANCE ANALYSIS")
    print("=" * 55)

    results = []

    for text_size, text in test_texts.items():
        print(f"\n📏 Testing with {text_size} text ({len(text):,} characters)")
        print("-" * 40)

        for method_name, method_func in methods.items():
            # Measure time
            start_time = time.time()
            tokens = method_func(text)
            end_time = time.time()

            duration = end_time - start_time
            chars_per_second = len(text) / duration if duration > 0 else float('inf')
            tokens_per_second = len(tokens) / duration if duration > 0 else float('inf')

            # Memory estimation (rough)
            token_memory = sys.getsizeof(tokens)

            print(f"{method_name:12} | {duration*1000:6.1f}ms | {chars_per_second/1000:6.1f}K chars/s | {len(tokens):5d} tokens | {token_memory/1024:.1f}KB")

            results.append({
                'Text Size': text_size,
                'Method': method_name,
                'Duration (ms)': duration * 1000,
                'Chars/sec': chars_per_second,
                'Tokens': len(tokens),
                'Memory (KB)': token_memory / 1024
            })

    # Performance summary
    print("\n" + "=" * 55)
    print("📊 PERFORMANCE SUMMARY:")

    perf_df = pd.DataFrame(results)

    # Average performance by method
    avg_perf = perf_df.groupby('Method').agg({
        'Duration (ms)': 'mean',
        'Chars/sec': 'mean',
        'Memory (KB)': 'mean'
    }).round(2)

    print("\n📈 Average Performance by Method:")
    print(avg_perf)

    # Visualization
    plt.figure(figsize=(15, 5))

    # Speed comparison
    plt.subplot(1, 3, 1)
    for method in perf_df['Method'].unique():
        method_data = perf_df[perf_df['Method'] == method]
        plt.plot(range(len(method_data)), method_data['Chars/sec'],
                         marker='o', label=method)
    plt.xlabel('Text Size Category')
    plt.ylabel('Characters/Second')
    plt.title('Processing Speed')
    plt.legend()
    plt.yscale('log')

    # Token efficiency
    plt.subplot(1, 3, 2)
    pivot_tokens = perf_df.pivot(index='Text Size', columns='Method', values='Tokens')
    pivot_tokens.plot(kind='bar', ax=plt.gca())
    plt.title('Token Count by Method')
    plt.ylabel('Number of Tokens')
    plt.xticks(rotation=45)

    # Memory usage
    plt.subplot(1, 3, 3)
    for method in perf_df['Method'].unique():
        method_data = perf_df[perf_df['Method'] == method]
        plt.plot(range(len(method_data)), method_data['Memory (KB)'],
                         marker='s', label=method)
    plt.xlabel('Text Size Category')
    plt.ylabel('Memory Usage (KB)')
    plt.title('Memory Efficiency')
    plt.legend()

    plt.tight_layout()
    plt.show()

    print("\n🎯 KEY PERFORMANCE INSIGHTS:")
    fastest_method = avg_perf['Chars/sec'].idxmax()
    most_memory_efficient = avg_perf['Memory (KB)'].idxmin()

    print(f"⚡ Fastest method: {fastest_method}")
    print(f"💾 Most memory efficient: {most_memory_efficient}")
    print(f"🔤 Character tokenization creates {perf_df[perf_df['Method']=='Character']['Tokens'].iloc[0]/perf_df[perf_df['Method']=='Word (NLTK)']['Tokens'].iloc[0]:.1f}x more tokens than word tokenization")

measure_tokenization_performance()

## 🎓 Summary and Next Steps

Congratulations! You've now experienced hands-on how different tokenization strategies work and their trade-offs.

In [None]:
import time

def generate_learning_summary():
    """Generate a personalized learning summary"""

    print("🎓 TOKENIZATION MASTERY SUMMARY")
    print("=" * 45)

    print("\n✅ WHAT YOU'VE LEARNED:")
    learnings = [
        "How computers break human language into processable pieces",
        "The evolution from word-level to sophisticated subword methods",
        "Why BPE revolutionized NLP and enabled large language models",
        "How to choose the right tokenization for your use case",
        "The relationship between tokens and LLM training scale",
        "Performance trade-offs between different approaches"
    ]

    for i, learning in enumerate(learnings, 1):
        print(f"  {i}. {learning}")

    print("\n🔧 PRACTICAL SKILLS GAINED:")
    skills = [
        "Implementing word, character, and BPE tokenization",
        "Using production tokenizers (GPT-2, SentencePiece)",
        "Analyzing tokenization performance and efficiency",
        "Debugging tokenization issues with real examples",
        "Making informed decisions about tokenization strategies"
    ]

    for i, skill in enumerate(skills, 1):
        print(f"  {i}. {skill}")

    print("\n🚀 NEXT STEPS IN YOUR NLP JOURNEY:")
    next_steps = [
        "📚 Word Embeddings: How tokens become meaningful vectors",
        "🔍 Attention Mechanisms: How models focus on relevant tokens",
        "🏗️ Transformer Architecture: The building blocks of modern LLMs",
        "⚙️ Fine-tuning: Adapting pre-trained models to your domain",
        "🎯 Evaluation: Measuring NLP model performance",
        "🔧 Production: Deploying NLP models at scale"
    ]

    for step in next_steps:
        print(f"  • {step}")

    print("\n💡 EXPERIMENT IDEAS:")
    experiments = [
        "Try tokenizing text in different languages",
        "Compare tokenization of code vs natural language",
        "Analyze how tokenization affects sentiment analysis",
        "Experiment with custom BPE vocabulary sizes",
        "Build a simple text classifier using different tokenizers",
        "Measure tokenization impact on model performance"
    ]

    for experiment in experiments:
        print(f"  🧪 {experiment}")

    print("\n📚 RECOMMENDED RESOURCES:")
    resources = [
        "🔗 Hugging Face Tokenizers Library Documentation",
        "📖 'Natural Language Processing with Python' (NLTK Book)",
        "🎥 Andrej Karpathy's 'Let's build GPT' video series",
        "📄 Original BPE paper: 'Neural Machine Translation of Rare Words'",
        "🌐 OpenAI's GPT-2 tokenizer implementation",
        "🔬 Google's SentencePiece research papers"
    ]

    for resource in resources:
        print(f"  {resource}")

    print("\n" + "=" * 45)
    print("🌟 Remember: Great tokenization is the foundation of great NLP!")
    print("Keep experimenting, keep learning, and most importantly...")
    print("🚀 Keep building amazing things with language AI!")

    # Generate a completion certificate
    print("\n" + "="*50)
    print("🏆 TOKENIZATION MASTERY CERTIFICATE")
    print("="*50)
    print(f"This certifies that you have successfully completed")
    print(f"the Interactive Tokenization Workshop and demonstrated")
    print(f"understanding of fundamental NLP tokenization concepts.")
    print(f"")
    print(f"Date: {time.strftime('%Y-%m-%d')}")
    print(f"Workshop: Breaking Language Into Pieces")
    print(f"")
    print(f"Skills Demonstrated:")
    print(f"✓ Word-level tokenization")
    print(f"✓ Character-level tokenization")
    print(f"✓ Byte Pair Encoding (BPE)")
    print(f"✓ Production tokenizer usage")
generate_learning_summary()