# Interactive Tokenization Tutorial: BPE, WordPiece, and SentencePiece

This tutorial provides hands-on experience with the three most important tokenization algorithms used in modern NLP. We'll use real tokenizers from Hugging Face to understand how they work in practice.

## Learning Objectives
- Understand how different tokenization algorithms split text
- Learn the vocabulary building process for each algorithm
- See how OOV (Out-of-Vocabulary) words are handled
- Practice with token IDs, encoding, and decoding
- Compare algorithms on real examples

## Prerequisites
Make sure you have the required packages installed:
```bash
pip install torch transformers tokenizers datasets matplotlib seaborn
```

In [None]:
# Import required libraries
import torch
from transformers import AutoTokenizer, GPT2Tokenizer, BertTokenizer, T5Tokenizer
from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece, Unigram
from tokenizers.trainers import BpeTrainer, WordPieceTrainer, UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import Counter
import re

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

## Part 1: Understanding Tokenization Fundamentals

Before diving into specific algorithms, let's understand what tokenization is and why it matters.

In [None]:
# Sample texts for our experiments
sample_texts = [
    "Hello, world! This is a tokenization tutorial.",
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning models need to understand text.",
    "Tokenization is preprocessing for NLP tasks.",
    "Subword tokenization handles out-of-vocabulary words effectively.",
    "GPT-2 uses Byte-Pair Encoding for tokenization.",
    "BERT uses WordPiece tokenization algorithm.",
    "T5 uses SentencePiece with unigram language model."
]

# Let's also test with some challenging examples
challenging_texts = [
    "COVID-19 pandemic started in 2020.",
    "The transformer architecture revolutionized NLP.",
    "Hugging Face transformers library is amazing!",
    "Pre-training and fine-tuning are key concepts.",
    "Attention is all you need (2017).",
    "GPT-3.5-turbo and GPT-4 are powerful models."
]

all_texts = sample_texts + challenging_texts

print("Sample texts prepared for tokenization experiments:")
for i, text in enumerate(all_texts[:3]):
    print(f"{i+1}. {text}")
print(f"... and {len(all_texts)-3} more texts")

## Part 2: Byte-Pair Encoding (BPE) - GPT-2 Style

BPE is used by GPT models. It starts with characters and iteratively merges the most frequent pairs.

In [None]:
# Load GPT-2 tokenizer (uses BPE)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

print("GPT-2 Tokenizer (BPE) Analysis")
print("=" * 50)
print(f"Vocabulary size: {len(gpt2_tokenizer)}")
print(f"Model max length: {gpt2_tokenizer.model_max_length}")
print(f"Special tokens: {gpt2_tokenizer.special_tokens_map}")

# Demonstrate tokenization on sample text
test_text = "Hello, tokenization! This is a test with out-of-vocabulary words like COVID-19."
print(f"\nOriginal text: '{test_text}'")

# Encode to tokens
tokens = gpt2_tokenizer.tokenize(test_text)
print(f"\nTokens ({len(tokens)}): {tokens}")

# Encode to IDs
token_ids = gpt2_tokenizer.encode(test_text)
print(f"\nToken IDs ({len(token_ids)}): {token_ids}")

# Decode back to text
decoded_text = gpt2_tokenizer.decode(token_ids)
print(f"\nDecoded text: '{decoded_text}'")
print(f"Perfect reconstruction: {test_text == decoded_text}")

In [None]:
# Interactive BPE Analysis Function
def analyze_bpe_tokenization(text, tokenizer, name="BPE"):
    """Analyze how BPE tokenizes a given text."""
    print(f"\n{name} Tokenization Analysis")
    print("-" * 40)
    print(f"Original: '{text}'")
    
    # Tokenize
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"Token IDs: {token_ids}")
    
    # Show character-level breakdown
    print("\nToken → Characters mapping:")
    for i, token in enumerate(tokens):
        # Handle special BPE encoding (Ġ represents space)
        display_token = token.replace('Ġ', '▁')  # Use visible space character
        print(f"  {i:2d}: '{display_token}' → ID {tokenizer.encode(token, add_special_tokens=False)[0] if tokenizer.encode(token, add_special_tokens=False) else 'UNK'}")
    
    return tokens, token_ids

# Test BPE on various examples
test_cases = [
    "hello world",
    "tokenization",
    "COVID-19",
    "transformer",
    "out-of-vocabulary"
]

for test_case in test_cases:
    analyze_bpe_tokenization(test_case, gpt2_tokenizer, "GPT-2 BPE")

## Part 3: WordPiece - BERT Style

WordPiece is used by BERT and similar models. It uses a greedy approach to build the longest possible subwords.

In [None]:
# Load BERT tokenizer (uses WordPiece)
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

print("BERT Tokenizer (WordPiece) Analysis")
print("=" * 50)
print(f"Vocabulary size: {len(bert_tokenizer)}")
print(f"Model max length: {bert_tokenizer.model_max_length}")
print(f"Special tokens: {bert_tokenizer.special_tokens_map}")

# WordPiece specific analysis
def analyze_wordpiece_tokenization(text, tokenizer, name="WordPiece"):
    """Analyze how WordPiece tokenizes a given text."""
    print(f"\n{name} Tokenization Analysis")
    print("-" * 40)
    print(f"Original: '{text}'")
    
    # Tokenize
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"Token IDs: {token_ids}")
    
    # Show WordPiece specific features
    print("\nWordPiece features:")
    for i, token in enumerate(tokens):
        if token.startswith('##'):
            feature = "Continuation subword (##)"
        elif token in ['[CLS]', '[SEP]', '[PAD]', '[UNK]', '[MASK]']:
            feature = "Special token"
        else:
            feature = "Word beginning"
        print(f"  {i:2d}: '{token}' → {feature}")
    
    return tokens, token_ids

# Test WordPiece on the same examples
for test_case in test_cases:
    analyze_wordpiece_tokenization(test_case, bert_tokenizer, "BERT WordPiece")

## Part 4: SentencePiece - T5 Style

SentencePiece is used by T5, mT5, and many multilingual models. It treats the input as a sequence of Unicode characters.

In [None]:
# Load T5 tokenizer (uses SentencePiece)
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')

print("T5 Tokenizer (SentencePiece) Analysis")
print("=" * 50)
print(f"Vocabulary size: {len(t5_tokenizer)}")
print(f"Model max length: {t5_tokenizer.model_max_length}")
print(f"Special tokens: {t5_tokenizer.special_tokens_map}")

# SentencePiece specific analysis
def analyze_sentencepiece_tokenization(text, tokenizer, name="SentencePiece"):
    """Analyze how SentencePiece tokenizes a given text."""
    print(f"\n{name} Tokenization Analysis")
    print("-" * 40)
    print(f"Original: '{text}'")
    
    # Tokenize
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"Token IDs: {token_ids}")
    
    # Show SentencePiece specific features
    print("\nSentencePiece features:")
    for i, token in enumerate(tokens):
        if token.startswith('▁'):
            feature = "Word boundary marker (▁)"
        elif token in ['<pad>', '</s>', '<unk>']:
            feature = "Special token"
        else:
            feature = "Subword continuation"
        print(f"  {i:2d}: '{token}' → {feature}")
    
    return tokens, token_ids

# Test SentencePiece on the same examples
for test_case in test_cases:
    analyze_sentencepiece_tokenization(test_case, t5_tokenizer, "T5 SentencePiece")

## Part 5: Comparative Analysis

Let's compare how different tokenizers handle the same text.

In [None]:
def compare_tokenizers(text):
    """Compare how different tokenizers handle the same text."""
    print(f"\nComparative Analysis for: '{text}'")
    print("=" * 60)
    
    # Get tokenizations from all three
    gpt2_tokens = gpt2_tokenizer.tokenize(text)
    bert_tokens = bert_tokenizer.tokenize(text)
    t5_tokens = t5_tokenizer.tokenize(text)
    
    # Create comparison table
    max_tokens = max(len(gpt2_tokens), len(bert_tokens), len(t5_tokens))
    
    print(f"{'Index':<6} {'GPT-2 (BPE)':<20} {'BERT (WordPiece)':<20} {'T5 (SentencePiece)':<20}")
    print("-" * 70)
    
    for i in range(max_tokens):
        gpt2_token = gpt2_tokens[i] if i < len(gpt2_tokens) else ""
        bert_token = bert_tokens[i] if i < len(bert_tokens) else ""
        t5_token = t5_tokens[i] if i < len(t5_tokens) else ""
        
        # Clean up display
        gpt2_display = gpt2_token.replace('Ġ', '▁')
        
        print(f"{i:<6} {gpt2_display:<20} {bert_token:<20} {t5_token:<20}")
    
    print(f"\nToken counts: GPT-2: {len(gpt2_tokens)}, BERT: {len(bert_tokens)}, T5: {len(t5_tokens)}")
    
    return {
        'gpt2': gpt2_tokens,
        'bert': bert_tokens,
        't5': t5_tokens
    }

# Compare on interesting examples
comparison_texts = [
    "Hello, world!",
    "COVID-19 pandemic",
    "transformer architecture",
    "out-of-vocabulary words",
    "GPT-3.5-turbo model"
]

comparison_results = {}
for text in comparison_texts:
    comparison_results[text] = compare_tokenizers(text)

In [None]:
# Visualize token count comparison
def visualize_token_counts(comparison_results):
    """Create a bar chart comparing token counts across tokenizers."""
    texts = list(comparison_results.keys())
    gpt2_counts = [len(comparison_results[text]['gpt2']) for text in texts]
    bert_counts = [len(comparison_results[text]['bert']) for text in texts]
    t5_counts = [len(comparison_results[text]['t5']) for text in texts]
    
    # Create DataFrame for easier plotting
    df = pd.DataFrame({
        'Text': texts,
        'GPT-2 (BPE)': gpt2_counts,
        'BERT (WordPiece)': bert_counts,
        'T5 (SentencePiece)': t5_counts
    })
    
    # Create the plot
    plt.figure(figsize=(12, 8))
    
    x = range(len(texts))
    width = 0.25
    
    plt.bar([i - width for i in x], gpt2_counts, width, label='GPT-2 (BPE)', alpha=0.8)
    plt.bar(x, bert_counts, width, label='BERT (WordPiece)', alpha=0.8)
    plt.bar([i + width for i in x], t5_counts, width, label='T5 (SentencePiece)', alpha=0.8)
    
    plt.xlabel('Text Examples')
    plt.ylabel('Number of Tokens')
    plt.title('Token Count Comparison Across Tokenizers')
    plt.xticks(x, [text[:20] + '...' if len(text) > 20 else text for text in texts], rotation=45, ha='right')
    plt.legend()
    plt.tight_layout()
    plt.grid(axis='y', alpha=0.3)
    
    plt.show()
    
    return df

token_count_df = visualize_token_counts(comparison_results)
print("\nToken Count Summary:")
print(token_count_df)

## Part 6: Handling Out-of-Vocabulary (OOV) Words

Let's see how each tokenizer handles words not in their vocabulary.

In [None]:
# Test with completely made-up words and technical terms
oov_test_cases = [
    "supercalifragilisticexpialidocious",
    "antidisestablishmentarianism",
    "pseudopseudohypoparathyroidism",
    "Huggingface",  # Company name
    "tokenizer123",  # Alphanumeric
    "COVID-19",  # Hyphenated with numbers
    "@username",  # Social media handle
    "#hashtag",  # Hashtag
    "www.example.com",  # URL
    "user@email.com"  # Email
]

def test_oov_handling(text, tokenizer, tokenizer_name):
    """Test how a tokenizer handles OOV words."""
    print(f"\n{tokenizer_name} OOV Analysis: '{text}'")
    print("-" * 50)
    
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"Token IDs: {token_ids}")
    
    # Check for UNK tokens
    unk_count = 0
    if tokenizer_name == "BERT":
        unk_count = tokens.count('[UNK]')
    elif tokenizer_name == "T5":
        unk_count = tokens.count('<unk>')
    # GPT-2 rarely uses UNK due to BPE falling back to characters
    
    print(f"UNK tokens: {unk_count}")
    
    # Decode and check for information loss
    decoded = tokenizer.decode(token_ids)
    if hasattr(tokenizer, 'clean_up_tokenization'):
        decoded = tokenizer.clean_up_tokenization(decoded)
    
    perfect_reconstruction = (text.lower().strip() == decoded.lower().strip())
    print(f"Decoded: '{decoded}'")
    print(f"Perfect reconstruction: {perfect_reconstruction}")
    
    return tokens, perfect_reconstruction

print("Testing Out-of-Vocabulary Handling")
print("=" * 60)

oov_results = {}
for test_case in oov_test_cases[:5]:  # Test first 5 cases
    print(f"\n{'='*60}")
    print(f"Testing: '{test_case}'")
    
    gpt2_tokens, gpt2_perfect = test_oov_handling(test_case, gpt2_tokenizer, "GPT-2")
    bert_tokens, bert_perfect = test_oov_handling(test_case, bert_tokenizer, "BERT")
    t5_tokens, t5_perfect = test_oov_handling(test_case, t5_tokenizer, "T5")
    
    oov_results[test_case] = {
        'gpt2_tokens': len(gpt2_tokens),
        'bert_tokens': len(bert_tokens),
        't5_tokens': len(t5_tokens),
        'gpt2_perfect': gpt2_perfect,
        'bert_perfect': bert_perfect,
        't5_perfect': t5_perfect
    }

## Part 7: Building a Simple BPE Tokenizer from Scratch

Let's implement a simplified version of BPE to understand the algorithm better.

In [None]:
class SimpleBPE:
    """A simplified implementation of Byte-Pair Encoding."""
    
    def __init__(self):
        self.word_freqs = {}
        self.vocab = set()
        self.merges = []
    
    def get_word_frequencies(self, texts):
        """Count word frequencies in the corpus."""
        word_freq = Counter()
        for text in texts:
            # Simple word splitting
            words = text.lower().split()
            for word in words:
                # Remove punctuation for simplicity
                word = re.sub(r'[^a-zA-Z]', '', word)
                if word:
                    word_freq[word] += 1
        return word_freq
    
    def get_initial_vocab(self, word_freqs):
        """Get initial vocabulary (all characters)."""
        vocab = set()
        for word in word_freqs:
            for char in word:
                vocab.add(char)
        return vocab
    
    def get_pairs(self, word_freqs):
        """Get all adjacent pairs in the corpus with their frequencies."""
        pairs = Counter()
        for word, freq in word_freqs.items():
            chars = list(word)
            for i in range(len(chars) - 1):
                pairs[(chars[i], chars[i + 1])] += freq
        return pairs
    
    def merge_vocab(self, pair, word_freqs):
        """Merge the most frequent pair in the vocabulary."""
        new_word_freqs = {}
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        
        for word, freq in word_freqs.items():
            # Add spaces between characters for replacement
            spaced_word = ' '.join(word)
            new_word = spaced_word.replace(bigram, replacement)
            # Remove spaces
            new_word = new_word.replace(' ', '')
            new_word_freqs[new_word] = freq
        
        return new_word_freqs
    
    def train(self, texts, num_merges=10):
        """Train the BPE tokenizer."""
        # Get word frequencies
        self.word_freqs = self.get_word_frequencies(texts)
        print(f"Initial vocabulary size: {len(self.get_initial_vocab(self.word_freqs))}")
        print(f"Most frequent words: {dict(self.word_freqs.most_common(5))}")
        
        # Initialize vocabulary with characters
        self.vocab = self.get_initial_vocab(self.word_freqs)
        
        # Perform merges
        current_word_freqs = self.word_freqs.copy()
        
        for i in range(num_merges):
            pairs = self.get_pairs(current_word_freqs)
            if not pairs:
                break
            
            # Get most frequent pair
            best_pair = pairs.most_common(1)[0][0]
            print(f"\nMerge {i+1}: {best_pair} (frequency: {pairs[best_pair]})")
            
            # Merge the pair
            current_word_freqs = self.merge_vocab(best_pair, current_word_freqs)
            
            # Add merged token to vocabulary
            new_token = ''.join(best_pair)
            self.vocab.add(new_token)
            self.merges.append(best_pair)
            
            print(f"New token: '{new_token}'")
            print(f"Vocabulary size: {len(self.vocab)}")
        
        print(f"\nFinal vocabulary size: {len(self.vocab)}")
        print(f"Final vocabulary: {sorted(self.vocab)}")
    
    def tokenize(self, word):
        """Tokenize a word using learned merges."""
        # Start with characters
        tokens = list(word.lower())
        
        # Apply merges in order
        for pair in self.merges:
            i = 0
            while i < len(tokens) - 1:
                if tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
                    # Merge the pair
                    tokens = tokens[:i] + [''.join(pair)] + tokens[i + 2:]
                else:
                    i += 1
        
        return tokens

# Train our simple BPE
simple_bpe = SimpleBPE()
training_texts = [
    "the quick brown fox jumps over the lazy dog",
    "machine learning is transforming the world",
    "tokenization is an important preprocessing step",
    "the transformer architecture uses attention mechanisms",
    "natural language processing helps machines understand text"
]

print("Training Simple BPE Tokenizer")
print("=" * 50)
simple_bpe.train(training_texts, num_merges=15)

In [None]:
# Test our simple BPE tokenizer
test_words = ["the", "quick", "transformer", "tokenization", "machine", "learning"]

print("\nTesting Simple BPE Tokenizer")
print("=" * 40)

for word in test_words:
    tokens = simple_bpe.tokenize(word)
    print(f"'{word}' → {tokens}")

## Part 8: Interactive Experiments

Now let's create some interactive functions to experiment with tokenization.

In [None]:
def interactive_tokenizer_comparison(text):
    """Interactive function to compare tokenizers on any text."""
    print(f"\n🔍 Analyzing: '{text}'")
    print("=" * 80)
    
    tokenizers = {
        "GPT-2 (BPE)": gpt2_tokenizer,
        "BERT (WordPiece)": bert_tokenizer,
        "T5 (SentencePiece)": t5_tokenizer
    }
    
    results = {}
    
    for name, tokenizer in tokenizers.items():
        tokens = tokenizer.tokenize(text)
        token_ids = tokenizer.encode(text, add_special_tokens=False)
        decoded = tokenizer.decode(token_ids)
        
        results[name] = {
            'tokens': tokens,
            'count': len(tokens),
            'ids': token_ids,
            'decoded': decoded,
            'perfect': text.strip().lower() == decoded.strip().lower()
        }
        
        print(f"\n{name}:")
        print(f"  Tokens ({len(tokens)}): {tokens[:10]}{'...' if len(tokens) > 10 else ''}")
        print(f"  Token IDs: {token_ids[:10]}{'...' if len(token_ids) > 10 else ''}")
        print(f"  Perfect reconstruction: {results[name]['perfect']}")
    
    # Summary comparison
    print(f"\n📊 Summary:")
    print(f"  Token counts: ", end="")
    for name in tokenizers.keys():
        print(f"{name.split()[0]}: {results[name]['count']}", end="  ")
    print()
    
    return results

# Test with various examples
test_examples = [
    "Hello, world!",
    "The transformer model revolutionized NLP.",
    "COVID-19 pandemic affected everyone globally.",
    "GPT-3.5-turbo and GPT-4 are amazing models!"
]

for example in test_examples:
    interactive_tokenizer_comparison(example)

In [None]:
# Create a function to explore vocabulary
def explore_vocabulary(tokenizer, name, sample_size=20):
    """Explore the vocabulary of a tokenizer."""
    print(f"\n🔤 {name} Vocabulary Exploration")
    print("=" * 50)
    
    vocab_size = len(tokenizer)
    print(f"Total vocabulary size: {vocab_size:,}")
    
    # Get some sample tokens
    if hasattr(tokenizer, 'get_vocab'):
        vocab = tokenizer.get_vocab()
        
        # Show some interesting tokens
        print(f"\nSample tokens (first {sample_size}):")
        sample_tokens = list(vocab.keys())[:sample_size]
        for i, token in enumerate(sample_tokens):
            token_id = vocab[token]
            # Clean up display
            display_token = token.replace('Ġ', '▁').replace('##', '##')
            print(f"  {i:2d}: '{display_token}' (ID: {token_id})")
        
        # Look for special patterns
        print(f"\nSpecial patterns found:")
        
        if name == "GPT-2 (BPE)":
            space_tokens = [token for token in vocab.keys() if token.startswith('Ġ')][:5]
            print(f"  Space-prefixed tokens: {[t.replace('Ġ', '▁') for t in space_tokens]}")
        
        elif name == "BERT (WordPiece)":
            subword_tokens = [token for token in vocab.keys() if token.startswith('##')][:5]
            print(f"  Subword tokens (##): {subword_tokens}")
            special_tokens = [token for token in vocab.keys() if token.startswith('[') and token.endswith(']')]
            print(f"  Special tokens: {special_tokens}")
        
        elif name == "T5 (SentencePiece)":
            boundary_tokens = [token for token in vocab.keys() if token.startswith('▁')][:5]
            print(f"  Boundary tokens (▁): {boundary_tokens}")
            special_tokens = [token for token in vocab.keys() if token.startswith('<') and token.endswith('>')]
            print(f"  Special tokens: {special_tokens}")

# Explore vocabularies
explore_vocabulary(gpt2_tokenizer, "GPT-2 (BPE)")
explore_vocabulary(bert_tokenizer, "BERT (WordPiece)")
explore_vocabulary(t5_tokenizer, "T5 (SentencePiece)")

## Part 9: Practical Applications and Best Practices

Let's explore some practical considerations when working with tokenizers.

In [None]:
# Demonstrate sequence length considerations
def analyze_sequence_lengths(texts, tokenizers_dict):
    """Analyze how sequence lengths vary across tokenizers."""
    print("📏 Sequence Length Analysis")
    print("=" * 50)
    
    results = {}
    
    for name, tokenizer in tokenizers_dict.items():
        lengths = []
        for text in texts:
            tokens = tokenizer.encode(text, add_special_tokens=False)
            lengths.append(len(tokens))
        
        results[name] = {
            'lengths': lengths,
            'mean': sum(lengths) / len(lengths),
            'max': max(lengths),
            'min': min(lengths)
        }
    
    # Print statistics
    print(f"{'Tokenizer':<20} {'Mean':<8} {'Min':<6} {'Max':<6}")
    print("-" * 45)
    for name, stats in results.items():
        print(f"{name:<20} {stats['mean']:<8.1f} {stats['min']:<6} {stats['max']:<6}")
    
    return results

# Test with our sample texts
tokenizers_dict = {
    "GPT-2 (BPE)": gpt2_tokenizer,
    "BERT (WordPiece)": bert_tokenizer,
    "T5 (SentencePiece)": t5_tokenizer
}

length_results = analyze_sequence_lengths(all_texts, tokenizers_dict)

In [None]:
# Demonstrate batch processing and padding
def demonstrate_batch_processing():
    """Show how to handle batches of texts with different lengths."""
    print("🔄 Batch Processing Demonstration")
    print("=" * 50)
    
    # Sample batch with different lengths
    batch_texts = [
        "Short text.",
        "This is a medium-length text for demonstration.",
        "This is a much longer text that will result in more tokens when tokenized by our various tokenizers, which is useful for demonstrating padding and truncation strategies."
    ]
    
    # Use BERT tokenizer for demonstration
    tokenizer = bert_tokenizer
    
    print("Original texts:")
    for i, text in enumerate(batch_texts):
        print(f"  {i+1}: {text[:50]}{'...' if len(text) > 50 else ''}")
    
    # Tokenize without padding
    print("\nTokenized (no padding):")
    for i, text in enumerate(batch_texts):
        tokens = tokenizer.encode(text, add_special_tokens=True)
        print(f"  {i+1}: Length {len(tokens)}, IDs: {tokens[:10]}{'...' if len(tokens) > 10 else ''}")
    
    # Batch encode with padding
    batch_encoding = tokenizer(
        batch_texts,
        padding=True,
        truncation=True,
        max_length=50,
        return_tensors="pt"
    )
    
    print("\nBatch encoded (with padding):")
    print(f"  Input IDs shape: {batch_encoding['input_ids'].shape}")
    print(f"  Attention mask shape: {batch_encoding['attention_mask'].shape}")
    
    print("\nPadded sequences:")
    for i in range(len(batch_texts)):
        input_ids = batch_encoding['input_ids'][i].tolist()
        attention_mask = batch_encoding['attention_mask'][i].tolist()
        
        # Count actual tokens (non-padded)
        actual_tokens = sum(attention_mask)
        print(f"  {i+1}: Actual tokens: {actual_tokens}, Total length: {len(input_ids)}")
        print(f"      Input IDs: {input_ids[:15]}{'...' if len(input_ids) > 15 else ''}")
        print(f"      Attention : {attention_mask[:15]}{'...' if len(attention_mask) > 15 else ''}")

demonstrate_batch_processing()

## Part 10: Key Takeaways and Best Practices

Let's summarize what we've learned about tokenization.

In [None]:
def create_tokenization_summary():
    """Create a comprehensive summary of tokenization concepts."""
    
    summary = """
    🎯 TOKENIZATION SUMMARY & BEST PRACTICES
    ======================================
    
    📚 ALGORITHM COMPARISON:
    
    BPE (Byte-Pair Encoding) - GPT-2 Style:
    ✅ Excellent OOV handling (falls back to characters)
    ✅ Deterministic and reversible
    ✅ Good for generative tasks
    ⚠️  Can create very long sequences for rare words
    
    WordPiece - BERT Style:
    ✅ Balances vocabulary size and sequence length
    ✅ Good for understanding tasks
    ✅ Handles morphology reasonably well
    ⚠️  Uses [UNK] for truly unknown words (information loss)
    
    SentencePiece - T5 Style:
    ✅ Language-agnostic (handles any Unicode)
    ✅ No need for pre-tokenization
    ✅ Great for multilingual models
    ✅ Reversible tokenization
    ⚠️  Requires larger vocabulary for same coverage
    
    🛠️ ENGINEERING BEST PRACTICES:
    
    1. Always use the same tokenizer for training and inference
    2. Pay attention to special tokens ([CLS], [SEP], <s>, </s>, etc.)
    3. Handle padding and truncation appropriately for your task
    4. Consider sequence length limits of your model
    5. Test tokenization on edge cases (URLs, emails, code, etc.)
    6. Use batch processing for efficiency
    7. Understand the trade-offs between vocab size and sequence length
    
    💡 PRACTICAL TIPS:
    
    • BPE: Best for generation, handles any text
    • WordPiece: Good balance for most tasks
    • SentencePiece: Essential for multilingual applications
    
    • Smaller vocab = longer sequences
    • Larger vocab = shorter sequences but more parameters
    
    • Always validate your tokenization pipeline
    • Monitor for unexpected tokenizations
    • Consider domain-specific vocabularies for specialized text
    """
    
    print(summary)

create_tokenization_summary()

# Final interactive test function
def final_tokenization_test(user_text):
    """Final test function for user input."""
    print(f"\n🧪 Final Test with Your Text: '{user_text}'")
    print("=" * 60)
    
    return interactive_tokenizer_comparison(user_text)

# Test with a final example
test_result = final_tokenization_test("The future of AI and NLP is bright with transformers!")

print("\n🎉 Congratulations! You've completed the tokenization tutorial!")
print("Now you understand how BPE, WordPiece, and SentencePiece work in practice.")
print("\n💡 Next steps:")
print("• Experiment with different texts using the functions above")
print("• Try training your own tokenizer on domain-specific data")
print("• Explore how tokenization affects model performance")
print("• Learn about other tokenization strategies (Unigram, etc.)")