# Understanding Tokenization 

Tokenization is the process of converting text into tokens that machine learning models can understand. It's a fundamental step in NLP that bridges the gap between human language and numerical representations.

## What is Tokenization?

**Tokenization** breaks down text into smaller units called **tokens**. These tokens are then converted to numerical IDs that models can process.

```
"Hello world!" → ["Hello", "world", "!"] → [7592, 2088, 999]
```

## Learning Objectives

By the end of this notebook, you'll understand:
1. Different types of tokenization approaches
2. How to use Hugging Face tokenizers
3. Special tokens and their purposes
4. Handling different text scenarios
5. Comparing tokenizer performance

Let's dive in!

In [None]:
# Import required libraries
from transformers import (
    AutoTokenizer
)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

print("All libraries imported successfully!")

## 1. Types of Tokenization

Let's explore different tokenization approaches by comparing how they handle the same text:

In [None]:
# Sample text to tokenize
sample_text = "Hugging Face revolutionizes natural language processing!"

print(f"Original text: '{sample_text}'")
print(f"Text length: {len(sample_text)} characters")
print("-" * 60)

# Word-level tokenization (simple split)
word_tokens = sample_text.split()
print(f"Word tokenization: {word_tokens}")
print(f"Number of tokens: {len(word_tokens)}")
print()

# Character-level tokenization
char_tokens = list(sample_text)
print(f"Character tokenization: {char_tokens}")
print(f"Number of tokens: {len(char_tokens)}")
print()

# Subword tokenization (we'll see this with actual tokenizers below)

## 2. Hugging Face Tokenizers

Let's load different tokenizers and see how they handle the same text:

In [None]:
# Load different tokenizers
tokenizers = {
    'BERT (WordPiece)': AutoTokenizer.from_pretrained('bert-base-uncased'),
    'GPT-2 (BPE)': AutoTokenizer.from_pretrained('gpt2'),
    'T5 (SentencePiece)': AutoTokenizer.from_pretrained('t5-small'),
    'DistilBERT': AutoTokenizer.from_pretrained('distilbert-base-uncased')
}

# Compare tokenization across different models
test_sentence = "The quick-thinking AI researcher's breakthrough was unprecedented!"

print(f"Input: '{test_sentence}'\n")

for name, tokenizer in tokenizers.items():
    tokens = tokenizer.tokenize(test_sentence)
    token_ids = tokenizer.encode(test_sentence, add_special_tokens=False)
    
    print(f"{name}:")
    print(f"  Tokens: {tokens}")
    print(f"  Count: {len(tokens)} tokens")
    print(f"  IDs: {token_ids[:10]}{'...' if len(token_ids) > 10 else ''}")
    print()

## 3. Understanding Special Tokens

Special tokens have specific meanings and help models understand text structure:

In [None]:
# Let's examine BERT's special tokens
bert_tokenizer = tokenizers['BERT (WordPiece)']

print("=== BERT Special Tokens ===")
special_tokens = {
    'CLS Token': bert_tokenizer.cls_token,
    'SEP Token': bert_tokenizer.sep_token, 
    'PAD Token': bert_tokenizer.pad_token,
    'UNK Token': bert_tokenizer.unk_token,
    'MASK Token': bert_tokenizer.mask_token
}

for name, token in special_tokens.items():
    token_id = bert_tokenizer.convert_tokens_to_ids(token)
    print(f"{name}: '{token}' (ID: {token_id})")

print("\n=== What they do ===")
print("[CLS]: Classification token - placed at the beginning")
print("[SEP]: Separator token - separates sentences")
print("[PAD]: Padding token - fills sequences to equal length")
print("[UNK]: Unknown token - replaces out-of-vocabulary words")
print("[MASK]: Mask token - used for masked language modeling")

In [None]:
# See special tokens in action
sentence1 = "I love machine learning."
sentence2 = "It's fascinating and powerful."

# Tokenize with and without special tokens
tokens_without = bert_tokenizer.tokenize(sentence1 + " " + sentence2)
tokens_with = bert_tokenizer.tokenize(sentence1, sentence2, add_special_tokens=True)

# Using encode for proper special token handling
encoded = bert_tokenizer.encode(sentence1, sentence2, add_special_tokens=True)
tokens_from_ids = bert_tokenizer.convert_ids_to_tokens(encoded)

print("Two sentences:")
print(f"Sentence 1: '{sentence1}'")
print(f"Sentence 2: '{sentence2}'")
print()

print("Without special tokens:")
print(f"Tokens: {tokens_without}")
print()

print("With special tokens:")
print(f"Tokens: {tokens_from_ids}")
print(f"Token IDs: {encoded}")

## 4. Encoding and Decoding

Let's understand the full tokenization pipeline:

In [None]:
# Complete tokenization pipeline
text = "Tokenization is the first step in NLP preprocessing!"

print(f"Original text: '{text}'")
print("="*50)

# Step 1: Text → Tokens
tokens = bert_tokenizer.tokenize(text)
print(f"1. Tokenize: {tokens}")

# Step 2: Tokens → IDs
token_ids = bert_tokenizer.convert_tokens_to_ids(tokens)
print(f"2. Convert to IDs: {token_ids}")

# Step 3: IDs → Tokens
tokens_back = bert_tokenizer.convert_ids_to_tokens(token_ids)
print(f"3. Convert back to tokens: {tokens_back}")

# Step 4: Tokens → Text
text_back = bert_tokenizer.convert_tokens_to_string(tokens_back)
print(f"4. Convert back to text: '{text_back}'")

# All-in-one methods
print("\n" + "="*50)
print("All-in-one methods:")
encoded = bert_tokenizer.encode(text, add_special_tokens=True)
decoded = bert_tokenizer.decode(encoded)
print(f"Encode: {encoded}")
print(f"Decode: '{decoded}'")

## 5. Handling Out-of-Vocabulary (OOV) Words

Let's see how different tokenizers handle words they haven't seen before:

In [None]:
# Text with unusual/made-up words
oov_text = "The splendiferous AI researcher invented the wonderflabber technique!"

print(f"Text with OOV words: '{oov_text}'\n")

# Test different tokenizers
for name, tokenizer in tokenizers.items():
    tokens = tokenizer.tokenize(oov_text)
    print(f"{name}:")
    print(f"  Tokens: {tokens}")
    
    # Highlight potential OOV handling
    oov_indicators = []
    for token in tokens:
        if '##' in token or token.startswith('▁') or token == tokenizer.unk_token:
            oov_indicators.append(token)
    
    if oov_indicators:
        print(f"  Subword/OOV tokens: {oov_indicators}")
    print()

## 6. Tokenization with Attention Masks and Padding

When processing batches, we need consistent sequence lengths:

In [None]:
# Different length sentences
sentences = [
    "Short sentence.",
    "This is a medium-length sentence with more words.",
    "This is a very long sentence that contains many words and will require more tokens to represent properly."
]

print("=== Tokenizing Multiple Sentences ===")
for i, sentence in enumerate(sentences):
    tokens = bert_tokenizer.tokenize(sentence)
    print(f"Sentence {i+1}: {len(tokens)} tokens")
    print(f"  '{sentence}'")
    print(f"  Tokens: {tokens[:10]}{'...' if len(tokens) > 10 else ''}")
    print()

In [None]:
# Batch tokenization with padding
tokenized_batch = bert_tokenizer(
    sentences,
    padding=True,  # Pad to the longest sequence
    truncation=True,  # Truncate if too long
    max_length=128,  # Maximum sequence length
    return_tensors="pt"  # Return PyTorch tensors
)

print("=== Batch Tokenization Results ===")
print(f"Input IDs shape: {tokenized_batch['input_ids'].shape}")
print(f"Attention mask shape: {tokenized_batch['attention_mask'].shape}")
print()

# Show the results for each sentence
for i in range(len(sentences)):
    input_ids = tokenized_batch['input_ids'][i]
    attention_mask = tokenized_batch['attention_mask'][i]
    
    print(f"Sentence {i+1}:")
    print(f"  Input IDs: {input_ids[:15]}...")
    print(f"  Attention: {attention_mask[:15]}...")
    print(f"  Non-padding tokens: {attention_mask.sum().item()}")
    print()

## 7. Visualizing Tokenization

Let's create some visualizations to better understand tokenization:

In [None]:
# Compare token counts across different tokenizers
test_sentences = [
    "Hello world!",
    "Natural language processing is amazing.",
    "The transformer architecture revolutionized AI.",
    "Subword tokenization handles out-of-vocabulary words efficiently.",
    "Machine learning models require numerical representations of text data."
]

# Collect token counts
results = []
for sentence in test_sentences:
    for name, tokenizer in tokenizers.items():
        tokens = tokenizer.tokenize(sentence)
        results.append({
            'Sentence': f"Sentence {test_sentences.index(sentence) + 1}",
            'Tokenizer': name,
            'Token Count': len(tokens),
            'Text Length': len(sentence)
        })

df = pd.DataFrame(results)

# Create visualization
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x='Sentence', y='Token Count', hue='Tokenizer')
plt.title('Token Count Comparison Across Different Tokenizers')
plt.xlabel('Test Sentences')
plt.ylabel('Number of Tokens')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

# Show the actual sentences
print("\nTest sentences:")
for i, sentence in enumerate(test_sentences):
    print(f"{i+1}. '{sentence}' ({len(sentence)} chars)")

In [None]:
# Token length distribution
sample_texts = [
    "AI", "GPT", "BERT", "transformer", "attention", "mechanism", 
    "tokenization", "preprocessing", "subword", "vocabulary",
    "neural", "network", "machine", "learning", "algorithm",
    "unprecedented", "revolutionary", "extraordinary", "magnificent"
]

tokenizer = bert_tokenizer
token_data = []

for text in sample_texts:
    tokens = tokenizer.tokenize(text)
    token_data.extend([(text, len(text), len(tokens), len(tokens)/len(text)) for _ in range(len(tokens))])

token_df = pd.DataFrame(token_data, columns=['Word', 'Char_Length', 'Token_Count', 'Tokens_per_Char'])

plt.figure(figsize=(10, 6))
plt.scatter(token_df['Char_Length'], token_df['Token_Count'], alpha=0.6)
plt.xlabel('Word Length (characters)')
plt.ylabel('Token Count')
plt.title('Character Length vs Token Count')

# Add trend line
z = np.polyfit(token_df['Char_Length'], token_df['Token_Count'], 1)
p = np.poly1d(z)
plt.plot(token_df['Char_Length'], p(token_df['Char_Length']), "r--", alpha=0.8)

plt.grid(True, alpha=0.3)
plt.show()

## 8. Advanced Tokenization Features

In [None]:
# Fast tokenizers with offsets
from transformers import AutoTokenizer

# Load a fast tokenizer
fast_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)

text = "Hugging Face provides amazing NLP tools!"

# Tokenize with offsets (character positions)
encoding = fast_tokenizer(
    text,
    return_offsets_mapping=True,
    return_tensors="pt"
)

tokens = fast_tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])
offsets = encoding["offset_mapping"][0]

print(f"Original text: '{text}'")
print("\nTokens with character positions:")
print("-" * 50)

for token, (start, end) in zip(tokens, offsets):
    if start == end == 0:  # Special tokens
        print(f"'{token}' -> Special token")
    else:
        original_text = text[start:end]
        print(f"'{token}' -> '{original_text}' (pos {start}-{end})")

In [None]:
# Working with different languages
multilingual_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

multilingual_texts = {
    "English": "Hello, how are you?",
    "French": "Bonjour, comment allez-vous?",
    "German": "Hallo, wie geht es dir?",
    "Japanese": "こんにちは、元気ですか？",
    "Arabic": "مرحبا، كيف حالك؟"
}

print("=== Multilingual Tokenization ===")
for language, text in multilingual_texts.items():
    tokens = multilingual_tokenizer.tokenize(text)
    print(f"{language}: '{text}'")
    print(f"  Tokens ({len(tokens)}): {tokens}")
    print()

## 9. Performance Comparison

Let's measure tokenization speed:

In [None]:
import time

# Generate test data
long_text = "This is a test sentence for measuring tokenization speed. " * 1000
test_texts = [long_text] * 100

print(f"Testing with {len(test_texts)} texts, each {len(long_text):,} characters")
print("="*60)

performance_results = []

for name, tokenizer in tokenizers.items():
    start_time = time.time()
    
    # Tokenize all texts
    for text in test_texts:
        _ = tokenizer.tokenize(text)
    
    end_time = time.time()
    duration = end_time - start_time
    
    performance_results.append((name, duration))
    print(f"{name}: {duration:.3f} seconds")

# Sort by performance
performance_results.sort(key=lambda x: x[1])
print("\nRanked by speed (fastest first):")
for i, (name, duration) in enumerate(performance_results, 1):
    print(f"{i}. {name}: {duration:.3f}s")

## 10. Practical Tips and Best Practices

In [None]:
print("=== Tokenization Best Practices ===")
print()

# 1. Always check sequence length
def check_sequence_length(text, tokenizer, max_length=512):
    tokens = tokenizer.tokenize(text)
    token_count = len(tokens)
    
    if token_count > max_length:
        print(f"Warning: Text has {token_count} tokens, exceeds max_length of {max_length}")
        return False
    else:
        print(f"Text has {token_count} tokens, within limit")
        return True

example_text = "This is an example text. " * 100
print("1. Checking sequence length:")
check_sequence_length(example_text, bert_tokenizer)
print()

# 2. Handle truncation properly
print("2. Proper truncation handling:")
long_text = "AI revolutionizes everything. " * 50

# Bad: Just truncate
truncated_bad = bert_tokenizer.encode(long_text, max_length=50, truncation=True)

# Good: Truncate with attention to special tokens
truncated_good = bert_tokenizer.encode(
    long_text,
    max_length=50,
    truncation=True,
    add_special_tokens=True
)

print(f"Bad truncation length: {len(truncated_bad)}")
print(f"Good truncation length: {len(truncated_good)}")
print(f"Good truncated text: '{bert_tokenizer.decode(truncated_good)}'")
print()

# 3. Use batch processing for efficiency
print("3. Batch processing example:")
texts_to_process = ["Short text.", "Medium length text here.", "Much longer text that needs processing."]

# Process in batch
batch_encoded = bert_tokenizer(
    texts_to_process,
    padding=True,
    truncation=True,
    return_tensors="pt",
    max_length=128
)

print(f"Batch shape: {batch_encoded['input_ids'].shape}")
print("Efficient batch processing completed")

## 🎯 Key Takeaways

**What we've learned about tokenization:**

✅ **Tokenization Types**: Word-level, character-level, and subword tokenization  
✅ **Subword Algorithms**: WordPiece (BERT), BPE (GPT-2), SentencePiece (T5)  
✅ **Special Tokens**: CLS, SEP, PAD, UNK, MASK and their purposes  
✅ **Encoding/Decoding**: Converting text ↔ tokens ↔ IDs  
✅ **OOV Handling**: How different tokenizers handle unknown words  
✅ **Batch Processing**: Padding, attention masks, and truncation  
✅ **Performance**: Speed comparisons and optimization tips  
✅ **Multilingual**: Working with different languages  

## 🔧 Best Practices Summary

1. **Always check sequence lengths** before training
2. **Use appropriate max_length** and truncation strategies
3. **Process texts in batches** for efficiency
4. **Choose the right tokenizer** for your model
5. **Handle special tokens** properly for your task
6. **Use fast tokenizers** when available for better performance
7. **Consider multilingual tokenizers** for cross-lingual tasks

## 🚀 Next Steps

Now that you understand tokenization, you're ready to:

1. **Move to the next notebook**: `03_model_loading.ipynb`
2. **Experiment** with different tokenizers on your own text
3. **Practice** handling edge cases like very long texts
4. **Explore** tokenizer-specific features in the documentation

## 📚 Additional Resources

- [Tokenizers Documentation](https://huggingface.co/docs/tokenizers/)
- [Understanding Subword Tokenization](https://huggingface.co/course/chapter6/1)
- [Fast Tokenizers Guide](https://huggingface.co/docs/transformers/fast_tokenizers)
- [Multilingual Models](https://huggingface.co/docs/transformers/multilingual)

## 🧪 Try This Yourself

**Exercise 1**: Compare how different tokenizers handle technical jargon in your field

**Exercise 2**: Test tokenization on text in different languages

**Exercise 3**: Measure tokenization performance on your own dataset

**Exercise 4**: Experiment with custom tokenizer settings

Happy tokenizing! 🔤