# Understanding Tokenization

This notebook demonstrates how language models break text into tokens:
- **What is tokenization?** Breaking text into smaller pieces (tokens)
- **Why does it matter?** Different models tokenize differently
- **Key insight:** The same sentence can produce different tokens in different models


![raw-tokenized-samples.png](raw-tokenized-samples.png)

In [None]:
! pip install transformers tiktoken python-dotenv

In [None]:
import os
from transformers import AutoTokenizer
from dotenv import load_dotenv

# Uncomment this to login to Hugging Face

from huggingface_hub import login
load_dotenv()
login(token=os.getenv("HF_TOKEN"))


## Part 1: Tokenizing with Llama 3.2 1B

Let's start by tokenizing a simple sentence using the Llama 3.2 1B tokenizer.


In [None]:
# Load Llama 3.2 1B tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

print(f"Llama 3.2 1B Tokenizer loaded!")
print(f"Vocabulary size: {len(llama_tokenizer):,} tokens")


In [None]:
# Example sentence
sentence = "The hyperparameter-tuning process improved model generalization."

print("="*80)
print("TOKENIZATION WITH LLAMA 3.2 1B")
print("="*80)
print(f"\nOriginal sentence: '{sentence}'")
print(f"Length: {len(sentence)} characters")


In [None]:
# Tokenize the sentence
llama_tokens = llama_tokenizer.encode(sentence, add_special_tokens=False)
llama_token_strings = [llama_tokenizer.decode([t]) for t in llama_tokens]

print("\n" + "-"*80)
print("TOKENIZATION RESULTS")
print("-"*80)
print(f"\nNumber of tokens: {len(llama_tokens)}")
print(f"\nToken IDs:  {llama_tokens}")
print(f"\nToken strings (subwords):")
for i, (token_id, token_str) in enumerate(zip(llama_tokens, llama_token_strings)):
    print(f"  {i}: '{token_str}' (ID: {token_id})")

print("\n" + "="*80)


### Observations

Notice:
- Some words are kept whole (e.g., "The")
- Some words are split into subwords
- Spaces are often included with the following word
- Each token has a unique ID number


## Part 2: Tokenizing with Mistral 7B

Now let's tokenize the **exact same sentence** using a different model's tokenizer.


In [None]:
# Load Mistral 7B tokenizer
mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

print(f"Mistral 7B Tokenizer loaded!")
print(f"Vocabulary size: {len(mistral_tokenizer):,} tokens")


In [None]:
# Tokenize the SAME sentence
mistral_tokens = mistral_tokenizer.encode(sentence, add_special_tokens=False)
mistral_token_strings = [mistral_tokenizer.decode([t]) for t in mistral_tokens]

print("="*80)
print("TOKENIZATION WITH MISTRAL 7B")
print("="*80)
print(f"\nOriginal sentence: '{sentence}'")

print("\n" + "-"*80)
print("TOKENIZATION RESULTS")
print("-"*80)
print(f"\nNumber of tokens: {len(mistral_tokens)}")
print(f"\nToken IDs:  {mistral_tokens}")
print(f"\nToken strings (subwords):")
for i, (token_id, token_str) in enumerate(zip(mistral_tokens, mistral_token_strings)):
    print(f"  {i}: '{token_str}' (ID: {token_id})")

print("\n" + "="*80)


## Part 3: Comparing Both Tokenizers

Let's put them side-by-side to see the differences clearly.


In [None]:
print("="*80)
print("SIDE-BY-SIDE COMPARISON")
print("="*80)
print(f"\nOriginal sentence: '{sentence}'")
print(f"\n{'Model':<20} {'# Tokens':<12} {'Vocab Size':<15}")
print("-"*80)
print(f"{'Llama 3.2 1B':<20} {len(llama_tokens):<12} {len(llama_tokenizer):,}")
print(f"{'Mistral 7B':<20} {len(mistral_tokens):<12} {len(mistral_tokenizer):,}")

print("\n" + "-"*80)
print("TOKEN BREAKDOWN (side-by-side)")
print("-"*80)

# Show tokens side by side
max_len = max(len(llama_token_strings), len(mistral_token_strings))
print(f"\n{'Position':<10} {'Llama 3.2 Token (ID)':<40} {'Mistral 7B Token (ID)':<40}")
print("-"*100)

for i in range(max_len):
    if i < len(llama_token_strings):
        llama_token = f"'{llama_token_strings[i]}' ({llama_tokens[i]})"
    else:
        llama_token = "-"
    
    if i < len(mistral_token_strings):
        mistral_token = f"'{mistral_token_strings[i]}' ({mistral_tokens[i]})"
    else:
        mistral_token = "-"
    
    print(f"{i:<10} {llama_token:<40} {mistral_token:<40}")

print("\n" + "="*80)


## üîë Key Takeaways

From this comparison, we can see:

1. **Different token counts**: The same sentence produces a different number of tokens
2. **Different token IDs**: Even when tokens look similar, they have different IDs
3. **Different splitting strategies**: Models break words differently
4. **Different vocabulary sizes**: Each model has its own vocabulary

**Why does this matter?**
- You **cannot** use tokens from one model with another model
- Each model needs its own tokenizer
- Tokenization affects model performance and behavior
- More tokens = more computation during inference


---

## Part 4: Special Tokens

Tokenizers also include **special tokens** that have specific meanings:
- **BOS (Beginning of Sequence)**: Marks the start of text
- **EOS (End of Sequence)**: Marks the end of text
- **PAD (Padding)**: Used to make sequences the same length
- **UNK (Unknown)**: For tokens not in vocabulary

Let's see what special tokens each tokenizer uses.


In [None]:
print("="*80)
print("LLAMA 3.2 1B SPECIAL TOKENS")
print("="*80)

# Get all special tokens
special_tokens = {
    "BOS token": (llama_tokenizer.bos_token, llama_tokenizer.bos_token_id),
    "EOS token": (llama_tokenizer.eos_token, llama_tokenizer.eos_token_id),
    "PAD token": (llama_tokenizer.pad_token, llama_tokenizer.pad_token_id),
    "UNK token": (llama_tokenizer.unk_token, llama_tokenizer.unk_token_id),
}

print(f"\n{'Token Type':<20} {'Token':<20} {'Token ID':<15}")
print("-"*80)
for token_type, (token, token_id) in special_tokens.items():
    token_display = f"'{token}'" if token is not None else "None"
    token_id_display = str(token_id) if token_id is not None else "None"
    print(f"{token_type:<20} {token_display:<20} {token_id_display:<15}")

# Show all special tokens dict
print("\n" + "-"*80)
print("All special tokens:")
print(llama_tokenizer.special_tokens_map)
print("="*80)


In [None]:
print("="*80)
print("MISTRAL 7B SPECIAL TOKENS")
print("="*80)

# Get all special tokens
special_tokens = {
    "BOS token": (mistral_tokenizer.bos_token, mistral_tokenizer.bos_token_id),
    "EOS token": (mistral_tokenizer.eos_token, mistral_tokenizer.eos_token_id),
    "PAD token": (mistral_tokenizer.pad_token, mistral_tokenizer.pad_token_id),
    "UNK token": (mistral_tokenizer.unk_token, mistral_tokenizer.unk_token_id),
}

print(f"\n{'Token Type':<20} {'Token':<20} {'Token ID':<15}")
print("-"*80)
for token_type, (token, token_id) in special_tokens.items():
    token_display = f"'{token}'" if token is not None else "None"
    token_id_display = str(token_id) if token_id is not None else "None"
    print(f"{token_type:<20} {token_display:<20} {token_id_display:<15}")

# Show all special tokens dict
print("\n" + "-"*80)
print("All special tokens:")
print(mistral_tokenizer.special_tokens_map)
print("="*80)


### Comparison of Special Tokens


In [None]:
print("="*80)
print("SPECIAL TOKENS COMPARISON")
print("="*80)

print(f"\n{'Token Type':<20} {'Llama 3.2 1B':<30} {'Mistral 7B':<30}")
print("-"*80)

token_types = ["BOS token", "EOS token", "PAD token", "UNK token"]
llama_tokens = [
    (llama_tokenizer.bos_token, llama_tokenizer.bos_token_id),
    (llama_tokenizer.eos_token, llama_tokenizer.eos_token_id),
    (llama_tokenizer.pad_token, llama_tokenizer.pad_token_id),
    (llama_tokenizer.unk_token, llama_tokenizer.unk_token_id),
]
mistral_tokens = [
    (mistral_tokenizer.bos_token, mistral_tokenizer.bos_token_id),
    (mistral_tokenizer.eos_token, mistral_tokenizer.eos_token_id),
    (mistral_tokenizer.pad_token, mistral_tokenizer.pad_token_id),
    (mistral_tokenizer.unk_token, mistral_tokenizer.unk_token_id),
]

for token_type, (llama_tok, llama_id), (mistral_tok, mistral_id) in zip(token_types, llama_tokens, mistral_tokens):
    llama_display = f"'{llama_tok}' (ID: {llama_id})" if llama_tok is not None else "None"
    mistral_display = f"'{mistral_tok}' (ID: {mistral_id})" if mistral_tok is not None else "None"
    print(f"{token_type:<20} {llama_display:<30} {mistral_display:<30}")

print("="*80)


### Observations on Special Tokens

Key points about special tokens:

1. **Different representations**: Even though both models have BOS/EOS tokens, they use different strings and IDs
2. **Not all tokens are present**: Some models may not have certain special tokens (e.g., PAD or UNK)
3. **Critical for training**: These tokens help the model understand:
   - Where text begins and ends
   - How to handle padding in batches
   - What to do with unknown/rare words

**Important:** When preparing data for training or inference, you must use the correct special tokens for your specific model!


---

## Part 5: Base vs Instruct Model Special Tokens

Even within the **same model family**, base and instruction-tuned versions can have different special tokens!

Let's compare Llama 3.2 1B base vs Llama 3.2 1B Instruct.


In [None]:
# Load Llama 3.2 1B Instruct tokenizer
llama_instruct_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

print(f"Llama 3.2 1B Instruct Tokenizer loaded!")
print(f"Vocabulary size: {len(llama_instruct_tokenizer):,} tokens")


In [None]:
print("="*80)
print("LLAMA 3.2 1B BASE - SPECIAL TOKENS")
print("="*80)

print(f"\n{'Token Type':<20} {'Token':<30} {'Token ID':<15}")
print("-"*80)

base_tokens = [
    ("BOS token", llama_tokenizer.bos_token, llama_tokenizer.bos_token_id),
    ("EOS token", llama_tokenizer.eos_token, llama_tokenizer.eos_token_id),
    ("PAD token", llama_tokenizer.pad_token, llama_tokenizer.pad_token_id),
    ("UNK token", llama_tokenizer.unk_token, llama_tokenizer.unk_token_id),
]

for token_type, token, token_id in base_tokens:
    token_display = f"'{token}'" if token is not None else "None"
    token_id_display = str(token_id) if token_id is not None else "None"
    print(f"{token_type:<20} {token_display:<30} {token_id_display:<15}")

print("\n" + "-"*80)
print("All special tokens:")
print(llama_tokenizer.special_tokens_map)
print("="*80)


In [None]:
print("="*80)
print("LLAMA 3.2 1B INSTRUCT - SPECIAL TOKENS")
print("="*80)

print(f"\n{'Token Type':<20} {'Token':<30} {'Token ID':<15}")
print("-"*80)

instruct_tokens = [
    ("BOS token", llama_instruct_tokenizer.bos_token, llama_instruct_tokenizer.bos_token_id),
    ("EOS token", llama_instruct_tokenizer.eos_token, llama_instruct_tokenizer.eos_token_id),
    ("PAD token", llama_instruct_tokenizer.pad_token, llama_instruct_tokenizer.pad_token_id),
    ("UNK token", llama_instruct_tokenizer.unk_token, llama_instruct_tokenizer.unk_token_id),
]

for token_type, token, token_id in instruct_tokens:
    token_display = f"'{token}'" if token is not None else "None"
    token_id_display = str(token_id) if token_id is not None else "None"
    print(f"{token_type:<20} {token_display:<30} {token_id_display:<15}")

print("\n" + "-"*80)
print("All special tokens:")
print(llama_instruct_tokenizer.special_tokens_map)
print("="*80)


### Side-by-Side Comparison: Base vs Instruct


In [None]:
print("="*90)
print("BASE vs INSTRUCT SPECIAL TOKENS COMPARISON")
print("="*90)

print(f"\n{'Token Type':<20} {'Base Model':<35} {'Instruct Model':<35}")
print("-"*90)

token_types = ["BOS token", "EOS token", "PAD token", "UNK token"]
base_tokens = [
    (llama_tokenizer.bos_token, llama_tokenizer.bos_token_id),
    (llama_tokenizer.eos_token, llama_tokenizer.eos_token_id),
    (llama_tokenizer.pad_token, llama_tokenizer.pad_token_id),
    (llama_tokenizer.unk_token, llama_tokenizer.unk_token_id),
]
instruct_tokens = [
    (llama_instruct_tokenizer.bos_token, llama_instruct_tokenizer.bos_token_id),
    (llama_instruct_tokenizer.eos_token, llama_instruct_tokenizer.eos_token_id),
    (llama_instruct_tokenizer.pad_token, llama_instruct_tokenizer.pad_token_id),
    (llama_instruct_tokenizer.unk_token, llama_instruct_tokenizer.unk_token_id),
]

for token_type, (base_tok, base_id), (inst_tok, inst_id) in zip(token_types, base_tokens, instruct_tokens):
    base_display = f"'{base_tok}' (ID: {base_id})" if base_tok is not None else "None"
    inst_display = f"'{inst_tok}' (ID: {inst_id})" if inst_tok is not None else "None"
    
    # Add marker if they're different
    if base_tok != inst_tok or base_id != inst_id:
        marker = " ‚ö†Ô∏è DIFFERENT"
    else:
        marker = " ‚úì Same"
    
    print(f"{token_type:<20} {base_display:<35} {inst_display:<35}{marker}")

print("="*90)


---

## Part 6: The Chat Template - How Conversations Are Formatted

The instruct model has a **`chat_template`** attribute that the base model doesn't have.

This is a **Jinja2 template** that defines exactly how to format conversations (user/assistant/system messages) into the format the model expects.


In [None]:
# Check if chat_template exists
print("="*80)
print("CHAT TEMPLATE CHECK")
print("="*80)

print(f"\nBase Model has chat_template: {hasattr(llama_tokenizer, 'chat_template') and llama_tokenizer.chat_template is not None}")
print(f"Instruct Model has chat_template: {hasattr(llama_instruct_tokenizer, 'chat_template') and llama_instruct_tokenizer.chat_template is not None}")

print("="*80)


### The Instruct Model's Chat Template

Let's look at the actual chat template used by the Llama 3.2 Instruct model:


In [None]:
print("="*80)
print("INSTRUCT MODEL CHAT TEMPLATE")
print("="*80)

if llama_instruct_tokenizer.chat_template:
    print("\nThe chat template is a Jinja2 template that formats conversations:")
    print("\n" + "-"*80)
    print(llama_instruct_tokenizer.chat_template)
    print("-"*80)
else:
    print("\nNo chat template found")

print("="*80)


### Using the Chat Template

Let's see how the chat template automatically formats a conversation:


In [None]:
# Create a simple conversation
conversation = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What about Germany?"},
]

print("="*80)
print("EXAMPLE: FORMATTING A CONVERSATION")
print("="*80)

print("\nOriginal conversation (Python list of dicts):")
print("-"*80)
for message in conversation:
    print(f"{message['role']}: {message['content']}")

print("\n" + "="*80)
print("FORMATTED WITH CHAT TEMPLATE")
print("="*80)

# Apply chat template
formatted_text = llama_instruct_tokenizer.apply_chat_template(
    conversation, 
    tokenize=False,  # Get string, not tokens
    add_generation_prompt=False
)

token_ids = llama_instruct_tokenizer.apply_chat_template(
    conversation, 
    tokenize=True,  # Get string, not tokens
    add_generation_prompt=False
)

print("\n" + formatted_text)
print("\n" + "="*80)


print("="*80)
print("TOKEN IDS")
print("="*80)
print(token_ids)

---

## Part 7: Tiktoken - OpenAI's Fast Tokenizer

**Tiktoken** is OpenAI's tokenizer library - it's extremely fast and used for GPT models (GPT-3.5, GPT-4, etc.).

Key features:
- üöÄ **Very fast** - written in Rust with Python bindings
- üéØ **Simple API** - easy to use
- üì¶ **Lightweight** - doesn't require downloading large model files
- üî¢ **Multiple encodings** - supports different GPT model encodings

In [None]:
import tiktoken

# Example: Use the encoding for GPT-4
encoding = tiktoken.encoding_for_model("gpt-4")

# Tokenize text
text = "What is the capital of France?"

tokens = encoding.encode(text)
token_strings = [encoding.decode([t]) for t in tokens]

print("="*80)
print("TIKTOKEN EXAMPLE (GPT-4 encoding)")
print("="*80)
print(f"\nText: '{text}'")
print(f"\nNumber of tokens: {len(tokens)}")
print(f"\nToken IDs: {tokens}")
print(f"\nToken strings:")
for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
    print(f"  {i}: '{token_str}' (ID: {token_id})")

print("\n" + "="*80)
print("üí° Tiktoken is perfect for:")
print("   - Counting tokens for OpenAI API calls")
print("   - Estimating costs (OpenAI charges by token)")
print("   - Fast tokenization without loading full models")
print("="*80)


### Common Tiktoken Encodings

Different OpenAI models use different encodings:

| Encoding | Models | Vocab Size |
|----------|--------|------------|
| `cl100k_base` | GPT-4, GPT-3.5-turbo, text-embedding-ada-002 | ~100K tokens |
| `p50k_base` | Codex models, text-davinci-002, text-davinci-003 | ~50K tokens |
| `r50k_base` | GPT-3 models (davinci, curie, babbage, ada) | ~50K tokens |

**Quick reference:**
```python
# For GPT-4 / GPT-3.5-turbo
encoding = tiktoken.get_encoding("cl100k_base")

# Or get encoding by model name
encoding = tiktoken.encoding_for_model("gpt-4")
```

**Why use Tiktoken?**
- No need to load full model weights
- ~10-100x faster than other tokenizers
- Perfect for token counting and cost estimation


---

## üéì Conclusion: Key Takeaways on Tokenization

Throughout this notebook, we've explored the fundamental concepts of tokenization in large language models. Here's what you should remember:

### 1Ô∏è‚É£ **Different Models = Different Tokenization**
- The **same text** produces **different tokens** in different models
- Llama 3.2 and Mistral tokenize "The quick brown fox..." differently
- You **cannot** interchange tokens between models
- Each model has its own vocabulary and tokenization strategy

### 2Ô∏è‚É£ **Special Tokens Matter**
- Every model has special tokens: BOS, EOS, PAD, UNK
- These tokens have **different strings and IDs** across models
- Special tokens tell the model where text begins, ends, or needs padding
- Using the wrong special tokens breaks the model

### 3Ô∏è‚É£ **Base vs Instruct: More Than Just Training**
- **Base models**: For text completion, no chat structure
- **Instruct models**: For conversations, with special formatting
- Instruct models add extra tokens for chat formatting: `<|start_header_id|>`, `<|eot_id|>`, etc.
- The **chat template** is what makes instruct models conversation-aware

### 4Ô∏è‚É£ **Chat Templates Are Essential**
- The `chat_template` automatically formats conversations
- It's a Jinja2 template that inserts special tokens in the right places
- **Without it**: Your conversation structure is lost
- **With it**: The model understands user/assistant/system roles

### 5Ô∏è‚É£ **Practical Implications**

**When building LLM applications:**
- ‚úÖ Always use the **correct tokenizer** for your model
- ‚úÖ Use the **instruct version** for chat applications
- ‚úÖ Use **`apply_chat_template()`** for conversations
- ‚úÖ Check the model's **special tokens** before training
- ‚ùå Don't mix tokenizers between models
- ‚ùå Don't skip the chat template for instruct models

### üéØ **Why This Matters**

Tokenization is the **first step** in the LLM pipeline:
```
Text ‚Üí Tokenization ‚Üí Model Processing ‚Üí Output
```

If you get tokenization wrong:
- The model won't understand your input
- Special tokens will be misaligned
- Chat structure will be lost
- Performance will suffer

**Get tokenization right, and everything else follows!**

---

### üìö What's Next?

Now that you understand tokenization, you're ready to:
- Learn about model architectures
- Understand attention mechanisms
- Fine-tune models for your specific tasks
- Build robust LLM applications

**Remember:** Every successful LLM application starts with proper tokenization! üöÄ
