# 🧠 MiniGPT Part 1: How GPT Reads Your Words

**Understanding Tokenization - The Foundation of Large Language Models**

Welcome to the first part of our MiniGPT series! In this notebook, you'll learn how GPT models "read" text by breaking it down into tokens - the fundamental building blocks that make modern AI possible.

## 🎯 What You'll Learn
- **What tokenization is** and why it's crucial for LLMs
- **How different tokenizers work** (character-level, word-level, subword)
- **The famous "strawberry problem"** and why GPT can't count letters
- **Byte-Pair Encoding (BPE)** - the algorithm used by GPT models
- **Hands-on implementation** of your own tokenizers

## 🚀 Quick Start

**Option 1: Run in Google Colab (Recommended)**
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/naresh-sharma/mini-gpt/blob/main/notebooks/part1_tokenization.ipynb)

**Option 2: Run Locally**
```bash
# Navigate to the project root directory (where setup.py is located)
cd /path/to/mini-gpt

# Install MiniGPT
pip install -e .

# Start Jupyter
jupyter notebook
```

**Important:** After Jupyter opens in your browser:
1. Navigate to the `notebooks/` folder in the Jupyter file browser
2. Open `part1_tokenization.ipynb`

---

## 📚 The Big Picture
Before we dive into code, let's understand why tokenization matters:

> **"GPT doesn't see words. It sees tokens."**

This simple statement explains so much about how modern AI works. Let's explore what this means!


In [None]:
# Install MiniGPT if running in Colab
try:
    import mini_gpt  # noqa: F401

    print("✅ MiniGPT already installed!")
except ImportError:
    print("📦 Installing MiniGPT...")
    %pip install -q git+https://github.com/naresh-sharma/mini-gpt.git
    print("✅ Installation complete!")

# Install additional dependencies for visualization
%pip install -q matplotlib

# Import our tokenizers and utilities
from mini_gpt import BPETokenizer, SimpleTokenizer
from mini_gpt.utils import (
    analyze_text_efficiency,
    compare_tokenizers,
    demonstrate_strawberry_problem,
)

print("🎉 Ready to explore tokenization!")

✅ MiniGPT already installed!
🎉 Ready to explore tokenization!


## 🔍 What is Tokenization?

Tokenization is the process of breaking text into smaller pieces called **tokens**. Think of it as cutting a sentence into puzzle pieces that a computer can understand.

### Why Do We Need Tokenization?

1. **Computers don't understand text** - they only understand numbers
2. **We need to convert text to numbers** - each token gets a unique ID
3. **Different approaches** - character-level, word-level, or subword-level

Let's see this in action!


In [None]:
# Let's start with a simple example
text = "Hello world!"

print("📝 Original text:", text)
print("📊 Character count:", len(text))
print("🔤 Characters:", list(text))

# Create a simple vocabulary
vocab = {
    "Hello": 1,
    " world": 2,  # Note the space at the beginning
    "!": 3,
    "<UNK>": 0,  # Unknown token
}

print("\n📚 Vocabulary:", vocab)

# Create our tokenizer
tokenizer = SimpleTokenizer(vocab)

# Tokenize the text
tokens = tokenizer.encode(text)
print(f"\n🎯 Tokenized: {tokens}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"🔄 Decoded: '{decoded}'")

📝 Original text: Hello world!
📊 Character count: 12
🔤 Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']

📚 Vocabulary: {'Hello': 1, ' world': 2, '!': 3, '<UNK>': 0}

🎯 Tokenized: [1, 2, 3]
🔄 Decoded: 'Hello world!'


## 🍓 The Famous Strawberry Problem

Now let's explore one of the most famous examples in AI: the "strawberry problem." This perfectly illustrates why tokenization matters and why GPT models struggle with certain tasks.


In [None]:
# Let's demonstrate the strawberry problem
demonstrate_strawberry_problem(tokenizer)

THE STRAWBERRY PROBLEM
Why GPT can't count letters reliably...

When you ask GPT 'How many R's are in strawberry?',
it doesn't see individual letters. It sees tokens!

You see:  s-t-r-a-w-b-e-r-r-y (10 letters, 3 R's)
GPT sees: ['<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>'] (10 tokens)

GPT doesn't have direct access to the letters!
It only knows about tokens: ['<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>']

This is why GPT models:
  ❌ Struggle to count letters
  ❌ Can't reliably spell backwards
  ❌ Have difficulty with character-level tasks

Modern models (like GPT-4+) learned to work around this
through better reasoning, but tokenization still happens!


In [None]:
# 🧩 Try it yourself! Experiment with BPE
# Try training the BPE tokenizer on your own text data

from mini_gpt.utils import visualize_tokens

# Your turn! Replace this with your own text
your_texts = [
    "Replace me with your own sentences",
    "Try different types of text",
    "See how BPE learns patterns from your data",
]

print("🎯 Training BPE on your custom text...")
custom_bpe = BPETokenizer(vocab_size=100)  # Minimum vocab size for quick training
custom_bpe.train(your_texts, verbose=True)

print("\n🧪 Now test it on new text:")
test_text = "Try this sentence with your trained tokenizer"
tokens = custom_bpe.encode(test_text)
print(f"Text: '{test_text}'")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")

# Visualize the tokenization
visualize_tokens(test_text, custom_bpe)

## 🔧 Building a Better Tokenizer: Byte-Pair Encoding

The SimpleTokenizer we used above is too basic for real-world use. Let's build a more sophisticated tokenizer using **Byte-Pair Encoding (BPE)** - the same algorithm used by GPT models!


In [None]:
# Create a BPE tokenizer
bpe_tokenizer = BPETokenizer(vocab_size=100)

# Training data (in practice, this would be much larger)
training_texts = [
    "Hello world!",
    "Hello there!",
    "The quick brown fox",
    "strawberry pie",
    "strawberry jam",
    "I love strawberries",
]

print("🎓 Training BPE tokenizer...")
bpe_tokenizer.train(training_texts, verbose=True)

print("\n✅ Training complete!")
print(f"📚 Vocabulary size: {bpe_tokenizer.get_vocab_size()}")
print(f"🔗 Number of merges: {len(bpe_tokenizer.get_merges())}")

🎓 Training BPE tokenizer...
Training BPE tokenizer on 6 texts...
Target vocabulary size: 100
Found 13 unique words
Character-level vocab size: 33
Training complete! Final vocab size: 87
Number of merges: 54

✅ Training complete!
📚 Vocabulary size: 87
🔗 Number of merges: 54


In [1]:
# 🖼️ Visual Comparison: Token Counts per Text
import matplotlib.pyplot as plt

# Test texts for visualization
texts = ["Hello world!", "strawberry", "Python programming"]
simple_counts = [len(tokenizer.encode(t)) for t in texts]
bpe_counts = [len(bpe_tokenizer.encode(t)) for t in texts]

# Create horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 6))
y_pos = range(len(texts))

# Plot SimpleTokenizer bars
bars1 = ax.barh(y_pos, simple_counts, label="SimpleTokenizer", alpha=0.8, color="skyblue")

# Plot BPE bars (stacked on top of SimpleTokenizer)
bars2 = ax.barh(
    y_pos, bpe_counts, left=simple_counts, label="BPETokenizer", alpha=0.8, color="lightcoral"
)

# Customize the plot
ax.set_yticks(y_pos)
ax.set_yticklabels(texts)
ax.set_xlabel("Token Count")
ax.set_title("Tokenization Comparison: SimpleTokenizer vs BPETokenizer")
ax.legend()
ax.grid(axis="x", alpha=0.3)

# Add value labels on bars
for i, (simple, bpe) in enumerate(zip(simple_counts, bpe_counts)):
    ax.text(simple / 2, i, str(simple), ha="center", va="center", fontweight="bold")
    ax.text(simple + bpe / 2, i, str(bpe), ha="center", va="center", fontweight="bold")

plt.tight_layout()
plt.show()

print("📊 Key Insights:")
print("• Lower bars = more efficient tokenization")
print("• SimpleTokenizer often uses fewer tokens for short texts")
print("• BPE learns patterns and can be more efficient for longer texts")
print("• The difference shows why tokenization strategy matters!")

Matplotlib is building the font cache; this may take a moment.


NameError: name 'tokenizer' is not defined

In [None]:
# 🧩 Try it yourself! Test the strawberry problem
# Try different words and see how they get tokenized

# Your turn! Test these words or add your own
test_words = [
    "strawberry",  # The classic example
    "programming",  # Try a longer word
    "hello",  # Try a simple word
    "supercalifragilisticexpialidocious",  # Try a very long word
]

print("🔍 Testing different words with tokenization:")
print("=" * 60)

for word in test_words:
    print(f"\n📝 Word: '{word}'")
    print(f"   Letters: {len(word)}")

    # Test with SimpleTokenizer
    simple_tokens = tokenizer.encode(word)
    print(f"   SimpleTokenizer: {len(simple_tokens)} tokens")

    # Test with BPE
    bpe_tokens = bpe_tokenizer.encode(word)
    print(f"   BPETokenizer: {len(bpe_tokens)} tokens")

    # Show the actual tokens
    print(f"   Simple tokens: {simple_tokens}")
    print(f"   BPE tokens: {bpe_tokens}")

print(
    "\n💡 Key insight: The more tokens a word becomes, the harder it is for GPT to 'see' the individual letters!"
)

In [None]:
# Now let's compare our tokenizers!
test_text = "strawberry"

print("🔍 Comparing tokenization approaches:")
print("=" * 50)

# Simple tokenizer
simple_tokens = tokenizer.encode(test_text)
print(f"SimpleTokenizer: {simple_tokens}")

# BPE tokenizer
bpe_tokens = bpe_tokenizer.encode(test_text)
print(f"BPETokenizer:    {bpe_tokens}")

# Visualize the differences
print(f"\n📊 Tokenization comparison for '{test_text}':")
compare_tokenizers(test_text, [tokenizer, bpe_tokenizer], ["Simple", "BPE"])

## 📈 Efficiency Analysis

Let's analyze how efficiently our tokenizers compress text:


In [None]:
# Analyze efficiency for different texts
test_texts = [
    "Hello world!",
    "strawberry",
    "The quick brown fox jumps over the lazy dog.",
    "I love programming in Python!",
]

print("📊 Tokenization Efficiency Analysis")
print("=" * 60)

for text in test_texts:
    print(f"\n📝 Text: '{text}'")
    print("-" * 40)

    # Simple tokenizer efficiency
    simple_metrics = analyze_text_efficiency(text, tokenizer)
    print(f"SimpleTokenizer: {simple_metrics['chars_per_token']:.2f} chars/token")

    # BPE tokenizer efficiency
    bpe_metrics = analyze_text_efficiency(text, bpe_tokenizer)
    print(f"BPETokenizer:    {bpe_metrics['chars_per_token']:.2f} chars/token")

    # Which is more efficient?
    if bpe_metrics["chars_per_token"] > simple_metrics["chars_per_token"]:
        print("🏆 SimpleTokenizer is more efficient")
    else:
        print("🏆 BPETokenizer is more efficient")

📊 Tokenization Efficiency Analysis

📝 Text: 'Hello world!'
----------------------------------------
SimpleTokenizer: 4.00 chars/token
BPETokenizer:    6.00 chars/token
🏆 SimpleTokenizer is more efficient

📝 Text: 'strawberry'
----------------------------------------
SimpleTokenizer: 1.00 chars/token
BPETokenizer:    10.00 chars/token
🏆 SimpleTokenizer is more efficient

📝 Text: 'The quick brown fox jumps over the lazy dog.'
----------------------------------------
SimpleTokenizer: 1.00 chars/token
BPETokenizer:    1.76 chars/token
🏆 SimpleTokenizer is more efficient

📝 Text: 'I love programming in Python!'
----------------------------------------
SimpleTokenizer: 1.00 chars/token
BPETokenizer:    1.21 chars/token
🏆 SimpleTokenizer is more efficient


## 🎯 Key Takeaways

### What We've Learned

1. **Tokenization is fundamental** - Every LLM starts with tokenization
2. **Different approaches exist** - Character, word, and subword tokenization
3. **BPE is powerful** - It learns common patterns and creates efficient representations
4. **The strawberry problem** - Shows why GPT struggles with character-level tasks
5. **Efficiency matters** - Better tokenization = better model performance

### Why This Matters for GPT

- **GPT-3 uses BPE** with ~50,000 tokens
- **Each token** gets converted to a vector (embedding)
- **The model learns** relationships between these vectors
- **Better tokenization** = better understanding of language

## 🚀 What's Next?

In **Part 2: Embeddings**, we'll learn how tokens become vectors that capture meaning!

- How do we convert tokens to numbers?
- What are embeddings and why do they matter?
- How does GPT understand relationships between words?

**Ready to continue?** Check out the next notebook or explore the code further!

---

## 🔗 Additional Resources

- [Full MiniGPT Repository](https://github.com/naresh-sharma/mini-gpt)
- [Part 2: Embeddings Notebook](notebooks/part2_embeddings.ipynb) (Coming Soon)
- [Blog Post: How GPT Reads Your Words](your-blog-link) (Coming Soon)

**Happy tokenizing!** 🎉


In [None]:
# 🧩 Final Challenge: Your Own Tokenization Experiment
# This is your playground! Try anything you want to explore

from mini_gpt.utils import visualize_tokens

# Replace this with your own text
your_text = "Replace me with anything you want to explore!"

print("🎯 Your Tokenization Experiment")
print("=" * 50)
print(f"Text: '{your_text}'")
print()

# Test with both tokenizers
simple_tokens = tokenizer.encode(your_text)
bpe_tokens = bpe_tokenizer.encode(your_text)

print(f"SimpleTokenizer: {len(simple_tokens)} tokens")
print(f"BPETokenizer: {len(bpe_tokens)} tokens")
print()

# Visualize the tokenization
print("🔍 Tokenization breakdown:")

print("\nSimpleTokenizer:")
visualize_tokens(your_text, tokenizer)

print("\nBPETokenizer:")
visualize_tokens(your_text, bpe_tokenizer)

print("\n🎉 Experiment complete! Try different texts to see how tokenization changes!")

🎯 Your Tokenization Experiment
Text: 'Replace me with anything you want to explore!'

SimpleTokenizer: 45 tokens
BPETokenizer: 42 tokens

🔍 Tokenization breakdown:

SimpleTokenizer:
TOKENIZATION VISUALIZATION
Original: "Replace me with anything you want to explore!"
Tokens:   ['<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '!']
IDs:      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]
Count:    45 tokens

BPETokenizer:
TOKENIZATION VISUALIZATION
Original: "Replace me with anything you want to explore!"
Tokens:   ['ID:32', 'ID:0', 'ID:12', 'ID:22', 'ID:18', 