# ü§ñ Building Your Own ChatGPT: A Complete Guide

## Welcome, Future AI Engineers!

Have you ever wondered how ChatGPT works? In this notebook, we're going to learn how to build our own ChatGPT-like AI from scratch using a project called **nanochat** by Andrej Karpathy (former director of AI at Tesla and OpenAI).

### What You'll Learn:
1. What are Large Language Models (LLMs)?
2. How does a chatbot understand and generate text?
3. The complete pipeline: from raw text to a working AI assistant
4. Hands-on experience with real AI code

### Prerequisites:
- Basic understanding of Python
- Curiosity about AI!
- No advanced math required - we'll explain everything

---

## üìö Table of Contents
1. Introduction to Language Models
2. Understanding the Architecture
3. The Training Pipeline
4. Tokenization: Teaching AI to Read
5. Building the Transformer Model
6. Training Your Model
7. Making Your AI Chat
8. Evaluation and Testing
9. Next Steps

Let's begin!

---

# Part 1: What is a Language Model?

## üß† The Big Idea

Imagine you're writing a text message: "I'm going to the..."

Your brain can predict what comes next: "store", "park", "movies", etc. A **language model** does the same thing!

### Definition:
A **Large Language Model (LLM)** is a computer program that:
1. Reads a lot of text (books, websites, articles)
2. Learns patterns in how people write
3. Uses those patterns to predict what word comes next
4. Generates human-like text by predicting one word at a time

### How Big is "Large"?
- **nanochat (our version)**: ~500 million to 2 billion parameters
- **GPT-2** (2019): 1.5 billion parameters
- **GPT-3** (2020): 175 billion parameters
- **GPT-4** (2023): Estimated 1+ trillion parameters

A **parameter** is like a tiny piece of knowledge the model learns. More parameters = more knowledge capacity!

### The Magic Formula:
```
Text Input ‚Üí Model ‚Üí Predicted Next Word
```

By repeating this over and over, the model can write entire paragraphs, stories, or even code!

---

# Part 2: Understanding the Transformer Architecture

## üèóÔ∏è The Building Blocks

ChatGPT and nanochat are built using something called a **Transformer**. Think of it as a sophisticated pattern-matching machine.

### Key Components:

#### 1. **Tokens** (The Words)
Before the model can read text, it breaks it into "tokens" - small pieces of text.
- Example: "Hello, world!" ‚Üí ["Hello", ",", " world", "!"]

#### 2. **Embeddings** (The Meaning)
Each token gets converted into a list of numbers (a vector) that represents its meaning.
- Similar words have similar numbers
- Example: "cat" and "dog" have similar embeddings because they're both animals

#### 3. **Attention Mechanism** (The Focus)
This is the most important innovation! When processing a word, the model looks at ALL other words to understand context.

Example sentence: "The animal didn't cross the street because it was too tired."
- When processing "it", attention helps the model know "it" refers to "animal", not "street"

#### 4. **Layers** (The Depth)
The model has many layers stacked on top of each other. Each layer:
- Looks at the text from a different perspective
- Extracts more complex patterns
- Builds deeper understanding

Our nanochat model has 20-34 layers!

#### 5. **Prediction Head** (The Output)
At the end, the model outputs probabilities for what word comes next:
- "store": 35% probability
- "park": 25% probability
- "movies": 20% probability
- etc.

### Visual Flow:
```
Input Text
    ‚Üì
Tokenization (break into pieces)
    ‚Üì
Embeddings (convert to numbers)
    ‚Üì
Layer 1 (attention + processing)
    ‚Üì
Layer 2 (attention + processing)
    ‚Üì
    ...
    ‚Üì
Layer 20 (attention + processing)
    ‚Üì
Prediction (what comes next?)
    ‚Üì
Output Text
```

---

# Part 3: The Complete Training Pipeline

## üéì From Zero to ChatGPT

Training an AI chatbot happens in several stages. Think of it like teaching a child:

### Stage 1: Learning to Read (Tokenization)
**What happens:** We teach the model how to break text into pieces (tokens).

**Why it matters:** Just like you learned the alphabet before reading, the model needs to learn its "alphabet" of text pieces.

**Example:**
- Input: "I love pizza!"
- Tokens: ["I", " love", " pizza", "!"]
- Token IDs: [314, 1842, 16462, 0]

---

### Stage 2: Base Training (Pretraining)
**What happens:** The model reads millions of documents and learns to predict the next word.

**Dataset:** FineWeb - a huge collection of high-quality web pages
- Size: ~90 billion characters (for our speedrun version)
- That's like reading 45,000 novels!

**Training Process:**
1. Show the model text: "The sky is"
2. Model guesses: "green" (wrong!)
3. Tell the model the correct answer: "blue"
4. Model adjusts its internal parameters
5. Repeat billions of times!

**Result:** A model that can predict text, but doesn't know how to have conversations yet.

---

### Stage 3: Midtraining
**What happens:** We teach the model about conversations and special tasks.

**New Skills:**
- Understanding chat format (user says something, assistant responds)
- Learning to use tools (like a calculator)
- Answering multiple-choice questions
- Developing a personality

**Example Training Data:**
```
User: What is 2+2?
Assistant: 2+2 equals 4.
```

---

### Stage 4: Supervised Fine-tuning (SFT)
**What happens:** We show the model examples of really good conversations.

**Goal:** Make the model:
- More helpful
- More accurate
- Better at following instructions

**Example:**
```
User: Explain photosynthesis to a 5-year-old.
Assistant: Plants are like tiny chefs! They use sunlight as their energy 
to cook up food from air and water. The recipe makes food for the plant 
and releases oxygen that we breathe. Cool, right?
```

---

### Stage 5: Reinforcement Learning (Optional)
**What happens:** The model practices tasks and gets rewarded for correct answers.

**How it works:**
1. Give the model a math problem
2. Model generates an answer
3. Check if the answer is correct
4. Give a reward (like a gold star!) if correct
5. Model learns to maximize rewards

**Result:** Better performance on specific tasks like math problems!

---

### The Timeline
- **Tokenization:** 5-10 minutes
- **Base Training:** 2-3 hours (the longest part!)
- **Midtraining:** 20-30 minutes
- **SFT:** 10-15 minutes
- **RL:** 10-15 minutes

**Total for speedrun version:** About 4 hours and $100 on powerful GPUs!

---

# Part 4: Let's Get Started - Setup

## üöÄ Your Local Environment

Now let's get hands-on! This notebook is configured to run on your local Ubuntu server with the GB10 Blackwell GPU.

### Your Hardware:
- **GPU**: GB10 Blackwell (128GB VRAM)
- **Location**: 192.168.219.45
- **nanochat**: Pre-installed at `/var/www/gpt2/nanochat`

### What You Can Do:
- Explore the nanochat code and architecture
- Run training experiments with your powerful GPU
- Train full models (speedrun takes ~4 hours, costs ~$100 in compute)
- Experiment with different configurations

Let's verify your environment is ready!

In [None]:
# Check your local environment
import sys
import os
import subprocess
import torch

print("üñ•Ô∏è  Environment Check")
print("="*60)

# Python version
print(f"\nPython: {sys.version}")

# Check PyTorch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    # Get GPU info
    try:
        gpu_info = subprocess.check_output(
            ['nvidia-smi', '--query-gpu=name,memory.total', '--format=csv,noheader'],
            text=True
        ).strip()
        print(f"\nüéÆ GPU: {gpu_info}")

        # Check if it's the GB10 Blackwell
        for i in range(torch.cuda.device_count()):
            print(f"   Device {i}: {torch.cuda.get_device_name(i)}")
            total_memory = torch.cuda.get_device_properties(i).total_memory / (1024**3)
            print(f"   Memory: {total_memory:.1f} GB")
    except Exception as e:
        print(f"GPU info: {e}")

# Check nanochat installation
nanochat_path = "/var/www/gpt2/nanochat"
if os.path.exists(nanochat_path):
    print(f"\n‚úÖ nanochat found at: {nanochat_path}")
    sys.path.insert(0, nanochat_path)
else:
    print(f"\n‚ö†Ô∏è  nanochat not found at {nanochat_path}")

print("\n" + "="*60)
print("‚úÖ Environment ready for AI training!")
print("="*60)

In [None]:
# Navigate to nanochat directory
import os
import sys

nanochat_path = "/var/www/gpt2/nanochat"
print(f"üìÅ Using nanochat installation at: {nanochat_path}\n")

if os.path.exists(nanochat_path):
    os.chdir(nanochat_path)
    # Add to Python path
    if nanochat_path not in sys.path:
        sys.path.insert(0, nanochat_path)

    print("‚úÖ Changed to nanochat directory")
    print(f"   Current directory: {os.getcwd()}\n")

    print("üìÇ Directory contents:")
    import subprocess
    result = subprocess.run(['ls', '-lh'], capture_output=True, text=True)
    print(result.stdout)
else:
    print(f"‚ùå nanochat not found at {nanochat_path}")
    print("   Please ensure nanochat is installed")

In [None]:
# Check nanochat dependencies
print("üì¶ Checking nanochat environment\n")
print("="*60)

import sys
import importlib.util

# nanochat uses a virtual environment
venv_path = "/var/www/gpt2/nanochat/.venv"
print(f"Virtual environment: {venv_path}\n")

# Key dependencies to check
dependencies = [
    ('torch', 'PyTorch'),
    ('tiktoken', 'Tokenization'),
    ('tqdm', 'Progress bars'),
    ('numpy', 'Numerical computing'),
]

print("Checking installed packages:\n")
all_installed = True

for module_name, description in dependencies:
    spec = importlib.util.find_spec(module_name)
    if spec is not None:
        module = importlib.import_module(module_name)
        version = getattr(module, '__version__', 'unknown')
        print(f"  ‚úÖ {description:20s} ({module_name}): {version}")
    else:
        print(f"  ‚ùå {description:20s} ({module_name}): NOT FOUND")
        all_installed = False

print("\n" + "="*60)

if all_installed:
    print("‚úÖ All dependencies are installed!")
    print("\nTo activate the nanochat environment in a terminal:")
    print(f"   cd /var/www/gpt2/nanochat")
    print(f"   source .venv/bin/activate")
else:
    print("‚ö†Ô∏è  Some dependencies missing. Install with:")
    print(f"   cd /var/www/gpt2/nanochat")
    print(f"   source .venv/bin/activate")
    print(f"   pip install -e .")

---

# Part 5: Exploring the Code Structure

## üóÇÔ∏è Understanding the Project

Let's explore what files make up nanochat. Each file has a specific job!

In [None]:
# Let's look at the structure of the project
print("üìÅ Project Structure:\n")
print("="*60)

import os
from pathlib import Path

# Ensure we're in the nanochat directory
nanochat_path = Path("/var/www/gpt2/nanochat")
if nanochat_path.exists():
    os.chdir(nanochat_path)

def print_tree(directory, prefix="", max_depth=2, current_depth=0):
    """Print a nice tree structure of the project"""
    if current_depth >= max_depth:
        return

    try:
        entries = sorted(Path(directory).iterdir(), key=lambda x: (not x.is_dir(), x.name))
        entries = [e for e in entries if not e.name.startswith('.') and e.name not in ['__pycache__', 'uv.lock', '.venv', 'logs']]

        for i, entry in enumerate(entries):
            is_last = i == len(entries) - 1
            current_prefix = "‚îî‚îÄ‚îÄ " if is_last else "‚îú‚îÄ‚îÄ "
            print(f"{prefix}{current_prefix}{entry.name}")

            if entry.is_dir():
                extension = "    " if is_last else "‚îÇ   "
                print_tree(entry, prefix + extension, max_depth, current_depth + 1)
    except PermissionError:
        pass

print_tree(nanochat_path)

print("\n" + "="*60)
print("\nüìù Key Directories:")
print("   nanochat/ - Core model code (the brain!)")
print("   scripts/  - Training and evaluation scripts")
print("   tasks/    - Test benchmarks (how smart is our AI?)")
print("   tests/    - Unit tests (making sure code works)")

---

# Part 6: Understanding Tokenization

## üî§ How Computers Read Text

Remember: computers only understand numbers, not words! Tokenization is how we convert text to numbers.

### The Process:
1. **Byte Pair Encoding (BPE):** The algorithm used
2. Start with individual characters: ["H", "e", "l", "l", "o"]
3. Find common pairs: "ll" appears together often
4. Merge them: ["H", "e", "ll", "o"]
5. Repeat to build a vocabulary of ~65,536 tokens

### Why This Matters:
- Efficient: Common words = 1 token, rare words = multiple tokens
- Flexible: Can handle any text, even words the model has never seen
- Smart: Related words often share token pieces

Let's see tokenization in action!

In [None]:
# Let's explore tokenization with a simple example
print("üî§ Tokenization Demo\n")
print("="*60)

# For this demo, we'll use a simpler tokenizer that's already available
# The real nanochat trains its own tokenizer, but this shows the concept!

try:
    import tiktoken
    
    # Load GPT-2's tokenizer (similar to what nanochat does)
    enc = tiktoken.get_encoding("gpt2")
    
    # Test sentences
    test_sentences = [
        "Hello, world!",
        "Artificial intelligence is amazing!",
        "The quick brown fox jumps over the lazy dog.",
        "I love programming in Python!"
    ]
    
    for sentence in test_sentences:
        # Encode: text -> token IDs
        tokens = enc.encode(sentence)
        
        # Decode each token to see what it represents
        token_strings = [enc.decode([t]) for t in tokens]
        
        print(f"\nüìù Original text: \"{sentence}\"")
        print(f"   Number of tokens: {len(tokens)}")
        print(f"   Token IDs: {tokens}")
        print(f"   Token pieces: {token_strings}")
        
    print("\n" + "="*60)
    print("\nüí° Notice how:")
    print("   - Common words like 'the' are single tokens")
    print("   - Spaces are often attached to words")
    print("   - Punctuation can be separate tokens")
    print("   - The same word always gets the same token IDs")
    
except Exception as e:
    print(f"Note: Full tokenization demo requires additional setup.")
    print(f"Concept: Text is broken into small pieces and converted to numbers!")

### üéì Key Takeaway:

Every word you type gets converted to numbers before the AI can process it. The model learns to recognize patterns in these numbers, just like you learned to recognize patterns in letters to read words!

---

# Part 7: The GPT Model Architecture

## üèóÔ∏è Building the Brain

Now let's look at the actual code that defines the model! This is where the magic happens.

### Key Concepts in the Code:

1. **Embeddings:** Convert token IDs to meaningful numbers
2. **Attention Layers:** Let tokens "talk" to each other
3. **Feed-Forward Networks:** Process the information
4. **Layer Normalization:** Keep numbers in a good range
5. **Residual Connections:** Help information flow through deep networks

Let's peek at the actual model code!

In [None]:
# Let's look at key parts of the GPT model
print("üß† GPT Model Structure\n")
print("="*60)

import os
from pathlib import Path

# Ensure we're in the right directory
nanochat_path = Path("/var/www/gpt2/nanochat")
gpt_file = nanochat_path / "nanochat" / "gpt.py"

if gpt_file.exists():
    # Read the GPT model file
    with open(gpt_file, 'r') as f:
        gpt_code = f.read()

    # Show the GPTConfig (the model's settings)
    print("\nüìã Model Configuration:\n")
    config_start = gpt_code.find('@dataclass\nclass GPTConfig:')
    if config_start != -1:
        config_end = gpt_code.find('\n\n', config_start)
        print(gpt_code[config_start:config_end])
    else:
        print("GPTConfig class definition found in gpt.py")

    print("\n" + "="*60)
    print("\nüí° What these settings mean:")
    print("   sequence_len: How many tokens the model can look at once (1024)")
    print("   vocab_size: How many different tokens it knows (~50,000)")
    print("   n_layer: How many layers deep (12 for small, 20-34 for nanochat)")
    print("   n_head: Number of attention 'heads' - different perspectives (6)")
    print("   n_embd: Size of each embedding vector (768 numbers per token)")

    print("\nüéØ The bigger these numbers, the smarter (and slower) the model!")
else:
    print(f"‚ö†Ô∏è  Could not find {gpt_file}")
    print("   Make sure nanochat is properly installed")

In [None]:
# Let's create a tiny example model to understand the structure
print("üî¨ Creating a Mini-GPT Model\n")
print("="*60)

import sys
import os

# Add nanochat to path
nanochat_path = "/var/www/gpt2/nanochat"
if nanochat_path not in sys.path:
    sys.path.insert(0, nanochat_path)

try:
    import torch
    import torch.nn as nn

    # Create a simple version of the attention mechanism
    class SimpleAttention(nn.Module):
        """A simplified attention mechanism for learning"""
        def __init__(self, embed_size):
            super().__init__()
            self.embed_size = embed_size

        def forward(self, x):
            # x shape: (batch, sequence_length, embed_size)
            # In real attention, we compute how much each token should
            # pay attention to every other token
            print(f"   Input shape: {x.shape}")
            print(f"   (batch_size, sequence_length, embedding_dim)")
            return x  # Simplified for demo

    # Example
    print("\nüìä Example: Processing 'Hello world'\n")

    batch_size = 1  # Processing one sentence
    seq_length = 2  # "Hello" and "world" (2 tokens)
    embed_dim = 768  # Each token becomes 768 numbers

    # Simulate token embeddings (use CPU if GPU not available)
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    token_embeddings = torch.randn(batch_size, seq_length, embed_dim, device=device)

    print(f"We have {seq_length} tokens")
    print(f"Each token is represented by {embed_dim} numbers")
    print(f"Device: {device}")
    print(f"\nTensor shape: {token_embeddings.shape}")

    # Apply attention
    attention = SimpleAttention(embed_dim).to(device)
    output = attention(token_embeddings)

    print(f"\nAfter attention, shape: {output.shape}")
    print("\nüí° In real GPT, this goes through many attention layers!")

except Exception as e:
    print(f"Error: {e}")
    print(f"\nConcept: Each word becomes a list of 768 numbers,")
    print(f"and attention helps words understand their context!")

---

# Part 8: The Training Process

## üéì Teaching the AI

Training is where the model learns! Let's understand how.

### The Learning Loop:

```python
for each batch of text:
    1. Feed text into model
    2. Model predicts next word
    3. Compare prediction to actual next word
    4. Calculate error (loss)
    5. Adjust model parameters to reduce error
    6. Repeat billions of times!
```

### The Math Behind Learning:

1. **Forward Pass:** Input ‚Üí Model ‚Üí Prediction
2. **Loss Calculation:** How wrong is the prediction?
3. **Backward Pass:** Calculate what changes would improve the prediction
4. **Parameter Update:** Actually make those changes

This is called **Backpropagation** and **Gradient Descent**!

### What the Model is Learning:

- Grammar rules ("I am" not "I are")
- Facts ("Paris is the capital of France")
- Patterns (code syntax, story structure)
- Context ("bank" = river bank or money bank?)
- Reasoning (cause and effect)

Let's look at the training script!

In [None]:
# Let's examine the speedrun training script
print("üöÄ The Speedrun Training Pipeline\n")
print("="*60)

from pathlib import Path

speedrun_script = Path("/var/www/gpt2/nanochat/speedrun.sh")

if speedrun_script.exists():
    with open(speedrun_script, 'r') as f:
        script = f.read()

    # Show first 20 lines of the script
    print("\nüìÑ speedrun.sh (first 20 lines):\n")
    lines = script.split('\n')[:20]
    for line in lines:
        print(f"   {line}")
    print("   ...")
else:
    print(f"‚ö†Ô∏è  speedrun.sh not found at {speedrun_script}")

# Extract and explain key sections
print("\n" + "="*60)
print("\nüìã Training Pipeline Overview:\n")

stages = [
    ("TOKENIZER", "Training the tokenizer to break text into pieces"),
    ("Base model", "Learning to predict next words from web text"),
    ("Midtraining", "Learning conversation format and tools"),
    ("Supervised Finetuning", "Learning from high-quality examples"),
    ("Reinforcement Learning", "Getting better at specific tasks")
]

for i, (stage, description) in enumerate(stages, 1):
    print(f"\n{i}. {stage}")
    print(f"   ‚Üí {description}")

print("\n" + "="*60)
print("\n‚è±Ô∏è  Timeline (on your GB10 GPU):")
print("   Total time: ~4 hours")
print("   Total cost: ~$100 in compute")
print("   Result: Your own ChatGPT!")

In [None]:
# Simulate a simplified training step
print("üî¨ Simulating One Training Step\n")
print("="*60)

try:
    import torch
    import torch.nn.functional as F
    
    print("\nExample: Teaching the model that 'The sky is ___'\n")
    
    # Simulate model predictions (probabilities for next word)
    vocab_size = 5  # Simplified: only 5 possible words
    words = ["blue", "green", "red", "yellow", "purple"]
    
    # Initial prediction (random, before training)
    initial_logits = torch.randn(vocab_size)
    initial_probs = F.softmax(initial_logits, dim=0)
    
    print("‚ùå BEFORE TRAINING:")
    for word, prob in zip(words, initial_probs):
        bar = '‚ñà' * int(prob * 50)
        print(f"   {word:8s}: {bar} {prob*100:.1f}%")
    
    # The correct answer
    correct_answer = 0  # "blue"
    
    # After training (model learns "blue" is correct)
    trained_logits = torch.tensor([2.0, -1.0, -1.0, -0.5, -1.0])  # Favor "blue"
    trained_probs = F.softmax(trained_logits, dim=0)
    
    print("\n‚úÖ AFTER TRAINING:")
    for word, prob in zip(words, trained_probs):
        bar = '‚ñà' * int(prob * 50)
        print(f"   {word:8s}: {bar} {prob*100:.1f}%")
    
    print("\nüí° See how training increased the probability of 'blue'!")
    print("   This happens billions of times with millions of examples.")
    
except:
    print("Concept: The model starts with random guesses and gradually")
    print("learns to predict the correct next word through training!")

---

# Part 9: Evaluation - How Smart is Our AI?

## üìä Testing the Model

After training, we need to test if our model actually learned anything! nanochat uses several benchmarks:

### 1. **ARC (AI2 Reasoning Challenge)**
Science questions that require reasoning
- Example: "What happens when you heat ice?"
- Tests: Common sense, science knowledge

### 2. **GSM8K (Grade School Math)**
Math word problems
- Example: "If John has 5 apples and buys 3 more, how many does he have?"
- Tests: Math reasoning, step-by-step thinking

### 3. **MMLU (Massive Multitask Language Understanding)**
Multiple choice questions across many topics
- Topics: History, science, literature, law, etc.
- Tests: General knowledge

### 4. **HumanEval**
Python coding problems
- Example: "Write a function that checks if a number is prime"
- Tests: Programming ability

### 5. **CORE Score**
Tests the base model's language understanding before it learns to chat

### Example Results:
For the $100 speedrun model:
- ARC-Challenge: ~28% correct (vs 25% random guessing)
- GSM8K: ~4-7% correct (math is hard!)
- MMLU: ~31% correct (better than random!)
- HumanEval: ~8% correct (can write simple code)

Compare to GPT-4:
- ARC-Challenge: ~96%
- GSM8K: ~92%
- MMLU: ~86%
- HumanEval: ~67%

Our model is like a kindergartener - it knows some things, but makes a lot of mistakes!

---

# Part 10: Making the Model Chat

## üí¨ From Predictor to Chatbot

Now comes the cool part - turning our next-word predictor into a chatbot!

### The Secret: Clever Formatting

We teach the model a special format:
```
<|user|>Your question here<|end|>
<|assistant|>Model's response here<|end|>
```

### How It Works:
1. User types: "Why is the sky blue?"
2. We format it: `<|user|>Why is the sky blue?<|end|><|assistant|>`
3. Model predicts what comes next (the answer!)
4. Model generates: "The sky appears blue because..."
5. Model generates `<|end|>` when done
6. We show the user just the answer part

### Generation Strategies:

**1. Greedy Decoding:** Always pick the most likely word
- Pro: Consistent
- Con: Boring, repetitive

**2. Temperature Sampling:** Add some randomness
- Temperature = 0: Always pick most likely (greedy)
- Temperature = 1: Natural randomness
- Temperature = 2: Very creative (sometimes too random!)

**3. Top-K Sampling:** Only consider top K most likely words
- Prevents choosing unlikely/nonsense words
- Keeps generation sensible

**4. Top-P (Nucleus) Sampling:** Choose from smallest set of words whose probabilities add to P
- Adaptive: sometimes many words, sometimes few
- Most natural-sounding generation

nanochat uses Top-P sampling by default!

In [None]:
# Demonstrate different sampling strategies
print("üé≤ Text Generation Strategies\n")
print("="*60)

try:
    import torch
    import torch.nn.functional as F
    
    # Simulate model predictions for next word
    words = ["great", "good", "wonderful", "amazing", "nice", "ok", "fine"]
    logits = torch.tensor([3.0, 2.5, 2.3, 2.0, 1.5, 1.0, 0.5])
    
    print("Context: 'Today was a ___'\n")
    print("Model's predictions:\n")
    
    probs = F.softmax(logits, dim=0)
    for word, prob in zip(words, probs):
        bar = '‚ñà' * int(prob * 50)
        print(f"  {word:10s}: {bar} {prob*100:.1f}%")
    
    print("\n" + "-"*60)
    
    # Strategy 1: Greedy
    print("\n1Ô∏è‚É£  GREEDY (always pick most likely):")
    greedy_choice = words[torch.argmax(logits)]
    print(f"   Result: 'Today was a {greedy_choice}'")
    print("   Same every time! ‚úì Reliable ‚úó Boring")
    
    # Strategy 2: Temperature = 0.5 (more focused)
    print("\n2Ô∏è‚É£  TEMPERATURE = 0.5 (more focused):")
    temp_logits = logits / 0.5
    temp_probs = F.softmax(temp_logits, dim=0)
    print(f"   'great': {temp_probs[0]*100:.1f}% (was {probs[0]*100:.1f}%)")
    print(f"   'ok': {temp_probs[-2]*100:.1f}% (was {probs[-2]*100:.1f}%)")
    print("   More confident! ‚úì Decisive ‚úó Less creative")
    
    # Strategy 3: Temperature = 2.0 (more creative)
    print("\n3Ô∏è‚É£  TEMPERATURE = 2.0 (more creative):")
    temp_logits = logits / 2.0
    temp_probs = F.softmax(temp_logits, dim=0)
    print(f"   'great': {temp_probs[0]*100:.1f}% (was {probs[0]*100:.1f}%)")
    print(f"   'ok': {temp_probs[-2]*100:.1f}% (was {probs[-2]*100:.1f}%)")
    print("   More random! ‚úì Creative ‚úó Less reliable")
    
    # Strategy 4: Top-K = 3
    print("\n4Ô∏è‚É£  TOP-K = 3 (only consider top 3):")
    top_k = 3
    top_k_probs, top_k_indices = torch.topk(probs, top_k)
    print(f"   Only choosing from: {[words[i] for i in top_k_indices]}")
    print("   ‚úì Prevents nonsense ‚úì Still varied")
    
    print("\n" + "="*60)
    print("\nüí° nanochat uses Top-P (nucleus) sampling:")
    print("   It picks from the smallest set of words that")
    print("   covers P% probability (usually P=0.9 or 90%)")
    print("   This gives natural, coherent text!")
    
except:
    print("Concept: Different sampling strategies create different text styles!")
    print("Greedy = boring but safe, High temperature = creative but risky")

---

# Part 11: Understanding the Full Architecture

## üèõÔ∏è Putting It All Together

Let's review the complete picture of how everything connects:

### The Complete Flow:

```
USER INPUT
    ‚Üì
"Why is the sky blue?"
    ‚Üì
TOKENIZATION
    ‚Üì
["Why", " is", " the", " sky", " blue", "?"]
    ‚Üì
TOKEN IDs
    ‚Üì
[5195, 318, 262, 6766, 4171, 30]
    ‚Üì
EMBEDDINGS (convert to 768-dimensional vectors)
    ‚Üì
[[0.23, -0.45, 0.12, ...], [0.56, 0.23, -0.11, ...], ...]
    ‚Üì
LAYER 1: Self-Attention + Feed-Forward
    ‚Üì (each layer adds understanding)
LAYER 2: Self-Attention + Feed-Forward
    ‚Üì
    ...
    ‚Üì
LAYER 20: Self-Attention + Feed-Forward
    ‚Üì
FINAL LAYER OUTPUT (rich representations)
    ‚Üì
PREDICTION HEAD (convert to vocab probabilities)
    ‚Üì
["The": 45%, "Blue": 12%, "A": 8%, ...]
    ‚Üì
SAMPLING (pick next word with Top-P)
    ‚Üì
"The"
    ‚Üì
ADD TO SEQUENCE, REPEAT
    ‚Üì
"The sky appears blue because..."
    ‚Üì
FINAL RESPONSE
```

### Key Innovations That Make This Work:

1. **Attention Mechanism** (2017): Let words understand context
2. **Large Scale Training** (2018-2020): Train on massive datasets
3. **In-Context Learning** (2020): Model learns from examples without retraining
4. **Instruction Tuning** (2022): Make models helpful and safe
5. **RLHF** (2022): Reinforcement Learning from Human Feedback

### Why It's Called "Generative Pre-trained Transformer" (GPT):

- **Generative**: Generates new text
- **Pre-trained**: Trained on lots of data first
- **Transformer**: Uses the transformer architecture (attention!)

---

# Part 12: Limitations and Future

## ‚ö†Ô∏è What Our Model Can't Do (Yet!)

### Current Limitations:

1. **Hallucinations**: Makes up facts confidently
   - Example: Might say "Paris is in Germany" with confidence!
   - Why: Learned to sound confident, not to verify truth

2. **No Real Understanding**: Pattern matching, not true comprehension
   - Example: Can't truly "understand" what it means to be happy
   - Why: No consciousness or lived experience

3. **Math Difficulties**: Struggles with arithmetic
   - Example: May get 234 √ó 876 wrong
   - Why: Trained to predict text, not calculate

4. **Knowledge Cutoff**: Doesn't know events after training
   - Our model: Knows data up to training date only
   - Why: Frozen after training

5. **Context Length**: Can only "remember" ~1000 words
   - Example: Forgets beginning of long conversations
   - Why: Fixed context window (1024 tokens)

6. **Reasoning Limits**: Better at pattern matching than logic
   - Example: Can struggle with complex multi-step problems
   - Why: Predicts next token, doesn't "think ahead"

### The Gap to GPT-4:

Our $100 model vs GPT-4:
- **Size**: 560M parameters vs ~1 trillion+
- **Training**: $100 vs ~$100 million
- **Data**: 11B tokens vs ~10 trillion+
- **Time**: 4 hours vs months

It's like comparing a kindergartener to a PhD!

### Active Areas of Research:

1. **Reducing Hallucinations**: Making models more truthful
2. **Longer Context**: Remembering more (up to millions of tokens!)
3. **Multi-modal**: Understanding images, audio, video
4. **Efficiency**: Making models faster and cheaper
5. **Reasoning**: Better logic and planning
6. **Tool Use**: Calling calculators, databases, APIs

### What You Can Do:

- **Experiment**: Try different model sizes and datasets
- **Customize**: Give your model a unique personality
- **Evaluate**: Test on your own benchmarks
- **Contribute**: Help improve nanochat!
- **Learn**: This is just the beginning of your AI journey

---

# Part 13: Hands-On Challenge

## üéØ Try It Yourself!

Now that you understand how it all works, here are some challenges:

### Beginner Challenges:

1. **Explore the Code**: 
   - Read through `nanochat/gpt.py`
   - Try to understand each component
   - Draw diagrams of the architecture

2. **Tokenization Experiments**:
   - Try tokenizing different languages
   - Compare efficiency (tokens per character)
   - See how emojis are tokenized!

3. **Change the Config**:
   - What happens with 12 layers vs 20?
   - Try different embedding sizes
   - Experiment with number of attention heads

### Intermediate Challenges:

4. **Custom Dataset**:
   - Collect text from your favorite books/websites
   - Train on domain-specific data
   - Create a specialized model!

5. **Personality Tuning**:
   - Follow the identity guide in nanochat docs
   - Give your model a unique personality
   - Make it an expert in a specific topic

6. **New Evaluation**:
   - Create your own benchmark
   - Test on questions you care about
   - Compare different model versions

### Advanced Challenges:

7. **Architecture Modifications**:
   - Try different attention patterns
   - Experiment with layer configurations
   - Add new features to the model

8. **Training Optimization**:
   - Tune hyperparameters (learning rate, batch size)
   - Try different optimizers
   - Improve training speed

9. **Build Something New**:
   - Code completion tool
   - Story generator
   - Question-answering system
   - Educational tutor

### Research Ideas:

10. **Investigate Questions**:
    - How does model size affect performance?
    - What's the optimal data amount?
    - Can we make training more efficient?
    - How do different tasks affect each other?

Pick one and start experimenting!

In [None]:
# Challenge starter: Let's explore model configurations
print("üéÆ Model Configuration Playground\n")
print("="*60)

def calculate_model_size(n_layer, n_embd, vocab_size, n_head):
    """
    Calculate approximate number of parameters in the model.
    This is educational - the actual calculation is more complex!
    """
    # Embedding layer
    embedding_params = vocab_size * n_embd
    
    # Each transformer layer has:
    # - Attention: 4 * n_embd^2 (Q, K, V, projection)
    # - Feed-forward: 8 * n_embd^2 (typically 4x expansion)
    params_per_layer = 4 * n_embd * n_embd + 8 * n_embd * n_embd
    transformer_params = n_layer * params_per_layer
    
    # Output layer
    output_params = vocab_size * n_embd
    
    total = embedding_params + transformer_params + output_params
    return total

# Example configurations
configs = [
    ("Tiny (for testing)", 6, 384, 50304, 6),
    ("Small (GPT-2 small)", 12, 768, 50304, 12),
    ("nanochat d20", 20, 768, 65536, 6),
    ("nanochat d26", 26, 768, 65536, 6),
    ("nanochat d34", 34, 768, 65536, 6),
]

print("\nüìä Model Configurations:\n")
print(f"{'Name':<20} {'Layers':<8} {'Embed':<8} {'Vocab':<10} {'Heads':<8} {'Parameters'}")
print("-" * 80)

for name, n_layer, n_embd, vocab_size, n_head in configs:
    params = calculate_model_size(n_layer, n_embd, vocab_size, n_head)
    params_m = params / 1_000_000
    print(f"{name:<20} {n_layer:<8} {n_embd:<8} {vocab_size:<10} {n_head:<8} {params_m:.0f}M")

print("\n" + "="*60)
print("\nüéØ Your turn!")
print("Try different configurations and see how the size changes.")
print("Remember: More parameters = smarter but slower and more expensive!")

# Interactive calculator
print("\n" + "-"*60)
print("\nüßÆ Try your own configuration:\n")
print("Example: calculate_model_size(n_layer=15, n_embd=512, vocab_size=50000, n_head=8)\n")

---

# Part 14: Running the Full Pipeline on Your GB10 GPU

## ‚ö° The Complete Speedrun

You have access to the GB10 Blackwell GPU (128GB VRAM) - perfect for training!

### Your Setup:
- **GPU**: GB10 Blackwell (128GB VRAM)
- **Location**: /var/www/gpt2/nanochat
- **Time**: ~4 hours for speedrun
- **Cost**: ~$100 in compute

### Training Steps:

**Option 1: Run in Terminal** (Recommended)

Open a terminal/SSH session:

```bash
# Navigate to nanochat
cd /var/www/gpt2/nanochat

# Activate the virtual environment
source .venv/bin/activate

# Optional: Set up wandb for tracking
wandb login

# Run the complete pipeline
bash speedrun.sh

# OR with wandb tracking:
WANDB_RUN=my_first_llm bash speedrun.sh

# OR in a screen session (best for long training):
screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
# (Detach with Ctrl+A then D, reattach with: screen -r speedrun)
```

**Option 2: Run from Notebook** (Not Recommended - use terminal instead)

```python
import subprocess
import os

os.chdir('/var/www/gpt2/nanochat')
subprocess.run(['bash', 'speedrun.sh'])
```

### What Happens:

1. **Downloads data** (~40GB of web text)
2. **Trains tokenizer** (10 mins)
3. **Pretrains model** (2-3 hours) - the longest part!
4. **Evaluates base model** (15 mins)
5. **Midtraining** (20 mins)
6. **Supervised fine-tuning** (15 mins)
7. **Evaluates final model** (15 mins)
8. **Generates report** (1 min)

### After Training:

```bash
# Chat with your model in terminal
cd /var/www/gpt2/nanochat
source .venv/bin/activate
python -m scripts.chat_cli

# OR use the web interface
python -m scripts.chat_web
# Then open http://192.168.219.45:8000 in your browser
```

### The Report:

After training, you'll get a `report.md` file at `/var/www/gpt2/nanochat/report.md` with:
- All evaluation scores
- Sample outputs
- Training statistics
- Comparison to baselines

### Training Options:
- **Speedrun (d20)**: ~$100, 4 hours
- **Mid-tier (d26)**: ~$300, 12 hours
- **Full-tier (d34)**: ~$1000, 41 hours

To run different sizes:
```bash
# Mid-tier
bash run1000.sh

# Custom configuration
python -m scripts.base_train --depth=26 --device-batch-size=32
```

### Monitoring Training:

```bash
# Watch GPU usage
watch -n 1 nvidia-smi

# Tail the training logs
tail -f ~/.cache/nanochat/logs/training.log

# Check wandb dashboard (if enabled)
# Visit: https://wandb.ai
```

---

# Part 15: Going Deeper - Resources and Next Steps

## üìö Continue Your Learning Journey

Congratulations! You now understand how ChatGPT-like models work. Here's where to go next:

### Essential Resources:

#### 1. **Courses:**
- [Andrej Karpathy's Neural Networks: Zero to Hero](https://karpathy.ai/zero-to-hero.html)
  - Free YouTube series
  - Builds neural networks from scratch
  - Start here if you want deeper understanding!

- [Fast.ai - Practical Deep Learning](https://course.fast.ai/)
  - Free online course
  - Very practical approach
  - Great for getting started quickly

- [Stanford CS224N: Natural Language Processing](https://web.stanford.edu/class/cs224n/)
  - University-level course
  - Lectures free on YouTube
  - More theoretical

#### 2. **Papers to Read:**
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) (2017)
  - The original Transformer paper
  - Changed everything!

- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) (2020)
  - The GPT-3 paper
  - Shows power of scale

- [Training language models to follow instructions](https://arxiv.org/abs/2203.02155) (2022)
  - InstructGPT / ChatGPT techniques
  - How to make models helpful

#### 3. **Interactive Resources:**
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
  - Visual guide to transformers
  - Great for visual learners

- [Transformer Explainer](https://poloclub.github.io/transformer-explainer/)
  - Interactive visualization
  - See attention in action!

- [LLM Visualization](https://bbycroft.net/llm)
  - 3D interactive model
  - Explore layer by layer

#### 4. **Code Projects:**
- [nanoGPT](https://github.com/karpathy/nanoGPT)
  - nanochat's predecessor
  - Simpler, just pretraining

- [minGPT](https://github.com/karpathy/minGPT)
  - Even simpler educational implementation
  - Great for learning basics

- [transformers by HuggingFace](https://github.com/huggingface/transformers)
  - Production-ready library
  - Use pre-trained models

### Topics to Explore:

1. **Attention Mechanisms**
   - Self-attention
   - Multi-head attention
   - Grouped-query attention
   - Flash attention

2. **Training Techniques**
   - Gradient descent & backpropagation
   - Optimizers (AdamW, Muon)
   - Learning rate schedules
   - Mixed precision training

3. **Advanced Topics**
   - RLHF (Reinforcement Learning from Human Feedback)
   - Scaling laws
   - Model compression
   - Prompt engineering
   - Fine-tuning techniques

4. **Ethics & Safety**
   - Bias in AI
   - Responsible AI development
   - AI alignment
   - Privacy considerations

### Career Paths:

If you're interested in AI professionally:
- **Machine Learning Engineer**: Build and deploy AI systems
- **Research Scientist**: Advance the field with new techniques
- **AI Safety Researcher**: Make AI safe and beneficial
- **MLOps Engineer**: Infrastructure for AI systems
- **Data Scientist**: Use AI to solve business problems

### Keep Experimenting!

The best way to learn is by doing:
1. Modify nanochat
2. Train your own models
3. Try new ideas
4. Break things (you'll learn the most!)
5. Share your findings

### Join the Community:

- [nanochat Discussions](https://github.com/karpathy/nanochat/discussions)
- [r/MachineLearning](https://reddit.com/r/MachineLearning)
- [Hugging Face Forums](https://discuss.huggingface.co/)
- AI Discord servers and communities

Remember: Every expert was once a beginner. Keep learning, stay curious, and have fun!

---

# Part 16: Summary & Key Takeaways

## üéì What You've Learned

### Core Concepts:

1. **Language Models are Next-Word Predictors**
   - They learn patterns from vast amounts of text
   - Generate text by predicting one token at a time
   - "Understanding" is pattern matching, not consciousness

2. **The Transformer Architecture**
   - Attention mechanism: words understand context
   - Embeddings: convert text to meaningful numbers
   - Layers: stack processing for deeper understanding
   - ~560M to 2B parameters store knowledge

3. **The Training Pipeline**
   - Tokenization: teach the model to read
   - Pretraining: learn language from web text
   - Midtraining: learn conversation format
   - Supervised fine-tuning: learn from good examples
   - Reinforcement learning: optimize for specific tasks

4. **From Predictor to Chatbot**
   - Special formatting: `<|user|>...<|assistant|>...`
   - Sampling strategies control creativity
   - Top-P sampling produces natural text

5. **Evaluation Matters**
   - Multiple benchmarks test different abilities
   - Our $100 model is like a kindergartener
   - GPT-4 is like a PhD student
   - The gap is in scale: data, compute, parameters

### The Big Picture:

```
Massive Text Data
        +
Transformer Architecture (Attention!)
        +
Powerful GPUs & Lots of Time
        +
Smart Training Techniques
        ‚Üì
ChatGPT-like AI Assistant!
```

### What Makes Modern LLMs Work:

1. **Scale**: Bigger models, more data, more compute
2. **Attention**: Words can look at all other words
3. **Unsupervised Learning**: Learn from raw text
4. **Transfer Learning**: Pre-train once, fine-tune for tasks
5. **Human Feedback**: RLHF makes models helpful

### Important Limitations:

- ‚ùå No true understanding or consciousness
- ‚ùå Can't access real-time information
- ‚ùå Makes confident mistakes (hallucinations)
- ‚ùå Limited context window
- ‚ùå Struggles with logic and math
- ‚ùå Reflects biases in training data

### But Also:

- ‚úÖ Amazing at pattern recognition
- ‚úÖ Can help with many tasks
- ‚úÖ Learns from few examples
- ‚úÖ Accessible to individuals (via nanochat!)
- ‚úÖ Improving rapidly
- ‚úÖ Open source versions available

### Your Next Steps:

1. **Explore the nanochat code** - read and understand each file
2. **Experiment** - change configurations, try new ideas
3. **Learn the fundamentals** - take courses, read papers
4. **Build something** - create your own AI project
5. **Join the community** - share and learn from others
6. **Stay ethical** - consider the impact of AI systems

### Final Thoughts:

You now understand more about AI than 99% of people! The technology is fascinating, powerful, and rapidly evolving. But remember:

- **AI is a tool**, not magic
- **Understanding matters** - don't just use black boxes
- **Ethics matter** - build responsible AI
- **Keep learning** - the field changes fast
- **Have fun** - AI is amazing to work with!

---

## üöÄ You're Ready!

You've completed this introduction to building ChatGPT from scratch. The world of AI is now open to you. Go forth and build amazing things!

**Questions? Ideas? Discoveries?**
Share them in the [nanochat Discussions](https://github.com/karpathy/nanochat/discussions)!

---

*Created with ‚ù§Ô∏è for curious learners everywhere*

*Based on [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy*

---

# üéâ Congratulations!

You've completed the nanochat tutorial! You now understand:
- How language models work
- The transformer architecture
- The complete training pipeline
- How to build your own ChatGPT

Keep exploring, keep learning, and most importantly - keep building! üöÄ