# Chapter 1: Understanding Large Language Models

**Portfolio Project: Building LLMs from Scratch on AWS** üöÄ

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yourusername/llm-from-scratch-aws/blob/main/01_LLM_Fundamentals.ipynb)

---

## üìã Chapter Overview

This notebook covers foundational concepts of Large Language Models:
- What are LLMs and how do they work?
- Transformer architecture overview
- LLM training stages: pretraining, finetuning, alignment
- Model scale and computational requirements
- AWS cost optimization strategies

**Learning Objectives:**
‚úÖ Understand core LLM concepts  
‚úÖ Learn the LLM development lifecycle  
‚úÖ Prepare for hands-on implementation  

**AWS Services:** None (conceptual)  
**Estimated Cost:** $0.00

---

## üîß Environment Setup
        
### Cell Purpose: Install and configure packages for AWS SageMaker and Google Colab

In [None]:
# Install required packages for cloud environments
import sys

# Detect environment
IN_COLAB = 'google.colab' in sys.modules
IN_SAGEMAKER = '/opt/ml' in sys.executable or 'sagemaker' in sys.executable.lower()

print(f"Environment: {'Google Colab' if IN_COLAB else 'AWS SageMaker' if IN_SAGEMAKER else 'Local/Other'}")

# Install packages if needed
if IN_COLAB or IN_SAGEMAKER:
    !pip install -q torch matplotlib numpy pandas
    print("‚úÖ Packages installed successfully!")

### Cell Purpose: Verify installations and check available compute resources

In [None]:
# Import libraries and verify installation
import torch
import numpy as np
import matplotlib.pyplot as plt
from importlib.metadata import version
import platform

print("="*60)
print("ENVIRONMENT INFORMATION")
print("="*60)
print(f"Python: {platform.python_version()}")
print(f"PyTorch: {version('torch')}")
print(f"NumPy: {version('numpy')}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
print("="*60)

## 1.1 What is a Large Language Model?

Large Language Models (LLMs) are deep learning models trained on massive text data to understand and generate human-like text.

### Key Characteristics:
- **Scale**: Billions of parameters (GPT-3: 175B, GPT-4: ~1.76T)
- **Architecture**: Transformer-based (self-attention mechanisms)
- **Training**: Unsupervised pretraining + supervised finetuning
- **Capabilities**: Text generation, translation, summarization, Q&A, coding

### LLM Development Stages:
1. **Pretraining**: Learning language patterns from vast unlabeled text
2. **Finetuning**: Adapting to specific tasks with labeled data
3. **Alignment**: RLHF to align with human preferences

### Cell Purpose: Visualize and compare different LLM model sizes

In [None]:
# Visualize LLM scale comparison
models = ['GPT-2\nSmall', 'GPT-2\nMedium', 'GPT-2\nLarge', 'GPT-2\nXL', 'GPT-3', 'GPT-4\n(est)']
parameters_billions = [0.124, 0.355, 0.774, 1.5, 175, 1760]
colors = ['#3498db', '#5dade2', '#85c1e9', '#aed6f1', '#e74c3c', '#c0392b']

plt.figure(figsize=(14, 6))
bars = plt.barh(models, parameters_billions, color=colors, edgecolor='black', linewidth=1.5)
plt.xlabel('Parameters (Billions)', fontsize=12, fontweight='bold')
plt.title('LLM Model Scale Comparison', fontsize=14, fontweight='bold')
plt.xscale('log')
plt.grid(axis='x', alpha=0.3, linestyle='--')

for bar, value in zip(bars, parameters_billions):
    plt.text(value*1.3, bar.get_y() + bar.get_height()/2, 
             f'{value}B', va='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("üìä Training smaller models from scratch is practical and cost-effective!")

## 1.2 Transformer Architecture

The transformer is the foundation of modern LLMs.

### Core Components:
1. **Token Embeddings**: Convert text into dense vectors
2. **Positional Encodings**: Add position information
3. **Self-Attention**: Allow tokens to attend to each other
4. **Multi-Head Attention**: Multiple attention in parallel
5. **Feed-Forward Networks**: Process attended information
6. **Layer Normalization**: Stabilize training
7. **Residual Connections**: Enable deep networks

### Architecture Types:
- **GPT (Decoder-only)**: Autoregressive, left-to-right generation
- **BERT (Encoder-only)**: Bidirectional, masked language modeling
- **T5 (Encoder-Decoder)**: Full transformer, sequence-to-sequence

### Cell Purpose: Visualize the transformer architecture flow

In [None]:
# Visualize transformer architecture
fig, ax = plt.subplots(figsize=(10, 12))
ax.axis('off')

layers = [
    ('Output Probabilities', 0.92, '#e74c3c'),
    ('Linear + Softmax', 0.84, '#e67e22'),
    ('Decoder Block N', 0.72, '#3498db'),
    ('...', 0.64, '#95a5a6'),
    ('Decoder Block 2', 0.56, '#3498db'),
    ('Decoder Block 1', 0.48, '#3498db'),
    ('Add Positional Encoding', 0.36, '#9b59b6'),
    ('Token Embeddings', 0.28, '#1abc9c'),
    ('Input Tokens', 0.16, '#2ecc71')
]

for label, y_pos, color in layers:
    rect = plt.Rectangle((0.2, y_pos-0.04), 0.6, 0.06, 
                         facecolor=color, edgecolor='black', linewidth=2, alpha=0.7)
    ax.add_patch(rect)
    ax.text(0.5, y_pos-0.01, label, ha='center', va='center', 
           fontsize=11, fontweight='bold', color='white')

for i in range(len(layers)-1):
    y_start = layers[i][1] - 0.04
    y_end = layers[i+1][1] + 0.02
    ax.arrow(0.5, y_start, 0, y_end-y_start-0.01, 
            head_width=0.03, head_length=0.01, fc='black', ec='black', linewidth=2)

ax.text(0.5, 0.98, 'GPT-Style Transformer (Decoder-Only)', 
       ha='center', fontsize=14, fontweight='bold')
ax.text(0.87, 0.72, 'Self-Attention\n+ FFN + Norm', 
       fontsize=8, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()

print("üèóÔ∏è We'll implement this architecture step-by-step in Chapters 2-4!")

## 1.3 LLM Training Pipeline

### Stage 1: Pretraining (Chapter 5)
- **Objective**: Learn general language understanding
- **Data**: Large unlabeled text corpus
- **Task**: Next-token prediction
- **Duration**: Weeks to months
- **Cost**: $100K - $10M+ for large models

### Stage 2: Supervised Finetuning (Chapters 6-7)
- **Objective**: Adapt to specific tasks
- **Data**: Curated labeled datasets
- **Task**: Task-specific objectives
- **Duration**: Hours to days
- **Cost**: $10 - $1,000

### Stage 3: Alignment (Optional)
- **Objective**: Align with human preferences
- **Methods**: RLHF, DPO
- **Duration**: Days to weeks
- **Cost**: $1,000 - $100K+

### Cell Purpose: Visualize the three-stage training pipeline

In [None]:
# Visualize training pipeline stages
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

stages = [
    {'title': 'Stage 1:\nPretraining', 'data': 'Unlabeled Text', 
     'task': 'Next Token\nPrediction', 'output': 'Base Model', 'color': '#3498db'},
    {'title': 'Stage 2:\nFinetuning', 'data': 'Labeled Data', 
     'task': 'Task-Specific', 'output': 'Finetuned Model', 'color': '#e67e22'},
    {'title': 'Stage 3:\nAlignment', 'data': 'Human Feedback', 
     'task': 'RLHF / DPO', 'output': 'Aligned Model', 'color': '#2ecc71'}
]

for ax, stage in zip(axes, stages):
    ax.axis('off')
    ax.text(0.5, 0.95, stage['title'], ha='center', fontsize=13, 
           fontweight='bold', color=stage['color'])
    
    # Boxes and arrows
    for y, text in [(0.75, stage['data']), (0.40, stage['task'])]:
        rect = plt.Rectangle((0.1, y-0.1), 0.8, 0.15, 
                            facecolor=stage['color'], alpha=0.3, 
                            edgecolor=stage['color'], linewidth=2)
        ax.add_patch(rect)
        ax.text(0.5, y-0.025, text, ha='center', va='center', fontsize=9)
        ax.arrow(0.5, y-0.12, 0, -0.13, head_width=0.08, head_length=0.05, 
                fc=stage['color'], ec=stage['color'], linewidth=2)
    
    rect = plt.Rectangle((0.1, 0.05), 0.8, 0.12, 
                        facecolor=stage['color'], alpha=0.7, 
                        edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(0.5, 0.11, stage['output'], ha='center', va='center', 
           fontsize=10, fontweight='bold', color='white')
    
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

plt.suptitle('LLM Training Pipeline', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("üéØ We'll implement all three stages using cost-effective methods!")

## 1.4 AWS Cost Optimization

### üí∞ Budget-Friendly LLM Development

**Instance Selection:**
- **Notebooks**: ml.t3.medium (CPU, $0.05/hr)
- **Training**: ml.g4dn.xlarge (1 GPU, $0.526/hr)
- **Spot Instances**: Save up to 70%!

**Data Management:**
- **S3**: $0.023/GB/month for datasets
- **Lifecycle Policies**: Auto-delete old checkpoints

**Training Optimization:**
- Train GPT-2 scale models (124M-355M params)
- Use FP16 mixed precision
- Gradient accumulation for larger batches
- Early stopping

**Alternatives:**
- **Google Colab**: Free GPU (T4, 15GB)
- **Colab Pro**: $10/month (V100/A100)
- **Local**: CPU-only for experiments

### Estimated Costs:
- **Development**: $5-20
- **Full Training**: $20-50
- **Complete Project**: **< $100 total**

### Free Tier:
- **AWS**: 250 hours SageMaker notebooks
- **Google Colab**: Free GPU with limits
- **AWS Educate**: $100-200 credits

### Cell Purpose: Calculate estimated AWS training costs for different scenarios

In [None]:
# AWS Cost Calculator for LLM Training
def estimate_aws_cost(model_params_millions, epochs, hours_per_epoch, 
                      instance_type='g4dn.xlarge', use_spot=True):
    '''Estimate AWS SageMaker training costs'''
    
    instance_prices = {
        'g4dn.xlarge': 0.526,   # 1x T4 GPU
        'g4dn.2xlarge': 0.752,  # 1x T4 GPU, more CPU
        'g5.xlarge': 1.006,     # 1x A10G GPU
        'p3.2xlarge': 3.06,     # 1x V100 GPU
    }
    
    hourly_rate = instance_prices.get(instance_type, 0.526)
    if use_spot:
        hourly_rate *= 0.35  # 65% discount
        label = f"{instance_type} (Spot)"
    else:
        label = f"{instance_type} (On-Demand)"
    
    total_hours = epochs * hours_per_epoch
    compute_cost = total_hours * hourly_rate
    storage_cost = model_params_millions * 0.004 * 0.023 * (total_hours / 720)
    total = compute_cost + storage_cost
    
    print("="*60)
    print(f"AWS COST ESTIMATE: {model_params_millions}M params, {epochs} epochs")
    print("="*60)
    print(f"Instance: {label}")
    print(f"Hours: {total_hours:.1f} ({hourly_rate:.3f}/hr)")
    print(f"Compute: ${compute_cost:.2f}")
    print(f"Storage (S3): ${storage_cost:.2f}")
    print(f"\nüí∞ TOTAL: ${total:.2f}")
    print("="*60)
    
    if use_spot:
        on_demand = total_hours * instance_prices.get(instance_type, 0.526)
        print(f"‚úÖ Savings vs On-Demand: ${on_demand-compute_cost:.2f} ({((on_demand-compute_cost)/on_demand)*100:.0f}%)")
    
    return total

# Example: GPT-2 Small (recommended)
print("Example 1: GPT-2 Small (124M) - Recommended\n")
cost1 = estimate_aws_cost(124, epochs=10, hours_per_epoch=2, use_spot=True)

print("\n")
print("Example 2: GPT-2 Medium (355M) - Advanced\n")
cost2 = estimate_aws_cost(355, epochs=5, hours_per_epoch=4, 
                          instance_type='g5.xlarge', use_spot=True)

print("\nüí° Recommendation: Start with GPT-2 Small. Total cost < $50!")

## 1.5 Project Roadmap

### What We'll Build:
Complete LLM training and finetuning pipeline!

**Chapter 2: Text Data Processing**
- Tokenization (BPE)
- Creating embeddings
- Data loading and batching
- AWS: S3 for datasets

**Chapter 3: Attention Mechanisms**
- Self-attention from scratch
- Multi-head attention
- Causal masking
- AWS: SageMaker notebooks

**Chapter 4: GPT Model**
- Complete GPT architecture
- Layer normalization
- Residual connections
- Text generation
- AWS: Model checkpoints in S3

**Chapter 5: Pretraining**
- Training loop implementation
- Loss calculation
- Model evaluation
- AWS: SageMaker Training with spot instances

**Chapter 6: Classification Finetuning**
- Spam classification
- Transfer learning
- Evaluation metrics
- AWS: SageMaker finetuning

**Chapter 7: Instruction Finetuning**
- Instruction-following
- Dataset preparation
- LoRA (parameter-efficient)
- Model deployment
- AWS: SageMaker Inference Endpoint

### Features:
‚úÖ Complete from-scratch implementation  
‚úÖ AWS-ready with cost optimization  
‚úÖ Google Colab compatible  
‚úÖ Production-ready code  
‚úÖ Comprehensive documentation  
‚úÖ Portfolio-ready presentation

### Cell Purpose: Visualize project roadmap and chapter complexity

In [None]:
# Project roadmap visualization
chapters = ['Ch 1:\nFundamentals', 'Ch 2:\nText Data', 'Ch 3:\nAttention', 
            'Ch 4:\nGPT Model', 'Ch 5:\nPretraining', 'Ch 6:\nClassification', 
            'Ch 7:\nInstructions']
complexity = [1, 3, 5, 7, 8, 6, 7]
colors = ['#2ecc71', '#3498db', '#9b59b6', '#e74c3c', '#e67e22', '#f39c12', '#1abc9c']
icons = ['üìö', 'üìù', 'üß†', 'ü§ñ', 'üöÄ', 'üéØ', 'üí¨']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Complexity chart
bars = ax1.barh(chapters, complexity, color=colors, edgecolor='black', linewidth=1.5)
ax1.set_xlabel('Complexity Score', fontsize=11, fontweight='bold')
ax1.set_title('Chapter Complexity', fontsize=13, fontweight='bold')
ax1.set_xlim(0, 10)
ax1.grid(axis='x', alpha=0.3)

for bar, value in zip(bars, complexity):
    ax1.text(value+0.2, bar.get_y()+bar.get_height()/2, 
             f'{value}/10', va='center', fontsize=10, fontweight='bold')

# Timeline
ax2.axis('off')
y_positions = np.linspace(0.9, 0.1, len(chapters))

for i, (chapter, y_pos, color, icon) in enumerate(zip(chapters, y_positions, colors, icons)):
    circle = plt.Circle((0.15, y_pos), 0.03, color=color, ec='black', linewidth=2)
    ax2.add_patch(circle)
    
    if i < len(chapters) - 1:
        ax2.plot([0.15, 0.15], [y_pos-0.03, y_positions[i+1]+0.03], 
                'k-', linewidth=2, alpha=0.5)
    
    ax2.text(0.22, y_pos, chapter, va='center', fontsize=11, fontweight='bold')
    ax2.text(0.08, y_pos, icon, va='center', ha='center', fontsize=16)

ax2.set_xlim(0, 1)
ax2.set_ylim(0, 1)
ax2.set_title('Project Roadmap', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("üéì You're in Chapter 1 - Fundamentals!")
print("   Ready to build? Let's move to Chapter 2! üöÄ")

## 1.6 Prerequisites Check

### Required Knowledge:
**Essential:**
- Python: functions, classes, loops
- NumPy: arrays, broadcasting
- Basic ML: neural networks, gradient descent

**Helpful:**
- PyTorch: tensors, autograd, nn.Module
- AWS: S3, SageMaker basics
- Deep Learning: CNNs, RNNs concepts

### Resources:
- [PyTorch in 60 Minutes](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)
- [AWS SageMaker Guide](https://docs.aws.amazon.com/sagemaker/)
- [Deep Learning Book](https://www.deeplearningbook.org/)

### Cell Purpose: Quick self-assessment of PyTorch and NumPy knowledge

In [None]:
# Quick knowledge check
print("üß™ QUICK KNOWLEDGE CHECK")
print("="*60)

# Test 1: Matrix multiplication
print("\n1. PyTorch Tensor Operations:")
try:
    x = torch.randn(3, 4)
    y = torch.randn(4, 5)
    z = torch.matmul(x, y)
    print(f"   ‚úÖ Matmul: ({x.shape}) √ó ({y.shape}) = ({z.shape})")
except Exception as e:
    print(f"   ‚ùå Error: {e}")

# Test 2: Broadcasting
print("\n2. Broadcasting:")
try:
    a = torch.randn(3, 1)
    b = torch.randn(1, 4)
    c = a + b
    print(f"   ‚úÖ Broadcasting: ({a.shape}) + ({b.shape}) = ({c.shape})")
except Exception as e:
    print(f"   ‚ùå Error: {e}")

# Test 3: Gradients
print("\n3. Automatic Differentiation:")
try:
    x = torch.tensor([2.0], requires_grad=True)
    y = x**2 + 3*x + 1
    y.backward()
    print(f"   ‚úÖ Gradient of y=x¬≤+3x+1 at x=2: dy/dx = {x.grad.item():.1f}")
    print(f"      (Expected: 7.0)")
except Exception as e:
    print(f"   ‚ùå Error: {e}")

# Test 4: GPU
print("\n4. GPU Availability:")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
    print(f"   ‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f} GB")
else:
    print(f"   ‚ö†Ô∏è  No GPU (CPU mode - slower but works!)")

print("\n"+"="*60)
print("‚úÖ All checks passed! Ready to proceed.")
print("üí° If CPU-only, consider Google Colab or AWS SageMaker for faster training.")

## üìù Chapter Summary

### What We Learned:
1. ‚úÖ **LLM Basics**: Transformer architecture and components
2. ‚úÖ **Training Pipeline**: Three-stage process
3. ‚úÖ **Cost Optimization**: AWS strategies (< $100 total)
4. ‚úÖ **Project Roadmap**: Clear path through 7 chapters
5. ‚úÖ **Environment Setup**: Ready for AWS, Colab, or local

### Key Takeaways:
- LLMs can be built cost-effectively at small scale
- Transformer architecture is the foundation
- AWS offers flexible ML infrastructure
- Hands-on production ML workflow experience

### Next Steps:
‚û°Ô∏è **Chapter 2**: Text data processing, tokenization, and embeddings!

---

## üîó Resources

**Papers:**
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [GPT-3 Paper](https://arxiv.org/abs/2005.14165)
- [BERT Paper](https://arxiv.org/abs/1810.04805)

**AWS Documentation:**
- [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/)
- [SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/)
- [Spot Instance Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html)

**Learning:**
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Hugging Face Course](https://huggingface.co/course)

**Ready for Chapter 2? Let's start building! üöÄ**