# Scaling Laws: The Physics of Model Performance

Why do larger models work better? How much data do you need? This notebook reveals the mathematical laws that govern transformer performance and revolutionize how we build AI systems.

## The Fundamental Questions

Every ML practitioner faces these choices:
- **How big should my model be?**
- **How much data do I need?** 
- **What's the optimal use of my compute budget?**
- **When will my model develop new capabilities?**

## The Physics Behind the Magic

Neural network performance follows **power laws** - the same mathematical relationships that govern earthquakes, city sizes, and biological systems.

**Power Law**: Performance ‚àù (Scale)^(-Œ±)
- Not random - governed by fundamental physics
- Predictable across scales
- Enables performance forecasting before training

**Chinchilla's Discovery**: Most models are severely undertrained
- Optimal ratio: ~20 data tokens per parameter
- Smaller, well-trained models often beat larger ones
- Revolutionized resource allocation strategies

**Emergence**: New capabilities appear suddenly at critical scales
- Phase transitions, like water freezing at 0¬∞C
- Some abilities require minimum model size
- Cannot be predicted from smaller models

## What You'll Master

1. **Discover power laws** through hands-on experiments
2. **Apply Chinchilla principles** for optimal resource allocation
3. **Observe emergence** and phase transitions in action
4. **Build prediction models** for performance forecasting

In [None]:
import sys
sys.path.append('..')

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import curve_fit
from typing import Dict, List, Tuple

from src.model.transformer import GPTModel, create_model_config

torch.manual_seed(42)
np.random.seed(42)

plt.style.use('default')
sns.set_palette("husl")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("Scaling laws laboratory ready! üìä")

## 1. The Power Law: Why Bigger Models Work Better

The most fundamental discovery in deep learning: performance follows a power law.

### The Mathematical Relationship

Neural network loss follows this equation:
```
Loss = A √ó (Parameters)^(-Œ±) + B
```

Where:
- **A**: Architecture-dependent constant (efficiency factor)
- **Œ±**: Scaling exponent (typically 0.1-0.3 for transformers)
- **B**: Irreducible loss (theoretical minimum for the dataset)

### Why This Mathematical Form?

**The Physics Explanation**: Think of parameters as "degrees of freedom" for function approximation:

1. **Few parameters**: Can only represent simple, smooth functions
2. **More parameters**: Can capture increasingly complex patterns
3. **Diminishing returns**: Each additional parameter contributes less than the previous ones

**Statistical Mechanics Analogy**: Like gas molecules in a container - more particles allow more complex behavior, but each additional particle has decreasing marginal impact.

**Key Insight**: The power law means doubling model size doesn't halve the loss - improvement is predictable but sub-linear.

Let's discover this law experimentally:

In [None]:
def create_model_family():
    """Create transformer models of increasing size to study scaling."""
    return {
        'nano': {'vocab_size': 500, 'd_model': 32, 'n_heads': 2, 'n_layers': 2, 'd_ff': 64, 'max_seq_len': 32, 'dropout': 0.1},
        'micro': {'vocab_size': 500, 'd_model': 48, 'n_heads': 3, 'n_layers': 2, 'd_ff': 96, 'max_seq_len': 32, 'dropout': 0.1},
        'tiny': {'vocab_size': 500, 'd_model': 64, 'n_heads': 4, 'n_layers': 3, 'd_ff': 128, 'max_seq_len': 32, 'dropout': 0.1},
        'small': {'vocab_size': 500, 'd_model': 80, 'n_heads': 5, 'n_layers': 4, 'd_ff': 160, 'max_seq_len': 32, 'dropout': 0.1},
        'medium': {'vocab_size': 500, 'd_model': 96, 'n_heads': 6, 'n_layers': 5, 'd_ff': 192, 'max_seq_len': 32, 'dropout': 0.1}
    }

def count_parameters(model):
    """Count trainable parameters in the model."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def measure_performance(model, training_steps=40):
    """Train model briefly and measure final performance."""
    model.train()
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
    criterion = nn.CrossEntropyLoss()
    
    losses = []
    
    for step in range(training_steps):
        # Generate consistent random batch for fair comparison
        torch.manual_seed(step)  # Consistent data across models
        x = torch.randint(0, 500, (4, 24), device=device)
        targets = torch.randint(0, 500, (4, 24), device=device)
        
        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs.reshape(-1, outputs.size(-1)), targets.reshape(-1))
        loss.backward()
        
        # Gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        
        losses.append(loss.item())
    
    # Return average of final few steps for stability
    return np.mean(losses[-10:])

# Run the scaling experiment
print("üß¨ Measuring the fundamental scaling law...")
print("This will take a few minutes - we're discovering physics!")

model_configs = create_model_family()
scaling_results = {'names': [], 'parameters': [], 'losses': []}

for name, config in model_configs.items():
    print(f"\nüî¨ Training {name} model...")
    
    # Create and train model
    model = GPTModel(**config).to(device)
    param_count = count_parameters(model)
    final_loss = measure_performance(model)
    
    # Store results
    scaling_results['names'].append(name)
    scaling_results['parameters'].append(param_count)
    scaling_results['losses'].append(final_loss)
    
    print(f"   Parameters: {param_count:,}")
    print(f"   Final loss: {final_loss:.4f}")
    
    # Clean up GPU memory
    del model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

print("\n‚úÖ Scaling experiment complete!")
print("Now let's discover the mathematical law...")

In [None]:
# Discover the Power Law

def power_law_function(params, A, alpha, B):
    """Power law: Loss = A * (Parameters)^(-alpha) + B"""
    return A * np.power(params, -alpha) + B

# Convert to numpy arrays for curve fitting
parameters = np.array(scaling_results['parameters'])
losses = np.array(scaling_results['losses'])

# Fit the power law to our data
try:
    # Initial guess for parameters [A, alpha, B]
    initial_guess = [max(losses), 0.2, min(losses) * 0.9]
    
    optimal_params, covariance = curve_fit(
        power_law_function, parameters, losses,
        p0=initial_guess, maxfev=3000
    )
    
    A, alpha, B = optimal_params
    
    print(f"üéØ DISCOVERED SCALING LAW:")
    print(f"   Loss = {A:.3f} √ó (Parameters)^(-{alpha:.3f}) + {B:.3f}")
    print(f"   Scaling exponent Œ± = {alpha:.3f}")
    print(f"   Irreducible loss B = {B:.3f}")
    
    # Calculate R-squared for fit quality
    predictions = power_law_function(parameters, A, alpha, B)
    ss_res = np.sum((losses - predictions) ** 2)
    ss_tot = np.sum((losses - np.mean(losses)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    
    print(f"   R¬≤ = {r_squared:.4f} (fit quality)")
    
except Exception as e:
    print(f"Curve fitting failed: {e}")
    print("Using default values for visualization")
    A, alpha, B = max(losses), 0.2, min(losses) * 0.9

# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Log-log plot (power law appears as straight line)
axes[0, 0].loglog(parameters, losses, 'ro', markersize=12, label='Measured Performance', markeredgecolor='darkred', linewidth=2)

# Generate smooth power law curve
param_range = np.logspace(np.log10(min(parameters)), np.log10(max(parameters)*3), 100)
power_law_curve = power_law_function(param_range, A, alpha, B)
axes[0, 0].loglog(param_range, power_law_curve, 'b--', linewidth=3, 
                  label=f'Power Law (Œ±={alpha:.3f})')

# Add model name labels
for i, name in enumerate(scaling_results['names']):
    axes[0, 0].annotate(name, (parameters[i], losses[i]), 
                        xytext=(10, 10), textcoords='offset points',
                        fontsize=10, weight='bold')

axes[0, 0].set_xlabel('Parameters (log scale)', fontsize=12)
axes[0, 0].set_ylabel('Loss (log scale)', fontsize=12)
axes[0, 0].set_title('Power Law Discovery\n(Straight line confirms power law)', fontsize=14, weight='bold')
axes[0, 0].legend(fontsize=11)
axes[0, 0].grid(True, alpha=0.3)

# 2. Linear plot showing diminishing returns
axes[0, 1].plot(parameters, losses, 'ro-', markersize=10, linewidth=3, 
                label='Measured', markeredgecolor='darkred')
param_linear = np.linspace(min(parameters), max(parameters)*1.5, 100)
power_law_linear = power_law_function(param_linear, A, alpha, B)
axes[0, 1].plot(param_linear, power_law_linear, 'b--', linewidth=3, 
                label='Power Law Prediction')

axes[0, 1].set_xlabel('Parameters', fontsize=12)
axes[0, 1].set_ylabel('Loss', fontsize=12)
axes[0, 1].set_title('Diminishing Returns\n(Linear scale shows saturation)', fontsize=14, weight='bold')
axes[0, 1].legend(fontsize=11)
axes[0, 1].grid(True, alpha=0.3)

# 3. Performance improvement analysis
param_multiples = np.array([2, 5, 10, 100, 1000])
base_params = min(parameters)
base_loss = power_law_function(base_params, A, alpha, B)

improvement_factors = []
for multiple in param_multiples:
    new_loss = power_law_function(base_params * multiple, A, alpha, B)
    improvement = (base_loss - new_loss) / base_loss * 100
    improvement_factors.append(improvement)

bars = axes[1, 0].bar([f'{m}x' for m in param_multiples], improvement_factors, 
                      color='green', alpha=0.7, edgecolor='darkgreen', linewidth=2)
axes[1, 0].set_xlabel('Parameter Scale Increase', fontsize=12)
axes[1, 0].set_ylabel('Performance Improvement (%)', fontsize=12)
axes[1, 0].set_title('Scaling Returns Analysis\n(Diminishing returns clearly visible)', fontsize=14, weight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Add values on bars
for bar, improvement in zip(bars, improvement_factors):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                    f'{improvement:.1f}%', ha='center', va='bottom', 
                    fontsize=10, weight='bold')

# 4. Residuals plot (fit quality check)
residuals = losses - power_law_function(parameters, A, alpha, B)
axes[1, 1].scatter(parameters, residuals, color='purple', s=100, alpha=0.7)
axes[1, 1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[1, 1].set_xlabel('Parameters', fontsize=12)
axes[1, 1].set_ylabel('Residuals (Actual - Predicted)', fontsize=12)
axes[1, 1].set_title('Fit Quality Check\n(Random scatter = good fit)', fontsize=14, weight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate practical insights
print(f"\nüí° PRACTICAL SCALING INSIGHTS:")
print(f"‚Ä¢ 10x parameters ‚Üí {((10**alpha) - 1) * 100:.1f}% better performance")
print(f"‚Ä¢ 100x parameters ‚Üí {((100**alpha) - 1) * 100:.1f}% better performance")
print(f"‚Ä¢ Diminishing returns: Each parameter helps less than the last")
print(f"‚Ä¢ Power law enables performance prediction before expensive training!")

print(f"\nüéØ MATHEMATICAL SIGNIFICANCE:")
print(f"‚Ä¢ Power laws appear across nature (earthquakes, cities, biology)")
print(f"‚Ä¢ Suggests deep learning follows fundamental physical principles")
print(f"‚Ä¢ Œ± ‚âà {alpha:.3f} is typical for language model transformers")
print(f"‚Ä¢ This law lets you predict GPT-5 performance from GPT-4 data!")

## 2. Chinchilla's Revolutionary Discovery

In 2022, DeepMind shattered conventional wisdom with a shocking discovery: **most large language models are severely undertrained**.

### The Game-Changing Insight

**Traditional Approach**: "Bigger models are always better"
- Train huge models on whatever data you have
- GPT-3 (175B parameters) trained on 300B tokens
- Assumption: model size is the key constraint

**Chinchilla's Discovery**: "There's an optimal data-to-parameter ratio"
- **Optimal ratio: ~20 data tokens per model parameter**
- GPT-3 should have trained on 3.5T tokens (not 300B!)
- Most large models are undertrained by 10x or more

### The Science Behind the Discovery

**The Optimization Problem**: Given fixed compute budget C, how do you split it between:
- **Model size N** (number of parameters)
- **Training data D** (number of tokens)

**The Mathematical Relationship**:
```
Compute Budget: C = N √ó D √ó constant
Performance: Loss(N, D) = A √ó N^(-Œ±) + B √ó D^(-Œ≤) + L‚ÇÄ
```

**Chinchilla's Solution**: Minimize loss subject to compute constraint
- Optimal allocation: N ‚àù C^a, D ‚àù C^b where a + b = 1
- Result: For every parameter, you need ~20 training tokens

### Why This Ratio Exists

**Too Few Tokens** (undertrained):
- Model memorizes training data instead of learning patterns
- Huge capacity but insufficient information
- Like hiring a genius but teaching them nothing

**Too Many Tokens** (overtrained):
- Model has absorbed all learnable patterns
- Additional data provides no new information
- Compute is wasted on a saturated model

**Just Right** (Chinchilla optimal):
- Model capacity perfectly matches data complexity
- Every parameter has ~20 tokens of information to learn from
- Maximum performance per unit of compute

Let's verify this principle experimentally:

In [None]:
def test_chinchilla_principle():
    """Test different model size vs training data allocation strategies."""
    
    base_config = {
        'vocab_size': 400,
        'max_seq_len': 32,
        'dropout': 0.1
    }
    
    # Different strategies for spending the same "compute budget"
    strategies = {
        'Big Undertrained': {
            'config': {**base_config, 'd_model': 128, 'n_heads': 8, 'n_layers': 6, 'd_ff': 256},
            'training_steps': 25,  # Less training
            'philosophy': 'Scale model size, minimal training'
        },
        'Small Overtrained': {
            'config': {**base_config, 'd_model': 64, 'n_heads': 4, 'n_layers': 3, 'd_ff': 128},
            'training_steps': 100,  # More training
            'philosophy': 'Small model, extensive training'
        },
        'Chinchilla Optimal': {
            'config': {**base_config, 'd_model': 96, 'n_heads': 6, 'n_layers': 4, 'd_ff': 192},
            'training_steps': 60,  # Balanced
            'philosophy': 'Balanced model size and training'
        }
    }
    
    results = {}
    
    print("üèÅ CHINCHILLA ALLOCATION RACE")
    print("Testing three different compute allocation strategies...\n")
    
    for strategy_name, strategy in strategies.items():
        print(f"üöÄ Strategy: {strategy_name}")
        print(f"   Philosophy: {strategy['philosophy']}")
        
        # Create model
        model = GPTModel(**strategy['config']).to(device)
        param_count = count_parameters(model)
        
        # Calculate "compute budget" proxy (params √ó training_steps)
        compute_budget = param_count * strategy['training_steps']
        
        # Calculate tokens per parameter (Chinchilla ratio)
        # Assuming each step processes ~96 tokens (4 batch √ó 24 seq_len)
        total_tokens = strategy['training_steps'] * 4 * 24
        tokens_per_param = total_tokens / param_count
        
        print(f"   Parameters: {param_count:,}")
        print(f"   Training steps: {strategy['training_steps']}")
        print(f"   Compute budget: {compute_budget:,}")
        print(f"   Tokens per parameter: {tokens_per_param:.1f}")
        
        # Train the model
        model.train()
        optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
        criterion = nn.CrossEntropyLoss()
        
        losses = []
        for step in range(strategy['training_steps']):
            # Consistent random data for fair comparison
            torch.manual_seed(step)
            x = torch.randint(0, 400, (4, 24), device=device)
            targets = torch.randint(0, 400, (4, 24), device=device)
            
            optimizer.zero_grad()
            outputs = model(x)
            loss = criterion(outputs.reshape(-1, outputs.size(-1)), targets.reshape(-1))
            loss.backward()
            
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            
            losses.append(loss.item())
        
        # Calculate final performance
        final_loss = np.mean(losses[-10:])
        
        results[strategy_name] = {
            'parameters': param_count,
            'training_steps': strategy['training_steps'],
            'compute_budget': compute_budget,
            'tokens_per_param': tokens_per_param,
            'final_loss': final_loss,
            'losses': losses,
            'philosophy': strategy['philosophy']
        }
        
        print(f"   Final loss: {final_loss:.4f}")
        print(f"   Efficiency: {final_loss * compute_budget:.0f} (lower = better)\n")
        
        # Cleanup
        del model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    
    return results

# Run the Chinchilla experiment
print("üéØ TESTING CHINCHILLA'S PRINCIPLE")
print("This experiment tests optimal resource allocation...\n")

chinchilla_results = test_chinchilla_principle()

# Find the winner
winner = min(chinchilla_results.items(), key=lambda x: x[1]['final_loss'])
print(f"üèÜ WINNER: {winner[0]}")
print(f"   Final loss: {winner[1]['final_loss']:.4f}")
print(f"   This {'confirms' if 'Chinchilla' in winner[0] else 'challenges'} Chinchilla's principle!")

In [None]:
# Visualize Chinchilla Results

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

colors = ['red', 'blue', 'green']
strategy_names = list(chinchilla_results.keys())

# 1. Training curves comparison
for i, (name, result) in enumerate(chinchilla_results.items()):
    axes[0, 0].plot(result['losses'], linewidth=3, color=colors[i], 
                    label=f"{name} (Final: {result['final_loss']:.3f})",
                    marker='o', markersize=4, markevery=len(result['losses'])//10)

axes[0, 0].set_title('Training Curves: Resource Allocation Strategies', fontsize=14, weight='bold')
axes[0, 0].set_xlabel('Training Steps')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Final performance comparison
final_losses = [chinchilla_results[name]['final_loss'] for name in strategy_names]
bars = axes[0, 1].bar(strategy_names, final_losses, color=colors, alpha=0.7, 
                      edgecolor='black', linewidth=2)

axes[0, 1].set_title('Final Performance Comparison', fontsize=14, weight='bold')
axes[0, 1].set_ylabel('Final Loss (Lower = Better)')
axes[0, 1].tick_params(axis='x', rotation=45)

# Add values on bars
for bar, loss in zip(bars, final_losses):
    axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                    f'{loss:.3f}', ha='center', va='bottom', 
                    fontsize=12, weight='bold')

axes[0, 1].grid(True, alpha=0.3)

# 3. Compute efficiency analysis
compute_budgets = [chinchilla_results[name]['compute_budget'] for name in strategy_names]
efficiency_scores = [loss * budget for loss, budget in zip(final_losses, compute_budgets)]

bars = axes[1, 0].bar(strategy_names, efficiency_scores, color=colors, alpha=0.7,
                      edgecolor='black', linewidth=2)

axes[1, 0].set_title('Compute Efficiency\n(Loss √ó Compute Budget, Lower = Better)', fontsize=14, weight='bold')
axes[1, 0].set_ylabel('Efficiency Score')
axes[1, 0].tick_params(axis='x', rotation=45)

for bar, score in zip(bars, efficiency_scores):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(efficiency_scores)*0.01,
                    f'{score:.0f}', ha='center', va='bottom', 
                    fontsize=12, weight='bold')

axes[1, 0].grid(True, alpha=0.3)

# 4. Tokens per parameter analysis (Chinchilla ratio)
tokens_per_param = [chinchilla_results[name]['tokens_per_param'] for name in strategy_names]

bars = axes[1, 1].bar(strategy_names, tokens_per_param, color=colors, alpha=0.7,
                      edgecolor='black', linewidth=2)

# Add Chinchilla optimal line
axes[1, 1].axhline(y=20, color='black', linestyle='--', linewidth=3, 
                   label='Chinchilla Optimal (~20)')

axes[1, 1].set_title('Data Efficiency\n(Tokens per Parameter)', fontsize=14, weight='bold')
axes[1, 1].set_ylabel('Tokens per Parameter')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].legend()

for bar, ratio in zip(bars, tokens_per_param):
    axes[1, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                    f'{ratio:.1f}', ha='center', va='bottom', 
                    fontsize=12, weight='bold')

axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Analyze results
print("\nüîç CHINCHILLA ANALYSIS:")
for name, result in chinchilla_results.items():
    chinchilla_score = abs(result['tokens_per_param'] - 20)  # Distance from optimal
    print(f"\nüìä {name}:")
    print(f"   Philosophy: {result['philosophy']}")
    print(f"   Tokens/param: {result['tokens_per_param']:.1f} (optimal ‚âà 20)")
    print(f"   Chinchilla score: {chinchilla_score:.1f} (lower = closer to optimal)")
    print(f"   Final performance: {result['final_loss']:.4f}")

print("\nüí° KEY INSIGHTS:")
print("‚Ä¢ Balanced allocation often outperforms extreme strategies")
print("‚Ä¢ 'Bigger is always better' is a costly myth")
print("‚Ä¢ Training data quality and quantity matter enormously")
print("‚Ä¢ Chinchilla ratio (~20 tokens/param) provides guidance")

print("\nüí∞ PRACTICAL IMPLICATIONS:")
print("‚Ä¢ Most large models (GPT-3, PaLM) are severely undertrained")
print("‚Ä¢ Smaller, well-trained models often beat larger ones")
print("‚Ä¢ Focus budget on high-quality, large-scale datasets")
print("‚Ä¢ Question the 'parameter count arms race' mentality")

## 3. Emergence: When Capabilities Suddenly Appear

Some abilities don't improve gradually - they **suddenly appear** at critical model sizes. This phenomenon, called **emergence**, is one of the most striking discoveries in scaling laws.

### The Mystery of Emergent Capabilities

**Examples from Real Models**:
- **Few-shot learning**: GPT-3 suddenly could learn from just a few examples
- **Chain-of-thought reasoning**: Step-by-step problem solving appeared in ~100B parameter models
- **Code generation**: Ability to write working programs emerged around 6B parameters
- **Mathematical reasoning**: Complex math skills appeared suddenly, not gradually

### The Physics of Phase Transitions

**Phase Transition Theory**: Like water freezing at exactly 0¬∞C:

1. **Below Critical Point**: Not enough representational capacity
   - Model cannot form the necessary internal representations
   - Performance remains at chance level
   - No amount of training helps

2. **At Critical Point**: Just enough capacity for pattern to "click"
   - Sudden reorganization of learned representations
   - Dramatic performance jump
   - Phase transition occurs

3. **Above Critical Point**: Pattern is mastered
   - Continued improvement with additional parameters
   - New capability is stable and reliable

### Why Emergence Happens

**Threshold Effects**: Some cognitive abilities require minimum "representational complexity"
- **Compositionality**: Understanding that concepts can be combined
- **Abstraction**: Recognizing patterns across different contexts  
- **Multi-step reasoning**: Chaining together multiple inference steps

**Mathematical Insight**: These are not smooth functions of model size - they're step functions with sharp transitions.

Let's observe emergence with a pattern recognition task:

In [None]:
def create_emergence_task():
    """Create a task that demonstrates emergent behavior - arithmetic sequence completion."""
    
    def generate_arithmetic_sequences(num_samples=200):
        """Generate arithmetic sequences: 2, 5, 8, 11, ? ‚Üí 14"""
        sequences = []
        
        for _ in range(num_samples):
            # Create arithmetic sequence: start + n*step
            start = np.random.randint(1, 15)
            step = np.random.randint(1, 4)
            
            # Generate sequence of length 5
            sequence = [start + i * step for i in range(5)]
            
            # Input: first 4 numbers, target: 5th number
            input_seq = sequence[:4]
            target = sequence[4]
            
            sequences.append((input_seq, target))
        
        return sequences
    
    def test_arithmetic_ability(model, test_sequences):
        """Test model's ability to complete arithmetic sequences."""
        model.eval()
        correct = 0
        total = len(test_sequences)
        
        with torch.no_grad():
            for input_seq, target in test_sequences:
                # Convert to tensor with proper range
                # Map numbers to vocabulary indices (add offset to avoid special tokens)
                input_tensor = torch.tensor([input_seq], device=device) + 10
                
                # Get model prediction
                outputs = model(input_tensor)
                prediction = outputs[0, -1, :].argmax().item() - 10  # Remove offset
                
                # Allow tolerance for close answers (emergence isn't perfect)
                if abs(prediction - target) <= 2:
                    correct += 1
        
        return correct / total
    
    # Create test sequences
    test_sequences = generate_arithmetic_sequences(100)
    
    # Test family of models with increasing size
    model_family = {
        'micro': {'vocab_size': 100, 'd_model': 32, 'n_heads': 2, 'n_layers': 2, 'd_ff': 64, 'max_seq_len': 16, 'dropout': 0.1},
        'tiny': {'vocab_size': 100, 'd_model': 48, 'n_heads': 3, 'n_layers': 3, 'd_ff': 96, 'max_seq_len': 16, 'dropout': 0.1},
        'small': {'vocab_size': 100, 'd_model': 64, 'n_heads': 4, 'n_layers': 4, 'd_ff': 128, 'max_seq_len': 16, 'dropout': 0.1},
        'medium': {'vocab_size': 100, 'd_model': 80, 'n_heads': 5, 'n_layers': 5, 'd_ff': 160, 'max_seq_len': 16, 'dropout': 0.1},
        'large': {'vocab_size': 100, 'd_model': 96, 'n_heads': 6, 'n_layers': 6, 'd_ff': 192, 'max_seq_len': 16, 'dropout': 0.1}
    }
    
    emergence_results = {'names': [], 'parameters': [], 'accuracies': []}
    
    print("üî¨ SEARCHING FOR EMERGENT CAPABILITIES")
    print("Task: Complete arithmetic sequences (e.g., 2, 5, 8, 11, ? ‚Üí 14)\n")
    
    for name, config in model_family.items():
        print(f"üß™ Testing {name} model...")
        
        model = GPTModel(**config).to(device)
        param_count = count_parameters(model)
        
        # Train on arithmetic sequence completion
        train_sequences = generate_arithmetic_sequences(300)
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
        criterion = nn.CrossEntropyLoss()
        
        model.train()
        # More intensive training for this complex reasoning task
        for epoch in range(20):
            epoch_losses = []
            for input_seq, target in train_sequences:
                # Add offset to avoid special token conflicts
                x = torch.tensor([input_seq], device=device) + 10
                y = torch.tensor([target + 10], device=device)
                
                optimizer.zero_grad()
                outputs = model(x)
                loss = criterion(outputs[0, -1:, :], y)
                loss.backward()
                
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
                
                epoch_losses.append(loss.item())
            
            # Early stopping if converged
            if epoch > 5 and np.mean(epoch_losses) < 0.1:
                break
        
        # Test arithmetic reasoning ability
        accuracy = test_arithmetic_ability(model, test_sequences)
        
        emergence_results['names'].append(name)
        emergence_results['parameters'].append(param_count)
        emergence_results['accuracies'].append(accuracy)
        
        print(f"   Parameters: {param_count:,}")
        print(f"   Sequence completion accuracy: {accuracy:.3f}")
        
        # Cleanup
        del model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    
    return emergence_results

# Run emergence experiment
print("üöÄ HUNTING FOR EMERGENCE PHENOMENA")
print("This may take several minutes - we're looking for phase transitions!\n")

emergence_data = create_emergence_task()
print("\n‚úÖ Emergence hunt complete!")

In [None]:
# Visualize Emergence Phenomena

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

parameters = np.array(emergence_data['parameters'])
accuracies = np.array(emergence_data['accuracies'])

# 1. Main emergence plot - linear scale
axes[0, 0].plot(parameters, accuracies, 'o-', markersize=12, linewidth=4, 
                color='purple', markeredgecolor='darkpurple', markeredgewidth=2)

# Add model labels
for i, name in enumerate(emergence_data['names']):
    axes[0, 0].annotate(name, (parameters[i], accuracies[i]), 
                        xytext=(10, 10), textcoords='offset points', 
                        fontsize=11, weight='bold')

# Add capability zones
axes[0, 0].axhspan(0, 0.3, alpha=0.2, color='red', label='No Capability')
axes[0, 0].axhspan(0.3, 0.7, alpha=0.2, color='yellow', label='Emerging')
axes[0, 0].axhspan(0.7, 1.0, alpha=0.2, color='green', label='Mastered')

axes[0, 0].set_xlabel('Parameters', fontsize=12)
axes[0, 0].set_ylabel('Arithmetic Reasoning Accuracy', fontsize=12)
axes[0, 0].set_title('Emergent Capability: Arithmetic Reasoning\n(Sharp transitions indicate phase changes)', fontsize=14, weight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_ylim(0, 1)

# 2. Log scale view to see emergence threshold clearly
axes[0, 1].semilogx(parameters, accuracies, 'o-', markersize=12, linewidth=4, 
                    color='orange', markeredgecolor='darkorange', markeredgewidth=2)

axes[0, 1].set_xlabel('Parameters (log scale)', fontsize=12)
axes[0, 1].set_ylabel('Accuracy', fontsize=12)
axes[0, 1].set_title('Phase Transition View\n(Log scale reveals critical points)', fontsize=14, weight='bold')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_ylim(0, 1)

# Find and mark emergence threshold
emergence_threshold = 0.6
emergence_param = None

for i, acc in enumerate(accuracies):
    if acc >= emergence_threshold:
        emergence_param = parameters[i]
        axes[0, 1].axvline(x=emergence_param, color='red', linestyle='--', linewidth=3,
                          label=f'Emergence at {emergence_param:,} params')
        break

if emergence_param:
    axes[0, 1].legend()

# 3. Performance improvement rate (derivative)
if len(accuracies) > 1:
    # Calculate rate of improvement between models
    param_diffs = np.diff(parameters)
    acc_diffs = np.diff(accuracies)
    improvement_rates = acc_diffs / param_diffs
    
    # Plot at midpoints
    param_midpoints = (parameters[:-1] + parameters[1:]) / 2
    
    axes[1, 0].plot(param_midpoints, improvement_rates, 'o-', markersize=10, 
                    linewidth=3, color='green', markeredgecolor='darkgreen')
    
    axes[1, 0].set_xlabel('Parameters', fontsize=12)
    axes[1, 0].set_ylabel('Improvement Rate\n(Accuracy per Parameter)', fontsize=12)
    axes[1, 0].set_title('Emergence Detection\n(Spikes show phase transitions)', fontsize=14, weight='bold')
    axes[1, 0].grid(True, alpha=0.3)

# 4. Capability comparison chart
capability_labels = ['Random\nGuessing', 'Pattern\nRecognition', 'Arithmetic\nReasoning']
capability_thresholds = [0.2, 0.5, 0.8]

model_capabilities = []
for acc in accuracies:
    if acc < 0.2:
        model_capabilities.append(0)
    elif acc < 0.5:
        model_capabilities.append(1)
    else:
        model_capabilities.append(2)

# Create stacked bar chart showing capability levels
model_names = emergence_data['names']
y_pos = np.arange(len(model_names))

colors_cap = ['red', 'yellow', 'green']
for i, (name, cap_level) in enumerate(zip(model_names, model_capabilities)):
    axes[1, 1].barh(i, 1, color=colors_cap[cap_level], alpha=0.7, 
                    edgecolor='black', linewidth=1)
    axes[1, 1].text(0.5, i, capability_labels[cap_level], 
                    ha='center', va='center', fontsize=10, weight='bold')

axes[1, 1].set_yticks(y_pos)
axes[1, 1].set_yticklabels(model_names)
axes[1, 1].set_xlabel('Capability Level')
axes[1, 1].set_title('Emergent Capability Levels\n(Discrete jumps in ability)', fontsize=14, weight='bold')
axes[1, 1].set_xlim(0, 1)

plt.tight_layout()
plt.show()

# Analyze emergence patterns
print("\nüîç EMERGENCE ANALYSIS:")

if emergence_param:
    print(f"üöÄ EMERGENCE DETECTED!")
    print(f"   Critical parameter threshold: ~{emergence_param:,}")
    initial_acc = accuracies[0]
    final_acc = max(accuracies)
    print(f"   Performance jump: {initial_acc:.3f} ‚Üí {final_acc:.3f}")
    print(f"   Improvement factor: {final_acc/initial_acc:.1f}x")
    print(f"   This demonstrates phase transition behavior!")
else:
    print("‚è≥ No sharp emergence detected in this parameter range")
    print("   Try larger models or different tasks")
    print("   Some capabilities need even more parameters")

print(f"\nüí° EMERGENCE INSIGHTS:")
print(f"‚Ä¢ Some capabilities require minimum representational capacity")
print(f"‚Ä¢ Performance can jump discontinuously, not smoothly")
print(f"‚Ä¢ Neural networks exhibit phase transition phenomena")
print(f"‚Ä¢ Scaling can unlock qualitatively new abilities")
print(f"‚Ä¢ Emergence thresholds vary by task complexity")

print(f"\nüéØ STRATEGIC IMPLICATIONS:")
print(f"‚Ä¢ Plan minimum model sizes for specific capabilities")
print(f"‚Ä¢ Some abilities cannot be predicted from smaller models")
print(f"‚Ä¢ Emergence explains why scaling sometimes yields surprises")
print(f"‚Ä¢ Critical scales exist - below them, capabilities are impossible")

## Summary: Your Scaling Laws Mastery

You've discovered the three fundamental laws that govern AI performance and learned to wield them strategically.

### üìä The Three Laws of Scaling

**1. Power Law of Performance**
```
Loss = A √ó (Parameters)^(-Œ±) + B
```
- **Œ± ‚âà 0.1-0.3**: Typical scaling exponent for transformers
- **Use Case**: Predict performance before expensive training
- **Key Insight**: Diminishing returns are mathematically predictable

**2. Chinchilla's Optimal Ratio**
```
Optimal: ~20 data tokens per model parameter
```
- **Discovery**: Most large models are severely undertrained
- **Use Case**: Optimize compute budget allocation
- **Key Insight**: Data scaling often beats parameter scaling

**3. Emergence Thresholds**
```
Capability = 0 if Parameters < Critical_Scale else Function(Parameters)
```
- **Phenomenon**: Phase transitions unlock new abilities
- **Use Case**: Plan minimum model sizes for specific tasks
- **Key Insight**: Some abilities have hard requirements

### üí∞ Strategic Applications

**Performance Prediction**:
```python
# Estimate parameters needed for target performance
target_loss = 2.0
current_loss = 3.0
estimated_params = ((current_loss - target_loss) / A) ** (1/alpha)
```

**Resource Optimization**:
```python
# Chinchilla-optimal allocation
optimal_tokens = model_parameters * 20
training_steps = optimal_tokens / (batch_size * sequence_length)
```

**Capability Planning**:
```python
# Check if model can develop target capability
if model_parameters >= emergence_threshold[capability]:
    print("Capability possible with sufficient training")
else:
    print("Need larger model for this capability")
```

### üéØ Decision Framework

**Fixed Compute Budget**:
- Apply Chinchilla principles: smaller model + more data
- Check emergence thresholds for required capabilities
- Use power law to predict final performance

**Fixed Time Constraint**:
- Use largest model that fits compute budget
- Ensure minimum data for stable training
- Plan for diminishing returns at large scales

**New Capability Target**:
- Research emergence thresholds for the capability
- Budget for minimum model size requirement
- Plan data collection for Chinchilla-optimal training

### üöÄ Strategic Implications

**For AI Researchers**:
- **Focus on data scaling**, not just model scaling
- **Predict emergent capabilities** before they appear in experiments
- **Use scaling laws** to optimize research directions
- **Plan experiments** with mathematical precision

**For AI Practitioners**:
- **Question "bigger is better"** assumptions
- **Invest heavily** in high-quality, large-scale datasets
- **Apply Chinchilla principles** to compute budgets
- **Plan capabilities** based on emergence thresholds

**For AI Strategy**:
- **Data is the new moat** - quality datasets are increasingly valuable
- **Compute allocation** can be mathematically optimized
- **Capability prediction** enables strategic planning
- **Scaling laws** reveal the physics of intelligence

### üß† The Deep Truth

You've learned that AI progress isn't random - it follows **mathematical laws** as fundamental as those governing physics. These laws enable:

- **Performance prediction** without expensive experiments
- **Resource optimization** for maximum efficiency  
- **Capability forecasting** for strategic planning
- **Scientific understanding** of intelligence scaling

**The scaling laws are your crystal ball** - use them to see the future of AI and build it more efficiently! üîÆüìà