# Optimizing PyTorch Kernels with Triton

This notebook explores how PyTorch leverages Triton to generate optimized kernels for various operations, enhancing performance on CUDA-enabled GPUs. We will delve into the specifics of how PyTorch compiles and executes custom operations, using LayerNorm + GELU as our primary example. 

The process will involve:
1. Understanding the baseline PyTorch implementation.
2. Examining the Triton-generated code.
3. Building a pipeline to benchmark different implementations.
4. Exploring techniques for further kernel tuning.

Finally, we will apply this pipeline and knowledge to optimize other common patterns like:
- Softmax + Dropout
- RMSNorm
- Sigmoid * x (Swish/SiLU)

## Kernel Organization Structure

Generated kernels will be saved in an organized manner:
```
./triton_kernels/
├── experiment_1/          # Organized by experiment name
│   ├── kernel_001.py     # Generated kernels
│   └── kernel_002.py
│   └── other artifacts from pytorch compile
│   └── ...
├── experiment_2/
│   └── kernel_003.py
│   └── other artifacts from pytorch compile
│   └── ...
```

## Prerequisites
- PyTorch with CUDA support
- Triton GPU compiler
- NVIDIA GPU with CUDA capability

In [None]:
# Advanced Triton Kernel Optimization Setup
import os
import torch
import time
import gc
import shutil
import glob
from pathlib import Path
from typing import Dict, List, Tuple
import json
import datetime

class ExperimentManager:
    """Manages experiment directories and kernel organization for advanced optimization"""
    
    def __init__(self, base_dir="./triton_kernels"):
        self.base_dir = Path(base_dir).resolve()
        self.base_dir.mkdir(parents=True, exist_ok=True)
        self.current_experiment = None
        self.experiment_counter = 1
        
    def create_experiment(self, name: str = None) -> Path:
        """Create a new experiment directory for kernel analysis"""
        if name is None:
            name = f"experiment_{self.experiment_counter}"
            self.experiment_counter += 1
        
        experiment_path = self.base_dir / name
        experiment_path.mkdir(parents=True, exist_ok=True)
        self.current_experiment = experiment_path
        
        metadata = {
            "experiment_name": name,
            "created_at": datetime.datetime.now().isoformat(),
            "description": "",
            "kernels": []
        }
        
        with open(experiment_path / "metadata.json", "w") as f:
            json.dump(metadata, f, indent=2)
        
        print(f"📁 Created experiment: {experiment_path}")
        return experiment_path
    
    def set_experiment_cache(self, experiment_path: Path):
        """Set Triton cache to point to experiment directory"""
        os.environ["TRITON_CACHE_DIR"] = str(experiment_path)
        print(f"🔧 Set cache directory: {experiment_path}")
    
    def save_kernel_metadata(self, kernel_info: dict):
        """Save metadata about generated kernels"""
        if self.current_experiment is None:
            return
        
        metadata_file = self.current_experiment / "metadata.json"
        if metadata_file.exists():
            with open(metadata_file, "r") as f:
                metadata = json.load(f)
        else:
            metadata = {"kernels": []}
        
        metadata["kernels"].append(kernel_info)
        
        with open(metadata_file, "w") as f:
            json.dump(metadata, f, indent=2)

# Simplified setup for advanced optimization experiments
def setup_advanced_optimization():
    """Configure environment for advanced kernel optimization experiments"""
    
    # Enable output code logging for kernel analysis
    os.environ["TORCH_LOGS"] = "output_code"
    os.environ["TRITON_PRINT_AUTOTUNING"] = "1"
    
    print("🚀 Advanced Triton Kernel Optimization Environment")
    print("=" * 60)
    
    # Detect device
    if torch.cuda.is_available():
        device = "cuda"
        print(f"✅ CUDA: {torch.cuda.get_device_name(0)}")
        print(f"   Compute capability: {torch.cuda.get_device_capability(0)}")
    else:
        device = "cpu"
        print("⚠️  Using CPU (CUDA not available)")
    
    return device

def clear_compilation_cache():
    """Clear PyTorch compilation cache for fresh experiments"""
    torch._dynamo.reset()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

def find_triton_kernels(search_dirs=None):
    """Search for generated Triton kernel files"""
    
    if search_dirs is None:
        search_dirs = [
            "./triton_kernels",
            f"/tmp/torchinductor_{os.getenv('USER', 'user')}/",
            "/tmp/torchinductor_alibina/",
            "/tmp/triton/"
        ]
    
    kernel_files = []
    
    for cache_dir in search_dirs:
        cache_path = Path(cache_dir)
        if cache_path.exists():
            for file_path in cache_path.rglob("*.py"):
                try:
                    content = file_path.read_text()
                    triton_patterns = [
                        '@triton.jit', 'triton_per_fused', 'triton_poi_fused',
                        'import triton', 'tl.load', 'tl.store'
                    ]
                    
                    if any(pattern in content for pattern in triton_patterns):
                        kernel_files.append((str(file_path), content))
                except Exception:
                    continue
    
    return kernel_files

# Initialize environment for advanced optimization
experiment_manager = ExperimentManager("./triton_kernels")
device = setup_advanced_optimization()
clear_compilation_cache()

print(f"\n✅ Ready for advanced kernel optimization experiments!")
print(f"📂 Kernels will be organized in: {experiment_manager.base_dir}")

## Advanced Kernel Fusion Experiments

This section focuses on advanced kernel optimization patterns using PyTorch + Triton. Each experiment demonstrates different fusion strategies and their performance benefits.

### 🎯 Experiment Overview

| Experiment | Pattern | Focus | Learning Objective |
|------------|---------|-------|-------------------|
| **Experiment 1** | LayerNorm + GELU | Sequential fusion | Fundamental fusion concepts |
| **Experiment 2** | Softmax + Dropout | Reduction + element-wise | Attention mechanism optimization |
| **Experiment 3** | RMSNorm | Modern normalization | Alternative approaches |
| **Experiment 4** | SiLU/Swish variants | Implementation comparison | Built-in vs custom optimization |

### 🧮 Mathematical Background

**Layer Normalization:**
```
LayerNorm(x) = γ * (x - μ) / σ + β
where μ = mean(x), σ = std(x)
```

**GELU Activation:**
```
GELU(x) = x * Φ(x) = x * 0.5 * (1 + erf(x/√2))
≈ x * 0.5 * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
```

### 🔄 Fusion Strategy

**Without Fusion (separate kernels):**
1. Load input → Compute LayerNorm → Store intermediate
2. Load intermediate → Compute GELU → Store final

**With Fusion (combined kernel):**
1. Load input → Compute LayerNorm + GELU → Store final

This eliminates intermediate memory allocation, providing significant speedup on memory-bound operations.

### 📏 Test Configuration

- **Batch Size**: 32 (typical training batch)
- **Sequence Length**: 512 (BERT-base length)  
- **Hidden Dimension**: 768 (BERT-base width)

In [None]:
# Experiment 1: LayerNorm + GELU Fusion
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class LayerNormGELU(nn.Module):
    """LayerNorm followed by GELU - a prime candidate for kernel fusion"""
    
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        self.layer_norm = nn.LayerNorm(normalized_shape, eps=eps)
        
    def forward(self, x):
        # Step 1: Apply layer normalization (creates intermediate tensor)
        normalized = self.layer_norm(x)
        # Step 2: Apply GELU activation (creates another intermediate)
        output = F.gelu(normalized)
        return output

def create_test_tensors(batch_size=32, seq_len=512, hidden_dim=768):
    """Create transformer-typical test tensors"""
    return torch.randn(batch_size, seq_len, hidden_dim, device=device, dtype=torch.float32)

# Test the baseline implementation
print("=== Experiment 1: LayerNorm + GELU Fusion ===")

test_input = create_test_tensors()
print(f"Input shape: {test_input.shape}")
print(f"Memory usage: {test_input.element_size() * test_input.numel() / 1024**2:.1f} MB")

# Initialize model
model = LayerNormGELU(test_input.shape[-1]).to(device)

# Test baseline
with torch.no_grad():
    baseline_output = model(test_input)
    print(f"Output shape: {baseline_output.shape}")
    print(f"Output range: [{baseline_output.min():.4f}, {baseline_output.max():.4f}]")

print("✅ Baseline implementation ready for optimization")

In [None]:
# 🔧 Compiled Version with Kernel Capture
# 
# This section demonstrates how PyTorch's torch.compile() works with Triton
# to automatically optimize our LayerNorm + GELU pattern through kernel fusion.

# Advanced Kernel Compilation and Analysis

def capture_kernels_for_experiment(model_fn, input_tensor, experiment_name):
    """
    Capture and organize generated Triton kernels for advanced analysis
    
    This function demonstrates how to:
    1. Organize kernel generation experiments
    2. Trigger optimal kernel compilation
    3. Capture generated artifacts for analysis
    """
    
    # Create dedicated experiment directory
    exp_path = experiment_manager.create_experiment(experiment_name)
    experiment_manager.set_experiment_cache(exp_path)
    
    # Clear compilation cache for fresh kernel generation
    clear_compilation_cache()
    
    # Compile with maximum optimization
    print(f"\n🔧 Compiling {experiment_name} with max-autotune...")
    compiled_model = torch.compile(model_fn, mode="max-autotune")
    
    # Trigger kernel generation
    print("🔥 Generating optimized kernels...")
    with torch.no_grad():
        _ = compiled_model(input_tensor)  # First call triggers compilation
        _ = compiled_model(input_tensor)  # Second call ensures completion
    
    # Find and catalog generated kernels
    kernel_files = find_triton_kernels([str(exp_path)])
    
    if kernel_files:
        print(f"✅ Found {len(kernel_files)} kernel files")
        for i, (file_path, content) in enumerate(kernel_files):
            print(f"   {i+1}. {Path(file_path).name}")
            
            kernel_info = {
                "kernel_id": f"kernel_{i+1:03d}",
                "file_path": file_path,
                "size_bytes": len(content),
                "created_at": datetime.datetime.now().isoformat(),
                "operations_detected": analyze_kernel_operations(content)
            }
            experiment_manager.save_kernel_metadata(kernel_info)
    else:
        print("⚠️  No Triton kernels found - checking system cache...")
        system_kernel_files = find_triton_kernels()
        if system_kernel_files:
            print(f"📦 Found {len(system_kernel_files)} kernels in system cache")
    
    return compiled_model, exp_path

def analyze_kernel_operations(content):
    """Analyze kernel content to identify fused operations"""
    operations = []
    content_lower = content.lower()
    
    if 'layer_norm' in content_lower or 'norm' in content_lower:
        operations.append("layer_norm")
    if 'gelu' in content_lower:
        operations.append("gelu") 
    if 'softmax' in content_lower:
        operations.append("softmax")
    if 'dropout' in content_lower:
        operations.append("dropout")
    if 'sigmoid' in content_lower:
        operations.append("sigmoid")
    
    return operations

def find_triton_kernels(search_dirs=None):
    """
    Enhanced kernel finding with better pattern matching
    
    Searches for Triton kernel files (.py) that contain Triton-specific code patterns.
    This helps us identify which files are actually generated kernels vs other Python files.
    """
    
    if search_dirs is None:
        # Default system cache directories where PyTorch/Triton store generated kernels
        search_dirs = [
            f"/tmp/torchinductor_{os.getenv('USER', 'user')}/",
            "/tmp/torchinductor_alibina/", 
            "/tmp/triton/",
            str(Path.home() / ".triton" / "cache")
        ]
    
    kernel_files = []
    
    for cache_dir in search_dirs:
        cache_path = Path(cache_dir)
        if cache_path.exists():
            # Search for Python files recursively
            for file_path in cache_path.rglob("*.py"):
                try:
                    content = file_path.read_text()
                    # Look for Triton-specific patterns to identify kernel files
                    triton_patterns = [
                        '@triton.jit',           # Triton JIT decorator
                        'triton_per_fused',      # Fused reduction kernels
                        'triton_poi_fused',      # Fused pointwise kernels
                        'import triton',         # Triton imports
                        'tl.load',              # Triton load operations
                        'tl.store'              # Triton store operations
                    ]
                    
                    if any(pattern in content for pattern in triton_patterns):
                        kernel_files.append((str(file_path), content))
                except Exception:
                    # Skip files that can't be read
                    continue
    
    return kernel_files

# 🧪 Execute Experiment 1: LayerNorm + GELU Fusion
print("=" * 60)
print("🧪 EXPERIMENT 1: LayerNorm + GELU Fusion")
print("=" * 60)
print("📖 Learning Objectives:")
print("   • Observe automatic kernel fusion in action")
print("   • Compare fused vs unfused performance")
print("   • Analyze generated Triton kernel code")
print("   • Understand compilation overhead vs runtime benefits")

compiled_model, exp1_path = capture_kernels_for_experiment(
    model, test_input, "layernorm_gelu_fusion"
)

print(f"\n📊 Experiment results saved to: {exp1_path}")
print(f"🔍 You can examine the generated kernel files in this directory!")

# 🧪 Verify correctness: compiled output should match baseline
print(f"\n🔬 Correctness Verification:")
with torch.no_grad():
    compiled_output = compiled_model(test_input)
    
    # Check if outputs are numerically equivalent
    if torch.allclose(baseline_output, compiled_output, rtol=1e-5, atol=1e-6):
        print("✅ Compiled model output matches baseline perfectly")
        print(f"   📊 Max absolute difference: {(baseline_output - compiled_output).abs().max():.2e}")
    else:
        print("❌ Output mismatch detected!")
        print(f"   📊 Max absolute difference: {(baseline_output - compiled_output).abs().max():.2e}")
        print("   💡 Small differences are normal due to different computation orders")

print(f"\n🎓 Key Learning: The compiled model produces identical results")
print(f"   but will be significantly faster on subsequent runs!")

In [None]:
# 📊 Comprehensive Benchmarking Pipeline
#
# This section implements a rigorous benchmarking methodology to measure
# the true performance impact of kernel fusion and compilation optimizations.

import statistics
from dataclasses import dataclass
from typing import List

@dataclass
class BenchmarkResult:
    """
    Container for benchmark results with comprehensive metrics
    
    This class stores all the important metrics we need to evaluate
    kernel performance comprehensively.
    """
    name: str
    mean_time: float      # Average execution time
    std_time: float       # Standard deviation (shows consistency)
    min_time: float       # Best-case performance
    max_time: float       # Worst-case performance
    throughput: float     # Elements processed per second
    speedup: float = 1.0  # Speedup relative to baseline

class PerformanceBenchmarker:
    """
    🎯 Professional-grade benchmarking for GPU kernel performance
    
    Key principles implemented:
    1. Proper warmup to avoid compilation overhead in measurements
    2. GPU synchronization to get accurate timings
    3. Multiple runs for statistical significance
    4. Comprehensive metrics including throughput and speedup
    """
    
    def __init__(self, warmup_runs=5, benchmark_runs=20):
        """
        Initialize benchmarker with scientific rigor
        
        Args:
            warmup_runs: Number of runs to "warm up" before measuring
                        (eliminates compilation overhead and cache misses)
            benchmark_runs: Number of timed runs for statistical analysis
        """
        self.warmup_runs = warmup_runs
        self.benchmark_runs = benchmark_runs
        self.baseline_time = None
        
    def benchmark_function(self, func, input_tensor, name: str) -> BenchmarkResult:
        """
        Benchmark a function with scientific rigor
        
        This method implements GPU benchmarking best practices:
        1. Warmup phase to eliminate one-time costs
        2. Proper CUDA synchronization for accurate timing
        3. Statistical analysis across multiple runs
        4. Comprehensive metrics calculation
        
        Args:
            func: Function to benchmark
            input_tensor: Input data for the function
            name: Human-readable name for this benchmark
            
        Returns:
            BenchmarkResult with comprehensive performance metrics
        """
        
        print(f"    🔥 Benchmarking: {name}")
        
        # 🔥 Phase 1: Warmup runs
        # These runs eliminate compilation overhead and prepare GPU caches
        print(f"       Warmup: {self.warmup_runs} runs...")
        for i in range(self.warmup_runs):
            with torch.no_grad():
                _ = func(input_tensor)
        
        # Ensure all warmup operations complete before timing
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        
        # ⏱️ Phase 2: Timed benchmark runs
        print(f"       Timing: {self.benchmark_runs} runs...")
        times = []
        
        for i in range(self.benchmark_runs):
            # Synchronize before starting timer
            if torch.cuda.is_available():
                torch.cuda.synchronize()
            
            # Start timing
            start = time.perf_counter()
            
            # Execute function
            with torch.no_grad():
                output = func(input_tensor)
            
            # Synchronize before stopping timer (crucial for GPU timing!)
            if torch.cuda.is_available():
                torch.cuda.synchronize()
            
            # Stop timing
            end = time.perf_counter()
            times.append(end - start)
        
        # 📊 Phase 3: Statistical analysis
        mean_time = statistics.mean(times)
        std_time = statistics.stdev(times) if len(times) > 1 else 0.0
        min_time = min(times)
        max_time = max(times)
        
        # Calculate throughput: how many elements processed per second
        num_elements = input_tensor.numel()
        throughput = num_elements / mean_time
        
        # Calculate speedup relative to baseline
        speedup = 1.0
        if self.baseline_time is not None:
            speedup = self.baseline_time / mean_time
        elif "baseline" in name.lower():
            self.baseline_time = mean_time
        
        print(f"       ✅ Results: {mean_time*1000:.3f}ms ± {std_time*1000:.3f}ms")
        
        return BenchmarkResult(
            name=name,
            mean_time=mean_time,
            std_time=std_time,
            min_time=min_time,
            max_time=max_time,
            throughput=throughput,
            speedup=speedup
        )
    
    def print_results(self, results: List[BenchmarkResult]):
        """
        Print formatted benchmark results in a professional table
        
        This creates an easy-to-read summary table showing:
        - Execution times with standard deviation
        - Speedup factors relative to baseline
        - Throughput in millions of elements per second
        """
        
        print("\n" + "=" * 80)
        print("🏃‍♂️ PERFORMANCE BENCHMARK RESULTS")
        print("=" * 80)
        
        # Table header
        print(f"{'Implementation':<20} {'Time (ms)':<12} {'±Std (ms)':<10} {'Speedup':<8} {'Throughput':<15}")
        print("-" * 80)
        
        # Results rows
        for result in results:
            print(f"{result.name:<20} "
                  f"{result.mean_time*1000:<12.3f} "
                  f"±{result.std_time*1000:<9.3f} "
                  f"{result.speedup:<8.2f}x "
                  f"{result.throughput/1e6:<15.1f}M elem/s")
        
        # Highlight best performer
        if len(results) > 1:
            best = max(results, key=lambda r: r.speedup)
            print(f"\n🏆 Best performer: {best.name} ({best.speedup:.2f}x speedup)")
            
            # Calculate performance improvement
            if best.speedup > 1.1:
                improvement = (best.speedup - 1) * 100
                print(f"🚀 Performance improvement: {improvement:.1f}% faster than baseline")

# 🏃‍♂️ Execute Comprehensive Benchmarks
print("\n📊 COMPREHENSIVE PERFORMANCE ANALYSIS")
print("=" * 60)
print("🎯 Testing multiple tensor sizes to understand scaling behavior")
print("📖 Learning Objectives:")
print("   • Measure fusion benefits across different scales")
print("   • Understand how performance scales with tensor size")
print("   • Observe consistency of performance improvements")
print("   • Analyze throughput characteristics")

benchmarker = PerformanceBenchmarker(warmup_runs=10, benchmark_runs=50)
all_results = []

# Test configurations: small to large to observe scaling
test_configs = [
    (16, 128, 768),   # Small: Typical inference batch
    (32, 512, 768),   # Medium: Training batch  
    (64, 1024, 768),  # Large: Large batch training
]

for i, (batch_size, seq_len, hidden_dim) in enumerate(test_configs, 1):
    print(f"\n📊 Configuration {i}/3: Batch={batch_size}, Seq={seq_len}, Hidden={hidden_dim}")
    
    # Calculate total elements and memory usage
    test_input = create_test_tensors(batch_size, seq_len, hidden_dim)
    total_elements = test_input.numel()
    memory_mb = test_input.element_size() * total_elements / (1024**2)
    
    print(f"    📏 Tensor shape: {test_input.shape}")
    print(f"    🔢 Total elements: {total_elements:,}")
    print(f"    💾 Memory usage: {memory_mb:.1f} MB")
    
    # Create fresh model instances to avoid compilation caching between sizes
    baseline_model = LayerNormGELU(test_input.shape[-1]).to(device)
    compiled_model_fresh = torch.compile(baseline_model, mode="max-autotune")
    
    # Benchmark baseline implementation
    baseline_result = benchmarker.benchmark_function(
        baseline_model, test_input, f"Baseline-{batch_size}x{seq_len}"
    )
    all_results.append(baseline_result)
    
    # Benchmark compiled version
    compiled_result = benchmarker.benchmark_function(
        compiled_model_fresh, test_input, f"Compiled-{batch_size}x{seq_len}"
    )
    all_results.append(compiled_result)
    
    # Print results for this configuration
    benchmarker.print_results([baseline_result, compiled_result])
    
    # Reset baseline for next configuration
    benchmarker.baseline_time = None

print(f"\n🎓 Key Insights from Comprehensive Benchmarking:")
print(f"   • Kernel fusion provides consistent speedups")
print(f"   • Performance benefits scale with tensor size")
print(f"   • Compilation overhead is one-time cost")
print(f"   • Larger tensors show more dramatic improvements")
print(f"\n✅ Benchmarking complete! Results saved in experiment directory.")

## Exploring More Fusion Patterns

### 🎯 Why Study Multiple Patterns?

Now that we understand the fundamentals with LayerNorm + GELU, let's explore other common patterns. Each pattern teaches us different aspects of GPU optimization:

| Pattern | Primary Learning | Common Use Case |
|---------|------------------|-----------------|
| **Softmax + Dropout** | Attention mechanism optimization | Transformer attention layers |
| **RMSNorm** | Alternative normalization schemes | Modern LLMs (LLaMA, PaLM) |
| **SiLU/Swish** | Activation function variants | MLP layers, ConvNeXt |

### 🧠 Fusion Strategy Patterns

Different operations benefit from fusion in different ways:

1. **Sequential Fusion**: Operations applied one after another (LayerNorm → GELU)
2. **Parallel Fusion**: Multiple operations on same data (computing mean + variance)
3. **Reduction Fusion**: Operations that reduce dimensionality (Softmax across sequence)

### 📚 Learning Progression

We'll progress through increasingly complex patterns:
- **Experiment 2**: Softmax + Dropout (attention patterns)
- **Experiment 3**: RMSNorm (modern normalization)
- **Experiment 4**: SiLU variants (activation comparison)

Each experiment will be organized in its own directory with:
- Generated Triton kernels
- Performance benchmarks
- Metadata and analysis
- Compilation artifacts

Let's dive into each pattern and see how PyTorch + Triton optimizes them!

In [None]:
# Experiment 2: Softmax + Dropout Fusion
#
# 📖 Educational Focus: Attention Mechanism Optimization
#
# This pattern is found in every transformer attention layer:
# 1. Compute attention scores (Q @ K^T / √d)
# 2. Apply softmax to get attention weights  
# 3. Apply dropout for regularization
# 4. Use weights to attend to values (Attention @ V)
#
# Mathematical Background:
# Softmax(x_i) = exp(x_i) / Σ(exp(x_j))
# Dropout(x_i) = x_i / p with probability p, else 0

print("=" * 60)
print("🧪 EXPERIMENT 2: Softmax + Dropout Fusion")
print("=" * 60)
print("📖 Focus: Optimizing Transformer Attention Mechanisms")
print("🎯 Key Learning: How reduction operations (softmax) fuse with element-wise ops")

class SoftmaxDropout(nn.Module):
    """
    Softmax followed by Dropout - the heart of attention mechanisms
    
    This pattern appears in every transformer attention layer and is
    a perfect candidate for fusion because:
    1. Softmax is a reduction operation (needs to see all elements)
    2. Dropout is element-wise (can be applied during softmax computation)
    3. Both are memory-bound operations
    """
    
    def __init__(self, dropout_prob=0.1):
        super().__init__()
        self.dropout_prob = dropout_prob
        
    def forward(self, x):
        # Step 1: Apply softmax along last dimension
        # This computes: softmax(x_i) = exp(x_i) / sum(exp(x_j))
        softmax_out = F.softmax(x, dim=-1)
        
        # Step 2: Apply dropout for regularization
        # During training: randomly zero elements with probability dropout_prob
        # During inference: scale by (1 - dropout_prob)
        dropped_out = F.dropout(softmax_out, p=self.dropout_prob, training=self.training)
        
        return dropped_out

# 🎯 Create attention-like test data
# Shape: [batch_size, num_heads, seq_len, seq_len]
# This represents attention scores before softmax in multi-head attention
batch_size, num_heads, seq_len = 32, 8, 512

attention_input = torch.randn(batch_size, num_heads, seq_len, seq_len, device=device)
print(f"📊 Attention input shape: {attention_input.shape}")
print(f"💾 Memory usage: {attention_input.element_size() * attention_input.numel() / 1024**2:.1f} MB")
print(f"🔢 Total elements: {attention_input.numel():,}")

# Initialize model in training mode to enable dropout
softmax_dropout_model = SoftmaxDropout(dropout_prob=0.1).to(device)
softmax_dropout_model.train()  # Enable dropout for this experiment

print(f"\n📝 Model configuration:")
print(f"   Dropout probability: {softmax_dropout_model.dropout_prob}")
print(f"   Training mode: {softmax_dropout_model.training}")

# 🔧 Capture kernels for this attention pattern
print(f"\n🔧 Compiling and capturing attention mechanism kernels...")
compiled_softmax_dropout, exp2_path = capture_kernels_for_experiment(
    softmax_dropout_model, attention_input, "softmax_dropout_fusion"
)

print(f"📊 Softmax + Dropout experiment saved to: {exp2_path}")

# 🏃‍♂️ Quick performance benchmark
print(f"\n🏃‍♂️ Performance Analysis:")
print(f"   🎯 This pattern tests reduction + element-wise fusion")
print(f"   📈 Expected benefit: Reduced memory bandwidth for attention")

benchmarker_exp2 = PerformanceBenchmarker(warmup_runs=5, benchmark_runs=20)

baseline_result = benchmarker_exp2.benchmark_function(
    softmax_dropout_model, attention_input, "Softmax+Dropout Baseline"
)

compiled_result = benchmarker_exp2.benchmark_function(
    compiled_softmax_dropout, attention_input, "Softmax+Dropout Compiled"
)

benchmarker_exp2.print_results([baseline_result, compiled_result])

# 🔬 Correctness verification for stochastic operations
print(f"\n🔬 Correctness Note:")
print(f"   ⚠️  Dropout is stochastic - outputs won't match exactly")
print(f"   ✅ We verify the statistical properties instead")

# Test in eval mode for deterministic comparison
softmax_dropout_model.eval()
compiled_softmax_dropout.eval()

with torch.no_grad():
    baseline_eval = softmax_dropout_model(attention_input)
    compiled_eval = compiled_softmax_dropout(attention_input)
    
    if torch.allclose(baseline_eval, compiled_eval, rtol=1e-5, atol=1e-6):
        print(f"   ✅ Eval mode outputs match perfectly")
    else:
        print(f"   📊 Max difference: {(baseline_eval - compiled_eval).abs().max():.2e}")

print(f"\n🎓 Key Insights:")
print(f"   • Softmax + Dropout fusion reduces memory traffic")
print(f"   • Attention mechanisms benefit significantly from this optimization")
print(f"   • Stochastic operations require careful correctness verification")

In [None]:
# 🧪 Experiment 3: RMSNorm (Root Mean Square Normalization)
#
# 📖 Educational Focus: Modern Normalization Schemes
#
# RMSNorm is used in modern large language models like:
# - LLaMA (Meta)
# - PaLM (Google) 
# - GPT-NeoX
#
# Mathematical Comparison:
# LayerNorm: (x - μ) / σ * γ + β  (requires mean AND variance)
# RMSNorm:   x / RMS(x) * γ       (only requires RMS, simpler!)
#
# Where RMS(x) = √(mean(x²) + ε)

print("\n" + "=" * 60)
print("🧪 EXPERIMENT 3: RMSNorm Optimization")
print("=" * 60)
print("📖 Focus: Modern Alternative to LayerNorm")
print("🎯 Key Learning: Simpler normalization can be more efficient")

class RMSNorm(nn.Module):
    """
    Root Mean Square Normalization - A simpler alternative to LayerNorm
    
    Benefits of RMSNorm vs LayerNorm:
    1. 🎯 Simpler: Only requires RMS, not mean AND variance
    2. ⚡ Faster: Fewer operations per element
    3. 🔢 Numerically stable: Avoids mean subtraction
    4. 📏 Equivalent performance: Similar results to LayerNorm in practice
    
    Used in: LLaMA, PaLM, GPT-NeoX, and other modern LLMs
    """
    
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        # Learnable scaling parameter (like γ in LayerNorm)
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.eps = eps  # Small constant for numerical stability
        
    def forward(self, x):
        # Step 1: Compute Root Mean Square
        # RMS = √(mean(x²) + ε)
        variance = x.pow(2).mean(dim=-1, keepdim=True)
        
        # Step 2: Normalize by RMS 
        # rsqrt is more efficient than 1/sqrt
        x = x * torch.rsqrt(variance + self.eps)
        
        # Step 3: Apply learned scaling
        return self.weight * x

# 🎯 Create LLaMA-style test data
# Modern LLMs use much larger hidden dimensions
batch_size, seq_len, hidden_dim = 32, 512, 4096  # LLaMA-7B dimensions

rms_input = torch.randn(batch_size, seq_len, hidden_dim, device=device)
print(f"📊 RMSNorm input shape: {rms_input.shape}")
print(f"💾 Memory usage: {rms_input.element_size() * rms_input.numel() / 1024**2:.1f} MB")
print(f"🔢 Total elements: {rms_input.numel():,}")
print(f"📏 Hidden dimension: {hidden_dim} (LLaMA-7B scale)")

# Initialize RMSNorm model
rmsnorm_model = RMSNorm(hidden_dim).to(device)
print(f"\n📝 Model configuration:")
print(f"   Parameters: {sum(p.numel() for p in rmsnorm_model.parameters()):,}")
print(f"   Epsilon: {rmsnorm_model.eps}")

# Compare with equivalent LayerNorm for educational purposes
layernorm_model = nn.LayerNorm(hidden_dim).to(device)
print(f"   LayerNorm parameters: {sum(p.numel() for p in layernorm_model.parameters()):,}")

# 🔧 Capture kernels for RMSNorm optimization
print(f"\n🔧 Compiling and capturing RMSNorm kernels...")
print(f"   🎯 Expected optimization: Fused power + mean + rsqrt + multiply")

compiled_rmsnorm, exp3_path = capture_kernels_for_experiment(
    rmsnorm_model, rms_input, "rmsnorm_optimization"
)

print(f"📊 RMSNorm experiment saved to: {exp3_path}")

# 🏃‍♂️ Comprehensive benchmark: RMSNorm vs LayerNorm
print(f"\n🏃‍♂️ Comprehensive Performance Analysis:")
print(f"   📊 Comparing RMSNorm vs LayerNorm vs Compiled RMSNorm")

benchmarker_exp3 = PerformanceBenchmarker(warmup_runs=5, benchmark_runs=20)

# Benchmark all three variants
rmsnorm_baseline = benchmarker_exp3.benchmark_function(
    rmsnorm_model, rms_input, "RMSNorm Baseline"
)

rmsnorm_compiled = benchmarker_exp3.benchmark_function(
    compiled_rmsnorm, rms_input, "RMSNorm Compiled"
)

layernorm_baseline = benchmarker_exp3.benchmark_function(
    layernorm_model, rms_input, "LayerNorm Baseline"
)

# Compare all results
all_norm_results = [rmsnorm_baseline, rmsnorm_compiled, layernorm_baseline]
benchmarker_exp3.print_results(all_norm_results)

# 🔬 Mathematical correctness verification
print(f"\n🔬 Mathematical Verification:")
with torch.no_grad():
    rms_output = rmsnorm_model(rms_input)
    compiled_output = compiled_rmsnorm(rms_input)
    
    # Verify outputs match
    if torch.allclose(rms_output, compiled_output, rtol=1e-5, atol=1e-6):
        print(f"   ✅ Compiled RMSNorm matches baseline perfectly")
    else:
        print(f"   📊 Max difference: {(rms_output - compiled_output).abs().max():.2e}")
    
    # Compare RMS vs LayerNorm properties
    layernorm_output = layernorm_model(rms_input)
    
    print(f"\n📊 Normalization Comparison:")
    print(f"   RMSNorm output std: {rms_output.std():.6f}")
    print(f"   LayerNorm output std: {layernorm_output.std():.6f}")
    print(f"   RMSNorm output mean: {rms_output.mean():.6f}")
    print(f"   LayerNorm output mean: {layernorm_output.mean():.6f}")

print(f"\n🎓 Key Insights:")
print(f"   • RMSNorm is computationally simpler than LayerNorm")
print(f"   • Modern LLMs prefer RMSNorm for efficiency")
print(f"   • Fusion makes normalization operations very fast")
print(f"   • Both normalization schemes achieve similar statistical properties")

In [None]:
# Experiment 4: SiLU/Swish Activation (x * sigmoid(x))
# 
# 📖 Educational Focus: Activation Function Variants and Implementation Comparison
# 
# SiLU (Sigmoid Linear Unit) = Swish = x * sigmoid(x)
# 
# Mathematical properties:
# - Smooth and differentiable everywhere
# - Non-monotonic (has a small "dip" near x = -1.25)
# - Self-gated: sigmoid(x) acts as a learned gate
# - Used in: EfficientNet, MobilenetV3, some transformer variants
# 
# Comparison with other activations:
# ReLU(x) = max(0, x)           # Simple but not smooth
# GELU(x) = x * Φ(x)            # Gaussian-based, smooth
# SiLU(x) = x * σ(x)            # Sigmoid-based, smooth

print("\n" + "=" * 60)
print("🧪 EXPERIMENT 4: SiLU/Swish Activation Optimization")
print("=" * 60)
print("📖 Focus: Comparing Custom vs Built-in Implementations")
print("🎯 Key Learning: How PyTorch optimizes different implementation styles")

class SiLUActivation(nn.Module):
    """
    Custom SiLU implementation: x * sigmoid(x)
    
    This implementation explicitly computes:
    1. sigmoid(x) = 1 / (1 + exp(-x))
    2. x * sigmoid(x)
    
    Educational purpose: See how explicit operations get fused
    """
    
    def __init__(self):
        super().__init__()
        
    def forward(self, x):
        # Explicit computation: x * sigmoid(x)
        # This creates an intermediate sigmoid result
        return x * torch.sigmoid(x)

class SiLUBuiltin(nn.Module):
    """
    Built-in SiLU implementation using PyTorch's optimized version
    
    PyTorch's nn.SiLU() may have hand-optimized kernels
    or special handling in the compilation pipeline.
    """
    
    def __init__(self):
        super().__init__()
        self.silu = nn.SiLU()
        
    def forward(self, x):
        return self.silu(x)

# 🎯 Create MLP-style test data
# SiLU is commonly used in MLP layers of modern architectures
batch_size, seq_len, hidden_dim = 64, 512, 2048  # Typical MLP dimensions

silu_input = torch.randn(batch_size, seq_len, hidden_dim, device=device)
print(f"📊 SiLU input shape: {silu_input.shape}")
print(f"💾 Memory usage: {silu_input.element_size() * silu_input.numel() / 1024**2:.1f} MB")
print(f"🔢 Total elements: {silu_input.numel():,}")

# 🔧 Test both implementations and capture their kernels
print(f"\n🔧 Comparing Implementation Strategies:")

# Strategy 1: Custom explicit implementation
print(f"\n1️⃣ Custom Implementation (x * sigmoid(x)):")
silu_custom_model = SiLUActivation().to(device)
compiled_silu_custom, exp4a_path = capture_kernels_for_experiment(
    silu_custom_model, silu_input, "silu_custom_implementation"
)

# Strategy 2: Built-in PyTorch implementation
print(f"\n2️⃣ Built-in Implementation (nn.SiLU):")
silu_builtin_model = SiLUBuiltin().to(device)
compiled_silu_builtin, exp4b_path = capture_kernels_for_experiment(
    silu_builtin_model, silu_input, "silu_builtin_implementation"
)

print(f"📊 Custom SiLU experiment saved to: {exp4a_path}")
print(f"📊 Built-in SiLU experiment saved to: {exp4b_path}")

# 🏃‍♂️ Comprehensive benchmark of all variants
print(f"\n🏃‍♂️ Comprehensive SiLU Performance Analysis:")
print(f"   📊 Testing: Custom vs Built-in vs Compiled versions")
print(f"   🎯 Learning: How different implementations affect performance")

benchmarker_exp4 = PerformanceBenchmarker(warmup_runs=5, benchmark_runs=20)

# Benchmark all SiLU variants
silu_results = []

# Custom implementations
silu_results.append(benchmarker_exp4.benchmark_function(
    silu_custom_model, silu_input, "SiLU Custom"
))
silu_results.append(benchmarker_exp4.benchmark_function(
    compiled_silu_custom, silu_input, "SiLU Custom Compiled"
))

# Built-in implementations  
silu_results.append(benchmarker_exp4.benchmark_function(
    silu_builtin_model, silu_input, "SiLU Built-in"
))
silu_results.append(benchmarker_exp4.benchmark_function(
    compiled_silu_builtin, silu_input, "SiLU Built-in Compiled"
))

# Display comprehensive results
benchmarker_exp4.print_results(silu_results)

# 🔬 Correctness verification across implementations
print(f"\n🔬 Correctness Verification:")
with torch.no_grad():
    custom_output = silu_custom_model(silu_input)
    builtin_output = silu_builtin_model(silu_input)
    compiled_custom = compiled_silu_custom(silu_input)
    compiled_builtin = compiled_silu_builtin(silu_input)
    
    # Check all implementations produce same results
    implementations = [
        ("Custom", custom_output),
        ("Built-in", builtin_output), 
        ("Compiled Custom", compiled_custom),
        ("Compiled Built-in", compiled_builtin)
    ]
    
    print(f"   📊 Cross-implementation comparison:")
    for i, (name1, output1) in enumerate(implementations):
        for j, (name2, output2) in enumerate(implementations[i+1:], i+1):
            max_diff = (output1 - output2).abs().max().item()
            if max_diff < 1e-6:
                print(f"   ✅ {name1} ≈ {name2} (max diff: {max_diff:.2e})")
            else:
                print(f"   ⚠️  {name1} vs {name2} (max diff: {max_diff:.2e})")

# 📊 Mathematical properties demonstration
print(f"\n📊 SiLU Mathematical Properties:")
test_range = torch.linspace(-3, 3, 7, device=device)
silu_values = test_range * torch.sigmoid(test_range)

print(f"   Input:  {test_range.cpu().numpy()}")
print(f"   SiLU:   {silu_values.cpu().numpy()}")
print(f"   📝 Note the smooth curve and small dip around x = -1.25")

print(f"\n🎓 Key Insights:")
print(f"   • Built-in implementations may have optimized kernels")
print(f"   • Compilation can make custom implementations competitive")
print(f"   • Different implementation styles lead to different fusion opportunities")
print(f"   • All variants produce mathematically identical results")
print(f"   • SiLU provides smooth, self-gated activation behavior")

## 🔍 Kernel Analysis and Deep Learning

Now comes the exciting part - analyzing what PyTorch and Triton actually generated! This is where we transition from using tools to understanding the underlying optimizations.

### 🎓 What We're Looking For

When analyzing generated kernels, we want to understand:

1. **🔗 Fusion Patterns**: Which operations got combined into single kernels?
2. **🧮 Memory Access**: How efficiently does the kernel access memory?
3. **⚡ Parallelization**: How work is distributed across GPU cores?
4. **🎯 Optimization Techniques**: What clever optimizations did Triton apply?

### 🔍 Kernel Naming Conventions

Triton kernels follow predictable naming patterns:

| Pattern | Meaning | Example |
|---------|---------|---------|
| `triton_poi_fused_*` | Pointwise fused operations | Element-wise operations |
| `triton_per_fused_*` | Per-tensor reduction fused | Softmax, LayerNorm |
| `triton_red_fused_*` | Reduction operations | Sum, mean across dimensions |

### 🧠 Understanding Kernel Structure

A typical Triton kernel has these components:

```python
@triton.jit
def kernel_name(
    input_ptr, output_ptr,    # Memory pointers
    n_elements,               # Problem size
    BLOCK_SIZE: tl.constexpr  # Compile-time constant
):
    # 1. Calculate thread/block indices
    pid = tl.program_id(0)
    
    # 2. Load data from memory
    data = tl.load(input_ptr + offsets)
    
    # 3. Perform computations (fused operations!)
    result = complex_computation(data)
    
    # 4. Store results back to memory
    tl.store(output_ptr + offsets, result)
```

### 📊 Performance Analysis Techniques

For each experiment, we'll analyze:
- **Kernel Count**: How many kernels were generated?
- **Fusion Success**: Which operations got fused together?
- **Memory Patterns**: Coalesced vs scattered memory access
- **Block Sizes**: How work is partitioned across GPU cores

Let's dive into the analysis of our experiments!

In [None]:
# Kernel Analysis Tools
def analyze_experiment_kernels(experiment_path: Path):
    """Analyze generated kernels in an experiment directory"""
    
    print(f"\n🔍 Analyzing kernels in: {experiment_path.name}")
    print("=" * 60)
    
    # Read metadata
    metadata_file = experiment_path / "metadata.json"
    if metadata_file.exists():
        with open(metadata_file, "r") as f:
            metadata = json.load(f)
        
        print(f"📋 Experiment: {metadata.get('experiment_name', 'Unknown')}")
        print(f"📅 Created: {metadata.get('created_at', 'Unknown')}")
        print(f"🔢 Kernels found: {len(metadata.get('kernels', []))}")
    
    # Find and analyze kernel files
    kernel_files = list(experiment_path.glob("*.py"))
    
    if not kernel_files:
        print("❌ No kernel files found")
        return
    
    print(f"\n📄 Kernel Files ({len(kernel_files)}):")
    for i, kernel_file in enumerate(kernel_files, 1):
        content = kernel_file.read_text()
        lines = len(content.split('\n'))
        size_kb = len(content.encode('utf-8')) / 1024
        
        print(f"   {i}. {kernel_file.name} ({lines} lines, {size_kb:.1f} KB)")
        
        # Extract key information
        if '@triton.jit' in content:
            print(f"      ✅ Triton JIT kernel detected")
        
        if 'tl.load' in content and 'tl.store' in content:
            print(f"      🔄 Memory operations: load/store patterns found")
        
        if 'BLOCK_SIZE' in content or 'block_size' in content:
            print(f"      📦 Block-based processing detected")
        
        # Look for fusion patterns
        fusion_indicators = []
        if 'layer_norm' in content.lower():
            fusion_indicators.append("LayerNorm")
        if 'gelu' in content.lower():
            fusion_indicators.append("GELU")
        if 'softmax' in content.lower():
            fusion_indicators.append("Softmax")
        if 'dropout' in content.lower():
            fusion_indicators.append("Dropout")
        if 'sigmoid' in content.lower():
            fusion_indicators.append("Sigmoid")
        
        if fusion_indicators:
            print(f"      🔗 Fusion detected: {' + '.join(fusion_indicators)}")

def create_experiment_summary():
    """Create a summary of all experiments"""
    
    print("\n" + "=" * 80)
    print("📊 EXPERIMENT SUMMARY")
    print("=" * 80)
    
    base_path = Path("./triton_kernels")
    if not base_path.exists():
        print("❌ No experiments found")
        return
    
    experiments = [d for d in base_path.iterdir() if d.is_dir()]
    
    if not experiments:
        print("❌ No experiment directories found")
        return
    
    print(f"🧪 Total experiments: {len(experiments)}\n")
    
    for exp_dir in sorted(experiments):
        analyze_experiment_kernels(exp_dir)
        print()

# Analyze all experiments
create_experiment_summary()

# Show directory structure
print("\n📁 Final Directory Structure:")
def show_tree(path: Path, prefix="", max_depth=3, current_depth=0):
    """Show directory tree structure"""
    if current_depth >= max_depth:
        return
    
    if path.is_dir():
        items = sorted(list(path.iterdir()))
        for i, item in enumerate(items):
            is_last = i == len(items) - 1
            current_prefix = "└── " if is_last else "├── "
            print(f"{prefix}{current_prefix}{item.name}")
            
            if item.is_dir() and current_depth < max_depth - 1:
                next_prefix = prefix + ("    " if is_last else "│   ")
                show_tree(item, next_prefix, max_depth, current_depth + 1)

triton_kernels_path = Path("./triton_kernels")
if triton_kernels_path.exists():
    print(f"{triton_kernels_path}/")
    show_tree(triton_kernels_path)
else:
    print("❌ Triton kernels directory not found")

## 🎓 Conclusions and Mastery Path

### 🔑 Key Insights from Our Journey

Through our systematic exploration, we've uncovered fundamental principles of GPU kernel optimization:

#### 1. **Kernel Fusion is Transformative** 🔗
- **Memory Bandwidth**: The primary bottleneck for most ML operations
- **Fusion Benefits**: Combining operations eliminates intermediate memory transfers
- **Automatic Optimization**: PyTorch + Triton handles this complexity for us

#### 2. **Compilation Has Two Phases** ⚡
- **First Run**: 10-100x slower due to kernel generation and autotuning
- **Subsequent Runs**: Near-optimal performance using cached kernels
- **Production Tip**: Pre-compile in development, cache in production

#### 3. **Different Patterns, Different Optimizations** 🎯
- **Sequential Operations** (LayerNorm + GELU): Straightforward fusion
- **Reduction Operations** (Softmax + Dropout): Complex memory patterns
- **Alternative Implementations** (RMSNorm): Simpler can be faster
- **Built-in vs Custom**: Multiple paths to optimization

### 🛠️ Practical Optimization Strategies

#### Memory-First Thinking 💾
```python
# Bad: Multiple memory roundtrips
x = layer_norm(x)      # Memory: Load x, store normalized
x = gelu(x)           # Memory: Load normalized, store activated

# Good: Single memory roundtrip  
x = compiled_layer_norm_gelu(x)  # Memory: Load x, store final result
```

#### Leverage Autotuning 🎯
- Let Triton find optimal block sizes for your hardware
- Use `mode="max-autotune"` for best performance
- Cache compiled kernels across runs

#### Profile Before Optimizing 📊
- Measure baseline performance first
- Identify memory-bound vs compute-bound operations
- Focus optimization efforts where they matter most

### 🚀 Advanced Optimization Techniques

#### 1. **Custom Triton Kernels** ✍️
When PyTorch's automatic fusion isn't enough:
- Write hand-optimized Triton kernels for critical paths
- Implement novel algorithms not available in PyTorch
- Optimize for specific hardware characteristics

#### 2. **Mixed Precision Optimization** 🎨
```python
# Combine kernel fusion with mixed precision
@torch.compile(mode="max-autotune")
def optimized_attention(q, k, v):
    # Automatically uses appropriate precision
    scores = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)
```

#### 3. **Memory Layout Optimization** 📐
- Experiment with different tensor layouts (NCHW vs NHWC)
- Use tensor cores when available (requires specific layouts)
- Consider padding strategies for optimal memory alignment

### 🔬 Experiment Organization Benefits

Our structured approach provides:

#### **Reproducibility** 🔄
- Each experiment is self-contained with metadata
- Easy to reproduce results across different hardware
- Clear documentation of what was tested

#### **Comparison** ⚖️
- Side-by-side performance analysis
- Clear identification of best-performing approaches
- Understanding of trade-offs between different methods

#### **Learning** 🎓
- Generated kernels serve as learning materials
- Progression from simple to complex patterns
- Foundation for advanced optimization work

### 🛤️ Your Next Steps

#### Immediate Actions (This Week) 📅
1. **Apply to Your Models**: Use `@torch.compile()` on your existing PyTorch models
2. **Measure Impact**: Benchmark before/after compilation on your workloads  
3. **Experiment**: Try different fusion patterns from this notebook

#### Intermediate Exploration (This Month) 📈
1. **Custom Patterns**: Implement fusion for operations specific to your domain
2. **Hardware Tuning**: Experiment with different GPUs and configurations
3. **Production Integration**: Deploy compiled models in your applications

#### Advanced Mastery (Ongoing) 🚀
1. **Custom Triton Kernels**: Write hand-optimized kernels for critical operations
2. **Multi-GPU Scaling**: Extend optimizations to distributed settings
3. **Novel Algorithms**: Implement cutting-edge research with optimal GPU utilization

### 🎯 Final Thoughts

GPU kernel optimization is both an art and a science. The tools (PyTorch + Triton) handle much of the complexity, but understanding the principles helps you:

- **Debug Performance Issues**: Know where to look when things are slow
- **Design Better Architectures**: Choose patterns that optimize well
- **Push Boundaries**: Implement novel ideas with optimal performance

The organized experimental approach we've developed here serves as a foundation for continued exploration and optimization of your specific workloads.

**🎉 Congratulations! You've mastered the fundamentals of PyTorch kernel optimization with Triton!**

In [None]:
# 🎯 Quick Start Guide
print("🚀 TRITON OPTIMIZATION NOTEBOOK - QUICK START")
print("=" * 50)
print("📚 To use this notebook:")
print("   1. Run cells sequentially from top to bottom")
print("   2. Each experiment creates its own organized directory")
print("   3. Check ./triton_kernels/ for generated kernels and analysis")
print("   4. Modify patterns to test your own operations")
print("")
print("🎓 Learning Path:")
print("   Experiment 1: LayerNorm + GELU (fundamentals)")
print("   Experiment 2: Softmax + Dropout (attention)")  
print("   Experiment 3: RMSNorm (modern normalization)")
print("   Experiment 4: SiLU variants (implementation comparison)")
print("")
print("🔬 Each experiment includes:")
print("   • Educational explanations and mathematical background")
print("   • Generated Triton kernels with organized storage")
print("   • Performance benchmarks and analysis")
print("   • Correctness verification and insights")
print("")
print("✨ Ready to explore GPU kernel optimization!")