[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mmcmanus1/rlhf-canary/blob/main/notebooks/04_root_cause_analysis.ipynb)

# Root Cause Analysis: Debugging Regressions

When canary tests fail, the heuristics system helps you understand why. Learn to interpret suspect rankings, match patterns to common issues, and create clear debug reports for your team.

**What you'll learn:**
1. How the heuristics system works
2. Regression categories: dataloader, memory, kernel, IO, etc.
3. Simulating different regression types
4. Interpreting suspect ranking and evidence
5. Creating a "debug story" for failed runs

**Requirements:** GPU runtime (Runtime > Change runtime type > T4 GPU)

**Runtime:** ~10-12 minutes

## 1. Setup

In [None]:
import os
import re
import sys

print("Starting Environment Setup...")

# --- 1. Clone or update the repo ---
if not os.path.exists("/content/rlhf-canary"):
    !git clone https://github.com/mmcmanus1/rlhf-canary.git /content/rlhf-canary
else:
    !cd /content/rlhf-canary && git pull --ff-only

%cd /content/rlhf-canary

# --- 2. Force-Install the "Safe Harbor" Stack ---
!pip install "trl==0.11.4" "transformers==4.44.2" "peft==0.12.0" "accelerate==0.34.2" "tokenizers==0.19.1" --force-reinstall --no-deps --quiet
!pip install -q datasets pydantic click PyYAML bitsandbytes
print("Libraries installed (TRL 0.11.4 / Transformers 4.44.2)")

# --- 3. Patch pyproject.toml (Prevent future drift) ---
project_file = "/content/rlhf-canary/pyproject.toml"
if os.path.exists(project_file):
    with open(project_file, "r") as f:
        content = f.read()
    
    if "trl==0.11.4" not in content:
        content = re.sub(r'trl[<>=!~]+[\d\.]+', 'trl==0.11.4', content)
        with open(project_file, "w") as f:
            f.write(content)
        print("Config file patched to lock TRL 0.11.4")

# --- 4. Patch Source Code (Compatibility Fix) ---
runner_file = "/content/rlhf-canary/canary/runner/local.py"
if os.path.exists(runner_file):
    with open(runner_file, "r") as f:
        code = f.read()
    
    if "processing_class=" in code:
        code = code.replace("processing_class=", "tokenizer=")
        with open(runner_file, "w") as f:
            f.write(code)
        print("Code patched: Reverted 'processing_class' to 'tokenizer'")
    else:
        print("Code is already compatible.")

# --- 5. Install the package ---
!pip install -e . --quiet

print("Environment Ready!")

In [None]:
# Verify GPU and installation
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

import canary
print(f"Canary module loaded from: {canary.__file__}")

## 2. Understanding the Heuristics System

When a canary run shows regressions, the heuristics system analyzes the pattern of failed checks to suggest likely root causes.

### Regression Categories

| Category | What it indicates | Common causes |
|----------|------------------|---------------|
| `DATALOADER` | CPU/IO bottleneck | num_workers, preprocessing |
| `TOKENIZATION` | Tokenizer changes | New tokenizer, different config |
| `COMMUNICATION` | Multi-GPU issues | NCCL, DDP changes |
| `MEMORY` | Memory pressure | Leaks, fragmentation |
| `KERNEL` | GPU kernel changes | Model architecture, batch size |
| `IO` | I/O overhead | Checkpointing, logging |
| `UNKNOWN` | Unclear cause | Needs manual investigation |

### How Suspects Are Ranked

Each suspect has:
- **Confidence score** (0.0 - 1.0): How likely this is the cause
- **Evidence**: Specific metrics that support this diagnosis
- **Suggested actions**: Concrete steps to fix the issue

In [None]:
# Show the regression categories from the code
from canary.compare.heuristics import RegressionCategory

print("Regression Categories:")
print("="*40)
for cat in RegressionCategory:
    print(f"  {cat.value:<15} - {cat.name}")

## 3. Create a Baseline Run

In [None]:
# Run baseline DPO canary
!python -m canary.cli run configs/dpo_smoke.yaml -o ./rca_output/baseline

In [None]:
import json
from pathlib import Path

# Load baseline metrics
baseline_paths = list(Path('./rca_output/baseline').rglob('metrics.json'))
if not baseline_paths:
    raise FileNotFoundError("No metrics.json found for baseline. Did the training complete?")

baseline_path = baseline_paths[0]
with open(baseline_path) as f:
    baseline_metrics = json.load(f)

print("Baseline Metrics:")
print(f"  Step time (mean): {baseline_metrics['perf']['step_time']['mean']:.4f}s")
print(f"  Tokens/sec: {baseline_metrics['perf']['approx_tokens_per_sec']:.0f}")
print(f"  Peak memory: {baseline_metrics['perf']['max_mem_mb']:.0f}MB")

## 4. Simulate a Dataloader Bottleneck

A common regression is when the dataloader becomes a bottleneck. This typically shows as:
- Step time increases
- GPU utilization drops
- Memory stays stable

We'll simulate this by reducing the batch size (which increases Python overhead).

In [None]:
# Create a config that simulates dataloader bottleneck
dataloader_bottleneck_config = """
name: dpo_dataloader_bottleneck
description: Simulates dataloader bottleneck via small batch

# Model configuration
model_name: EleutherAI/pythia-70m
use_peft: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05

# Training configuration - SMALL BATCH = MORE DATALOADER CALLS
training_type: dpo
max_steps: 60
batch_size: 1              # Tiny batch = high Python overhead
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
max_length: 256
warmup_steps: 5

# DPO-specific
beta: 0.1
max_prompt_length: 64

# Dataset configuration
dataset_name: Anthropic/hh-rlhf
dataset_split: train
dataset_size: 256
seed: 42

output_dir: ./rca_output
metrics_warmup_steps: 5

profiler:
  enabled: false
"""

with open('configs/dpo_dataloader_bottleneck.yaml', 'w') as f:
    f.write(dataloader_bottleneck_config)

print("Created config with batch_size=1 to simulate dataloader bottleneck")

In [None]:
# Run the dataloader bottleneck config
!python -m canary.cli run configs/dpo_dataloader_bottleneck.yaml -o ./rca_output/dataloader_bottleneck

## 5. Run Root Cause Analysis

In [None]:
# Load the regression metrics
regression_paths = list(Path('./rca_output/dataloader_bottleneck').rglob('metrics.json'))
if not regression_paths:
    raise FileNotFoundError("No metrics.json found for regression run. Did the training complete?")

regression_path = regression_paths[0]
with open(regression_path) as f:
    regression_metrics = json.load(f)

print("Regression Metrics:")
print(f"  Step time (mean): {regression_metrics['perf']['step_time']['mean']:.4f}s")
print(f"  Tokens/sec: {regression_metrics['perf']['approx_tokens_per_sec']:.0f}")
print(f"  Peak memory: {regression_metrics['perf']['max_mem_mb']:.0f}MB")

# Calculate deltas
base_step = baseline_metrics['perf']['step_time']['mean']
reg_step = regression_metrics['perf']['step_time']['mean']
step_delta = (reg_step - base_step) / base_step * 100

base_tps = baseline_metrics['perf']['approx_tokens_per_sec']
reg_tps = regression_metrics['perf']['approx_tokens_per_sec']
tps_delta = (reg_tps - base_tps) / base_tps * 100

base_mem = baseline_metrics['perf']['max_mem_mb']
reg_mem = regression_metrics['perf']['max_mem_mb']
mem_delta = reg_mem - base_mem

print(f"\nDeltas:")
print(f"  Step time: {step_delta:+.1f}%")
print(f"  Tokens/sec: {tps_delta:+.1f}%")
print(f"  Memory: {mem_delta:+.0f}MB")

In [None]:
# Run comparison with root cause analysis
from canary.compare.stats import compare_to_baseline, load_metrics
from canary.compare.thresholds import SMOKE_THRESHOLDS
from canary.compare.heuristics import analyze_regression, format_suspects_markdown

# Load metrics properly
current = load_metrics(str(regression_path))
baseline = load_metrics(str(baseline_path))

# Run comparison
report = compare_to_baseline(current, baseline, SMOKE_THRESHOLDS)

print("="*60)
print("COMPARISON REPORT")
print("="*60)
print(f"\nOverall: {'PASS' if report.passed else 'FAIL'}")
print(f"\nFailed checks ({len(report.failed_checks)}):")
for check in report.failed_checks:
    delta_str = f"{check.delta_pct:+.1f}%" if check.delta_pct else f"{check.delta:+.1f}"
    print(f"  - {check.name}: {delta_str}")

In [None]:
# Run root cause analysis
analysis = analyze_regression(report, current, baseline)

print("="*60)
print("ROOT CAUSE ANALYSIS")
print("="*60)
print(f"\nSummary: {analysis.summary}")
print(f"\nTop {len(analysis.suspects)} suspects:")

for i, suspect in enumerate(analysis.suspects, 1):
    conf_bar = "█" * int(suspect.confidence * 10) + "░" * (10 - int(suspect.confidence * 10))
    print(f"\n#{i} {suspect.category.value.upper()} [{conf_bar}] {suspect.confidence:.0%}")
    print(f"   {suspect.description}")
    print(f"   Evidence:")
    for ev in suspect.evidence:
        print(f"     - {ev}")
    print(f"   Actions:")
    for action in suspect.suggested_actions[:2]:  # Show top 2 actions
        print(f"     - {action}")

## 6. Simulate a Memory Regression

Memory regressions show as:
- Memory increases significantly
- Throughput may drop due to memory pressure

We'll simulate this by increasing the sequence length (which uses more memory for activations).

In [None]:
# Create a config with higher memory usage
memory_regression_config = """
name: dpo_memory_regression
description: Simulates memory regression via longer sequences

# Model configuration
model_name: EleutherAI/pythia-70m
use_peft: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05

# Training configuration - LONGER SEQUENCES = MORE MEMORY
training_type: dpo
max_steps: 60
batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
max_length: 512            # 2x longer sequences!
warmup_steps: 5

# DPO-specific
beta: 0.1
max_prompt_length: 128     # Also longer prompts

# Dataset configuration
dataset_name: Anthropic/hh-rlhf
dataset_split: train
dataset_size: 256
seed: 42

output_dir: ./rca_output
metrics_warmup_steps: 5

profiler:
  enabled: false
"""

with open('configs/dpo_memory_regression.yaml', 'w') as f:
    f.write(memory_regression_config)

print("Created config with max_length=512 (2x baseline) to simulate memory regression")

In [None]:
# Run the memory regression config
!python -m canary.cli run configs/dpo_memory_regression.yaml -o ./rca_output/memory_regression

In [None]:
# Analyze the memory regression
mem_reg_paths = list(Path('./rca_output/memory_regression').rglob('metrics.json'))
if not mem_reg_paths:
    raise FileNotFoundError("No metrics.json found for memory regression run. Did the training complete?")

mem_reg_path = mem_reg_paths[0]
mem_current = load_metrics(str(mem_reg_path))

# Run comparison
mem_report = compare_to_baseline(mem_current, baseline, SMOKE_THRESHOLDS)
mem_analysis = analyze_regression(mem_report, mem_current, baseline)

print("="*60)
print("MEMORY REGRESSION ANALYSIS")
print("="*60)
print(f"\nOverall: {'PASS' if mem_report.passed else 'FAIL'}")

if mem_report.failed_checks:
    print(f"\nFailed checks:")
    for check in mem_report.failed_checks:
        delta_str = f"{check.delta_pct:+.1f}%" if check.delta_pct else f"{check.delta:+.1f}"
        print(f"  - {check.name}: {delta_str}")

print(f"\nRoot cause summary: {mem_analysis.summary}")
if mem_analysis.top_suspect:
    print(f"\nTop suspect: {mem_analysis.top_suspect.category.value}")
    print(f"Confidence: {mem_analysis.top_suspect.confidence:.0%}")
    print(f"Description: {mem_analysis.top_suspect.description}")

## 7. Using the CLI for Root Cause Analysis

You can also get root cause analysis directly from the CLI.

In [None]:
# Use CLI to compare and get analysis
!python -m canary.cli compare {regression_path} {baseline_path} --threshold-tier smoke

## 8. Creating a Debug Story

When presenting a regression to your team, you want a clear "debug story". Here's how to structure it:

In [None]:
def create_debug_story(report, analysis, current, baseline):
    """Create a structured debug story from canary results."""
    
    story = []
    story.append("# Regression Debug Report")
    story.append("")
    
    # 1. What happened
    story.append("## 1. What Happened")
    story.append(f"- Canary result: **{'PASS' if report.passed else 'FAIL'}**")
    story.append(f"- Failed checks: {len(report.failed_checks)}")
    for check in report.failed_checks:
        delta_str = f"{check.delta_pct:+.1f}%" if check.delta_pct else f"{check.delta:+.1f}"
        story.append(f"  - {check.name}: {delta_str}")
    story.append("")
    
    # 2. Root cause
    story.append("## 2. Root Cause Analysis")
    story.append(f"**Summary:** {analysis.summary}")
    story.append("")
    
    if analysis.top_suspect:
        ts = analysis.top_suspect
        story.append(f"**Primary suspect:** {ts.category.value} ({ts.confidence:.0%} confidence)")
        story.append(f"")
        story.append("Evidence:")
        for ev in ts.evidence:
            story.append(f"- {ev}")
    story.append("")
    
    # 3. Recommended actions
    story.append("## 3. Recommended Actions")
    if analysis.top_suspect:
        for i, action in enumerate(analysis.top_suspect.suggested_actions, 1):
            story.append(f"{i}. {action}")
    story.append("")
    
    # 4. Metrics comparison
    story.append("## 4. Metrics Comparison")
    story.append("| Metric | Baseline | Current | Delta |")
    story.append("|--------|----------|---------|-------|")
    
    base_step = baseline.perf.step_time.mean
    curr_step = current.perf.step_time.mean
    if base_step is not None and base_step > 0 and curr_step is not None:
        step_delta = (curr_step - base_step) / base_step * 100
        story.append(f"| Step time | {base_step:.4f}s | {curr_step:.4f}s | {step_delta:+.1f}% |")
    else:
        story.append(f"| Step time | N/A | N/A | N/A |")
    
    base_tps = baseline.perf.approx_tokens_per_sec
    curr_tps = current.perf.approx_tokens_per_sec
    if base_tps is not None and base_tps > 0 and curr_tps is not None:
        tps_delta = (curr_tps - base_tps) / base_tps * 100
        story.append(f"| Tokens/sec | {base_tps:.0f} | {curr_tps:.0f} | {tps_delta:+.1f}% |")
    else:
        story.append(f"| Tokens/sec | N/A | N/A | N/A |")
    
    base_mem = baseline.perf.max_mem_mb
    curr_mem = current.perf.max_mem_mb
    if base_mem is not None and curr_mem is not None:
        mem_delta = curr_mem - base_mem
        story.append(f"| Memory | {base_mem:.0f}MB | {curr_mem:.0f}MB | {mem_delta:+.0f}MB |")
    else:
        story.append(f"| Memory | N/A | N/A | N/A |")
    
    return "\n".join(story)

# Generate debug story for our dataloader regression
debug_story = create_debug_story(report, analysis, current, baseline)
print(debug_story)

## 9. Pattern Recognition Guide

Here's a quick reference for recognizing regression patterns:

| Pattern | Step Time | Memory | GPU Util | Likely Cause |
|---------|-----------|--------|----------|-------------|
| Step time ↑, Memory stable | ↑ | ~ | ↓ | Dataloader/CPU |
| Step time ↑, Memory ↑ | ↑ | ↑ | ~ | Memory fragmentation |
| Memory ↑↑, Step time ~ | ~ | ↑↑ | ~ | Memory leak |
| All metrics worse | ↑ | ↑ | ↓ | Major regression |
| NaN detected | N/A | N/A | N/A | Numerical instability |


In [None]:
# Quick pattern matcher
def match_regression_pattern(report, current, baseline):
    """Match regression to common patterns."""
    
    failed_names = {c.name for c in report.failed_checks}
    
    # Check for NaN/Inf first (highest priority)
    if 'nan_steps' in failed_names or 'inf_steps' in failed_names:
        return "NUMERICAL_INSTABILITY", "Check learning rate, gradient clipping, data preprocessing"
    
    # Pattern: Step time up, memory stable
    if 'step_time_mean' in failed_names and 'max_memory' not in failed_names:
        return "CPU_BOTTLENECK", "Likely dataloader, tokenization, or Python overhead"
    
    # Pattern: Step time up, memory up
    if 'step_time_mean' in failed_names and 'max_memory' in failed_names:
        return "MEMORY_FRAGMENTATION", "Likely CUDA allocator issue or memory leak"
    
    # Pattern: Only memory up
    if 'max_memory' in failed_names and 'step_time_mean' not in failed_names:
        return "MEMORY_INCREASE", "Model size, batch size, or sequence length change"
    
    # Pattern: Only throughput down
    if 'tokens_per_sec' in failed_names:
        return "THROUGHPUT_DROP", "Check batch efficiency, model changes"
    
    return "UNKNOWN", "Requires manual investigation"

# Test pattern matcher on our regressions
print("Pattern Analysis:")
print("="*50)

pattern, suggestion = match_regression_pattern(report, current, baseline)
print(f"\nDataloader bottleneck run:")
print(f"  Pattern: {pattern}")
print(f"  Suggestion: {suggestion}")

if mem_report.failed_checks:
    pattern2, suggestion2 = match_regression_pattern(mem_report, mem_current, baseline)
    print(f"\nMemory regression run:")
    print(f"  Pattern: {pattern2}")
    print(f"  Suggestion: {suggestion2}")

## 10. Summary

### Key Takeaways:

1. **Heuristics analyze patterns** of failed checks to suggest root causes
2. **Suspects are ranked** by confidence score based on evidence
3. **Common patterns** map to specific categories (dataloader, memory, etc.)
4. **Each suspect includes** evidence and suggested actions
5. **Debug stories** help communicate regressions to your team

### When to Use Root Cause Analysis:

- After any canary failure
- When investigating performance degradation
- When onboarding new team members to debugging
- When creating post-mortems for incidents

### Next Steps:

- See `05_ppo_canary.ipynb` for PPO-specific canary workflows
- See `02_profiler_deep_dive.ipynb` for deeper performance analysis