[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mmcmanus1/rlhf-canary/blob/main/notebooks/08_configuration_and_thresholds.ipynb)

# Configuration & Thresholds: Customizing Your Canary Tests

Master RLHF Canary configuration: generate configs, customize regression thresholds, understand test tiers, and build test matrices for comprehensive coverage.

**What you'll learn:**
1. Using `canary init-config` to generate starter configurations
2. Understanding threshold tiers (smoke, default, perf, nightly)
3. Creating custom threshold files
4. Embedding thresholds in config files
5. Understanding the test matrix system
6. Environment fingerprinting for reproducibility
7. JSON output for automation

**Requirements:** GPU runtime (Runtime > Change runtime type > T4 GPU)

**Runtime:** ~12-15 minutes

## 1. Setup

In [None]:
import os
import re
import sys

print("Starting Environment Setup...")

# --- 1. Clone the repo first ---
if not os.path.exists("/content/rlhf-canary"):
    !git clone https://github.com/mmcmanus1/rlhf-canary.git /content/rlhf-canary

%cd /content/rlhf-canary

# --- 2. Force-Install the "Safe Harbor" Stack ---
!pip install "trl==0.11.4" "transformers==4.44.2" "peft==0.12.0" "accelerate==0.34.2" "tokenizers==0.19.1" --force-reinstall --no-deps --quiet
!pip install -q datasets pydantic click PyYAML bitsandbytes
print("Libraries installed (TRL 0.11.4 / Transformers 4.44.2)")

# --- 3. Patch pyproject.toml ---
project_file = "/content/rlhf-canary/pyproject.toml"
if os.path.exists(project_file):
    with open(project_file, "r") as f:
        content = f.read()
    
    if "trl==0.11.4" not in content:
        content = re.sub(r'trl[<>=!~]+[\d\.]+', 'trl==0.11.4', content)
        with open(project_file, "w") as f:
            f.write(content)
        print("Config file patched to lock TRL 0.11.4")

# --- 4. Patch Source Code ---
runner_file = "/content/rlhf-canary/canary/runner/local.py"
if os.path.exists(runner_file):
    with open(runner_file, "r") as f:
        code = f.read()
    
    if "processing_class=" in code:
        code = code.replace("processing_class=", "tokenizer=")
        with open(runner_file, "w") as f:
            f.write(code)
        print("Code patched: Reverted 'processing_class' to 'tokenizer'")
    else:
        print("Code is already compatible.")

# --- 5. Install the package ---
!pip install -e . --quiet

print("Environment Ready!")

In [None]:
# Verify GPU is available
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Generating Configurations with `init-config`

The `canary init-config` command generates starter configurations for different training types and test tiers.

### Usage

```bash
canary init-config <output_path> --type <dpo|sft|ppo> --tier <smoke|perf|nightly>
```

### Step Counts by Tier

| Tier | DPO/SFT Steps | PPO Steps | Duration | Use Case |
|------|---------------|-----------|----------|----------|
| smoke | 100 | 50 | ~5-10 min | PR gating |
| perf | 500 | 200 | ~20-45 min | Performance analysis |
| nightly | 2000 | 500 | ~1-3 hr | Comprehensive testing |

In [None]:
# Generate DPO configs for all tiers
!python -m canary.cli init-config generated/dpo_smoke.yaml --type dpo --tier smoke
!python -m canary.cli init-config generated/dpo_perf.yaml --type dpo --tier perf
!python -m canary.cli init-config generated/dpo_nightly.yaml --type dpo --tier nightly

In [None]:
# Generate PPO and SFT smoke configs
!python -m canary.cli init-config generated/ppo_smoke.yaml --type ppo --tier smoke
!python -m canary.cli init-config generated/sft_smoke.yaml --type sft --tier smoke

In [None]:
# View a generated config
!cat generated/dpo_smoke.yaml

In [None]:
# Compare DPO vs PPO configs
print("=== DPO Smoke Config ===")
!cat generated/dpo_smoke.yaml

print("\n=== PPO Smoke Config ===")
!cat generated/ppo_smoke.yaml

### Config Fields Explained

| Field | Description | Default |
|-------|-------------|---------|
| `name` | Run identifier | `{type}_{tier}` |
| `model_name` | HuggingFace model | `EleutherAI/pythia-70m` |
| `use_peft` | Enable LoRA | `true` |
| `training_type` | `dpo`, `sft`, or `ppo` | varies |
| `max_steps` | Training steps | varies by tier |
| `batch_size` | Per-device batch size | `2` |
| `dataset_name` | HuggingFace dataset | `Anthropic/hh-rlhf` |

**DPO-specific:** `beta`, `max_prompt_length`

**PPO-specific:** `ppo_epochs`, `init_kl_coef`, `target_kl`, `cliprange`, `vf_coef`, `max_new_tokens`, `use_synthetic_reward`

## 3. Understanding Threshold Tiers

Thresholds control how sensitive regression detection is. RLHF Canary provides four built-in tiers:

In [None]:
# View all threshold tiers programmatically
from canary.compare.thresholds import (
    DEFAULT_THRESHOLDS,
    SMOKE_THRESHOLDS,
    PERF_THRESHOLDS,
    NIGHTLY_THRESHOLDS,
    get_thresholds,
)

print("Threshold Comparison")
print("=" * 80)
print(f"{'Metric':<35} {'Smoke':>10} {'Default':>10} {'Perf':>10} {'Nightly':>10}")
print("-" * 80)

metrics = [
    ("max_step_time_increase_pct", "%"),
    ("max_tps_drop_pct", "%"),
    ("max_mem_increase_mb", "MB"),
    ("max_mem_increase_pct", "%"),
    ("min_step_count", "steps"),
    ("nan_steps_allowed", "steps"),
    ("inf_steps_allowed", "steps"),
]

tiers = [SMOKE_THRESHOLDS, DEFAULT_THRESHOLDS, PERF_THRESHOLDS, NIGHTLY_THRESHOLDS]

for metric, unit in metrics:
    values = [getattr(t, metric) for t in tiers]
    print(f"{metric:<35} {values[0]:>9}{unit[0]} {values[1]:>9}{unit[0]} {values[2]:>9}{unit[0]} {values[3]:>9}{unit[0]}")

### When to Use Each Tier

| Tier | Strictness | Use Case |
|------|------------|----------|
| **smoke** | Lenient | PR gating - catch obvious regressions quickly |
| **default** | Balanced | General purpose, manual comparisons |
| **perf** | Strict | Performance-focused testing, longer runs |
| **nightly** | Strictest | Comprehensive nightly soak tests |

In [None]:
# Run a quick canary for comparison demos
!python -m canary.cli run configs/dpo_smoke.yaml -o ./canary_output/baseline

from pathlib import Path
baseline_path = next(Path('./canary_output/baseline').rglob('metrics.json'))
!mkdir -p baselines
!cp {baseline_path} baselines/main.json
print(f"Baseline saved")

In [None]:
# Compare using different tiers
print("=== SMOKE TIER (lenient) ===")
!python -m canary.cli compare {baseline_path} baselines/main.json --threshold-tier smoke 2>/dev/null | head -20

print("\n=== NIGHTLY TIER (strict) ===")
!python -m canary.cli compare {baseline_path} baselines/main.json --threshold-tier nightly 2>/dev/null | head -20

## 4. Creating Custom Threshold Files

For project-specific needs, create custom threshold YAML files. The format supports:

1. **`base_tier`** - Start from an existing tier
2. **Override** - Customize specific values

In [None]:
# Create a custom threshold file
custom_thresholds = """
# Custom thresholds for noisy GPU cluster
# Start from smoke tier (lenient) and customize
base_tier: smoke

# Performance thresholds
# Our cluster has high variance, so we're lenient on timing
max_step_time_increase_pct: 25.0  # Allow 25% increase (default smoke: 15%)
max_tps_drop_pct: 20.0            # Allow 20% throughput drop (default smoke: 12%)

# Memory thresholds
# We have limited VRAM, so we're strict on memory
max_mem_increase_mb: 256.0        # Only 256MB increase allowed (default smoke: 1000MB)
max_mem_increase_pct: 10.0        # Only 10% increase allowed (default smoke: 20%)

# Stability thresholds
# Zero tolerance for NaN/Inf
nan_steps_allowed: 0
inf_steps_allowed: 0

# Minimum steps for valid comparison
min_step_count: 20
"""

!mkdir -p thresholds
with open('thresholds/noisy_cluster.yaml', 'w') as f:
    f.write(custom_thresholds)

print("Created thresholds/noisy_cluster.yaml")
!cat thresholds/noisy_cluster.yaml

In [None]:
# Use custom thresholds
!python -m canary.cli compare {baseline_path} baselines/main.json --threshold-file thresholds/noisy_cluster.yaml

In [None]:
# Load and inspect custom thresholds programmatically
from canary.compare.thresholds import load_thresholds_from_yaml

custom = load_thresholds_from_yaml('thresholds/noisy_cluster.yaml')

print("Loaded custom thresholds:")
print(f"  Step time increase: {custom.max_step_time_increase_pct}%")
print(f"  TPS drop: {custom.max_tps_drop_pct}%")
print(f"  Memory increase: {custom.max_mem_increase_mb}MB")
print(f"  Min steps: {custom.min_step_count}")

### Common Custom Threshold Scenarios

**Lenient (demo/development):**
```yaml
base_tier: smoke
max_step_time_increase_pct: 50.0
max_tps_drop_pct: 40.0
max_mem_increase_mb: 2000.0
```

**Strict (production):**
```yaml
base_tier: nightly
max_step_time_increase_pct: 3.0
max_tps_drop_pct: 2.0
nan_steps_allowed: 0
```

**Memory-focused:**
```yaml
base_tier: default
max_mem_increase_mb: 100.0
max_mem_increase_pct: 5.0
```

## 5. Embedding Thresholds in Config Files

You can also embed thresholds directly in your canary config YAML. This keeps thresholds with the test definition.

In [None]:
# Create a config with embedded thresholds
config_with_thresholds = """
name: dpo_strict
description: DPO test with embedded strict thresholds

model_name: EleutherAI/pythia-70m
use_peft: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05

training_type: dpo
max_steps: 100
batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
max_length: 256
warmup_steps: 10

beta: 0.1
max_prompt_length: 64

dataset_name: Anthropic/hh-rlhf
dataset_split: train
dataset_size: 512
seed: 42

output_dir: ./canary_output

# Embedded thresholds - used when comparing this run
thresholds:
  base_tier: perf  # Start from perf tier
  max_step_time_increase_pct: 5.0  # Very strict on timing
  max_mem_increase_mb: 200.0
"""

with open('configs/dpo_strict.yaml', 'w') as f:
    f.write(config_with_thresholds)

print("Created configs/dpo_strict.yaml with embedded thresholds")

In [None]:
# Load and inspect embedded thresholds
import yaml
from canary.compare.thresholds import load_thresholds_from_config

with open('configs/dpo_strict.yaml') as f:
    config_dict = yaml.safe_load(f)

embedded = load_thresholds_from_config(config_dict)

if embedded:
    print("Embedded thresholds found:")
    print(f"  Step time increase: {embedded.max_step_time_increase_pct}%")
    print(f"  Memory increase: {embedded.max_mem_increase_mb}MB")
else:
    print("No embedded thresholds found")

## 6. The Test Matrix System

For running multiple canary tests systematically, RLHF Canary provides a test matrix system.

In [None]:
# Explore the test matrix system
from canary.matrix import (
    TestTier,
    TestDefinition,
    TestMatrix,
    DEFAULT_TEST_MATRIX,
    get_tests_for_tier,
    get_test_by_name,
)

print("Default Test Matrix")
print("=" * 70)
print(f"{'Name':<15} {'Tier':<10} {'Config':<30} {'Timeout':>10}")
print("-" * 70)

for test in DEFAULT_TEST_MATRIX:
    print(f"{test.name:<15} {test.tier.value:<10} {test.config_path:<30} {test.timeout_minutes:>7} min")

In [None]:
# Get tests for specific tier
smoke_tests = get_tests_for_tier(TestTier.SMOKE)

print(f"\nSmoke tests ({len(smoke_tests)} tests):")
for test in smoke_tests:
    print(f"  - {test.name}: {test.description}")

In [None]:
# Pre-built test matrices
print("Pre-built Test Matrices\n")

pr_matrix = TestMatrix.for_pr()
print(f"PR Gate Matrix: {pr_matrix.name}")
print(f"  Description: {pr_matrix.description}")
print(f"  Tests: {[t.name for t in pr_matrix.tests]}")

print()

nightly_matrix = TestMatrix.for_nightly()
print(f"Nightly Matrix: {nightly_matrix.name}")
print(f"  Description: {nightly_matrix.description}")
print(f"  Tests: {[t.name for t in nightly_matrix.tests]}")

print()

perf_matrix = TestMatrix.for_performance()
print(f"Performance Matrix: {perf_matrix.name}")
print(f"  Description: {perf_matrix.description}")
print(f"  Tests: {[t.name for t in perf_matrix.tests]}")

In [None]:
# Create a custom test matrix
from canary.compare.thresholds import SMOKE_THRESHOLDS

custom_matrix = TestMatrix(
    name="my_pr_gate",
    description="Custom PR gate with only DPO",
    tests=[
        TestDefinition(
            name="dpo_quick",
            description="Quick DPO validation",
            tier=TestTier.SMOKE,
            config_path="configs/dpo_smoke.yaml",
            thresholds=SMOKE_THRESHOLDS,
            timeout_minutes=10,
        ),
    ]
)

print(f"Custom matrix: {custom_matrix.name}")
print(f"Tests: {len(custom_matrix.tests)}")

## 7. Environment Fingerprinting

Environment fingerprinting captures hardware and software configuration for reproducible comparisons.

In [None]:
# View environment fingerprint via CLI
!python -m canary.cli env

In [None]:
# Access fingerprint programmatically
from canary.collect.env_fingerprint import get_env_fingerprint, fingerprints_compatible

fingerprint = get_env_fingerprint()

print("Environment Fingerprint Object")
print("=" * 50)
print(f"Python: {fingerprint.python_version}")
print(f"PyTorch: {fingerprint.torch_version}")
print(f"CUDA: {fingerprint.cuda_version}")
print(f"GPU: {fingerprint.gpu_name}")
print(f"GPU Count: {fingerprint.gpu_count}")
print(f"GPU Memory: {fingerprint.gpu_memory_gb}GB")
print(f"Platform: {fingerprint.platform} {fingerprint.platform_version}")
print(f"Transformers: {fingerprint.transformers_version}")
print(f"TRL: {fingerprint.trl_version}")
print(f"\nFingerprint Hash: {fingerprint.fingerprint_hash}")

In [None]:
# Check fingerprint compatibility (simulated different environment)
from canary.collect.env_fingerprint import EnvFingerprint

# Simulate a different environment (A100 vs T4)
a100_fingerprint = EnvFingerprint(
    python_version="3.10.12",
    torch_version="2.1.0",
    cuda_available=True,
    cuda_version="12.1",
    gpu_name="NVIDIA A100-SXM4-40GB",
    gpu_count=1,
    gpu_memory_gb=40.0,
    platform="Linux",
    platform_version="5.15.0",
    transformers_version="4.44.2",
    trl_version="0.11.4",
)

# Check compatibility
compatible, warnings = fingerprints_compatible(fingerprint, a100_fingerprint)

print(f"Compatible: {compatible}")
if warnings:
    print("Warnings:")
    for w in warnings:
        print(f"  - {w}")

### Why Fingerprinting Matters

Performance comparisons are only meaningful on similar hardware:

| Scenario | Compatible | Notes |
|----------|------------|-------|
| Same GPU, same CUDA | Yes | Ideal for regression detection |
| Same GPU, different CUDA | Warn | Minor perf differences possible |
| Different GPU | No | Performance not comparable |
| Different GPU count | No | Distributed training changes |

## 8. JSON Output for Automation

For integration with dashboards, alerting, and automation, use JSON output.

In [None]:
# Get JSON output from compare command
!python -m canary.cli compare {baseline_path} baselines/main.json --threshold-tier smoke --json

In [None]:
# Parse JSON for custom processing
import subprocess
import json

result = subprocess.run(
    ["python", "-m", "canary.cli", "compare", str(baseline_path), "baselines/main.json", "--threshold-tier", "smoke", "--json"],
    capture_output=True,
    text=True
)

report = json.loads(result.stdout)

print("Parsed Comparison Report")
print("=" * 50)
print(f"Passed: {report['passed']}")
print(f"Total checks: {len(report['checks'])}")
print(f"Failed checks: {len([c for c in report['checks'] if not c['passed']])}")

print("\nPerformance Delta:")
for key, value in report['perf_delta'].items():
    print(f"  {key}: {value}")

print("\nAll Checks:")
for check in report['checks']:
    status = "PASS" if check['passed'] else "FAIL"
    print(f"  [{status}] {check['name']}: {check['message']}")

In [None]:
# Example: Build a dashboard payload
def build_dashboard_payload(report: dict, metrics_path: str) -> dict:
    """Build a payload for a metrics dashboard."""
    import json
    from pathlib import Path
    
    with open(metrics_path) as f:
        metrics = json.load(f)
    
    return {
        "timestamp": metrics.get("run_id", "").split("_")[1] if "_" in metrics.get("run_id", "") else None,
        "run_id": metrics.get("run_id"),
        "passed": report["passed"],
        "metrics": {
            "step_time_ms": metrics["perf"]["step_time"]["mean"] * 1000,
            "tokens_per_sec": metrics["perf"]["approx_tokens_per_sec"],
            "peak_memory_mb": metrics["perf"]["max_mem_mb"],
            "nan_steps": metrics["stability"]["nan_steps"],
        },
        "checks": {
            "total": len(report["checks"]),
            "passed": sum(1 for c in report["checks"] if c["passed"]),
            "failed": sum(1 for c in report["checks"] if not c["passed"]),
        },
        "deltas": report["perf_delta"],
    }

payload = build_dashboard_payload(report, str(baseline_path))
print("Dashboard Payload:")
print(json.dumps(payload, indent=2))

## 9. Practical Example: Multi-Tier Testing Strategy

Here's how to set up a complete testing pyramid for your project.

In [None]:
# Create a testing pyramid structure
!mkdir -p my_project/configs
!mkdir -p my_project/thresholds
!mkdir -p my_project/baselines

# Generate configs for each tier
!python -m canary.cli init-config my_project/configs/smoke.yaml --type dpo --tier smoke
!python -m canary.cli init-config my_project/configs/perf.yaml --type dpo --tier perf
!python -m canary.cli init-config my_project/configs/nightly.yaml --type dpo --tier nightly

print("Created testing pyramid configs!")
!ls -la my_project/configs/

In [None]:
# Create tier-specific threshold files
pr_thresholds = """
# PR Gate Thresholds - lenient for quick feedback
base_tier: smoke
max_step_time_increase_pct: 20.0
min_step_count: 10
"""

nightly_thresholds = """
# Nightly Thresholds - strict for comprehensive testing
base_tier: nightly
max_step_time_increase_pct: 3.0
max_tps_drop_pct: 2.0
min_step_count: 200
"""

with open('my_project/thresholds/pr.yaml', 'w') as f:
    f.write(pr_thresholds)

with open('my_project/thresholds/nightly.yaml', 'w') as f:
    f.write(nightly_thresholds)

print("Created threshold files!")
!ls -la my_project/thresholds/

In [None]:
# View the complete project structure
!find my_project -type f | sort

### Recommended Testing Strategy

```
my_project/
├── configs/
│   ├── smoke.yaml      # 100 steps, ~5-10 min
│   ├── perf.yaml       # 500 steps, ~20-45 min
│   └── nightly.yaml    # 2000 steps, ~1-3 hr
├── thresholds/
│   ├── pr.yaml         # Lenient for PR gating
│   └── nightly.yaml    # Strict for nightly
└── baselines/
    └── main.json       # Current main branch baseline
```

**Workflow:**
1. **PR opened** → Run smoke test with pr.yaml thresholds
2. **PR merged to main** → Update baseline
3. **Nightly (2 AM)** → Run nightly test with nightly.yaml thresholds

## 10. Summary

### Key Takeaways

1. **`init-config`** generates starter configs for any training type and tier
2. **Four threshold tiers** (smoke → nightly) provide increasing strictness
3. **Custom threshold files** allow project-specific tuning
4. **Embedded thresholds** keep config and thresholds together
5. **Test matrices** organize multiple tests for CI/CD
6. **Environment fingerprints** ensure hardware-compatible comparisons
7. **JSON output** enables dashboard and automation integration

### Quick Reference

```bash
# Generate config
canary init-config output.yaml --type dpo --tier smoke

# Compare with tier
canary compare current.json baseline.json --threshold-tier perf

# Compare with custom thresholds
canary compare current.json baseline.json --threshold-file custom.yaml

# JSON output
canary compare current.json baseline.json --json

# Environment info
canary env
```

### Next Steps

- [01_quickstart.ipynb](01_quickstart.ipynb) - Core workflow basics
- [07_ci_cd_integration.ipynb](07_ci_cd_integration.ipynb) - GitHub integration
- [09_quantization_and_memory.ipynb](09_quantization_and_memory.ipynb) - Memory optimization