# Post-Training Toolkit - Live Output Demo

## Part 1: The Problem

**Training bugs waste compute and money.**

Typical scenario:
```
Train for 8 hours ‚Üí Check metrics ‚Üí Something's wrong ‚Üí 
Dig through logs ‚Üí Find the bug ‚Üí Restart

Cost: 8 hours √ó 8 GPUs √ó $2.50/hr = $160 wasted
```

**Common failure modes:**
- KL divergence explosion
- Importance sampling collapse (‚Üê what we'll see)
- Reward hacking
- Policy collapse

**Most bugs show early signals, but we only look after training ends.**

**PTT solution:** Monitor during training and catch bugs as they happen.

---

## Part 2: The Solution - Just 3 Lines

```python
from post_training_toolkit import DiagnosticsCallback

# Add to any TRL trainer (PPO, DPO, SFT, ORPO...)
trainer = DPOTrainer(
    model=model,
    args=training_args,
    callbacks=[DiagnosticsCallback(
        enable_live_warnings=True,    # ‚Üê Live alerts in terminal
        stop_on_critical=True,         # ‚Üê Auto-stop on critical issues
    )],
    ...
)
trainer.train()
```

The callback automatically captures metrics and alerts you to problems **as they happen**.

### Live Warnings Demo (DPO Training)

This runs a real DPO training with live PTT warnings. 
**Note:** This uses a tiny model (tiny-gpt2) for fast demo.

In [6]:
# Run DPO training with live PTT warnings
# Watch the output for live alerts like: [DiagnosticsCallback] ‚ö†Ô∏è MEDIUM at step 20: ...

!cd ../.. && python demo/scripts/test_live_warnings.py


This demo shows:
  ‚Ä¢ Auto-stop on high-severity issues (if any)
  ‚Ä¢ Auto-diagnostics report at end

[1/4] Loading tiny model: sshleifer/tiny-gpt2
   Model parameters: 102,714

[2/4] Creating preference dataset...
   Train: 48, Eval: 12

[3/4] Setting up DPO training with NEW callback options...
   Options:
     ‚Ä¢ stop_on_critical=False (for demo)
     ‚Ä¢ auto_diagnostics=True (prints summary at end)
Extracting prompt in train dataset: 100%|‚ñà| 48/48 [00:00<00:00, 3581.24 examples
Applying chat template to train dataset: 100%|‚ñà| 48/48 [00:00<00:00, 5340.37 exa
Tokenizing train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 48/48 [00:00<00:00, 1283.48 examples/s]
Extracting prompt in eval dataset: 100%|‚ñà| 12/12 [00:00<00:00, 2382.34 examples/
Applying chat template to eval dataset: 100%|‚ñà| 12/12 [00:00<00:00, 3359.03 exam
Tokenizing eval dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12/12 [00:00<00:00, 1647.57 examples/s]
No label_names provided for model class `PeftModelForCausalLM

### Auto-Stop Demo (Critical Issue Detection)

This demo shows PTT's **auto-stop** feature. We intentionally inject a bug:
- **Bug**: Learning rate is 100x too high (5e-3 instead of 5e-5)
- **Effect**: DPO loss gets stuck at 0.693 (random chance = not learning)
- **PTT Response**: Detects critical issue and **auto-stops training**


In [7]:
# Run DPO training with intentional bug - PTT will auto-stop
# Expected: Critical alert when loss stuck at 0.693, then auto-stop

!cd ../.. && python demo/real_scenario2_dpo_autostop.py

SCENARIO 2: REAL DPO TRAINING WITH PTT AUTO-STOP

Setup:
  ‚Ä¢ Model: gpt2 (124M)
  ‚Ä¢ Trainer: DPO
  ‚Ä¢ Dataset: Anthropic/hh-rlhf (25 examples)
  ‚Ä¢ Bug: Very high LR (5e-3) ‚Üí DPO loss explodes (>2.0)
  ‚Ä¢ PTT: stop_on_critical=True (AUTO-STOP ENABLED)
  ‚Ä¢ Expected: Auto-stop when loss exceeds 2.0

Loading model and tokenizer...
Loading dataset...

Configuring DPO training...

Creating DPO trainer with PTT callback...

STARTING TRAINING - PTT WILL AUTO-STOP ON CRITICAL ISSUES

üëÄ PTT will check metrics every step (logging_steps=1)
üö® Expected: Critical alert + AUTO-STOP around step 11 when loss > 2.0

[DiagnosticsCallback] Detected trainer: DPO
[DiagnosticsCallback] Run directory: real_scenario2_output/diagnostics
[DiagnosticsCallback] Auto-diff enabled
[DiagnosticsCallback] Postmortem recording enabled
[DiagnosticsCallback] Safe stopping enabled (will stop on NaN/Inf)
  5%|‚ñà‚ñà                                          | 1/21 [00:06<02:08,  6.42s/it][DiagnosticsCallback

### View Generated Diagnostics Report

After training, PTT generates a markdown report with:
- Training status (Stable / Partially unstable / Unstable)
- Key insights ranked by severity
- Recommended actions

In [12]:
from IPython.display import Markdown, display
from pathlib import Path

# Find the most recent diagnostics report
report_paths = [
    Path("../outputs/test_live_warnings/diagnostics_report.md"),
    Path("../outputs/reports").glob("*_report.md"),
]

report_found = False
for p in report_paths:
    if isinstance(p, Path) and p.exists():
        print(f"üìÑ Showing report: {p}")
        display(Markdown(p.read_text()))
        report_found = True
        break
    elif hasattr(p, '__iter__'):
        for f in sorted(p, key=lambda x: x.stat().st_mtime, reverse=True):
            print(f"üìÑ Showing report: {f}")
            display(Markdown(f.read_text()))
            report_found = True
            break
    if report_found:
        break

if not report_found:
    print("No diagnostics report found. Run the training cells above first!")

üìÑ Showing report: ../outputs/reports/diagnostics_log_report.md


## RLHF Run Diagnostic Report

Generated: 2026-01-22T06:14:16.026655Z

**Trainer:** DPO | **Status:** Crashed (exception)

### Run Summary
- Steps: 24

- Final DPO Loss: 0.6931
- Mean Win Rate: 65.6%





### Key Insights


1. [HIGH] DPO loss stuck at ~0.693 (random chance). Model may not be learning preferences.
   *Ref: Rafailov et al. (2023) 'DPO', Section 4.2 - Loss at ln(2) indicates no preference signal*

2. [MEDIUM] Win rate shows high volatility (std=0.53), indicating inconsistent preference learning.

3. [LOW] DPO loss has plateaued; consider adjusting learning rate or beta.








### Postmortem
**Exit Reason:** exception
- Last Step: 15
- Timestamp: 2025-12-17T19:26:04.880887+00:00


### Recommended Actions


- DPO loss at random chance: increase learning rate 2-5x, check data quality, or reduce beta.

- DPO loss plateaued: try learning rate warmup/decay or adjust beta parameter.

- Win rate unstable: increase batch size for more stable gradient estimates.



### Callback Configuration Options

```python
DiagnosticsCallback(
    # Where to save diagnostics
    run_dir="./my_run",
    
    # Live warnings during training
    enable_live_warnings=True,     # Print alerts as they happen
    live_warning_interval=10,      # Check every N steps
    
    # Auto-stop on critical issues
    stop_on_critical=True,         # Stop training on HIGH severity
    
    # End-of-run diagnostics
    auto_diagnostics=True,         # Print summary when training ends
    
    # Debug output
    verbose=True,                  # Extra logging
)
```

---
# Part 3: Contributing

### Adding a heuristic = Writing a YAML file

Here's the heuristic that caught the IS ratio bug:

In [None]:
# Let's look at a builtin heuristic - this is all it takes!
from pathlib import Path

yaml_path = Path("../../post_training_toolkit/heuristics/builtin/dpo/loss_random.yaml")
print("üìÑ Builtin heuristic example:")
print("-" * 50)
print(yaml_path.read_text())
print("-" * 50)
print("\n‚ú® That's it! 12 lines of YAML = a training diagnostic.")

### The Condition DSL

| Syntax | Meaning | Example Use Case |
|--------|---------|------------------|
| `< 0.5` | Below threshold | Reward margin too low |
| `> 2.0` | Above threshold | KL divergence explosion |
| `range(0.68, 0.71)` | Stuck in range | DPO loss at random chance |
| `drop(50%)` | Dropped 50% from baseline | Entropy collapse |
| `spike(3x)` | 3x above rolling average | Loss oscillation |

---

### üß™ Live Demo: Create & Use a Custom Heuristic

Let's create our own heuristic and watch PTT use it in real training!

In [None]:
# Step 1: Create a custom heuristic YAML file
from pathlib import Path

custom_dir = Path("../custom_heuristics/dpo")
custom_dir.mkdir(parents=True, exist_ok=True)

# Note: Use PTT's standard metric names (e.g., "reward_margin")
yaml_content = """# Custom heuristic: Detect when reward margin is too low
name: custom_margin_too_low
description: Detect when model can't distinguish chosen vs rejected
trainers: [dpo]
metric: reward_margin
condition: "< 0.02"
window: 5
severity: high
message: "Reward margin too low ({value:.4f}). Model can't distinguish chosen vs rejected!"
min_steps: 5
enabled: true
"""

yaml_path = custom_dir / "margin_too_low.yaml"  # Give it a descriptive name
yaml_path.write_text(yaml_content)

print("‚úÖ Created custom heuristic!")
print(f"üìÑ File: {yaml_path}")
print()
print("Contents:")
print("-" * 50)
print(yaml_content)
print("-" * 50)

**Step 2: Run training and watch PTT auto-stop!**

We'll run DPO training with a tiny model + `stop_on_critical=True`. 

When our custom heuristic fires (HIGH severity), PTT will:
1. Print the alert: `üö® HIGH: Reward margin too low`
2. **Automatically stop training** to save compute!

Watch for the auto-stop around step 20!

In [None]:
# Step 2: Run training with our custom heuristic + auto-stop
# Watch for: üö® HIGH alert ‚Üí then training stops automatically!

!cd ../.. && python demo/scripts/custom_heuristic_demo.py

## Why This Matters

**The missing layer for continuous autonomous RL.**  
Post Training Toolkit turns continuous training from a fragile, manual process into a system that can be monitored and controlled reliably.

**Built for agents.**  
PTT exposes training signals in a structured, machine-readable form that agents can reason over directly. Instead of parsing logs, agents can inspect diagnostics, detect degeneracy early, and propose fixes while training is still running.

This is the foundation for self-correcting, continuously trained agent systems.
