# Post-Training Toolkit Demo
## For the HuggingFace TRL Team

**Agenda (10 min)**
1. Integration: One line, zero config
2. Your GRPO bug: How we'd catch it
3. Contributing: YAML heuristics
4. Vision: Continuous RL & Agent Training

---
# Part 1: Integration

### The entire integration is ONE line:

In [None]:
# This is the ENTIRE integration:

from post_training_toolkit import DiagnosticsCallback

# trainer = GRPOTrainer(
#     model=model,
#     callbacks=[DiagnosticsCallback()],  # <-- Just this
#     ...
# )

print("That's it. Zero configuration needed.")

### Auto-detects your trainer type:

In [None]:
from post_training_toolkit.integrations.trl import TRAINER_CLASS_MAP

print("Supported trainers (auto-detected):")
for cls, typ in sorted(set((k, v) for k, v in TRAINER_CLASS_MAP.items() if "Trainer" in k)):
    print(f"  {cls} ‚Üí {typ}")

---
# Part 2: The GRPO Bug

> "importance_sampling_ratio wasn't close to 1 (it was mostly ~0)"

Let's simulate this and see how PTT catches it:

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Simulate: IS ratio starts healthy (~1.0) then collapses (~0.1)
is_ratio_good = np.random.normal(1.0, 0.1, 50)  # First 50 steps: healthy
is_ratio_bad = np.clip(np.random.normal(0.1, 0.05, 50), 0.01, 0.3)  # Last 50: collapsed

df = pd.DataFrame({
    "step": list(range(100)),
    "importance_sampling_ratio": np.concatenate([is_ratio_good, is_ratio_bad]),
})

print(f"Steps 0-50:   IS ratio = {df['importance_sampling_ratio'].iloc[:50].mean():.3f} (healthy)")
print(f"Steps 50-100: IS ratio = {df['importance_sampling_ratio'].iloc[50:].mean():.3f} (PROBLEM!)")

In [None]:
from post_training_toolkit.models.heuristics import run_heuristics

insights = run_heuristics(df, trainer_type="grpo")

print(f"\nüîç PTT detected {len(insights)} issue(s):\n")
for insight in insights:
    icon = {"high": "üö®", "medium": "‚ö†Ô∏è", "low": "‚ÑπÔ∏è"}.get(insight.severity, "‚Ä¢")
    print(f"{icon} [{insight.severity.upper()}] {insight.type}")
    print(f"   {insight.message}")
    if insight.reference:
        print(f"   Ref: {insight.reference}")
    print()

### ‚úÖ Caught during training. Not after.

---
# Part 3: Contributing Heuristics

### Adding a heuristic = Writing a YAML file

Here's the heuristic that caught the IS ratio bug:

In [None]:
# Let's look at the YAML file:
from pathlib import Path

yaml_path = Path("../post_training_toolkit/heuristics/builtin/grpo/importance_sampling_ratio.yaml")
print(yaml_path.read_text())

### The Condition DSL

| Syntax | Meaning |
|--------|--------|
| `< 0.5` | Below threshold |
| `> 2.0` | Above threshold |
| `range(0.68, 0.71)` | Stuck in range |
| `drop(50%)` | Dropped 50% from baseline |
| `spike(3x)` | 3x above rolling average |

In [None]:
# All conditions parse correctly:
from post_training_toolkit.heuristics.parser import parse_condition

for cond in ["< 0.5", "> 2.0", "range(0.68, 0.71)", "drop(50%)", "spike(3x)"]:
    result = parse_condition(cond)
    print(f"  '{cond}' ‚Üí {result.type.value}")

### Even faster: Inline alerts (no file needed)

In [None]:
# For quick experiments, define alerts inline:

cb = DiagnosticsCallback(
    custom_alerts=[
        "grpo: importance_sampling_ratio < 0.5 -> high: IS ratio collapsed!",
        "grpo: entropy drop(50%) for 30 steps -> medium: Entropy dropping",
    ]
)

print(f"Custom alerts registered: {len(cb._custom_alerts)}")
for alert in cb._custom_alerts:
    print(f"  ‚Ä¢ {alert}")

---
# Part 4: Vision - Continuous RL & Agent Training

### The Problem with Continuous/Online RL

```
Offline RLHF:     Train ‚Üí Evaluate ‚Üí Ship ‚Üí Done
                         ‚Üë
                    Catch problems here

Continuous RL:    Train ‚Üí Train ‚Üí Train ‚Üí Train ‚Üí ...
                        ‚Üë
                  Problems compound silently
```

**PTT makes continuous RL safe by default.**

### Simulating Continuous Training with Live Monitoring

In [None]:
import time

def simulate_continuous_training():
    """Simulate continuous training with live PTT monitoring."""
    
    metrics_history = []
    
    print("üîÑ Simulating continuous GRPO training...\n")
    
    for step in range(100):
        # Simulate metrics - IS ratio degrades over time
        if step < 40:
            is_ratio = np.random.normal(1.0, 0.1)
        elif step < 60:
            is_ratio = np.random.normal(0.7, 0.1)  # Starting to drift
        else:
            is_ratio = np.random.normal(0.2, 0.05)  # Collapsed
        
        metrics_history.append({
            "step": step,
            "importance_sampling_ratio": max(0.01, is_ratio),
            "reward_mean": 0.1 + step * 0.005 + np.random.normal(0, 0.02),
        })
        
        # Run heuristics every 10 steps (like the callback does)
        if step > 0 and step % 10 == 0:
            df = pd.DataFrame(metrics_history)
            insights = run_heuristics(df, "grpo")
            
            high_severity = [i for i in insights if i.severity == "high"]
            if high_severity:
                print(f"Step {step:3d}: üö® {high_severity[0].message[:60]}...")
            else:
                print(f"Step {step:3d}: ‚úÖ All metrics healthy")
        
        time.sleep(0.05)  # Simulate training time
    
    print("\n‚úã In real training, PTT would have warned you at step 70.")
    print("   Without PTT, you might not notice until step 200+.")

simulate_continuous_training()

### The Agent Training Angle

PTT provides **structured signals** that AI agents can reason over:

```python
# Instead of parsing 10,000 log lines:
[2024-01-21 10:23:45] loss=0.693
[2024-01-21 10:23:46] loss=0.694
...

# AI gets structured insights:
Insight(
    type="dpo_loss_random",
    severity="high",
    message="Loss stuck at 0.693 (random chance)",
    data={"expected": "< 0.5", "actual": 0.693},
    reference="Rafailov et al. (2023)"
)
```

**This is the missing layer for AI-assisted debugging.**

---
# Summary: Why First-Party Integration?

| Point | Value |
|-------|-------|
| **Integration** | One line, zero config |
| **Coverage** | All TRL trainers today |
| **Contributions** | YAML = no Python required |
| **Knowledge** | Encodes tribal knowledge in code |
| **Future** | Enables safe continuous RL + agent training |

---

### One-liners:

> "PTT makes continuous RL safe by default."

> "The early-warning system for long-horizon agent training."

> "Structured signals for humans AND AI to debug training."

---
# Questions?

**Try it yourself:**
```bash
pip install post-training-toolkit
```

**Add to any TRL trainer:**
```python
callbacks=[DiagnosticsCallback()]
```