# P1 eval & tables (approachable walkthrough)

Use this notebook to:
- Run a GRPO training/eval loop (defaults to a tiny HF test model + tiny tasks for quick smoke).
- Summarize runs into paper-style tables (reward mean/CI, latency p95/p99, json-valid rate).
- Swap in your paper configs (local Qwen/RL checkpoints, P1 task files) for real results.

ðŸŽ¯ **How to adapt for your paper:**
- Set `BASE_MODEL` to your local Qwen/RL adapter path (e.g., `./models/qwen3-4b-thinking-2507`).
+    - Optionally add `RESUME_FROM` to load a checkpoint.
- Set `TASKS` to your P1 eval suite (e.g., `tasks/fc_tasks.jsonl`).
- Increase `STEPS`/`MAX_NEW_TOKENS` for real training; enable validation if desired.
- To summarize an existing run, skip the training cell and point `RUN_DIR` to the run directory.

In [None]:
from pathlib import Path
import pandas as pd
from agent_stable_slo.train.grpo_train_loop import GRPOTrainConfig, train_loop
from scripts.summarize_logs import summarize_file

# --- User-facing config (edit these) ---
BASE_MODEL = "hf-internal-testing/tiny-random-GPTNeoXForCausalLM"   # swap to your local Qwen/RL checkpoint
TASKS = "tasks/tiny_smoke.jsonl"                                   # swap to P1 eval suite, e.g., tasks/fc_tasks.jsonl
RUN_DIR = Path("out/notebook_p1_smoke")                            # where outputs go; point to an existing run to summarize only
RESUME_FROM = None                                                  # optional: existing checkpoint/run dir to resume
STEPS = 40                                                          # raise for real runs (e.g., 500+)
MAX_NEW_TOKENS = 64                                                 # raise for real runs (e.g., 128-192)
CACHE_DATASET = True                                                # keep True for reproducibility
BLOCKLIST = ""                                                     # e.g., "forbidden1,forbidden2"; empty to disable
VAL_TASKS = None                                                    # optional validation tasks file
VAL_INTERVAL = 0                                                    # validate every N steps; 0 disables
SEED = 1234                                                         # set for reproducibility

RUN_DIR.mkdir(parents=True, exist_ok=True)
print(f"Using model={BASE_MODEL}\nTasks={TASKS}\nOut={RUN_DIR}")

In [None]:
# --- Run GRPO training (skip this cell if you only want to summarize an existing run) ---
cfg = GRPOTrainConfig(
    base_model=BASE_MODEL,
    tasks=TASKS,
    out=str(RUN_DIR),
    steps=STEPS,
    max_prompt_len=256,
    max_new_tokens=MAX_NEW_TOKENS,
    deterministic=True,
    cache_dataset=CACHE_DATASET,
    load_in_4bit=False,
    gradient_accumulation=1,
    lr=1e-4,
    lora_rank=4,
    lora_alpha=8,
    lora_dropout=0.0,
    lora_targets="query_key_value,dense",
    eval_interval=max(1, STEPS // 4),
    torch_dtype="float32",
    cache_dir="out/cache",
    val_tasks=VAL_TASKS,
    val_interval=VAL_INTERVAL,
    blocklist=BLOCKLIST,
    seed=SEED,
    resume_from=RESUME_FROM,
)

train_loop(cfg)
print(f"Run complete: {RUN_DIR}")

In [None]:
# --- Summarize into a paper-style table ---
# Choose train_log.jsonl or eval.jsonl depending on what you ran
log_path = RUN_DIR / "train_log.jsonl"  # swap to eval.jsonl if summarizing eval
summary = summarize_file(log_path)
df = pd.DataFrame([summary])
df

In [None]:
# --- Visuals: reward and latency over time (if train_log exists) ---
import json
import plotly.express as px

records = []
for line in log_path.read_text(encoding="utf-8").splitlines():
    if not line.strip():
        continue
    records.append(json.loads(line))
df_log = pd.DataFrame(records)

fig_reward = px.line(df_log, x="step", y="reward", title="Reward over steps")
fig_latency = px.line(df_log, x="step", y="latency_ms", title="Latency (ms) over steps")
fig_reward.show()
fig_latency.show()

### Sample output (tiny smoke)

| file                                |   json_valid_rate |   latency_p95_ms |   latency_p99_ms |   reward_ci_lower |   reward_ci_upper |   reward_mean |   ttft_avg_ms |
|:------------------------------------|------------------:|-----------------:|-----------------:|------------------:|------------------:|--------------:|--------------:|
| out/notebook_p1_smoke/train_log.jsonl |                 1 |            ~95   |            ~96   |                 2 |                 2 |             2 |        ~93    |

Your table will reflect your model/tasks; for real runs, rewards/latencies will differ.

### Swap in paper runs
- Set `BASE_MODEL` to your local Qwen or RL-adapted checkpoint (or use `RESUME_FROM` to load a saved adapter).
- Point `TASKS` at your P1 eval suite (e.g., `tasks/fc_tasks.jsonl`).
- Increase `STEPS`/`MAX_NEW_TOKENS` and optionally enable validation (`VAL_TASKS`, `VAL_INTERVAL`).
- To summarize an existing run, skip the training cell and set `RUN_DIR` to the run dir, then re-run the summary/visual cells.