Add archive dynamics pipeline and audience-based model presets#7
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an offline/posterior “dynamics” analysis pipeline (archive loading, metrics, plots, CLI + scripts) and updates the Space submission UI to support audience-filtered model presets (full catalog vs budget-friendly subset).
Changes:
- Introduces dynamics core analysis (
clawbench/dynamics.py), archive helpers + plotting, and a newclawbench dynamics-reportCLI command (plus an optional--dynamicspost-run hook). - Refactors posterior analysis scripts to consume cached run archives via shared helpers and emits a consolidated markdown report.
- Centralizes Space submission model presets in
clawbench/submission_models.py, adds “preset audiences”, and updates UI + docs + tests accordingly.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_submission_models.py |
Adds coverage for audience filtering + model/provider resolution. |
tests/test_dynamics_cli.py |
Verifies dynamics-report --no-plots works end-to-end. |
tests/test_dynamics_archive.py |
Covers archive loading + JSON report writing + sensitivity summary. |
tests/test_dynamics.py |
Unit tests for dynamics metrics, stratification, KM, and sensitivity. |
scripts/variance_decomp.py |
Switches variance decomposition to cached-run archive loader + CLI args. |
scripts/survival_analysis.py |
Switches survival analysis to cached-run archive loader + CLI args. |
scripts/snr_weighted_ranking.py |
Computes SNR-weighted rankings from cached runs + CLI args. |
scripts/run_posterior_dynamics_pipeline.py |
New driver to run the full posterior pipeline end-to-end. |
scripts/generate_dynamical_report.py |
Rebuilds combined markdown report from JSON artifacts (optional ranking). |
scripts/compute_constraint_index.py |
Computes task-level C(q) from cached transcripts using shared loader. |
scripts/classify_regimes.py |
Classifies regimes from cached transcripts using shared loader. |
clawbench/submission_models.py |
New centralized preset catalog + audience filtering + resolution logic. |
clawbench/harness.py |
Stores last_task_runs for post-run dynamics analysis hook. |
clawbench/dynamics_plots.py |
New matplotlib-based plotting for dynamics outputs. |
clawbench/dynamics_archive.py |
New offline archive loader + report builder/writer + sensitivity sections. |
clawbench/dynamics.py |
New core dynamics feature embedding, regime classification, sensitivity, strata. |
clawbench/client.py |
Improves transcript capture reliability + resolves Node executable for identity helper. |
clawbench/cli.py |
Adds --dynamics option and a new dynamics-report command. |
app.py |
Moves presets into submission_models, adds preset audience UI + bulk-submit filtering. |
SPACE_README.md |
Documents preset audiences in the Space UI. |
README.md |
Documents posterior dynamics pipeline + CLI usage + formulas and repo layout updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ranking is not None: | ||
| L("## 5. SNR-weighted Ranking") | ||
| L("") | ||
| L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |") |
There was a problem hiding this comment.
The markdown table header includes SNR x |C(q)|, which contains | characters. In markdown tables, | is a column separator, so this header will render with the wrong number of columns. Rename the header (e.g. SNR x abs(C(q))) or escape the pipes so the generated report renders correctly.
| L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |") | |
| L("| Rank | Model | Flat | SNR x abs(C(q)) | Winsorized | Coverage |") |
| "PR": pr, | ||
| "entropy": ent, | ||
| "BOPS": bops, | ||
| "data_source": "fallback_any_message" if use_fallback_messages else "assistant_final", |
There was a problem hiding this comment.
data_source is set to "assistant_final", but the text is built from the full assistant trajectory (all assistant messages + tool call metadata), not just the final assistant turn. This label is likely to confuse downstream consumers of constraint_index.json; consider renaming it to something like assistant_trajectory (and keeping fallback_any_message for the fallback path).
| "data_source": "fallback_any_message" if use_fallback_messages else "assistant_final", | |
| "data_source": "fallback_any_message" if use_fallback_messages else "assistant_trajectory", |
|
|
||
|
|
||
| def _conditional_mi(seq: list[str]) -> float: | ||
| """I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator.""" |
There was a problem hiding this comment.
Typo in docstring: "msemory" should be "memory".
| """I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator.""" | |
| """I(X_t ; X_{t-2} | X_{t-1}) — non-Markov memory indicator.""" |
| events.append(step) | ||
| censored.append(False) | ||
| else: | ||
| events.append(float(len(run.transcript.assistant_messages))) |
There was a problem hiding this comment.
In the Kaplan–Meier setup, find_event_step() returns a 0-based assistant-step index (per find_event_step tests), but the censored time uses len(assistant_messages) (1-based count). This off-by-one skews survival times for censored runs. Use len(run.transcript.assistant_messages) - 1 (clamped at 0) so censoring is on the same index scale as event steps.
| events.append(float(len(run.transcript.assistant_messages))) | |
| censored_step = max(len(run.transcript.assistant_messages) - 1, 0) | |
| events.append(float(censored_step)) |
| stratify_by_tier, | ||
| stratify_by_tool_mix, | ||
| ) | ||
| from clawbench.dynamics_plots import generate_all_plots |
There was a problem hiding this comment.
clawbench.dynamics_archive imports generate_all_plots at module import time, and clawbench.dynamics_plots imports matplotlib. matplotlib is not declared in pyproject.toml, so importing clawbench.dynamics_archive (even when --no-plots / generate_plots=False) can fail in minimal installs. Consider either adding matplotlib as a required dependency, or lazily importing the plotting module only when plots are requested (and emitting a clear error if unavailable).
| from clawbench.dynamics_plots import generate_all_plots | |
| try: | |
| from clawbench.dynamics_plots import generate_all_plots | |
| except ImportError as exc: | |
| _PLOTTING_IMPORT_ERROR = exc | |
| def generate_all_plots(*args, **kwargs): | |
| raise RuntimeError( | |
| "Plot generation requires optional plotting dependencies to be installed. " | |
| "Reinstall with plotting support (for example, ensure matplotlib is " | |
| "available), or disable plot generation." | |
| ) from _PLOTTING_IMPORT_ERROR |
| for run_file in sorted(task_dir.glob("run*.json")): | ||
| try: | ||
| run = TaskRunResult.model_validate_json( | ||
| run_file.read_text(encoding="utf-8") | ||
| ) | ||
| except Exception: | ||
| continue |
There was a problem hiding this comment.
load_task_runs_archive silently ignores all exceptions when parsing run*.json files. This makes it very hard to notice corrupted/partial archives (and can bias the report by dropping runs without any signal). Consider at least tracking a skipped-file count and reporting it, or logging the exception at debug/warn level with the file path.
c50b659 to
11d943f
Compare
Summary
Validation
results/full_rerun_2026-04-21_cacheNotes
gpt-oss:20bandqwen3.5:27brerun archive