Add archive dynamics pipeline and audience-based model presets by HaoLi111 · Pull Request #7 · openclaw/clawbench

HaoLi111 · 2026-04-22T03:37:59Z

Summary

add archive-based dynamics analysis and plotting with CLI support
preserve posterior formulas and explanatory comments in the analysis scripts
add Space preset audiences for full Claw users and budget-sensitive researchers
add focused tests for dynamics reporting and submission model selection

Validation

focused pytest slice passed: 47 tests
real posterior pipeline rerun succeeded on results/full_rerun_2026-04-21_cache

Notes

keeps Scott’s offline workflow intact
verified on the real gpt-oss:20b and qwen3.5:27b rerun archive

Copilot

Pull request overview

This PR adds an offline/posterior “dynamics” analysis pipeline (archive loading, metrics, plots, CLI + scripts) and updates the Space submission UI to support audience-filtered model presets (full catalog vs budget-friendly subset).

Changes:

Introduces dynamics core analysis (clawbench/dynamics.py), archive helpers + plotting, and a new clawbench dynamics-report CLI command (plus an optional --dynamics post-run hook).
Refactors posterior analysis scripts to consume cached run archives via shared helpers and emits a consolidated markdown report.
Centralizes Space submission model presets in clawbench/submission_models.py, adds “preset audiences”, and updates UI + docs + tests accordingly.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/test_submission_models.py`	Adds coverage for audience filtering + model/provider resolution.
`tests/test_dynamics_cli.py`	Verifies `dynamics-report --no-plots` works end-to-end.
`tests/test_dynamics_archive.py`	Covers archive loading + JSON report writing + sensitivity summary.
`tests/test_dynamics.py`	Unit tests for dynamics metrics, stratification, KM, and sensitivity.
`scripts/variance_decomp.py`	Switches variance decomposition to cached-run archive loader + CLI args.
`scripts/survival_analysis.py`	Switches survival analysis to cached-run archive loader + CLI args.
`scripts/snr_weighted_ranking.py`	Computes SNR-weighted rankings from cached runs + CLI args.
`scripts/run_posterior_dynamics_pipeline.py`	New driver to run the full posterior pipeline end-to-end.
`scripts/generate_dynamical_report.py`	Rebuilds combined markdown report from JSON artifacts (optional ranking).
`scripts/compute_constraint_index.py`	Computes task-level C(q) from cached transcripts using shared loader.
`scripts/classify_regimes.py`	Classifies regimes from cached transcripts using shared loader.
`clawbench/submission_models.py`	New centralized preset catalog + audience filtering + resolution logic.
`clawbench/harness.py`	Stores `last_task_runs` for post-run dynamics analysis hook.
`clawbench/dynamics_plots.py`	New matplotlib-based plotting for dynamics outputs.
`clawbench/dynamics_archive.py`	New offline archive loader + report builder/writer + sensitivity sections.
`clawbench/dynamics.py`	New core dynamics feature embedding, regime classification, sensitivity, strata.
`clawbench/client.py`	Improves transcript capture reliability + resolves Node executable for identity helper.
`clawbench/cli.py`	Adds `--dynamics` option and a new `dynamics-report` command.
`app.py`	Moves presets into `submission_models`, adds preset audience UI + bulk-submit filtering.
`SPACE_README.md`	Documents preset audiences in the Space UI.
`README.md`	Documents posterior dynamics pipeline + CLI usage + formulas and repo layout updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T03:42:11Z

+    if ranking is not None:
+        L("## 5. SNR-weighted Ranking")
+        L("")
+        L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |")


Copilot · 2026-04-22T03:42:11Z

            "PR": pr,
            "entropy": ent,
            "BOPS": bops,
+            "data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",


data_source is set to "assistant_final", but the text is built from the full assistant trajectory (all assistant messages + tool call metadata), not just the final assistant turn. This label is likely to confuse downstream consumers of constraint_index.json; consider renaming it to something like assistant_trajectory (and keeping fallback_any_message for the fallback path).

Suggested change

"data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",

"data_source": "fallback_any_message" if use_fallback_messages else "assistant_trajectory",

Copilot · 2026-04-22T03:42:11Z

+
+
+def _conditional_mi(seq: list[str]) -> float:
+    """I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""


Typo in docstring: "msemory" should be "memory".

Suggested change

"""I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""

"""I(X_t ; X_{t-2} | X_{t-1}) — non-Markov memory indicator."""

Copilot · 2026-04-22T03:42:12Z

+            events.append(step)
+            censored.append(False)
+        else:
+            events.append(float(len(run.transcript.assistant_messages)))


In the Kaplan–Meier setup, find_event_step() returns a 0-based assistant-step index (per find_event_step tests), but the censored time uses len(assistant_messages) (1-based count). This off-by-one skews survival times for censored runs. Use len(run.transcript.assistant_messages) - 1 (clamped at 0) so censoring is on the same index scale as event steps.

Suggested change

events.append(float(len(run.transcript.assistant_messages)))

censored_step = max(len(run.transcript.assistant_messages) - 1, 0)

events.append(float(censored_step))

Copilot · 2026-04-22T03:42:12Z

+    stratify_by_tier,
+    stratify_by_tool_mix,
+)
+from clawbench.dynamics_plots import generate_all_plots


clawbench.dynamics_archive imports generate_all_plots at module import time, and clawbench.dynamics_plots imports matplotlib. matplotlib is not declared in pyproject.toml, so importing clawbench.dynamics_archive (even when --no-plots / generate_plots=False) can fail in minimal installs. Consider either adding matplotlib as a required dependency, or lazily importing the plotting module only when plots are requested (and emitting a clear error if unavailable).

Suggested change

from clawbench.dynamics_plots import generate_all_plots

try:

from clawbench.dynamics_plots import generate_all_plots

except ImportError as exc:

_PLOTTING_IMPORT_ERROR = exc

def generate_all_plots(*args, **kwargs):

raise RuntimeError(

"Plot generation requires optional plotting dependencies to be installed. "

"Reinstall with plotting support (for example, ensure matplotlib is "

"available), or disable plot generation."

) from _PLOTTING_IMPORT_ERROR

Copilot · 2026-04-22T03:42:12Z

+            for run_file in sorted(task_dir.glob("run*.json")):
+                try:
+                    run = TaskRunResult.model_validate_json(
+                        run_file.read_text(encoding="utf-8")
+                    )
+                except Exception:
+                    continue


load_task_runs_archive silently ignores all exceptions when parsing run*.json files. This makes it very hard to notice corrupted/partial archives (and can bias the report by dropping runs without any signal). Consider at least tracking a skipped-file count and reporting it, or logging the exception at debug/warn level with the file path.

Copilot AI review requested due to automatic review settings April 22, 2026 03:38

Copilot started reviewing on behalf of HaoLi111 April 22, 2026 03:38 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

pllm-uci and others added 2 commits April 22, 2026 12:03

Add archive dynamics pipeline and audience-based model presets

c209612

fix: preserve preset submission settings and lazy-load plots

11d943f

scoootscooob mentioned this pull request Apr 22, 2026

Merge-ready version of #7 #9

Closed

scoootscooob force-pushed the feat/dynamics-analysis branch from c50b659 to 11d943f Compare April 22, 2026 19:47

scoootscooob merged commit df32a5f into openclaw:main Apr 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add archive dynamics pipeline and audience-based model presets#7

Add archive dynamics pipeline and audience-based model presets#7
scoootscooob merged 2 commits into
openclaw:mainfrom
HaoLi111:feat/dynamics-analysis

HaoLi111 commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	L("\| Rank \| Model \| Flat \| SNR x \|C(q)\| \| Winsorized \| Coverage \|")
	L("\| Rank \| Model \| Flat \| SNR x abs(C(q)) \| Winsorized \| Coverage \|")

	"data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",
	"data_source": "fallback_any_message" if use_fallback_messages else "assistant_trajectory",



		def _conditional_mi(seq: list[str]) -> float:
		"""I(X_t ; X_{t-2} \| X_{t-1}) — non-Markov msemory indicator."""

	events.append(float(len(run.transcript.assistant_messages)))
	censored_step = max(len(run.transcript.assistant_messages) - 1, 0)
	events.append(float(censored_step))

-from clawbench.dynamics_plots import generate_all_plots
+try:
+    from clawbench.dynamics_plots import generate_all_plots
+except ImportError as exc:
+    _PLOTTING_IMPORT_ERROR = exc
+    def generate_all_plots(*args, **kwargs):
+        raise RuntimeError(
+            "Plot generation requires optional plotting dependencies to be installed. "
+            "Reinstall with plotting support (for example, ensure matplotlib is "
+            "available), or disable plot generation."
+        ) from _PLOTTING_IMPORT_ERROR

Uh oh!

Conversation

HaoLi111 commented Apr 22, 2026

Summary

Validation

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants