Skip to content

Add archive dynamics pipeline and audience-based model presets#7

Merged
scoootscooob merged 2 commits into
openclaw:mainfrom
HaoLi111:feat/dynamics-analysis
Apr 22, 2026
Merged

Add archive dynamics pipeline and audience-based model presets#7
scoootscooob merged 2 commits into
openclaw:mainfrom
HaoLi111:feat/dynamics-analysis

Conversation

@HaoLi111
Copy link
Copy Markdown
Contributor

Summary

  • add archive-based dynamics analysis and plotting with CLI support
  • preserve posterior formulas and explanatory comments in the analysis scripts
  • add Space preset audiences for full Claw users and budget-sensitive researchers
  • add focused tests for dynamics reporting and submission model selection

Validation

  • focused pytest slice passed: 47 tests
  • real posterior pipeline rerun succeeded on results/full_rerun_2026-04-21_cache

Notes

  • keeps Scott’s offline workflow intact
  • verified on the real gpt-oss:20b and qwen3.5:27b rerun archive

Copilot AI review requested due to automatic review settings April 22, 2026 03:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an offline/posterior “dynamics” analysis pipeline (archive loading, metrics, plots, CLI + scripts) and updates the Space submission UI to support audience-filtered model presets (full catalog vs budget-friendly subset).

Changes:

  • Introduces dynamics core analysis (clawbench/dynamics.py), archive helpers + plotting, and a new clawbench dynamics-report CLI command (plus an optional --dynamics post-run hook).
  • Refactors posterior analysis scripts to consume cached run archives via shared helpers and emits a consolidated markdown report.
  • Centralizes Space submission model presets in clawbench/submission_models.py, adds “preset audiences”, and updates UI + docs + tests accordingly.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_submission_models.py Adds coverage for audience filtering + model/provider resolution.
tests/test_dynamics_cli.py Verifies dynamics-report --no-plots works end-to-end.
tests/test_dynamics_archive.py Covers archive loading + JSON report writing + sensitivity summary.
tests/test_dynamics.py Unit tests for dynamics metrics, stratification, KM, and sensitivity.
scripts/variance_decomp.py Switches variance decomposition to cached-run archive loader + CLI args.
scripts/survival_analysis.py Switches survival analysis to cached-run archive loader + CLI args.
scripts/snr_weighted_ranking.py Computes SNR-weighted rankings from cached runs + CLI args.
scripts/run_posterior_dynamics_pipeline.py New driver to run the full posterior pipeline end-to-end.
scripts/generate_dynamical_report.py Rebuilds combined markdown report from JSON artifacts (optional ranking).
scripts/compute_constraint_index.py Computes task-level C(q) from cached transcripts using shared loader.
scripts/classify_regimes.py Classifies regimes from cached transcripts using shared loader.
clawbench/submission_models.py New centralized preset catalog + audience filtering + resolution logic.
clawbench/harness.py Stores last_task_runs for post-run dynamics analysis hook.
clawbench/dynamics_plots.py New matplotlib-based plotting for dynamics outputs.
clawbench/dynamics_archive.py New offline archive loader + report builder/writer + sensitivity sections.
clawbench/dynamics.py New core dynamics feature embedding, regime classification, sensitivity, strata.
clawbench/client.py Improves transcript capture reliability + resolves Node executable for identity helper.
clawbench/cli.py Adds --dynamics option and a new dynamics-report command.
app.py Moves presets into submission_models, adds preset audience UI + bulk-submit filtering.
SPACE_README.md Documents preset audiences in the Space UI.
README.md Documents posterior dynamics pipeline + CLI usage + formulas and repo layout updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if ranking is not None:
L("## 5. SNR-weighted Ranking")
L("")
L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |")
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table header includes SNR x |C(q)|, which contains | characters. In markdown tables, | is a column separator, so this header will render with the wrong number of columns. Rename the header (e.g. SNR x abs(C(q))) or escape the pipes so the generated report renders correctly.

Suggested change
L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |")
L("| Rank | Model | Flat | SNR x abs(C(q)) | Winsorized | Coverage |")

Copilot uses AI. Check for mistakes.
"PR": pr,
"entropy": ent,
"BOPS": bops,
"data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_source is set to "assistant_final", but the text is built from the full assistant trajectory (all assistant messages + tool call metadata), not just the final assistant turn. This label is likely to confuse downstream consumers of constraint_index.json; consider renaming it to something like assistant_trajectory (and keeping fallback_any_message for the fallback path).

Suggested change
"data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",
"data_source": "fallback_any_message" if use_fallback_messages else "assistant_trajectory",

Copilot uses AI. Check for mistakes.
Comment thread clawbench/dynamics.py


def _conditional_mi(seq: list[str]) -> float:
"""I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in docstring: "msemory" should be "memory".

Suggested change
"""I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""
"""I(X_t ; X_{t-2} | X_{t-1}) — non-Markov memory indicator."""

Copilot uses AI. Check for mistakes.
events.append(step)
censored.append(False)
else:
events.append(float(len(run.transcript.assistant_messages)))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Kaplan–Meier setup, find_event_step() returns a 0-based assistant-step index (per find_event_step tests), but the censored time uses len(assistant_messages) (1-based count). This off-by-one skews survival times for censored runs. Use len(run.transcript.assistant_messages) - 1 (clamped at 0) so censoring is on the same index scale as event steps.

Suggested change
events.append(float(len(run.transcript.assistant_messages)))
censored_step = max(len(run.transcript.assistant_messages) - 1, 0)
events.append(float(censored_step))

Copilot uses AI. Check for mistakes.
Comment thread clawbench/dynamics_archive.py Outdated
stratify_by_tier,
stratify_by_tool_mix,
)
from clawbench.dynamics_plots import generate_all_plots
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clawbench.dynamics_archive imports generate_all_plots at module import time, and clawbench.dynamics_plots imports matplotlib. matplotlib is not declared in pyproject.toml, so importing clawbench.dynamics_archive (even when --no-plots / generate_plots=False) can fail in minimal installs. Consider either adding matplotlib as a required dependency, or lazily importing the plotting module only when plots are requested (and emitting a clear error if unavailable).

Suggested change
from clawbench.dynamics_plots import generate_all_plots
try:
from clawbench.dynamics_plots import generate_all_plots
except ImportError as exc:
_PLOTTING_IMPORT_ERROR = exc
def generate_all_plots(*args, **kwargs):
raise RuntimeError(
"Plot generation requires optional plotting dependencies to be installed. "
"Reinstall with plotting support (for example, ensure matplotlib is "
"available), or disable plot generation."
) from _PLOTTING_IMPORT_ERROR

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +152
for run_file in sorted(task_dir.glob("run*.json")):
try:
run = TaskRunResult.model_validate_json(
run_file.read_text(encoding="utf-8")
)
except Exception:
continue
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_task_runs_archive silently ignores all exceptions when parsing run*.json files. This makes it very hard to notice corrupted/partial archives (and can bias the report by dropping runs without any signal). Consider at least tracking a skipped-file count and reporting it, or logging the exception at debug/warn level with the file path.

Copilot uses AI. Check for mistakes.
@scoootscooob scoootscooob force-pushed the feat/dynamics-analysis branch from c50b659 to 11d943f Compare April 22, 2026 19:47
@scoootscooob scoootscooob merged commit df32a5f into openclaw:main Apr 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants