Skip to content

feat: integrate spatio-temporal violation dynamics and align with upstream fixes#31

Open
HaoLi111 wants to merge 7 commits into
openclaw:mainfrom
HaoLi111:feature/spatio-temporal-dynamics-v2
Open

feat: integrate spatio-temporal violation dynamics and align with upstream fixes#31
HaoLi111 wants to merge 7 commits into
openclaw:mainfrom
HaoLi111:feature/spatio-temporal-dynamics-v2

Conversation

@HaoLi111
Copy link
Copy Markdown
Contributor

@HaoLi111 HaoLi111 commented Jun 2, 2026

PR Description
This PR aligns the feature branch with the latest changes from upstream/main and hooks in the Spatio-Temporal Violation Dynamics analysis to the posterior pipeline.

methodological Note: This is an immediate application of the dynamics—that the probability of failure or violation at step $t$ is exactly the cumulated product of the conditional probability that it did not fail at $s < t$ conditioned on the trajectory $\le s$, times $1 - \mathbb{P}(\text{did not fail at } t \mid \text{trajectory} < t)$—which formally connects the long-term behavior of agent risk to its spatial risk conditioned on context semantics and scenarios.

i.e.

$$ \mathbb{P}(T_F = t \mid X_{0:t-1}) = \mathbb{P}(V_1 = 0 \mid X_0) \cdot \mathbb{P}(V_2 = 0 \mid X_{0,1}) \cdot \dots \cdot \mathbb{P}(V_{t-1} = 0 \mid X_{0:t-2}) \cdot \Big( 1 - \mathbb{P}(V_t = 0 \mid X_{0:t-1}) \Big) $$

which let you do a lot of things.

Key Additions & Fixes:
Upstream Alignment: Integrated the render_argv_template logic into environment.py and environment_files.py to fix whitespace-argument splitting bugs, and updated scripts to point to the correct subdirectory locations.
Violation Time Decomposition: Hooked violation_time_decomposition.py into the main pipeline. It now writes session results (violation_metrics.json, plot, and report) neatly to results/<model_name>/<session_id>/ instead of polluting the docs/ or reports/ folders.
Test Suite Stability: Created tests/conftest.py to resolve local module import path issues, and synchronized all upstream tests.

Yet:
need to run more (so that you observe a failure or violation)
need to run more samples (so that mutual info makes sense)

Copilot AI review requested due to automatic review settings June 2, 2026 05:54
@HaoLi111 HaoLi111 requested a review from a team as a code owner June 2, 2026 05:54
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 2, 2026

Codex review: needs real behavior proof before merge. Reviewed June 4, 2026, 1:57 PM ET / 17:57 UTC.

Summary
The branch adds violation-time decomposition reporting to the posterior dynamics pipeline, carries model/task metadata through regime reports, adjusts downstream model/task parsing, and adds pytest import-path configuration plus focused tests.

Reproducibility: Do we have a high-confidence way to reproduce the issue? Source-level yes for the review finding: a TaskRunResult with a non-dangerous forbidden tool or forbidden shell-pattern violation will have forbidden_violations set, but the new function only localizes dangerous shell commands.

Review metrics: 2 noteworthy metrics.

  • Changed surface: 7 files changed, +277/-4. The PR is more than a small fix and adds a new analysis stage plus downstream report metadata handling.
  • Pipeline stage: 1 mandatory posterior stage added. The new script runs unconditionally in the posterior pipeline, so runtime proof matters before merge.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted terminal output, logs, or generated artifacts from a real posterior pipeline run.
  • [P1] Fix violation-time localization for all stored forbidden_violations kinds, not only dangerous shell commands.

Proof guidance:

  • [P1] Needs real behavior proof before merge: No after-fix terminal output, logs, artifacts, screenshot, or recording shows the new pipeline running; the contributor should add redacted real output and update the PR body for automatic re-review, or ask a maintainer for @clawsweeper re-review if needed.

Risk before merge

  • [P1] No after-fix real behavior proof shows the new posterior pipeline running against a real archive; the PR body explicitly says more runs and samples are still needed.
  • [P1] The new violation-time script is now a mandatory stage in the posterior pipeline, so runtime failures or malformed real-cache assumptions would stop the full pipeline.
  • [P1] The current implementation can mis-time non-dangerous forbidden violations, which would corrupt the hazard and mutual-information outputs even if the script completes.

Maintainer options:

  1. Require real pipeline proof and localization fix (recommended)
    Ask the contributor to fix the violation timing bug and add redacted terminal output, logs, or artifacts showing the posterior pipeline completes on a real cache.
  2. Keep as experimental off the main pipeline
    Maintainers could ask for the script to stay callable directly until the methodology and runtime proof are stronger.
  3. Pause the broad methodology bundle
    If the new decomposition is not yet a maintainer-approved benchmark metric, pause or close this branch and request a narrower proposal.

Next step before merge

  • [P1] Human review is needed because contributor proof is missing and the remaining blockers include methodology and pipeline-availability judgment, not only a mechanical repair.

Security
Cleared: No concrete security or supply-chain regression was found; the diff adds local analysis code and pytest configuration without new dependency sources, CI permissions, secrets, or lifecycle hooks.

Review findings

  • [P2] Localize every forbidden violation type — scripts/violation_time_decomposition.py:29-31
Review details

Best possible solution:

Land this only after violation localization covers every stored forbidden_violations kind, the contributor posts redacted real pipeline output or artifacts, and maintainers accept the new mandatory analysis stage.

Do we have a high-confidence way to reproduce the issue?

Do we have a high-confidence way to reproduce the issue? Source-level yes for the review finding: a TaskRunResult with a non-dangerous forbidden tool or forbidden shell-pattern violation will have forbidden_violations set, but the new function only localizes dangerous shell commands.

Is this the best way to solve the issue?

Is this the best way to solve the issue? Not yet; the maintainable path is to make violation timing match all stored trajectory violation kinds and prove the mandatory pipeline stage on a real archive before merge.

Full review comments:

  • [P2] Localize every forbidden violation type — scripts/violation_time_decomposition.py:29-31
    The new decomposition is supposed to measure the first forbidden violation, but once forbidden_violations is nonempty this loop only returns a turn for dangerous shell commands. Runs with Forbidden tool called or configured Forbidden shell pattern matched violations fall through to the transcript end, shifting the event time and corrupting the hazard and mutual-information output.
    Confidence: 0.87

Overall correctness: patch is incorrect
Overall confidence: 0.82

AGENTS.md: not found in the target repository.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4f752b617a75.

Label changes

Label justifications:

  • P2: This is a normal-priority benchmark pipeline feature with a concrete correctness issue and limited blast radius.
  • merge-risk: 🚨 availability: Merging would make an unproven new analysis script part of the mandatory posterior pipeline, so failures could stop pipeline runs.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: No after-fix terminal output, logs, artifacts, screenshot, or recording shows the new pipeline running; the contributor should add redacted real output and update the PR body for automatic re-review, or ask a maintainer for @clawsweeper re-review if needed.
Evidence reviewed

What I checked:

  • Current main does not already contain the requested stage: Current main has no violation_time_decomposition script or violation metrics references, while the PR adds scripts/violation_time_decomposition.py and wires it into the pipeline. (scripts/violation_time_decomposition.py:1, 140c17c30175)
  • Mandatory pipeline integration: The PR inserts violation_time_decomposition.py into run_posterior_dynamics_pipeline.py before the combined report step, so a failure in the new script would fail the whole posterior pipeline. (scripts/run_posterior_dynamics_pipeline.py:91, 140c17c30175)
  • Violation localization mismatch: The added get_first_violation_turn only returns early for dangerous shell commands, but current trajectory evaluation records forbidden tool calls, configured forbidden shell patterns, and dangerous shell commands as forbidden_violations. (scripts/violation_time_decomposition.py:29, 140c17c30175)
  • Proof remains absent: The PR body says more runs and more samples are still needed, and the provided discussion contains no terminal output, logs, artifacts, screenshot, or recording showing the new pipeline running after the fix. (140c17c30175)
  • Feature history provenance: The central pipeline area traces to commits adding archive dynamics and spatio-temporal dynamics evaluation, with recent maintenance in the same files. (scripts/run_posterior_dynamics_pipeline.py:58, c209612d46b0)

Likely related people:

  • HaoLi111: The merged spatio-temporal dynamics evaluation work on current main is by Hao, and this PR continues that methodology surface. (role: feature history contributor; confidence: high; commits: 5c58e7beaaa5; files: scripts/run_posterior_dynamics_pipeline.py, clawbench/dynamics.py, clawbench/dynamics_archive.py)
  • pllm-uci: The current archive dynamics pipeline and several touched script paths were introduced in the archive dynamics pipeline commit. (role: archive pipeline introducer; confidence: high; commits: c209612d46b0; files: scripts/run_posterior_dynamics_pipeline.py, scripts/classify_regimes.py, clawbench/dynamics_archive.py)
  • scoootscooob: Recent dynamics diagnostics and plot-loading fixes touched the same analysis area, and the PR discussion includes a follow-up note about rerunning the experiment. (role: recent area contributor; confidence: high; commits: b6f07d9a8796, 11d943f21cd3; files: scripts/classify_regimes.py, clawbench/dynamics.py, clawbench/dynamics_archive.py)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR expands ClawBench’s evaluation/dynamics tooling by adding “perturbed” task variants, posterior reweighting + reporting scripts, and improving execution-check command rendering so templated values containing whitespace remain a single argv element.

Changes:

  • Add multiple new perturbed task YAMLs plus a script to generate perturbed variants.
  • Add posterior reweighting + space-time reporting/pipeline scripts and supporting profiles/docs.
  • Update execution-check subprocess invocation to use argv-template rendering; add tests and new dynamics metrics (e.g., Rényi proxy).

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_trajectory.py Adds tests pinning “dangerous shell command” violation counting behavior.
tests/test_environment_files.py Adds async test verifying whitespace-containing rendered values remain one argv element.
tests/test_environment.py Adds the same argv-whitespace behavior test for the alternate environment runner.
tests/conftest.py Forces repo-root importability in pytest by inserting into sys.path.
tasks-public/tier3/t3-web-research-and-cite-perturbed.yaml Adds a new perturbed Tier 3 task definition.
tasks-public/tier3/t3-msg-inbox-triage-perturbed.yaml Adds a new perturbed Tier 3 task definition.
tasks-public/tier3/t3-feature-export-perturbed.yaml Adds a new perturbed Tier 3 task definition.
tasks-public/tier3/t3-data-sql-query-perturbed.yaml Adds a new perturbed Tier 3 task definition.
tasks-public/tier3/t3-data-pipeline-report-perturbed.yaml Adds a new perturbed Tier 3 task definition.
tasks-public/tier1/t1-fs-quick-note-perturbed.yaml Adds a new perturbed Tier 1 task definition.
tasks-public/tier1/t1-bugfix-discount-perturbed.yaml Adds a new perturbed Tier 1 task definition.
scripts/violation_time_decomposition.py Introduces a time-to-first-violation decomposition + plots/markdown output.
scripts/run_posterior_reweighting.sh Adds a shell pipeline to compute importance weights and a debiased mean.
scripts/run_posterior_dynamics_pipeline.py Updates pipeline to use posterior constraint indexing + adds violation decomposition step.
scripts/run_eval_pipeline.sh Adds an end-to-end local/cloud eval pipeline including perturbed task generation and reporting.
scripts/posterior/3_generate_space_time_report.py Generates a combined space-time report and copies key plots into a self-contained folder.
scripts/posterior/1_compute_posterior_weights.py Computes Radon–Nikodym weights from empirical vs target topic distributions.
scripts/generate_perturbed_tasks.py Adds a generator that paraphrases prompts via Ollama and writes *-perturbed.yaml files.
scripts/debiased_evaluation.py Adds Hajek/IPW aggregation of task scores.
scripts/compute_debiased_dynamics.py Adds IPW/Hajek debiasing over regimes and constraint index.
scripts/compute_constraint_index.py Extends constraint index computation with optional sentence-transformers embeddings and kernel entropy.
profiles/user_target_distribution.json Adds an example target distribution profile.
profiles/radon_nikodym_weights.json Adds example precomputed weights.
profiles/empirical_topic_distribution.json Adds an example empirical benchmark distribution profile.
docs/task_distribution_reweighting.md Documents stratified reweighting and its space-time fusion.
docs/semantic_spatiotemporal_dynamics.md Documents the combined semantic + temporal dynamics framework.
docs/long_term_dynamics.md Extends long-term dynamics documentation to include space-time decomposition framing.
clawbench/render.py Adds render_argv_template() using shlex.split() pre-render to preserve whitespace in substituted values.
clawbench/environment_files.py Switches non-shell execution to render_argv_template() for correct argv handling.
clawbench/environment.py Same argv-template switch for the gateway environment runner.
clawbench/dynamics_archive.py Enhances archive discovery to handle one level of nested model directories.
clawbench/dynamics.py Adds renyi_d2 metric computation to per-trajectory dynamics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +29
- message: "Thinking...\nThinking Process:\n\n1. **Analyze the Request:**\n \
\ * **Task:** Paraphrase the provided instruction.\n * **Constraint 1:**\
\ Keep the exact same semantic meaning and intent.\n * **Constraint 2:**\
\ Change the wording slightly.\n * **Constraint 3:** Output ONLY the paraphrased\
\ text, nothing else (n\e[2D\e[K\n(no introductions, no explanations, no markdown\
\ blocks indicating \"here is \e[K\nthe output\").\n\n2. **Analyze the Original\
Comment thread tests/test_environment.py
Comment on lines +168 to +189
@pytest.mark.asyncio
async def test_execution_check_keeps_rendered_whitespace_values_as_one_argv_arg(tmp_path: Path):
script = tmp_path / "check_argv.py"
script.write_text(
"import json, sys\n"
"print(json.dumps(sys.argv[1:]))\n",
encoding="utf-8",
)

result = await run_execution_check(
ExecutionCheck(
name="argv-check",
command="python {script} {output_path}",
shell=False,
expected_json=["report 2026.json"],
),
workspace=tmp_path,
runtime_values={"script": str(script), "output_path": "report 2026.json"},
)

assert result.passed is True
assert result.reason == "OK"
Comment thread tests/conftest.py Outdated

# Add the repository root to sys.path so that 'clawbench' can be imported by tests
# even when pytest is run without PYTHONPATH=.
sys.path.insert(0, str(Path(__file__).parent.parent))
dyn_json = dyn_dir / "dynamics.json"
if dyn_json.exists():
try:
dyn_data = json.load(open(dyn_json))
Comment thread scripts/generate_perturbed_tasks.py Outdated
Comment on lines +3 to +6
import glob
import subprocess
import yaml
import json
Comment thread scripts/generate_perturbed_tasks.py Outdated

# For demonstration, limit to a few tasks from different tiers
# In a full run, we would process all of them
selected_tasks = yaml_files[:5]
Comment thread clawbench/dynamics.py
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 2, 2026
- message: Add CSV export functionality to the issue tracker in the workspace. Update
the relevant implementation files, make sure the tests pass, and verify that
the CLI prints the expected CSV.
- message: "Thinking...\nThinking Process:\n\n1. **Analyze the Request:**\n \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a part of prompt for perturbation was leaked into task.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank for the review! will fix that and rerun experiment for this one.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check others too: they have the same issue (not all of them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants