Skip to content

feat: Spatio-Temporal dynamics evaluation (quantify and decompose)#27

Merged
scoootscooob merged 2 commits into
openclaw:mainfrom
HaoLi111:feature/spatio-temporal-dynamics
May 27, 2026
Merged

feat: Spatio-Temporal dynamics evaluation (quantify and decompose)#27
scoootscooob merged 2 commits into
openclaw:mainfrom
HaoLi111:feature/spatio-temporal-dynamics

Conversation

@HaoLi111
Copy link
Copy Markdown
Contributor

PR: Spatio-Temporal Dynamics — Quantifying and Decomposing Long-Running Agent Behavior

Branch: feature/spatio-temporal-dynamics
Base: main


please note that some results with the ollama test does not make sense because of the poor model performance and the lack of computing power locally. some metrics are built but perhaps raises more questions than it solves. comments welcomed.

Why This Matters

The Fundamental Gap in LLM Agent Evaluation

Current benchmarks answer a single question: "Can the model solve this task?"
They reduce a multi-step, iterative reasoning trajectory into a binary pass/fail or a scalar score. This is sufficient for single-turn inference, but critically insufficient for long-running agents — systems that autonomously plan, execute, observe, and revise over dozens of tool-calling turns.

When an agent runs for 30+ turns, failure modes emerge that no single-pass benchmark can detect:

  • Goal drift: The agent gradually shifts away from the original objective, producing work that is internally consistent but fundamentally off-target.
  • Hallucination loops: The agent enters a self-reinforcing cycle where it conditions on its own erroneous output, amplifying mistakes with each iteration.
  • Chaotic sensitivity: Semantically identical prompts with minor lexical variation cause completely divergent trajectories — a Lyapunov-unstable system masquerading as a reliable tool.
  • Regime collapse: An agent that appears to "explore" is actually trapped in a degenerate attractor basin, repeating the same ineffective action patterns.

None of these failure modes appear in a pass/fail score. An agent that scores 70% but enters destructive loops on 30% of tasks is operationally more dangerous than one that scores 60% but fails gracefully. Standard benchmarks cannot distinguish between the two.

Why Decomposition Is Necessary

Raw temporal metrics (entropy, drift, attractor geometry) treat all benchmark tasks equally. But benchmark datasets are not representative of real-world usage — they over-represent certain capability strata (e.g., mathematics) and under-represent others (e.g., multi-file code refactoring). Reporting unweighted dynamics metrics inherits this bias: a benchmark dominated by tightly-constrained tasks will make any model look artificially stable.

Spatio-temporal decomposition solves this by factoring the analysis into two orthogonal axes:

  1. Spatial: What is the semantic distribution of tasks, and how does it differ from real deployment?
  2. Temporal: How does the agent's trajectory evolve over time within each task?

These are then fused via importance-weighted estimators to produce debiased dynamics metrics that reflect what a real user would actually experience.


The Three-Stage Framework

Stage 1 — Spatial Reweighting (scripts/posterior/1_compute_posterior_weights.py)

We stratify evaluation tasks into semantic clusters using NLU embeddings and compute Radon-Nikodym importance weights:

$$\rho_{k_i} = \frac{P(C_{k_i})}{Q(C_{k_i})}$$

where $Q$ is the benchmark distribution and $P$ is the target user distribution. Over-represented task types are suppressed; under-represented but operationally critical types are amplified.

Stage 2 — Temporal Dynamics (scripts/posterior/2_compute_constraint_index.py)

We treat the agent's multi-turn transcript as a discrete dynamical system and extract:

Metric What It Measures
Participation Ratio (PR) Effective dimensionality of the trajectory covariance — how many independent behavioral modes the agent explores
Von Neumann Kernel Entropy Continuous entropy via regularized RBF kernel density matrix — robust for high-dimensional embeddings where $N \ll D$
BOPS Inter-run cosine predictability — do repeated evaluations produce consistent trajectories?
Constraint Index $C(q)$ Composite measure: high $C(q)$ = tight attractor basin (predictable), low $C(q)$ = diffusive/chaotic
Lyapunov Proxy $\hat{\lambda}$ Perturbation sensitivity — do lexically different but semantically identical prompts cause trajectory divergence?

The embedding backend is configurable:

  • bag-of-words (default): Zero external dependencies, backwards-compatible with existing infrastructure.
  • all-MiniLM-L6-v2 (or any HuggingFace model): Dense semantic embeddings via sentence-transformers for higher-fidelity spatial analysis.

Stage 3 — Spatio-Temporal Fusion (scripts/posterior/3_generate_space_time_report.py)

Applies Hajek importance-weighted estimators to project temporal metrics onto the user-aligned semantic manifold:

$$\mathbb{E}_P[D] \approx \frac{\sum_{i=1}^N \rho_{k_i} D_i}{\sum_{i=1}^N \rho_{k_i}}$$

This produces:

  • Debiased regime probabilities: The true probability a deployed user encounters a chaotic/looping trajectory.
  • Weighted survival curves: Corrected Kaplan-Meier time-to-failure estimates.
  • Expected operational stability: The constraint index and sensitivity a real user would experience.

Why This Requires Significantly More Compute

Spatio-temporal dynamics is fundamentally a high-compute evaluation methodology. This is not a limitation — it is inherent to the problem:

What Standard Benchmarks Need What Dynamics Analysis Needs Why
1 run per task ≥3 runs per task Inter-run variance for BOPS, PR, and statistical confidence
1 turn per run 10-50+ turns per run Trajectory geometry, regime classification, survival curves
Original prompts only Original + perturbed variants Lyapunov sensitivity estimation
Scalar score output Full transcript archiving Posterior embedding, constraint index computation

A production-scale evaluation with 2 frontier models × 50 multi-turn tasks × 3 runs × 30 average turns requires approximately 9,000 agent turns. This is orders of magnitude more compute than a standard single-pass benchmark — but it is the minimum required to make rigorous claims about the operational stability of agents deployed in autonomous, long-horizon settings.

The local dev pipeline (--local flag) validates the infrastructure end-to-end using small Ollama models on Tier 1 tasks. These runs complete in a single turn, producing degenerate (flat) trajectory plots — which the report generator automatically detects and documents. The full dynamics emerge only when frontier models engage in genuine multi-step agentic reasoning on complex tasks.


What This Enables for Rigorous Agent Testing

1. Systematic Safety Characterization

Instead of asking "does it work?", researchers can now ask: "When it fails, what kind of failure is it?" — distinguishing between graceful degradation (low entropy, clean exit) and catastrophic drift (high entropy, unbounded expansion).

2. Prompt Robustness Certification

The Lyapunov proxy $\hat{\lambda}$ directly quantifies whether an agent's behavior is stable under prompt perturbation. A positive $\hat{\lambda}$ flags tasks where minor rewording causes exponentially diverging behavior — exactly the fragility that matters for production deployment.

3. Debiased Operational Metrics

By fusing dynamics with user-distribution weights, teams can report metrics like: "Under real-world usage patterns, there is a 12% probability of encountering a hallucination loop" — rather than the benchmark-biased "20% of benchmark tasks trigger loops".

4. Regression Detection Over Time

As models are fine-tuned or updated, the constraint index provides a compact, interpretable signal: if $C(q)$ drops for a task category, the model's trajectory stability has regressed — even if its pass/fail score remains unchanged.


Backwards Compatibility

This PR is fully backwards-compatible:

  • Default behavior unchanged. Running without --local uses production cloud models and bag-of-words embeddings. No new dependencies required.
  • --local flag is opt-in. Switches to Ollama models and dense semantic embeddings for local dev validation.
  • torch and sentence-transformers are lazy-imported. The default bag-of-words code path has zero GPU dependencies.
  • No existing CLI commands, APIs, or scripts are modified. All new functionality lives in scripts/posterior/ and docs/.

Files Changed

New: Analysis Pipeline (scripts/posterior/)

  • 1_compute_posterior_weights.py — Radon-Nikodym importance weights
  • 2_compute_constraint_index.py — Constraint index with dual BoW/Transformer backends
  • 3_generate_space_time_report.py — Self-contained report with auto-linked plots

New: Documentation (docs/)

  • long_term_dynamics.md — Temporal trajectory analysis methodology
  • task_distribution_reweighting.md — Spatial reweighting theory
  • semantic_spatiotemporal_dynamics.md — Unified fusion framework

New: Pipeline Orchestration

  • scripts/run_eval_pipeline.sh — End-to-end script with --local dev mode
  • scripts/generate_perturbed_tasks.py — Prompt perturbation generator
  • scripts/run_posterior_reweighting.sh — Standalone spatial reweighting

New: Task Variants & Profiles

  • tasks-public/tier{1,3}/*-perturbed.yaml — 7 perturbed task definitions
  • profiles/*.json — Distribution schemas and precomputed weights

Modified

  • clawbench/dynamics_archive.py — Fixed recursive traversal for fingerprint-nested cache
  • clawbench/dynamics.py — Minor integration hook

How to Test

# Full local dev pipeline (small models, ~3 min)
conda run -n clawbench --no-capture-output bash scripts/run_eval_pipeline.sh --local

# Inspect output
cat results/space_time_report/EVAL_REPORT_SPACE_TIME.md
ls results/space_time_report/plots/

# Verify backwards compat (no torch needed)
python scripts/posterior/2_compute_constraint_index.py --help

@HaoLi111 HaoLi111 requested a review from a team as a code owner May 19, 2026 05:24
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. labels May 21, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 21, 2026

Codex review: needs real behavior proof before merge. Reviewed May 26, 2026, 5:36 PM ET / 21:36 UTC.

Summary
Adds a spatio-temporal dynamics and reweighting evaluation pipeline with new posterior scripts, docs, profile JSONs, perturbed public tasks, and small dynamics/archive metric changes.

Reproducibility: yes. for the actionable patch blockers: the problematic cache deletion, token override, script rename, and missing mock-results input are all visible in the PR source. Runtime success of the full pipeline is not reproduced because no real behavior proof was provided and this review stayed read-only.

Review metrics: 2 noteworthy metrics.

  • Patch surface: 25 files changed, +1686/-56. The PR spans scripts, docs, profiles, tasks, and runtime dynamics code, so compatibility and proof matter before merge.
  • Existing script moved: 1 script renamed. Moving scripts/compute_constraint_index.py changes a user-facing maintenance entrypoint unless the old path is preserved.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Add redacted terminal/log proof showing the repaired pipeline running in a real local or cloud setup after the latest commit.
  • Preserve the old constraint-index script path and avoid destructive cache cleanup by default.
  • Respect existing gateway tokens and make standalone scripts consume generated or user-supplied inputs.

Proof guidance:
Needs real behavior proof before merge: No after-fix terminal output, logs, screenshots, recordings, or linked artifacts are present; the contributor should add redacted proof and update the PR body to trigger re-review.

Risk before merge

  • Merging as-is can delete a user's default run cache when they run the new pipeline from a checkout.
  • Merging as-is can break existing direct users of scripts/compute_constraint_index.py unless a compatibility wrapper remains.
  • The end-to-end script can run with the wrong gateway token because it overwrites the operator's OPENCLAW_GATEWAY_TOKEN.
  • No contributor-provided real behavior proof shows that the repaired pipeline works on an actual local or cloud run.

Maintainer options:

  1. Repair compatibility before merge (recommended)
    Keep the old constraint-index entrypoint, isolate or opt into cache clearing, respect existing gateway tokens, and require real run proof before another merge review.
  2. Accept an intentional breaking experiment
    Maintainers could explicitly accept this as an experimental workflow, but should say that existing script paths, cache contents, and gateway-token assumptions may change.
  3. Pause if this is too broad
    If the methodology is not ready for core, close or pause this branch and ask for a narrower PR around one validated dynamics metric or script.

Next step before merge
Human handling is needed because contributor proof is absent and maintainers need to decide whether this experimental public-task/methodology surface belongs in core after the compatibility repairs.

Security
Cleared: No concrete security or supply-chain issue was found; the blockers are compatibility, runtime reliability, and missing proof rather than new dependency or secret-handling risk.

Review findings

  • [P1] Keep the pipeline from deleting the shared run cache — scripts/run_eval_pipeline.sh:33
  • [P1] Preserve the existing constraint-index script path — scripts/run_posterior_dynamics_pipeline.py:80
  • [P1] Do not clobber the user's gateway token — scripts/run_eval_pipeline.sh:40
Review details

Best possible solution:

Land this only after preserving existing entrypoints/cache/token behavior, making the scripts reproducible with real inputs, and providing redacted end-to-end proof for a local run.

Do we have a high-confidence way to reproduce the issue?

Yes for the actionable patch blockers: the problematic cache deletion, token override, script rename, and missing mock-results input are all visible in the PR source. Runtime success of the full pipeline is not reproduced because no real behavior proof was provided and this review stayed read-only.

Is this the best way to solve the issue?

No. The proposed direction may be useful, but the current branch is not the best merge shape until it keeps backward-compatible script paths, avoids destructive defaults, respects user credentials, and proves the new workflow on a real run.

Full review comments:

  • [P1] Keep the pipeline from deleting the shared run cache — scripts/run_eval_pipeline.sh:33
    This removes the entire default .clawbench/run_cache before every run. That path is the documented cache/archive for offline posterior analysis, so users running the new script from a checkout can lose unrelated cached runs and prior expensive evaluations. Use a PR-specific subdirectory or require an explicit clean flag instead.
    Confidence: 0.96
  • [P1] Preserve the existing constraint-index script path — scripts/run_posterior_dynamics_pipeline.py:80
    The patch renames the established scripts/compute_constraint_index.py entrypoint and only updates this driver to the new numbered location. README and downstream messaging still refer to the old script name, so existing direct invocations break after upgrade; leave a wrapper or keep the old path while adding the posterior alias.
    Confidence: 0.92
  • [P1] Do not clobber the user's gateway token — scripts/run_eval_pipeline.sh:40
    This unconditional export overwrites any OPENCLAW_GATEWAY_TOKEN the operator set for their gateway. In the default cloud mode the new pipeline can authenticate with the wrong token even when the environment was configured correctly; default from the existing environment or require the token instead.
    Confidence: 0.9
  • [P2] Make the reweighting script consume a real results file — scripts/run_posterior_reweighting.sh:20
    The standalone reweighting script is wired to results/mock_execution_results.json, but the PR does not add that file or generate it before invoking scripts/debiased_evaluation.py. Running the advertised script fails immediately unless the user invents that input; accept a results path or feed the pipeline's actual output.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.93

AGENTS.md: not found in the target repository.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 0f1b45e4674b.

Label changes

Label changes:

  • add merge-risk: 🚨 auth-provider: The new pipeline unconditionally overrides OPENCLAW_GATEWAY_TOKEN, which can break existing authenticated gateway setups.

Label justifications:

  • P2: This is a normal-priority feature PR with clear merge blockers but limited blast radius while unmerged.
  • merge-risk: 🚨 compatibility: The patch moves an existing script entrypoint and deletes the shared default run cache from its new pipeline script.
  • merge-risk: 🚨 auth-provider: The new pipeline unconditionally overrides OPENCLAW_GATEWAY_TOKEN, which can break existing authenticated gateway setups.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🧂 unranked krab.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: No after-fix terminal output, logs, screenshots, recordings, or linked artifacts are present; the contributor should add redacted proof and update the PR body to trigger re-review.
Evidence reviewed

What I checked:

  • Target policy check: No AGENTS.md was found inside the target openclaw/clawbench checkout; the only AGENTS.md found was in the parent ClawSweeper workspace, so no target-specific AGENTS policy applied.
  • Patch surface: The PR changes 25 files with 1686 additions and 56 deletions, including scripts, docs, profiles, task YAMLs, and dynamics code. (2050164e7ccc)
  • Shared cache deletion: The new end-to-end pipeline removes the default .clawbench/run_cache directory before running, which can delete unrelated cached evaluations documented for offline posterior analysis. (scripts/run_eval_pipeline.sh:33, 2050164e7ccc)
  • Gateway token override: The new pipeline unconditionally exports OPENCLAW_GATEWAY_TOKEN to a hard-coded value, overriding operator-provided gateway credentials. (scripts/run_eval_pipeline.sh:40, 2050164e7ccc)
  • Existing entrypoint moved: The diff renames scripts/compute_constraint_index.py into scripts/posterior/2_compute_constraint_index.py while current main still documents the old script in the repository tree and related messaging. (scripts/run_posterior_dynamics_pipeline.py:80, 2050164e7ccc)
  • Standalone reweighting input missing: The advertised reweighting script points to results/mock_execution_results.json, but the PR does not add or generate that file before calling scripts/debiased_evaluation.py. (scripts/run_posterior_reweighting.sh:20, 2050164e7ccc)

Likely related people:

  • pllm-uci: Introduced the current archive dynamics pipeline, run_posterior_dynamics_pipeline.py, and most current compute_constraint_index.py behavior that this PR modifies and moves. (role: feature-history owner; confidence: high; commits: c209612d46b0; files: scripts/compute_constraint_index.py, scripts/run_posterior_dynamics_pipeline.py, clawbench/dynamics_archive.py)
  • scoootscooob: Introduced earlier dynamical-systems diagnostics and recently touched dynamics archive behavior adjacent to the PR's modified code paths. (role: recent area contributor; confidence: medium; commits: b6f07d9a8796, 11d943f21cd3; files: scripts/compute_constraint_index.py, clawbench/dynamics_archive.py)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 21, 2026

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@clawsweeper clawsweeper Bot added the merge-risk: 🚨 auth-provider 🚨 Merging this PR could break OAuth, tokens, provider routing, model choice, or credentials. label May 26, 2026
@scoootscooob scoootscooob merged commit 5c58e7b into openclaw:main May 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 auth-provider 🚨 Merging this PR could break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants