Codex/kova scenarios and fixes by JuanHuaXu · Pull Request #2 · openclaw/Kova

JuanHuaXu · 2026-05-21T12:38:49Z

Fresh Evidence

Latest Kova head: 24f5b9f (codex/kova-scenarios-and-fixes, single squashed PR commit on base 7052527).

Current-Head Proof

Reviewer proof refreshed on current PR head 24f5b9f after the conflict-resolution squash and cron fail-closed amendment. Commands were run from this PR branch with disposable Kova homes under checkouts/kova-pr2-proof-24f5b9f-*; artifact paths below are repo-relative and do not include private local paths.

Validation:

node bin/kova.mjs self-check -> 182/182 passed, generated 2026-05-28 06:00 UTC.
node tests/render-snapshots.mjs -> 21 pass, 0 fail, 0 written.
Conflict marker scan: rg -n "<<<<<<<|=======|>>>>>>>" . -g '!checkouts/**' -g '!node_modules/**' -g '!reports/**' -> no matches.
Cleanup before and after targeted runs: ocm env list -> No environments.
Branch shape: git rev-list --count upstream/main..HEAD -> 1 commit.

Targeted real-run proof:

Scenario	Run id	Status	Evidence
`cron-runtime` / `cron-user`	`kova-260528-055817-79a0be`	PASS	`cronRuntimeEvidence.available: true`; `cronRegisterMs: 760`; `cronRunMs: 707`; `cronRunCompleted: true`; `cronTriggerAttributed: true`; no violations.
`mcp-tool-call` / `mcp-tool-user`	`kova-260528-055843-76d8f2`	PASS	`mcpToolCallEvidence.available: true`; `toolsCallMs: 124`; `safeToolName: conversations_list`; `safeToolSucceeded: true`; `invalidToolErrorAttributed: true`; `processExited: true`; no violations.
`exec-tool-safety` / `exec-tool-user`	`kova-260528-055902-9617b8`	FAIL	Kova evidence was present and scoped: `safeCommandSucceeded: true`, `dangerousCommandBlocked: true`, `dangerousPayloadExecuted: false`, `outputTruncated: true`, `timeoutMs: 3222`, `processLeaks: 0`. Remaining failure is an OpenClaw resource signal: `tool-runtime peak RSS 760.3 MB exceeded threshold 500 MB`.

Artifacts on runner:

checkouts/kova-pr2-proof-24f5b9f-cron/reports/kova-260528-055817-79a0be.json
checkouts/kova-pr2-proof-24f5b9f-mcp/reports/kova-260528-055843-76d8f2.json
checkouts/kova-pr2-proof-24f5b9f-exec/reports/kova-260528-055902-9617b8.json

This directly addresses ClawSweeper's stale-proof blocker: current-head proof now matches the branch state maintainers would merge. The only targeted runtime failure remains the known OpenClaw/product RSS threshold signal; cron/MCP/exec helper evidence itself is present, parsed, and gated.

Current-Head Exec Tool Evidence Proof

Reviewer P2 fix proof was run at Kova head 1546804 (Drive exec containment checks through OpenClaw). This fixes the false-evidence lane by using the OpenClaw exec tool schema's required command argument and by driving safe, blocked, oversized-output, and timeout cases through openclaw agent plus mock-provider tool-result evidence.

Commands:

KOVA_HOME=checkouts/p2-exec-real-proof2-kova-home \
  node bin/kova.mjs run \
    --target runtime:stable \
    --scenario exec-tool-safety \
    --state exec-tool-user \
    --execute \
    --report-dir checkouts/p2-exec-real-proof2/reports \
    --json

KOVA_HOME=checkouts/p2-exec-real-proof2-kova-home \
  node bin/kova.mjs run \
    --target runtime:stable \
    --scenario tool-failure-containment \
    --state exec-tool-user \
    --execute \
    --report-dir checkouts/p2-exec-real-proof2/reports \
    --json

Results:

exec-tool-safety: kova-260522-170514-192a6a, proof complete 10/10; exec evidence available; safeCommandSucceeded: true; dangerousCommandBlocked: true; dangerousPayloadExecuted: false; outputTruncated: true; timeoutMs: 3499; processLeaks: 0.
tool-failure-containment: kova-260522-170551-6419fb, proof complete 10/10; exec evidence available; dangerousCommandBlocked: true; dangerousPayloadExecuted: false; outputTruncated: true; timeoutMs: 3291; processLeaks: 0.

Both runs still report FAIL, but only for the existing OpenClaw resource threshold: tool-runtime RSS around 750 MB over the 500 MB threshold. The previous Kova evidence failures are gone.

Validation for the patch:

node bin/kova.mjs self-check --json -> ok: true (173 checks)
node tests/render-snapshots.mjs -> 18 pass
git diff --check -> pass
Cleanup: ocm env list -> no environments

Artifacts on runner:

checkouts/p2-exec-real-proof2/reports/kova-260522-170514-192a6a.json
checkouts/p2-exec-real-proof2/reports/kova-260522-170551-6419fb.json

Current-Head Tool-Failure Proof

Reviewer P2 fix proof was run at Kova head 2439672 (Fix failure-only exec mock provider flow).

Command:

KOVA_HOME=checkouts/p2-tool-failure-proof-kova-home \
  node bin/kova.mjs run \
    --target runtime:stable \
    --scenario tool-failure-containment \
    --state exec-tool-user \
    --execute \
    --report-dir checkouts/p2-tool-failure-proof/reports \
    --json

Result: kova-260522-164448-f2fbb2 completed with proof completeness 10/10 required obligations, cleanup destroyed the disposable env, and the fixed failure-only provider path was exercised:

Mock script: kova-exec-tool-failure-only
First provider step: kova-exec-tool-failure-only-dangerous-tool-call
Matched provider step: kova-exec-tool-failure-only-dangerous-tool-call
Provider emitted one exec tool call
dangerousCommandBlocked: true
dangerousPayloadExecuted: false
dangerousSentinelStillPresent: true
outputTruncated: true, timeoutObserved: true, processLeaks: 0

The scenario verdict is still FAIL, but for an OpenClaw resource threshold, not the Kova wiring bug: tool-runtime peak RSS 741.9 MB exceeded threshold 500 MB. This is expected to remain a product/resource signal for maintainers.

Artifacts on runner:

Markdown: checkouts/p2-tool-failure-proof/reports/kova-260522-164448-f2fbb2.md
JSON: checkouts/p2-tool-failure-proof/reports/kova-260522-164448-f2fbb2.json
Tool artifact: checkouts/p2-tool-failure-proof-kova-home/artifacts/kova-260522-164448-f2fbb2/kova-tool-failure-containmen-81131346-kova-260522-164448-f2fbb2/tool-failure-containment.json

Requested OpenClaw target: v2026.5.21-beta.1. The Git tag exists and resolves to 89a17def chore(release): prepare 2026.5.21-beta.1, but openclaw@2026.5.21-beta.1 is not currently published in npm/OCM release discovery. A direct npm:2026.5.21-beta.1 exhaustive run therefore blocked at provisioning with OpenClaw release version "2026.5.21-beta.1" was not found.

Fresh release-shaped tag evidence was run from a disposable openclaw/openclaw.git checkout at that tag using local-build:checkouts/openclaw-v2026.5.21-beta.1. After pnpm install --frozen-lockfile, direct pnpm pack succeeded and produced openclaw-2026.5.21-beta.1.tgz.

Matrix command:

KOVA_HOME=checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun \
  node bin/kova.mjs matrix run \
    --profile exhaustive \
    --target local-build:checkouts/openclaw-v2026.5.21-beta.1 \
    --source-env kova-pr2-source-v2026-5-21-beta-1 \
    --execute \
    --allow-exhaustive \
    --json \
    --report-dir checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports

Result: kova-260522-145534-037d84 -> 77 total · 32 PASS · 45 FAIL · 0 BLOCKED.

Evidence artifacts on runner:

Markdown: checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports/kova-260522-145534-037d84-exhaustive.md
JSON: checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports/kova-260522-145534-037d84-exhaustive.json
Bundle: checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports/kova-260522-145534-037d84-bundle.tar.gz
Bundle SHA256: 743ee926c809921c0b6aea0170c1731e8576749ab88ee8126cafebfb210c73ae

Notable PASS coverage: release-runtime-startup, channel-discord-capability-conformance, upgrade-existing-user both source states, bundled-runtime-deps both states, plugin-lifecycle all states, official/bad/missing/unsafe plugin lanes, provider-models, agent-cold-warm-message, dashboard-readiness, tui-responsiveness, mcp-runtime-start-stop, agent-network-offline, failure-injection, and cross-platform-smoke.

Top remaining failures are product/resource signals in OpenClaw/tag behavior rather than Kova unsupported-mode blockers: gateway RSS around the 700 MB threshold across agent/provider/HTTP/TUI surfaces, rolling-upgrade package/runtime RSS/CPU, dirty-plugin doctor-cli RSS, tool-runtime RSS for exec/tool containment, soak/workspace latency, and a few functional/liveness failures (channel generated-image handoff, Telegram timeout signals, cron/browser/media gateway restarting, MCP tool-call missing runtime role evidence).

Cleanup after the run: disposable source env destroyed; old beta runtime records reintroduced by upgrade lanes removed; ocm runtime list shows only stable.

PR Change List

Branch: codex/kova-scenarios-and-fixes
Base compared: origin/main
RCA doc is removed from the feature set. .learnings/ is still untracked and not part of PR.

Matrix/Profile Wiring

Added profiles/rolling-upgrade.json
Updated profiles/exhaustive.json
- includes rolling upgrade day/week/month scenarios
- includes fixed old-release upgrade scenarios
- includes unsafe-memory plugin scenario
- total exhaustive entries now 77
Updated profiles/release.json
- includes unsafe-memory plugin scenario
- release entries now 51
Added profiles/adversarial.json

New Upgrade Coverage

Added scenarios/upgrade-from-day-ago.json
Added scenarios/upgrade-from-week-ago.json
Added scenarios/upgrade-from-month-ago.json
Added support/resolve-openclaw-release-age.mjs
Added support/run-openclaw-release-age-upgrade.mjs
Updated docs/AGENT_USAGE.md with rolling upgrade usage
Added self-check coverage for rolling upgrade resolver/profile planning

Unsafe Legacy Plugin Memory Test

Added scenarios/plugin-legacy-unsafe-memory.json
Added surfaces/plugin-legacy-unsafe-memory.json
Added fixture plugin:
- support/plugins/kova-legacy-unsafe-memory/index.js
- support/plugins/kova-legacy-unsafe-memory/openclaw.plugin.json
- support/plugins/kova-legacy-unsafe-memory/package.json
Added support/assert-command-output.mjs
Updated src/evaluator.mjs to count failed during register as plugin load failure evidence

Dirty Plugin Testing

Added docs/DIRTY_PLUGIN_TESTING_PLAN.md
Added scenarios/dirty-plugin-state.json
Added dirty plugin states:
- states/dirty-plugin-local-edits.json
- states/dirty-plugin-stale-deps.json
- states/dirty-plugin-manifest-drift.json
- states/dirty-plugin-disabled-broken.json
- states/dirty-plugin-symlink-dev.json
- states/dirty-plugin-partial-install.json
- states/update-recovery-plugin-user.json
Added support/dirty-plugin-state.mjs
Added surfaces/dirty-plugin-state.json

Release Update Recovery

Added docs/RELEASE_UPDATE_RECOVERY_PLAN.md
Added scenarios/release-update-recovery.json
Added surfaces/release-update-recovery.json
Added support/restore-first-ocm-upgrade-snapshot.mjs

Tool Runtime Matrix

Added docs/TOOL_RUNTIME_MATRIX_PLAN.md
Added scenarios:
- scenarios/cron-runtime.json
- scenarios/exec-tool-safety.json
- scenarios/mcp-tool-call.json
- scenarios/tool-failure-containment.json
Added states:
- states/cron-user.json
- states/exec-tool-user.json
- states/mcp-tool-user.json
Added surfaces:
- surfaces/cron-runtime.json
- surfaces/exec-tool-safety.json
- surfaces/mcp-tool-call.json
- surfaces/tool-failure-containment.json
Added process roles:
- process-roles/cron-runtime.json
- process-roles/tool-runtime.json
Added helpers:
- support/run-cron-runtime-smoke.mjs
- support/run-exec-tool-safety.mjs
- support/mcp-tool-call-smoke.mjs

Provider/Network Failure Coverage

Added docs/NETWORK_ISOLATION_PLAN.md
Added src/network-frontage.mjs
Added support/network-frontage-proxy.mjs
Added provider scenarios:
- scenarios/agent-provider-protocol-failure.json
- scenarios/agent-provider-random-disconnect.json
Updated support/mock-openai-server.mjs
Updated support/configure-openclaw-mock-auth.mjs
Updated src/commands/run.mjs, src/commands/matrix-run.mjs, src/run/context.mjs, src/run/phase-plan.mjs

Adversarial Input Coverage

Added scenarios/adversarial-input-openai-compatible.json
Added surfaces/adversarial-input.json
Added support/run-adversarial-inputs.mjs
Added profiles/adversarial.json

Plugin Fixture/Manifest Fixes

Added support/plugins/kova-basic/openclaw.plugin.json
Added support/plugins/kova-missing-runtime-dep/openclaw.plugin.json
Updated scenarios/plugin-missing-runtime-deps.json

Resource Attribution / Evaluation / Reporting Fixes

Updated src/collectors/resources.mjs
Updated src/evaluation/violations.mjs
Updated src/evidence/agent-turns.mjs
Updated src/evidence/shared.mjs
Updated src/measurement-contract.mjs
Updated src/reporting/report.mjs
Updated src/reporting/scenario-aggregate.mjs
Updated src/run/command-executor.mjs
Updated src/run/report-finalization.mjs
Updated src/runner.mjs
Updated src/safety.mjs
Updated src/selfcheck.mjs

Large Session Fixture

Added support/prepare-large-memory-session-state.mjs
Updated related surface thresholds/metadata:
- surfaces/fresh-install.json
- surfaces/soak.json
- surfaces/gateway-performance.json
- surfaces/workspace-scan.json

OpenAI-Compatible / Runtime Role Tweaks

Updated scenarios/openai-compatible-turn.json
Updated support/run-openai-compatible-turn.mjs
Updated process-roles/openai-compatible-client.json
Updated role primary-resource metadata across several surfaces

Docs / User-Facing Metadata

Updated README.md
Updated docs/WHAT_IS_KOVA.md
Updated docs/AGENT_USAGE.md
Updated metrics/known.json

Git Hygiene

Updated .gitignore
- ignores .env, .env.*, local JSON/env files, and checkout contents
- keeps .env.example and checkouts/.gitkeep
Added checkouts/.gitkeep

Tests / Snapshots

Added checked-in report fixtures:
- tests/fixtures/reports/pass.json
- tests/fixtures/reports/fail.json
Updated tests/render-snapshots.mjs
Refreshed all affected snapshots under tests/snapshots/

Validation Already Run

node bin/kova.mjs self-check --json
npm run test:snapshots
git diff --check
Real disposable run for plugin-legacy-unsafe-memory passed against runtime:stable

clawsweeper · 2026-05-21T17:44:34Z

Codex review: needs maintainer review before merge. Reviewed May 28, 2026, 2:07 AM ET / 06:07 UTC.

Summary
This PR expands Kova's release validation matrix with rolling-upgrade, dirty-plugin, release-recovery, tool-runtime/MCP/cron, adversarial-input, loopback-frontage, evaluator gates, support helpers, docs, fixtures, and snapshot updates.

Reproducibility: not applicable. as a feature PR rather than a reported bug. The PR body provides current-head terminal proof for the new Kova runs and validation commands.

Review metrics: 3 noteworthy metrics.

Diff size: 126 files, +15,249/-1,591. The breadth makes maintainer acceptance of profile contracts and generated snapshot churn important before merge.
Focused gate surface: 7 central files add 2,358 lines. The main review-critical code paths are concentrated in profiles, evaluator gates, network frontage, and tool/MCP helpers.
Current-head proof: 3 targeted runs plus 1 exhaustive local-build matrix reported. The PR body now ties real behavior proof to the mergeable head after the force-push.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

[P2] Maintainers should explicitly accept or narrow the expanded release/exhaustive gate and loopback-frontage operator behavior before merge.

Risk before merge

[P1] Merging will intentionally expand Kova release/exhaustive gates and can turn existing automation red on new OpenClaw RSS, plugin, cron, MCP, exec, rolling-upgrade, and dirty-plugin signals.
[P1] The loopback-frontage mode depends on host loopback alias/proxy behavior and may require macOS operator privileges and cleanup confidence before being used in shared runners.
[P1] Rolling-upgrade selection depends on live npm release-time discovery at execution time, so older-version selection is useful but less reproducible than fixed-version upgrade lanes.

Maintainer options:

Accept the expanded gate contract
Maintainers can merge after explicitly accepting that existing Kova release automation may now fail on the new OpenClaw resource, plugin, tool-runtime, and upgrade signals.
Split operator-sensitive coverage
If loopback alias privileges or live release discovery are too disruptive, split network-frontage or rolling-upgrade into a narrower follow-up before merging the rest.
Pause if the matrix is too broad
If maintainers want smaller review units, pause or close this PR and request scoped PRs for tool-runtime, dirty-plugin, upgrade, and network-frontage coverage.

Next step before merge

[P2] Human review should decide whether to accept the broad Kova gate/profile and operator behavior changes; there is no narrow automated repair to queue.

Security
Cleared: No concrete security or supply-chain regression was found in the diff; the exec/plugin/security-sensitive helpers are scoped to disposable Kova envs and include token redaction or bounded fixtures.

Review details

Best possible solution:

Land the expanded validation matrix only after maintainers accept the broader gate contract and operator requirements; otherwise split network-frontage or rolling-upgrade into smaller follow-ups.

Do we have a high-confidence way to reproduce the issue?

Not applicable as a feature PR rather than a reported bug. The PR body provides current-head terminal proof for the new Kova runs and validation commands.

Is this the best way to solve the issue?

Unclear until maintainer acceptance: the implementation is coherent and I found no discrete blocking line-level defect, but the PR changes the release/profile contract rather than making a narrow repair. Splitting operator-sensitive pieces is the safer alternative if maintainers want lower blast radius.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 70525273b6ff.

Label changes

Label changes:

add proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.
add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.
remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.
remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

P2: This is a normal-priority validation feature with broad release-lab impact but no emergency core runtime outage.
merge-risk: 🚨 compatibility: The PR changes Kova profile/gate contracts and adds operator-facing loopback-frontage and rolling-upgrade behavior that existing users may need to adapt to.
merge-risk: 🚨 automation: The expanded release/exhaustive matrix can intentionally fail existing automation on newly surfaced OpenClaw signals and live resolver/network setup.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
feature: ✨ showcase: ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. The feature idea is unusually useful for release confidence because it turns upgrade, dirty-plugin, cron/MCP/exec, and resource behavior into structured OpenClaw validation gates.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.

Evidence reviewed

What I checked:

Repository policy read and applied: AGENTS.md was read in full; it frames Kova as an OpenClaw validation lab with OCM as harness infrastructure, so this review focused on OpenClaw validation behavior and did not run real Kova scenarios. (AGENTS.md:1, 70525273b6ff)
Central PR work is not on main: Current main does not contain src/network-frontage.mjs, profiles/rolling-upgrade.json, or scenarios/exec-tool-safety.json; each git show lookup exited 128. (70525273b6ff)
Large PR surface: The branch changes 126 files with 15,249 additions and 1,591 deletions, including profile, scenario, evaluator, helper, fixture, documentation, and snapshot surfaces. (24f5b9fd09b1)
Release gate contract changes: The release profile adds cron-runtime and tool-runtime role caps plus executable thresholds for cron, exec-tool-safety, MCP tool calls, and tool-failure containment. (profiles/release.json:63, 24f5b9fd09b1)
New rolling upgrade profile: The branch adds a rolling-upgrade profile that resolves day/week/month-old OpenClaw releases and upgrades cloned user state to the requested target. (profiles/rolling-upgrade.json:1, 24f5b9fd09b1)
Evaluator fail-closed gates: The evaluator enforces required cron, exec, MCP tool-call, dirty-plugin, and release-recovery evidence when matching thresholds are active. (src/evaluator.mjs:458, 24f5b9fd09b1)

Likely related people:

Shakker: Authored the current main Kova evaluator, runner, profiles, and initial repository import, and is the only current-main shortlog owner for the affected Kova validation surfaces. (role: recent area contributor and feature-history owner; confidence: high; commits: 00ad1a374f30, 343210fc643a, c7f19466e0b8; files: src/evaluator.mjs, src/runner.mjs, profiles/release.json)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

JuanHuaXu · 2026-05-22T01:13:23Z

Remediation pushed in ad6dcbd.

What changed:

P1: exec-tool-safety now uses the real OpenClaw agent/provider path. The mock provider emits Responses function_call items for exec; the helper verifies the safe exec turn, sends a dangerous rm -rf <sentinel> payload through the same path, and proves the sentinel remained.
P2: evaluator now collects and gates cron-runtime, exec-tool-safety, and mcp-tool-call helper JSON (cronRunMs, execSafeCommandSucceeded, execDangerousCommandBlocked, execOutputTruncated, mcpToolsCallMs, invalid MCP attribution, process leaks/errors).
Added self-check coverage: tool-runtime-evidence-evaluation fails if those helper outputs stop being parsed/enforced.

Validation:

npm run check -> PASS, 162/162 checks.
npm run test:snapshots -> PASS, 18/18 snapshots.
Real disposable Kova run: cron-runtime on runtime:stable -> PASS, run kova-260522-011201-c292fb; evidence: cronRunMs=698, cronRunCompleted=true, cronTriggerAttributed=true.
Real disposable Kova run: mcp-tool-call on runtime:stable -> PASS, run kova-260522-011137-720f05; evidence: mcpToolsCallMs=184, safeToolSucceeded=true, invalidToolErrorAttributed=true.
Real disposable Kova run: exec-tool-safety on runtime:stable -> Kova tool checks PASS but scenario verdict FAIL due to product RSS threshold; run kova-260522-011101-dd863a; evidence: safeCommandSucceeded=true, dangerousCommandBlocked=true, dangerousPayloadExecuted=false, outputTruncated=true, timeoutMs=1006, processLeaks=0. Remaining violation is OpenClaw/product RSS: tool-runtime peak RSS 694.8 MB > 500 MB.

So the reviewer-reported Kova false evidence paths are patched. The only failure observed in the exec proof run is now a real product resource signal, not Kova misidentifying its own helper behavior.

JuanHuaXu · 2026-05-22T07:50:44Z

Follow-up remediation pushed in 646b744 for the latest P2/P3 findings.

Fixes:

P2 network frontage: waitForTcp() now only checks child exit state when a child process is actually passed, so validation probes without a child are allowed.
P2 cron gates: evaluator now enforces cronRunCompleted and cronTriggerAttributed boolean thresholds. Added a negative self-check where cronTriggerAttributed=false must fail.
P3 MCP metric naming: evaluator now reports/violates mcpToolCallErrorAttributed, matching the surface/profile/known metric id.

Validation:

node bin/kova.mjs self-check --json -> PASS (ok: true), including new network-frontage-no-child-tcp and negative cron attribution coverage.
npm run test:snapshots -> PASS, 18/18.
Reviewer acceptance command was attempted: node bin/kova.mjs run --target runtime:stable --scenario fresh-install --network-frontage loopback --worker-id 7 --execute --json.
- Result: BLOCKED by local macOS privilege, not the fixed no-child validation bug.
- Run id: kova-260522-074939-8e800b.
- Blocker: ifconfig: ioctl (SIOCAIFADDR): permission denied while adding 127.0.1.17 alias.
- Cleanup verified: ocm env list -> No environments.

So the code-level review blockers are patched. The live loopback command reaches the expected alias setup path here, but this Codex session cannot grant elevated ifconfig lo0 alias permissions.

clawsweeper · 2026-05-22T07:57:04Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

clawsweeper · 2026-05-22T08:08:07Z

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

State: Failed
Detail: The targeted re-review did not finish cleanly. Check the workflow run for details.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26276335563
Updated: 2026-05-22T08:15:27.302Z

JuanHuaXu · 2026-05-22T08:15:32Z

Follow-up remediation pushed in 30c2d69 for the latest P2 metric-contract findings.

Fixes:

P2 dirty-plugin gates: evaluator now collects kova.dirtyPluginState.v1 fixture output and plugin command results, then enforces dirtyPluginDetected, dirtyPluginReported, dirtyPluginChecksumPreserved, doctorDestructiveChangeCount, pluginsUsableWithDirtyState, and gatewaySurvivedDirtyPlugin. Missing evidence now fails active dirty-plugin thresholds instead of silently passing.
P2 release recovery gates: evaluator now derives/enforces updateRetryVersionDrift, rollbackAvailable, rollbackSucceeded, pluginsUsableAfterUpgrade, pluginsUsableAfterRollback, and rollbackPreservedPluginData from upgrade/retry version output, rollback restore output, plugin-health commands, rollback plugin commands, and post-rollback dirty fixture verification. Missing evidence now fails active release-recovery thresholds.
Added plugin-recovery-evidence-evaluation self-check coverage with negative cases for missing dirty reporting, checksum/destructive doctor failure, retry version drift, missing/failed rollback, and post-rollback plugin unusability.

Validation:

node bin/kova.mjs self-check --json -> PASS (ok: true), including the new plugin recovery evidence check.
npm run test:snapshots -> PASS, 18/18.
git diff --check -> PASS.

This addresses the reviewer concern by making the advertised dirty-plugin and release-update-recovery surface thresholds executable gates rather than planning-only metric names.

clawsweeper · 2026-05-22T08:36:44Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Superseded
Detail: A newer re-review for this item started before this run finished, so GitHub cancelled this older run. Check the latest ClawSweeper run for the current result.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26277571960
Updated: 2026-05-22T08:42:54.932Z

JuanHuaXu · 2026-05-22T08:42:55Z

Follow-up remediation pushed in 8e7ffa4 for the latest P2/P3 findings.

Fixes:

P2 network frontage: partial loopback allocation is now registered immediately after alias creation, before proxy startup, so stopNetworkFrontage() can clean created aliases if proxy startup/validation fails.
P2 exec evidence: active exec thresholds now fail closed on missing/null helper evidence using required gates for execSafeCommandMs, execTimeoutMs, execSafeCommandSucceeded, execDangerousCommandBlocked, execOutputTruncated, and execProcessLeaks.
P3 README inventory: refreshed counts to 56 scenarios / 37 surfaces / 37 states / 10 profiles from node bin/kova.mjs plan --json.

Validation:

node bin/kova.mjs self-check --json -> PASS (ok: true), including new partial network frontage invariant and missing/incomplete exec evidence checks.
npm run test:snapshots -> PASS, 18/18.
git diff --check -> PASS.

clawsweeper · 2026-05-22T08:48:05Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26278065984
Updated: 2026-05-22T08:55:22.462Z

clawsweeper · 2026-05-22T18:47:24Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26305953505
Updated: 2026-05-22T18:56:25.854Z

clawsweeper · 2026-05-23T21:50:00Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26344471193
Updated: 2026-05-23T21:59:39.862Z

JuanHuaXu · 2026-05-28T06:01:05Z

@clawsweeper re-review

clawsweeper · 2026-05-28T06:01:09Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26557720541
Updated: 2026-05-28T06:07:59.141Z

clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 22, 2026

JuanHuaXu force-pushed the codex/kova-scenarios-and-fixes branch from ff5cd8e to 8a4d3a1 Compare May 28, 2026 05:15

feat: expand Kova release validation matrix

24f5b9f

JuanHuaXu force-pushed the codex/kova-scenarios-and-fixes branch from 8a4d3a1 to 24f5b9f Compare May 28, 2026 05:20

Conversation

JuanHuaXu commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fresh Evidence

Current-Head Proof

Current-Head Exec Tool Evidence Proof

Current-Head Tool-Failure Proof

PR Change List

Matrix/Profile Wiring

New Upgrade Coverage

Unsafe Legacy Plugin Memory Test

Dirty Plugin Testing

Release Update Recovery

Tool Runtime Matrix

Provider/Network Failure Coverage

Adversarial Input Coverage

Plugin Fixture/Manifest Fixes

Resource Attribution / Evaluation / Reporting Fixes

Large Session Fixture

OpenAI-Compatible / Runtime Role Tweaks

Docs / User-Facing Metadata

Git Hygiene

Tests / Snapshots

Validation Already Run

Uh oh!

clawsweeper Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanHuaXu commented May 22, 2026

Uh oh!

JuanHuaXu commented May 22, 2026

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanHuaXu commented May 22, 2026

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanHuaXu commented May 22, 2026

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanHuaXu commented May 28, 2026

Uh oh!

clawsweeper Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JuanHuaXu commented May 21, 2026 •

edited

Loading

clawsweeper Bot commented May 21, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 23, 2026 •

edited

Loading

clawsweeper Bot commented May 28, 2026 •

edited

Loading