Skip to content

Codex/kova scenarios and fixes#2

Open
JuanHuaXu wants to merge 1 commit into
openclaw:mainfrom
JuanHuaXu:codex/kova-scenarios-and-fixes
Open

Codex/kova scenarios and fixes#2
JuanHuaXu wants to merge 1 commit into
openclaw:mainfrom
JuanHuaXu:codex/kova-scenarios-and-fixes

Conversation

@JuanHuaXu
Copy link
Copy Markdown

@JuanHuaXu JuanHuaXu commented May 21, 2026

Fresh Evidence

Latest Kova head: 24f5b9f (codex/kova-scenarios-and-fixes, single squashed PR commit on base 7052527).

Current-Head Proof

Reviewer proof refreshed on current PR head 24f5b9f after the conflict-resolution squash and cron fail-closed amendment. Commands were run from this PR branch with disposable Kova homes under checkouts/kova-pr2-proof-24f5b9f-*; artifact paths below are repo-relative and do not include private local paths.

Validation:

  • node bin/kova.mjs self-check -> 182/182 passed, generated 2026-05-28 06:00 UTC.
  • node tests/render-snapshots.mjs -> 21 pass, 0 fail, 0 written.
  • Conflict marker scan: rg -n "<<<<<<<|=======|>>>>>>>" . -g '!checkouts/**' -g '!node_modules/**' -g '!reports/**' -> no matches.
  • Cleanup before and after targeted runs: ocm env list -> No environments.
  • Branch shape: git rev-list --count upstream/main..HEAD -> 1 commit.

Targeted real-run proof:

Scenario Run id Status Evidence
cron-runtime / cron-user kova-260528-055817-79a0be PASS cronRuntimeEvidence.available: true; cronRegisterMs: 760; cronRunMs: 707; cronRunCompleted: true; cronTriggerAttributed: true; no violations.
mcp-tool-call / mcp-tool-user kova-260528-055843-76d8f2 PASS mcpToolCallEvidence.available: true; toolsCallMs: 124; safeToolName: conversations_list; safeToolSucceeded: true; invalidToolErrorAttributed: true; processExited: true; no violations.
exec-tool-safety / exec-tool-user kova-260528-055902-9617b8 FAIL Kova evidence was present and scoped: safeCommandSucceeded: true, dangerousCommandBlocked: true, dangerousPayloadExecuted: false, outputTruncated: true, timeoutMs: 3222, processLeaks: 0. Remaining failure is an OpenClaw resource signal: tool-runtime peak RSS 760.3 MB exceeded threshold 500 MB.

Artifacts on runner:

  • checkouts/kova-pr2-proof-24f5b9f-cron/reports/kova-260528-055817-79a0be.json
  • checkouts/kova-pr2-proof-24f5b9f-mcp/reports/kova-260528-055843-76d8f2.json
  • checkouts/kova-pr2-proof-24f5b9f-exec/reports/kova-260528-055902-9617b8.json

This directly addresses ClawSweeper's stale-proof blocker: current-head proof now matches the branch state maintainers would merge. The only targeted runtime failure remains the known OpenClaw/product RSS threshold signal; cron/MCP/exec helper evidence itself is present, parsed, and gated.

Current-Head Exec Tool Evidence Proof

Reviewer P2 fix proof was run at Kova head 1546804 (Drive exec containment checks through OpenClaw). This fixes the false-evidence lane by using the OpenClaw exec tool schema's required command argument and by driving safe, blocked, oversized-output, and timeout cases through openclaw agent plus mock-provider tool-result evidence.

Commands:

KOVA_HOME=checkouts/p2-exec-real-proof2-kova-home \
  node bin/kova.mjs run \
    --target runtime:stable \
    --scenario exec-tool-safety \
    --state exec-tool-user \
    --execute \
    --report-dir checkouts/p2-exec-real-proof2/reports \
    --json

KOVA_HOME=checkouts/p2-exec-real-proof2-kova-home \
  node bin/kova.mjs run \
    --target runtime:stable \
    --scenario tool-failure-containment \
    --state exec-tool-user \
    --execute \
    --report-dir checkouts/p2-exec-real-proof2/reports \
    --json

Results:

  • exec-tool-safety: kova-260522-170514-192a6a, proof complete 10/10; exec evidence available; safeCommandSucceeded: true; dangerousCommandBlocked: true; dangerousPayloadExecuted: false; outputTruncated: true; timeoutMs: 3499; processLeaks: 0.
  • tool-failure-containment: kova-260522-170551-6419fb, proof complete 10/10; exec evidence available; dangerousCommandBlocked: true; dangerousPayloadExecuted: false; outputTruncated: true; timeoutMs: 3291; processLeaks: 0.

Both runs still report FAIL, but only for the existing OpenClaw resource threshold: tool-runtime RSS around 750 MB over the 500 MB threshold. The previous Kova evidence failures are gone.

Validation for the patch:

  • node bin/kova.mjs self-check --json -> ok: true (173 checks)
  • node tests/render-snapshots.mjs -> 18 pass
  • git diff --check -> pass
  • Cleanup: ocm env list -> no environments

Artifacts on runner:

  • checkouts/p2-exec-real-proof2/reports/kova-260522-170514-192a6a.json
  • checkouts/p2-exec-real-proof2/reports/kova-260522-170551-6419fb.json

Current-Head Tool-Failure Proof

Reviewer P2 fix proof was run at Kova head 2439672 (Fix failure-only exec mock provider flow).

Command:

KOVA_HOME=checkouts/p2-tool-failure-proof-kova-home \
  node bin/kova.mjs run \
    --target runtime:stable \
    --scenario tool-failure-containment \
    --state exec-tool-user \
    --execute \
    --report-dir checkouts/p2-tool-failure-proof/reports \
    --json

Result: kova-260522-164448-f2fbb2 completed with proof completeness 10/10 required obligations, cleanup destroyed the disposable env, and the fixed failure-only provider path was exercised:

  • Mock script: kova-exec-tool-failure-only
  • First provider step: kova-exec-tool-failure-only-dangerous-tool-call
  • Matched provider step: kova-exec-tool-failure-only-dangerous-tool-call
  • Provider emitted one exec tool call
  • dangerousCommandBlocked: true
  • dangerousPayloadExecuted: false
  • dangerousSentinelStillPresent: true
  • outputTruncated: true, timeoutObserved: true, processLeaks: 0

The scenario verdict is still FAIL, but for an OpenClaw resource threshold, not the Kova wiring bug: tool-runtime peak RSS 741.9 MB exceeded threshold 500 MB. This is expected to remain a product/resource signal for maintainers.

Artifacts on runner:

  • Markdown: checkouts/p2-tool-failure-proof/reports/kova-260522-164448-f2fbb2.md
  • JSON: checkouts/p2-tool-failure-proof/reports/kova-260522-164448-f2fbb2.json
  • Tool artifact: checkouts/p2-tool-failure-proof-kova-home/artifacts/kova-260522-164448-f2fbb2/kova-tool-failure-containmen-81131346-kova-260522-164448-f2fbb2/tool-failure-containment.json

Requested OpenClaw target: v2026.5.21-beta.1. The Git tag exists and resolves to 89a17def chore(release): prepare 2026.5.21-beta.1, but openclaw@2026.5.21-beta.1 is not currently published in npm/OCM release discovery. A direct npm:2026.5.21-beta.1 exhaustive run therefore blocked at provisioning with OpenClaw release version "2026.5.21-beta.1" was not found.

Fresh release-shaped tag evidence was run from a disposable openclaw/openclaw.git checkout at that tag using local-build:checkouts/openclaw-v2026.5.21-beta.1. After pnpm install --frozen-lockfile, direct pnpm pack succeeded and produced openclaw-2026.5.21-beta.1.tgz.

Matrix command:

KOVA_HOME=checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun \
  node bin/kova.mjs matrix run \
    --profile exhaustive \
    --target local-build:checkouts/openclaw-v2026.5.21-beta.1 \
    --source-env kova-pr2-source-v2026-5-21-beta-1 \
    --execute \
    --allow-exhaustive \
    --json \
    --report-dir checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports

Result: kova-260522-145534-037d84 -> 77 total · 32 PASS · 45 FAIL · 0 BLOCKED.

Evidence artifacts on runner:

  • Markdown: checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports/kova-260522-145534-037d84-exhaustive.md
  • JSON: checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports/kova-260522-145534-037d84-exhaustive.json
  • Bundle: checkouts/kova-pr2-evidence-v2026-5-21-beta-1-local-build-rerun/reports/kova-260522-145534-037d84-bundle.tar.gz
  • Bundle SHA256: 743ee926c809921c0b6aea0170c1731e8576749ab88ee8126cafebfb210c73ae

Notable PASS coverage: release-runtime-startup, channel-discord-capability-conformance, upgrade-existing-user both source states, bundled-runtime-deps both states, plugin-lifecycle all states, official/bad/missing/unsafe plugin lanes, provider-models, agent-cold-warm-message, dashboard-readiness, tui-responsiveness, mcp-runtime-start-stop, agent-network-offline, failure-injection, and cross-platform-smoke.

Top remaining failures are product/resource signals in OpenClaw/tag behavior rather than Kova unsupported-mode blockers: gateway RSS around the 700 MB threshold across agent/provider/HTTP/TUI surfaces, rolling-upgrade package/runtime RSS/CPU, dirty-plugin doctor-cli RSS, tool-runtime RSS for exec/tool containment, soak/workspace latency, and a few functional/liveness failures (channel generated-image handoff, Telegram timeout signals, cron/browser/media gateway restarting, MCP tool-call missing runtime role evidence).

Cleanup after the run: disposable source env destroyed; old beta runtime records reintroduced by upgrade lanes removed; ocm runtime list shows only stable.

PR Change List

Branch: codex/kova-scenarios-and-fixes
Base compared: origin/main
RCA doc is removed from the feature set. .learnings/ is still untracked and not part of PR.

Matrix/Profile Wiring

  • Added profiles/rolling-upgrade.json
  • Updated profiles/exhaustive.json
    • includes rolling upgrade day/week/month scenarios
    • includes fixed old-release upgrade scenarios
    • includes unsafe-memory plugin scenario
    • total exhaustive entries now 77
  • Updated profiles/release.json
    • includes unsafe-memory plugin scenario
    • release entries now 51
  • Added profiles/adversarial.json

New Upgrade Coverage

  • Added scenarios/upgrade-from-day-ago.json
  • Added scenarios/upgrade-from-week-ago.json
  • Added scenarios/upgrade-from-month-ago.json
  • Added support/resolve-openclaw-release-age.mjs
  • Added support/run-openclaw-release-age-upgrade.mjs
  • Updated docs/AGENT_USAGE.md with rolling upgrade usage
  • Added self-check coverage for rolling upgrade resolver/profile planning

Unsafe Legacy Plugin Memory Test

  • Added scenarios/plugin-legacy-unsafe-memory.json
  • Added surfaces/plugin-legacy-unsafe-memory.json
  • Added fixture plugin:
    • support/plugins/kova-legacy-unsafe-memory/index.js
    • support/plugins/kova-legacy-unsafe-memory/openclaw.plugin.json
    • support/plugins/kova-legacy-unsafe-memory/package.json
  • Added support/assert-command-output.mjs
  • Updated src/evaluator.mjs to count failed during register as plugin load failure evidence

Dirty Plugin Testing

  • Added docs/DIRTY_PLUGIN_TESTING_PLAN.md
  • Added scenarios/dirty-plugin-state.json
  • Added dirty plugin states:
    • states/dirty-plugin-local-edits.json
    • states/dirty-plugin-stale-deps.json
    • states/dirty-plugin-manifest-drift.json
    • states/dirty-plugin-disabled-broken.json
    • states/dirty-plugin-symlink-dev.json
    • states/dirty-plugin-partial-install.json
    • states/update-recovery-plugin-user.json
  • Added support/dirty-plugin-state.mjs
  • Added surfaces/dirty-plugin-state.json

Release Update Recovery

  • Added docs/RELEASE_UPDATE_RECOVERY_PLAN.md
  • Added scenarios/release-update-recovery.json
  • Added surfaces/release-update-recovery.json
  • Added support/restore-first-ocm-upgrade-snapshot.mjs

Tool Runtime Matrix

  • Added docs/TOOL_RUNTIME_MATRIX_PLAN.md
  • Added scenarios:
    • scenarios/cron-runtime.json
    • scenarios/exec-tool-safety.json
    • scenarios/mcp-tool-call.json
    • scenarios/tool-failure-containment.json
  • Added states:
    • states/cron-user.json
    • states/exec-tool-user.json
    • states/mcp-tool-user.json
  • Added surfaces:
    • surfaces/cron-runtime.json
    • surfaces/exec-tool-safety.json
    • surfaces/mcp-tool-call.json
    • surfaces/tool-failure-containment.json
  • Added process roles:
    • process-roles/cron-runtime.json
    • process-roles/tool-runtime.json
  • Added helpers:
    • support/run-cron-runtime-smoke.mjs
    • support/run-exec-tool-safety.mjs
    • support/mcp-tool-call-smoke.mjs

Provider/Network Failure Coverage

  • Added docs/NETWORK_ISOLATION_PLAN.md
  • Added src/network-frontage.mjs
  • Added support/network-frontage-proxy.mjs
  • Added provider scenarios:
    • scenarios/agent-provider-protocol-failure.json
    • scenarios/agent-provider-random-disconnect.json
  • Updated support/mock-openai-server.mjs
  • Updated support/configure-openclaw-mock-auth.mjs
  • Updated src/commands/run.mjs, src/commands/matrix-run.mjs, src/run/context.mjs, src/run/phase-plan.mjs

Adversarial Input Coverage

  • Added scenarios/adversarial-input-openai-compatible.json
  • Added surfaces/adversarial-input.json
  • Added support/run-adversarial-inputs.mjs
  • Added profiles/adversarial.json

Plugin Fixture/Manifest Fixes

  • Added support/plugins/kova-basic/openclaw.plugin.json
  • Added support/plugins/kova-missing-runtime-dep/openclaw.plugin.json
  • Updated scenarios/plugin-missing-runtime-deps.json

Resource Attribution / Evaluation / Reporting Fixes

  • Updated src/collectors/resources.mjs
  • Updated src/evaluation/violations.mjs
  • Updated src/evidence/agent-turns.mjs
  • Updated src/evidence/shared.mjs
  • Updated src/measurement-contract.mjs
  • Updated src/reporting/report.mjs
  • Updated src/reporting/scenario-aggregate.mjs
  • Updated src/run/command-executor.mjs
  • Updated src/run/report-finalization.mjs
  • Updated src/runner.mjs
  • Updated src/safety.mjs
  • Updated src/selfcheck.mjs

Large Session Fixture

  • Added support/prepare-large-memory-session-state.mjs
  • Updated related surface thresholds/metadata:
    • surfaces/fresh-install.json
    • surfaces/soak.json
    • surfaces/gateway-performance.json
    • surfaces/workspace-scan.json

OpenAI-Compatible / Runtime Role Tweaks

  • Updated scenarios/openai-compatible-turn.json
  • Updated support/run-openai-compatible-turn.mjs
  • Updated process-roles/openai-compatible-client.json
  • Updated role primary-resource metadata across several surfaces

Docs / User-Facing Metadata

  • Updated README.md
  • Updated docs/WHAT_IS_KOVA.md
  • Updated docs/AGENT_USAGE.md
  • Updated metrics/known.json

Git Hygiene

  • Updated .gitignore
    • ignores .env, .env.*, local JSON/env files, and checkout contents
    • keeps .env.example and checkouts/.gitkeep
  • Added checkouts/.gitkeep

Tests / Snapshots

  • Added checked-in report fixtures:
    • tests/fixtures/reports/pass.json
    • tests/fixtures/reports/fail.json
  • Updated tests/render-snapshots.mjs
  • Refreshed all affected snapshots under tests/snapshots/

Validation Already Run

  • node bin/kova.mjs self-check --json
  • npm run test:snapshots
  • git diff --check
  • Real disposable run for plugin-legacy-unsafe-memory passed against runtime:stable

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. merge-risk: 🚨 automation 🚨 Merging this PR could break CI, automerge, proof capture, label sync, or automation. labels May 21, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 21, 2026

Codex review: needs maintainer review before merge. Reviewed May 28, 2026, 2:07 AM ET / 06:07 UTC.

Summary
This PR expands Kova's release validation matrix with rolling-upgrade, dirty-plugin, release-recovery, tool-runtime/MCP/cron, adversarial-input, loopback-frontage, evaluator gates, support helpers, docs, fixtures, and snapshot updates.

Reproducibility: not applicable. as a feature PR rather than a reported bug. The PR body provides current-head terminal proof for the new Kova runs and validation commands.

Review metrics: 3 noteworthy metrics.

  • Diff size: 126 files, +15,249/-1,591. The breadth makes maintainer acceptance of profile contracts and generated snapshot churn important before merge.
  • Focused gate surface: 7 central files add 2,358 lines. The main review-critical code paths are concentrated in profiles, evaluator gates, network frontage, and tool/MCP helpers.
  • Current-head proof: 3 targeted runs plus 1 exhaustive local-build matrix reported. The PR body now ties real behavior proof to the mergeable head after the force-push.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Maintainers should explicitly accept or narrow the expanded release/exhaustive gate and loopback-frontage operator behavior before merge.

Risk before merge

  • [P1] Merging will intentionally expand Kova release/exhaustive gates and can turn existing automation red on new OpenClaw RSS, plugin, cron, MCP, exec, rolling-upgrade, and dirty-plugin signals.
  • [P1] The loopback-frontage mode depends on host loopback alias/proxy behavior and may require macOS operator privileges and cleanup confidence before being used in shared runners.
  • [P1] Rolling-upgrade selection depends on live npm release-time discovery at execution time, so older-version selection is useful but less reproducible than fixed-version upgrade lanes.

Maintainer options:

  1. Accept the expanded gate contract
    Maintainers can merge after explicitly accepting that existing Kova release automation may now fail on the new OpenClaw resource, plugin, tool-runtime, and upgrade signals.
  2. Split operator-sensitive coverage
    If loopback alias privileges or live release discovery are too disruptive, split network-frontage or rolling-upgrade into a narrower follow-up before merging the rest.
  3. Pause if the matrix is too broad
    If maintainers want smaller review units, pause or close this PR and request scoped PRs for tool-runtime, dirty-plugin, upgrade, and network-frontage coverage.

Next step before merge

  • [P2] Human review should decide whether to accept the broad Kova gate/profile and operator behavior changes; there is no narrow automated repair to queue.

Security
Cleared: No concrete security or supply-chain regression was found in the diff; the exec/plugin/security-sensitive helpers are scoped to disposable Kova envs and include token redaction or bounded fixtures.

Review details

Best possible solution:

Land the expanded validation matrix only after maintainers accept the broader gate contract and operator requirements; otherwise split network-frontage or rolling-upgrade into smaller follow-ups.

Do we have a high-confidence way to reproduce the issue?

Not applicable as a feature PR rather than a reported bug. The PR body provides current-head terminal proof for the new Kova runs and validation commands.

Is this the best way to solve the issue?

Unclear until maintainer acceptance: the implementation is coherent and I found no discrete blocking line-level defect, but the PR changes the release/profile contract rather than making a narrow repair. Splitting operator-sensitive pieces is the safer alternative if maintainers want lower blast radius.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 70525273b6ff.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.
  • remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.
  • remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

  • P2: This is a normal-priority validation feature with broad release-lab impact but no emergency core runtime outage.
  • merge-risk: 🚨 compatibility: The PR changes Kova profile/gate contracts and adds operator-facing loopback-frontage and rolling-upgrade behavior that existing users may need to adapt to.
  • merge-risk: 🚨 automation: The expanded release/exhaustive matrix can intentionally fail existing automation on newly surfaced OpenClaw signals and live resolver/network setup.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • feature: ✨ showcase: ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. The feature idea is unusually useful for release confidence because it turns upgrade, dirty-plugin, cron/MCP/exec, and resource behavior into structured OpenClaw validation gates.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body now includes current-head terminal proof for self-check, snapshot rendering, targeted cron/MCP/exec real runs, and an exhaustive local-build matrix; no contributor proof action is currently needed.
Evidence reviewed

What I checked:

  • Repository policy read and applied: AGENTS.md was read in full; it frames Kova as an OpenClaw validation lab with OCM as harness infrastructure, so this review focused on OpenClaw validation behavior and did not run real Kova scenarios. (AGENTS.md:1, 70525273b6ff)
  • Central PR work is not on main: Current main does not contain src/network-frontage.mjs, profiles/rolling-upgrade.json, or scenarios/exec-tool-safety.json; each git show lookup exited 128. (70525273b6ff)
  • Large PR surface: The branch changes 126 files with 15,249 additions and 1,591 deletions, including profile, scenario, evaluator, helper, fixture, documentation, and snapshot surfaces. (24f5b9fd09b1)
  • Release gate contract changes: The release profile adds cron-runtime and tool-runtime role caps plus executable thresholds for cron, exec-tool-safety, MCP tool calls, and tool-failure containment. (profiles/release.json:63, 24f5b9fd09b1)
  • New rolling upgrade profile: The branch adds a rolling-upgrade profile that resolves day/week/month-old OpenClaw releases and upgrades cloned user state to the requested target. (profiles/rolling-upgrade.json:1, 24f5b9fd09b1)
  • Evaluator fail-closed gates: The evaluator enforces required cron, exec, MCP tool-call, dirty-plugin, and release-recovery evidence when matching thresholds are active. (src/evaluator.mjs:458, 24f5b9fd09b1)

Likely related people:

  • Shakker: Authored the current main Kova evaluator, runner, profiles, and initial repository import, and is the only current-main shortlog owner for the affected Kova validation surfaces. (role: recent area contributor and feature-history owner; confidence: high; commits: 00ad1a374f30, 343210fc643a, c7f19466e0b8; files: src/evaluator.mjs, src/runner.mjs, profiles/release.json)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@JuanHuaXu
Copy link
Copy Markdown
Author

Remediation pushed in ad6dcbd.

What changed:

  • P1: exec-tool-safety now uses the real OpenClaw agent/provider path. The mock provider emits Responses function_call items for exec; the helper verifies the safe exec turn, sends a dangerous rm -rf <sentinel> payload through the same path, and proves the sentinel remained.
  • P2: evaluator now collects and gates cron-runtime, exec-tool-safety, and mcp-tool-call helper JSON (cronRunMs, execSafeCommandSucceeded, execDangerousCommandBlocked, execOutputTruncated, mcpToolsCallMs, invalid MCP attribution, process leaks/errors).
  • Added self-check coverage: tool-runtime-evidence-evaluation fails if those helper outputs stop being parsed/enforced.

Validation:

  • npm run check -> PASS, 162/162 checks.
  • npm run test:snapshots -> PASS, 18/18 snapshots.
  • Real disposable Kova run: cron-runtime on runtime:stable -> PASS, run kova-260522-011201-c292fb; evidence: cronRunMs=698, cronRunCompleted=true, cronTriggerAttributed=true.
  • Real disposable Kova run: mcp-tool-call on runtime:stable -> PASS, run kova-260522-011137-720f05; evidence: mcpToolsCallMs=184, safeToolSucceeded=true, invalidToolErrorAttributed=true.
  • Real disposable Kova run: exec-tool-safety on runtime:stable -> Kova tool checks PASS but scenario verdict FAIL due to product RSS threshold; run kova-260522-011101-dd863a; evidence: safeCommandSucceeded=true, dangerousCommandBlocked=true, dangerousPayloadExecuted=false, outputTruncated=true, timeoutMs=1006, processLeaks=0. Remaining violation is OpenClaw/product RSS: tool-runtime peak RSS 694.8 MB > 500 MB.

So the reviewer-reported Kova false evidence paths are patched. The only failure observed in the exec proof run is now a real product resource signal, not Kova misidentifying its own helper behavior.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. labels May 22, 2026
@JuanHuaXu
Copy link
Copy Markdown
Author

Follow-up remediation pushed in 646b744 for the latest P2/P3 findings.

Fixes:

  • P2 network frontage: waitForTcp() now only checks child exit state when a child process is actually passed, so validation probes without a child are allowed.
  • P2 cron gates: evaluator now enforces cronRunCompleted and cronTriggerAttributed boolean thresholds. Added a negative self-check where cronTriggerAttributed=false must fail.
  • P3 MCP metric naming: evaluator now reports/violates mcpToolCallErrorAttributed, matching the surface/profile/known metric id.

Validation:

  • node bin/kova.mjs self-check --json -> PASS (ok: true), including new network-frontage-no-child-tcp and negative cron attribution coverage.
  • npm run test:snapshots -> PASS, 18/18.
  • Reviewer acceptance command was attempted: node bin/kova.mjs run --target runtime:stable --scenario fresh-install --network-frontage loopback --worker-id 7 --execute --json.
    • Result: BLOCKED by local macOS privilege, not the fixed no-child validation bug.
    • Run id: kova-260522-074939-8e800b.
    • Blocker: ifconfig: ioctl (SIOCAIFADDR): permission denied while adding 127.0.1.17 alias.
    • Cleanup verified: ocm env list -> No environments.

So the code-level review blockers are patched. The live loopback command reaches the expected alias setup path here, but this Codex session cannot grant elevated ifconfig lo0 alias permissions.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@JuanHuaXu
Copy link
Copy Markdown
Author

Follow-up remediation pushed in 30c2d69 for the latest P2 metric-contract findings.

Fixes:

  • P2 dirty-plugin gates: evaluator now collects kova.dirtyPluginState.v1 fixture output and plugin command results, then enforces dirtyPluginDetected, dirtyPluginReported, dirtyPluginChecksumPreserved, doctorDestructiveChangeCount, pluginsUsableWithDirtyState, and gatewaySurvivedDirtyPlugin. Missing evidence now fails active dirty-plugin thresholds instead of silently passing.
  • P2 release recovery gates: evaluator now derives/enforces updateRetryVersionDrift, rollbackAvailable, rollbackSucceeded, pluginsUsableAfterUpgrade, pluginsUsableAfterRollback, and rollbackPreservedPluginData from upgrade/retry version output, rollback restore output, plugin-health commands, rollback plugin commands, and post-rollback dirty fixture verification. Missing evidence now fails active release-recovery thresholds.
  • Added plugin-recovery-evidence-evaluation self-check coverage with negative cases for missing dirty reporting, checksum/destructive doctor failure, retry version drift, missing/failed rollback, and post-rollback plugin unusability.

Validation:

  • node bin/kova.mjs self-check --json -> PASS (ok: true), including the new plugin recovery evidence check.
  • npm run test:snapshots -> PASS, 18/18.
  • git diff --check -> PASS.

This addresses the reviewer concern by making the advertised dirty-plugin and release-update-recovery surface thresholds executable gates rather than planning-only metric names.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@JuanHuaXu
Copy link
Copy Markdown
Author

Follow-up remediation pushed in 8e7ffa4 for the latest P2/P3 findings.

Fixes:

  • P2 network frontage: partial loopback allocation is now registered immediately after alias creation, before proxy startup, so stopNetworkFrontage() can clean created aliases if proxy startup/validation fails.
  • P2 exec evidence: active exec thresholds now fail closed on missing/null helper evidence using required gates for execSafeCommandMs, execTimeoutMs, execSafeCommandSucceeded, execDangerousCommandBlocked, execOutputTruncated, and execProcessLeaks.
  • P3 README inventory: refreshed counts to 56 scenarios / 37 surfaces / 37 states / 10 profiles from node bin/kova.mjs plan --json.

Validation:

  • node bin/kova.mjs self-check --json -> PASS (ok: true), including new partial network frontage invariant and missing/incomplete exec evidence checks.
  • npm run test:snapshots -> PASS, 18/18.
  • git diff --check -> PASS.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added status: 🔁 re-review loop A fresh ClawSweeper review was explicitly requested after the latest review. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. status: 🔁 re-review loop A fresh ClawSweeper review was explicitly requested after the latest review. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 22, 2026
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 22, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 22, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. feature: ✨ showcase ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 22, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 23, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 23, 2026
@JuanHuaXu JuanHuaXu force-pushed the codex/kova-scenarios-and-fixes branch from ff5cd8e to 8a4d3a1 Compare May 28, 2026 05:15
@JuanHuaXu JuanHuaXu force-pushed the codex/kova-scenarios-and-fixes branch from 8a4d3a1 to 24f5b9f Compare May 28, 2026 05:20
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 28, 2026
@JuanHuaXu
Copy link
Copy Markdown
Author

@clawsweeper re-review

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 28, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature: ✨ showcase ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. merge-risk: 🚨 automation 🚨 Merging this PR could break CI, automerge, proof capture, label sync, or automation. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant