Skip to content

[codex] fix shell runtime value quoting#30

Merged
scoootscooob merged 2 commits into
mainfrom
codex/fix-shell-exec-rendering
Jun 2, 2026
Merged

[codex] fix shell runtime value quoting#30
scoootscooob merged 2 commits into
mainfrom
codex/fix-shell-exec-rendering

Conversation

@scoootscooob
Copy link
Copy Markdown
Collaborator

@scoootscooob scoootscooob commented Jun 1, 2026

Summary

  • add quote-aware shell rendering for runtime placeholders
  • use it for shell=True execution checks in both verifier implementations
  • use the same renderer for background service commands
  • add {name:raw} as an explicit escape hatch for benchmark-owned shell fragments
  • document the shell placeholder contract in the v0.4 spec
  • add regression coverage for whitespace paths, shell metacharacters, pre-quoted placeholders, raw shell fragments, and background services

Root Cause

ExecutionCheck.shell defaults to True, but the shell path rendered placeholders with plain string substitution and then passed the command to /bin/sh -c. Runtime values containing whitespace were split by the shell, and metacharacters could be interpreted as shell syntax.

Compatibility

The new renderer treats {name} as literal data and quotes or escapes it according to shell quote context. Templates that intentionally need a runtime value to expand as a shell fragment can opt in at that insertion point with {name:raw}. The spec documents that {name:raw} is only for benchmark-owned fragments and should not be used for agent-produced or otherwise untrusted values.

Fixes #29.

Validation

  • /tmp/clawbench-shell-render-venv/bin/python -m pytest -q tests/test_execution_shell_rendering.py tests/test_environment_files.py tests/test_environment.py tests/test_services.py
  • /tmp/clawbench-shell-render-venv/bin/python -m ruff check clawbench/render.py clawbench/environment.py clawbench/environment_files.py clawbench/services.py tests/test_execution_shell_rendering.py tests/test_services.py
  • /tmp/clawbench-shell-render-venv/bin/python -m compileall clawbench/render.py clawbench/environment.py clawbench/environment_files.py clawbench/services.py
  • git diff --check

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 1, 2026

Codex review: needs maintainer review before merge. Reviewed June 1, 2026, 4:58 PM ET / 20:58 UTC.

Summary
Adds quote-aware rendering for runtime placeholders in shell-mode execution checks and background services, documents {name:raw}, and adds shell-rendering regression tests.

Reproducibility: yes. from source inspection: current main renders shell-mode commands with render_template and then passes them to /bin/sh, so whitespace and metacharacters in placeholder values are interpreted by the shell. I did not run local reproduction commands because this was a read-only review.

Review metrics: 3 noteworthy metrics.

  • Runtime Shell Paths: 3 changed. Both execution-check implementations and background services now share the new placeholder semantics.
  • Regression Coverage: 1 file added, 1 file extended. The tests cover both verifier implementations plus background service command rendering.
  • Diff Scope: 7 files, +269/-10. The patch is focused but touches core benchmark runtime behavior and the v0.4 task contract.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🌊 off-meta tidepool
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Maintainers should explicitly accept the literal-data default plus {name:raw} upgrade path before merge.

Risk before merge

  • [P1] Existing external task templates that intentionally relied on placeholder values expanding to multiple shell words, globs, redirects, pipes, or other shell syntax will need to switch those insertion points to {name:raw}.
  • [P1] Maintainers should explicitly accept {name:raw} as the compatibility path for benchmark-owned fragments before this shell placeholder contract becomes the default.

Maintainer options:

  1. Approve The Raw Escape Hatch Contract
    Maintainers can intentionally accept literal-data placeholders by default and use {name:raw} as the upgrade path for benchmark-owned shell fragments.
  2. Require A Task Migration Audit
    Before merge, maintainers can ask for a quick audit or migration note covering any known task templates that rely on placeholder values as shell syntax.

Next step before merge

  • No automated repair is indicated; the remaining action is maintainer approval of the shell placeholder compatibility contract for this collaborator-authored PR.

Security
Cleared: No supply-chain or secret-handling regression was found; the patch reduces shell metacharacter injection risk while leaving raw fragments explicitly benchmark-owned.

Review details

Best possible solution:

Merge a maintainer-approved literal-data shell placeholder contract with {name:raw} as the documented compatibility escape hatch, keeping the regression coverage across both execution-check implementations and background services.

Do we have a high-confidence way to reproduce the issue?

Yes, from source inspection: current main renders shell-mode commands with render_template and then passes them to /bin/sh, so whitespace and metacharacters in placeholder values are interpreted by the shell. I did not run local reproduction commands because this was a read-only review.

Is this the best way to solve the issue?

Yes, assuming maintainers accept the compatibility contract: quote-aware literal rendering is the narrow fix for runtime values as data, and {name:raw} gives benchmark-owned fragments an explicit opt-in path.

AGENTS.md: not found in the target repository.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1ca672acf013.

Label changes

Label changes:

  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate does not apply to this collaborator-authored PR; the PR body lists validation commands and GitHub checks for the head are successful.
  • remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P2: This is a normal-priority bug and hardening fix with limited blast radius but real compatibility implications for benchmark task authors.
  • merge-risk: 🚨 compatibility: Changing shell-mode placeholders to literal data by default can break existing tasks that used runtime values as shell fragments unless they adopt {name:raw}.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate does not apply to this collaborator-authored PR; the PR body lists validation commands and GitHub checks for the head are successful.
Evidence reviewed

What I checked:

  • Current main shell execution path: Current main renders ExecutionCheck.command with plain render_template before passing it to asyncio.create_subprocess_shell when spec.shell is true, so the linked shell quoting problem is not already fixed on main. (clawbench/environment.py:111, 1ca672acf013)
  • Current main file verifier path: The file-based verifier has the same render_template before create_subprocess_shell behavior on current main. (clawbench/environment_files.py:103, 1ca672acf013)
  • Current main background service path: Background services render commands with render_template and launch them with shell=True, so service command placeholders share the same shell interpretation risk. (clawbench/services.py:83, 1ca672acf013)
  • PR shell renderer: The PR head adds render_shell_template, quotes normal placeholders according to shell quote context, and leaves {name:raw} as the explicit unescaped path. (clawbench/render.py:74, c621e6b2004d)
  • PR documentation: The PR documents that shell-mode placeholders are literal data by default and that {name:raw} is only for benchmark-owned shell fragments. (CLAWBENCH_V0_4_SPEC.md:167, c621e6b2004d)
  • PR regression coverage: The PR adds regression tests for unquoted runtime values, shell metacharacters, {name:raw}, pre-quoted placeholders, and background service commands. (tests/test_execution_shell_rendering.py:23, c621e6b2004d)

Likely related people:

  • scoootscooob: Blame ties the original render_template, shell execution paths, and background service command rendering to commits 95b226d and 56531fb, so this person is connected to the current runtime surface beyond this PR. (role: introduced behavior and current area contributor; confidence: high; commits: 95b226dfedb0, 56531fbf432b; files: clawbench/render.py, clawbench/environment.py, clawbench/environment_files.py)
  • Peter Steinberger: Commit da0bf52 introduced render_argv_template and updated shell=False execution checks to preserve placeholder values as single argv arguments. (role: adjacent shell-rendering contributor; confidence: medium; commits: da0bf52c500e; files: clawbench/render.py, clawbench/environment.py, clawbench/environment_files.py)
  • Vincent Koc: Commit ed9adf8 recently changed runtime path hardening in the same environment and service code paths touched by this PR. (role: recent runtime hardening contributor; confidence: medium; commits: ed9adf8d8451, 88ab0f55646e; files: clawbench/environment.py, clawbench/services.py, tests/test_environment.py)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. labels Jun 1, 2026
@scoootscooob scoootscooob marked this pull request as ready for review June 1, 2026 18:27
@scoootscooob scoootscooob requested a review from a team as a code owner June 1, 2026 18:27
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 1, 2026
@scoootscooob
Copy link
Copy Markdown
Collaborator Author

@clawsweeper re-review

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 1, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 1, 2026
@scoootscooob scoootscooob merged commit 4f752b6 into main Jun 2, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: shell=True execution checks mis-tokenize runtime values containing whitespace or shell metacharacters

1 participant