fix(tests): add verbose to Claude stream-json evals#1687
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates test harness scripts to emit more detailed CLI output while also tightening contribution guidance around human involvement and PR targeting.
Changes:
- Add
--verboseto multipleclaudetest invocations (including multi-turn scripts). - Reorder flags in one test script (move
--output-format stream-jsonbelow--verbose). - Update contribution docs and PR template to require explicit human review/authorization and to target
mainas the PR base branch.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/subagent-driven-dev/run-test.sh | Reorders claude CLI flags around --verbose / --output-format. |
| tests/skill-triggering/run-test.sh | Adds --verbose to test invocation logging. |
| tests/explicit-skill-requests/run-test.sh | Adds --verbose to test invocation logging. |
| tests/explicit-skill-requests/run-multiturn-test.sh | Adds --verbose to each multi-turn invocation. |
| tests/explicit-skill-requests/run-haiku-test.sh | Adds --verbose to each step invocation. |
| tests/explicit-skill-requests/run-extended-multiturn-test.sh | Adds --verbose to each step invocation. |
| tests/explicit-skill-requests/run-claude-describes-sdd.sh | Adds --verbose to each step invocation. |
| CLAUDE.md | Updates human involvement requirements and changes PR base-branch guidance to main. |
| .github/PULL_REQUEST_TEMPLATE.md | Aligns template with main targeting and adds “review vs authorization” options. |
| --plugin-dir "$PLUGIN_DIR" \ | ||
| --dangerously-skip-permissions \ | ||
| --max-turns "$MAX_TURNS" \ | ||
| --verbose \ | ||
| --output-format stream-json \ | ||
| > "$LOG_FILE" 2>&1 || true |
There was a problem hiding this comment.
Verified against the logs generated by the targeted evals: the captured files remain valid JSONL with stdout/stderr combined. jq -s parsed all three logs successfully (skill-triggering/systematic-debugging, explicit-skill-requests/systematic-debugging, and explicit-skill-requests/subagent-driven-development). Claude Code 2.1.163 rejects claude -p --output-format stream-json without --verbose, so gating or omitting this flag would reintroduce the harness failure this PR fixes.
| --plugin-dir "$PLUGIN_DIR" \ | ||
| --dangerously-skip-permissions \ | ||
| --max-turns 2 \ | ||
| --verbose \ | ||
| --output-format stream-json \ | ||
| > "$TURN1_LOG" 2>&1 || true |
There was a problem hiding this comment.
I ran this multi-turn script after the PR change. It passed: Turn 3 triggered superpowers:subagent-driven-development with no premature tool use. The generated logs were parseable JSONL and modest in size: turn1 36,276 bytes, turn2 18,135 bytes, turn3 31,544 bytes. Since --verbose is required by Claude Code 2.1.163 for --output-format stream-json, keeping it unconditional preserves the default eval path.
|
Your changes modify our PR guidance. |
Who is submitting this PR? (required)
What problem are you trying to solve?
Running the existing Claude Code eval for
systematic-debuggingfailed before any skill could be invoked:This made the
systematic-debuggingskill appear not to trigger, but the root cause was the test harness invokingclaude -pwith--output-format stream-jsonand no--verbose. Claude Code2.1.163rejects that combination before the model/plugin session starts.While preparing the PR, the contributor guidance also said an agent could not submit unless a human had already reviewed the complete diff, and it said PRs should target
dev. My human partner explicitly authorized agent-created PRs for manual review after validation, and clarified that feature branches should open PRs againstorigin/main.I reproduced the eval failure with:
Both reported no skill invocation because the raw log contained only the CLI error above.
What does this PR change?
Adds
--verboseto the remaining Claude Code eval scripts that combineclaude -pwith--output-format stream-json, matching the existing accepted fix in PR #452. It also updates the contributor guidance and PR template to allow explicitly authorized agent-created PRs, and corrects the PR base-branch guidance tomain.Is this change appropriate for the core library?
Yes. This fixes the general-purpose Claude Code evaluation harness for core Superpowers skills and corrects repository-level contributor workflow guidance. It is not project-specific, team-specific, or tied to any third-party service.
What alternatives did you consider?
systematic-debuggingskill description or body. Rejected: the raw logs showed the model never started; this was not a skill-triggering problem.--output-format stream-json. Rejected: the tests parse JSON stream events to verify Skill tool invocation.systematic-debuggingtest path. Rejected: the same CLI contract affects everyclaude -pplusstream-jsoneval, and PR Fix: add --verbose flag for stream-json output in SDD test runner #452 already established--verboseas the accepted fix for this exact class of failure.main.Does this PR contain multiple unrelated changes?
No. The eval harness fix is the original bug fix. The contributor-guidance edits are tied to submitting this fix correctly: they replace the hard "do not submit" gate with honest human review/authorization disclosure, and correct the base-branch instruction to
mainas clarified by my human partner.Existing PRs
PR #452 merged the same
--verbosefix fortests/subagent-driven-dev/run-test.sh. This PR applies that accepted pattern to the remaining eval scripts that still lacked the flag.PR #914 is a separate open one-line documentation fix for a broken
Phase 4.5reference insystematic-debugging/SKILL.md; this PR does not duplicate that prose fix.I also searched for prior art around explicit agent PR authorization and the branch-target guidance; I did not find a direct duplicate of those contributor-guidance changes.
Environment tested
New harness support (required if this PR adds a new harness)
Not applicable. This PR does not add a new harness.
Clean-session transcript for "Let's make a react todo list"
Evaluation
use systematic-debugging to figure out what's wrongand the natural failing-test prompt intests/skill-triggering/prompts/systematic-debugging.txtboth failed because the eval scripts exited withError: When using --print, --output-format=stream-json requires --verbose.systematic-debuggingevals reported no Skill tool invocation and raw logs contained only the CLI error.systematic-debuggingevals invokedsuperpowers:systematic-debugging; an additional explicitsubagent-driven-developmenteval invokedsuperpowers:subagent-driven-developmentwith no premature tool invocation.Commands run after the change:
Rigor
superpowers:writing-skillsand completed adversarial pressure testing (paste results below)This is not a skill-content change; it fixes eval harness invocation. I used
superpowers:systematic-debuggingto trace the false skill failure to the CLI invocation layer andsuperpowers:writing-skillsto verify that skill-body changes require behavioral evidence.Human review / authorization