WIP: Harness resilience — relaxed phase-2 gate, enriched resume diagnostics, timestamped summaries#46
Conversation
…, timestamped summaries Make the run-summary gate actionable for weak models. Completion gate - check_phase_graceful_completion now returns (bool, list[str]); the list carries human-readable missing-artifact messages consumed by the resume prompt builder. - Phase 2 gate relaxes from (pending_fresh AND summary_fresh) to summary_fresh alone. A legitimate '0 new findings' outcome no longer fails. Phase 3 (counter-analysis) provides the second-line review. - build_phase_resume_prompt takes optional failure_details and renders a 'Missing required artifacts:' block when given, otherwise keeps the existing generic reassess wording. Phase prompts - All run-summary paths now use runs/phase-X[-FINDING]-summary-YYYY-MM-DD-HHMMSS.md to prevent overwrites on rerun. Globs in completion.py already use wildcards, so the new format is matched by the gate. Agent files - Removed the 'run summary ... when practical' hedge from all five agent files (auditor, reviewer, validator, exploiter, recon). Phase prompts are now the single source of truth for required durable artifacts. A delegating note replaces each removed line. Runtime callers - harness.py and phase_1.py unpack the new tuple, preserve the short-circuit on check_phase_graceful_completion, and pass phase_failures (or None) to the resume prompt builder. Tests - Updated graceful-completion assertions to assert on the tuple and to assert on the diagnostic fragments (runs/phase-2-summary, itemdb/evidence/<id>/, itemdb/reports, itemdb/notes/sandbox-plan.md, etc.) so regressions in the human-readable messages are caught. - Updated callers that monkeypatch check_phase_graceful_completion to return a tuple. All 676 tests pass; frontmatter and phase-artifact checks pass.
Records the design rationale and verification checklist for the harness resilience changes implemented in 7cd8843. Kept on the feature branch for PR review; the master commit itself ships only the code changes.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughThis PR implements run-summary gating resilience by refactoring phase-completion gating from boolean returns to structured tuples containing success status and human-readable failure details. All phase prompts are standardized to use timestamped run-summary filenames; agent documentation is cleaned of conflicting run-summary obligations and made to defer to phase-prompt specifications; completion logic is enhanced to validate phase-specific artifacts and return exact failure reasons; resume prompts are enriched with targeted guidance for missing artifacts; and comprehensive test coverage validates the new tuple contract and failure-state reset across retry attempts. A separate session API update changes status endpoint querying. ChangesHarness Resilience and Run-Summary Gating
🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Coverage Report
Generated by pytest-cov on |
pruiz
left a comment
There was a problem hiding this comment.
Review comments added inline. Main theme: align the new “phase prompt is the source of truth for required artifacts” rule with what the completion gates and resume checklists actually enforce.
There was a problem hiding this comment.
Pull request overview
This PR improves phase-mode harness resilience by (1) enriching completion-gate diagnostics to make auto-resume actionable, (2) relaxing the Phase 2 completion gate to allow “0 new findings” runs to pass as long as a run summary is produced, and (3) standardizing run-summary filenames across prompts to be timestamped to avoid overwrites.
Changes:
check_phase_graceful_completionnow returns(ok, failure_details)and Phase 2 gating is relaxed to require only a fresh run summary.- Resume prompts can include an explicit “Missing required artifacts” block populated from gate diagnostics.
- Phase prompts now instruct timestamped run-summary filenames; agent docs remove “when practical” hedges and delegate artifact requirements to phase prompts; tests updated for the tuple return.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/phases/completion.py | Returns (bool, list[str]) from completion gate; relaxes Phase 2 gate; adds failure-detail-driven resume prompt output; updates Phase 2 checklist. |
| tools/codecome/harness.py | Unpacks the new completion-gate tuple and forwards failure details into resume prompts. |
| tools/codecome/phase_1.py | Unpacks the new completion-gate tuple and forwards failure details into resume prompts for subphases. |
| tests/test_render_settings_propagation.py | Updates monkeypatches for the new tuple return shape. |
| tests/test_phases_completion.py | Updates assertions to validate (ok, failures) and adds coverage for Phase 2 relaxed gate + diagnostics + resume prompt rendering. |
| tests/test_phase_graceful_completion_subphases.py | Updates subphase tests for tuple return and validates failure detail presence. |
| tests/test_phase_1_mid_turn_forgiveness.py | Updates mocks to return (ok, failures) tuples. |
| tests/test_phase_1_codeql_plan_repair.py | Updates mocks to return (ok, failures) tuples. |
| prompts/phase-1a-profile.md | Switches run-summary path guidance to timestamped filename. |
| prompts/phase-1b-recon.md | Switches run-summary path guidance to timestamped filename; removes hardcoded summary path mentions. |
| prompts/phase-1c-sandbox.md | Switches run-summary path guidance to timestamped filename; removes hardcoded summary path mentions in prose. |
| prompts/phase-2-audit.md | Switches run-summary path guidance to timestamped filename. |
| prompts/phase-4-validate.md | Switches run-summary path guidance to timestamped filename and updates example. |
| prompts/phase-5-exploit.md | Switches run-summary path guidance to timestamped filename and updates example. |
| prompts/phase-6-report.md | Switches run-summary path guidance to timestamped filename. |
| .opencode/agents/auditor.md | Removes run-summary “when practical” guidance; adds delegation note pointing to phase prompt requirements. |
| .opencode/agents/reviewer.md | Removes run-summary references; adds delegation note pointing to phase prompt requirements. |
| .opencode/agents/recon.md | Removes run-summary references; adds delegation note pointing to phase prompt requirements. |
| .opencode/agents/exploiter.md | Removes run-summary references; adds delegation note pointing to phase prompt requirements. |
| .opencode/agents/validator.md | Removes run-summary references; adds delegation note pointing to phase prompt requirements. |
| .project/harness-resilience-summary-gating-plan.md | Adds the design/implementation plan document for the change set. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR makes the phase harness more resilient for weak models by enriching resume diagnostics with specific missing-artifact details, relaxing the Phase 2 completion gate to require only a fresh run summary (zero new findings is a valid outcome), and switching run-summary filenames to a timestamped format to prevent overwrites on re-run.
Confidence Score: 5/5Safe to merge; the changes are well-scoped harness retry/diagnostic improvements with 676 passing tests. All structural changes — tuple return from the completion gate, failure-details forwarding, phase_failures/phase_ok initialisation — are correctly implemented and covered by targeted new tests. The session endpoint migration is the most externally-facing change but degrades gracefully to None on failure. tools/phases/completion.py — the Phase 3 glob inconsistency and the dead code in _find_finding_file are minor but worth a second look before the final merge. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[_run_single_attempt] --> B{returncode?}
B -- "!= 0" --> INFRA[Infrastructure error\nfatal retry or break]
B -- "== 0" --> C{finish_warning?}
C -- "None" --> D{FINISH_TERMINAL_OK?}
C -- "set" --> E{mid-turn AND\nno permission error?}
E -- "yes" --> F[check_phase_graceful_completion\nreturns bool, failures]
F -- "ok=True" --> G[graceful_forgiveness\nfinish_warning=None]
F -- "ok=False" --> H[returncode=2]
E -- "no" --> H
D -- "yes" --> I[check_phase_graceful_completion]
I -- "ok=True" --> J[continue to validation]
I -- "ok=False" --> K[returncode=2\nfinish_warning set]
D -- "no" --> J
J --> L[frontmatter / artifact\nvalidation]
L -- "pass" --> M[Phase complete]
L -- "fail" --> N[resume with repair prompt]
H --> O{retry budget?}
K --> O
O -- "yes" --> P[build_phase_resume_prompt\nfailure_details=phase_failures]
O -- "no" --> Q[exit 2]
P --> A
G --> D
Reviews (2): Last reviewed commit: "fix(completion): keep path diagnostics a..." | Re-trigger Greptile |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/codecome/phase_1.py (1)
626-643:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winInitialize
phase_failuresbefore the mid-turn resume path.If a subphase stops mid-turn with a permission error, Lines 627-645 skip the graceful-completion call, so
phase_failuresis never assigned. The auto-resume block still reads it at Line 770, which will raiseUnboundLocalErrorinstead of sending the resume prompt.Suggested fix
step_finish_count = 0 transcript_path: Path = Path() finish_warning: str | None = None subphase_start_time = time.time() @@ while True: attempt_number += 1 + phase_failures: list[str] = [] _reset_subagent_state() finish_warning = NoneAlso applies to: 768-770
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/codecome/phase_1.py` around lines 626 - 643, The variable phase_failures can be unbound when the mid-turn resume path skips calling check_phase_graceful_completion; to fix, initialize phase_failures to a safe default (e.g., an empty list or None) before the block that checks finish_warning so it is always defined for the later auto-resume logic that reads phase_failures; update the code around the finish_warning handling (reference symbols: phase_failures, finish_warning, last_permission_error, check_phase_graceful_completion) to set phase_failures = [] (or None) before the conditional or explicitly assign it in the permission-error branch so the later auto-resume code can safely use it.
🧹 Nitpick comments (1)
.project/harness-resilience-summary-gating-plan.md (1)
18-18: ⚡ Quick winAdd language identifiers to fenced code blocks to satisfy markdownlint.
These unlabeled fences trigger
MD040. Add explicit languages (text,python,bash) so docs lint stays clean.Also applies to: 112-112, 119-119, 146-146, 150-150, 155-155, 164-164, 178-178, 184-184, 199-199, 205-205, 220-220, 226-226, 241-241, 256-256, 276-276, 296-296
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.project/harness-resilience-summary-gating-plan.md at line 18, Update all unlabeled fenced code blocks (``` ) in the document by adding explicit language identifiers (e.g., ```text, ```bash, ```python) so they satisfy markdownlint MD040; locate every triple-backtick fence in the file (including the instances noted around lines 112, 119, 146, 150, 155, 164, 178, 184, 199, 205, 220, 226, 241, 256, 276, 296) and replace the opening fence with the appropriate language token based on the block content.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/codecome/harness.py`:
- Around line 137-138: phase_failures is being initialized once and then reused
across retry attempts, leaking previous-attempt diagnostics into subsequent
resumes; to fix, reinitialize phase_failures = [] at the start of each retry
attempt (i.e., inside the retry loop, before any calls to
check_phase_graceful_completion() or before logic that sets/reads phase_ok) so
each attempt starts with a fresh failures list; apply the same change to the
other places where phase_failures is declared/used to ensure no stale failures
are carried between attempts.
In `@tools/phases/completion.py`:
- Around line 204-226: The code currently returns immediately after appending
the "required phase-1 notes are not all present" failure, which prevents later
checks (sandbox_state_recorded and summary_fresh) from running and leaves
failures incomplete; update the control flow in the function that builds the
failures list (the block that computes fresh_required via _path_is_fresh over
required_artifacts and appends the NOTES_ROOT/... message using
_PHASE1_REQUIRED_ARTIFACT_NAMES) to remove the early return and let execution
continue to evaluate sandbox_state_recorded and summary_fresh so all missing
items are appended to failures, then return once at the end (so
build_phase_resume_prompt receives a complete failure_details list).
---
Outside diff comments:
In `@tools/codecome/phase_1.py`:
- Around line 626-643: The variable phase_failures can be unbound when the
mid-turn resume path skips calling check_phase_graceful_completion; to fix,
initialize phase_failures to a safe default (e.g., an empty list or None) before
the block that checks finish_warning so it is always defined for the later
auto-resume logic that reads phase_failures; update the code around the
finish_warning handling (reference symbols: phase_failures, finish_warning,
last_permission_error, check_phase_graceful_completion) to set phase_failures =
[] (or None) before the conditional or explicitly assign it in the
permission-error branch so the later auto-resume code can safely use it.
---
Nitpick comments:
In @.project/harness-resilience-summary-gating-plan.md:
- Line 18: Update all unlabeled fenced code blocks (``` ) in the document by
adding explicit language identifiers (e.g., ```text, ```bash, ```python) so they
satisfy markdownlint MD040; locate every triple-backtick fence in the file
(including the instances noted around lines 112, 119, 146, 150, 155, 164, 178,
184, 199, 205, 220, 226, 241, 256, 276, 296) and replace the opening fence with
the appropriate language token based on the block content.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: ec9cf21c-483b-4578-9a23-6af534c4bfeb
📒 Files selected for processing (21)
.opencode/agents/auditor.md.opencode/agents/exploiter.md.opencode/agents/recon.md.opencode/agents/reviewer.md.opencode/agents/validator.md.project/harness-resilience-summary-gating-plan.mdprompts/phase-1a-profile.mdprompts/phase-1b-recon.mdprompts/phase-1c-sandbox.mdprompts/phase-2-audit.mdprompts/phase-4-validate.mdprompts/phase-5-exploit.mdprompts/phase-6-report.mdtests/test_phase_1_codeql_plan_repair.pytests/test_phase_1_mid_turn_forgiveness.pytests/test_phase_graceful_completion_subphases.pytests/test_phases_completion.pytests/test_render_settings_propagation.pytools/codecome/harness.pytools/codecome/phase_1.pytools/phases/completion.py
…, 5, 6; align phase 2 glob; honor resume reason Address PR #46 review feedback (owner + Copilot): - Subphase gates (1a, 1b, 1c) now also check for a fresh runs/phase-X-summary*.md via the new _append_run_summary_check helper in tools/phases/completion.py. - Phase 4 and 5 gates now also check for a fresh runs/phase-{4|5}-<finding>-summary*.md. - Phase 6 gate now also checks for a fresh runs/phase-6-summary*.md. - Phase 3 resume checklist now includes a run-summary line in phase_checklist_lines, matching the already-enforced gate check. - Phase 2 diagnostic message and glob now use the same pattern (no mandatory hyphen after 'summary'): both are runs/phase-2-summary*.md. - build_phase_resume_prompt opening line is no longer a universal 'Your previous run completed'. A new _resume_opener_for_reason helper classifies the recorded reason (infrastructure_error, mid-turn cutoffs, finish failures, graceful_forgiveness, terminal-OK with missing artifacts, or unknown) and renders a context-specific opener. - tools/codecome/phase_1.py defensively initializes phase_failures and phase_ok at the top of _run_subphase, mirroring the pattern already established in tools/codecome/harness.py. This eliminates the UnboundLocalError risk on the path that builds the resume prompt when the run was set to returncode=2 without ever entering the graceful-completion branch. Tests: - 15 new tests across 5 new classes in tests/test_phases_completion.py: * TestPhase2GlobStringMatchesDiagnostic * TestResumePromptOpenerDistinguishesReasons * TestSubphaseGatesRequireRunSummary * TestPhase45And6GatesRequireRunSummary * TestPhase3ChecklistMentionsRunSummary - Existing test_resume_prompt_with_failure_details_lists_missing_artifacts updated to use the unhyphenated glob fragment. - tests/test_phase_graceful_completion_subphases.py: positive tests now create a fresh runs/phase-1{a,b,c}-summary.md; negative tests assert the new summary failure fragment. make tests: 676 -> 691 pass; pre-existing threat-model.md check-phase-artifacts warning is unrelated (verified with stash).
Append section 8 to .project/harness-resilience-summary-gating-plan.md documenting the owner-review-driven expansion: - Subphase 1a/1b/1c, phase 4, phase 5, and phase 6 gates now also check for a fresh run summary. - Phase 3 resume checklist now includes a run-summary line. - Phase 2 diagnostic and glob are now consistent (no mandatory hyphen). - build_phase_resume_prompt opener is now context-specific via _resume_opener_for_reason. - tools/codecome/phase_1.py has defensive init of phase_failures and phase_ok. The original plan explicitly deferred subphase summary gating as a non-goal and did not gate phases 4/5/6 on summary either; the follow-up section records that the review pushed back and that this commit resolves the gap.
…, 5, 6; align phase 2 glob; honor resume reason Address PR #46 review feedback (owner + Copilot): - Subphase gates (1a, 1b, 1c) now also check for a fresh runs/phase-X-summary*.md via the new _append_run_summary_check helper in tools/phases/completion.py. - Phase 4 and 5 gates now also check for a fresh runs/phase-{4|5}-<finding>-summary*.md. - Phase 6 gate now also checks for a fresh runs/phase-6-summary*.md. - Phase 3 resume checklist now includes a run-summary line in phase_checklist_lines, matching the already-enforced gate check. - Phase 2 diagnostic message and glob now use the same pattern (no mandatory hyphen after 'summary'): both are runs/phase-2-summary*.md. - build_phase_resume_prompt opening line is no longer a universal 'Your previous run completed'. A new _resume_opener_for_reason helper classifies the recorded reason (infrastructure_error, mid-turn cutoffs, finish failures, graceful_forgiveness, terminal-OK with missing artifacts, or unknown) and renders a context-specific opener. - tools/codecome/phase_1.py defensively initializes phase_failures and phase_ok at the top of _run_subphase, mirroring the pattern already established in tools/codecome/harness.py. This eliminates the UnboundLocalError risk on the path that builds the resume prompt when the run was set to returncode=2 without ever entering the graceful-completion branch. Tests: - 15 new tests across 5 new classes in tests/test_phases_completion.py: * TestPhase2GlobStringMatchesDiagnostic * TestResumePromptOpenerDistinguishesReasons * TestSubphaseGatesRequireRunSummary * TestPhase45And6GatesRequireRunSummary * TestPhase3ChecklistMentionsRunSummary - Existing test_resume_prompt_with_failure_details_lists_missing_artifacts updated to use the unhyphenated glob fragment. - tests/test_phase_graceful_completion_subphases.py: positive tests now create a fresh runs/phase-1{a,b,c}-summary.md; negative tests assert the new summary failure fragment. make tests: 676 -> 691 pass; pre-existing threat-model.md check-phase-artifacts warning is unrelated (verified with stash).
What
Implements the plan in
.project/harness-resilience-summary-gating-plan.mdto make the run-summary gate actionable for weak models.Why
make phase-2with a weak model failed (exit 2) despite producing PENDING findings, because the harness only knew the run was incomplete — it could not tell the model which artifacts were missing. The auto-resume prompt was generic, so a model that thought it was done could not self-diagnose.Changes
Completion gate (
tools/phases/completion.py)check_phase_graceful_completionnow returns(bool, list[str]); the list carries human-readable missing-artifact messages consumed by the resume prompt builder.pending_fresh AND summary_freshtosummary_freshalone. A legitimate "0 new findings" outcome no longer fails. Phase 3 (counter-analysis) provides the second-line review.build_phase_resume_prompttakes optionalfailure_detailsand renders a "Missing required artifacts:" block when given, otherwise keeps the existing generic reassess wording.Phase prompts (7 files under
prompts/)runs/phase-X[-FINDING]-summary-YYYY-MM-DD-HHMMSS.mdto prevent overwrites on rerun. Globs incompletion.pyalready use wildcards.Agent files (5 files under
.opencode/agents/)Runtime callers (
tools/codecome/harness.py,tools/codecome/phase_1.py)check_phase_graceful_completion, and passphase_failures(orNone) to the resume prompt builder.Tests (5 test files)
runs/phase-2-summary,itemdb/evidence/<id>/,itemdb/reports,itemdb/notes/sandbox-plan.md, etc.) so regressions in the human-readable messages are caught.check_phase_graceful_completionto return a tuple.Verification
make tests: 676 passed.Non-goals
WIP checklist for review
Summary by CodeRabbit
Documentation
Improvements