feat: impl+verify loop improvements (P1-P7), bench, profiles, state machine#6
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire captureReviewBaseline and collectReviewDelta into RunTicket so the reviewer sees only what the current worker attempt changed, not the entire worktree diff since base commit. Pre-existing dirty files are excluded; files the worker creates or modifies are included. Add reviewBaseline field and currentReviewBaseline() helper to ticketRunState for resume safety. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TestRunTicket_RepairReviewDiffOnlyIncludesRepairDelta to prove that the second review request only includes files changed by the repair worker, not files from the initial implementation phase. The test passed immediately, confirming that Task 4 already wired per-attempt baseline capture correctly for both Implement and Repair phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces DefaultReopenTargetForSnapshot that routes retry decisions through the snapshot's Outcome field instead of relying solely on the execution phase. failed_retryable → implement; needs_decision and blocked → no automatic retry; empty outcome falls back to the existing phase-based logic for backward compatibility. Adds Outcome field to BlockedTicket so callers can propagate the outcome to CLI rendering. Updates collectBlockedTickets to load the outcome from the snapshot and call DefaultReopenTargetForSnapshot, gating RetryPhase on having a trusted snapshot rather than ticket-store status alone. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add stopKind enum and classifyTicketStop helper to map WHY a ticket stopped to a precise TicketOutcome. Add currentOutcome field to ticketRunState and update snapshot() to prefer the explicit outcome over the phase-derived fallback. Annotate all TicketPhaseBlocked transition sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Partition blocked-run guidance into three distinct sections based on ticket Outcome: "Retryable tickets" (failed_retryable), "Tickets needing decision" (needs_decision with implement/repair/block commands), and "Blocked tickets" (hard blockers with no retry path). Updates retryableBlockedTickets to exclude needs_decision tickets from the interactive prompt, and updates tests to match the new section labels. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce ticketDecision type and decisionPrompter interface so tests can inject stub answers. Wire promptNeedsDecisionTickets into handleBlockedEpicRun so needs_decision tickets get an interactive menu in TTY sessions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add user-facing reference doc explaining ready/in_progress/failed_retryable/needs_decision/blocked outcomes, interactive and non-interactive verk run behavior, and backward-compatibility rules for legacy blocked artifacts. Mark the plan as implemented in the plan file and INDEX.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RunEpic now evaluates ticket quality before dispatching any workers. If the deterministic lint finds blocking findings (P2 or above), the run returns ErrEpicBlocked and persists the artifact to .verk/runs/<run-id>/ticket-quality.json so the CLI can surface it. Existing tests updated: epicChildTicket gains a stub ValidationCommands so it satisfies the gate; hand-crafted child tickets in epic_run_test.go and e2e helpers_test.go similarly updated (removing ambiguous "done" AC in favour of the already-present ValidationCommands). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Run the deterministic ticket-quality evaluator in RunTicket before the first worker is dispatched. Blocking findings transition the ticket to Blocked and persist .verk/runs/<run-id>/ticket-quality.json without invoking any worker. RunEpic suppresses the per-ticket gate via SkipTicketQualityGate so it does not overwrite the epic-scoped artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds TicketSummary, RawTicketQualityFinding, PlannerReviewRequest, and PlannerReviewResult types plus PlannerSystemPrompt constant and BuildPlannerReviewPrompt function so the engine can build deterministic planner-role review requests without adapter wiring. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add e2e coverage for the quality gate (CLI scenario block, multi-surface integration-gap block, and inspect --fix safe-repair path), publish the user-facing reference doc, and mark the plan as Implemented in the index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Append-only JSONL storage for escaped defects and promoted rules under .verk/memory/, plus `verk learn escaped|list|show|promote` subcommands as the foundation for the memory learning loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add user-facing reference for the memory learning loop feature and mark the plan as implemented in the plans index. Also fix a pre-existing gofumpt alignment issue in epos/types.go that was blocking the commit hook. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Profile as a first-class Ticket field (stored via Extra["profile"] in epos, mirroring the model pattern), profile constants and ValidateProfile, and a pure DetectProfile function that routes tickets to security > contract > frontend > backend based on tags, criteria, body keywords, and owned-path extensions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds AdvisoryFindingsFromPromotedRules to the engine (severity P3, never blocking) and wires promoted rules as a "Lessons From Prior Escaped Defects" section into BuildPlannerReviewPrompt. Callers opt in; EvaluateTicketQuality is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add ResolveTicketProfile helper that detects and writes back the agent profile when empty, or validates it when explicit. Wire into the epic dispatch path and single-ticket CLI path before BuildPlanArtifact. Add AgentProfile field to PlanArtifact so the resolved value is captured in the plan artifact for audit purposes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements Tasks 1-3 of the benchmark-adoption plan: foundational types with JSON/YAML round-trip serialization, matrix parser with validation and pairing expansion, and a provider registry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity CLI, bench verknative)
Add IntentConfig to policy (Enabled=false default, MaxAttempts=2), implement runIntentGate with the three validation rules (missing criteria, superset paths, empty test plan) and MaxAttempts retry loop, wire the gate into RunTicket at the Intake phase before dispatching the implementation worker, and add SR-1 post-implementation alignment check that logs out-of-scope changed files as an advisory warning in implementation.out_of_scope_files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After all tickets in a wave complete, run a single fresh-context reviewer over the union diff to catch cross-ticket contradictions, integration drift, incomplete fanout, and orphaned references. Shadow mode persists the artifact without blocking; enforce mode blocks the wave on findings at or above the configured threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before the first wave dispatches, run a single fresh-context reviewer over the root ticket and all child tickets to catch plan-level gaps (missing tickets, cross-service boundary omissions, ownership conflicts). Shadow mode persists the artifact without blocking; enforce mode blocks the epic with reason epic_plan_review_gap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds internal/engine/constraints/ package with signature derivation, candidate store (JSONL append), constraint execution at verify time, and promotion helpers. Hooks into ticket closeout (candidate recording) and verification (active constraint execution). Adds `verk constraints` CLI subcommands (list, show, disable, history). Adds ConstraintsConfig to policy with Enabled=false default and seeds two inactive grep constraints in .verk/constraints/index.json. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
</review_stack_artifact> |
There was a problem hiding this comment.
Actionable comments posted: 3
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (18)
internal/adapters/repo/git/repo.go-431-438 (1)
431-438:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid full-file reads before applying synthetic diff limits.
syntheticUntrackedDiffstill reads entire files into memory (Line 431) and only truncates later. Very large untracked files can cause memory spikes/OOM even though output is capped.Suggested fix
import ( "bytes" "context" "errors" "fmt" + "io" "os" "os/exec" "path" "path/filepath" "sort" @@ - read, readErr := os.ReadFile(fullPath) + f, openErr := os.Open(fullPath) + if openErr != nil { + if errors.Is(openErr, os.ErrNotExist) { + continue + } + return "", fmt.Errorf("open untracked file %q: %w", p, openErr) + } + read, readErr := io.ReadAll(io.LimitReader(f, syntheticDiffSizeLimit+1)) + _ = f.Close() if readErr != nil { if errors.Is(readErr, os.ErrNotExist) { continue } return "", fmt.Errorf("read untracked file %q: %w", p, readErr) } content = read }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapters/repo/git/repo.go` around lines 431 - 438, The code in syntheticUntrackedDiff currently does a full-file os.ReadFile(fullPath) (variables: fullPath, read, readErr, content) which can OOM for huge untracked files; change it to open the file (os.Open), then read only up to the configured synthetic diff limit using io.LimitReader (e.g. io.ReadAll(io.LimitReader(f, limit+1))) so you can detect truncation without loading the entire file, preserve the existing os.ErrNotExist handling, return the same error on non-not-exist read errors, and set content to the truncated bytes (and mark/truncate the output if you read more than limit).internal/adapters/runtime/llmclibridge/bridge.go-323-327 (1)
323-327:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake reasoning-effort support conditional in diagnostics.
DiagnoseRuntimecurrently treatscodexas unavailable whenOptionReasoningEffortis unsupported, even thoughRunonly requires that option when reasoning is explicitly requested. This can produce false-negative readiness results.Suggested fix
func requiredRuntimeOptions(runtimeName string) []llmclient.OptionName { opts := []llmclient.OptionName{ llmclient.OptionWorkingDirectory, llmclient.OptionEnvironment, llmclient.OptionTimeout, llmclient.OptionRawCapture, } if strings.TrimSpace(runtimeName) == RuntimeCodex { opts = append(opts, llmclient.OptionCodexJSONL, - llmclient.OptionReasoningEffort, ) } return opts }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapters/runtime/llmclibridge/bridge.go` around lines 323 - 327, DiagnoseRuntime currently fails codex readiness if OptionReasoningEffort is unsupported even though Run only needs that option when reasoning is requested; update DiagnoseRuntime so it only requires llmclient.OptionReasoningEffort when the diagnostics are simulating or validating a reasoning-enabled Run (or when an explicit reasoning flag/parameter is passed), while still requiring llmclient.OptionCodexJSONL (and other core options) unconditionally for RuntimeCodex; locate the readiness checks in DiagnoseRuntime and change the conditional that treats OptionReasoningEffort as mandatory into a conditional that checks for a “reasoning requested” condition (or an extra boolean parameter) before marking the runtime unavailable for missing OptionReasoningEffort so Run and diagnostics align.internal/bench/governance.go-40-50 (1)
40-50:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject impossible public sampling metadata (
sampled > total_corpus).Line 40-45 only enforces non-zero values. A public suite with
sampling.sampledgreater thansampling.total_corpuscurrently passes validation.🔧 Suggested fix
if s.Sampling.TotalCorpus == 0 { return fmt.Errorf("bench: public suite %q must specify sampling.total_corpus", s.Name) } if s.Sampling.Sampled == 0 { return fmt.Errorf("bench: public suite %q must specify sampling.sampled", s.Name) } + if s.Sampling.Sampled > s.Sampling.TotalCorpus { + return fmt.Errorf( + "bench: public suite %q has sampling.sampled=%d greater than sampling.total_corpus=%d", + s.Name, s.Sampling.Sampled, s.Sampling.TotalCorpus, + ) + } if s.TaskCount < minPublicTaskCount {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/governance.go` around lines 40 - 50, The validation currently only checks non-zero sampling values but misses the impossible case where s.Sampling.Sampled > s.Sampling.TotalCorpus; update the validation (in the same function that checks s.Sampling.TotalCorpus and s.Sampling.Sampled) to add a check that if s.Sampling.Sampled > s.Sampling.TotalCorpus return a fmt.Errorf describing the problem (e.g., "bench: public suite %q has sampled > total_corpus (%d > %d)"). Ensure you reference s.Name, s.Sampling.Sampled and s.Sampling.TotalCorpus in the error message for clarity.internal/bench/compare.go-210-251 (1)
210-251:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPropagate markdown write/flush errors instead of returning success.
Line 211 onward ignores
fmt.Fprintf/fmt.Fprintlnandtabwriter.Flusherrors, soRenderComparisonMarkdowncan returnnilon failed writes (e.g., broken pipe).🔧 Suggested fix
func RenderComparisonMarkdown(w io.Writer, cmp Comparison) error { - fmt.Fprintf(w, "# Benchmark Comparison\n\n") - fmt.Fprintf(w, "**Baseline:** %s (%s) \n", cmp.Baseline.RunID, cmp.Baseline.SuiteName) - fmt.Fprintf(w, "**Candidate:** %s (%s) \n\n", cmp.Candidate.RunID, cmp.Candidate.SuiteName) + writef := func(format string, args ...any) error { + if _, err := fmt.Fprintf(w, format, args...); err != nil { + return fmt.Errorf("bench: render markdown comparison: %w", err) + } + return nil + } + if err := writef("# Benchmark Comparison\n\n"); err != nil { + return err + } + if err := writef("**Baseline:** %s (%s) \n", cmp.Baseline.RunID, cmp.Baseline.SuiteName); err != nil { + return err + } + if err := writef("**Candidate:** %s (%s) \n\n", cmp.Candidate.RunID, cmp.Candidate.SuiteName); err != nil { + return err + } @@ - tw.Flush() - fmt.Fprintln(w) + if err := tw.Flush(); err != nil { + return fmt.Errorf("bench: render markdown comparison: %w", err) + } + if err := writef("\n"); err != nil { + return err + } @@ - tw.Flush() - fmt.Fprintln(w) + if err := tw.Flush(); err != nil { + return fmt.Errorf("bench: render markdown comparison: %w", err) + } + if err := writef("\n"); err != nil { + return err + } return nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/compare.go` around lines 210 - 251, RenderComparisonMarkdown currently ignores errors from fmt.Fprintf/Fprintln and tabwriter.Flush, which can hide write failures; update the function (RenderComparisonMarkdown) to check the error returned by each fmt.Fprintf/Fprintln and by tw.Flush (for each tabwriter instance) and return the first non-nil error encountered instead of continuing, ensuring every write/flush call (including the initial headers, the summary block, and the per-task loop) captures its error and propagates it to the caller.internal/bench/matrix.go-75-79 (1)
75-79:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
fixed-reviewerpairing currently degenerates to “first profile is reviewer” infull-verk.
ValidateMatrixmakes reviewer fields effectively present for every full-verk profile, soexpandFixedReviewer’s firstReviewer.Model != ""match is always the first profile. This can silently mis-pair workers/reviewers and skew benchmark outcomes.💡 Fix direction
func ValidateMatrix(m Matrix) error { + if m.ComparisonDesign == designFixedReviewer { + // Require a single, unambiguous reviewer target across profiles, + // or add an explicit `reviewer_profile_id` field in Matrix. + } ... } func expandFixedReviewer(m Matrix) [][2]MatrixProfile { - reviewerIdx := len(m.Profiles) - 1 - for i, p := range m.Profiles { - if p.Reviewer.Model != "" { - reviewerIdx = i - break - } - } + // Resolve reviewer deterministically from explicit config + // (preferred) or from a validated single reviewer target. + reviewerIdx := resolveFixedReviewerIndex(m) ... }Also applies to: 122-132
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/matrix.go` around lines 75 - 79, The problem is that ValidateMatrix populates reviewer fields for full-verk profiles causing expandFixedReviewer to match the first profile (since Reviewer.Model != ""), so fixed-reviewer pairing degenerates; fix by changing the pairing logic in expandFixedReviewer to only consider reviewer candidates that were explicitly specified (e.g., introduce and check a boolean/marker like ReviewerExplicit or ReviewerSource on Profile) or by moving the expandFixedReviewer call so it runs before ValidateMatrix fills in defaults; update expandFixedReviewer to reference that explicit marker (or change call order around ValidateMatrix and expandFixedReviewer) to ensure only originally-specified reviewers are matched.internal/bench/providers/verknative/provider.go-118-123 (1)
118-123:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
Scoreallows path traversal viaexpect_marker.
filepath.Join(workspaceDir, marker)with unvalidatedmarkercan escape the workspace (../...or absolute paths), allowing unintended file reads.🛡️ Suggested hardening
marker, ok := markerRaw.(string) if !ok || marker == "" { return bench.Score{Solved: false} } - - path := filepath.Join(workspaceDir, filepath.FromSlash(marker)) + clean := filepath.Clean(filepath.FromSlash(marker)) + if filepath.IsAbs(clean) || clean == ".." || strings.HasPrefix(clean, ".."+string(filepath.Separator)) { + return bench.Score{Solved: false} + } + path := filepath.Join(workspaceDir, clean) + rel, err := filepath.Rel(workspaceDir, path) + if err != nil || rel == ".." || strings.HasPrefix(rel, ".."+string(filepath.Separator)) { + return bench.Score{Solved: false} + } - data, err := os.ReadFile(path) + data, err := os.ReadFile(path)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/providers/verknative/provider.go` around lines 118 - 123, The code constructs path := filepath.Join(workspaceDir, filepath.FromSlash(marker)) with an unvalidated marker (expect_marker) which allows path traversal; change to validate and sanitize marker before reading: reject absolute markers (filepath.IsAbs), clean the marker (filepath.Clean) and ensure it does not contain upward segments ("..") or path separators that escape the workspace, then compute the final path and verify it is inside workspaceDir (use filepath.Join + filepath.Clean and confirm filepath.Rel(workspaceDir, finalPath) does not start with ".."); only call os.ReadFile on the validated finalPath (references: variables marker/expect_marker, workspaceDir, path, and functions filepath.FromSlash, filepath.Clean, filepath.IsAbs, filepath.Rel, os.ReadFile).internal/bench/report.go-141-169 (1)
141-169:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPropagate write/flush errors in
RenderMarkdown.
RenderMarkdownreturnserror, but writes at Line 141+ and alltw.Flush()calls are unchecked. A short write or flush failure currently returnsniland emits partial output.Proposed direction
+ writef := func(format string, args ...any) error { + _, err := fmt.Fprintf(w, format, args...) + return err + } + - fmt.Fprintf(w, "# Benchmark Report\n\n") + if err := writef("# Benchmark Report\n\n"); err != nil { + return fmt.Errorf("bench: render markdown header: %w", err) + } @@ - tw.Flush() + if err := tw.Flush(); err != nil { + return fmt.Errorf("bench: flush profiles table: %w", err) + }Apply the same pattern across all
fmt.Fprintf/Fprintlnandtw.Flushsites.Also applies to: 173-185, 189-194, 198-209, 220-232, 262-293
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/report.go` around lines 141 - 169, RenderMarkdown currently ignores write and flush errors (many fmt.Fprintf/Fprintln calls and tw.Flush use), so partial output is silently returned nil; update RenderMarkdown to check and propagate errors from every fmt.Fprintf/Fprintln and from tabwriter.Flush calls. For each write call in RenderMarkdown (e.g., the header writes, label loop over r.Labels, profile table loop over r.Profiles) capture the returned (int, error) and if err != nil return it, and similarly check the error returned by each tw.Flush() and return it when non-nil so the function surface returns the first write/flush error encountered.internal/bench/report.go-103-111 (1)
103-111:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCost totals drop
Score.CostUSDwhenUsageis empty.Both
aggregateTotalsandRenderCSVonly readTaskResult.Usage. If executors provide cost only viaTaskResult.Score(asDefaultExecutordoes), report/csv cost becomes zero.Proposed fix
for _, u := range r.Usage { switch u.Confidence { case "exact": t.ExactCostUSD += u.CostUSD default: t.EstimatedCostUSD += u.CostUSD } } + if len(r.Usage) == 0 && r.Score.CostUSD > 0 { + switch defaultUsageConfidence(UsageRecord{ + CostUSD: r.Score.CostUSD, + Confidence: r.Score.Confidence, + }) { + case "exact": + t.ExactCostUSD += r.Score.CostUSD + default: + t.EstimatedCostUSD += r.Score.CostUSD + } + }And in
RenderCSV, fallback totr.Score.CostUSD/tr.Score.Confidencewhenlen(tr.Usage)==0.Also applies to: 325-333
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/report.go` around lines 103 - 111, aggregateTotals and RenderCSV currently only sum costs from TaskResult.Usage, which drops cost when Usage is empty; update both functions (aggregateTotals and RenderCSV) to fall back to TaskResult.Score.CostUSD and TaskResult.Score.Confidence when len(tr.Usage)==0 (i.e., treat a single pseudo-usage entry sourced from tr.Score when Usage is empty) so totals and CSV output include cost provided via TaskResult.Score (used by DefaultExecutor); locate the loops over r.Usage and tr.Usage and add the conditional fallback to use tr.Score.CostUSD and tr.Score.Confidence when Usage is empty.internal/bench/registry.go-22-27 (1)
22-27:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard against nil/empty providers in
Register.At Line 23,
p.Name()can panic ifpis nil, and empty names are currently accepted. This should return an error instead of crashing or registering an invalid key.Proposed fix
import ( "fmt" "sort" + "strings" ) @@ func (r *Registry) Register(p Provider) error { + if p == nil { + return fmt.Errorf("bench: provider is nil") + } - name := p.Name() + name := strings.TrimSpace(p.Name()) + if name == "" { + return fmt.Errorf("bench: provider name is empty") + } if _, exists := r.providers[name]; exists { return fmt.Errorf("bench: provider %q is already registered", name) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/registry.go` around lines 22 - 27, Registry.Register currently calls p.Name() without validating p and allows empty names, which can panic or register invalid keys; update Register to first check that the Provider parameter (p) is not nil and return a descriptive error if it is, then obtain name := p.Name() and validate that name != "" (return an error for empty names), and only then check r.providers[name] and assign r.providers[name] = p; reference the Registry.Register method, the Provider parameter p, the p.Name() call, and the r.providers map when making the changes.internal/bench/runner.go-144-147 (1)
144-147:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate
RunOptions.Providerbefore dereferencing.At Line 144,
opts.Provider.LoadTasks(...)will panic whenProvideris nil. Return a validation error instead.Proposed fix
func Run(ctx context.Context, opts RunOptions) (RunSummary, error) { now := opts.clock() + if opts.Provider == nil { + return RunSummary{}, fmt.Errorf("bench: provider is required") + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/runner.go` around lines 144 - 147, Check for nil RunOptions.Provider before calling opts.Provider.LoadTasks to avoid a panic; if opts.Provider is nil return an appropriate error (e.g., fmt.Errorf("bench: nil provider in RunOptions") or a wrapped validation error) instead of proceeding. Update the code surrounding the LoadTasks call in runner.go (the block using opts.Provider.LoadTasks and returning RunSummary{}) to validate opts.Provider, and ensure the error type/message clearly indicates a missing/invalid provider when returning. Keep the rest of the LoadTasks error handling unchanged.internal/bench/report.go-327-333 (1)
327-333:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not infer CSV confidence from only the first usage row.
At Line 329, confidence is taken from the first usage entry. For mixed usage records (e.g., exact + estimated), this can overstate confidence for the aggregated task cost.
Proposed fix
var totalCost float64 - var confidence string + confidence := "unavailable" for _, u := range tr.Usage { totalCost += u.CostUSD - if confidence == "" { - confidence = u.Confidence - } + c := defaultUsageConfidence(u) + switch { + case c == "exact" && confidence == "unavailable": + confidence = "exact" + case c != "exact": + confidence = "estimated" + } } - confidence = defaultUsageConfidence(UsageRecord{CostUSD: totalCost, Confidence: confidence}) + if len(tr.Usage) == 0 { + confidence = defaultUsageConfidence(UsageRecord{CostUSD: totalCost, Confidence: confidence}) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/report.go` around lines 327 - 333, The code currently sets confidence using only the first entry from tr.Usage; instead, scan all usage rows (the tr.Usage slice) and compute a combined confidence before calling defaultUsageConfidence. For example, iterate over each UsageRecord in tr.Usage, accumulate totalCost as you already do, and derive combinedConfidence by checking every u.Confidence (e.g., if any record is non-"Exact" mark the aggregate as "Estimated"/the most conservative value or otherwise keep "Exact"); then call defaultUsageConfidence(UsageRecord{CostUSD: totalCost, Confidence: combinedConfidence}) so the aggregated task confidence reflects all usage rows rather than just the first.internal/cli/bench.go-199-201 (1)
199-201:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle repo-root resolution failures before reading run artifacts.
Line 200 and Line 218 discard
resolveRepoRooterrors, soreport/comparecan read from unintended relative paths when not in a repo.Suggested fix
func runBenchReport(cmd *cobra.Command, runID, format string) error { - repoRoot, _ := resolveRepoRoot() + repoRoot, err := resolveRepoRoot() + if err != nil { + return cmdError(cmd, fmt.Errorf("resolve repo root: %w", err), 1) + } outDir := filepath.Join(repoRoot, ".verk", "bench", "runs", runID) - rr, err := bench.LoadRunResult(outDir) + rr, err := bench.LoadRunResult(outDir) if err != nil { return cmdError(cmd, err, 1) } @@ func runBenchCompare(cmd *cobra.Command, baseline, candidate, format string) error { - repoRoot, _ := resolveRepoRoot() + repoRoot, err := resolveRepoRoot() + if err != nil { + return cmdError(cmd, fmt.Errorf("resolve repo root: %w", err), 1) + } - base, err := bench.LoadRunResult(filepath.Join(repoRoot, ".verk", "bench", "runs", baseline)) + base, err := bench.LoadRunResult(filepath.Join(repoRoot, ".verk", "bench", "runs", baseline))Also applies to: 217-219
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cli/bench.go` around lines 199 - 201, The code in runBenchReport currently ignores errors from resolveRepoRoot() and proceeds to build outDir and read artifacts, which can point to unintended relative paths; update runBenchReport to check the error returned by resolveRepoRoot() (the call that assigns repoRoot, err := resolveRepoRoot()) and return or log the error immediately instead of discarding it, ensuring outDir is only constructed when repoRoot is valid; apply the same fix to the other occurrence where resolveRepoRoot() is called (the block around lines 217-219) so both places validate and handle the error before using repoRoot.internal/cli/bench_test.go-64-69 (1)
64-69:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate stale expectation in
bench runtest.This test now contradicts current behavior and is failing CI (Line 67 expects non-zero, command returns 0).
Suggested fix
-func TestBenchRun_NotYetImplemented(t *testing.T) { +func TestBenchRun_Completes(t *testing.T) { _, _, code := runBenchCmd(t, "run", "smoke") - if code == 0 { - t.Fatal("bench run: expected non-zero exit code, got 0") + if code != 0 { + t.Fatalf("bench run: expected exit 0, got %d", code) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cli/bench_test.go` around lines 64 - 69, The test TestBenchRun_NotYetImplemented has a stale assertion: it calls runBenchCmd(t, "run", "smoke") and asserts the exit code is non-zero; update the expectation to match current behavior by asserting code == 0 (or using t.Fatalf on non-zero) so the test expects a zero exit code from runBenchCmd; modify the assertion in TestBenchRun_NotYetImplemented to validate success instead of failure.internal/cli/run_blocked.go-108-110 (1)
108-110:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDecision prompt ignores run cancellation while waiting for input.
Line 144 hardcodes
context.Background(), so a canceled run context does not interruptChooseTicketDecision; the prompt can hang until another keypress arrives. Thread the caller context through the prompter API and pass it toreadPromptByte.Suggested fix
type decisionPrompter interface { - ChooseTicketDecision(ticket engine.BlockedTicket) (ticketDecision, error) + ChooseTicketDecision(ctx context.Context, ticket engine.BlockedTicket) (ticketDecision, error) } -func (p terminalDecisionPrompter) ChooseTicketDecision(ticket engine.BlockedTicket) (ticketDecision, error) { +func (p terminalDecisionPrompter) ChooseTicketDecision(ctx context.Context, ticket engine.BlockedTicket) (ticketDecision, error) { + if ctx == nil { + ctx = context.Background() + } reason := strings.TrimSpace(ticket.Reason) if reason == "" { reason = noReasonRecorded } @@ - ctx := context.Background() for { _, _ = fmt.Fprint(p.w, " Choice: ") b, ok, cancelled := readPromptByte(ctx, p.r) @@ - decision, err := prompter.ChooseTicketDecision(t) + decision, err := prompter.ChooseTicketDecision(ctx, t)Also applies to: 132-150, 176-177, 197-197, 584-595
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cli/run_blocked.go` around lines 108 - 110, The decision prompter currently ignores cancellation because ChooseTicketDecision is called with context.Background(); change the decisionPrompter interface signature to ChooseTicketDecision(ctx context.Context, ticket engine.BlockedTicket) (ticketDecision, error), update all implementations and call sites to pass the caller's ctx instead of context.Background(), and thread that same ctx into readPromptByte so the prompt read is cancellable; ensure all referenced calls (ChooseTicketDecision invocations and readPromptByte usage) are updated accordingly.internal/engine/constraints/promote.go-25-33 (1)
25-33: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winUnused
promotedmap is dead code.The
promotedmap is built but never used to filter or check anything. The loop overc.PromotedFromdoes nothing (_ = pf). Either remove this code or implement the intended filtering logic.- // Collect promoted signatures so we can detect already-promoted ones. - promoted := make(map[Signature]struct{}) - for _, c := range idx.Constraints { - promoted[Signature(c.ID)] = struct{}{} - // Also track by finding signature patterns - for _, pf := range c.PromotedFrom { - _ = pf // just a marker - } - }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/constraints/promote.go` around lines 25 - 33, The code builds a promoted map (promoted := make(map[Signature]struct{})) and iterates c.PromotedFrom without using either; remove the dead code or implement the intended check: either delete the entire promoted collection and the unused loop, or use promoted inside the constraint processing to skip already-promoted constraints by checking if Signature(c.ID) exists in promoted or by populating promoted from c.PromotedFrom and using that set where constraints are evaluated; refer to the promoted variable, Signature(c.ID), idx.Constraints, and c.PromotedFrom to locate and update the logic accordingly.internal/engine/review_delta.go-105-110 (1)
105-110:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid false attribution when a file was already deleted at baseline.
At Line 107,
os.IsNotExistalways marks the file as worker-changed. If baseline already recorded it as non-existent (snap.Exists == false), this is unchanged pre-existing state and should not be attributed to the worker.💡 Suggested fix
current, readErr := os.ReadFile(abs) if readErr != nil { if os.IsNotExist(readErr) { - // File was deleted by the worker — include it. - workerChanged = append(workerChanged, f) + // Include only when the file existed at baseline and is now deleted. + if snap.Exists { + workerChanged = append(workerChanged, f) + } continue } return reviewDelta{}, fmt.Errorf("collectReviewDelta: read %s: %w", f, readErr) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/review_delta.go` around lines 105 - 110, When os.ReadFile(abs) returns os.IsNotExist, don't automatically attribute deletion to the worker; check the baseline snapshot's exists flag (snap.Exists) first and only append f to workerChanged when the baseline recorded the file as existing. Update the block around the ReadFile error handling in review_delta.go (the branch using os.IsNotExist(readErr)) to conditionally append to workerChanged only if snap.Exists is true, then continue.internal/engine/intent.go-155-160 (1)
155-160:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse owned-path containment, not exact equality, for TargetFiles checks.
Line 157 currently requires exact path matches, so a valid target like
internal/app/foo.gois rejected when ownership isinternal/app. That can incorrectly fail intent convergence.💡 Suggested fix
- // Rule 2: TargetFiles must be a subset of OwnedPaths (empty TargetFiles is OK). - for _, f := range result.TargetFiles { - if _, ok := ownedSet[f]; !ok { - return "superset_paths" - } - } + // Rule 2: TargetFiles must be within OwnedPaths (file == owned path OR nested under owned dir). + for _, f := range result.TargetFiles { + target := filepath.Clean(strings.TrimSpace(f)) + allowed := false + for owned := range ownedSet { + scope := filepath.Clean(strings.TrimSpace(owned)) + if scope == "" { + continue + } + if target == scope || strings.HasPrefix(target, scope+string(filepath.Separator)) { + allowed = true + break + } + } + if !allowed { + return "superset_paths" + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/intent.go` around lines 155 - 160, The current loop in intent.go that checks result.TargetFiles against ownedSet uses exact equality and should instead check containment (owned directory covers files beneath it). Replace the exact map lookup for each f in result.TargetFiles with a containment test: iterate owned paths from ownedSet and treat f as owned if f == owned or f has the owned path as a directory prefix (e.g. f starts with owned + "/"); use filepath.Clean on both sides and strings.HasPrefix for the prefix check to handle path separators correctly so the function that returns "superset_paths" only does so when a target file is truly outside all owned paths.internal/engine/ticket_quality_gate.go-41-42 (1)
41-42:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd runID validation to prevent directory traversal in artifact paths.
The
runIDparameter is used directly infilepath.Joinat line 41 without validation. While current callers validate viaValidateArtifactIdentifier, the function should be defensive. Use the existing validation pattern:Suggested fix
func persistTicketQualityArtifact(repoRoot, runID string, artifact state.TicketQualityArtifact) error { + if err := ValidateArtifactIdentifier(runID, "run id"); err != nil { + return err + } dir := filepath.Join(repoRoot, ".verk", "runs", runID) path := filepath.Join(dir, "ticket-quality.json")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/ticket_quality_gate.go` around lines 41 - 42, The code uses runID directly in filepath.Join to build dir/path; add defensive validation by calling ValidateArtifactIdentifier(runID) before computing dir and path, and return an error (or propagate) if validation fails so malicious runID cannot cause directory traversal; update the block around the dir := filepath.Join(repoRoot, ".verk", "runs", runID) / path := filepath.Join(dir, "ticket-quality.json") to validate runID first and only proceed to join when ValidateArtifactIdentifier succeeds.
🟡 Minor comments (12)
docs/ticket-state-machine.md-67-82 (1)
67-82:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language identifiers to fenced code blocks.
These fences currently violate MD040. Please annotate them with an explicit language (e.g.,
textfor CLI output examples).Also applies to: 96-99, 111-119, 130-133
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/ticket-state-machine.md` around lines 67 - 82, Update the fenced code blocks in docs/ticket-state-machine.md to include explicit language identifiers (e.g., use ```text instead of ``` ) for all CLI/output examples; specifically change the fenced blocks at the shown snippet and the other occurrences referenced (lines around 96-99, 111-119, 130-133) so they comply with MD040 by adding the appropriate language token such as "text" or "bash".docs/plans/2026-05-16-llmcli-runtime-adapter-integration.md-3-10 (1)
3-10:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove blank lines inside the opening blockquote block.
Lines 6 and 8 create MD028 violations (
no-blanks-blockquote). Keep the status/related-tickets/for-claude lines in one continuous blockquote.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/2026-05-16-llmcli-runtime-adapter-integration.md` around lines 3 - 10, The blockquote contains blank lines causing MD028; remove the empty lines so the three blockquote lines starting with "**Status: Implemented / Reference.**", "**Related tickets:** `fi-8ken` and children." and "**For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans..." are contiguous within the same > block (i.e., delete the blank lines between those > lines to form one continuous blockquote).docs/plans/2026-04-22-ticket-state-machine.md-3-5 (1)
3-5:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix blockquote blank-line formatting to satisfy markdownlint.
Line 4 introduces a blank line between blockquote lines, which triggers MD028. Keep the blockquote contiguous (or make the blank line a quoted
>line).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/2026-04-22-ticket-state-machine.md` around lines 3 - 5, The blockquote in docs/plans/2026-04-22-ticket-state-machine.md is split by a bare blank line between the two quoted lines ("**Status: Implemented** — All six tasks landed..." and "**For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans..."), which triggers markdownlint MD028; fix it by either removing the blank line so the two lines remain contiguous in the same blockquote, or make the blank line a quoted line by adding a leading ">" to preserve spacing; update the lines containing those exact strings accordingly.docs/plans/INDEX.md-53-67 (1)
53-67:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix ordered-list numbering in “Current Priority Order.”
The sequence currently jumps from
1.to3.(Line 57), which reads like a missing priority item. Renumber this section consistently.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/INDEX.md` around lines 53 - 67, The ordered list under "Current Priority Order" has inconsistent numbering (it jumps from "1." to "3."); update the numeric prefixes so they are sequential (change the item starting "Implement the Claude Code foundation..." from "3." to "2." and adjust any following items if present) so the list reads 1., 2., 3., 4., 5. and preserves the existing item text (e.g., lines beginning "Finish the remaining active tool-reliability...", "Implement the Claude Code foundation...", etc.).internal/adapters/runtime/prompt.go-410-415 (1)
410-415:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHarden intent-block validation against whitespace-only list items.
finalizeIntentBlockcurrently accepts payloads like{"covered_criteria":[" "]}as valid. That allows an effectively empty intent echo to pass.Suggested fix
func finalizeIntentBlock(b VerkIntentBlock) (VerkIntentBlock, bool) { - if len(b.CoveredCriteria) == 0 && len(b.TargetFiles) == 0 && strings.TrimSpace(b.TestPlan) == "" { + b.CoveredCriteria = trimNonEmpty(b.CoveredCriteria) + b.TargetFiles = trimNonEmpty(b.TargetFiles) + b.TestPlan = strings.TrimSpace(b.TestPlan) + if len(b.CoveredCriteria) == 0 && len(b.TargetFiles) == 0 && b.TestPlan == "" { return VerkIntentBlock{}, false } - b.TestPlan = strings.TrimSpace(b.TestPlan) return b, true } + +func trimNonEmpty(items []string) []string { + out := make([]string, 0, len(items)) + for _, item := range items { + if s := strings.TrimSpace(item); s != "" { + out = append(out, s) + } + } + return out +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapters/runtime/prompt.go` around lines 410 - 415, finalizeIntentBlock currently treats lists containing whitespace-only strings as non-empty; update finalizeIntentBlock to trim each entry of b.CoveredCriteria and b.TargetFiles, filter out entries that become empty after trimming, and replace the slices with the filtered results before validation; then apply the existing TestPlan trim and return the block only if the resulting CoveredCriteria or TargetFiles slices are non-empty or TestPlan is non-empty. Ensure you operate on VerkIntentBlock fields CoveredCriteria, TargetFiles, and TestPlan within the finalizeIntentBlock function.internal/adapters/runtime/llmclibridge/bridge_test.go-238-240 (1)
238-240:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThe cancelled-case stderr check is currently a no-op.
When
wantStderr == "",strings.Contains(x, "")always passes, so this branch does not verify stderr behavior.Suggested fix
- if !strings.Contains(string(result.Stderr), tt.wantStderr) { - t.Fatalf("expected stderr to contain %q, got %q", tt.wantStderr, result.Stderr) - } + if tt.wantStderr == "" { + if len(result.Stderr) != 0 { + t.Fatalf("expected empty stderr, got %q", result.Stderr) + } + } else if !strings.Contains(string(result.Stderr), tt.wantStderr) { + t.Fatalf("expected stderr to contain %q, got %q", tt.wantStderr, result.Stderr) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapters/runtime/llmclibridge/bridge_test.go` around lines 238 - 240, The test currently uses strings.Contains(result.Stderr, tt.wantStderr) which is a no-op when tt.wantStderr == "" (strings.Contains(..., "") is always true); change the assertion to branch on tt.wantStderr: if tt.wantStderr == "" assert that result.Stderr is empty (len(result.Stderr) == 0 or == ""), otherwise assert strings.Contains(string(result.Stderr), tt.wantStderr) and keep the t.Fatalf messages referencing tt.wantStderr and result.Stderr to preserve diagnostics. Use the existing tt.wantStderr and result.Stderr symbols to locate and update the check in the test.internal/bench/providers/aiderpolyglot/provider.go-15-18 (1)
15-18:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate
suitebefore configuration/not-implemented paths inLoadTasks.Right now, an invalid suite can be reported as
ErrNotConfigured(or generic not-implemented), which hides caller input errors and complicates CLI/debug UX.🔧 Proposed patch
import ( "errors" + "fmt" "verk/internal/bench" ) ... func (p *Provider) LoadTasks(suite string) ([]bench.Task, error) { + if suite != "smoke" { + return nil, fmt.Errorf("aider-polyglot: unknown suite %q", suite) + } if p.DatasetPath == "" { return nil, ErrNotConfigured } // TODO: read dataset metadata from p.DatasetPath; not implemented in this spike. - return nil, errors.New("aider-polyglot: loading from dataset not yet implemented (spike)") + return nil, errors.New("aider-polyglot: loading from dataset not yet implemented (spike)") }Also applies to: 52-58
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/providers/aiderpolyglot/provider.go` around lines 15 - 18, In LoadTasks, validate the incoming suite parameter immediately (before any configuration checks or returning ErrNotConfigured/ErrNotImplemented) and return an explicit validation error for a bad suite; update provider.go's LoadTasks to perform the suite check first (use or add a clear error symbol such as bench.ErrInvalidSuite or a new package-level error) so callers get a deterministic validation error instead of ErrNotConfigured/ErrNotImplemented; ensure this same early validation is applied to the other affected code paths around the 52-58 area that currently branch to configuration/not-implemented handling.internal/cli/learn.go-207-209 (1)
207-209:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReject whitespace-only
--rule-idvalues.Line 207 accepts
" "as present; trim before validating and persist the trimmed value.Suggested fix
- if ruleID == "" { + ruleID = strings.TrimSpace(ruleID) + if ruleID == "" { return cmdError(cmd, fmt.Errorf("--rule-id is required"), 2) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cli/learn.go` around lines 207 - 209, Trim whitespace from the ruleID flag before validating and using it: replace direct checks of ruleID (the variable validated in the block that currently calls cmdError) with a trimmed value (e.g., strings.TrimSpace(ruleID)) and validate that trimmed string is non-empty; persist and pass the trimmed value onward instead of the original. Ensure the cmdError check uses the trimmedRuleID and that any subsequent references in this command handler (where ruleID is used) are updated to the trimmed variable.internal/cli/learn.go-43-53 (1)
43-53:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTighten required-flag validation for
--summaryand--missed-by.Whitespace-only
--summaryand comma-only/blank--missed-bycurrently pass required checks.Suggested fix
- if summary == "" { + summary = strings.TrimSpace(summary) + if summary == "" { return cmdError(cmd, fmt.Errorf("--summary is required"), 2) } - if missedBy == "" { + missedBy = strings.TrimSpace(missedBy) + if missedBy == "" { return cmdError(cmd, fmt.Errorf("--missed-by is required"), 2) } missedByList := splitComma(missedBy) + if len(missedByList) == 0 { + return cmdError(cmd, fmt.Errorf("--missed-by is required"), 2) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cli/learn.go` around lines 43 - 53, The current checks treat whitespace-only --summary and comma-only/blank entries in --missed-by as valid; update the validation in the CLI handler to trim and validate inputs: replace the summary check to use strings.TrimSpace(summary) and return the same cmdError if it's empty after trimming; for missedBy, split with splitComma but then iterate with mb = strings.TrimSpace(mb), skip or treat empty mb as an error (e.g., if len(missedByList)==0 or any trimmed mb == "" return cmdError about required --missed-by), and validate each non-empty trimmed mb against memory.ValidMissedBy when calling cmdError for unknown values; keep references to splitComma, ValidMissedBy, cmdError, summary, and missedBy so the change is easy to locate.internal/cli/status_test.go-171-171 (1)
171-171:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert the CLI execute error in the test.
Line 171 drops the
root.Execute()error, which can hide command failures and make this test pass for the wrong reason.Suggested fix
- _ = root.Execute() + if err := root.Execute(); err != nil { + t.Fatalf("status command failed: %v", err) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cli/status_test.go` at line 171, The test currently ignores the error returned by root.Execute(), which can hide failures; update the test in internal/cli/status_test.go to capture the return value from root.Execute() and assert it's nil (or assert the expected error) using the test framework's assertion (e.g., t.Fatalf/t.Errorf or require.NoError/assert.NoError) so command failures fail the test; locate the call to root.Execute() and replace the discarded result with a proper check that fails the test if an unexpected error is returned.internal/engine/constraints/signature.go-27-34 (1)
27-34:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPlatform inconsistency:
filepath.Diruses OS-specific separators.On Windows,
filepath.Dirreturns paths with backslashes, but glob patterns typically use forward slashes. This could cause inconsistent signature hashes across platforms.func fileToGlob(path string) string { dir := filepath.Dir(path) ext := filepath.Ext(path) if ext == "" { return path } - return dir + "/*" + ext + return filepath.ToSlash(dir) + "/*" + ext }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/constraints/signature.go` around lines 27 - 34, fileToGlob uses filepath.Dir which yields OS-specific separators and can create backslashes on Windows, causing inconsistent glob/signature strings; update fileToGlob to normalize to forward slashes by calling filepath.ToSlash on the inputs/outputs: when ext == "" return filepath.ToSlash(path), otherwise compute dir := filepath.ToSlash(filepath.Dir(path)) and return dir + "/*" + ext (using "/" explicitly for the glob separator) so the produced glob always uses forward slashes across platforms.internal/engine/ticket_quality_gate_test.go-148-155 (1)
148-155:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert
RunEpicerrors in artifact-persistence subtests.Both subtests ignore the returned error, which can hide behavior regressions unrelated to artifact writing.
✅ Suggested test tightening
- _, _ = RunEpic(context.Background(), RunEpicRequest{ + _, err := RunEpic(context.Background(), RunEpicRequest{ RepoRoot: repoRoot, RunID: runID, RootTicketID: epic.ID, BaseCommit: baseCommit, Adapter: runtimefake.New(nil, nil), Config: policy.DefaultConfig(), }) + if !errors.Is(err, ErrEpicBlocked) { + t.Fatalf("expected ErrEpicBlocked in blocked case, got: %v", err) + } @@ - _, _ = RunEpic(context.Background(), RunEpicRequest{ + _, err := RunEpic(context.Background(), RunEpicRequest{ RepoRoot: repoRoot, RunID: runID, RootTicketID: epic.ID, BaseCommit: baseCommit, Adapter: adapter, Config: policy.DefaultConfig(), }) + if err != nil { + t.Fatalf("expected passed case to succeed, got: %v", err) + }Also applies to: 223-230
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/ticket_quality_gate_test.go` around lines 148 - 155, The subtests for artifact-persistence call RunEpic(...) and ignore its returned error, which can hide regressions; update those calls in ticket_quality_gate_test.go (the artifact-persistence subtests around the RunEpicRequest usage and the similar block at the later occurrence) to capture the error return and assert success (e.g., use require.NoError(t, err) or assert.NoError(t, err)) and/or fail the test when err != nil so failures in RunEpic are surfaced; locate the RunEpic call and the RunEpicRequest struct usage to modify the test assertions accordingly.
🧹 Nitpick comments (6)
internal/bench/providers/aiderpolyglot/provider_test.go (1)
76-79: ⚡ Quick winAssert
SupportedModescontents, not just count.A slice-length check alone can miss regressions where modes are incorrect but still two items.
✅ Stronger assertion
wantModes := []bench.BenchmarkMode{bench.ModeFullVerk, bench.ModeWorkerOnly} if len(caps.SupportedModes) != len(wantModes) { t.Errorf("SupportedModes: got %v, want %v", caps.SupportedModes, wantModes) } + for i, want := range wantModes { + if caps.SupportedModes[i] != want { + t.Errorf("SupportedModes[%d]: got %q, want %q", i, caps.SupportedModes[i], want) + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/providers/aiderpolyglot/provider_test.go` around lines 76 - 79, The test currently only checks the length of caps.SupportedModes against wantModes; change it to assert the actual contents match — compare caps.SupportedModes to wantModes (either by deep-equality with reflect.DeepEqual or by converting to a set and comparing elements) so the test fails if the specific modes differ; update the assertion around the variables wantModes and caps.SupportedModes in provider_test.go to verify element equality (and consider order-insensitive comparison if order isn't guaranteed).internal/bench/providers/verknative/provider_test.go (1)
113-151: ⚡ Quick winAdd a scorer test that rejects
../and absolute marker paths.Given scoring reads from disk, a dedicated boundary test will lock in safe behavior once path sanitization is added.
🧪 Example test to add
+func TestScore_UnsolvedWhenMarkerEscapesWorkspace(t *testing.T) { + dir := t.TempDir() + task := smokeTask("../outside.txt", "ok") + score := verknative.Score(task, dir) + if score.Solved { + t.Fatal("Score.Solved should be false for traversal marker path") + } +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/providers/verknative/provider_test.go` around lines 113 - 151, Add a new unit test in internal/bench/providers/verknative/provider_test.go (e.g. TestScore_RejectsPathTraversalAndAbsoluteMarker) that constructs tasks with marker paths containing "../" and absolute paths (like "../secret.txt" and "/etc/passwd") using the existing smokeTask helper and calls verknative.Score(task, dir); ensure the test writes matching files both inside and outside dir to prove traversal would succeed if unsanitized, but assert score.Solved is false for both inputs to lock in safe behavior; reference smokeTask and verknative.Score when locating where to add the test.internal/bench/runner_test.go (1)
100-475: 🏗️ Heavy liftAdd resume regression tests for result preservation and run-id consistency.
Please add tests that assert
Resumekeeps prior completed task outcomes (not rewritten ascancelled) and that persistedrun.json/complete.jsonretain the checkpoint run ID.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/runner_test.go` around lines 100 - 475, Add two tests in runner_test.go near the other Run tests that exercise Resume: (1) create an initial run via Run with at least two tasks, ensure one is completed and checkpoint/checkpoint.json and results.jsonl contain that task, then call Resume with the same OutDir and verify the previously completed task remains completed (not changed to "cancelled") in results.jsonl and checkpoint.json; (2) perform a run, capture the run ID written to run.json and complete.json (or checkpoint run metadata), then resume and assert the same run ID is preserved in the persisted run.json/complete.json files after Resume; locate code references like Run, Resume, results.jsonl, checkpoint.json, run.json, and complete.json to implement the assertions.internal/bench/report_test.go (1)
202-254: ⚡ Quick winAdd CSV assertions for mixed-confidence and score-only-cost rows.
Current CSV tests validate first-row behavior, but they don't cover mixed usage confidence or fallback to
TaskResult.ScorewhenUsageis empty—both are important report invariants.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/bench/report_test.go` around lines 202 - 254, TestRenderCSV_HasHeaderAndRows is missing assertions for rows that should show mixed-confidence and cost derived from TaskResult.Score when Usage is empty; update the test data from makeTestRunResult/BuildReport (or ensure tasks like "task-2" and "task-4" exist in the generated report) and add CSV assertions after parsing records to check that the mixed-usage row has cost_confidence "mixed" (e.g., for task-2) and that the score-only row uses the TaskResult.Score value for cost_usd and sets cost_confidence to the appropriate fallback (e.g., "exact" or "unavailable" as defined by RenderCSV logic); locate these checks in TestRenderCSV_HasHeaderAndRows and add comparisons against records[...] indices consistent with the header (cost_usd at index 7, cost_confidence at index 8).internal/engine/constraints/promote.go (1)
134-136: 💤 Low valuePrefer
strings.HasSuffixfor suffix checks.Using
strings.HasSuffixis more idiomatic and readable than manual index slicing.- if len(e) > 6 && e[len(e)-6:] == ".jsonl" { + if strings.HasSuffix(e, ".jsonl") { sigs = append(sigs, Signature(e[:len(e)-6]))Note: You'd also want to use
strings.TrimSuffix(e, ".jsonl")for extracting the signature.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/constraints/promote.go` around lines 134 - 136, Replace the manual slice-based suffix check and extraction in promote.go: instead of checking len(e) > 6 && e[len(e)-6:] == ".jsonl" and slicing e[:len(e)-6], use strings.HasSuffix(e, ".jsonl") to test the suffix and strings.TrimSuffix(e, ".jsonl") to get the base name, then append Signature(trimmed) to sigs (referencing variables e, sigs and the Signature type).internal/engine/constraints/execute.go (1)
144-146: 💤 Low valueSilent error swallowing may hide actionable failures.
When
git difffails for reasons other than "no commits yet", the error is silently discarded. Consider returning the error inExecuteResult.Errso operators can debug unexpected git failures.- // Other errors: return as informational, not fatal - return "", false, nil + // Other errors: return as informational + return "", false, fmt.Errorf("git diff failed: %w", runErr)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/engine/constraints/execute.go` around lines 144 - 146, The code currently swallows non-"no commits yet" git errors by returning "", false, nil; instead propagate that error into the execution result so callers can surface/debug it. Locate the return in internal/engine/constraints/execute.go (the branch handling git diff failures) and replace the nil error with an ExecuteResult that carries the original error in its Err field (or otherwise set ExecuteResult.Err to the caught error) so the function returns the error via ExecuteResult.Err rather than dropping it.
18 issues fixed across security (path traversal, OOM reads), error handling (swallowed write/flush errors, nil guards, validation), and logic bugs (checkpoint key collision, prefix vs exact matching, false attribution, dead code, cross-platform signature stability). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/bench/compare.go`:
- Around line 129-167: pairResults currently keys bMap/cMap only by TaskID which
causes overwrites when a run has multiple profiles; change the maps in
pairResults to use a composite key (e.g., TaskID + ProfileID or TaskID +
ProfileName) so each distinct profile-result remains separate, update the
composite key creation in pairResults where bMap[candidate[i].TaskID] and
similar are assigned, and propagate that composite identity into the
PairedTaskComparison struct (add a field like CompositeID or ProfileID) so
rendering/diffing and classifyOutcome lookups remain accurate; ensure
classifyOutcome(b, c) and any callers of PairedTaskComparison are adjusted to
expect/use the composite identity.
In `@internal/bench/governance.go`:
- Around line 19-27: The switch in SuiteClaimLabel is missing a case for
SamplingModeRegression so regression suites fall through to "exploratory"; add a
case for SamplingModeRegression and return the intended regression label (e.g.,
the same string used for SamplingModeSmoke like "regression/sanity only — not an
externally defensible ranking" or the correct regression-specific text) to
ensure SamplingModeRegression maps to the proper claim label.
In `@internal/bench/report.go`:
- Around line 272-275: aggregateTotals currently adds Score.CostUSD into the
grand totals when tr.Usage is empty but the per-profile (ps.costUSD aggregation
loop over tr.Usage) and CSV/profile table code paths omit Score.CostUSD, causing
inconsistent cost figures; update the per-profile and CSV/profile aggregation
logic (where ps.costUSD is accumulated from tr.Usage) to fall back to or also
include tr.Score.CostUSD when tr.Usage is empty (mirror the aggregateTotals
behavior), ensuring code paths that compute ps.costUSD and CSV rows reference
tr.Score.CostUSD as the source of cost when tr.Usage has no entries.
In `@internal/engine/constraints/promote.go`:
- Around line 67-75: The timeout_ms parsing silently falls back and allows
non-positive values; update the rawSpec["timeout_ms"] handling so that if the
key exists you validate its type and range and return an error immediately on
invalid input: accept only int or float64 (convert float64->int), reject other
types, and require the resulting timeoutMs > 0; set timeoutMs to the validated
value only after these checks. Ensure the error is propagated from the
surrounding parsing/constructor path so invalid timeout_ms cannot be persisted
(referencing rawSpec and timeoutMs in promote.go).
In `@internal/engine/intent.go`:
- Around line 64-67: The code trims criterion IDs only in the if guard but
stores the raw c.ID into requiredCriteria, causing mismatches for IDs with
surrounding whitespace; update the logic in the loop over plan.Criteria so you
compute a normalized id := strings.TrimSpace(c.ID) (or lower-case if
normalization requires) and then use that normalized id for both the non-empty
check and as the key stored in requiredCriteria; apply the same normalization
where membership checks occur (the other occurrence around lines 139-140) so all
set insertions and lookups use the same normalized ID variable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 827e93a4-d8da-4df8-94d6-f6e89d78e16f
📒 Files selected for processing (17)
internal/adapters/repo/git/repo.gointernal/adapters/runtime/llmclibridge/bridge.gointernal/bench/compare.gointernal/bench/governance.gointernal/bench/providers/verknative/provider.gointernal/bench/registry.gointernal/bench/report.gointernal/bench/runner.gointernal/bench/types.gointernal/cli/learn.gointernal/engine/constraints/promote.gointernal/engine/constraints/signature.gointernal/engine/intent.gointernal/engine/intent_test.gointernal/engine/review_delta.gointernal/engine/test_resolver.gointernal/engine/ticket_quality_gate.go
🚧 Files skipped from review as they are similar to previous changes (9)
- internal/bench/registry.go
- internal/bench/types.go
- internal/engine/ticket_quality_gate.go
- internal/engine/constraints/signature.go
- internal/cli/learn.go
- internal/bench/providers/verknative/provider.go
- internal/engine/intent_test.go
- internal/adapters/repo/git/repo.go
- internal/engine/review_delta.go
…ion label, cost consistency, timeout validation, criterion normalization)
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/bench/governance.go`:
- Line 29: Replace the raw string literal "exploratory" with the existing shared
constant designExploratory in the function that currently returns "exploratory"
(update the return in governance.go to return designExploratory); this removes
the duplicate constant and fixes the goconst lint error while reusing the shared
symbol.
In `@internal/bench/report.go`:
- Around line 132-138: The repeated string literals in the switch (returning
"system result", "capability result", "runtime result", "exploratory") and
elsewhere ("exact", "estimated") should be replaced with package-level constants
to satisfy goconst; add clear constant names (e.g., resultSystem,
resultCapability, resultRuntime, resultExploratory, labelExact, labelEstimated)
and update the switch that checks ModeWorkerOnly and ModeRuntimeProbe (and the
other mode cases) to return those constants, and likewise replace the duplicate
literals around lines 379-385 with the new constants so all occurrences reuse
the same identifiers.
- Around line 348-357: The current loop over tr.Usage sums totalCost but picks
the first non-empty confidence, which can mislabel rows when later entries are
less certain; update the aggregation to compute totalCost as before but compute
confidence by scanning all u.Confidence values and choosing the most
conservative/worst certainty (e.g., if any u.Confidence == "estimated" then
confidence = "estimated", otherwise if any == "exact" then "exact", else ""),
then handle the len(tr.Usage)==0 fallback to tr.Score.CostUSD and finally call
defaultUsageConfidence(UsageRecord{CostUSD: totalCost, Confidence: confidence})
to produce the final confidence; adjust the code around tr.Usage, totalCost,
confidence, defaultUsageConfidence, and UsageRecord accordingly.
In `@internal/engine/intent.go`:
- Around line 154-156: The check that iterates over result.TargetFiles should
normalize each entry (trim surrounding whitespace and skip empty/whitespace-only
strings) before calling fileInOwned against ownedPaths to avoid false
"superset_paths" rejections; update the loop that references result.TargetFiles
to trim entries (e.g., strings.TrimSpace) and continue past empty results, then
call fileInOwned(f, ownedPaths) as before so only meaningful file paths are
validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a0835c12-529e-4775-9c69-666676033092
📒 Files selected for processing (6)
internal/bench/compare.gointernal/bench/governance.gointernal/bench/governance_test.gointernal/bench/report.gointernal/engine/constraints/promote.gointernal/engine/intent.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/bench/governance_test.go
- internal/engine/constraints/promote.go
- internal/bench/compare.go
…CSV, trim TargetFiles
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/bench/report.go`:
- Around line 128-132: Add a dedicated constant for the repeated "unavailable"
confidence string and replace literal occurrences with it: extend the existing
constants block (which defines claimSystemResult, confidenceExact,
confidenceEstim) by adding something like confidenceUnavailable = "unavailable",
then update all uses (e.g., where the code currently compares/assigns the
literal "unavailable" around the confidence logic referenced in the diff,
including the other occurrences noted) to reference confidenceUnavailable
instead of the string literal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5d7b5eed-a758-45a2-922e-027972cbc7a6
📒 Files selected for processing (3)
internal/bench/governance.gointernal/bench/report.gointernal/engine/intent.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/bench/governance.go
- internal/engine/intent.go
Summary
50 commits implementing all 9 plans from
docs/plans/:runIntentGateininternal/engine/intent.go; called fromticket_run.gobefore implement phase; policy flag off by default (policy.intent.enabled)DetectLanguagesFromPaths+WorkerRequest.StandardsResolutionEvidence/DiffRange/TestReferencein closeout schema v2;MigrateReviewFindingsToV2internal/engine/wave_review.go;WaveReviewArtifact; shadow/enforce mode; wired into epic outer loopinternal/engine/epic_review.go;EpicPlanReviewArtifact; runs before first wave; shadow/enforce modeinternal/engine/constraints/package; candidate accumulation; CLI (verk constraints list/show/disable/history); closeout + verify hooks; 2 seed constraintsEvaluateTicketQuality,RunTicketQualityGate, policy configDetectProfile(4 profiles), rationalization injection,go:embedprofile MDsTicketOutcome/TicketPhaseseparation,stopKindenum,classifyTicketStopDiffAgainstFiles,reviewBaseline/collectReviewDeltaEscapedDefect/PromotionEntryJSONL store,verk learnCLIverk benchCLITest plan
go build ./...— cleango test ./...— 1157 tests pass across 22 packagesjust lint— clean🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation