CNTRLPLANE-3339: add agent and convention eval framework#8382
CNTRLPLANE-3339: add agent and convention eval framework#8382enxebre wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@enxebre: This pull request references CNTRLPLANE-3339 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdds an opt-in agent evaluation framework: a Ginkgo Sequence Diagram(s)sequenceDiagram
participant TestHarness as Test Harness
participant ScenarioDir as Scenario Directory
participant GitRepo as Git Repository
participant Agent as Claude Agent (CLI)
participant Judge as Claude Judge (CLI)
TestHarness->>ScenarioDir: Discover scenarios (prompt.txt, expected.txt, optional patch.diff)
loop For each scenario across EVAL_RUNS trials
TestHarness->>GitRepo: Optionally create worktree and apply patch.diff
TestHarness->>Agent: Invoke claude (agent model) with prompt
Agent-->>TestHarness: Return result & total_cost_usd
TestHarness->>Judge: Invoke claude (judge model) with templated judge prompt (includes agent JSON)
Judge-->>TestHarness: Return JSON {pass, issues}
TestHarness->>GitRepo: Defer cleanup (remove worktree / revert)
TestHarness->>TestHarness: Record pass/missed issues and cost
end
TestHarness->>TestHarness: Compute pass-rate per scenario
TestHarness->>TestHarness: Assert pass-rate >= EVAL_THRESHOLD and report aggregated costs
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 3 warnings)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
- api-sme: add mandatory linter run before API reviews - api/AGENTS.md: add field grouping rule — fields sharing a common prefix must be consolidated into a dedicated struct - api/AGENTS.md: add best practices section pointing to etcdbackup_types.go and karpenter_types.go as reference examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/agents/api-sme.md:
- Line 19: Update the broken relative link in .claude/agents/api-sme.md: replace
the reference string '../api/AGENTS.md' with the unambiguous repo path
'api/AGENTS.md' so the agent points to the authoritative API guide; ensure the
changed symbol is the link text or path occurrence in the file (search for
'../api/AGENTS.md' and change it to 'api/AGENTS.md').
In `@api/hypershift/v1beta1/hostedcluster_types.go`:
- Around line 850-853: The FooConfig struct's FooDomains field uses an exported
JSON name "FooDomains" which should be lower camel case; update the struct tag
on FooDomains (type FooConfig) from `json:"FooDomains,omitempty"` to
`json:"fooDomains,omitempty"` so the wire representation matches the API
convention and avoid breaking clients.
- Around line 837-838: Rename the struct field Foo_IP to FooIP and change its
JSON tag to include the normalized camelCase name and an omission directive
(e.g., `json:"fooIP,omitempty"`) so the zero value is not serialized; update any
references to Foo_IP to use FooIP and ensure generated CRD/DeepCopy/regeneration
steps are run as appropriate.
- Around line 843-847: The FooID pointer field is still removable because the
current CEL rule only blocks changing a present value; add a parent-level CEL
rule on the containing API struct to prevent removal once set by using the
pattern oldSelf.has(fooID) ? self.has(fooID) : true; add a
+kubebuilder:validation:XValidation annotation with
rule="oldSelf.has(self.fooID) ? self.has(self.fooID) : true" (adjust syntax to
match the struct context) so once FooID is set it cannot be cleared or replaced.
- Around line 836-847: Move Foo_IP, FooConfig, and FooID into a dedicated struct
(e.g., type FooSpec struct { IP string `json:"ip"`; Config *FooConfig
`json:"config,omitempty"`; // +kubebuilder:validation:XValidation:rule="self ==
oldSelf",message="fooID is immutable" FooID *string `json:"id,omitempty"` }) and
replace the three top-level fields in HostedClusterSpec with a single pointer
field Foo *FooSpec `json:"foo,omitempty"`; ensure you transfer comments, json
tags, the immutability kubebuilder tag from FooID into the new FooSpec.FooID
field, keep pointer/omitempty semantics for optional fields, and update any
code/tests that referenced HostedClusterSpec.Foo_IP, .FooConfig, or .FooID to
use HostedClusterSpec.Foo.IP, .Foo.Config, or .Foo.ID respectively.
In `@Makefile`:
- Around line 392-398: The eval-agents recipe currently fans out eval-% targets
in parallel (using $(MAKE) -j $(EVAL_TARGETS)), which races with repo-mutating
tests in test/eval/eval_test.go; fix by either serializing those targets or
isolating each target into its own temporary worktree: change the eval-agents
rule to invoke $(MAKE) $(EVAL_TARGETS) without -j to force serialized runs, or
modify the eval-% target (and/or the test harness in test/eval/eval_test.go
around the patch-based code at lines ~248-268) to create and use a temporary git
worktree per target before applying patches/checkout so parallelism is safe.
Ensure references to the Makefile targets (eval-agents, EVAL_TARGETS, eval-%)
and the test file (test/eval/eval_test.go) are updated accordingly.
In `@test/eval/eval_test.go`:
- Around line 350-385: The patch is currently applied once before the evalRuns
loop causing later runs to see a mutated repo; move the patch lifecycle inside
the loop so each trial starts from the same patched state: call
applyPatch(tc.Patch) at the start of each iteration (inside the for i := range
evalRuns loop) and ensure revertPatch(tc.Patch) is executed after that iteration
(use DeferCleanup scoped per-iteration or an explicit revert after judge
evaluation) so each run is independent; update references around applyPatch,
revertPatch, evalRuns, runAgent and runJudge to reflect the per-iteration
apply/revert flow.
In `@test/eval/README.md`:
- Around line 42-56: The fenced code block showing the directory tree in the
README should include a language specifier to satisfy markdownlint MD040; update
the opening triple backticks for the block containing the directory listing to
```text (i.e., change ``` to ```text) so the directory tree is treated as plain
text and the lint warning is resolved.
- Around line 75-76: Update the README wording to reflect that non-patch runs
are not toolless: clarify that runAgent (see runAgent in test/eval/eval_test.go)
always enables the read-only tools Read, Grep, and Glob, and only gates the Bash
tool on the presence of patch.diff; replace the sentence “with tools enabled if
a patch is present, disabled otherwise” with text stating “Read, Grep, and Glob
are always available; Bash is enabled only when a patch.diff is present (i.e.,
write tools are gated).”
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a92a4f72-7509-4336-9344-38ab6b57442b
⛔ Files ignored due to path filters (13)
test/eval/testdata/conventions/01-go-test-style/expected.txtis excluded by!**/testdata/**test/eval/testdata/conventions/01-go-test-style/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diffis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/prompt.txtis excluded by!**/testdata/**
📒 Files selected for processing (6)
.claude/agents/api-sme.mdMakefileapi/AGENTS.mdapi/hypershift/v1beta1/hostedcluster_types.gotest/eval/README.mdtest/eval/eval_test.go
|
|
||
| **MANDATORY**: Before writing any review, you MUST run `make api-lint-fix` and include its output in your review. Do not skip this step. The linter is the authoritative source for convention violations. Your review must start with the linter findings, then add your own analysis on top. | ||
|
|
||
| Stick to ../api/AGENTS.md |
There was a problem hiding this comment.
Fix the API guide path.
../api/AGENTS.md does not resolve to the authoritative file from this location, so the agent can miss the repo’s primary API guidance entirely. Point this at an unambiguous repo path like api/AGENTS.md instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/agents/api-sme.md at line 19, Update the broken relative link in
.claude/agents/api-sme.md: replace the reference string '../api/AGENTS.md' with
the unambiguous repo path 'api/AGENTS.md' so the agent points to the
authoritative API guide; ensure the changed symbol is the link text or path
occurrence in the file (search for '../api/AGENTS.md' and change it to
'api/AGENTS.md').
| ``` | ||
| test/eval/ | ||
| eval_test.go # Test harness | ||
| testdata/ | ||
| sme-agents/ # Agent scenarios (uses --agent flag) | ||
| <agent-name>/ | ||
| <scenario>/ | ||
| prompt.txt # Input prompt | ||
| expected.txt # Expected issues, one per line | ||
| patch.diff # Optional: applied before run | ||
| conventions/ # Convention tests (no agent) | ||
| <scenario>/ | ||
| prompt.txt | ||
| expected.txt | ||
| ``` |
There was a problem hiding this comment.
Add a language to this fenced block.
markdownlint is already flagging this with MD040. text is enough for the directory tree.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 42-42: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/eval/README.md` around lines 42 - 56, The fenced code block showing the
directory tree in the README should include a language specifier to satisfy
markdownlint MD040; update the opening triple backticks for the block containing
the directory listing to ```text (i.e., change ``` to ```text) so the directory
tree is treated as plain text and the lint warning is resolved.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8382 +/- ##
=======================================
Coverage 36.42% 36.42%
=======================================
Files 765 765
Lines 93302 93302
=======================================
Hits 33981 33981
Misses 56606 56606
Partials 2715 2715
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/eval/eval_test.go`:
- Around line 237-245: patchedFiles currently only collects "+++ b/..." entries
so removed files and other side-effecting paths are missed, causing repo state
leakage; update patchedFiles to parse both "+++ b/..." and "--- a/..." (and any
"/dev/null" cases) to produce a complete set of touched paths, and change
revertPatch to: (1) restore modified/removed files via git checkout/restore
using the combined set, (2) detect added files (those present in "+++ b/" but
not in "--- a/" or with /dev/null) and remove them (git rm or filesystem
delete), and (3) run a safe git clean (or equivalent) for any untracked files
created by tools; reference the patchedFiles and revertPatch functions when
making these changes so the cleanup covers deletions, additions, and tool
side-effects.
- Around line 136-137: The parsed evalThreshold (from
envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)) into
evalThreshold) is not validated for bounds; add a check immediately after
strconv.ParseFloat to assert 0.0 <= evalThreshold <= 1.0 and fail the test with
a clear message that includes the actual value and the EVAL_THRESHOLD env var if
it is out of range (use the existing test assertion style, e.g. Expect/... or
equivalent) so invalid pass-rate semantics are rejected.
- Around line 127-145: Add a guard in the BeforeSuite (in
test/eval/eval_test.go) to detect a dirty git working tree before any
destructive operations (applyPatch/revertPatch) run: run a git status
--porcelain (e.g., via exec.Command("git", "status", "--porcelain")) in repoRoot
and Expect the output to be empty, failing the suite immediately with a clear
message if not; place this check after computing repoRoot and before any tests
that may call applyPatch/revertPatch so you prevent accidental overwrites of
local edits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 96887751-65f4-4af2-9f90-bd26c782618d
⛔ Files ignored due to path filters (13)
test/eval/testdata/conventions/01-go-test-style/expected.txtis excluded by!**/testdata/**test/eval/testdata/conventions/01-go-test-style/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diffis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/prompt.txtis excluded by!**/testdata/**
📒 Files selected for processing (3)
Makefiletest/eval/README.mdtest/eval/eval_test.go
| evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64) | ||
| Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float") |
There was a problem hiding this comment.
Validate EVAL_THRESHOLD bounds.
Right now any float parses; values outside [0,1] create invalid pass-rate semantics.
Proposed fix
evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64)
Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float")
+ Expect(evalThreshold).To(BeNumerically(">=", 0.0), "EVAL_THRESHOLD must be >= 0.0")
+ Expect(evalThreshold).To(BeNumerically("<=", 1.0), "EVAL_THRESHOLD must be <= 1.0")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64) | |
| Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float") | |
| evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64) | |
| Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float") | |
| Expect(evalThreshold).To(BeNumerically(">=", 0.0), "EVAL_THRESHOLD must be >= 0.0") | |
| Expect(evalThreshold).To(BeNumerically("<=", 1.0), "EVAL_THRESHOLD must be <= 1.0") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/eval/eval_test.go` around lines 136 - 137, The parsed evalThreshold
(from envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)) into
evalThreshold) is not validated for bounds; add a check immediately after
strconv.ParseFloat to assert 0.0 <= evalThreshold <= 1.0 and fail the test with
a clear message that includes the actual value and the EVAL_THRESHOLD env var if
it is out of range (use the existing test assertion style, e.g. Expect/... or
equivalent) so invalid pass-rate semantics are rejected.
|
The Makefile changes only add eval-related targets ( Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis is a pre-existing flaky test in the Kubernetes 1.35.0 envtest suite, unrelated to PR #8382's changes. The test installs all CRDs, then uninstalls them and polls for complete deletion with a 30-second timeout. On Kubernetes 1.35.0, the CRD Root CauseThe root cause is a race condition in CRD finalization under Kubernetes 1.35.0's envtest. The test at Eventually(func() bool {
err := k8sClient.Get(ctx, key, &apiextensionsv1.CustomResourceDefinition{})
return apierrors.IsNotFound(err)
}, "30s", "1s").Should(BeTrue(), ...)Kubernetes 1.35.x introduced changes to CRD condition handling (notably Evidence this is pre-existing and not caused by PR #8382:
Recommendations
Evidence
|
Add an evaluation framework for testing Claude Code agent definitions and AGENTS.md conventions. Uses a Go test with Ginkgo at test/eval/ with prompt.txt + expected.txt per scenario, a bidirectional LLM judge with per-issue structured verdicts, configurable pass-rate thresholds, and cost tracking. Key features: - Patch-based scenarios: apply diffs to real API files so agents can run make api-lint-fix against actual code - Convention tests: validate AGENTS.md rules without a specific agent - Auto-discovered make targets from testdata directory structure - Parallel execution via make -j - Structured judge output with per-issue COVERED/MISSED verdicts Scenarios: - sme-agents/api-sme: API design review with linter integration - sme-agents/cloud-provider-sme: KMS encryption design - sme-agents/control-plane-sme: HO/CPO version skew coordination - sme-agents/data-plane-sme: spot instance lifecycle - sme-agents/hcp-architect-sme: architectural review - conventions/01-go-test-style: Gherkin + gomega conventions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
test/eval/README.md (2)
42-56:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a language to the fenced directory-tree block.
This currently triggers markdownlint MD040; use
textfor the tree block.Proposed fix
-``` +```text test/eval/ eval_test.go # Test harness testdata/ ... -``` +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/eval/README.md` around lines 42 - 56, Change the fenced directory-tree block to specify the language by replacing the opening triple backticks with ```text (i.e., update the fenced directory-tree block that starts with ``` to ```text) and keep the existing closing ``` so the block is recognized as plain text and no longer triggers markdownlint MD040.
75-76:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAlign tool-availability docs with the harness behavior.
The harness always enables
Read,Grep,Glob; onlyBashis gated bypatch.diff(test/eval/eval_test.go, Line 282-286).Proposed fix
-3. **Agent invocation**: runs `claude --agent <name> -p <prompt>` - with tools enabled if a patch is present, disabled otherwise +3. **Agent invocation**: runs `claude --agent <name> -p <prompt>` + with `Read`, `Grep`, and `Glob` always enabled; `Bash` is enabled only + when `patch.diff` is present🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/eval/README.md` around lines 75 - 76, Update the README line about agent invocation to match the test harness behavior: state that the harness always enables the Read, Grep and Glob tools and only enables Bash when a patch.diff is present (as implemented in the test harness in eval_test.go where Read/Grep/Glob are unconditionally enabled and Bash is gated by patch.diff).test/eval/eval_test.go (2)
344-381:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRecreate patched state per trial to keep runs independent.
The worktree is created once (Line 345-348) and reused for all
EVAL_RUNSiterations. If run 1 mutates files (e.g., via Bash tool), runs 2..N no longer evaluate the same starting state.Proposed fix shape
- workDir := repoRoot - if tc.Patch != nil { - workDir = createWorktree(tc.Patch) - DeferCleanup(func() { removeWorktree(workDir) }) - } - for i := range evalRuns { + workDir := repoRoot + if tc.Patch != nil { + workDir = createWorktree(tc.Patch) + } + + func() { + if tc.Patch != nil { + defer removeWorktree(workDir) + } By(fmt.Sprintf("run %d/%d", i+1, evalRuns)) agentOutput, agentCost := runAgent(tc, evalModel, workDir) ... if judge.Pass { result.Passed++ } else { ... } + }() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/eval/eval_test.go` around lines 344 - 381, The current code creates the worktree once (workDir := repoRoot / createWorktree) and reuses it for all iterations of the evalRuns loop, so side effects from runAgent can taint later runs; move the createWorktree/cleanup logic into the loop so each iteration uses a fresh workDir: inside the for i := range evalRuns loop call createWorktree(tc.Patch) to set workDir (and register a DeferCleanup or call removeWorktree after that iteration) so each runAgent invocation gets an independent filesystem state; ensure any existing references to workDir outside the loop are updated accordingly and keep runAgent/judge calls unchanged.
136-137:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate
EVAL_THRESHOLDbounds.
EVAL_THRESHOLDis parsed but never constrained; values outside0.0..1.0make pass-rate semantics invalid.Proposed fix
evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64) Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float") + Expect(evalThreshold).To(BeNumerically(">=", 0.0), "EVAL_THRESHOLD must be >= 0.0") + Expect(evalThreshold).To(BeNumerically("<=", 1.0), "EVAL_THRESHOLD must be <= 1.0")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/eval/eval_test.go` around lines 136 - 137, The parsed evalThreshold from envOrDefault is not validated; add a bounds check after strconv.ParseFloat to ensure evalThreshold is within 0.0..1.0 and fail the test with a clear message if not (e.g., use Expect or t.Fatalf to assert evalThreshold >= 0.0 && evalThreshold <= 1.0 with message "EVAL_THRESHOLD must be between 0.0 and 1.0"); update the block around evalThreshold, err = strconv.ParseFloat(...) to include this range validation and clear error message referencing evalThreshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/eval/eval_test.go`:
- Around line 257-263: The removeWorktree function currently discards errors
from cmd.CombinedOutput() and os.RemoveAll(dir), which can leak state and hide
CI failures; capture the error and output from cmd.CombinedOutput() and the
error from os.RemoveAll(dir) and surface them (e.g., call Ginkgo's Expect/Fail
or t.Fatalf) with contextual messages including the command output and the
repoRoot/dir so cleanup failures are visible in CI; update removeWorktree to
check the error returned by cmd.CombinedOutput() and by os.RemoveAll and fail
the test or log the errors instead of ignoring them.
---
Duplicate comments:
In `@test/eval/eval_test.go`:
- Around line 344-381: The current code creates the worktree once (workDir :=
repoRoot / createWorktree) and reuses it for all iterations of the evalRuns
loop, so side effects from runAgent can taint later runs; move the
createWorktree/cleanup logic into the loop so each iteration uses a fresh
workDir: inside the for i := range evalRuns loop call createWorktree(tc.Patch)
to set workDir (and register a DeferCleanup or call removeWorktree after that
iteration) so each runAgent invocation gets an independent filesystem state;
ensure any existing references to workDir outside the loop are updated
accordingly and keep runAgent/judge calls unchanged.
- Around line 136-137: The parsed evalThreshold from envOrDefault is not
validated; add a bounds check after strconv.ParseFloat to ensure evalThreshold
is within 0.0..1.0 and fail the test with a clear message if not (e.g., use
Expect or t.Fatalf to assert evalThreshold >= 0.0 && evalThreshold <= 1.0 with
message "EVAL_THRESHOLD must be between 0.0 and 1.0"); update the block around
evalThreshold, err = strconv.ParseFloat(...) to include this range validation
and clear error message referencing evalThreshold.
In `@test/eval/README.md`:
- Around line 42-56: Change the fenced directory-tree block to specify the
language by replacing the opening triple backticks with ```text (i.e., update
the fenced directory-tree block that starts with ``` to ```text) and keep the
existing closing ``` so the block is recognized as plain text and no longer
triggers markdownlint MD040.
- Around line 75-76: Update the README line about agent invocation to match the
test harness behavior: state that the harness always enables the Read, Grep and
Glob tools and only enables Bash when a patch.diff is present (as implemented in
the test harness in eval_test.go where Read/Grep/Glob are unconditionally
enabled and Bash is gated by patch.diff).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 51bed98b-f500-4731-a819-345bbd438e74
⛔ Files ignored due to path filters (13)
test/eval/testdata/conventions/01-go-test-style/expected.txtis excluded by!**/testdata/**test/eval/testdata/conventions/01-go-test-style/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diffis excluded by!**/testdata/**test/eval/testdata/sme-agents/api-sme/01-api-design-review/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/prompt.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/expected.txtis excluded by!**/testdata/**test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/prompt.txtis excluded by!**/testdata/**
📒 Files selected for processing (3)
Makefiletest/eval/README.mdtest/eval/eval_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- Makefile
| func removeWorktree(dir string) { | ||
| By("removing git worktree") | ||
| cmd := exec.Command("git", "worktree", "remove", "--force", dir) | ||
| cmd.Dir = repoRoot | ||
| cmd.CombinedOutput() | ||
| os.RemoveAll(dir) | ||
| } |
There was a problem hiding this comment.
Don’t ignore worktree cleanup failures.
Line 259 and Line 262 discard errors. If cleanup fails, state can leak across scenarios and CI diagnostics are lost.
Proposed fix
func removeWorktree(dir string) {
By("removing git worktree")
cmd := exec.Command("git", "worktree", "remove", "--force", dir)
cmd.Dir = repoRoot
- cmd.CombinedOutput()
- os.RemoveAll(dir)
+ out, err := cmd.CombinedOutput()
+ Expect(err).NotTo(HaveOccurred(), "git worktree remove failed: %s", string(out))
+ err = os.RemoveAll(dir)
+ Expect(err).NotTo(HaveOccurred(), "failed to remove worktree dir %s", dir)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/eval/eval_test.go` around lines 257 - 263, The removeWorktree function
currently discards errors from cmd.CombinedOutput() and os.RemoveAll(dir), which
can leak state and hide CI failures; capture the error and output from
cmd.CombinedOutput() and the error from os.RemoveAll(dir) and surface them
(e.g., call Ginkgo's Expect/Fail or t.Fatalf) with contextual messages including
the command output and the repoRoot/dir so cleanup failures are visible in CI;
update removeWorktree to check the error returned by cmd.CombinedOutput() and by
os.RemoveAll and fail the test or log the errors instead of ignoring them.
|
@enxebre: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| haikuModel = "claude-haiku-4-5-20251001" | ||
|
|
||
| defaultModel = opusModel | ||
| defaultJudgeModel = opusModel |
There was a problem hiding this comment.
I'd use haiku for the judge, it's actually pretty good for this. Opus is slow and doesn't follow the prompt as well (sometimes doesn't return json 🤦).
I'd also look at splitting test cases into two tiers: golden tests (single-issue, targeted) run with sonnet, and integration tests (multi-issue, complex scenarios) run with opus. Right now it looks like we've only got the integration style cases.
The golden tests are faster, cheaper, and more useful for pinpointing exactly which area a change broke: if a golden test for e.g. "missing optional doc" fails, you know immediately what regressed without wading through a multi-issue report. Integration tests then cover whether the agent handles realistic PRs with several issues at once.
There was a problem hiding this comment.
Summary
Solid framework. Auto-discovery, LLM-as-judge, build-tag isolation, and the deliberately-planted-violation pattern all work well.
What works
api/AGENTS.mdconsolidation is a nice improvement over the scattered design principles- Scenario auto-discovery from directory structure means no Makefile changes to add tests
- Cost transparency in the AfterSuite summary
Additional finding
CodeRabbit already covers the mechanical issues (patch lifecycle, parallel race, cleanup errors). One thing not yet flagged:
- The judge prompt's "no entirely unrelated issues" clause can cause false negatives at the default
EVAL_RUNS=1 - Agents frequently volunteer adjacent observations, and a single tangential comment fails the whole scenario with no retry budget
- Worth either softening that clause or bumping the default
EVAL_RUNS
Agreements
- +1 to @theobarberbany on using haiku for the judge role
Generated with Claude Code
| @@ -0,0 +1,29 @@ | |||
| diff --git a/api/hypershift/v1beta1/hostedcluster_types.go b/api/hypershift/v1beta1/hostedcluster_types.go | |||
There was a problem hiding this comment.
aside: we have an API review skill we're working on over in o/api. It'll use KAL + the api review guidance over there. Might be able to shell out to that?
There was a problem hiding this comment.
For this effort we'll keep exercising this sme agent which also relies on the kas linter.
Let's follow up and see how to converge eventually.
In general I would expect we eventually expose a generic api review skill and others as part of a curated promoted skills repo which consumption is standardized.
| "--print", | ||
| "--model", model, | ||
| "--output-format", "json", | ||
| "--no-session-persistence", |
There was a problem hiding this comment.
I'd add "--dangerously-skip-permissions", otherwise claude code will hang waiting on interactive permission input
There was a problem hiding this comment.
I think we could just add the same arg inputs we do when we run Claude in Prow
There was a problem hiding this comment.
yes, this is targeted tools so it doesn't hang.
| Expect(err).NotTo(HaveOccurred(), "claude judge command failed: %s", string(output)) | ||
|
|
||
| var parsed claudeOutput | ||
| err = json.Unmarshal(output, &parsed) |
There was a problem hiding this comment.
I'd write a helperfunc to strip anything before the opening { as sometimes the CLI outputs e.g warnings. that will break this :(
|
@devguyio - I'm defaulting EVAL_RUNS=3 for this reason. |
Summary
test/eval/) for evaluating Claude Code agent definitions and AGENTS.md conventionsmake api-lint-fix)make -jScenarios
sme-agents/api-sme/01-api-design-review— API design review with linter integrationsme-agents/cloud-provider-sme/01-kms-integration— KMS encryption designsme-agents/control-plane-sme/01-ho-cpo-version-skew— HO/CPO versioning coordinationsme-agents/data-plane-sme/01-spot-instance-lifecycle— spot instance lifecyclesme-agents/hcp-architect-sme/01-architectural-review— architectural reviewconventions/01-go-test-style— Gherkin + gomega conventionsUsage
Test plan
make eval-api-smepasses consistentlymake eval-conventionspassesgo build -tags eval ./...make eval-agents🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation