Skip to content

CNTRLPLANE-3339: add agent and convention eval framework#8382

Open
enxebre wants to merge 2 commits intoopenshift:mainfrom
enxebre:evals
Open

CNTRLPLANE-3339: add agent and convention eval framework#8382
enxebre wants to merge 2 commits intoopenshift:mainfrom
enxebre:evals

Conversation

@enxebre
Copy link
Copy Markdown
Member

@enxebre enxebre commented Apr 30, 2026

Summary

  • Add a Go test harness (test/eval/) for evaluating Claude Code agent definitions and AGENTS.md conventions
  • Uses prompt.txt + expected.txt per scenario with an LLM-as-judge for semantic matching
  • Supports patch-based scenarios where agents run tools against real code (e.g., make api-lint-fix)
  • Auto-discovered make targets with parallel execution via make -j
  • Update api-sme agent to run the linter before reviews
  • Add field grouping rule and best practices section to api/AGENTS.md

Scenarios

  • sme-agents/api-sme/01-api-design-review — API design review with linter integration
  • sme-agents/cloud-provider-sme/01-kms-integration — KMS encryption design
  • sme-agents/control-plane-sme/01-ho-cpo-version-skew — HO/CPO versioning coordination
  • sme-agents/data-plane-sme/01-spot-instance-lifecycle — spot instance lifecycle
  • sme-agents/hcp-architect-sme/01-architectural-review — architectural review
  • conventions/01-go-test-style — Gherkin + gomega conventions

Usage

make eval-agents                              # all scenarios in parallel
make eval-api-sme                             # single agent
make eval-agents EVAL_FOCUS=api-sme EVAL_VERBOSE=1  # verbose single
make eval-agents EVAL_RUNS=5 EVAL_THRESHOLD=0.6     # statistical

Test plan

  • make eval-api-sme passes consistently
  • make eval-conventions passes
  • All scenarios compile with go build -tags eval ./...
  • Full parallel run: make eval-agents

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Agent evaluation harness with Make targets to run focused or parallel evaluation runs; configurable model, judge selection, run count, pass threshold, focus, and verbose modes.
  • Documentation

    • Updated API guidance with a revised Approach, mandatory review workflow step, best-practice/type-change patterns, and field-grouping guidance.
    • New evaluation-suite documentation describing scenario layout, configuration options, run commands, and expected evaluation flow.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 30, 2026

@enxebre: This pull request references CNTRLPLANE-3339 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add a Go test harness (test/eval/) for evaluating Claude Code agent definitions and AGENTS.md conventions
  • Uses prompt.txt + expected.txt per scenario with an LLM-as-judge for semantic matching
  • Supports patch-based scenarios where agents run tools against real code (e.g., make api-lint-fix)
  • Auto-discovered make targets with parallel execution via make -j
  • Update api-sme agent to run the linter before reviews
  • Add field grouping rule and best practices section to api/AGENTS.md

Scenarios

  • sme-agents/api-sme/01-api-design-review — API design review with linter integration
  • sme-agents/cloud-provider-sme/01-kms-integration — KMS encryption design
  • sme-agents/control-plane-sme/01-ho-cpo-version-skew — HO/CPO versioning coordination
  • sme-agents/data-plane-sme/01-spot-instance-lifecycle — spot instance lifecycle
  • sme-agents/hcp-architect-sme/01-architectural-review — architectural review
  • conventions/01-go-test-style — Gherkin + gomega conventions

Usage

make eval-agents                              # all scenarios in parallel
make eval-api-sme                             # single agent
make eval-agents EVAL_FOCUS=api-sme EVAL_VERBOSE=1  # verbose single
make eval-agents EVAL_RUNS=5 EVAL_THRESHOLD=0.6     # statistical

Test plan

  • make eval-api-sme passes consistently
  • make eval-conventions passes
  • All scenarios compile with go build -tags eval ./...
  • Full parallel run: make eval-agents

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API labels Apr 30, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Apr 30, 2026
@openshift-ci openshift-ci Bot requested review from bryan-cox and clebs April 30, 2026 09:53
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

Adds an opt-in agent evaluation framework: a Ginkgo eval test suite that discovers scenarios in test/eval/testdata, optionally applies patch.diff via a detached git worktree, invokes Claude agents via the claude CLI for multiple runs, sends outputs to a separate Claude judge for semantic comparison against expected.txt, computes per-scenario pass rates against EVAL_THRESHOLD, and reports aggregated results and costs. Also adds Make targets (eval-agents, eval-%), a README for the suite, and documentation updates to api/AGENTS.md and .claude/agents/api-sme.md.

Sequence Diagram(s)

sequenceDiagram
    participant TestHarness as Test Harness
    participant ScenarioDir as Scenario Directory
    participant GitRepo as Git Repository
    participant Agent as Claude Agent (CLI)
    participant Judge as Claude Judge (CLI)

    TestHarness->>ScenarioDir: Discover scenarios (prompt.txt, expected.txt, optional patch.diff)
    
    loop For each scenario across EVAL_RUNS trials
        TestHarness->>GitRepo: Optionally create worktree and apply patch.diff
        TestHarness->>Agent: Invoke claude (agent model) with prompt
        Agent-->>TestHarness: Return result & total_cost_usd
        TestHarness->>Judge: Invoke claude (judge model) with templated judge prompt (includes agent JSON)
        Judge-->>TestHarness: Return JSON {pass, issues}
        TestHarness->>GitRepo: Defer cleanup (remove worktree / revert)
        TestHarness->>TestHarness: Record pass/missed issues and cost
    end
    
    TestHarness->>TestHarness: Compute pass-rate per scenario
    TestHarness->>TestHarness: Assert pass-rate >= EVAL_THRESHOLD and report aggregated costs
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 3 warnings)

Check name Status Explanation Resolution
Ote Binary Stdout Contract ❌ Error The AfterSuite function violates the OTE Binary Stdout Contract by writing to stdout via fmt.Printf and fmt.Println (lines 149-165), which corrupts the JSON protocol with openshift-tests. Replace all fmt.Print* and fmt.Println calls in AfterSuite with GinkgoWriter.Printf to prevent stdout corruption and maintain protocol compliance.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test code has critical cleanup error-handling and assertion message issues that violate quality requirements. Fix removeWorktree() to properly check and report errors, and add meaningful failure messages to all assertions.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning The test/eval/eval_test.go calls external Claude APIs via CLI and invokes a Claude judge service, requiring public internet connectivity unavailable in disconnected environments. Mock or stub Claude CLI calls, add [Skipped:Disconnected] to test name, or document that this test requires authenticated external internet connectivity.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding an agent and convention evaluation framework. It is specific, concise, and clearly conveys the primary purpose of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test titles are stable and deterministic, using static strings and filesystem directory names with no dynamic content.
Microshift Test Compatibility ✅ Passed The test/eval/eval_test.go file is a Ginkgo test suite with a //go:build eval tag that tests an LLM agent evaluation framework entirely locally, with zero cluster API interactions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new Ginkgo test suite in test/eval/eval_test.go is an AI agent evaluation harness that makes no assumptions about cluster topology, does not interact with OpenShift infrastructure, and is fully compatible with Single Node OpenShift.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request introduces agent evaluation test framework and documentation updates only, with no deployment manifests, operator code, or Kubernetes controllers that define scheduling constraints.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

- api-sme: add mandatory linter run before API reviews
- api/AGENTS.md: add field grouping rule — fields sharing a common
  prefix must be consolidated into a dedicated struct
- api/AGENTS.md: add best practices section pointing to
  etcdbackup_types.go and karpenter_types.go as reference examples

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/agents/api-sme.md:
- Line 19: Update the broken relative link in .claude/agents/api-sme.md: replace
the reference string '../api/AGENTS.md' with the unambiguous repo path
'api/AGENTS.md' so the agent points to the authoritative API guide; ensure the
changed symbol is the link text or path occurrence in the file (search for
'../api/AGENTS.md' and change it to 'api/AGENTS.md').

In `@api/hypershift/v1beta1/hostedcluster_types.go`:
- Around line 850-853: The FooConfig struct's FooDomains field uses an exported
JSON name "FooDomains" which should be lower camel case; update the struct tag
on FooDomains (type FooConfig) from `json:"FooDomains,omitempty"` to
`json:"fooDomains,omitempty"` so the wire representation matches the API
convention and avoid breaking clients.
- Around line 837-838: Rename the struct field Foo_IP to FooIP and change its
JSON tag to include the normalized camelCase name and an omission directive
(e.g., `json:"fooIP,omitempty"`) so the zero value is not serialized; update any
references to Foo_IP to use FooIP and ensure generated CRD/DeepCopy/regeneration
steps are run as appropriate.
- Around line 843-847: The FooID pointer field is still removable because the
current CEL rule only blocks changing a present value; add a parent-level CEL
rule on the containing API struct to prevent removal once set by using the
pattern oldSelf.has(fooID) ? self.has(fooID) : true; add a
+kubebuilder:validation:XValidation annotation with
rule="oldSelf.has(self.fooID) ? self.has(self.fooID) : true" (adjust syntax to
match the struct context) so once FooID is set it cannot be cleared or replaced.
- Around line 836-847: Move Foo_IP, FooConfig, and FooID into a dedicated struct
(e.g., type FooSpec struct { IP string `json:"ip"`; Config *FooConfig
`json:"config,omitempty"`; // +kubebuilder:validation:XValidation:rule="self ==
oldSelf",message="fooID is immutable" FooID *string `json:"id,omitempty"` }) and
replace the three top-level fields in HostedClusterSpec with a single pointer
field Foo *FooSpec `json:"foo,omitempty"`; ensure you transfer comments, json
tags, the immutability kubebuilder tag from FooID into the new FooSpec.FooID
field, keep pointer/omitempty semantics for optional fields, and update any
code/tests that referenced HostedClusterSpec.Foo_IP, .FooConfig, or .FooID to
use HostedClusterSpec.Foo.IP, .Foo.Config, or .Foo.ID respectively.

In `@Makefile`:
- Around line 392-398: The eval-agents recipe currently fans out eval-% targets
in parallel (using $(MAKE) -j $(EVAL_TARGETS)), which races with repo-mutating
tests in test/eval/eval_test.go; fix by either serializing those targets or
isolating each target into its own temporary worktree: change the eval-agents
rule to invoke $(MAKE) $(EVAL_TARGETS) without -j to force serialized runs, or
modify the eval-% target (and/or the test harness in test/eval/eval_test.go
around the patch-based code at lines ~248-268) to create and use a temporary git
worktree per target before applying patches/checkout so parallelism is safe.
Ensure references to the Makefile targets (eval-agents, EVAL_TARGETS, eval-%)
and the test file (test/eval/eval_test.go) are updated accordingly.

In `@test/eval/eval_test.go`:
- Around line 350-385: The patch is currently applied once before the evalRuns
loop causing later runs to see a mutated repo; move the patch lifecycle inside
the loop so each trial starts from the same patched state: call
applyPatch(tc.Patch) at the start of each iteration (inside the for i := range
evalRuns loop) and ensure revertPatch(tc.Patch) is executed after that iteration
(use DeferCleanup scoped per-iteration or an explicit revert after judge
evaluation) so each run is independent; update references around applyPatch,
revertPatch, evalRuns, runAgent and runJudge to reflect the per-iteration
apply/revert flow.

In `@test/eval/README.md`:
- Around line 42-56: The fenced code block showing the directory tree in the
README should include a language specifier to satisfy markdownlint MD040; update
the opening triple backticks for the block containing the directory listing to
```text (i.e., change ``` to ```text) so the directory tree is treated as plain
text and the lint warning is resolved.
- Around line 75-76: Update the README wording to reflect that non-patch runs
are not toolless: clarify that runAgent (see runAgent in test/eval/eval_test.go)
always enables the read-only tools Read, Grep, and Glob, and only gates the Bash
tool on the presence of patch.diff; replace the sentence “with tools enabled if
a patch is present, disabled otherwise” with text stating “Read, Grep, and Glob
are always available; Bash is enabled only when a patch.diff is present (i.e.,
write tools are gated).”
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a92a4f72-7509-4336-9344-38ab6b57442b

📥 Commits

Reviewing files that changed from the base of the PR and between 5eaee74 and b8fb0eb.

⛔ Files ignored due to path filters (13)
  • test/eval/testdata/conventions/01-go-test-style/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/conventions/01-go-test-style/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/prompt.txt is excluded by !**/testdata/**
📒 Files selected for processing (6)
  • .claude/agents/api-sme.md
  • Makefile
  • api/AGENTS.md
  • api/hypershift/v1beta1/hostedcluster_types.go
  • test/eval/README.md
  • test/eval/eval_test.go

Comment thread .claude/agents/api-sme.md

**MANDATORY**: Before writing any review, you MUST run `make api-lint-fix` and include its output in your review. Do not skip this step. The linter is the authoritative source for convention violations. Your review must start with the linter findings, then add your own analysis on top.

Stick to ../api/AGENTS.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix the API guide path.

../api/AGENTS.md does not resolve to the authoritative file from this location, so the agent can miss the repo’s primary API guidance entirely. Point this at an unambiguous repo path like api/AGENTS.md instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/agents/api-sme.md at line 19, Update the broken relative link in
.claude/agents/api-sme.md: replace the reference string '../api/AGENTS.md' with
the unambiguous repo path 'api/AGENTS.md' so the agent points to the
authoritative API guide; ensure the changed symbol is the link text or path
occurrence in the file (search for '../api/AGENTS.md' and change it to
'api/AGENTS.md').

Comment thread api/hypershift/v1beta1/hostedcluster_types.go Outdated
Comment thread api/hypershift/v1beta1/hostedcluster_types.go Outdated
Comment thread api/hypershift/v1beta1/hostedcluster_types.go Outdated
Comment thread api/hypershift/v1beta1/hostedcluster_types.go Outdated
Comment thread Makefile
Comment thread test/eval/eval_test.go
Comment thread test/eval/README.md
Comment on lines +42 to +56
```
test/eval/
eval_test.go # Test harness
testdata/
sme-agents/ # Agent scenarios (uses --agent flag)
<agent-name>/
<scenario>/
prompt.txt # Input prompt
expected.txt # Expected issues, one per line
patch.diff # Optional: applied before run
conventions/ # Convention tests (no agent)
<scenario>/
prompt.txt
expected.txt
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language to this fenced block.

markdownlint is already flagging this with MD040. text is enough for the directory tree.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 42-42: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/eval/README.md` around lines 42 - 56, The fenced code block showing the
directory tree in the README should include a language specifier to satisfy
markdownlint MD040; update the opening triple backticks for the block containing
the directory listing to ```text (i.e., change ``` to ```text) so the directory
tree is treated as plain text and the lint warning is resolved.

Comment thread test/eval/README.md
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 36.42%. Comparing base (5eaee74) to head (6f26580).
⚠️ Report is 29 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8382   +/-   ##
=======================================
  Coverage   36.42%   36.42%           
=======================================
  Files         765      765           
  Lines       93302    93302           
=======================================
  Hits        33981    33981           
  Misses      56606    56606           
  Partials     2715     2715           
Flag Coverage Δ
cmd-support 30.37% <ø> (ø)
cpo-hostedcontrolplane 37.08% <ø> (ø)
cpo-other 35.69% <ø> (ø)
hypershift-operator 47.88% <ø> (ø)
other 27.76% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/eval/eval_test.go`:
- Around line 237-245: patchedFiles currently only collects "+++ b/..." entries
so removed files and other side-effecting paths are missed, causing repo state
leakage; update patchedFiles to parse both "+++ b/..." and "--- a/..." (and any
"/dev/null" cases) to produce a complete set of touched paths, and change
revertPatch to: (1) restore modified/removed files via git checkout/restore
using the combined set, (2) detect added files (those present in "+++ b/" but
not in "--- a/" or with /dev/null) and remove them (git rm or filesystem
delete), and (3) run a safe git clean (or equivalent) for any untracked files
created by tools; reference the patchedFiles and revertPatch functions when
making these changes so the cleanup covers deletions, additions, and tool
side-effects.
- Around line 136-137: The parsed evalThreshold (from
envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)) into
evalThreshold) is not validated for bounds; add a check immediately after
strconv.ParseFloat to assert 0.0 <= evalThreshold <= 1.0 and fail the test with
a clear message that includes the actual value and the EVAL_THRESHOLD env var if
it is out of range (use the existing test assertion style, e.g. Expect/... or
equivalent) so invalid pass-rate semantics are rejected.
- Around line 127-145: Add a guard in the BeforeSuite (in
test/eval/eval_test.go) to detect a dirty git working tree before any
destructive operations (applyPatch/revertPatch) run: run a git status
--porcelain (e.g., via exec.Command("git", "status", "--porcelain")) in repoRoot
and Expect the output to be empty, failing the suite immediately with a clear
message if not; place this check after computing repoRoot and before any tests
that may call applyPatch/revertPatch so you prevent accidental overwrites of
local edits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 96887751-65f4-4af2-9f90-bd26c782618d

📥 Commits

Reviewing files that changed from the base of the PR and between f4bf4f8 and 9eab4e7.

⛔ Files ignored due to path filters (13)
  • test/eval/testdata/conventions/01-go-test-style/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/conventions/01-go-test-style/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/prompt.txt is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • Makefile
  • test/eval/README.md
  • test/eval/eval_test.go

Comment thread test/eval/eval_test.go
Comment thread test/eval/eval_test.go
Comment on lines +136 to +137
evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64)
Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate EVAL_THRESHOLD bounds.

Right now any float parses; values outside [0,1] create invalid pass-rate semantics.

Proposed fix
 	evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64)
 	Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float")
+	Expect(evalThreshold).To(BeNumerically(">=", 0.0), "EVAL_THRESHOLD must be >= 0.0")
+	Expect(evalThreshold).To(BeNumerically("<=", 1.0), "EVAL_THRESHOLD must be <= 1.0")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64)
Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float")
evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64)
Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float")
Expect(evalThreshold).To(BeNumerically(">=", 0.0), "EVAL_THRESHOLD must be >= 0.0")
Expect(evalThreshold).To(BeNumerically("<=", 1.0), "EVAL_THRESHOLD must be <= 1.0")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/eval/eval_test.go` around lines 136 - 137, The parsed evalThreshold
(from envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)) into
evalThreshold) is not validated for bounds; add a check immediately after
strconv.ParseFloat to assert 0.0 <= evalThreshold <= 1.0 and fail the test with
a clear message that includes the actual value and the EVAL_THRESHOLD env var if
it is out of range (use the existing test assertion style, e.g. Expect/... or
equivalent) so invalid pass-rate semantics are rejected.

Comment thread test/eval/eval_test.go Outdated
@hypershift-jira-solve-ci
Copy link
Copy Markdown

The Makefile changes only add eval-related targets (eval-agents, eval-*) with no connection to CRD generation, installation, or the envtest workflow. Now I have everything I need.

Test Failure Analysis Complete

Job Information

  • Prow Job: Envtest Vanilla Kube API Validation / Envtest Vanilla Kube 1.35.0
  • Build ID: GitHub Actions run 25160817867, job 73754540287
  • Target: make test-envtest-kube ENVTEST_KUBE_VERSIONS="1.35.0"
  • Test: CRD Installation [It] should install all CRDs for feature set "TechPreviewNoUpgrade"

Test Failure Analysis

Error

[FAILED] Timed out after 30.000s.
CRD clusterclasses.cluster.x-k8s.io should be fully removed
Expected
    <bool>: false
to be true

In [It] at: test/envtest/generator.go:262

Summary

This is a pre-existing flaky test in the Kubernetes 1.35.0 envtest suite, unrelated to PR #8382's changes. The test installs all CRDs, then uninstalls them and polls for complete deletion with a 30-second timeout. On Kubernetes 1.35.0, the CRD clusterclasses.cluster.x-k8s.io intermittently takes longer than 30 seconds to be fully garbage-collected by the API server, causing the Eventually assertion at generator.go:262 to time out. The PR only adds agent eval framework files (.claude/agents/, test/eval/, Makefile eval targets) — none of which touch CRD schemas, envtest logic, or CRD generation.

Root Cause

The root cause is a race condition in CRD finalization under Kubernetes 1.35.0's envtest. The test at generator.go:256-263 uninstalls all CRDs and then polls each one with a 30-second Eventually timeout:

Eventually(func() bool {
    err := k8sClient.Get(ctx, key, &apiextensionsv1.CustomResourceDefinition{})
    return apierrors.IsNotFound(err)
}, "30s", "1s").Should(BeTrue(), ...)

Kubernetes 1.35.x introduced changes to CRD condition handling (notably ObservedGeneration in CRD conditions) which can slow finalization and garbage collection of CRDs in the envtest environment. The clusterclasses.cluster.x-k8s.io CRD is particularly affected because it is a complex multi-version Cluster API CRD with multiple subresources.

Evidence this is pre-existing and not caused by PR #8382:

  • The same evals branch passed Kube 1.35.0 in two prior runs (25159401839 at 10:01 and 25159048275 at 09:53) on the same day.
  • Of 15 total failures in the last ~200 runs of this workflow, at least 13 failed specifically on the Kube 1.35.0 job across many unrelated branches (e.g., docs/remove-todo-from-api-docs, CNTRLPLANE-3318, drop-karpenter-feature-gate, dependabot/*, capi-1.11-bump, etc.).
  • Kube versions 1.31–1.34 pass consistently.
  • Other branches show the identical failure: CRD clusterclasses.cluster.x-k8s.io should be fully removed timing out after 30s (confirmed in run 25112338312 on the docs/remove-todo-from-api-docs branch).
Recommendations
  1. Re-run the job — This is a non-deterministic timeout. The same branch already passed Kube 1.35.0 twice earlier the same day. A simple re-run will likely succeed.

  2. Increase the CRD removal timeout — The 30s timeout at generator.go:262 is too tight for Kubernetes 1.35.0's finalization behavior. Changing it to 60s (matching the CRD installation timeout at line 247) would eliminate this flake:

    // generator.go:260-263 — change "30s" to "60s"
    Eventually(func() bool {
        err := k8sClient.Get(ctx, key, &apiextensionsv1.CustomResourceDefinition{})
        return apierrors.IsNotFound(err)
    }, "60s", "1s").Should(BeTrue(), fmt.Sprintf("CRD %s should be fully removed", crd.Name))
  3. No changes needed to PR CNTRLPLANE-3339: add agent and convention eval framework #8382 — The PR's changes (agent definitions, eval framework, Makefile eval targets) have zero overlap with CRD schemas, envtest setup, or the failing test.

Evidence
Evidence Detail
Failing test CRD Installation [It] should install all CRDs for feature set "TechPreviewNoUpgrade" at test/envtest/generator.go:262
Failed CRD clusterclasses.cluster.x-k8s.io — a Cluster API CRD, not modified by PR #8382
Timeout 30s timeout on Eventually polling for CRD deletion (IsNotFound check)
Only on Kube 1.35.0 Kube 1.31–1.34 all passed in this run; 13/15 recent workflow failures are on Kube 1.35.0
Non-deterministic Same branch (evals) passed Kube 1.35.0 in runs 25159401839 and 25159048275 earlier that day
Cross-branch Identical failure seen on unrelated branches: docs/remove-todo-from-api-docs (run 25112338312), CNTRLPLANE-3318 (runs 25110262598, 25107675743), drop-karpenter-feature-gate (run 25100554690)
PR scope PR only adds .claude/agents/api-sme.md, api/AGENTS.md, test/eval/*, and Makefile eval targets — no CRD or envtest changes

Add an evaluation framework for testing Claude Code agent definitions
and AGENTS.md conventions. Uses a Go test with Ginkgo at test/eval/
with prompt.txt + expected.txt per scenario, a bidirectional LLM
judge with per-issue structured verdicts, configurable pass-rate
thresholds, and cost tracking.

Key features:
- Patch-based scenarios: apply diffs to real API files so agents
  can run make api-lint-fix against actual code
- Convention tests: validate AGENTS.md rules without a specific agent
- Auto-discovered make targets from testdata directory structure
- Parallel execution via make -j
- Structured judge output with per-issue COVERED/MISSED verdicts

Scenarios:
- sme-agents/api-sme: API design review with linter integration
- sme-agents/cloud-provider-sme: KMS encryption design
- sme-agents/control-plane-sme: HO/CPO version skew coordination
- sme-agents/data-plane-sme: spot instance lifecycle
- sme-agents/hcp-architect-sme: architectural review
- conventions/01-go-test-style: Gherkin + gomega conventions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
test/eval/README.md (2)

42-56: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language to the fenced directory-tree block.

This currently triggers markdownlint MD040; use text for the tree block.

Proposed fix
-```
+```text
 test/eval/
   eval_test.go                         # Test harness
   testdata/
   ...
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/eval/README.md` around lines 42 - 56, Change the fenced directory-tree
block to specify the language by replacing the opening triple backticks with
```text (i.e., update the fenced directory-tree block that starts with ``` to
```text) and keep the existing closing ``` so the block is recognized as plain
text and no longer triggers markdownlint MD040.

75-76: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align tool-availability docs with the harness behavior.

The harness always enables Read,Grep,Glob; only Bash is gated by patch.diff (test/eval/eval_test.go, Line 282-286).

Proposed fix
-3. **Agent invocation**: runs `claude --agent <name> -p <prompt>`
-   with tools enabled if a patch is present, disabled otherwise
+3. **Agent invocation**: runs `claude --agent <name> -p <prompt>`
+   with `Read`, `Grep`, and `Glob` always enabled; `Bash` is enabled only
+   when `patch.diff` is present
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/eval/README.md` around lines 75 - 76, Update the README line about agent
invocation to match the test harness behavior: state that the harness always
enables the Read, Grep and Glob tools and only enables Bash when a patch.diff is
present (as implemented in the test harness in eval_test.go where Read/Grep/Glob
are unconditionally enabled and Bash is gated by patch.diff).
test/eval/eval_test.go (2)

344-381: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Recreate patched state per trial to keep runs independent.

The worktree is created once (Line 345-348) and reused for all EVAL_RUNS iterations. If run 1 mutates files (e.g., via Bash tool), runs 2..N no longer evaluate the same starting state.

Proposed fix shape
-	workDir := repoRoot
-	if tc.Patch != nil {
-		workDir = createWorktree(tc.Patch)
-		DeferCleanup(func() { removeWorktree(workDir) })
-	}
-
 	for i := range evalRuns {
+		workDir := repoRoot
+		if tc.Patch != nil {
+			workDir = createWorktree(tc.Patch)
+		}
+
+		func() {
+			if tc.Patch != nil {
+				defer removeWorktree(workDir)
+			}
 		By(fmt.Sprintf("run %d/%d", i+1, evalRuns))
 
 		agentOutput, agentCost := runAgent(tc, evalModel, workDir)
 		...
 		if judge.Pass {
 			result.Passed++
 		} else {
 			...
 		}
+		}()
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/eval/eval_test.go` around lines 344 - 381, The current code creates the
worktree once (workDir := repoRoot / createWorktree) and reuses it for all
iterations of the evalRuns loop, so side effects from runAgent can taint later
runs; move the createWorktree/cleanup logic into the loop so each iteration uses
a fresh workDir: inside the for i := range evalRuns loop call
createWorktree(tc.Patch) to set workDir (and register a DeferCleanup or call
removeWorktree after that iteration) so each runAgent invocation gets an
independent filesystem state; ensure any existing references to workDir outside
the loop are updated accordingly and keep runAgent/judge calls unchanged.

136-137: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate EVAL_THRESHOLD bounds.

EVAL_THRESHOLD is parsed but never constrained; values outside 0.0..1.0 make pass-rate semantics invalid.

Proposed fix
 	evalThreshold, err = strconv.ParseFloat(envOrDefault("EVAL_THRESHOLD", fmt.Sprintf("%g", defaultThreshold)), 64)
 	Expect(err).NotTo(HaveOccurred(), "EVAL_THRESHOLD must be a float")
+	Expect(evalThreshold).To(BeNumerically(">=", 0.0), "EVAL_THRESHOLD must be >= 0.0")
+	Expect(evalThreshold).To(BeNumerically("<=", 1.0), "EVAL_THRESHOLD must be <= 1.0")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/eval/eval_test.go` around lines 136 - 137, The parsed evalThreshold from
envOrDefault is not validated; add a bounds check after strconv.ParseFloat to
ensure evalThreshold is within 0.0..1.0 and fail the test with a clear message
if not (e.g., use Expect or t.Fatalf to assert evalThreshold >= 0.0 &&
evalThreshold <= 1.0 with message "EVAL_THRESHOLD must be between 0.0 and 1.0");
update the block around evalThreshold, err = strconv.ParseFloat(...) to include
this range validation and clear error message referencing evalThreshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/eval/eval_test.go`:
- Around line 257-263: The removeWorktree function currently discards errors
from cmd.CombinedOutput() and os.RemoveAll(dir), which can leak state and hide
CI failures; capture the error and output from cmd.CombinedOutput() and the
error from os.RemoveAll(dir) and surface them (e.g., call Ginkgo's Expect/Fail
or t.Fatalf) with contextual messages including the command output and the
repoRoot/dir so cleanup failures are visible in CI; update removeWorktree to
check the error returned by cmd.CombinedOutput() and by os.RemoveAll and fail
the test or log the errors instead of ignoring them.

---

Duplicate comments:
In `@test/eval/eval_test.go`:
- Around line 344-381: The current code creates the worktree once (workDir :=
repoRoot / createWorktree) and reuses it for all iterations of the evalRuns
loop, so side effects from runAgent can taint later runs; move the
createWorktree/cleanup logic into the loop so each iteration uses a fresh
workDir: inside the for i := range evalRuns loop call createWorktree(tc.Patch)
to set workDir (and register a DeferCleanup or call removeWorktree after that
iteration) so each runAgent invocation gets an independent filesystem state;
ensure any existing references to workDir outside the loop are updated
accordingly and keep runAgent/judge calls unchanged.
- Around line 136-137: The parsed evalThreshold from envOrDefault is not
validated; add a bounds check after strconv.ParseFloat to ensure evalThreshold
is within 0.0..1.0 and fail the test with a clear message if not (e.g., use
Expect or t.Fatalf to assert evalThreshold >= 0.0 && evalThreshold <= 1.0 with
message "EVAL_THRESHOLD must be between 0.0 and 1.0"); update the block around
evalThreshold, err = strconv.ParseFloat(...) to include this range validation
and clear error message referencing evalThreshold.

In `@test/eval/README.md`:
- Around line 42-56: Change the fenced directory-tree block to specify the
language by replacing the opening triple backticks with ```text (i.e., update
the fenced directory-tree block that starts with ``` to ```text) and keep the
existing closing ``` so the block is recognized as plain text and no longer
triggers markdownlint MD040.
- Around line 75-76: Update the README line about agent invocation to match the
test harness behavior: state that the harness always enables the Read, Grep and
Glob tools and only enables Bash when a patch.diff is present (as implemented in
the test harness in eval_test.go where Read/Grep/Glob are unconditionally
enabled and Bash is gated by patch.diff).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 51bed98b-f500-4731-a819-345bbd438e74

📥 Commits

Reviewing files that changed from the base of the PR and between 9eab4e7 and 6f26580.

⛔ Files ignored due to path filters (13)
  • test/eval/testdata/conventions/01-go-test-style/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/conventions/01-go-test-style/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/cloud-provider-sme/01-kms-integration/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/control-plane-sme/01-ho-cpo-version-skew/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/data-plane-sme/01-spot-instance-lifecycle/prompt.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/expected.txt is excluded by !**/testdata/**
  • test/eval/testdata/sme-agents/hcp-architect-sme/01-architectural-review/prompt.txt is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • Makefile
  • test/eval/README.md
  • test/eval/eval_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile

Comment thread test/eval/eval_test.go
Comment on lines +257 to +263
func removeWorktree(dir string) {
By("removing git worktree")
cmd := exec.Command("git", "worktree", "remove", "--force", dir)
cmd.Dir = repoRoot
cmd.CombinedOutput()
os.RemoveAll(dir)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t ignore worktree cleanup failures.

Line 259 and Line 262 discard errors. If cleanup fails, state can leak across scenarios and CI diagnostics are lost.

Proposed fix
 func removeWorktree(dir string) {
 	By("removing git worktree")
 	cmd := exec.Command("git", "worktree", "remove", "--force", dir)
 	cmd.Dir = repoRoot
-	cmd.CombinedOutput()
-	os.RemoveAll(dir)
+	out, err := cmd.CombinedOutput()
+	Expect(err).NotTo(HaveOccurred(), "git worktree remove failed: %s", string(out))
+	err = os.RemoveAll(dir)
+	Expect(err).NotTo(HaveOccurred(), "failed to remove worktree dir %s", dir)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/eval/eval_test.go` around lines 257 - 263, The removeWorktree function
currently discards errors from cmd.CombinedOutput() and os.RemoveAll(dir), which
can leak state and hide CI failures; capture the error and output from
cmd.CombinedOutput() and the error from os.RemoveAll(dir) and surface them
(e.g., call Ginkgo's Expect/Fail or t.Fatalf) with contextual messages including
the command output and the repoRoot/dir so cleanup failures are visible in CI;
update removeWorktree to check the error returned by cmd.CombinedOutput() and by
os.RemoveAll and fail the test or log the errors instead of ignoring them.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@enxebre: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment thread test/eval/eval_test.go
haikuModel = "claude-haiku-4-5-20251001"

defaultModel = opusModel
defaultJudgeModel = opusModel
Copy link
Copy Markdown

@theobarberbany theobarberbany Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use haiku for the judge, it's actually pretty good for this. Opus is slow and doesn't follow the prompt as well (sometimes doesn't return json 🤦).

I'd also look at splitting test cases into two tiers: golden tests (single-issue, targeted) run with sonnet, and integration tests (multi-issue, complex scenarios) run with opus. Right now it looks like we've only got the integration style cases.

The golden tests are faster, cheaper, and more useful for pinpointing exactly which area a change broke: if a golden test for e.g. "missing optional doc" fails, you know immediately what regressed without wading through a multi-issue report. Integration tests then cover whether the agent handles realistic PRs with several issues at once.

Copy link
Copy Markdown
Contributor

@devguyio devguyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Solid framework. Auto-discovery, LLM-as-judge, build-tag isolation, and the deliberately-planted-violation pattern all work well.

What works

  • api/AGENTS.md consolidation is a nice improvement over the scattered design principles
  • Scenario auto-discovery from directory structure means no Makefile changes to add tests
  • Cost transparency in the AfterSuite summary

Additional finding

CodeRabbit already covers the mechanical issues (patch lifecycle, parallel race, cleanup errors). One thing not yet flagged:

  • The judge prompt's "no entirely unrelated issues" clause can cause false negatives at the default EVAL_RUNS=1
  • Agents frequently volunteer adjacent observations, and a single tangential comment fails the whole scenario with no retry budget
  • Worth either softening that clause or bumping the default EVAL_RUNS

Agreements

Generated with Claude Code

@@ -0,0 +1,29 @@
diff --git a/api/hypershift/v1beta1/hostedcluster_types.go b/api/hypershift/v1beta1/hostedcluster_types.go
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: we have an API review skill we're working on over in o/api. It'll use KAL + the api review guidance over there. Might be able to shell out to that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this effort we'll keep exercising this sme agent which also relies on the kas linter.
Let's follow up and see how to converge eventually.

In general I would expect we eventually expose a generic api review skill and others as part of a curated promoted skills repo which consumption is standardized.

Comment thread test/eval/eval_test.go
"--print",
"--model", model,
"--output-format", "json",
"--no-session-persistence",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add "--dangerously-skip-permissions", otherwise claude code will hang waiting on interactive permission input

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just add the same arg inputs we do when we run Claude in Prow

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is targeted tools so it doesn't hang.

Comment thread test/eval/eval_test.go
Expect(err).NotTo(HaveOccurred(), "claude judge command failed: %s", string(output))

var parsed claudeOutput
err = json.Unmarshal(output, &parsed)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write a helperfunc to strip anything before the opening { as sometimes the CLI outputs e.g warnings. that will break this :(

@theobarberbany
Copy link
Copy Markdown

@devguyio - I'm defaulting EVAL_RUNS=3 for this reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants