test(evals): add claude-code agent runner#16609
Merged
Merged
Conversation
… runnerKind on EvalResult
- claudeCode: surface SIGKILL errors in agent log instead of silently swallowing - variant: explicit fallback prevents future RunnerKind values from silently bucketing as skill - CompareTable: drop truthy-modelId-always-replaces bug; first-write-wins per lane - cache: drop redundant agentModel/agentVersion params; modelId already encodes them - cache.spec: real skillHash invalidation tests via vi.spyOn mock - globalSetup: import RunnerKind/SkillInstallMode instead of inline literal unions - claudeCode: drop fossil path comment
Contributor
📦 esbuild Bundle Analysis for payloadThis analysis was generated by esbuild-bundle-analyzer. 🤖
Largest pathsThese visualization shows top 20 largest paths in the bundle.Meta file: packages/next/meta_index.json, Out file: esbuild/index.js
Meta file: packages/payload/meta_index.json, Out file: esbuild/index.js
Meta file: packages/payload/meta_shared.json, Out file: esbuild/exports/shared.js
Meta file: packages/richtext-lexical/meta_client.json, Out file: esbuild/exports/client_optimized/index.js
Meta file: packages/ui/meta_client.json, Out file: esbuild/exports/client_optimized/index.js
Meta file: packages/ui/meta_shared.json, Out file: esbuild/exports/shared_optimized/index.js
DetailsNext to the size is how much the size has increased or decreased compared with the base branch of this PR.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds a Claude Code CLI agent runner to
test/evals/alongside the existing direct-LLM runner. Real agent invocations exercise the Payload skill the way users actually consume it: progressive disclosure ofSKILL.md+reference/*.mdfrom a sandboxed workdir, rather than the entire skill being injected into a system prompt.Four eval lanes are now selectable via
EVAL_VARIANT:skill(default)baselineagent-claude-code.claude/skills/payload/in workdiragent-claude-code-baselineKey Changes
Dispatcher-based runner indirection (
test/evals/runner/)runCodegenEvalbecomes aRunnerKind-keyed dispatcher overRecord<RunnerKind, CodegenRunner>.runner/llm.tsbehind the newCodegenRunnertype.runner/claudeCode.tswraps theclaudeCLI with lazy init, p-limit concurrency, a process-group-killable timeout, and'error'/'exit'resolution guards so spawn failures (missing binary, auth, hang) surface as actionable errors instead of test-worker timeouts.Sandboxed workdir per case (
test/evals/runner/workdir.ts)os.tmpdir()/payload-eval-*/withgit init(fixed local identity) and an embedded skill tree copied verbatim fromtools/claude-plugin/skills/payload/.os.tmpdir()or lands under$HOME.getSkillTreeHashwalks the source tree (sorted) so skill content changes invalidate cached results.Sandboxed
claudeinvocationCLAUDE_CONFIG_DIRoverridden to a per-process empty sandbox dir, blocking the developer's globalCLAUDE.md, installed skills, settings, and hooks from contaminating the eval.~/.claude/.credentials.jsonsetups. Authentication failures surface the CLI's actual stderr/stdout instead of a generic message.Cache + result type extensions (
test/evals/cache.ts,test/evals/types.ts)codegenKeykeyed onrunnerKind,modelId(which encodesagentModel/version for agent runs),skillInstall, and a conditional skill-tree hash for runs that depend on skill content.EvalResultgains requiredrunnerKindplus optionalskillInstall,agentLog(truncated),agentExitCode.loadSkillContextstays LLM-only; agents see the live filesystem tree.Variant taxonomy + dashboard surfacing (
test/evals/variant.ts, dashboard components)getVariant(result)classifies cache entries into one of four lanes, with explicit fallback for unknownRunnerKindvalues.Variantwidened to four values:agent-baseline,agent-skill,baseline,skill.CompareTablebuckets agent rows into the existing skill/baseline columns (badge distinguishes lane in list view).Scripts + docs (
package.json,test/evals/README.md):baselinepattern:test:eval:agent,test:eval:agent:baseline, and per-suite variants.OPENAI_API_KEYfor scorer,ANTHROPIC_API_KEYfor agent), and optional knobs (EVAL_AGENT_MODEL,EVAL_AGENT_CONCURRENCY,EVAL_KEEP_WORKDIR,EVAL_NO_CACHE).Design Decisions
Sandbox via
CLAUDE_CONFIG_DIR, not--bare. The--bareflag would forceANTHROPIC_API_KEY-only auth and skip keychain.CLAUDE_CONFIG_DIRredirection is more invasive (it breaks macOS keychain auth, forcing API-key use for agent lanes) but produces a cleaner sandbox: no user skills, no globalCLAUDE.md, no plugin marketplace, no hooks. The trade-off is documented; agent runs requireANTHROPIC_API_KEYset in the shell.Single-file readback, multi-file deferred. The MVP runner enforces "modify only
payload.config.ts" via a prompt suffix and reads back only that file. Multi-file agent edits (e.g. extracting a Collection into its own file) would require validators and the scorer to operate on a tree rather than a string, which is a separate phase.LLM scorer kept for agent runs. Agent invocations produce a real config diff that still needs grading. Building build-success scoring would require a Payload-specific oracle the project doesn't have, so the existing
scoreConfigChangeis reused. Consequence: bothOPENAI_API_KEYandANTHROPIC_API_KEYare needed for agent variants.Per-process concurrency cap. Agent runs are heavy (~30–120s, external process).
pLimit(EVAL_AGENT_CONCURRENCY ?? 2)at module scope prevents the suite from forking dozens ofclaudeprocesses. Vitest'sevalproject hasfileParallelism: false, so the module-level limiter is process-wide.Verbatim skill install over concatenation. The LLM runner injects a concatenated
SKILL.md+reference/*.mdblob via system prompt because the model has no tool access. The agent runner instead copies the skill directory tree verbatim intoworkdir/.claude/skills/payload/, letting the agent discover and read reference files through its ownReadtool. The cache key uses a separategetSkillTreeHash(not the LLM concatenation) so both runners invalidate on any skill change.runnerKindrequired onEvalResult. Optional discriminants made downstream code coerce via?? 'llm'at every read. Making it required tightens the new-cache contract; read sites that consume legacy entries keep the default-coercion for backward compatibility.agentModel/agentVersionnot separate cache-key fields.modelIdfor agent runs isclaude-code/<agentModel>/<version>, so version and model changes invalidate viamodelIdalone. Adding them separately would create silent divergence risk.Overall Flow
sequenceDiagram participant Spec as eval.*.spec.ts participant Variant as variantOptions.ts participant Case as runCodegenCase participant Dispatch as runCodegenEval participant Agent as claudeCodeRunner participant Workdir as workdir.ts participant CLI as claude (CLI) Spec->>Variant: resolveVariantOptions() Variant-->>Spec: { kind, skillInstall, agentModel, ... } Spec->>Case: runCodegenCase(testCase, label, opts) Case->>Case: codegenKey({ runnerKind, modelId, skillInstall, ... }) Case->>Dispatch: runCodegenEval(instruction, starter, opts) alt kind === 'llm' Dispatch->>Dispatch: llmRunner.run (unchanged) else kind === 'claude-code' Dispatch->>Agent: claudeCodeRunner.run Agent->>Agent: ensureInit (lazy, memoized) Note over Agent: First call: create sandbox CLAUDE_CONFIG_DIR,<br/>capture version, auth-probe (creds-copy fallback) Agent->>Workdir: materialize → gitInit → installSkill Agent->>CLI: spawn(claude --print --model <m> --dangerously-skip-permissions) Note over CLI: env: { CLAUDE_CONFIG_DIR=<sandbox> }<br/>cwd: <workdir> CLI-->>Agent: stdout/stderr + exit code Agent->>Workdir: readEntry(workdir) Agent->>Workdir: cleanup(workdir) Agent-->>Dispatch: { modifiedConfig, agentLog, agentExitCode } end Dispatch-->>Case: CodegenRunnerResult Case->>Case: validateConfigTypes (tsc) → evaluateAssertions → scoreConfigChange (OpenAI) Case-->>Spec: EvalResult (cached for next run)