test(evals): tighten codegen evaluation pipeline by denolfe · Pull Request #16506 · payloadcms/payload

denolfe · 2026-05-06T14:36:02Z

Overview

Hardens the eval pipeline used to test the Payload skill's codegen quality. Adds deterministic AST-level structural assertions, inlines the full skill (SKILL.md plus every reference/*.md) into the eval prompt, fixes dashboard / cache drift, and tightens the LLM scorer rubric.

Full agent evals coming soon.

Key Changes

AST assertions for codegen cases
- New test/evals/assertions/ module: a TS-compiler-API parser plus a small DSL (collectionExists, fieldExists, fieldOption, fieldHook, collectionHook, collectionAccess, blockField).
- CodegenEvalCase gains an optional assertions field. Failures short-circuit the case to fail before the LLM scorer runs, so structural mismatches no longer get smoothed over by a "mostly correct" rating.
- All collections and fields cases backfilled with assertions matching the canonical patterns in FIELDS.md, HOOKS.md, and ACCESS-CONTROL.md.
Full skill context in the eval prompt
- The eval LLM has no tool access, so the markdown links inside SKILL.md (e.g. [FIELDS.md](reference/FIELDS.md)) were inert text. Reference docs were effectively invisible to the model under test.
- New skillContent.ts reads SKILL.md plus every reference/*.md and concatenates them under labeled headers. runner/systemPrompts.ts consumes this; cache.getSkillHash hashes the same content so cache keys invalidate when any reference doc changes.
- System prompt grows from roughly 5K to 42K tokens; OpenAI prompt caching keeps the per-run cost bounded.
LLM scorer rubric tightening
- When the expected outcome names a specific construct or API shape, using a different one caps correctness at 0.4. "Minor" applies only when construct, location, and API surface are correct and only an option value differs.
Skill content adjustments
- Quick-Reference row for computed fields now states "field-level hooks.afterRead returning the value".
- Hook Example expanded with explicit collection-vs-field framing and an inline virtual-field example.
- Removed unsupported "use as const for field options" guidance.
- Added a rule to annotate extracted named constants with the matching Payload type or use satisfies.

Design Decisions

AST assertions over scorer-only judgment. The scorer (gpt-4o-mini) rated wrong-shape answers as "mostly correct" even after rubric tightening. Structural checks belong in deterministic code, not in another LLM. The DSL covers structural concerns; the scorer continues to handle semantic correctness on cases where assertions are absent or pass.
Eager-inline of reference docs over a tool-using runner. Tool use would mirror real Claude Code behavior more faithfully but adds non-determinism and runner complexity. Eager inline is simple, deterministic, and prompt caching is sufficient for now.
Prune-on-write rather than purge-on-startup. Cache file accumulation comes from changing inputs (fixture content, skill hash) producing new keys. Pruning the previous logical-case entry on each write keeps the dashboard surface clean without a destructive global cleanup step.
starterContent stored on each result. Reading the live fixture at render time tied diff correctness to the current fixture file. Persisting the starter alongside the answer makes diffs immutable to later edits.

Overall Flow

sequenceDiagram
  participant V as Vitest case
  participant R as Runner LLM
  participant T as tsc
  participant A as AST assertions
  participant S as Scorer LLM
  participant C as Cache
  V->>R: instruction + starter + full skill (SKILL.md + reference/*.md)
  R-->>V: modified config
  V->>T: typecheck against fixtures tsconfig
  T-->>V: errors / clean
  V->>A: evaluate structural assertions
  A-->>V: errors / clean
  alt tsc fail or assertion fail
    V->>C: write fail result, prune siblings
    V-->>V: ✗ FAIL [TSC | ASSERT]
  else clean
    V->>S: score correctness + completeness
    S-->>V: score + reasoning
    V->>C: write scored result, prune siblings
  end

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1214639058034041

Adds test/evals/fixtures/db-stub.ts exporting a typed `stubAdapter` (`{} as DatabaseAdapterObj`) and an `@/*` path mapping in the fixtures tsconfig. All 33 starter `payload.config.ts` fixtures now use `db: stubAdapter` instead of `null as unknown as Parameters<...>['db']`, so LLMs see realistic Payload import patterns rather than a type cast.

Two related fixes so the eval dashboard reflects the current skill + fixtures: 1. Include the SKILL.md hash in `codegenKey` (was only in `qaKey`). Without this, editing SKILL.md returned the old cached codegen result, so dashboard diffs stayed pinned to the pre-edit run. 2. Persist the starter file contents on each codegen `EvalResult` and diff against the stored copy. Previously the dashboard re-read the live fixture file at render time, so any fixture edit (e.g. the db-stub sweep) misaligned every cached diff until each case was re-run. Legacy entries fall back to the live file.

Cache keys depend on fixture content and SKILL.md hash, so each fixture or skill edit produces a fresh key — the previous file lingered on disk and the dashboard kept rendering stale diffs alongside fresh ones. `pruneStaleEntries` removes any entry matching the same logical case (input + fixturePath + modelId + systemPromptKey) but a different key, called immediately after each successful or tsc-failed cache write. Cache-hit backfill writes are not pruned (same key, no siblings).

The virtual-field case passed (score 0.7) despite the LLM placing afterRead on the collection rather than on the field. The scorer's "minor property differs" rule was too lenient: wrong hook scope is a structural mismatch, not a minor option difference. - Updates the virtual-field dataset `expected` to describe the canonical field-level pattern (`siblingData`, `virtual: true`, field-level `hooks.afterRead`) so the scorer has a precise target. - Adds a rubric clause to scoreConfigChange: when `expected` names a specific construct or API shape, using a different construct caps correctness at 0.4. "Minor" only applies when construct, location, and API surface are right and only an option value differs.

The LLM scorer keeps rating wrong-shape answers as "mostly correct" even after rubric tightening — gpt-4o-mini misses structural distinctions like field-level vs. collection-level hooks. This adds a deterministic check that runs before the LLM scorer. - New `test/evals/assertions/` module with a small DSL: collection- Exists, fieldExists, fieldOption, fieldHook, collectionHook. Each resolves against an AST built with the TS compiler API; identifier references (e.g. `const Users = {...}; collections: [Users]`) are followed. - `CodegenEvalCase.assertions` is optional. When set, any failure short-circuits the case to fail with reasoning before the LLM scorer runs (no extra cost, no chance of being smoothed over). - `EvalResult.assertionErrors` carries the failures for the dashboard. - Adds the virtual-field fixture and a starter assertion set covering the field-level `hooks.afterRead` case the scorer was missing.

Extends the AST DSL and adds assertions to every collections- and fields-dataset case. Patterns asserted match the canonical shapes in the skill's reference docs (FIELDS.md, ACCESS-CONTROL.md, HOOKS.md). DSL extensions: - `parentField` on field assertions to walk one level into array/group fields' `fields` array - `collectionAccess` for collection-level access functions - `blockField` for fields nested inside blocks (`field.blocks[].fields`) - Parser now extracts nested fields, blocks, and access objects - ParsedCollection.access / ParsedField.{fields,blocks} populated when the field type indicates nesting Backfilled cases: - collections: posts-title-content, categories-relationship, media-access-control, comments-relationships, beforechange-hook (virtual-field already had assertions) - fields: select-status, array-images, group-seo, number-price, blocks-layout, checkbox-ispublished Dashboard / globalSetup updated so codegen-result detection includes assertionErrors (not just tscErrors / changeDescription), preventing miscategorisation when assertions fail before the LLM scorer runs. Config / plugins / negative datasets are not backfilled — they require DSL extensions for top-level config keys (admin, cors, serverURL, onInit, localization), plugin function-body introspection, and bug-fix verification respectively. Those cases continue to rely on the LLM scorer with its tightened rubric.

Codegen evals surfaced LLM output like `export const Posts = { slug: 'posts', fields: [{ type: 'text' }] }` where bare extraction widens `type: 'text'` to `string` and breaks the `Field` / `CollectionConfig` discriminated unions. Inline object literals work because of contextual typing, so this issue only shows up in extracted-const code paths. Adds a Type Safety bullet in SKILL.md covering all extraction targets (collections, fields, hooks, access, plugins) — annotate with the matching Payload type or use `satisfies`.

The bullet had no example in the skill, no precedent in templates (only Next.js redirects use `as const`, for `RouteHas['type']`), and the most natural reading — applying `as const` to a collection or field literal — produces readonly tuples that fail to assign to `CollectionConfig.fields: Field[]`. The "annotate extracted named constants" rule covers the real type-safety need.

LLM kept producing collection-level afterRead for virtual-field codegen even with rubric + AST assertions in place. The skill mentioned virtual fields but never surfaced the field-vs-collection hook distinction in-context — the Hook Example only showed a collection hook, and the Quick Reference row was ambiguous. - Quick Reference: row for computed fields now says "field-level hooks.afterRead returning the value". - Hook Example: adds explicit framing on the two hook levels (which args, what they return) plus a virtual-field example inline, and a trailing rule: when asked to compute / populate a field's value, use a field-level hook, never a collection-level one.

The eval LLM has no tool access, so the markdown links inside SKILL.md (e.g. `[FIELDS.md](reference/FIELDS.md)`) were inert text — every reference doc was effectively invisible to the model under test. Several iterations of the virtual-field case were chased back to this: patterns documented only in FIELDS.md never reached the LLM. - New shared `skillContent.ts` loads SKILL.md plus every `reference/*.md` in stable order, separated by labeled headers, cached per process - `runner/systemPrompts.ts` and `cache.ts` both consume it - `getSkillHash` now hashes the full concatenated context so cache keys invalidate when any reference doc changes, not just SKILL.md - Context grows from ~5K to ~42K tokens; OpenAI prompt caching keeps the per-run cost amortized

github-actions · 2026-05-06T14:46:14Z

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

denolfe added 12 commits May 5, 2026 12:23

chore: disable export-default eslint on eval-results

271108f

chore: add JSDoc to Assertion variant properties

e91cf30

denolfe requested a review from AlessioGr as a code owner May 6, 2026 14:36

github-actions Bot added the created-by: Payload team label May 6, 2026

denolfe changed the title ~~feat(evals): tighten codegen evaluation pipeline~~ test(evals): tighten codegen evaluation pipeline May 6, 2026

DanRibbens approved these changes May 6, 2026

View reviewed changes

denolfe merged commit 1ed0f74 into main May 6, 2026
17 checks passed

denolfe deleted the ai/skill-codegen-quality branch May 6, 2026 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): tighten codegen evaluation pipeline#16506

test(evals): tighten codegen evaluation pipeline#16506
denolfe merged 12 commits into
mainfrom
ai/skill-codegen-quality

denolfe commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

denolfe commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Design Decisions

Overall Flow

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

📦 esbuild Bundle Analysis for payload

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

denolfe commented May 6, 2026 •

edited

Loading