Skip to content

test(evals): tighten codegen evaluation pipeline#16506

Merged
denolfe merged 12 commits into
mainfrom
ai/skill-codegen-quality
May 6, 2026
Merged

test(evals): tighten codegen evaluation pipeline#16506
denolfe merged 12 commits into
mainfrom
ai/skill-codegen-quality

Conversation

@denolfe
Copy link
Copy Markdown
Member

@denolfe denolfe commented May 6, 2026

Overview

Hardens the eval pipeline used to test the Payload skill's codegen quality. Adds deterministic AST-level structural assertions, inlines the full skill (SKILL.md plus every reference/*.md) into the eval prompt, fixes dashboard / cache drift, and tightens the LLM scorer rubric.

Full agent evals coming soon.

Key Changes

  • AST assertions for codegen cases

    • New test/evals/assertions/ module: a TS-compiler-API parser plus a small DSL (collectionExists, fieldExists, fieldOption, fieldHook, collectionHook, collectionAccess, blockField).
    • CodegenEvalCase gains an optional assertions field. Failures short-circuit the case to fail before the LLM scorer runs, so structural mismatches no longer get smoothed over by a "mostly correct" rating.
    • All collections and fields cases backfilled with assertions matching the canonical patterns in FIELDS.md, HOOKS.md, and ACCESS-CONTROL.md.
  • Full skill context in the eval prompt

    • The eval LLM has no tool access, so the markdown links inside SKILL.md (e.g. [FIELDS.md](reference/FIELDS.md)) were inert text. Reference docs were effectively invisible to the model under test.
    • New skillContent.ts reads SKILL.md plus every reference/*.md and concatenates them under labeled headers. runner/systemPrompts.ts consumes this; cache.getSkillHash hashes the same content so cache keys invalidate when any reference doc changes.
    • System prompt grows from roughly 5K to 42K tokens; OpenAI prompt caching keeps the per-run cost bounded.
  • LLM scorer rubric tightening

    • When the expected outcome names a specific construct or API shape, using a different one caps correctness at 0.4. "Minor" applies only when construct, location, and API surface are correct and only an option value differs.
  • Skill content adjustments

    • Quick-Reference row for computed fields now states "field-level hooks.afterRead returning the value".
    • Hook Example expanded with explicit collection-vs-field framing and an inline virtual-field example.
    • Removed unsupported "use as const for field options" guidance.
    • Added a rule to annotate extracted named constants with the matching Payload type or use satisfies.

Design Decisions

  • AST assertions over scorer-only judgment. The scorer (gpt-4o-mini) rated wrong-shape answers as "mostly correct" even after rubric tightening. Structural checks belong in deterministic code, not in another LLM. The DSL covers structural concerns; the scorer continues to handle semantic correctness on cases where assertions are absent or pass.
  • Eager-inline of reference docs over a tool-using runner. Tool use would mirror real Claude Code behavior more faithfully but adds non-determinism and runner complexity. Eager inline is simple, deterministic, and prompt caching is sufficient for now.
  • Prune-on-write rather than purge-on-startup. Cache file accumulation comes from changing inputs (fixture content, skill hash) producing new keys. Pruning the previous logical-case entry on each write keeps the dashboard surface clean without a destructive global cleanup step.
  • starterContent stored on each result. Reading the live fixture at render time tied diff correctness to the current fixture file. Persisting the starter alongside the answer makes diffs immutable to later edits.

Overall Flow

sequenceDiagram
  participant V as Vitest case
  participant R as Runner LLM
  participant T as tsc
  participant A as AST assertions
  participant S as Scorer LLM
  participant C as Cache
  V->>R: instruction + starter + full skill (SKILL.md + reference/*.md)
  R-->>V: modified config
  V->>T: typecheck against fixtures tsconfig
  T-->>V: errors / clean
  V->>A: evaluate structural assertions
  A-->>V: errors / clean
  alt tsc fail or assertion fail
    V->>C: write fail result, prune siblings
    V-->>V: ✗ FAIL [TSC | ASSERT]
  else clean
    V->>S: score correctness + completeness
    S-->>V: score + reasoning
    V->>C: write scored result, prune siblings
  end
Loading

denolfe added 12 commits May 5, 2026 12:23
Adds test/evals/fixtures/db-stub.ts exporting a typed `stubAdapter`
(`{} as DatabaseAdapterObj`) and an `@/*` path mapping in the fixtures
tsconfig. All 33 starter `payload.config.ts` fixtures now use
`db: stubAdapter` instead of `null as unknown as Parameters<...>['db']`,
so LLMs see realistic Payload import patterns rather than a type cast.
Two related fixes so the eval dashboard reflects the current skill +
fixtures:

1. Include the SKILL.md hash in `codegenKey` (was only in `qaKey`).
   Without this, editing SKILL.md returned the old cached codegen
   result, so dashboard diffs stayed pinned to the pre-edit run.

2. Persist the starter file contents on each codegen `EvalResult` and
   diff against the stored copy. Previously the dashboard re-read the
   live fixture file at render time, so any fixture edit (e.g. the
   db-stub sweep) misaligned every cached diff until each case was
   re-run. Legacy entries fall back to the live file.
Cache keys depend on fixture content and SKILL.md hash, so each fixture
or skill edit produces a fresh key — the previous file lingered on disk
and the dashboard kept rendering stale diffs alongside fresh ones.

`pruneStaleEntries` removes any entry matching the same logical case
(input + fixturePath + modelId + systemPromptKey) but a different key,
called immediately after each successful or tsc-failed cache write.
Cache-hit backfill writes are not pruned (same key, no siblings).
The virtual-field case passed (score 0.7) despite the LLM placing
afterRead on the collection rather than on the field. The scorer's
"minor property differs" rule was too lenient: wrong hook scope is a
structural mismatch, not a minor option difference.

- Updates the virtual-field dataset `expected` to describe the
  canonical field-level pattern (`siblingData`, `virtual: true`,
  field-level `hooks.afterRead`) so the scorer has a precise target.
- Adds a rubric clause to scoreConfigChange: when `expected` names a
  specific construct or API shape, using a different construct caps
  correctness at 0.4. "Minor" only applies when construct, location,
  and API surface are right and only an option value differs.
The LLM scorer keeps rating wrong-shape answers as "mostly correct"
even after rubric tightening — gpt-4o-mini misses structural
distinctions like field-level vs. collection-level hooks. This adds a
deterministic check that runs before the LLM scorer.

- New `test/evals/assertions/` module with a small DSL: collection-
  Exists, fieldExists, fieldOption, fieldHook, collectionHook. Each
  resolves against an AST built with the TS compiler API; identifier
  references (e.g. `const Users = {...}; collections: [Users]`) are
  followed.
- `CodegenEvalCase.assertions` is optional. When set, any failure
  short-circuits the case to fail with reasoning before the LLM scorer
  runs (no extra cost, no chance of being smoothed over).
- `EvalResult.assertionErrors` carries the failures for the dashboard.
- Adds the virtual-field fixture and a starter assertion set covering
  the field-level `hooks.afterRead` case the scorer was missing.
Extends the AST DSL and adds assertions to every collections- and
fields-dataset case. Patterns asserted match the canonical shapes in
the skill's reference docs (FIELDS.md, ACCESS-CONTROL.md, HOOKS.md).

DSL extensions:
- `parentField` on field assertions to walk one level into array/group
  fields' `fields` array
- `collectionAccess` for collection-level access functions
- `blockField` for fields nested inside blocks (`field.blocks[].fields`)
- Parser now extracts nested fields, blocks, and access objects
- ParsedCollection.access / ParsedField.{fields,blocks} populated when
  the field type indicates nesting

Backfilled cases:
- collections: posts-title-content, categories-relationship,
  media-access-control, comments-relationships, beforechange-hook
  (virtual-field already had assertions)
- fields: select-status, array-images, group-seo, number-price,
  blocks-layout, checkbox-ispublished

Dashboard / globalSetup updated so codegen-result detection includes
assertionErrors (not just tscErrors / changeDescription), preventing
miscategorisation when assertions fail before the LLM scorer runs.

Config / plugins / negative datasets are not backfilled — they require
DSL extensions for top-level config keys (admin, cors, serverURL,
onInit, localization), plugin function-body introspection, and bug-fix
verification respectively. Those cases continue to rely on the LLM
scorer with its tightened rubric.
Codegen evals surfaced LLM output like
`export const Posts = { slug: 'posts', fields: [{ type: 'text' }] }`
where bare extraction widens `type: 'text'` to `string` and breaks the
`Field` / `CollectionConfig` discriminated unions. Inline object
literals work because of contextual typing, so this issue only shows
up in extracted-const code paths.

Adds a Type Safety bullet in SKILL.md covering all extraction targets
(collections, fields, hooks, access, plugins) — annotate with the
matching Payload type or use `satisfies`.
The bullet had no example in the skill, no precedent in templates
(only Next.js redirects use `as const`, for `RouteHas['type']`), and
the most natural reading — applying `as const` to a collection or
field literal — produces readonly tuples that fail to assign to
`CollectionConfig.fields: Field[]`. The "annotate extracted named
constants" rule covers the real type-safety need.
LLM kept producing collection-level afterRead for virtual-field
codegen even with rubric + AST assertions in place. The skill
mentioned virtual fields but never surfaced the field-vs-collection
hook distinction in-context — the Hook Example only showed a
collection hook, and the Quick Reference row was ambiguous.

- Quick Reference: row for computed fields now says "field-level
  hooks.afterRead returning the value".
- Hook Example: adds explicit framing on the two hook levels (which
  args, what they return) plus a virtual-field example inline, and a
  trailing rule: when asked to compute / populate a field's value,
  use a field-level hook, never a collection-level one.
The eval LLM has no tool access, so the markdown links inside SKILL.md
(e.g. `[FIELDS.md](reference/FIELDS.md)`) were inert text — every
reference doc was effectively invisible to the model under test.
Several iterations of the virtual-field case were chased back to this:
patterns documented only in FIELDS.md never reached the LLM.

- New shared `skillContent.ts` loads SKILL.md plus every `reference/*.md`
  in stable order, separated by labeled headers, cached per process
- `runner/systemPrompts.ts` and `cache.ts` both consume it
- `getSkillHash` now hashes the full concatenated context so cache keys
  invalidate when any reference doc changes, not just SKILL.md
- Context grows from ~5K to ~42K tokens; OpenAI prompt caching keeps
  the per-run cost amortized
@denolfe denolfe requested a review from AlessioGr as a code owner May 6, 2026 14:36
@denolfe denolfe changed the title feat(evals): tighten codegen evaluation pipeline test(evals): tighten codegen evaluation pipeline May 6, 2026
@denolfe denolfe merged commit 1ed0f74 into main May 6, 2026
17 checks passed
@denolfe denolfe deleted the ai/skill-codegen-quality branch May 6, 2026 14:43
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants