test: evals per-case QA tests and inlined suites by denolfe · Pull Request #16424 · payloadcms/payload

denolfe · 2026-04-29T18:11:05Z

Overview

Refactors the eval suite to surface per-question pass/fail in vitest output and reduces boilerplate by inlining suite registration into the spec files.

Key Changes

Per-question test reporting
- QA datasets previously batched all questions into a single it() that asserted on aggregate accuracy. Each question now registers as its own it() block inside describe.concurrent, so vitest reports pass/fail per question and -t filtering targets individual cases.
- Extracted runQACase from runDataset for per-case invocation. runDataset still exists as a wrapper that maps over cases and aggregates accuracy for the snapshot output.
Inlined suite registration
- The 10 register*Suite() functions in suites/ were thin wrappers around two patterns. Replaced with registerQACases and registerCodegenCases helpers in suites/helpers.ts; each spec file calls these directly with its dataset and label.
- Deleted the per-suite files. The suites/ directory now contains only helpers.ts and types.ts.
- Dropped redundant nested wrappers in graphql/local-api/rest-api specs (single-child describe('Collections') / describe('CRUD')).
API key check in globalSetup
- The duplicated beforeAll API key guard from 10 spec files moved into globalSetup.setup(). Spec files no longer import beforeAll or repeat the check.
Richer failure output
- failureMessage and caseFailureMessage now include the model answer and scorer reasoning inline in the vitest assertion message.
- Switched all assert(condition, message) calls in suites to expect(value, message).toBe(...) for native vitest reporter integration.
Path alias support in fixture type-checking
- Added fixtures/ambient.d.ts with wildcard module declarations for @/* and ~/*. The per-invocation tsconfig in validate.ts includes this file so LLM-generated configs that import from common path aliases type-check structurally without requiring real stub files per fixture.
Removed unused ACCURACY_THRESHOLD
- The dataset-level accuracy gate is no longer applied since each case asserts independently. SCORE_THRESHOLD (per-case pass criterion for the LLM judge) remains.
Cleanup
- Removed redundant dotenv.config() from runner/systemPrompts.ts (vitest setup already loads .env).
- Switched bare fs/path imports to node: prefixed.

Design Decisions

Per-case over batched assertions: The batched it() masked individual failures behind aggregate accuracy. With 30+ questions per dataset, the visibility tradeoff favored splitting. Each case now asserts independently and aggregate metrics are computed at snapshot/reporting time rather than in test code.
Inlining over indirection: With variant selection moved to a runtime env var (EVAL_VARIANT via resolveVariantOptions()), each spec calls register once. The register*Suite() function abstraction added a layer with no remaining purpose. Inlining puts the suite definition (name, datasets, structure) in one file the reader can scan top to bottom.
expect over assert: Both work, but expect integrates with vitest's reporter so the custom failure message is rendered alongside the standard test output rather than as a raw AssertionError.
Ambient module declarations over per-fixture stub files: LLMs commonly generate imports from @/components/... paths. Maintaining real stub files per import path would not scale. Wildcard module declarations let any @/* import resolve to any, which is sufficient for structural type-checking.

Overall Flow

flowchart TB
    subgraph Before
      A1[spec file]
      A1 --> A2[register Suite function]
      A2 --> A3[runDataset]
      A3 --> A4[Promise.all over cases]
      A4 --> A5[Single it asserts aggregate accuracy]
    end

    subgraph After
      B1[spec file]
      B1 --> B2[registerQACases helper]
      B2 --> B3[describe.concurrent]
      B3 --> B4[it per case]
      B4 --> B5[runQACase]
      B5 --> B6[expect with rich message]
    end

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1214427219166339

Move the duplicated beforeAll API key guard from 10 spec files into globalSetup.ts setup(). Remove redundant dotenv import from systemPrompts.ts and use node: prefixed imports.

Each QA dataset case now registers as its own vitest it() block instead of being batched into a single accuracy assertion. This surfaces per-question pass/fail in vitest output and enables targeted re-runs via -t filtering. - Extract runQACase from runDataset for per-case invocation - Extract registerQACases / registerCodegenCases helpers - Inline 10 suite files into the spec files that consume them - Switch assert() to expect() for native vitest reporter integration - Enrich failureMessage and caseFailureMessage with answer + reasoning - Add fixtures/ambient.d.ts for @/* and ~/* path aliases in codegen

The suite-level accuracy gate was removed when QA tests were split into per-question cases. SCORE_THRESHOLD remains for per-case pass/fail.

github-actions · 2026-04-29T18:20:25Z

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖

Meta File	Out File	Size (raw)	Note
packages/next/meta_index.json	esbuild/index.js	985.42 KB	🆕 Added
packages/payload/meta_index.json	esbuild/index.js	1.39 MB	🆕 Added
packages/payload/meta_shared.json	esbuild/exports/shared.js	191.30 KB	🆕 Added
packages/richtext-lexical/meta_client.json	esbuild/exports/client_optimized/index.js	287.18 KB	🆕 Added
packages/ui/meta_client.json	esbuild/exports/client_optimized/index.js	1.19 MB	🆕 Added
packages/ui/meta_shared.json	esbuild/exports/shared_optimized/index.js	16.32 KB	🆕 Added

Largest paths

These visualization shows top 20 largest paths in the bundle.

Meta file: packages/next/meta_index.json, Out file: esbuild/index.js

Path	Size
../../node_modules	${{\color{Goldenrod}{ ████████████████████▌ }}}$ 82.3%, 807.63 KB
dist/views/Version	${{\color{Goldenrod}{ █▎ }}}$ 5.3%, 51.49 KB
dist/views/Dashboard	${{\color{Goldenrod}{ ▌ }}}$ 2.2%, 21.38 KB
dist/views/Document	${{\color{Goldenrod}{ ▍ }}}$ 1.7%, 16.59 KB
dist/views/List	${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 11.38 KB
dist/views/Root	${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 9.90 KB
dist/views/Versions	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 6.17 KB
dist/views/API	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 6.13 KB
dist/elements/Nav	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 5.96 KB
dist/views/Account	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 5.55 KB
dist/elements/DocumentHeader	${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 4.81 KB
dist/views/Login	${{\color{Goldenrod}{ }}}$ 0.4%, 4.40 KB
dist/layouts/Root	${{\color{Goldenrod}{ }}}$ 0.3%, 3.41 KB
dist/views/ForgotPassword	${{\color{Goldenrod}{ }}}$ 0.3%, 3.13 KB
dist/views/CreateFirstUser	${{\color{Goldenrod}{ }}}$ 0.3%, 2.81 KB
dist/templates/Default	${{\color{Goldenrod}{ }}}$ 0.3%, 2.64 KB
dist/views/BrowseByFolder	${{\color{Goldenrod}{ }}}$ 0.3%, 2.61 KB
dist/views/CollectionFolders	${{\color{Goldenrod}{ }}}$ 0.2%, 2.44 KB
dist/views/ResetPassword	${{\color{Goldenrod}{ }}}$ 0.2%, 2.40 KB
dist/views/Logout	${{\color{Goldenrod}{ }}}$ 0.2%, 1.94 KB
(other)	${{\color{Goldenrod}{ ████▍ }}}$ 17.7%, 173.12 KB

Meta file: packages/payload/meta_index.json, Out file: esbuild/index.js

Path	Size
../../node_modules	${{\color{Goldenrod}{ █████████████████▏ }}}$ 68.8%, 951.98 KB
dist/fields/hooks	${{\color{Goldenrod}{ ▊ }}}$ 3.2%, 44.07 KB
dist/collections/operations	${{\color{Goldenrod}{ ▋ }}}$ 2.9%, 39.96 KB
dist/versions/migrations	${{\color{Goldenrod}{ ▎ }}}$ 1.3%, 18.50 KB
dist/auth/operations	${{\color{Goldenrod}{ ▎ }}}$ 1.1%, 15.63 KB
dist/fields/config	${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 14.16 KB
dist/globals/operations	${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 13.32 KB
dist/utilities/configToJSONSchema.js	${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 13.13 KB
dist/queues/operations	${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 12.43 KB
dist/fields/validations.js	${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 10.57 KB
dist/bin/generateImportMap	${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 9.08 KB
dist/collections/config	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 8.91 KB
dist/config/orderable	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 8.00 KB
dist/uploads/fetchAPI-multipart	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.80 KB
dist/index.js	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.79 KB
dist/database/migrations	${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 7.54 KB
dist/collections/endpoints	${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 6.23 KB
dist/config/sanitize.js	${{\color{Goldenrod}{ }}}$ 0.4%, 5.86 KB
dist/auth/strategies	${{\color{Goldenrod}{ }}}$ 0.4%, 5.50 KB
dist/queues/config	${{\color{Goldenrod}{ }}}$ 0.4%, 5.31 KB
(other)	${{\color{Goldenrod}{ ███████▊ }}}$ 31.2%, 431.87 KB

Meta file: packages/payload/meta_shared.json, Out file: esbuild/exports/shared.js

Path	Size
../../node_modules	${{\color{Goldenrod}{ ███████████████████▊ }}}$ 79.4%, 148.89 KB
dist/fields/validations.js	${{\color{Goldenrod}{ █▍ }}}$ 5.6%, 10.57 KB
dist/config/orderable	${{\color{Goldenrod}{ ▍ }}}$ 1.7%, 3.13 KB
dist/fields/baseFields	${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 2.79 KB
dist/utilities/deepCopyObject.js	${{\color{Goldenrod}{ ▎ }}}$ 1.4%, 2.54 KB
dist/auth/cookies.js	${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 1.55 KB
dist/utilities/flattenTopLevelFields.js	${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 1.42 KB
dist/fields/config	${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 1.28 KB
dist/utilities/getVersionsConfig.js	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 1.04 KB
dist/utilities/flattenAllFields.js	${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 943 B
dist/folders/utils	${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 916 B
dist/utilities/unflatten.js	${{\color{Goldenrod}{ }}}$ 0.4%, 779 B
dist/utilities/sanitizeUserDataForEmail.js	${{\color{Goldenrod}{ }}}$ 0.4%, 713 B
dist/utilities/getFieldPermissions.js	${{\color{Goldenrod}{ }}}$ 0.3%, 651 B
dist/collections/config	${{\color{Goldenrod}{ }}}$ 0.3%, 570 B
dist/bin/generateImportMap	${{\color{Goldenrod}{ }}}$ 0.3%, 561 B
dist/auth/sessions.js	${{\color{Goldenrod}{ }}}$ 0.3%, 525 B
dist/fields/getFieldPaths.js	${{\color{Goldenrod}{ }}}$ 0.3%, 485 B
dist/utilities/getSafeRedirect.js	${{\color{Goldenrod}{ }}}$ 0.2%, 423 B
dist/utilities/deepMerge.js	${{\color{Goldenrod}{ }}}$ 0.2%, 413 B
(other)	${{\color{Goldenrod}{ █████▏ }}}$ 20.6%, 38.74 KB

Meta file: packages/richtext-lexical/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Path	Size
dist/features/blocks	${{\color{Goldenrod}{ ███▏ }}}$ 12.8%, 36.34 KB
dist/lexical/plugins	${{\color{Goldenrod}{ ██▉ }}}$ 11.5%, 32.65 KB
dist/lexical/ui	${{\color{Goldenrod}{ ██▏ }}}$ 8.6%, 24.36 KB
dist/features/experimental_table	${{\color{Goldenrod}{ ██ }}}$ 8.3%, 23.70 KB
dist/packages/@lexical	${{\color{Goldenrod}{ █▋ }}}$ 6.7%, 18.99 KB
dist/features/link	${{\color{Goldenrod}{ █▋ }}}$ 6.5%, 18.53 KB
dist/features/toolbars	${{\color{Goldenrod}{ █▍ }}}$ 5.7%, 16.08 KB
dist/features/upload	${{\color{Goldenrod}{ █▏ }}}$ 4.9%, 13.77 KB
dist/features/textState	${{\color{Goldenrod}{ ▉ }}}$ 3.9%, 11.08 KB
dist/features/relationship	${{\color{Goldenrod}{ ▊ }}}$ 3.2%, 9.03 KB
dist/lexical/utils	${{\color{Goldenrod}{ ▊ }}}$ 3.1%, 8.79 KB
dist/features/converters	${{\color{Goldenrod}{ ▋ }}}$ 2.9%, 8.36 KB
dist/features/debug	${{\color{Goldenrod}{ ▋ }}}$ 2.6%, 7.40 KB
dist/utilities/fieldsDrawer	${{\color{Goldenrod}{ ▋ }}}$ 2.5%, 7.15 KB
dist/lexical/config	${{\color{Goldenrod}{ ▍ }}}$ 1.8%, 5.08 KB
dist/features/lists	${{\color{Goldenrod}{ ▍ }}}$ 1.8%, 5.00 KB
dist/features/format	${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 3.46 KB
dist/lexical/LexicalEditor.js	${{\color{Goldenrod}{ ▎ }}}$ 1.1%, 3.23 KB
dist/field/Field.js	${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 2.81 KB
dist/lexical/nodes	${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 2.66 KB
(other)	${{\color{Goldenrod}{ █████████████████████▊ }}}$ 87.2%, 247.61 KB

Meta file: packages/ui/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Path	Size
../../node_modules	${{\color{Goldenrod}{ ████████████▎ }}}$ 49.3%, 579.12 KB
dist/elements/FolderView	${{\color{Goldenrod}{ ▋ }}}$ 2.5%, 29.38 KB
dist/elements/BulkUpload	${{\color{Goldenrod}{ ▌ }}}$ 2.4%, 28.24 KB
dist/elements/WhereBuilder	${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 17.36 KB
dist/views/Edit	${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 17.30 KB
dist/forms/Form	${{\color{Goldenrod}{ ▎ }}}$ 1.4%, 15.91 KB
dist/fields/Relationship	${{\color{Goldenrod}{ ▎ }}}$ 1.3%, 15.79 KB
dist/elements/Table	${{\color{Goldenrod}{ ▎ }}}$ 1.3%, 15.77 KB
dist/fields/Upload	${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 14.22 KB
dist/fields/Blocks	${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 13.90 KB
dist/elements/QueryPresets	${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 10.36 KB
dist/elements/PublishButton	${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 9.11 KB
dist/providers/Folders	${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 8.46 KB
dist/elements/HTMLDiff	${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 8.38 KB
dist/elements/ListHeader	${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 8.07 KB
dist/fields/Array	${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 7.73 KB
dist/views/CollectionFolder	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.50 KB
dist/views/List	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.36 KB
dist/elements/ReactSelect	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.33 KB
dist/elements/LivePreview	${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.03 KB
(other)	${{\color{Goldenrod}{ ████████████▋ }}}$ 50.7%, 596.51 KB

Meta file: packages/ui/meta_shared.json, Out file: esbuild/exports/shared_optimized/index.js

Path	Size
dist/graphics/Logo	${{\color{Goldenrod}{ █████ }}}$ 20.0%, 3.12 KB
../../node_modules	${{\color{Goldenrod}{ ████▎ }}}$ 17.0%, 2.65 KB
dist/graphics/Icon	${{\color{Goldenrod}{ ██▍ }}}$ 9.8%, 1.52 KB
dist/utilities/formatDocTitle	${{\color{Goldenrod}{ ██▏ }}}$ 8.5%, 1.32 KB
dist/providers/TableColumns	${{\color{Goldenrod}{ █▍ }}}$ 5.5%, 862 B
dist/utilities/groupNavItems.js	${{\color{Goldenrod}{ █▎ }}}$ 5.2%, 814 B
dist/utilities/getGlobalData.js	${{\color{Goldenrod}{ █▏ }}}$ 4.9%, 762 B
dist/utilities/api.js	${{\color{Goldenrod}{ █▏ }}}$ 4.8%, 756 B
dist/elements/Translation	${{\color{Goldenrod}{ ▊ }}}$ 3.2%, 493 B
dist/utilities/handleTakeOver.js	${{\color{Goldenrod}{ ▋ }}}$ 2.8%, 440 B
dist/utilities/traverseForLocalizedFields.js	${{\color{Goldenrod}{ ▋ }}}$ 2.6%, 399 B
dist/elements/withMergedProps	${{\color{Goldenrod}{ ▌ }}}$ 2.2%, 339 B
dist/utilities/getVisibleEntities.js	${{\color{Goldenrod}{ ▌ }}}$ 2.1%, 329 B
dist/utilities/getNavGroups.js	${{\color{Goldenrod}{ ▍ }}}$ 1.9%, 301 B
dist/elements/WithServerSideProps	${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 232 B
dist/utilities/handleGoBack.js	${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 180 B
dist/fields/mergeFieldStyles.js	${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 159 B
dist/utilities/handleBackToDashboard.js	${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 152 B
dist/forms/Form	${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 147 B
dist/utilities/abortAndIgnore.js	${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 146 B
(other)	${{\color{Goldenrod}{ ████████████████████ }}}$ 80.0%, 12.51 KB

Details

Next to the size is how much the size has increased or decreased compared with the base branch of this PR.

‼️: Size increased by 20% or more. Special attention should be given to this.
⚠️: Size increased in acceptable range (lower than 20%).
✅: No change or even downsized.
🗑️: The out file is deleted: not found in base branch.
🆕: The out file is newly found: will be added to base branch.

denolfe · 2026-04-29T19:14:19Z

Failures are flakes.

denolfe added 3 commits March 25, 2026 17:04

chore(evals): move API key check to globalSetup and clean up imports

cd5c75e

Move the duplicated beforeAll API key guard from 10 spec files into globalSetup.ts setup(). Remove redundant dotenv import from systemPrompts.ts and use node: prefixed imports.

chore: remove unused ACCURACY_THRESHOLD constant

fbe6cc0

The suite-level accuracy gate was removed when QA tests were split into per-question cases. SCORE_THRESHOLD remains for per-case pass/fail.

github-actions Bot added the created-by: Payload team label Apr 29, 2026

denolfe changed the title ~~chore(evals): per-case QA tests and inlined suites~~ test(evals): per-case QA tests and inlined suites Apr 29, 2026

denolfe changed the title ~~test(evals): per-case QA tests and inlined suites~~ test: evals per-case QA tests and inlined suites Apr 29, 2026

denolfe marked this pull request as ready for review April 29, 2026 19:13

denolfe merged commit db1b2a8 into main Apr 29, 2026
171 of 176 checks passed

denolfe deleted the ai/evals-cleanup branch April 29, 2026 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: evals per-case QA tests and inlined suites#16424

test: evals per-case QA tests and inlined suites#16424
denolfe merged 3 commits intomainfrom
ai/evals-cleanup

denolfe commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026

Meta file: packages/next/meta_index.json, Out file: esbuild/index.js

Meta file: packages/payload/meta_index.json, Out file: esbuild/index.js

Meta file: packages/payload/meta_shared.json, Out file: esbuild/exports/shared.js

Meta file: packages/richtext-lexical/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Meta file: packages/ui/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Meta file: packages/ui/meta_shared.json, Out file: esbuild/exports/shared_optimized/index.js

Uh oh!

denolfe commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

denolfe commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Design Decisions

Overall Flow

Uh oh!

github-actions Bot commented Apr 29, 2026

📦 esbuild Bundle Analysis for payload

Meta file: packages/next/meta_index.json, Out file: esbuild/index.js

Meta file: packages/payload/meta_index.json, Out file: esbuild/index.js

Meta file: packages/payload/meta_shared.json, Out file: esbuild/exports/shared.js

Meta file: packages/richtext-lexical/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Meta file: packages/ui/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Meta file: packages/ui/meta_shared.json, Out file: esbuild/exports/shared_optimized/index.js

Uh oh!

denolfe commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

denolfe commented Apr 29, 2026 •

edited

Loading