Skip to content

test: evals per-case QA tests and inlined suites#16424

Merged
denolfe merged 3 commits intomainfrom
ai/evals-cleanup
Apr 29, 2026
Merged

test: evals per-case QA tests and inlined suites#16424
denolfe merged 3 commits intomainfrom
ai/evals-cleanup

Conversation

@denolfe
Copy link
Copy Markdown
Member

@denolfe denolfe commented Apr 29, 2026

Overview

Refactors the eval suite to surface per-question pass/fail in vitest output and reduces boilerplate by inlining suite registration into the spec files.

Key Changes

  • Per-question test reporting

    • QA datasets previously batched all questions into a single it() that asserted on aggregate accuracy. Each question now registers as its own it() block inside describe.concurrent, so vitest reports pass/fail per question and -t filtering targets individual cases.
    • Extracted runQACase from runDataset for per-case invocation. runDataset still exists as a wrapper that maps over cases and aggregates accuracy for the snapshot output.
  • Inlined suite registration

    • The 10 register*Suite() functions in suites/ were thin wrappers around two patterns. Replaced with registerQACases and registerCodegenCases helpers in suites/helpers.ts; each spec file calls these directly with its dataset and label.
    • Deleted the per-suite files. The suites/ directory now contains only helpers.ts and types.ts.
    • Dropped redundant nested wrappers in graphql/local-api/rest-api specs (single-child describe('Collections') / describe('CRUD')).
  • API key check in globalSetup

    • The duplicated beforeAll API key guard from 10 spec files moved into globalSetup.setup(). Spec files no longer import beforeAll or repeat the check.
  • Richer failure output

    • failureMessage and caseFailureMessage now include the model answer and scorer reasoning inline in the vitest assertion message.
    • Switched all assert(condition, message) calls in suites to expect(value, message).toBe(...) for native vitest reporter integration.
  • Path alias support in fixture type-checking

    • Added fixtures/ambient.d.ts with wildcard module declarations for @/* and ~/*. The per-invocation tsconfig in validate.ts includes this file so LLM-generated configs that import from common path aliases type-check structurally without requiring real stub files per fixture.
  • Removed unused ACCURACY_THRESHOLD

    • The dataset-level accuracy gate is no longer applied since each case asserts independently. SCORE_THRESHOLD (per-case pass criterion for the LLM judge) remains.
  • Cleanup

    • Removed redundant dotenv.config() from runner/systemPrompts.ts (vitest setup already loads .env).
    • Switched bare fs/path imports to node: prefixed.

Design Decisions

  • Per-case over batched assertions: The batched it() masked individual failures behind aggregate accuracy. With 30+ questions per dataset, the visibility tradeoff favored splitting. Each case now asserts independently and aggregate metrics are computed at snapshot/reporting time rather than in test code.

  • Inlining over indirection: With variant selection moved to a runtime env var (EVAL_VARIANT via resolveVariantOptions()), each spec calls register once. The register*Suite() function abstraction added a layer with no remaining purpose. Inlining puts the suite definition (name, datasets, structure) in one file the reader can scan top to bottom.

  • expect over assert: Both work, but expect integrates with vitest's reporter so the custom failure message is rendered alongside the standard test output rather than as a raw AssertionError.

  • Ambient module declarations over per-fixture stub files: LLMs commonly generate imports from @/components/... paths. Maintaining real stub files per import path would not scale. Wildcard module declarations let any @/* import resolve to any, which is sufficient for structural type-checking.

Overall Flow

flowchart TB
    subgraph Before
      A1[spec file]
      A1 --> A2[register Suite function]
      A2 --> A3[runDataset]
      A3 --> A4[Promise.all over cases]
      A4 --> A5[Single it asserts aggregate accuracy]
    end

    subgraph After
      B1[spec file]
      B1 --> B2[registerQACases helper]
      B2 --> B3[describe.concurrent]
      B3 --> B4[it per case]
      B4 --> B5[runQACase]
      B5 --> B6[expect with rich message]
    end
Loading

denolfe added 3 commits March 25, 2026 17:04
Move the duplicated beforeAll API key guard from 10 spec files into
globalSetup.ts setup(). Remove redundant dotenv import from
systemPrompts.ts and use node: prefixed imports.
Each QA dataset case now registers as its own vitest it() block instead
of being batched into a single accuracy assertion. This surfaces
per-question pass/fail in vitest output and enables targeted re-runs
via -t filtering.

- Extract runQACase from runDataset for per-case invocation
- Extract registerQACases / registerCodegenCases helpers
- Inline 10 suite files into the spec files that consume them
- Switch assert() to expect() for native vitest reporter integration
- Enrich failureMessage and caseFailureMessage with answer + reasoning
- Add fixtures/ambient.d.ts for @/* and ~/* path aliases in codegen
The suite-level accuracy gate was removed when QA tests were split into
per-question cases. SCORE_THRESHOLD remains for per-case pass/fail.
@denolfe denolfe changed the title chore(evals): per-case QA tests and inlined suites test(evals): per-case QA tests and inlined suites Apr 29, 2026
@denolfe denolfe changed the title test(evals): per-case QA tests and inlined suites test: evals per-case QA tests and inlined suites Apr 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖

Meta File Out File Size (raw) Note
packages/next/meta_index.json esbuild/index.js 985.42 KB 🆕 Added
packages/payload/meta_index.json esbuild/index.js 1.39 MB 🆕 Added
packages/payload/meta_shared.json esbuild/exports/shared.js 191.30 KB 🆕 Added
packages/richtext-lexical/meta_client.json esbuild/exports/client_optimized/index.js 287.18 KB 🆕 Added
packages/ui/meta_client.json esbuild/exports/client_optimized/index.js 1.19 MB 🆕 Added
packages/ui/meta_shared.json esbuild/exports/shared_optimized/index.js 16.32 KB 🆕 Added
Largest paths These visualization shows top 20 largest paths in the bundle.

Meta file: packages/next/meta_index.json, Out file: esbuild/index.js

Path Size
../../node_modules ${{\color{Goldenrod}{ ████████████████████▌ }}}$ 82.3%, 807.63 KB
dist/views/Version ${{\color{Goldenrod}{ █▎ }}}$ 5.3%, 51.49 KB
dist/views/Dashboard ${{\color{Goldenrod}{ ▌ }}}$ 2.2%, 21.38 KB
dist/views/Document ${{\color{Goldenrod}{ ▍ }}}$ 1.7%, 16.59 KB
dist/views/List ${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 11.38 KB
dist/views/Root ${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 9.90 KB
dist/views/Versions ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 6.17 KB
dist/views/API ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 6.13 KB
dist/elements/Nav ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 5.96 KB
dist/views/Account ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 5.55 KB
dist/elements/DocumentHeader ${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 4.81 KB
dist/views/Login ${{\color{Goldenrod}{ }}}$ 0.4%, 4.40 KB
dist/layouts/Root ${{\color{Goldenrod}{ }}}$ 0.3%, 3.41 KB
dist/views/ForgotPassword ${{\color{Goldenrod}{ }}}$ 0.3%, 3.13 KB
dist/views/CreateFirstUser ${{\color{Goldenrod}{ }}}$ 0.3%, 2.81 KB
dist/templates/Default ${{\color{Goldenrod}{ }}}$ 0.3%, 2.64 KB
dist/views/BrowseByFolder ${{\color{Goldenrod}{ }}}$ 0.3%, 2.61 KB
dist/views/CollectionFolders ${{\color{Goldenrod}{ }}}$ 0.2%, 2.44 KB
dist/views/ResetPassword ${{\color{Goldenrod}{ }}}$ 0.2%, 2.40 KB
dist/views/Logout ${{\color{Goldenrod}{ }}}$ 0.2%, 1.94 KB
(other) ${{\color{Goldenrod}{ ████▍ }}}$ 17.7%, 173.12 KB

Meta file: packages/payload/meta_index.json, Out file: esbuild/index.js

Path Size
../../node_modules ${{\color{Goldenrod}{ █████████████████▏ }}}$ 68.8%, 951.98 KB
dist/fields/hooks ${{\color{Goldenrod}{ ▊ }}}$ 3.2%, 44.07 KB
dist/collections/operations ${{\color{Goldenrod}{ ▋ }}}$ 2.9%, 39.96 KB
dist/versions/migrations ${{\color{Goldenrod}{ ▎ }}}$ 1.3%, 18.50 KB
dist/auth/operations ${{\color{Goldenrod}{ ▎ }}}$ 1.1%, 15.63 KB
dist/fields/config ${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 14.16 KB
dist/globals/operations ${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 13.32 KB
dist/utilities/configToJSONSchema.js ${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 13.13 KB
dist/queues/operations ${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 12.43 KB
dist/fields/validations.js ${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 10.57 KB
dist/bin/generateImportMap ${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 9.08 KB
dist/collections/config ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 8.91 KB
dist/config/orderable ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 8.00 KB
dist/uploads/fetchAPI-multipart ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.80 KB
dist/index.js ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.79 KB
dist/database/migrations ${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 7.54 KB
dist/collections/endpoints ${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 6.23 KB
dist/config/sanitize.js ${{\color{Goldenrod}{ }}}$ 0.4%, 5.86 KB
dist/auth/strategies ${{\color{Goldenrod}{ }}}$ 0.4%, 5.50 KB
dist/queues/config ${{\color{Goldenrod}{ }}}$ 0.4%, 5.31 KB
(other) ${{\color{Goldenrod}{ ███████▊ }}}$ 31.2%, 431.87 KB

Meta file: packages/payload/meta_shared.json, Out file: esbuild/exports/shared.js

Path Size
../../node_modules ${{\color{Goldenrod}{ ███████████████████▊ }}}$ 79.4%, 148.89 KB
dist/fields/validations.js ${{\color{Goldenrod}{ █▍ }}}$ 5.6%, 10.57 KB
dist/config/orderable ${{\color{Goldenrod}{ ▍ }}}$ 1.7%, 3.13 KB
dist/fields/baseFields ${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 2.79 KB
dist/utilities/deepCopyObject.js ${{\color{Goldenrod}{ ▎ }}}$ 1.4%, 2.54 KB
dist/auth/cookies.js ${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 1.55 KB
dist/utilities/flattenTopLevelFields.js ${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 1.42 KB
dist/fields/config ${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 1.28 KB
dist/utilities/getVersionsConfig.js ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 1.04 KB
dist/utilities/flattenAllFields.js ${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 943 B
dist/folders/utils ${{\color{Goldenrod}{ ▏ }}}$ 0.5%, 916 B
dist/utilities/unflatten.js ${{\color{Goldenrod}{ }}}$ 0.4%, 779 B
dist/utilities/sanitizeUserDataForEmail.js ${{\color{Goldenrod}{ }}}$ 0.4%, 713 B
dist/utilities/getFieldPermissions.js ${{\color{Goldenrod}{ }}}$ 0.3%, 651 B
dist/collections/config ${{\color{Goldenrod}{ }}}$ 0.3%, 570 B
dist/bin/generateImportMap ${{\color{Goldenrod}{ }}}$ 0.3%, 561 B
dist/auth/sessions.js ${{\color{Goldenrod}{ }}}$ 0.3%, 525 B
dist/fields/getFieldPaths.js ${{\color{Goldenrod}{ }}}$ 0.3%, 485 B
dist/utilities/getSafeRedirect.js ${{\color{Goldenrod}{ }}}$ 0.2%, 423 B
dist/utilities/deepMerge.js ${{\color{Goldenrod}{ }}}$ 0.2%, 413 B
(other) ${{\color{Goldenrod}{ █████▏ }}}$ 20.6%, 38.74 KB

Meta file: packages/richtext-lexical/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Path Size
dist/features/blocks ${{\color{Goldenrod}{ ███▏ }}}$ 12.8%, 36.34 KB
dist/lexical/plugins ${{\color{Goldenrod}{ ██▉ }}}$ 11.5%, 32.65 KB
dist/lexical/ui ${{\color{Goldenrod}{ ██▏ }}}$ 8.6%, 24.36 KB
dist/features/experimental_table ${{\color{Goldenrod}{ ██ }}}$ 8.3%, 23.70 KB
dist/packages/@lexical ${{\color{Goldenrod}{ █▋ }}}$ 6.7%, 18.99 KB
dist/features/link ${{\color{Goldenrod}{ █▋ }}}$ 6.5%, 18.53 KB
dist/features/toolbars ${{\color{Goldenrod}{ █▍ }}}$ 5.7%, 16.08 KB
dist/features/upload ${{\color{Goldenrod}{ █▏ }}}$ 4.9%, 13.77 KB
dist/features/textState ${{\color{Goldenrod}{ ▉ }}}$ 3.9%, 11.08 KB
dist/features/relationship ${{\color{Goldenrod}{ ▊ }}}$ 3.2%, 9.03 KB
dist/lexical/utils ${{\color{Goldenrod}{ ▊ }}}$ 3.1%, 8.79 KB
dist/features/converters ${{\color{Goldenrod}{ ▋ }}}$ 2.9%, 8.36 KB
dist/features/debug ${{\color{Goldenrod}{ ▋ }}}$ 2.6%, 7.40 KB
dist/utilities/fieldsDrawer ${{\color{Goldenrod}{ ▋ }}}$ 2.5%, 7.15 KB
dist/lexical/config ${{\color{Goldenrod}{ ▍ }}}$ 1.8%, 5.08 KB
dist/features/lists ${{\color{Goldenrod}{ ▍ }}}$ 1.8%, 5.00 KB
dist/features/format ${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 3.46 KB
dist/lexical/LexicalEditor.js ${{\color{Goldenrod}{ ▎ }}}$ 1.1%, 3.23 KB
dist/field/Field.js ${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 2.81 KB
dist/lexical/nodes ${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 2.66 KB
(other) ${{\color{Goldenrod}{ █████████████████████▊ }}}$ 87.2%, 247.61 KB

Meta file: packages/ui/meta_client.json, Out file: esbuild/exports/client_optimized/index.js

Path Size
../../node_modules ${{\color{Goldenrod}{ ████████████▎ }}}$ 49.3%, 579.12 KB
dist/elements/FolderView ${{\color{Goldenrod}{ ▋ }}}$ 2.5%, 29.38 KB
dist/elements/BulkUpload ${{\color{Goldenrod}{ ▌ }}}$ 2.4%, 28.24 KB
dist/elements/WhereBuilder ${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 17.36 KB
dist/views/Edit ${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 17.30 KB
dist/forms/Form ${{\color{Goldenrod}{ ▎ }}}$ 1.4%, 15.91 KB
dist/fields/Relationship ${{\color{Goldenrod}{ ▎ }}}$ 1.3%, 15.79 KB
dist/elements/Table ${{\color{Goldenrod}{ ▎ }}}$ 1.3%, 15.77 KB
dist/fields/Upload ${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 14.22 KB
dist/fields/Blocks ${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 13.90 KB
dist/elements/QueryPresets ${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 10.36 KB
dist/elements/PublishButton ${{\color{Goldenrod}{ ▏ }}}$ 0.8%, 9.11 KB
dist/providers/Folders ${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 8.46 KB
dist/elements/HTMLDiff ${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 8.38 KB
dist/elements/ListHeader ${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 8.07 KB
dist/fields/Array ${{\color{Goldenrod}{ ▏ }}}$ 0.7%, 7.73 KB
dist/views/CollectionFolder ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.50 KB
dist/views/List ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.36 KB
dist/elements/ReactSelect ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.33 KB
dist/elements/LivePreview ${{\color{Goldenrod}{ ▏ }}}$ 0.6%, 7.03 KB
(other) ${{\color{Goldenrod}{ ████████████▋ }}}$ 50.7%, 596.51 KB

Meta file: packages/ui/meta_shared.json, Out file: esbuild/exports/shared_optimized/index.js

Path Size
dist/graphics/Logo ${{\color{Goldenrod}{ █████ }}}$ 20.0%, 3.12 KB
../../node_modules ${{\color{Goldenrod}{ ████▎ }}}$ 17.0%, 2.65 KB
dist/graphics/Icon ${{\color{Goldenrod}{ ██▍ }}}$ 9.8%, 1.52 KB
dist/utilities/formatDocTitle ${{\color{Goldenrod}{ ██▏ }}}$ 8.5%, 1.32 KB
dist/providers/TableColumns ${{\color{Goldenrod}{ █▍ }}}$ 5.5%, 862 B
dist/utilities/groupNavItems.js ${{\color{Goldenrod}{ █▎ }}}$ 5.2%, 814 B
dist/utilities/getGlobalData.js ${{\color{Goldenrod}{ █▏ }}}$ 4.9%, 762 B
dist/utilities/api.js ${{\color{Goldenrod}{ █▏ }}}$ 4.8%, 756 B
dist/elements/Translation ${{\color{Goldenrod}{ ▊ }}}$ 3.2%, 493 B
dist/utilities/handleTakeOver.js ${{\color{Goldenrod}{ ▋ }}}$ 2.8%, 440 B
dist/utilities/traverseForLocalizedFields.js ${{\color{Goldenrod}{ ▋ }}}$ 2.6%, 399 B
dist/elements/withMergedProps ${{\color{Goldenrod}{ ▌ }}}$ 2.2%, 339 B
dist/utilities/getVisibleEntities.js ${{\color{Goldenrod}{ ▌ }}}$ 2.1%, 329 B
dist/utilities/getNavGroups.js ${{\color{Goldenrod}{ ▍ }}}$ 1.9%, 301 B
dist/elements/WithServerSideProps ${{\color{Goldenrod}{ ▍ }}}$ 1.5%, 232 B
dist/utilities/handleGoBack.js ${{\color{Goldenrod}{ ▎ }}}$ 1.2%, 180 B
dist/fields/mergeFieldStyles.js ${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 159 B
dist/utilities/handleBackToDashboard.js ${{\color{Goldenrod}{ ▎ }}}$ 1.0%, 152 B
dist/forms/Form ${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 147 B
dist/utilities/abortAndIgnore.js ${{\color{Goldenrod}{ ▏ }}}$ 0.9%, 146 B
(other) ${{\color{Goldenrod}{ ████████████████████ }}}$ 80.0%, 12.51 KB
Details

Next to the size is how much the size has increased or decreased compared with the base branch of this PR.

  • ‼️: Size increased by 20% or more. Special attention should be given to this.
  • ⚠️: Size increased in acceptable range (lower than 20%).
  • ✅: No change or even downsized.
  • 🗑️: The out file is deleted: not found in base branch.
  • 🆕: The out file is newly found: will be added to base branch.

@denolfe denolfe marked this pull request as ready for review April 29, 2026 19:13
@denolfe
Copy link
Copy Markdown
Member Author

denolfe commented Apr 29, 2026

Failures are flakes.

@denolfe denolfe merged commit db1b2a8 into main Apr 29, 2026
171 of 176 checks passed
@denolfe denolfe deleted the ai/evals-cleanup branch April 29, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant