test(evals): remove Q and A evals and low-power model variant by denolfe · Pull Request #16509 · payloadcms/payload

denolfe · 2026-05-06T16:14:17Z

Overview

Cuts the eval suite down to codegen-only signal. Removes the QA pipeline, the low-power model variant, and the four QA-only suites (conventions, graphql, local-api, rest-api). Repurposes EVAL_VARIANT=baseline to run codegen with no skill context, preserving the skill-uplift comparison.

Key Changes

Removed QA pipeline
- Deleted runDataset.ts, runner/runEval.ts, scorer/scoreAnswer.ts, writeFailedQAAssertion, all *QADataset modules, and the qaWithSkill / qaNoSkill / configReview system prompts.
Removed low-power model variant
- Dropped EVAL_VARIANT=low-power, the gpt-4o model entry, and every :low-power package script.
Repurposed baseline variant for codegen
- EVAL_VARIANT=baseline now selects codegenNoSkill. Skill vs. no-skill comparison still works against codegen.
Stripped QA halves from mixed specs
- eval.collections, eval.config, eval.fields, eval.building-plugins, eval.negative now register only codegen cases. The negative suite drops its configReview override too.
Narrowed shared types
- SystemPromptKey reduced to codegenWithSkill | codegenNoSkill. ModelKey reduced to the two models still in use. Dead types removed (EvalCase, RunnerResult, RunEvalOptions, ScoreAnswerOptions, RunDatasetOptions, the QA ScorerResult).
Patched dashboard for the new world
- EvalDashboard and globalSetup lost qa filtering, the Low Power column, the QA badge, and qaWithSkill / qaNoSkill / low-power variant logic.
Collapsed package scripts
- 45 eval scripts reduced to 15. Per surviving suite, only test:eval:<suite> and test:eval:<suite>:baseline. Top-level test:eval, test:eval:baseline, test:eval:report kept.
Fixed a latent default-prompt bug
- runCodegenDataset.ts and runCodegenEval.ts defaulted systemPromptKey to 'qaWithSkill'. Now default to 'codegenWithSkill'. Always overridden in practice, but a real bug surfaced once the literal was deleted.

Design Decisions

Why drop QA rather than fix it. QA evals score free-form answers against expected concepts via an LLM judge. Output is noisy and weakly correlated with whether the skill changes real behavior. Codegen evals have a hard TypeScript compile gate plus a structural assertion layer before the LLM scorer, so failures map to actual regressions. One signal, higher fidelity.
Why keep baseline instead of removing it with low-power. Skill vs. no-skill is the only remaining lever for measuring whether the skill document helps. Repurposing baseline keeps that lever without new infrastructure; existing scripts and the dashboard's baseline column work as-is.
Why patch the dashboard in place rather than delete it. The compare table still has work to do for codegen with-skill vs. without-skill and run-over-run drift. Dropping QA filters and the Low Power column is a smaller change than rebuilding the view.

Overall Flow

flowchart LR
    A[EVAL_VARIANT env] --> B{variant?}
    B -->|skill, default| C[codegenWithSkill]
    B -->|baseline| D[codegenNoSkill]
    C --> E[runCodegenEval]
    D --> E
    E --> F[tsc compile gate]
    F -->|fail| G[record TSC failure]
    F -->|pass| H[structural assertions]
    H -->|fail| I[record assertion failure]
    H -->|pass| J[scoreConfigChange LLM judge]
    J --> K[EvalResult cached and surfaced in dashboard]

The QA branch (LLM answers a question, LLM scores the answer) is gone. Codegen is the only path; baseline means the same path with no skill context injected.

Codegen is the only signal worth keeping. Drop the QA pipeline (runDataset, runEval, scoreAnswer, all *QADataset modules, qaWithSkill/qaNoSkill/configReview prompts) and the low-power model variant. Repurpose EVAL_VARIANT=baseline to mean codegen-with-no-skill so the skill-uplift comparison still works. - delete 4 QA-only specs (conventions, graphql, local-api, rest-api) - delete 5 per-suite qa.ts datasets - strip QA halves from 5 mixed specs (collections, config, fields, negative, building-plugins) - delete QA pipeline modules + writeFailedQAAssertion - narrow SystemPromptKey, ModelKey, helpers to codegen - collapse package.json scripts to {*, *:baseline} per suite - patch dashboard + globalSetup to drop low-power/QA branches

github-actions · 2026-05-06T16:23:53Z

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

github-actions Bot added the created-by: Payload team label May 6, 2026

denolfe changed the title ~~test(evals): remove QA evals and low-power model variant~~ test(evals): remove Q and A evals and low-power model variant May 6, 2026

denolfe marked this pull request as ready for review May 6, 2026 16:23

denolfe merged commit 5c3111e into main May 6, 2026
168 of 169 checks passed

denolfe deleted the ai/evals-remove-q-and-a branch May 6, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): remove Q and A evals and low-power model variant#16509

test(evals): remove Q and A evals and low-power model variant#16509
denolfe merged 1 commit into
mainfrom
ai/evals-remove-q-and-a

denolfe commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

denolfe commented May 6, 2026

Overview

Key Changes

Design Decisions

Overall Flow

Uh oh!

github-actions Bot commented May 6, 2026

📦 esbuild Bundle Analysis for payload

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant