test(evals): remove Q and A evals and low-power model variant#16509
Merged
Conversation
Codegen is the only signal worth keeping. Drop the QA pipeline
(runDataset, runEval, scoreAnswer, all *QADataset modules,
qaWithSkill/qaNoSkill/configReview prompts) and the low-power model
variant. Repurpose EVAL_VARIANT=baseline to mean codegen-with-no-skill
so the skill-uplift comparison still works.
- delete 4 QA-only specs (conventions, graphql, local-api, rest-api)
- delete 5 per-suite qa.ts datasets
- strip QA halves from 5 mixed specs (collections, config, fields,
negative, building-plugins)
- delete QA pipeline modules + writeFailedQAAssertion
- narrow SystemPromptKey, ModelKey, helpers to codegen
- collapse package.json scripts to {*, *:baseline} per suite
- patch dashboard + globalSetup to drop low-power/QA branches
Contributor
📦 esbuild Bundle Analysis for payloadThis analysis was generated by esbuild-bundle-analyzer. 🤖 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Cuts the eval suite down to codegen-only signal. Removes the QA pipeline, the low-power model variant, and the four QA-only suites (conventions, graphql, local-api, rest-api). Repurposes
EVAL_VARIANT=baselineto run codegen with no skill context, preserving the skill-uplift comparison.Key Changes
Removed QA pipeline
runDataset.ts,runner/runEval.ts,scorer/scoreAnswer.ts,writeFailedQAAssertion, all*QADatasetmodules, and theqaWithSkill/qaNoSkill/configReviewsystem prompts.Removed low-power model variant
EVAL_VARIANT=low-power, thegpt-4omodel entry, and every:low-powerpackage script.Repurposed
baselinevariant for codegenEVAL_VARIANT=baselinenow selectscodegenNoSkill. Skill vs. no-skill comparison still works against codegen.Stripped QA halves from mixed specs
eval.collections,eval.config,eval.fields,eval.building-plugins,eval.negativenow register only codegen cases. Thenegativesuite drops itsconfigReviewoverride too.Narrowed shared types
SystemPromptKeyreduced tocodegenWithSkill | codegenNoSkill.ModelKeyreduced to the two models still in use. Dead types removed (EvalCase,RunnerResult,RunEvalOptions,ScoreAnswerOptions,RunDatasetOptions, the QAScorerResult).Patched dashboard for the new world
EvalDashboardandglobalSetuplostqafiltering, the Low Power column, the QA badge, andqaWithSkill/qaNoSkill/low-powervariant logic.Collapsed package scripts
test:eval:<suite>andtest:eval:<suite>:baseline. Top-leveltest:eval,test:eval:baseline,test:eval:reportkept.Fixed a latent default-prompt bug
runCodegenDataset.tsandrunCodegenEval.tsdefaultedsystemPromptKeyto'qaWithSkill'. Now default to'codegenWithSkill'. Always overridden in practice, but a real bug surfaced once the literal was deleted.Design Decisions
Why drop QA rather than fix it. QA evals score free-form answers against expected concepts via an LLM judge. Output is noisy and weakly correlated with whether the skill changes real behavior. Codegen evals have a hard TypeScript compile gate plus a structural assertion layer before the LLM scorer, so failures map to actual regressions. One signal, higher fidelity.
Why keep
baselineinstead of removing it with low-power. Skill vs. no-skill is the only remaining lever for measuring whether the skill document helps. Repurposingbaselinekeeps that lever without new infrastructure; existing scripts and the dashboard's baseline column work as-is.Why patch the dashboard in place rather than delete it. The compare table still has work to do for codegen with-skill vs. without-skill and run-over-run drift. Dropping QA filters and the Low Power column is a smaller change than rebuilding the view.
Overall Flow
flowchart LR A[EVAL_VARIANT env] --> B{variant?} B -->|skill, default| C[codegenWithSkill] B -->|baseline| D[codegenNoSkill] C --> E[runCodegenEval] D --> E E --> F[tsc compile gate] F -->|fail| G[record TSC failure] F -->|pass| H[structural assertions] H -->|fail| I[record assertion failure] H -->|pass| J[scoreConfigChange LLM judge] J --> K[EvalResult cached and surfaced in dashboard]The QA branch (LLM answers a question, LLM scores the answer) is gone. Codegen is the only path;
baselinemeans the same path with no skill context injected.