Skip to content

test(evals): remove Q and A evals and low-power model variant#16509

Merged
denolfe merged 1 commit into
mainfrom
ai/evals-remove-q-and-a
May 6, 2026
Merged

test(evals): remove Q and A evals and low-power model variant#16509
denolfe merged 1 commit into
mainfrom
ai/evals-remove-q-and-a

Conversation

@denolfe
Copy link
Copy Markdown
Member

@denolfe denolfe commented May 6, 2026

Overview

Cuts the eval suite down to codegen-only signal. Removes the QA pipeline, the low-power model variant, and the four QA-only suites (conventions, graphql, local-api, rest-api). Repurposes EVAL_VARIANT=baseline to run codegen with no skill context, preserving the skill-uplift comparison.

Key Changes

  • Removed QA pipeline

    • Deleted runDataset.ts, runner/runEval.ts, scorer/scoreAnswer.ts, writeFailedQAAssertion, all *QADataset modules, and the qaWithSkill / qaNoSkill / configReview system prompts.
  • Removed low-power model variant

    • Dropped EVAL_VARIANT=low-power, the gpt-4o model entry, and every :low-power package script.
  • Repurposed baseline variant for codegen

    • EVAL_VARIANT=baseline now selects codegenNoSkill. Skill vs. no-skill comparison still works against codegen.
  • Stripped QA halves from mixed specs

    • eval.collections, eval.config, eval.fields, eval.building-plugins, eval.negative now register only codegen cases. The negative suite drops its configReview override too.
  • Narrowed shared types

    • SystemPromptKey reduced to codegenWithSkill | codegenNoSkill. ModelKey reduced to the two models still in use. Dead types removed (EvalCase, RunnerResult, RunEvalOptions, ScoreAnswerOptions, RunDatasetOptions, the QA ScorerResult).
  • Patched dashboard for the new world

    • EvalDashboard and globalSetup lost qa filtering, the Low Power column, the QA badge, and qaWithSkill / qaNoSkill / low-power variant logic.
  • Collapsed package scripts

    • 45 eval scripts reduced to 15. Per surviving suite, only test:eval:<suite> and test:eval:<suite>:baseline. Top-level test:eval, test:eval:baseline, test:eval:report kept.
  • Fixed a latent default-prompt bug

    • runCodegenDataset.ts and runCodegenEval.ts defaulted systemPromptKey to 'qaWithSkill'. Now default to 'codegenWithSkill'. Always overridden in practice, but a real bug surfaced once the literal was deleted.

Design Decisions

  • Why drop QA rather than fix it. QA evals score free-form answers against expected concepts via an LLM judge. Output is noisy and weakly correlated with whether the skill changes real behavior. Codegen evals have a hard TypeScript compile gate plus a structural assertion layer before the LLM scorer, so failures map to actual regressions. One signal, higher fidelity.

  • Why keep baseline instead of removing it with low-power. Skill vs. no-skill is the only remaining lever for measuring whether the skill document helps. Repurposing baseline keeps that lever without new infrastructure; existing scripts and the dashboard's baseline column work as-is.

  • Why patch the dashboard in place rather than delete it. The compare table still has work to do for codegen with-skill vs. without-skill and run-over-run drift. Dropping QA filters and the Low Power column is a smaller change than rebuilding the view.

Overall Flow

flowchart LR
    A[EVAL_VARIANT env] --> B{variant?}
    B -->|skill, default| C[codegenWithSkill]
    B -->|baseline| D[codegenNoSkill]
    C --> E[runCodegenEval]
    D --> E
    E --> F[tsc compile gate]
    F -->|fail| G[record TSC failure]
    F -->|pass| H[structural assertions]
    H -->|fail| I[record assertion failure]
    H -->|pass| J[scoreConfigChange LLM judge]
    J --> K[EvalResult cached and surfaced in dashboard]
Loading

The QA branch (LLM answers a question, LLM scores the answer) is gone. Codegen is the only path; baseline means the same path with no skill context injected.

Codegen is the only signal worth keeping. Drop the QA pipeline
(runDataset, runEval, scoreAnswer, all *QADataset modules,
qaWithSkill/qaNoSkill/configReview prompts) and the low-power model
variant. Repurpose EVAL_VARIANT=baseline to mean codegen-with-no-skill
so the skill-uplift comparison still works.

- delete 4 QA-only specs (conventions, graphql, local-api, rest-api)
- delete 5 per-suite qa.ts datasets
- strip QA halves from 5 mixed specs (collections, config, fields,
  negative, building-plugins)
- delete QA pipeline modules + writeFailedQAAssertion
- narrow SystemPromptKey, ModelKey, helpers to codegen
- collapse package.json scripts to {*, *:baseline} per suite
- patch dashboard + globalSetup to drop low-power/QA branches
@denolfe denolfe changed the title test(evals): remove QA evals and low-power model variant test(evals): remove Q and A evals and low-power model variant May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

@denolfe denolfe marked this pull request as ready for review May 6, 2026 16:23
@denolfe denolfe merged commit 5c3111e into main May 6, 2026
168 of 169 checks passed
@denolfe denolfe deleted the ai/evals-remove-q-and-a branch May 6, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant