Skip to content

Build generic benchmark runner for nForma skills #107

@jobordu

Description

@jobordu

Problem

nf:solve has a CI benchmark gate that catches regressions before they ship. The same risk exists for quick, quick --full, new-milestone, fix-tests, and debug — but no equivalent coverage exists today.

User Story

As a maintainer, I want every PR to run smoke benchmarks for core skills so that regressions in quick, fix-tests, or debug are caught before merge.

Proposed Solution

Build a generic bin/nf-benchmark.cjs --skill=<name> --track=smoke|full --json runner that loads per-skill fixtures from benchmarks/<skill>/ and integrates into the existing benchmark-gate.yml alongside solve.

MVP scope: extract shared utils → generic runner → quick smoke fixtures → CI gate update.

Acceptance Criteria

  • bin/nf-benchmark.cjs --skill=quick --track=smoke --json exits 0 and outputs valid JSON with pass_rate
  • Smoke fixtures for nf:quick exist under benchmarks/quick/fixtures.json with at least one exits_zero fixture
  • benchmarks/quick/baseline.json exists and the CI gate enforces it
  • benchmark-gate.yml runs both solve and quick and fails if either score drops below its baseline
  • Shared utilities (evaluatePassCondition, extractResidual, snapshot/restore) are extracted into bin/benchmark-utils.cjs and imported by both runners
  • No fixture in the CI (smoke) track requires an LLM API key to execute

Decomposition

  1. Extract evaluatePassCondition, extractResidual, snapshotFormalJson, restoreFormalJson from nf-benchmark-solve.cjsbin/benchmark-utils.cjs
  2. Build bin/nf-benchmark.cjs accepting --skill=, --track=smoke|full, --json, loading from benchmarks/<skill>/
  3. Add benchmarks/quick/fixtures.json with smoke fixtures (exits_zero on --dry-run or --plan-only invocations)
  4. Add benchmarks/quick/baseline.json with initial pass_rate floor
  5. Update benchmark-gate.yml to run both solve and quick and check both baselines
  6. Update nf-benchmark-solve.cjs to delegate to benchmark-utils.cjs (remove duplication)

Not Doing (this issue)

  • LLM-calling fixtures in CI
  • Fixtures for new-milestone, fix-tests, debug — add once the generic runner is proven
  • Composite cross-skill scoring

Difficulty

M — Pattern is proven (solve runner exists), extraction + generalization is mechanical. Main risk: quick may not have a cheap deterministic smoke path, which could require fixture design work.

Open Questions

  • Does nf:quick have a --dry-run or --plan-only flag that makes smoke fixtures cheap without LLM calls?
  • Should baselines auto-update after merge or remain manual (currently manual for solve)?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions