Problem
nf:solve has a CI benchmark gate that catches regressions before they ship. The same risk exists for quick, quick --full, new-milestone, fix-tests, and debug — but no equivalent coverage exists today.
User Story
As a maintainer, I want every PR to run smoke benchmarks for core skills so that regressions in quick, fix-tests, or debug are caught before merge.
Proposed Solution
Build a generic bin/nf-benchmark.cjs --skill=<name> --track=smoke|full --json runner that loads per-skill fixtures from benchmarks/<skill>/ and integrates into the existing benchmark-gate.yml alongside solve.
MVP scope: extract shared utils → generic runner → quick smoke fixtures → CI gate update.
Acceptance Criteria
Decomposition
- Extract
evaluatePassCondition, extractResidual, snapshotFormalJson, restoreFormalJson from nf-benchmark-solve.cjs → bin/benchmark-utils.cjs
- Build
bin/nf-benchmark.cjs accepting --skill=, --track=smoke|full, --json, loading from benchmarks/<skill>/
- Add
benchmarks/quick/fixtures.json with smoke fixtures (exits_zero on --dry-run or --plan-only invocations)
- Add
benchmarks/quick/baseline.json with initial pass_rate floor
- Update
benchmark-gate.yml to run both solve and quick and check both baselines
- Update
nf-benchmark-solve.cjs to delegate to benchmark-utils.cjs (remove duplication)
Not Doing (this issue)
- LLM-calling fixtures in CI
- Fixtures for
new-milestone, fix-tests, debug — add once the generic runner is proven
- Composite cross-skill scoring
Difficulty
M — Pattern is proven (solve runner exists), extraction + generalization is mechanical. Main risk: quick may not have a cheap deterministic smoke path, which could require fixture design work.
Open Questions
- Does
nf:quick have a --dry-run or --plan-only flag that makes smoke fixtures cheap without LLM calls?
- Should baselines auto-update after merge or remain manual (currently manual for solve)?
Problem
nf:solvehas a CI benchmark gate that catches regressions before they ship. The same risk exists forquick,quick --full,new-milestone,fix-tests, anddebug— but no equivalent coverage exists today.User Story
As a maintainer, I want every PR to run smoke benchmarks for core skills so that regressions in
quick,fix-tests, ordebugare caught before merge.Proposed Solution
Build a generic
bin/nf-benchmark.cjs --skill=<name> --track=smoke|full --jsonrunner that loads per-skill fixtures frombenchmarks/<skill>/and integrates into the existingbenchmark-gate.ymlalongside solve.MVP scope: extract shared utils → generic runner →
quicksmoke fixtures → CI gate update.Acceptance Criteria
bin/nf-benchmark.cjs --skill=quick --track=smoke --jsonexits 0 and outputs valid JSON withpass_ratenf:quickexist underbenchmarks/quick/fixtures.jsonwith at least oneexits_zerofixturebenchmarks/quick/baseline.jsonexists and the CI gate enforces itbenchmark-gate.ymlruns bothsolveandquickand fails if either score drops below its baselineevaluatePassCondition,extractResidual, snapshot/restore) are extracted intobin/benchmark-utils.cjsand imported by both runnersDecomposition
evaluatePassCondition,extractResidual,snapshotFormalJson,restoreFormalJsonfromnf-benchmark-solve.cjs→bin/benchmark-utils.cjsbin/nf-benchmark.cjsaccepting--skill=,--track=smoke|full,--json, loading frombenchmarks/<skill>/benchmarks/quick/fixtures.jsonwith smoke fixtures (exits_zeroon--dry-runor--plan-onlyinvocations)benchmarks/quick/baseline.jsonwith initial pass_rate floorbenchmark-gate.ymlto run both solve and quick and check both baselinesnf-benchmark-solve.cjsto delegate tobenchmark-utils.cjs(remove duplication)Not Doing (this issue)
new-milestone,fix-tests,debug— add once the generic runner is provenDifficulty
M — Pattern is proven (solve runner exists), extraction + generalization is mechanical. Main risk:
quickmay not have a cheap deterministic smoke path, which could require fixture design work.Open Questions
nf:quickhave a--dry-runor--plan-onlyflag that makes smoke fixtures cheap without LLM calls?