Build generic benchmark runner for nForma skills

## Problem

`nf:solve` has a CI benchmark gate that catches regressions before they ship. The same risk exists for `quick`, `quick --full`, `new-milestone`, `fix-tests`, and `debug` — but no equivalent coverage exists today.

## User Story

As a maintainer, I want every PR to run smoke benchmarks for core skills so that regressions in `quick`, `fix-tests`, or `debug` are caught before merge.

## Proposed Solution

Build a generic `bin/nf-benchmark.cjs --skill=<name> --track=smoke|full --json` runner that loads per-skill fixtures from `benchmarks/<skill>/` and integrates into the existing `benchmark-gate.yml` alongside solve.

MVP scope: extract shared utils → generic runner → `quick` smoke fixtures → CI gate update.

## Acceptance Criteria

- [ ] `bin/nf-benchmark.cjs --skill=quick --track=smoke --json` exits 0 and outputs valid JSON with `pass_rate`
- [ ] Smoke fixtures for `nf:quick` exist under `benchmarks/quick/fixtures.json` with at least one `exits_zero` fixture
- [ ] `benchmarks/quick/baseline.json` exists and the CI gate enforces it
- [ ] `benchmark-gate.yml` runs both `solve` and `quick` and fails if either score drops below its baseline
- [ ] Shared utilities (`evaluatePassCondition`, `extractResidual`, snapshot/restore) are extracted into `bin/benchmark-utils.cjs` and imported by both runners
- [ ] No fixture in the CI (smoke) track requires an LLM API key to execute

## Decomposition

1. Extract `evaluatePassCondition`, `extractResidual`, `snapshotFormalJson`, `restoreFormalJson` from `nf-benchmark-solve.cjs` → `bin/benchmark-utils.cjs`
2. Build `bin/nf-benchmark.cjs` accepting `--skill=`, `--track=smoke|full`, `--json`, loading from `benchmarks/<skill>/`
3. Add `benchmarks/quick/fixtures.json` with smoke fixtures (`exits_zero` on `--dry-run` or `--plan-only` invocations)
4. Add `benchmarks/quick/baseline.json` with initial pass_rate floor
5. Update `benchmark-gate.yml` to run both solve and quick and check both baselines
6. Update `nf-benchmark-solve.cjs` to delegate to `benchmark-utils.cjs` (remove duplication)

## Not Doing (this issue)

- LLM-calling fixtures in CI
- Fixtures for `new-milestone`, `fix-tests`, `debug` — add once the generic runner is proven
- Composite cross-skill scoring

## Difficulty

**M** — Pattern is proven (solve runner exists), extraction + generalization is mechanical. Main risk: `quick` may not have a cheap deterministic smoke path, which could require fixture design work.

## Open Questions

- Does `nf:quick` have a `--dry-run` or `--plan-only` flag that makes smoke fixtures cheap without LLM calls?
- Should baselines auto-update after merge or remain manual (currently manual for solve)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build generic benchmark runner for nForma skills #107

Problem

User Story

Proposed Solution

Acceptance Criteria

Decomposition

Not Doing (this issue)

Difficulty

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build generic benchmark runner for nForma skills #107

Description

Problem

User Story

Proposed Solution

Acceptance Criteria

Decomposition

Not Doing (this issue)

Difficulty

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions