Skip to content

Benchmarks and Evaluation

Sasha Lopashev edited this page Jun 27, 2026 · 1 revision

Benchmarks and Evaluation

Migaki should not claim optimization from lower token counts alone.

An optimization worked only when the workflow still satisfies declared acceptance criteria.

Measurement Dimensions

  • Token count.
  • Estimated cost.
  • Actual cost where available.
  • Latency.
  • Validator pass rate.
  • Eval score.
  • Schema validity.
  • Source-grounding score.
  • Human acceptance rate.
  • Regression threshold.
  • Replayability.
  • Policy compliance.

Baselines

Every benchmark should compare against simple baselines:

  • naive execution,
  • no optimization,
  • simple static routing,
  • provider or gateway default behavior.

Routing benchmarks in particular should prove they beat simple alternatives under the chosen task and model set.

Acceptance Criteria

Each benchmark should declare:

  • which quality metric must not regress,
  • the allowed regression threshold,
  • required deterministic invariants,
  • allowed providers,
  • data-retention requirements,
  • replay and audit level.

Reporting Format

A benchmark report should include:

  • baseline plan,
  • optimized plan,
  • plan diff,
  • enabled passes,
  • disabled passes,
  • token and cost deltas,
  • latency deltas,
  • validator or eval results,
  • warning list,
  • evidence bundle link or artifact path.

Rule of Thumb

The defensible target is not:

Same answer, lower cost.

The defensible target is:

Equivalent task outcome under declared validators and allowed regression thresholds, with lower cost, lower latency, or improved reliability.

Clone this wiki locally