Benchmarks and Evaluation

Jump to bottom

Sasha Lopashev edited this page Jun 27, 2026 · 1 revision

Benchmarks and Evaluation

Migaki should not claim optimization from lower token counts alone.

An optimization worked only when the workflow still satisfies declared acceptance criteria.

Measurement Dimensions

Token count.
Estimated cost.
Actual cost where available.
Latency.
Validator pass rate.
Eval score.
Schema validity.
Source-grounding score.
Human acceptance rate.
Regression threshold.
Replayability.
Policy compliance.

Baselines

Every benchmark should compare against simple baselines:

naive execution,
no optimization,
simple static routing,
provider or gateway default behavior.

Routing benchmarks in particular should prove they beat simple alternatives under the chosen task and model set.

Acceptance Criteria

Each benchmark should declare:

which quality metric must not regress,
the allowed regression threshold,
required deterministic invariants,
allowed providers,
data-retention requirements,
replay and audit level.

Reporting Format

A benchmark report should include:

baseline plan,
optimized plan,
plan diff,
enabled passes,
disabled passes,
token and cost deltas,
latency deltas,
validator or eval results,
warning list,
evidence bundle link or artifact path.

Rule of Thumb

The defensible target is not:

Same answer, lower cost.

The defensible target is:

Equivalent task outcome under declared validators and allowed regression thresholds, with lower cost, lower latency, or improved reliability.