-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmarks and Evaluation
Sasha Lopashev edited this page Jun 27, 2026
·
1 revision
Migaki should not claim optimization from lower token counts alone.
An optimization worked only when the workflow still satisfies declared acceptance criteria.
- Token count.
- Estimated cost.
- Actual cost where available.
- Latency.
- Validator pass rate.
- Eval score.
- Schema validity.
- Source-grounding score.
- Human acceptance rate.
- Regression threshold.
- Replayability.
- Policy compliance.
Every benchmark should compare against simple baselines:
- naive execution,
- no optimization,
- simple static routing,
- provider or gateway default behavior.
Routing benchmarks in particular should prove they beat simple alternatives under the chosen task and model set.
Each benchmark should declare:
- which quality metric must not regress,
- the allowed regression threshold,
- required deterministic invariants,
- allowed providers,
- data-retention requirements,
- replay and audit level.
A benchmark report should include:
- baseline plan,
- optimized plan,
- plan diff,
- enabled passes,
- disabled passes,
- token and cost deltas,
- latency deltas,
- validator or eval results,
- warning list,
- evidence bundle link or artifact path.
The defensible target is not:
Same answer, lower cost.
The defensible target is:
Equivalent task outcome under declared validators and allowed regression thresholds, with lower cost, lower latency, or improved reliability.