Skip to content

Execution Evidence Bundles

Sasha Lopashev edited this page Jun 27, 2026 · 1 revision

Execution Evidence Bundles

An execution evidence bundle is the artifact that makes Migaki optimization inspectable.

Without evidence, optimization becomes invisible prompt mutation. Migaki should make every transformation visible enough to benchmark, audit, replay, and debug.

Contents

An evidence bundle may include:

  • original mIR plan,
  • optimized mIR plan,
  • plan diff,
  • enabled optimization passes,
  • disabled optimization passes,
  • context blocks added, removed, compressed, or reordered,
  • token estimates before and after,
  • cache planning decisions,
  • routing decisions,
  • provider capability assumptions,
  • retry and fallback decisions,
  • validator results,
  • eval metadata,
  • cost estimate,
  • actual cost,
  • latency estimate,
  • actual latency,
  • trace IDs,
  • replay handles,
  • policy decisions,
  • human approvals,
  • redactions.

Required for v0

  • Original plan reference.
  • Optimized plan reference.
  • Plan diff.
  • Pass list.
  • Warnings.
  • Context diff.
  • Token estimates.
  • Cost estimates.
  • Provider assumptions.
  • Validator results where available.
  • Replay mode.
  • Redaction metadata.

Privacy and Retention

Evidence bundles can contain sensitive data. The evidence model should support:

  • metadata-only replay,
  • full trace replay,
  • redacted exports,
  • privacy classes,
  • retention policies,
  • provider data-retention notes,
  • audit levels.

The evidence bundle should say what it omitted, not only what it included.

Export Targets

Migaki should integrate with existing telemetry rather than invent a closed observability island.

Potential export targets:

  • OpenTelemetry GenAI spans and metrics,
  • Langfuse-style trace systems,
  • local JSON artifacts,
  • CI artifacts,
  • CLI reports.

Claim Standard

An optimization is credible only when the evidence bundle can answer:

  • What changed?
  • Why was it allowed?
  • What constraints were checked?
  • What did it cost before and after?
  • What quality gate passed?
  • What provider assumptions were used?
  • Can the run be replayed or audited?

Clone this wiki locally