Skip to content

bench: statistical methodology (warmup, confidence intervals)#684

Merged
danieljohnmorris merged 1 commit into
mainfrom
chore/bench-stats-methodology
May 22, 2026
Merged

bench: statistical methodology (warmup, confidence intervals)#684
danieljohnmorris merged 1 commit into
mainfrom
chore/bench-stats-methodology

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Implements ILO-349 (follow-up to ILO-65 / #608).

  • Warmup: 5 runs discarded before measurement (3 in --quick mode) to eliminate cold-start noise
  • 30 measurement runs per engine/language (10 in --quick), each an independent process invocation; ilo uses its internal 10k-iter loop per run
  • Full statistics computed per (benchmark, language): min, max, mean, median, p95, p99, stddev
  • 95% confidence intervals via t-distribution (df = n-1), converging to z = 1.960 for large samples
  • CI overlap detection: after all benchmarks complete, all pairwise CIs are compared; overlapping CIs are reported as "no statistically significant difference" (non-fatal warning, exit 1 can be enabled for CI gates)
  • results.json schema updated: values are now stats objects instead of bare scalars; a methodology block records warmup/measure counts and CI method

Test plan

  • ./bench/run.sh --quick --no-rust completes without error
  • bench/results.json contains ci95_lo/ci95_hi fields
  • CI overlap section prints without crashing on degenerate (single-sample) input
  • --quick produces 10 samples, normal mode produces 30

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
686 1 685 0
View the full list of 1 ❄️ flaky test(s)
ilo::interpreter::tests::interpret_braced_guard_in_loop_no_early_return

Flake rate in main: 66.67% (Passed 1 times, Failed 2 times)

Stack Traces | 0.011s run time
thread 'interpreter::tests::interpret_braced_guard_in_loop_no_early_return' (32872) panicked at src/interpreter/mod.rs:10202:9:
parse errors: [ParseError { code: "ILO-P003", position: 27, span: Span { start: 35, end: 36 }, message: "expected `{`, got `;`", hint: Some("ilo bodies are single-line, `;`-separated — not python/swift-style indented. Use either the brace-block form `name p:t>r;{body1;body2}` or the single-line form `name p:t>r;body1;body2`. For statements that require a block (`@k xs{...}`, `wh cond{...}`, `?subj{...}`), the `{...}` must be on the same line as the head.") }]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@danieljohnmorris danieljohnmorris added the mini Created by mini PC autonomous workflow label May 22, 2026
@danieljohnmorris
Copy link
Copy Markdown
Collaborator Author

needs manual rebase (conflicts in: bench/run.sh)

@danieljohnmorris
Copy link
Copy Markdown
Collaborator Author

needs manual — rebase conflict in non-doc file(s). Auto-resolve skipped.

@danieljohnmorris
Copy link
Copy Markdown
Collaborator Author

needs manual — rebase conflict in non-doc file(s)

hotfix(codegen/js): handle Pattern::Or, Expr::Todo, Expr::Panic
@danieljohnmorris danieljohnmorris force-pushed the chore/bench-stats-methodology branch from 8527fc7 to 0485a68 Compare May 22, 2026 10:36
@danieljohnmorris danieljohnmorris merged commit f32d1e6 into main May 22, 2026
8 of 10 checks passed
@danieljohnmorris danieljohnmorris deleted the chore/bench-stats-methodology branch May 22, 2026 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mini Created by mini PC autonomous workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant