cli: --bench --json envelope with top-level engine field#419
Merged
Conversation
`--bench` previously dumped human-readable text regardless of output mode. Wire `run_bench` to the resolved `OutputMode` so `--json` (or non-TTY auto-detect) emits one JSON object per engine measurement instead. Top-level `engine` field names which engine produced the numbers — `tree`, `vm` (with `variant: "fresh"|"reusable"` for the two VM modes), `jit`, or `llvm`. Python transpiler comparison and the summary block are skipped under `--json`; text mode is unchanged. Lets agents read bench results without parsing prose. Closes the JSON-envelope slice of adoption brief 4.
New regression suite asserts the bench JSON contract: - one envelope per engine measurement - top-level `engine` is `tree`, `vm`, or `jit` - VM emits two records with `variant: "fresh" | "reusable"` - every line carries iterations / totalMs / perCallNs / result - `--text` still produces the legacy human-readable text The 7 pre-existing `bench_*` tests in eval_inline.rs grepped for "Rust interpreter" / "Register VM" / "Cranelift JIT" headings. Subprocess-mode output now auto-resolves to JSON when stderr isn't a TTY (matching every other ilo subcommand), so these tests need an explicit `--text` to stay on the legacy text contract they were written against.
Replace the misleading per-engine bench invocation table (every `--bench` run already exercises tree/vm/jit regardless of engine flag) with one accurate text invocation, the JSON invocation, and the envelope shape so agents know what to parse.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Slim slice of adoption brief 4.
--benchpreviously ignored output mode and dumped text only. With--jsonit now emits one JSON object per engine measurement, top-levelenginefield naming which engine produced the numbers — so agents can filter bench results without parsing prose.The other deliverables in the brief either already exist (engine docs at
skills/ilo/ilo-engines.md) or were deliberately reversed (the--engineCLI flag, removed in 0e30891). This PR is just the JSON envelope.What's in the diff
run_benchtakes ajsonflag from the resolvedOutputMode. Each engine block emits either text or one JSON line. Python comparator + Summary skipped in JSON mode (Python isn't one of the four engines; Summary is prose).regression_bench_engine_field.rs(4 tests) covers the JSON contract; the 7 pre-existingbench_*tests ineval_inline.rsneed--textbecause subprocess mode now auto-resolves to JSON (matching every other subcommand).--benchinvocation already exercises all engines) with the actual invocations and the JSON envelope shape.Notes
aotis reserved in the JSON contract but--benchdoesn't measure AOT today (useilo buildthen time the produced binary). When a future change adds an AOT bench it gets"engine":"aot"for free.variant: "fresh"|"reusable". Keeping both is useful — the reusable VmState measurement is closer to the steady-state cost agents care about, while the fresh-compile path matches whatilo file.ilodoes today.Test plan
cargo test --release --features cranelift(all green)cargo fmt --checkcargo clippy --release --features cranelift --tests -- -D warnings--textmode unchangedFollow-ups
"engine":"aot"is actually produced (separate brief — needs design on whether bench builds + spawns a binary, or compiles in-process).