feature: closed-loop agent benchmark harness (Phase 2)#399
Merged
Conversation
Adds research/closed-loop-bench/ - a driver, per-language adapters, task definitions, and system prompts for the closed-loop benchmark that measures end-to-end token cost across generation -> compile -> repair -> retry. Variants under test: python, ilo-pre-phase-1, ilo-post-phase-1, zero. Adapters call real ilo and zero binaries for compile-check; ilo also runs test cases via subprocess. Zero is compile-only (documented limitation). Driver supports --mock for offline plumbing validation (no API key) and --full for the live Anthropic matrix. Prompt-cache aware; budget cap; 3-seed median + variance aggregation. Charts script produces the 4 canonical figures from aggregated.json. The whole research/ tree was previously ignored. Add a targeted exception for closed-loop-bench/ and BENCHMARK-METHODOLOGY.md while keeping per-cell raw results, zero compile caches, and pycache out of the repo.
Output of running the harness in --mock mode: 4 variants x 5 tasks x 5 session lengths x 1 model x 3 seeds = 300 cells, aggregated to 100 rows in results/aggregated.json. CSV mirror in data/aggregated.csv. All four canonical figures generated to charts/ as PNG + SVG: chart1-tokens-vs-n total tokens per task vs session length chart2-cost-composition cost composition at the largest N chart3-success-rate success rate by variant chart4-usd-cost USD per completed task These numbers are NOT publishable - the mock LLM is a deterministic stub returning broken-then-fixed reference impls. They exist to prove the loop, the adapters, the aggregation, and the chart generator all work end-to-end. The live matrix is gated on ANTHROPIC_API_KEY and will overwrite results/ + charts/ when run.
Two reference docs alongside the harness: - research/closed-loop-bench/README.md Quick start, repo layout, what's shipped in this PR (harness + mock dataset), what's deferred (the live matrix), how to extend with new languages or tasks. - research/BENCHMARK-METHODOLOGY.md How to reproduce, what each metric means, the per-variant adapter behaviour, the repair-loop algorithm, pricing assumptions, and the four documented limitations: zero is compile-only, ilo-post-phase-1 is simulated via a prefix slice of ai.txt, ilo-post-phase-4 is deferred until Phase 4 lands, and the shipped numbers are mock-only.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the Phase 2 keystone: a closed-loop benchmark harness that drives an
LLM through generation -> compile -> repair -> retry until passing, measuring
total tokens, success rate, and time across ilo, Zero, and Python on 5
canonical tasks at 5 session lengths. The dataset this produces is the
empirical receipt behind every "ilo wins per-task token cost on iterative
agent workflows" claim - without it the strategy is opinion, with it it's
measurement.
What's in this PR:
system prompts, charts script, run script, README and methodology doc.
numbers are NOT publishable - they exist to prove the plumbing.
What's deferred:
3 seeds = 750 cells, ~$100-150 in API spend). Gated on access to an
ANTHROPIC_API_KEY. The brief explicitly accepts "harness + smallersample, document the gap" within the 4-hour scaffolding timebox.
ilo-post-phase-4variant - gated on Phase 4 landing.What's in the diff
e62dcbc research: scaffold closed-loop agent benchmark harness-driver.py (mock + live modes, prompt caching, repair loop, budget cap),
variants.py (python/ilo/zero adapters), charts.py, run_benchmark.sh,
5 task JSON files, 4 system prompts, gitignore exception.
f14d782 research: add 300-cell mock dataset and four canonical charts-aggregated.json, aggregated.csv, 4 charts (PNG + SVG), summary.txt.
d207ef8 docs: closed-loop benchmark README and methodology note-closed-loop-bench/README.md + research/BENCHMARK-METHODOLOGY.md.
Repro
Mock (no API key, ~3 minutes):
Live (requires
ANTHROPIC_API_KEY):Known limitations (documented in BENCHMARK-METHODOLOGY.md)
successful
zero check, not on running output. Adding Zero-runnabletest harnesses per task is future work.
ilo-post-phase-1simulates modular skills by loading a 30% prefix ofai.txt. Onceilo skill getis wired into the variant prompt loaderthis can be re-run more accurately.
cache_hit_rateas 0 or a synthetic 0.85 - onlythe live mode produces real cache stats.
Test plan
./run_benchmark.sh --mockruns end-to-end and writesresults/aggregated.json, data/aggregated.csv, and 4 PNG+SVG charts.
validates that every ref impl actually compiles via
ilo --astandzero check).binaries (
/Users/dan/.cargo/bin/ilo,~/code/ilo-lang/zero/.zero/bin/zero)../run_benchmark.sh --fullagainst the Anthropic API -blocked on
ANTHROPIC_API_KEYaccess, not run in this PR.won't show the strategy crossover; that arrives with the live run.
Follow-ups
ilo skill get <name>intoload_spec_for_variantsoilo-post-phase-1measures the real modular skill load, not a prefixproxy.
limitation.
ilo-post-phase-4variant once typed fix plans ship.