feature: closed-loop agent benchmark harness (Phase 2) by danieljohnmorris · Pull Request #399 · ilo-lang/ilo

danieljohnmorris · 2026-05-18T19:46:19Z

Summary

This is the Phase 2 keystone: a closed-loop benchmark harness that drives an
LLM through generation -> compile -> repair -> retry until passing, measuring
total tokens, success rate, and time across ilo, Zero, and Python on 5
canonical tasks at 5 session lengths. The dataset this produces is the
empirical receipt behind every "ilo wins per-task token cost on iterative
agent workflows" claim - without it the strategy is opinion, with it it's
measurement.

What's in this PR:

The harness end-to-end: driver, per-language adapters, 5 task definitions,
system prompts, charts script, run script, README and methodology doc.
A 300-cell mock dataset that exercises every code path in the loop. The
numbers are NOT publishable - they exist to prove the plumbing.
All four canonical charts generated to PNG + SVG.

What's deferred:

The live Anthropic matrix (5 variants x 5 tasks x 5 lengths x 2 models x
3 seeds = 750 cells, ~$100-150 in API spend). Gated on access to an
ANTHROPIC_API_KEY. The brief explicitly accepts "harness + smaller
sample, document the gap" within the 4-hour scaffolding timebox.
The ilo-post-phase-4 variant - gated on Phase 4 landing.

What's in the diff

e62dcbc research: scaffold closed-loop agent benchmark harness -
driver.py (mock + live modes, prompt caching, repair loop, budget cap),
variants.py (python/ilo/zero adapters), charts.py, run_benchmark.sh,
5 task JSON files, 4 system prompts, gitignore exception.
f14d782 research: add 300-cell mock dataset and four canonical charts -
aggregated.json, aggregated.csv, 4 charts (PNG + SVG), summary.txt.
d207ef8 docs: closed-loop benchmark README and methodology note -
closed-loop-bench/README.md + research/BENCHMARK-METHODOLOGY.md.

Repro

Mock (no API key, ~3 minutes):

cd research/closed-loop-bench
./run_benchmark.sh --mock

Live (requires ANTHROPIC_API_KEY):

export ANTHROPIC_API_KEY=sk-ant-...
./run_benchmark.sh

Known limitations (documented in BENCHMARK-METHODOLOGY.md)

Zero is compile-only - the test harness judges Zero variants on
successful zero check, not on running output. Adding Zero-runnable
test harnesses per task is future work.
ilo-post-phase-1 simulates modular skills by loading a 30% prefix of
ai.txt. Once ilo skill get is wired into the variant prompt loader
this can be re-run more accurately.
Mock numbers report cache_hit_rate as 0 or a synthetic 0.85 - only
the live mode produces real cache stats.

Test plan

./run_benchmark.sh --mock runs end-to-end and writes
results/aggregated.json, data/aggregated.csv, and 4 PNG+SVG charts.
All 300 mock cells succeed (mock returns ref impls on retry, which
validates that every ref impl actually compiles via ilo --ast and
zero check).
Python, ilo and zero compile adapters all exercised against real
binaries (/Users/dan/.cargo/bin/ilo, ~/code/ilo-lang/zero/.zero/bin/zero).
Live ./run_benchmark.sh --full against the Anthropic API -
blocked on ANTHROPIC_API_KEY access, not run in this PR.
Chart visual sanity-check against expected shapes - the mock data
won't show the strategy crossover; that arrives with the live run.

Follow-ups

Run the live matrix and replace the mock dataset.
Wire ilo skill get <name> into load_spec_for_variant so
ilo-post-phase-1 measures the real modular skill load, not a prefix
proxy.
Add Zero-runnable test harnesses per task to lift the compile-only
limitation.
Add the ilo-post-phase-4 variant once typed fix plans ship.

Adds research/closed-loop-bench/ - a driver, per-language adapters, task definitions, and system prompts for the closed-loop benchmark that measures end-to-end token cost across generation -> compile -> repair -> retry. Variants under test: python, ilo-pre-phase-1, ilo-post-phase-1, zero. Adapters call real ilo and zero binaries for compile-check; ilo also runs test cases via subprocess. Zero is compile-only (documented limitation). Driver supports --mock for offline plumbing validation (no API key) and --full for the live Anthropic matrix. Prompt-cache aware; budget cap; 3-seed median + variance aggregation. Charts script produces the 4 canonical figures from aggregated.json. The whole research/ tree was previously ignored. Add a targeted exception for closed-loop-bench/ and BENCHMARK-METHODOLOGY.md while keeping per-cell raw results, zero compile caches, and pycache out of the repo.

Output of running the harness in --mock mode: 4 variants x 5 tasks x 5 session lengths x 1 model x 3 seeds = 300 cells, aggregated to 100 rows in results/aggregated.json. CSV mirror in data/aggregated.csv. All four canonical figures generated to charts/ as PNG + SVG: chart1-tokens-vs-n total tokens per task vs session length chart2-cost-composition cost composition at the largest N chart3-success-rate success rate by variant chart4-usd-cost USD per completed task These numbers are NOT publishable - the mock LLM is a deterministic stub returning broken-then-fixed reference impls. They exist to prove the loop, the adapters, the aggregation, and the chart generator all work end-to-end. The live matrix is gated on ANTHROPIC_API_KEY and will overwrite results/ + charts/ when run.

Two reference docs alongside the harness: - research/closed-loop-bench/README.md Quick start, repo layout, what's shipped in this PR (harness + mock dataset), what's deferred (the live matrix), how to extend with new languages or tasks. - research/BENCHMARK-METHODOLOGY.md How to reproduce, what each metric means, the per-variant adapter behaviour, the repair-loop algorithm, pricing assumptions, and the four documented limitations: zero is compile-only, ilo-post-phase-1 is simulated via a prefix slice of ai.txt, ilo-post-phase-4 is deferred until Phase 4 lands, and the shipped numbers are mock-only.

codecov · 2026-05-18T19:49:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

…-loop-bench

danieljohnmorris added 3 commits May 18, 2026 20:45

Merge remote-tracking branch 'origin/main' into feature/phase2-closed…

4c88687

…-loop-bench

danieljohnmorris merged commit c79b85a into main May 18, 2026
5 checks passed

danieljohnmorris deleted the feature/phase2-closed-loop-bench branch May 18, 2026 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: closed-loop agent benchmark harness (Phase 2)#399

feature: closed-loop agent benchmark harness (Phase 2)#399
danieljohnmorris merged 4 commits into
mainfrom
feature/phase2-closed-loop-bench

danieljohnmorris commented May 18, 2026

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danieljohnmorris commented May 18, 2026

Summary

What's in the diff

Repro

Known limitations (documented in BENCHMARK-METHODOLOGY.md)

Test plan

Follow-ups

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 18, 2026 •

edited

Loading