Skip to content

feature: closed-loop agent benchmark harness (Phase 2)#399

Merged
danieljohnmorris merged 4 commits into
mainfrom
feature/phase2-closed-loop-bench
May 18, 2026
Merged

feature: closed-loop agent benchmark harness (Phase 2)#399
danieljohnmorris merged 4 commits into
mainfrom
feature/phase2-closed-loop-bench

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

This is the Phase 2 keystone: a closed-loop benchmark harness that drives an
LLM through generation -> compile -> repair -> retry until passing, measuring
total tokens, success rate, and time across ilo, Zero, and Python on 5
canonical tasks at 5 session lengths. The dataset this produces is the
empirical receipt behind every "ilo wins per-task token cost on iterative
agent workflows" claim - without it the strategy is opinion, with it it's
measurement.

What's in this PR:

  • The harness end-to-end: driver, per-language adapters, 5 task definitions,
    system prompts, charts script, run script, README and methodology doc.
  • A 300-cell mock dataset that exercises every code path in the loop. The
    numbers are NOT publishable - they exist to prove the plumbing.
  • All four canonical charts generated to PNG + SVG.

What's deferred:

  • The live Anthropic matrix (5 variants x 5 tasks x 5 lengths x 2 models x
    3 seeds = 750 cells, ~$100-150 in API spend). Gated on access to an
    ANTHROPIC_API_KEY. The brief explicitly accepts "harness + smaller
    sample, document the gap" within the 4-hour scaffolding timebox.
  • The ilo-post-phase-4 variant - gated on Phase 4 landing.

What's in the diff

  • e62dcbc research: scaffold closed-loop agent benchmark harness -
    driver.py (mock + live modes, prompt caching, repair loop, budget cap),
    variants.py (python/ilo/zero adapters), charts.py, run_benchmark.sh,
    5 task JSON files, 4 system prompts, gitignore exception.
  • f14d782 research: add 300-cell mock dataset and four canonical charts -
    aggregated.json, aggregated.csv, 4 charts (PNG + SVG), summary.txt.
  • d207ef8 docs: closed-loop benchmark README and methodology note -
    closed-loop-bench/README.md + research/BENCHMARK-METHODOLOGY.md.

Repro

Mock (no API key, ~3 minutes):

cd research/closed-loop-bench
./run_benchmark.sh --mock

Live (requires ANTHROPIC_API_KEY):

export ANTHROPIC_API_KEY=sk-ant-...
./run_benchmark.sh

Known limitations (documented in BENCHMARK-METHODOLOGY.md)

  • Zero is compile-only - the test harness judges Zero variants on
    successful zero check, not on running output. Adding Zero-runnable
    test harnesses per task is future work.
  • ilo-post-phase-1 simulates modular skills by loading a 30% prefix of
    ai.txt. Once ilo skill get is wired into the variant prompt loader
    this can be re-run more accurately.
  • Mock numbers report cache_hit_rate as 0 or a synthetic 0.85 - only
    the live mode produces real cache stats.

Test plan

  • ./run_benchmark.sh --mock runs end-to-end and writes
    results/aggregated.json, data/aggregated.csv, and 4 PNG+SVG charts.
  • All 300 mock cells succeed (mock returns ref impls on retry, which
    validates that every ref impl actually compiles via ilo --ast and
    zero check).
  • Python, ilo and zero compile adapters all exercised against real
    binaries (/Users/dan/.cargo/bin/ilo, ~/code/ilo-lang/zero/.zero/bin/zero).
  • Live ./run_benchmark.sh --full against the Anthropic API -
    blocked on ANTHROPIC_API_KEY access, not run in this PR.
  • Chart visual sanity-check against expected shapes - the mock data
    won't show the strategy crossover; that arrives with the live run.

Follow-ups

  • Run the live matrix and replace the mock dataset.
  • Wire ilo skill get <name> into load_spec_for_variant so
    ilo-post-phase-1 measures the real modular skill load, not a prefix
    proxy.
  • Add Zero-runnable test harnesses per task to lift the compile-only
    limitation.
  • Add the ilo-post-phase-4 variant once typed fix plans ship.

Adds research/closed-loop-bench/ - a driver, per-language adapters, task
definitions, and system prompts for the closed-loop benchmark that measures
end-to-end token cost across generation -> compile -> repair -> retry.

Variants under test: python, ilo-pre-phase-1, ilo-post-phase-1, zero.
Adapters call real ilo and zero binaries for compile-check; ilo also runs
test cases via subprocess. Zero is compile-only (documented limitation).

Driver supports --mock for offline plumbing validation (no API key) and
--full for the live Anthropic matrix. Prompt-cache aware; budget cap; 3-seed
median + variance aggregation. Charts script produces the 4 canonical
figures from aggregated.json.

The whole research/ tree was previously ignored. Add a targeted exception
for closed-loop-bench/ and BENCHMARK-METHODOLOGY.md while keeping per-cell
raw results, zero compile caches, and pycache out of the repo.
Output of running the harness in --mock mode: 4 variants x 5 tasks x
5 session lengths x 1 model x 3 seeds = 300 cells, aggregated to 100 rows
in results/aggregated.json. CSV mirror in data/aggregated.csv.

All four canonical figures generated to charts/ as PNG + SVG:
  chart1-tokens-vs-n        total tokens per task vs session length
  chart2-cost-composition   cost composition at the largest N
  chart3-success-rate       success rate by variant
  chart4-usd-cost           USD per completed task

These numbers are NOT publishable - the mock LLM is a deterministic stub
returning broken-then-fixed reference impls. They exist to prove the
loop, the adapters, the aggregation, and the chart generator all work
end-to-end. The live matrix is gated on ANTHROPIC_API_KEY and will
overwrite results/ + charts/ when run.
Two reference docs alongside the harness:

- research/closed-loop-bench/README.md  Quick start, repo layout, what's
  shipped in this PR (harness + mock dataset), what's deferred (the live
  matrix), how to extend with new languages or tasks.

- research/BENCHMARK-METHODOLOGY.md  How to reproduce, what each metric
  means, the per-variant adapter behaviour, the repair-loop algorithm,
  pricing assumptions, and the four documented limitations: zero is
  compile-only, ilo-post-phase-1 is simulated via a prefix slice of
  ai.txt, ilo-post-phase-4 is deferred until Phase 4 lands, and the
  shipped numbers are mock-only.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@danieljohnmorris danieljohnmorris merged commit c79b85a into main May 18, 2026
5 checks passed
@danieljohnmorris danieljohnmorris deleted the feature/phase2-closed-loop-bench branch May 18, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant