Fix resumed multi-run cost attribution medians by orban · Pull Request #31 · orban/intent-layer

orban · 2026-04-04T07:09:55Z

Summary

replay the cost attribution estimator implementation and edge-case fixes onto this branch
fix resumed multi-run summary recomputation so medians use per-run cost data while totals still use summed run totals
add a regression test covering resumed multi-run cost medians vs totals

Testing

pytest eval-harness/tests/test_reporter.py eval-harness/tests/test_task_runner.py eval-harness/tests/test_resume.py eval-harness/tests/test_agentbench.py -q

…to eval harness - store run_config in checkpoints; warn on resume if config mismatches - classify errors as infra/timeout/genuine; skip genuine failures on resume - add --retry-all flag to override and re-run genuine failures too - write per-trial JSON to results/trials/ for ls-level observability - 13 new tests covering all three features Entire-Checkpoint: 7fb98ee370d1

Entire-Checkpoint: 5384ed119f2a

Entire-Checkpoint: 81278d45e27d

Pre-validation failures trip at task level (all conditions share Docker setup), other failures trip at task+condition level. Threshold of 2 accounts for in-flight parallel workers. Entire-Checkpoint: b2252e9cb3d3

After each batch, classifies failures as infra vs genuine. If infra failures detected: checks Docker health, restarts if needed, reduces parallelism, resets circuit breaker, and retries. Max 2 retry rounds. Entire-Checkpoint: 806ec810be18

Status file (.eval-status.json) updated after each result with machine-readable state: workers, pass/fail rates, paused flag. Control dir (.eval-control/) accepts commands: pause, resume, set-workers N, skip-task <id>. Commands consumed on read. Enables Ralph Loop or any external agent to manage running evals. Entire-Checkpoint: 5f55311d2e2a

- Wire fisher_exact_test into reporter for per-task significance testing - Add _compute_recommendations() flagging ceiling/floor/infra-only tasks - Add Per-Task Analysis table and Recommendations section to markdown output - Add lib/monitor.py: polling-based eval supervisor with stall detection, Docker recovery, infra-task skipping, and worker scaling - 79 new tests across stats, reporter, and monitor modules Entire-Checkpoint: 722abb53d525

- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval) - run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer) - Dynamic condition discovery in reporter (no more hardcoded condition list) - Path traversal protection in write_test_infrastructure - Docker --network none for test isolation - Thread-safe temp dirs (PID + thread ID) - Checkpoint batching (every 10 results vs every 1) - Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift) - Set-based infra result filtering (replaces fragile list.remove) - Empty dict all() guard in pre-validation Entire-Checkpoint: 815f2403cb52

Entire-Checkpoint: 94115d70169e

--network none breaks setup commands that need pip install. Default to bridge (Docker's default) and expose the parameter so callers can opt into none for pure-test phases later. Entire-Checkpoint: 8118f4f3017a

- agentbench_loader: use split="train" (dataset has no "test" split) - docker_runner: use bash instead of sh (AGENTbench setup uses `source`) Entire-Checkpoint: 0afc9c1d9169

AGENTbench images install tools like uv to /root/.local/bin which only gets added to PATH via /etc/profile in login shells. bash -lc instead of bash -c fixes exit 127 for repos using uv. Entire-Checkpoint: 5d8c2af91c5d

…ote docker support Three fixes from run 4 analysis: - Regression eval: only fail when a golden-passing test now fails (was requiring 100% pass rate, which is impossible when repos have 14-83 pre-existing failures in the baseline) - strip_docs: preserve README.md variants (setup.py reads them) - docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution via rsync, avoiding QEMU emulation on Apple Silicon Also removes .index-cache-preserve/ (context files now generated dynamically per run). Entire-Checkpoint: 9f4d4a6248d7

UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync connections and rejects paths outside configured modules. tar piped through SSH bypasses this entirely and works reliably. Entire-Checkpoint: e86fbbd52da2

The pre-pull step was running `docker pull` locally even when Docker execution happens on a remote host via SSH. Now uses `ssh $host docker pull` when EVAL_DOCKER_HOST is configured. Entire-Checkpoint: cfe3bf0bccca

Prevents two issues from overnight eval runs: - SSH agent key expiry caused all workers to hang indefinitely on stale connections. Added ConnectTimeout, ServerAliveInterval, and subprocess timeout=300s so failures surface within 5 minutes. - sync_from_remote was transferring .venv dirs (4GB+) back from chronos. Added excludes for .venv, node_modules, __pycache__, etc. Entire-Checkpoint: aa17a443b57d

Replace ephemeral docker run with persistent containers (docker run -d + docker exec) so setup runs once per task instead of 3x. Add start_container, exec_in_container, stop_container, copy_into_container to docker_runner.py. Fix git "dubious ownership" error caused by macOS tar overlay changing file UIDs inside containers (CVE-2022-24765). Add safe.directory config before overlay. Include stderr/stdout tail in setup error messages. Entire-Checkpoint: c11a4a4ced3d

Nightshift-Task: cost-attribution Nightshift-Ref: https://github.com/marcus/nightshift

resolve conflicts in eval-harness/lib/{agentbench_runner,cli,reporter,task_runner}.py and tests/{test_agentbench,test_task_runner}.py by taking main's versions and re-applying the cost-attribution commits (ad44f3c, 3d38aa0, 0412446) on top. also thread skill_generation=skill_metrics through the empty-run and timeout TaskResult paths that main added.

orban added 20 commits February 23, 2026 20:30

default eval model to sonnet for reproducibility

4aae36a

Entire-Checkpoint: 5384ed119f2a

bump default parallelism from 2 to 8 workers

b981018

Entire-Checkpoint: 81278d45e27d

add circuit breaker to skip remaining reps after repeated failures

d5bc415

Pre-validation failures trip at task level (all conditions share Docker setup), other failures trip at task+condition level. Threshold of 2 accounts for in-flight parallel workers. Entire-Checkpoint: b2252e9cb3d3

fix stale test references to renamed setup_workspace method

785c257

Entire-Checkpoint: 94115d70169e

add network parameter to run_in_docker, default to bridge

b53c614

--network none breaks setup commands that need pip install. Default to bridge (Docker's default) and expose the parameter so callers can opt into none for pure-test phases later. Entire-Checkpoint: 8118f4f3017a

fix agentbench loader split and docker shell compatibility

2d42eab

- agentbench_loader: use split="train" (dataset has no "test" split) - docker_runner: use bash instead of sh (AGENTbench setup uses `source`) Entire-Checkpoint: 0afc9c1d9169

use login shell in docker to pick up ~/.local/bin PATH

9ab8632

AGENTbench images install tools like uv to /root/.local/bin which only gets added to PATH via /etc/profile in login shells. bash -lc instead of bash -c fixes exit 127 for repos using uv. Entire-Checkpoint: 5d8c2af91c5d

pull docker images on remote host when EVAL_DOCKER_HOST is set

2c9ea2e

The pre-pull step was running `docker pull` locally even when Docker execution happens on a remote host via SSH. Now uses `ssh $host docker pull` when EVAL_DOCKER_HOST is configured. Entire-Checkpoint: cfe3bf0bccca

Add cost attribution estimator

ad44f3c

Nightshift-Task: cost-attribution Nightshift-Ref: https://github.com/marcus/nightshift

Fix cost attribution edge cases

3d38aa0

Nightshift-Task: cost-attribution Nightshift-Ref: https://github.com/marcus/nightshift

Fix resumed cost medians

0412446

Nightshift-Task: cost-attribution Nightshift-Ref: https://github.com/marcus/nightshift

This was referenced Apr 15, 2026

Add eval-harness cost attribution estimator #29

Closed

Fix cost attribution edge cases #30

Closed

orban merged commit 58d73dc into main Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resumed multi-run cost attribution medians#31

Fix resumed multi-run cost attribution medians#31
orban merged 21 commits intomainfrom
cost-attribution-fix3

orban commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

orban commented Apr 4, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant