Skip to content

Fix resumed multi-run cost attribution medians#31

Merged
orban merged 21 commits intomainfrom
cost-attribution-fix3
Apr 15, 2026
Merged

Fix resumed multi-run cost attribution medians#31
orban merged 21 commits intomainfrom
cost-attribution-fix3

Conversation

@orban
Copy link
Copy Markdown
Owner

@orban orban commented Apr 4, 2026

Summary

  • replay the cost attribution estimator implementation and edge-case fixes onto this branch
  • fix resumed multi-run summary recomputation so medians use per-run cost data while totals still use summed run totals
  • add a regression test covering resumed multi-run cost medians vs totals

Testing

  • pytest eval-harness/tests/test_reporter.py eval-harness/tests/test_task_runner.py eval-harness/tests/test_resume.py eval-harness/tests/test_agentbench.py -q

orban added 20 commits February 23, 2026 20:30
…to eval harness

- store run_config in checkpoints; warn on resume if config mismatches
- classify errors as infra/timeout/genuine; skip genuine failures on resume
- add --retry-all flag to override and re-run genuine failures too
- write per-trial JSON to results/trials/ for ls-level observability
- 13 new tests covering all three features

Entire-Checkpoint: 7fb98ee370d1
Entire-Checkpoint: 5384ed119f2a
Entire-Checkpoint: 81278d45e27d
Pre-validation failures trip at task level (all conditions share
Docker setup), other failures trip at task+condition level.
Threshold of 2 accounts for in-flight parallel workers.

Entire-Checkpoint: b2252e9cb3d3
After each batch, classifies failures as infra vs genuine. If infra
failures detected: checks Docker health, restarts if needed, reduces
parallelism, resets circuit breaker, and retries. Max 2 retry rounds.

Entire-Checkpoint: 806ec810be18
Status file (.eval-status.json) updated after each result with
machine-readable state: workers, pass/fail rates, paused flag.
Control dir (.eval-control/) accepts commands: pause, resume,
set-workers N, skip-task <id>. Commands consumed on read.
Enables Ralph Loop or any external agent to manage running evals.

Entire-Checkpoint: 5f55311d2e2a
- Wire fisher_exact_test into reporter for per-task significance testing
- Add _compute_recommendations() flagging ceiling/floor/infra-only tasks
- Add Per-Task Analysis table and Recommendations section to markdown output
- Add lib/monitor.py: polling-based eval supervisor with stall detection,
  Docker recovery, infra-task skipping, and worker scaling
- 79 new tests across stats, reporter, and monitor modules

Entire-Checkpoint: 722abb53d525
- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval)
- run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer)
- Dynamic condition discovery in reporter (no more hardcoded condition list)
- Path traversal protection in write_test_infrastructure
- Docker --network none for test isolation
- Thread-safe temp dirs (PID + thread ID)
- Checkpoint batching (every 10 results vs every 1)
- Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift)
- Set-based infra result filtering (replaces fragile list.remove)
- Empty dict all() guard in pre-validation

Entire-Checkpoint: 815f2403cb52
--network none breaks setup commands that need pip install.
Default to bridge (Docker's default) and expose the parameter
so callers can opt into none for pure-test phases later.

Entire-Checkpoint: 8118f4f3017a
- agentbench_loader: use split="train" (dataset has no "test" split)
- docker_runner: use bash instead of sh (AGENTbench setup uses `source`)
Entire-Checkpoint: 0afc9c1d9169
AGENTbench images install tools like uv to /root/.local/bin which
only gets added to PATH via /etc/profile in login shells. bash -lc
instead of bash -c fixes exit 127 for repos using uv.

Entire-Checkpoint: 5d8c2af91c5d
…ote docker support

Three fixes from run 4 analysis:

- Regression eval: only fail when a golden-passing test now fails (was
  requiring 100% pass rate, which is impossible when repos have 14-83
  pre-existing failures in the baseline)
- strip_docs: preserve README.md variants (setup.py reads them)
- docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution
  via rsync, avoiding QEMU emulation on Apple Silicon

Also removes .index-cache-preserve/ (context files now generated
dynamically per run).

Entire-Checkpoint: 9f4d4a6248d7
UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync
connections and rejects paths outside configured modules. tar piped
through SSH bypasses this entirely and works reliably.

Entire-Checkpoint: e86fbbd52da2
The pre-pull step was running `docker pull` locally even when Docker
execution happens on a remote host via SSH. Now uses `ssh $host docker
pull` when EVAL_DOCKER_HOST is configured.

Entire-Checkpoint: cfe3bf0bccca
Prevents two issues from overnight eval runs:
- SSH agent key expiry caused all workers to hang indefinitely on
  stale connections. Added ConnectTimeout, ServerAliveInterval, and
  subprocess timeout=300s so failures surface within 5 minutes.
- sync_from_remote was transferring .venv dirs (4GB+) back from
  chronos. Added excludes for .venv, node_modules, __pycache__, etc.

Entire-Checkpoint: aa17a443b57d
Replace ephemeral docker run with persistent containers (docker run -d +
docker exec) so setup runs once per task instead of 3x. Add
start_container, exec_in_container, stop_container, copy_into_container
to docker_runner.py.

Fix git "dubious ownership" error caused by macOS tar overlay changing
file UIDs inside containers (CVE-2022-24765). Add safe.directory config
before overlay. Include stderr/stdout tail in setup error messages.

Entire-Checkpoint: c11a4a4ced3d
Nightshift-Task: cost-attribution
Nightshift-Ref: https://github.com/marcus/nightshift
Nightshift-Task: cost-attribution
Nightshift-Ref: https://github.com/marcus/nightshift
Nightshift-Task: cost-attribution
Nightshift-Ref: https://github.com/marcus/nightshift
resolve conflicts in eval-harness/lib/{agentbench_runner,cli,reporter,task_runner}.py
and tests/{test_agentbench,test_task_runner}.py by taking main's versions
and re-applying the cost-attribution commits (ad44f3c, 3d38aa0, 0412446)
on top. also thread skill_generation=skill_metrics through the empty-run
and timeout TaskResult paths that main added.
@orban orban merged commit 58d73dc into main Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant