feat(agent-comparison): harden autoresearch live evals by notque · Pull Request #207 · notque/claude-code-toolkit

notque · 2026-03-30T03:15:29Z

Summary

harden live registered-skill autoresearch evaluation and blind body scoring
add guardrails/tests for trigger evidence, contamination rejection, holdout selection, and final report consistency
apply a measured socratic-debugging body improvement and document the short runnable workflows

Validation

pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_skill_eval_claude_code.py scripts/tests/test_passk_eval.py scripts/tests/test_eval_compare_optimization.py
python3 -m scripts.skill_eval.run_eval --eval-set skills/agent-comparison/references/read-only-ops-short-tasks.json --skill-path skills/read-only-ops
python3 skills/agent-comparison/scripts/optimize_loop.py --target skills/socratic-debugging/SKILL.md --goal 'Improve the first response so it asks exactly one question, avoids direct diagnosis, avoids code examples, and does not add tool-permission preamble.' --benchmark-tasks skills/agent-comparison/references/socratic-debugging-body-short-tasks.json --optimization-scope body-only --max-iterations 1 --beam-width 1 --candidates-per-parent 1 --revert-streak-limit 1 --output-dir /tmp/cck-autoresearch-pr-proof2 --report /tmp/cck-autoresearch-pr-proof2/report.html --verbose

Notes

base branch is fix/autoresearch-recommendations because the autoresearch infrastructure is not on main.
this includes the accepted socratic-debugging instruction-body update proven by the short blind body benchmark.

notque · 2026-03-30T03:20:51Z

Current status on 98ecf1a587affda2e09611359dd6dd30132b48cd:

Local validation passed:
- pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_skill_eval_claude_code.py scripts/tests/test_passk_eval.py scripts/tests/test_eval_compare_optimization.py
Short live body benchmark still works on this branch and produced an additional accepted socratic-debugging body tightening that is now included in the head commit.

Workflow note:

GitHub reports no checks for this PR because .github/workflows/test.yml is configured only for push/pull_request on main.
This PR targets fix/autoresearch-recommendations, so the Tests workflow does not trigger for it in its current configuration.

PR is merge-clean, but there is no branch-applicable GitHub Actions signal available for this stacked base branch.

notque added 2 commits March 29, 2026 20:16

feat(agent-comparison): harden autoresearch live evals

2ec703c

fix(socratic-debugging): tighten first-turn question discipline

98ecf1a

notque force-pushed the feature/skill-body-autoresearch-hardening branch from a2ba028 to 98ecf1a Compare March 30, 2026 03:19

notque merged commit 23cad06 into fix/autoresearch-recommendations Mar 30, 2026

notque deleted the feature/skill-body-autoresearch-hardening branch March 30, 2026 03:51

notque mentioned this pull request Mar 30, 2026

feat(agent-comparison): promote autoresearch live eval hardening to main #208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent-comparison): harden autoresearch live evals#207

feat(agent-comparison): harden autoresearch live evals#207
notque merged 2 commits intofix/autoresearch-recommendationsfrom
feature/skill-body-autoresearch-hardening

notque commented Mar 30, 2026

Uh oh!

notque commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notque commented Mar 30, 2026

Summary

Validation

Notes

Uh oh!

notque commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant