Skip to content

feat(agent-comparison): harden autoresearch live evals#207

Merged
notque merged 2 commits intofix/autoresearch-recommendationsfrom
feature/skill-body-autoresearch-hardening
Mar 30, 2026
Merged

feat(agent-comparison): harden autoresearch live evals#207
notque merged 2 commits intofix/autoresearch-recommendationsfrom
feature/skill-body-autoresearch-hardening

Conversation

@notque
Copy link
Copy Markdown
Owner

@notque notque commented Mar 30, 2026

Summary

  • harden live registered-skill autoresearch evaluation and blind body scoring
  • add guardrails/tests for trigger evidence, contamination rejection, holdout selection, and final report consistency
  • apply a measured socratic-debugging body improvement and document the short runnable workflows

Validation

  • pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_skill_eval_claude_code.py scripts/tests/test_passk_eval.py scripts/tests/test_eval_compare_optimization.py
  • python3 -m scripts.skill_eval.run_eval --eval-set skills/agent-comparison/references/read-only-ops-short-tasks.json --skill-path skills/read-only-ops
  • python3 skills/agent-comparison/scripts/optimize_loop.py --target skills/socratic-debugging/SKILL.md --goal 'Improve the first response so it asks exactly one question, avoids direct diagnosis, avoids code examples, and does not add tool-permission preamble.' --benchmark-tasks skills/agent-comparison/references/socratic-debugging-body-short-tasks.json --optimization-scope body-only --max-iterations 1 --beam-width 1 --candidates-per-parent 1 --revert-streak-limit 1 --output-dir /tmp/cck-autoresearch-pr-proof2 --report /tmp/cck-autoresearch-pr-proof2/report.html --verbose

Notes

  • base branch is fix/autoresearch-recommendations because the autoresearch infrastructure is not on main.
  • this includes the accepted socratic-debugging instruction-body update proven by the short blind body benchmark.

@notque notque force-pushed the feature/skill-body-autoresearch-hardening branch from a2ba028 to 98ecf1a Compare March 30, 2026 03:19
Copy link
Copy Markdown
Owner Author

notque commented Mar 30, 2026

Current status on 98ecf1a587affda2e09611359dd6dd30132b48cd:

  • Local validation passed:
    • pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_skill_eval_claude_code.py scripts/tests/test_passk_eval.py scripts/tests/test_eval_compare_optimization.py
  • Short live body benchmark still works on this branch and produced an additional accepted socratic-debugging body tightening that is now included in the head commit.

Workflow note:

  • GitHub reports no checks for this PR because .github/workflows/test.yml is configured only for push/pull_request on main.
  • This PR targets fix/autoresearch-recommendations, so the Tests workflow does not trigger for it in its current configuration.

PR is merge-clean, but there is no branch-applicable GitHub Actions signal available for this stacked base branch.

@notque notque merged commit 23cad06 into fix/autoresearch-recommendations Mar 30, 2026
@notque notque deleted the feature/skill-body-autoresearch-hardening branch March 30, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant