Skip to content

Fail fast on systemic SearchQA rollout failures#64

Merged
Yif-Yang merged 2 commits into
microsoft:mainfrom
summerview1997:codex/searchqa-rollout-failfast
Jun 17, 2026
Merged

Fail fast on systemic SearchQA rollout failures#64
Yif-Yang merged 2 commits into
microsoft:mainfrom
summerview1997:codex/searchqa-rollout-failfast

Conversation

@summerview1997

Copy link
Copy Markdown
Contributor

Summary

This PR makes SearchQA rollout fail fast when every item in a batch failed before the target agent produced any response.

Previously, per-item exceptions such as model endpoint misconfiguration were recorded as ordinary failed answers. If every item had agent_ok=false, the trainer could continue with a complete-looking run and all-zero scores, even though no agent responses were produced.

Changes

  • Add a SearchQA rollout guard that detects all rows with agent_ok=false.
  • Raise a runtime error summarizing the most common fail_reason.
  • Apply the guard to both resumed/cached result paths and newly completed batches.
  • Keep ordinary wrong-answer results valid when at least one row has an agent response.
  • Add regression tests for cached systemic failures and answered wrong rollouts.

Impact

Infrastructure failures such as missing or unreachable model endpoints become visible immediately instead of being mistaken for model quality or skill optimization failure.

Validation

  • /home/thomas/SkillOpt/.venv/bin/python -m pytest -q tests/test_searchqa_rollout_failfast.py
  • /home/thomas/SkillOpt/.venv/bin/python -m pytest -q
  • /home/thomas/SkillOpt/.venv/bin/python -m ruff check skillopt/envs/searchqa/rollout.py tests/test_searchqa_rollout_failfast.py
  • /home/thomas/SkillOpt/.venv/bin/python -m py_compile skillopt/envs/searchqa/rollout.py tests/test_searchqa_rollout_failfast.py
  • git diff --check

@Yif-Yang Yif-Yang merged commit 0e96221 into microsoft:main Jun 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants