Skip to content

fix: Replace done_callback with coroutine chain for judge tracking#147

Open
jsonbailey wants to merge 1 commit intomainfrom
jb/aic-2174/evaluations
Open

fix: Replace done_callback with coroutine chain for judge tracking#147
jsonbailey wants to merge 1 commit intomainfrom
jb/aic-2174/evaluations

Conversation

@jsonbailey
Copy link
Copy Markdown
Contributor

@jsonbailey jsonbailey commented Apr 28, 2026

Summary

  • Replaces add_done_callback in ManagedModel._track_judge_results with a proper _run_and_track coroutine wrapped in asyncio.create_task, so that awaiting response.evaluations guarantees both evaluation and tracker.track_judge_result() calls complete in sequence
  • The old callback approach ran outside the asyncio scheduler and gave no ordering guarantees; the new approach makes the chain explicit and testable
  • Adds test_managed_model.py with 5 tests covering: return-before-resolve, collect results, tracking fires inside the awaited chain, failed results skip tracking, noop evaluator returns empty list

Test plan

  • All 138 existing tests pass
  • 5 new test_managed_model.py tests added and passing
  • e2e validation via hello-python-ai agent

🤖 Generated with Claude Code


Note

Medium Risk
Changes async scheduling for judge evaluations/tracking, which can affect ordering and side effects (metrics emission) during ManagedModel.invoke and when consumers await response.evaluations. Covered by new tests but still touches concurrency behavior.

Overview
Fixes judge-result tracking determinism. ManagedModel._track_judge_results now wraps evaluator execution in an awaited coroutine chain and returns a new asyncio task, so awaiting response.evaluations reliably includes both evaluation completion and tracker.track_judge_result() side effects (instead of relying on a add_done_callback).

Adds a focused test_managed_model.py suite validating that invoke() returns before evaluations finish, that awaiting evaluations yields results, and that tracking fires only for successful judge results (including a noop-evaluator case).

Reviewed by Cursor Bugbot for commit 381cf75. Bugbot is set up for automated code reviews on this repo. Configure here.

`_track_judge_results` previously used `add_done_callback` to fire
`track_judge_result()` after evaluation completed, but callbacks run
outside the asyncio task scheduler and can execute at unpredictable
times. Replace with a single `_run_and_track` coroutine wrapped in a
new `asyncio.create_task`, so that awaiting `response.evaluations`
guarantees both evaluation and tracker calls complete in sequence.

Add `test_managed_model.py` covering: invoke() returns before
evaluations resolve; awaiting evaluations collects results; tracking
fires inside the awaited chain (not before); failed judge results do
not trigger tracking; noop evaluator returns an empty list.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jsonbailey jsonbailey marked this pull request as ready for review April 28, 2026 22:57
@jsonbailey jsonbailey requested a review from a team as a code owner April 28, 2026 22:57
@jsonbailey jsonbailey force-pushed the jb/aic-2174/evaluations branch from a2402ea to 381cf75 Compare April 28, 2026 22:57
@jsonbailey jsonbailey changed the title fix(ldai): Replace done_callback with coroutine chain for judge tracking fix: Replace done_callback with coroutine chain for judge tracking Apr 28, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 381cf75. Configure here.

for r in results:
if r.success:
tracker.track_judge_result(r)
return results
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing exception handling loses evaluation results on tracking failure

Medium Severity

The _run_and_track coroutine has no try/except around the tracker.track_judge_result(r) call. If tracking raises (e.g., the LD client is shut down or track() fails internally), the exception propagates through the wrapper task, and callers awaiting response.evaluations receive an exception instead of the List[JudgeResult]. The old add_done_callback approach was inherently resilient to this — asyncio catches callback exceptions and logs them without affecting the task's result. The new inline approach loses that isolation, meaning a tracking failure now destroys the evaluation results that were successfully computed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 381cf75. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant