feat: Add evaluations support to ManagedAgent.run()#153
Conversation
4f29d99 to
0ea4a04
Compare
0539ba1 to
404670d
Compare
0ea4a04 to
04e80a8
Compare
404670d to
f132154
Compare
04e80a8 to
29ced10
Compare
f132154 to
eb1004c
Compare
29ced10 to
c343602
Compare
eb1004c to
8a049e2
Compare
c343602 to
1a24a4f
Compare
8a049e2 to
cea3780
Compare
1a24a4f to
78a7ded
Compare
cea3780 to
f27f9b8
Compare
38951a6 to
52756c7
Compare
f27f9b8 to
d892533
Compare
52756c7 to
ff2de9a
Compare
d892533 to
13ee088
Compare
ff2de9a to
1054ef7
Compare
13ee088 to
3159524
Compare
1054ef7 to
4a0923d
Compare
3159524 to
2c5671d
Compare
4a0923d to
9f9c880
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9f9c880. Configure here.
| log.warning("Judge evaluation failed: %s", r.error_message) | ||
| return results | ||
|
|
||
| return asyncio.create_task(_run_and_track(evaluator_task)) |
There was a problem hiding this comment.
Duplicated _track_judge_results logic across managed classes
Low Severity
The _track_judge_results method in ManagedAgent is a character-for-character duplicate of the same method in ManagedModel. Both take tracker, input_text, output_text, call evaluator.evaluate(), wrap it in an async task that iterates results, tracks successful ones, and logs failures. This duplicated logic increases maintenance burden — a bug fix or behavior change in one would need to be manually replicated in the other.
Reviewed by Cursor Bugbot for commit 9f9c880. Configure here.
There was a problem hiding this comment.
We will consider a refactor in the future if needed. Its light enough that we will leave it as is for the moment.
4411d93 to
ba4b5dd
Compare
Wire judge evaluations into ManagedAgent.run() via an asyncio.Task, mirroring ManagedModel.run(). Awaiting result.evaluations guarantees both evaluation and tracker.track_judge_result() complete. run() returns immediately; the evaluations task resolves asynchronously. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirror the managed_model.py fix in managed_agent.py: wrap tracker.track_judge_result() in try/except so a tracking failure does not destroy successfully computed evaluation results, and log a warning when a judge evaluation fails (r.success is False) so failures are visible rather than silently skipped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9f9c880 to
91877d5
Compare
🤖 I have created a release *beep* *boop* --- <details><summary>launchdarkly-server-sdk-ai: 0.19.0</summary> ## [0.19.0](launchdarkly-server-sdk-ai-0.18.0...launchdarkly-server-sdk-ai-0.19.0) (2026-05-05) ### ⚠ BREAKING CHANGES * StructuredResponse replaced by RunnerResult with new "parsed" property * AgentResult replaced by RunnerResult and Managed Result * Removed ModelRunner and AgentRunner protocols * Removed invoke_method, invoke_structured_model from AIProvider base class. * ModelResponse was replaced by RunnerResult * Add ManagedResult, RunnerResult, and Runner protocol; rename invoke() to run() ([#148](#148)) * Swap track_metrics_of parameter order to match spec ([#144](#144)) ### Features * Add evaluations support to ManagedAgent.run() ([#153](#153)) ([442f46a](442f46a)) * Add judge evaluation support to agent graphs ([#142](#142)) ([3d5a6a9](3d5a6a9)) * Add ManagedGraphResult, GraphMetricSummary, and AgentGraphRunnerResult types ([#151](#151)) ([301e24c](301e24c)) * Add ManagedResult, RunnerResult, and Runner protocol; rename invoke() to run() ([#148](#148)) ([88d4ddc](88d4ddc)) * Add root-level tools map with customParameters to AI Config types ([#141](#141)) ([f17c535](f17c535)) * bake sampling_rate into Judge at construction; simplify Evaluator to List[Judge] ([#159](#159)) ([86c79e6](86c79e6)) * Update LangChain runners to implement Runner protocol returning RunnerResult ([#150](#150)) ([62a8e25](62a8e25)) ### Bug Fixes * Add runtime DeprecationWarnings to deprecated methods ([#145](#145)) ([2189b81](2189b81)) * AgentResult replaced by RunnerResult and Managed Result ([fbb0b4b](fbb0b4b)) * build judge input as string; strip legacy judge config messages ([#165](#165)) ([e6942a6](e6942a6)) * Fall back to model.parameters.tools when root tools absent ([#146](#146)) ([2c30d75](2c30d75)) * Graph tracking refactor — ManagedAgentGraph drives tracking for new runner shape ([#154](#154)) ([20a5020](20a5020)) * ModelResponse was replaced by RunnerResult ([fbb0b4b](fbb0b4b)) * parse model.parameters.tools as list ([#160](#160)) ([fb53e99](fb53e99)) * reference correct PyPI package names in provider load error messages ([#164](#164)) ([48761c9](48761c9)) * Removed invoke_method, invoke_structured_model from AIProvider base class. ([fbb0b4b](fbb0b4b)) * Removed ModelRunner and AgentRunner protocols ([fbb0b4b](fbb0b4b)) * Replace done_callback with coroutine chain for judge tracking ([#147](#147)) ([1e1f36b](1e1f36b)) * StructuredResponse replaced by RunnerResult with new "parsed" property ([fbb0b4b](fbb0b4b)) * Swap track_metrics_of parameter order to match spec ([#144](#144)) ([53db736](53db736)) </details> <details><summary>launchdarkly-server-sdk-ai-langchain: 0.6.0</summary> ## [0.6.0](launchdarkly-server-sdk-ai-langchain-0.5.0...launchdarkly-server-sdk-ai-langchain-0.6.0) (2026-05-05) ### Features * Add judge evaluation support to agent graphs ([#142](#142)) ([3d5a6a9](3d5a6a9)) * Migrate LangGraph runner to AgentGraphRunnerResult; clean up legacy shape detection ([#156](#156)) ([efa8e00](efa8e00)) * Support conversation history directly in AI Provider model runners ([#166](#166)) ([4bb3e78](4bb3e78)) * Update LangChain runners to implement Runner protocol returning RunnerResult ([#150](#150)) ([62a8e25](62a8e25)) ### Bug Fixes * build judge input as string; strip legacy judge config messages ([#165](#165)) ([e6942a6](e6942a6)) </details> <details><summary>launchdarkly-server-sdk-ai-openai: 0.5.0</summary> ## [0.5.0](launchdarkly-server-sdk-ai-openai-0.4.0...launchdarkly-server-sdk-ai-openai-0.5.0) (2026-05-05) ### Features * Add judge evaluation support to agent graphs ([#142](#142)) ([3d5a6a9](3d5a6a9)) * Support conversation history directly in AI Provider model runners ([#166](#166)) ([4bb3e78](4bb3e78)) * Update OpenAI graph runner to return AgentGraphRunnerResult with GraphMetrics ([#155](#155)) ([388b7af](388b7af)) * Update OpenAI runners to implement Runner protocol returning RunnerResult ([#149](#149)) ([382e662](382e662)) ### Bug Fixes * build judge input as string; strip legacy judge config messages ([#165](#165)) ([e6942a6](e6942a6)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Primarily a release/version bump, but it publishes **breaking API changes** (move to unified `Runner.run()`/`RunnerResult` and removal of `invoke_*` methods), which can break downstream integrations. > > **Overview** > Cuts a new release across the core SDK and provider packages: `launchdarkly-server-sdk-ai` to `0.19.0`, LangChain provider to `0.6.0`, and OpenAI provider to `0.5.0`, updating the release manifest and package metadata accordingly. > > Changelogs document the shipped breaking API surface changes (notably removing `invoke_model()`/`invoke_structured_model()` in favor of `run(...)` and standardizing returns on `RunnerResult`) plus accompanying feature/fix entries; the core package version constants/docs (`__version__`, `PROVENANCE.md`) are updated to match. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a20d7a5. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: jsonbailey <jbailey@launchdarkly.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>


Summary
ManagedAgent.run()viaasyncio.Task, mirroringManagedModel.run()(PR 7 / PR 8)run()returns immediately;await result.evaluationsguarantees both evaluation andtracker.track_judge_result()completeai_config.evaluator.evaluate(input, content)— resolves to empty list withEvaluator.noop()success=False) do NOT calltrack_judge_result()Depends on
Test plan
uv run pytest packages/sdk/server-ai/tests/)TestManagedAgentEvaluationstests: run returns before evaluations resolve, collect results, tracking fires on await, noop evaluator returns empty list, failed results not tracked🤖 Generated with Claude Code
Note
Medium Risk
Adds background
asyncioevaluation tasks and tracking side effects toManagedAgent.run(), which can introduce task-lifecycle/awaiting and error-handling edge cases. Core behavior remains the same for the main agent invocation, but results now include an optionalevaluationstask and new warning logs on evaluation failures.Overview
ManagedAgent.run()now kicks off judge evaluations viaai_config.evaluator.evaluate(input, output)in a backgroundasyncio.Taskand returns aManagedResultthat includes the optionalevaluationshandle.When
result.evaluationsis awaited, successfulJudgeResults are recorded viatracker.track_judge_result(); failed results (or tracking exceptions) are logged as warnings without failing the main run. Tests are expanded with a helper config and a newTestManagedAgentEvaluationssuite covering task-asynchrony, result collection, and tracking behavior for noop/failed evaluations.Reviewed by Cursor Bugbot for commit 91877d5. Bugbot is set up for automated code reviews on this repo. Configure here.