feat: Add evaluations support to ManagedAgent.run() by jsonbailey · Pull Request #153 · launchdarkly/python-server-sdk-ai

jsonbailey · 2026-04-28T23:27:49Z

Summary

Wires judge evaluations into ManagedAgent.run() via asyncio.Task, mirroring ManagedModel.run() (PR 7 / PR 8)
run() returns immediately; await result.evaluations guarantees both evaluation and tracker.track_judge_result() complete
Uses ai_config.evaluator.evaluate(input, content) — resolves to empty list with Evaluator.noop()
Failed judge results (success=False) do NOT call track_judge_result()
Adds 6 new tests covering the full evaluations contract

Depends on

feat: Wire LDAIMetrics tool_calls and duration_ms into tracker #152 (PR 10 — enrich-metrics, which is based on feat: Add ManagedGraphResult, GraphMetricSummary, and AgentGraphRunnerResult types #151, feat: Update LangChain runners to implement Runner protocol returning RunnerResult #150, feat: Update OpenAI runners to implement Runner protocol returning RunnerResult #149, feat!: Add ManagedResult, RunnerResult, and Runner protocol; rename invoke() to run() #148, fix: Replace done_callback with coroutine chain for judge tracking #147)

Test plan

All existing tests pass (uv run pytest packages/sdk/server-ai/tests/)
New TestManagedAgentEvaluations tests: run returns before evaluations resolve, collect results, tracking fires on await, noop evaluator returns empty list, failed results not tracked

🤖 Generated with Claude Code

Note

Medium Risk
Adds background asyncio evaluation tasks and tracking side effects to ManagedAgent.run(), which can introduce task-lifecycle/awaiting and error-handling edge cases. Core behavior remains the same for the main agent invocation, but results now include an optional evaluations task and new warning logs on evaluation failures.

Overview
ManagedAgent.run() now kicks off judge evaluations via ai_config.evaluator.evaluate(input, output) in a background asyncio.Task and returns a ManagedResult that includes the optional evaluations handle.

When result.evaluations is awaited, successful JudgeResults are recorded via tracker.track_judge_result(); failed results (or tracking exceptions) are logged as warnings without failing the main run. Tests are expanded with a helper config and a new TestManagedAgentEvaluations suite covering task-asynchrony, result collection, and tracking behavior for noop/failed evaluations.

^{Reviewed by Cursor Bugbot for commit 91877d5. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9f9c880. Configure here.}

cursor · 2026-05-01T18:10:33Z

+                    log.warning("Judge evaluation failed: %s", r.error_message)
+            return results
+
+        return asyncio.create_task(_run_and_track(evaluator_task))


Duplicated _track_judge_results logic across managed classes

Low Severity

The _track_judge_results method in ManagedAgent is a character-for-character duplicate of the same method in ManagedModel. Both take tracker, input_text, output_text, call evaluator.evaluate(), wrap it in an async task that iterates results, tracks successful ones, and logs failures. This duplicated logic increases maintenance burden — a bug fix or behavior change in one would need to be manually replicated in the other.

^{Reviewed by Cursor Bugbot for commit 9f9c880. Configure here.}

We will consider a refactor in the future if needed. Its light enough that we will leave it as is for the moment.

Wire judge evaluations into ManagedAgent.run() via an asyncio.Task, mirroring ManagedModel.run(). Awaiting result.evaluations guarantees both evaluation and tracker.track_judge_result() complete. run() returns immediately; the evaluations task resolves asynchronously. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mirror the managed_model.py fix in managed_agent.py: wrap tracker.track_judge_result() in try/except so a tracking failure does not destroy successfully computed evaluation results, and log a warning when a judge evaluation fails (r.success is False) so failures are visible rather than silently skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

🤖 I have created a release *beep* *boop* --- <details><summary>launchdarkly-server-sdk-ai: 0.19.0</summary> ## [0.19.0](launchdarkly-server-sdk-ai-0.18.0...launchdarkly-server-sdk-ai-0.19.0) (2026-05-05) ### ⚠ BREAKING CHANGES * StructuredResponse replaced by RunnerResult with new "parsed" property * AgentResult replaced by RunnerResult and Managed Result * Removed ModelRunner and AgentRunner protocols * Removed invoke_method, invoke_structured_model from AIProvider base class. * ModelResponse was replaced by RunnerResult * Add ManagedResult, RunnerResult, and Runner protocol; rename invoke() to run() ([#148](#148)) * Swap track_metrics_of parameter order to match spec ([#144](#144)) ### Features * Add evaluations support to ManagedAgent.run() ([#153](#153)) ([442f46a](442f46a)) * Add judge evaluation support to agent graphs ([#142](#142)) ([3d5a6a9](3d5a6a9)) * Add ManagedGraphResult, GraphMetricSummary, and AgentGraphRunnerResult types ([#151](#151)) ([301e24c](301e24c)) * Add ManagedResult, RunnerResult, and Runner protocol; rename invoke() to run() ([#148](#148)) ([88d4ddc](88d4ddc)) * Add root-level tools map with customParameters to AI Config types ([#141](#141)) ([f17c535](f17c535)) * bake sampling_rate into Judge at construction; simplify Evaluator to List[Judge] ([#159](#159)) ([86c79e6](86c79e6)) * Update LangChain runners to implement Runner protocol returning RunnerResult ([#150](#150)) ([62a8e25](62a8e25)) ### Bug Fixes * Add runtime DeprecationWarnings to deprecated methods ([#145](#145)) ([2189b81](2189b81)) * AgentResult replaced by RunnerResult and Managed Result ([fbb0b4b](fbb0b4b)) * build judge input as string; strip legacy judge config messages ([#165](#165)) ([e6942a6](e6942a6)) * Fall back to model.parameters.tools when root tools absent ([#146](#146)) ([2c30d75](2c30d75)) * Graph tracking refactor — ManagedAgentGraph drives tracking for new runner shape ([#154](#154)) ([20a5020](20a5020)) * ModelResponse was replaced by RunnerResult ([fbb0b4b](fbb0b4b)) * parse model.parameters.tools as list ([#160](#160)) ([fb53e99](fb53e99)) * reference correct PyPI package names in provider load error messages ([#164](#164)) ([48761c9](48761c9)) * Removed invoke_method, invoke_structured_model from AIProvider base class. ([fbb0b4b](fbb0b4b)) * Removed ModelRunner and AgentRunner protocols ([fbb0b4b](fbb0b4b)) * Replace done_callback with coroutine chain for judge tracking ([#147](#147)) ([1e1f36b](1e1f36b)) * StructuredResponse replaced by RunnerResult with new "parsed" property ([fbb0b4b](fbb0b4b)) * Swap track_metrics_of parameter order to match spec ([#144](#144)) ([53db736](53db736)) </details> <details><summary>launchdarkly-server-sdk-ai-langchain: 0.6.0</summary> ## [0.6.0](launchdarkly-server-sdk-ai-langchain-0.5.0...launchdarkly-server-sdk-ai-langchain-0.6.0) (2026-05-05) ### Features * Add judge evaluation support to agent graphs ([#142](#142)) ([3d5a6a9](3d5a6a9)) * Migrate LangGraph runner to AgentGraphRunnerResult; clean up legacy shape detection ([#156](#156)) ([efa8e00](efa8e00)) * Support conversation history directly in AI Provider model runners ([#166](#166)) ([4bb3e78](4bb3e78)) * Update LangChain runners to implement Runner protocol returning RunnerResult ([#150](#150)) ([62a8e25](62a8e25)) ### Bug Fixes * build judge input as string; strip legacy judge config messages ([#165](#165)) ([e6942a6](e6942a6)) </details> <details><summary>launchdarkly-server-sdk-ai-openai: 0.5.0</summary> ## [0.5.0](launchdarkly-server-sdk-ai-openai-0.4.0...launchdarkly-server-sdk-ai-openai-0.5.0) (2026-05-05) ### Features * Add judge evaluation support to agent graphs ([#142](#142)) ([3d5a6a9](3d5a6a9)) * Support conversation history directly in AI Provider model runners ([#166](#166)) ([4bb3e78](4bb3e78)) * Update OpenAI graph runner to return AgentGraphRunnerResult with GraphMetrics ([#155](#155)) ([388b7af](388b7af)) * Update OpenAI runners to implement Runner protocol returning RunnerResult ([#149](#149)) ([382e662](382e662)) ### Bug Fixes * build judge input as string; strip legacy judge config messages ([#165](#165)) ([e6942a6](e6942a6)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).  --- > [!NOTE] > **Medium Risk** > Primarily a release/version bump, but it publishes **breaking API changes** (move to unified `Runner.run()`/`RunnerResult` and removal of `invoke_*` methods), which can break downstream integrations. > > **Overview** > Cuts a new release across the core SDK and provider packages: `launchdarkly-server-sdk-ai` to `0.19.0`, LangChain provider to `0.6.0`, and OpenAI provider to `0.5.0`, updating the release manifest and package metadata accordingly. > > Changelogs document the shipped breaking API surface changes (notably removing `invoke_model()`/`invoke_structured_model()` in favor of `run(...)` and standardizing returns on `RunnerResult`) plus accompanying feature/fix entries; the core package version constants/docs (`__version__`, `PROVENANCE.md`) are updated to match. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a20d7a5. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>  --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: jsonbailey <jbailey@launchdarkly.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 4f29d99 to 0ea4a04 Compare April 28, 2026 23:56

jsonbailey changed the base branch from jb/aic-2388/enrich-metrics to jb/aic-2174/langchain-graph-runner April 28, 2026 23:57

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 0539ba1 to 404670d Compare April 29, 2026 13:15

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 0ea4a04 to 04e80a8 Compare April 29, 2026 13:15

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 404670d to f132154 Compare April 29, 2026 13:19

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 04e80a8 to 29ced10 Compare April 29, 2026 13:19

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from f132154 to eb1004c Compare April 29, 2026 13:22

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 29ced10 to c343602 Compare April 29, 2026 13:23

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from eb1004c to 8a049e2 Compare April 29, 2026 13:53

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from c343602 to 1a24a4f Compare April 29, 2026 13:55

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 8a049e2 to cea3780 Compare April 29, 2026 13:57

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 1a24a4f to 78a7ded Compare April 29, 2026 13:57

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from cea3780 to f27f9b8 Compare April 29, 2026 14:39

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 38951a6 to 52756c7 Compare April 29, 2026 14:39

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from f27f9b8 to d892533 Compare April 29, 2026 16:34

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 52756c7 to ff2de9a Compare April 29, 2026 16:34

jsonbailey mentioned this pull request Apr 29, 2026

feat: Add ManagedGraphResult, GraphMetricSummary, and AgentGraphRunnerResult types #151

Merged

2 tasks

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from d892533 to 13ee088 Compare April 30, 2026 14:06

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from ff2de9a to 1054ef7 Compare April 30, 2026 14:07

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 13ee088 to 3159524 Compare April 30, 2026 14:24

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 1054ef7 to 4a0923d Compare April 30, 2026 14:25

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch from 3159524 to 2c5671d Compare April 30, 2026 14:48

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 4a0923d to 9f9c880 Compare April 30, 2026 14:48

jsonbailey marked this pull request as ready for review May 1, 2026 18:05

jsonbailey requested a review from a team as a code owner May 1, 2026 18:05

cursor Bot reviewed May 1, 2026

View reviewed changes

keelerm84 approved these changes May 1, 2026

View reviewed changes

jsonbailey force-pushed the jb/aic-2174/langchain-graph-runner branch 2 times, most recently from 4411d93 to ba4b5dd Compare May 4, 2026 21:09

Base automatically changed from jb/aic-2174/langchain-graph-runner to main May 4, 2026 21:45

jsonbailey and others added 3 commits May 4, 2026 16:48

fix: log warning when judge result tracking fails in ManagedAgent

91877d5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jsonbailey force-pushed the jb/aic-2174/agent-evaluations branch from 9f9c880 to 91877d5 Compare May 4, 2026 21:50

jsonbailey merged commit 442f46a into main May 5, 2026
45 checks passed

jsonbailey deleted the jb/aic-2174/agent-evaluations branch May 5, 2026 13:22

github-actions Bot mentioned this pull request May 4, 2026

chore: release main #143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add evaluations support to ManagedAgent.run()#153

feat: Add evaluations support to ManagedAgent.run()#153
jsonbailey merged 3 commits intomainfrom
jb/aic-2174/agent-evaluations

jsonbailey commented Apr 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 1, 2026

Uh oh!

jsonbailey May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jsonbailey commented Apr 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Depends on

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 1, 2026

Choose a reason for hiding this comment

Duplicated _track_judge_results logic across managed classes

Uh oh!

jsonbailey May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsonbailey commented Apr 28, 2026 •

edited by cursor Bot

Loading

Duplicated `_track_judge_results` logic across managed classes