Skip to content

feat: add Evaluator class for judge orchestration#1331

Merged
jsonbailey merged 4 commits intonext-ai-releasefrom
jb/aic-2174/server-ai
May 1, 2026
Merged

feat: add Evaluator class for judge orchestration#1331
jsonbailey merged 4 commits intonext-ai-releasefrom
jb/aic-2174/server-ai

Conversation

@jsonbailey
Copy link
Copy Markdown
Contributor

@jsonbailey jsonbailey commented Apr 28, 2026

Summary

  • Adds Evaluator class that wraps judges and JudgeConfiguration
  • Evaluator.noop() static factory returns a no-op evaluator (resolves to [])
  • evaluate(input, output) runs all configured judges in parallel; missing judge key → warning + skip (not error)
  • Evaluator does NOT call tracker.trackJudgeResult — that belongs in the managed layer
  • Attaches Evaluator to LDAICompletionConfig and LDAIAgentConfig (populated in createChat/createAgent)

Test plan

  • All 188 existing tests pass
  • New Evaluator.test.ts covers noop(), judge evaluation, missing judge warns+skips, error handling, no tracker calls

🤖 Generated with Claude Code


Note

Medium Risk
Introduces new judge orchestration and changes how createChat/createJudge construct and wire judges, which can affect when/if evaluations run and how sampling is applied. Main risk is behavioral drift in judge execution paths due to the new evaluator attachment and updated Judge sampling semantics.

Overview
Adds an internal Evaluator abstraction that runs all configured Judge instances in parallel and returns an array of LDJudgeResults (with a noop() mode that always returns []).

Refactors LDAIClientImpl.createChat to build and attach an Evaluator onto the returned LDAICompletionConfig (and updates config types to carry evaluator), instead of returning a populated judges map to TrackedChat; judge instance creation is centralized via a new _createJudgeInstance helper.

Updates Judge to support a constructor-level default sampling rate (exposed via sampleRate) and changes evaluate/evaluateMessages so samplingRate is optional and falls back to the constructor default. Adds/updates Jest coverage for Evaluator, the new Judge sampling behavior, and the updated createJudge constructor arguments.

Reviewed by Cursor Bugbot for commit 12f3657. Bugbot is set up for automated code reviews on this repo. Configure here.


Open in Devin Review

@github-actions
Copy link
Copy Markdown
Contributor

@launchdarkly/js-sdk-common size report
This is the brotli compressed size of the ESM build.
Compressed size: 25623 bytes
Compressed size limit: 29000
Uncompressed size: 125843 bytes

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

@launchdarkly/browser size report
This is the brotli compressed size of the ESM build.
Compressed size: 179311 bytes
Compressed size limit: 200000
Uncompressed size: 830815 bytes

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

@launchdarkly/js-client-sdk size report
This is the brotli compressed size of the ESM build.
Compressed size: 31866 bytes
Compressed size limit: 34000
Uncompressed size: 113634 bytes

@github-actions
Copy link
Copy Markdown
Contributor

@launchdarkly/js-client-sdk-common size report
This is the brotli compressed size of the ESM build.
Compressed size: 38473 bytes
Compressed size limit: 39000
Uncompressed size: 211104 bytes

@jsonbailey jsonbailey changed the title feat(server-sdk-ai): add Evaluator class for judge orchestration feat: add Evaluator class for judge orchestration Apr 28, 2026
…-1657)

Introduces `Evaluator` wrapping judges and JudgeConfiguration. The evaluator
runs all configured judges in parallel, warns+skips on missing judge keys, and
intentionally does NOT call tracker.trackJudgeResult — that responsibility
belongs in the managed layer. Attaches Evaluator to LDAICompletionConfig and
LDAIAgentConfig via createChat/createAgent. Adds Evaluator.noop() static factory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jsonbailey jsonbailey force-pushed the jb/aic-2174/server-ai branch from 46ab0a4 to c751ce6 Compare April 28, 2026 23:12
@jsonbailey jsonbailey marked this pull request as ready for review April 28, 2026 23:23
@jsonbailey jsonbailey requested a review from a team as a code owner April 28, 2026 23:23
cursor[bot]

This comment was marked as resolved.

Comment thread packages/sdk/server-ai/src/api/judge/Evaluator.ts Outdated
Comment thread packages/sdk/server-ai/src/LDAIClientImpl.ts
Comment thread packages/sdk/server-ai/src/LDAIClientImpl.ts Outdated
Comment thread packages/sdk/server-ai/src/api/judge/Evaluator.ts Outdated
Comment thread packages/sdk/server-ai/src/api/judge/Evaluator.ts Outdated
Comment thread packages/sdk/server-ai/src/api/config/types.ts Outdated
jsonbailey and others added 2 commits April 30, 2026 09:57
- Include judgeConfigKey in error LDJudgeResults emitted from the catch
  block so error results are attributable to a specific judge config.
- Switch from Promise.allSettled to Promise.all in Evaluator.evaluate;
  the map callback already catches internally so allSettled never sees
  rejections. Filter out null returns from the missing-judge path.
- Mark the Evaluator class @internal and remove it from the public
  api/judge re-exports. It is consumed only by the managed layer; tests
  import via the source path.
- Mark the evaluator property on LDAICompletionConfig and LDAIAgentConfig
  as @internal so it is excluded from the published API surface.
- Simplify _initializeJudges to return Map<string, Judge> directly,
  eliminating the Record-then-convert step in _buildEvaluator.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r to Judge[]

- `Judge` constructor now takes a `sampleRate: number = 1.0` between the
  provider and the optional logger. The rate is stored as
  `private readonly _sampleRate` with a public `get sampleRate()` getter, in
  line with the rest of the file's `_xxx` field convention.
- `Judge.evaluate` and `Judge.evaluateMessages` now take an optional
  `samplingRate?: number`; when omitted (or `undefined`) the constructor
  default is used. Falls through with `samplingRate ?? this._sampleRate`
  so an explicit `0` overrides correctly.
- `Evaluator` is now constructed with just `(judges: Judge[], logger?)` —
  the `Map<string, Judge>` and `LDJudgeConfiguration` parameters are gone.
  `evaluate()` iterates the array and calls `judge.evaluate(input, output)`
  with no per-call rate; sampling is the judge's concern. The missing-judge
  warn-and-skip path is gone since the array can only contain judges that
  were successfully created.
- `LDAIClientImpl._initializeJudges` is removed. The judge-construction
  body of `createJudge` is extracted into a private
  `_createJudgeInstance` helper that does NOT emit
  `TRACK_USAGE_CREATE_JUDGE`. The public `createJudge` emits the usage
  event then delegates — matching the existing `completionConfig` /
  `_completionConfig` and `judgeConfig` / `_judgeConfig` split. The
  public `createJudge` (and the `LDAIClient` interface) gain an optional
  `sampleRate?: number = 1.0` so callers can bake the default rate at
  construction. `_buildEvaluator` now inlines the per-config
  `Promise.all`, calls `_createJudgeInstance` per judge (no per-judge
  usage event), and constructs `new Evaluator(judges, this._logger)`.
- Drop the legacy `judges: Record<string, Judge>` build in `createChat`.
  The `TrackedChat` constructor still accepts a `judges` param (its
  default-empty `{}`), but the client no longer populates it; the
  deprecated `invoke()` path will see no judges. The `run()` path is
  unaffected — it relies on the `evaluator` attached to the config.
- Update `Evaluator.test.ts` to construct with `Judge[]`. Add new
  `Judge.test.ts` cases covering sampleRate defaulting, the per-call
  override (including explicit `0`), and the `undefined` fallthrough.
- Update the `LDAIClientImpl.test.ts` `createJudge` expectation to assert
  the new constructor arity (with the default `1.0`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cursor[bot]

This comment was marked as resolved.

@jsonbailey jsonbailey requested a review from joker23 May 1, 2026 15:53
…orts

Judge.evaluate() never throws — it has its own internal try/catch and
always returns an LDJudgeResult. Simplify Evaluator.evaluate() to plain
Promise.all, remove the now-unused _logger param from the constructor,
and drop the Evaluator export from the public judge index. Update
LDAIClientImpl and tests to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment on lines +328 to +331
// Attach the evaluator to the config for use by the managed layer
const configWithEvaluator: LDAICompletionConfig = { ...config, evaluator };

return new TrackedChat(configWithEvaluator, provider, {}, this._logger);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 TrackedChat receives empty judges record, breaking all judge evaluations

createChat now passes an empty object {} for the judges parameter of TrackedChat, while attaching an evaluator to the config instead. However, TrackedChat._evaluateWithJudges() (TrackedChat.ts:85) still looks up judges from this.judges[judgeConfig.key], which will always return undefined since the record is empty. This means every judge evaluation triggered by TrackedChat.invoke() will fail with the error "Judge configuration is not enabled for ..." instead of actually running the judge. The evaluator on the config is never consumed by TrackedChat.

Prompt for agents
The PR introduces an Evaluator abstraction and attaches it to the config via configWithEvaluator, but TrackedChat was not updated to use it. TrackedChat._evaluateWithJudges() (TrackedChat.ts:77-115) still relies on the this.judges record passed to its constructor, which is now always empty {}.

Two approaches to fix this:

1. Update TrackedChat to use the evaluator from the config (this.aiConfig.evaluator) instead of the old this.judges record. This would mean replacing the _evaluateWithJudges method to delegate to the evaluator, and updating the invoke() method accordingly. Note that TrackedChat currently also calls tracker.trackJudgeResult for each result, which still needs to happen.

2. Alternatively, keep populating the judges record as before (reverting the empty {} change) while also attaching the evaluator for other consumers. But this would mean duplicating the judge instances.

Approach 1 is cleaner and aligns with the PR's intent. The key files are:
- packages/sdk/server-ai/src/LDAIClientImpl.ts (createChat method, lines 296-332)
- packages/sdk/server-ai/src/api/chat/TrackedChat.ts (constructor, invoke, _evaluateWithJudges)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional and will be resolved in a subsequent commit

@jsonbailey jsonbailey changed the base branch from main to next-ai-release May 1, 2026 16:14
@jsonbailey jsonbailey merged commit f26c4cc into next-ai-release May 1, 2026
43 checks passed
@jsonbailey jsonbailey deleted the jb/aic-2174/server-ai branch May 1, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants