feat: add Evaluator class for judge orchestration#1331
feat: add Evaluator class for judge orchestration#1331jsonbailey merged 4 commits intonext-ai-releasefrom
Conversation
|
@launchdarkly/js-sdk-common size report |
|
@launchdarkly/browser size report |
|
@launchdarkly/js-client-sdk size report |
|
@launchdarkly/js-client-sdk-common size report |
…-1657) Introduces `Evaluator` wrapping judges and JudgeConfiguration. The evaluator runs all configured judges in parallel, warns+skips on missing judge keys, and intentionally does NOT call tracker.trackJudgeResult — that responsibility belongs in the managed layer. Attaches Evaluator to LDAICompletionConfig and LDAIAgentConfig via createChat/createAgent. Adds Evaluator.noop() static factory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
46ab0a4 to
c751ce6
Compare
- Include judgeConfigKey in error LDJudgeResults emitted from the catch block so error results are attributable to a specific judge config. - Switch from Promise.allSettled to Promise.all in Evaluator.evaluate; the map callback already catches internally so allSettled never sees rejections. Filter out null returns from the missing-judge path. - Mark the Evaluator class @internal and remove it from the public api/judge re-exports. It is consumed only by the managed layer; tests import via the source path. - Mark the evaluator property on LDAICompletionConfig and LDAIAgentConfig as @internal so it is excluded from the published API surface. - Simplify _initializeJudges to return Map<string, Judge> directly, eliminating the Record-then-convert step in _buildEvaluator. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r to Judge[]
- `Judge` constructor now takes a `sampleRate: number = 1.0` between the
provider and the optional logger. The rate is stored as
`private readonly _sampleRate` with a public `get sampleRate()` getter, in
line with the rest of the file's `_xxx` field convention.
- `Judge.evaluate` and `Judge.evaluateMessages` now take an optional
`samplingRate?: number`; when omitted (or `undefined`) the constructor
default is used. Falls through with `samplingRate ?? this._sampleRate`
so an explicit `0` overrides correctly.
- `Evaluator` is now constructed with just `(judges: Judge[], logger?)` —
the `Map<string, Judge>` and `LDJudgeConfiguration` parameters are gone.
`evaluate()` iterates the array and calls `judge.evaluate(input, output)`
with no per-call rate; sampling is the judge's concern. The missing-judge
warn-and-skip path is gone since the array can only contain judges that
were successfully created.
- `LDAIClientImpl._initializeJudges` is removed. The judge-construction
body of `createJudge` is extracted into a private
`_createJudgeInstance` helper that does NOT emit
`TRACK_USAGE_CREATE_JUDGE`. The public `createJudge` emits the usage
event then delegates — matching the existing `completionConfig` /
`_completionConfig` and `judgeConfig` / `_judgeConfig` split. The
public `createJudge` (and the `LDAIClient` interface) gain an optional
`sampleRate?: number = 1.0` so callers can bake the default rate at
construction. `_buildEvaluator` now inlines the per-config
`Promise.all`, calls `_createJudgeInstance` per judge (no per-judge
usage event), and constructs `new Evaluator(judges, this._logger)`.
- Drop the legacy `judges: Record<string, Judge>` build in `createChat`.
The `TrackedChat` constructor still accepts a `judges` param (its
default-empty `{}`), but the client no longer populates it; the
deprecated `invoke()` path will see no judges. The `run()` path is
unaffected — it relies on the `evaluator` attached to the config.
- Update `Evaluator.test.ts` to construct with `Judge[]`. Add new
`Judge.test.ts` cases covering sampleRate defaulting, the per-call
override (including explicit `0`), and the `undefined` fallthrough.
- Update the `LDAIClientImpl.test.ts` `createJudge` expectation to assert
the new constructor arity (with the default `1.0`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…orts Judge.evaluate() never throws — it has its own internal try/catch and always returns an LDJudgeResult. Simplify Evaluator.evaluate() to plain Promise.all, remove the now-unused _logger param from the constructor, and drop the Evaluator export from the public judge index. Update LDAIClientImpl and tests to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| // Attach the evaluator to the config for use by the managed layer | ||
| const configWithEvaluator: LDAICompletionConfig = { ...config, evaluator }; | ||
|
|
||
| return new TrackedChat(configWithEvaluator, provider, {}, this._logger); |
There was a problem hiding this comment.
🔴 TrackedChat receives empty judges record, breaking all judge evaluations
createChat now passes an empty object {} for the judges parameter of TrackedChat, while attaching an evaluator to the config instead. However, TrackedChat._evaluateWithJudges() (TrackedChat.ts:85) still looks up judges from this.judges[judgeConfig.key], which will always return undefined since the record is empty. This means every judge evaluation triggered by TrackedChat.invoke() will fail with the error "Judge configuration is not enabled for ..." instead of actually running the judge. The evaluator on the config is never consumed by TrackedChat.
Prompt for agents
The PR introduces an Evaluator abstraction and attaches it to the config via configWithEvaluator, but TrackedChat was not updated to use it. TrackedChat._evaluateWithJudges() (TrackedChat.ts:77-115) still relies on the this.judges record passed to its constructor, which is now always empty {}.
Two approaches to fix this:
1. Update TrackedChat to use the evaluator from the config (this.aiConfig.evaluator) instead of the old this.judges record. This would mean replacing the _evaluateWithJudges method to delegate to the evaluator, and updating the invoke() method accordingly. Note that TrackedChat currently also calls tracker.trackJudgeResult for each result, which still needs to happen.
2. Alternatively, keep populating the judges record as before (reverting the empty {} change) while also attaching the evaluator for other consumers. But this would mean duplicating the judge instances.
Approach 1 is cleaner and aligns with the PR's intent. The key files are:
- packages/sdk/server-ai/src/LDAIClientImpl.ts (createChat method, lines 296-332)
- packages/sdk/server-ai/src/api/chat/TrackedChat.ts (constructor, invoke, _evaluateWithJudges)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This is intentional and will be resolved in a subsequent commit
Summary
Evaluatorclass that wraps judges andJudgeConfigurationEvaluator.noop()static factory returns a no-op evaluator (resolves to[])evaluate(input, output)runs all configured judges in parallel; missing judge key → warning + skip (not error)Evaluatordoes NOT calltracker.trackJudgeResult— that belongs in the managed layerEvaluatortoLDAICompletionConfigandLDAIAgentConfig(populated increateChat/createAgent)Test plan
Evaluator.test.tscovers noop(), judge evaluation, missing judge warns+skips, error handling, no tracker calls🤖 Generated with Claude Code
Note
Medium Risk
Introduces new judge orchestration and changes how
createChat/createJudgeconstruct and wire judges, which can affect when/if evaluations run and how sampling is applied. Main risk is behavioral drift in judge execution paths due to the new evaluator attachment and updatedJudgesampling semantics.Overview
Adds an internal
Evaluatorabstraction that runs all configuredJudgeinstances in parallel and returns an array ofLDJudgeResults (with anoop()mode that always returns[]).Refactors
LDAIClientImpl.createChatto build and attach anEvaluatoronto the returnedLDAICompletionConfig(and updates config types to carryevaluator), instead of returning a populated judges map toTrackedChat; judge instance creation is centralized via a new_createJudgeInstancehelper.Updates
Judgeto support a constructor-level default sampling rate (exposed viasampleRate) and changesevaluate/evaluateMessagessosamplingRateis optional and falls back to the constructor default. Adds/updates Jest coverage forEvaluator, the newJudgesampling behavior, and the updatedcreateJudgeconstructor arguments.Reviewed by Cursor Bugbot for commit 12f3657. Bugbot is set up for automated code reviews on this repo. Configure here.