fix(trajectory): add diagnostic comments to trajectory match scorer failures#81
Open
Sarthak Gupta (sarthakkgupta) wants to merge 2 commits intolangchain-ai:mainfrom
Conversation
…ailures Fixes langchain-ai#66. Modifies _scorer functions in strict, unordered, subset, and superset to return (bool, str) tuples on failure. The reasoning string surfaces as EvaluatorResult.comment via the openevals runner. Fully backward-compatible — callers that only check result.score are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #66.
Summary
Trajectory match evaluators previously returned only a
True/Falsescore with no explanation of why a comparison failed. This made debugging agent test failures a manual process of eyeballing trajectory diffs.This PR modifies the
_scorerfunctions instrict,unordered,subset, andsupersetto return(bool, str)tuples on failure. Theopenevalsrunner already supports this format — the string surfaces asEvaluatorResult.comment.Before
After
Changes
python/agentevals/trajectory/strict.py— per-step diagnostics (length, role, tool name, tool args)python/agentevals/trajectory/unordered.py— missing/extra tool call namespython/agentevals/trajectory/subset.py— extra tool call names not in referencepython/agentevals/trajectory/superset.py— missing required tool call names from referencepython/tests/test_trajectory.py— updated failure-case assertions to allow diagnostic commentspython/tests/test_trajectory_diagnostics.py— 13 new tests covering each failure modeBackward compatibility
Fully backward-compatible. All callers that only check
result["score"]are unaffected. Pass cases still returnTrue(no tuple), socommentremainsNoneon success.