Skip to content

fix(trajectory): add diagnostic comments to trajectory match scorer failures#81

Open
Sarthak Gupta (sarthakkgupta) wants to merge 2 commits intolangchain-ai:mainfrom
sarthakkgupta:fix/trajectory-diagnostic-comments
Open

fix(trajectory): add diagnostic comments to trajectory match scorer failures#81
Sarthak Gupta (sarthakkgupta) wants to merge 2 commits intolangchain-ai:mainfrom
sarthakkgupta:fix/trajectory-diagnostic-comments

Conversation

@sarthakkgupta
Copy link
Copy Markdown

Fixes #66.

Summary

Trajectory match evaluators previously returned only a True/False score with no explanation of why a comparison failed. This made debugging agent test failures a manual process of eyeballing trajectory diffs.

This PR modifies the _scorer functions in strict, unordered, subset, and superset to return (bool, str) tuples on failure. The openevals runner already supports this format — the string surfaces as EvaluatorResult.comment.

Before

result = evaluator(outputs=agent_run, reference_outputs=expected)
# result["score"] = False
# result["comment"] = None   ← no information

After

result = evaluator(outputs=agent_run, reference_outputs=expected)
# result["score"] = False
# result["comment"] = "Step 2, tool 'search_web': argument mismatch (match_mode='exact').
#                      Expected: {\"query\": \"AI news 2025\"}.
#                      Got: {\"query\": \"AI news\"}."

Changes

  • python/agentevals/trajectory/strict.py — per-step diagnostics (length, role, tool name, tool args)
  • python/agentevals/trajectory/unordered.py — missing/extra tool call names
  • python/agentevals/trajectory/subset.py — extra tool call names not in reference
  • python/agentevals/trajectory/superset.py — missing required tool call names from reference
  • python/tests/test_trajectory.py — updated failure-case assertions to allow diagnostic comments
  • python/tests/test_trajectory_diagnostics.py — 13 new tests covering each failure mode

Backward compatibility

Fully backward-compatible. All callers that only check result["score"] are unaffected. Pass cases still return True (no tuple), so comment remains None on success.

Sarthak Gupta and others added 2 commits March 27, 2026 21:57
…ailures

Fixes langchain-ai#66. Modifies _scorer functions in strict, unordered, subset, and
superset to return (bool, str) tuples on failure. The reasoning string
surfaces as EvaluatorResult.comment via the openevals runner. Fully
backward-compatible — callers that only check result.score are unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Error messages for failed trajectories

1 participant