fix(trajectory): add diagnostic comments to trajectory match scorer failures by sarthakkgupta · Pull Request #81 · langchain-ai/agentevals

Sarthak Gupta (sarthakkgupta) · 2026-03-28T04:58:20Z

Fixes #66.

Summary

Trajectory match evaluators previously returned only a True/False score with no explanation of why a comparison failed. This made debugging agent test failures a manual process of eyeballing trajectory diffs.

This PR modifies the _scorer functions in strict, unordered, subset, and superset to return (bool, str) tuples on failure. The openevals runner already supports this format — the string surfaces as EvaluatorResult.comment.

Before

result = evaluator(outputs=agent_run, reference_outputs=expected)
# result["score"] = False
# result["comment"] = None   ← no information

After

result = evaluator(outputs=agent_run, reference_outputs=expected)
# result["score"] = False
# result["comment"] = "Step 2, tool 'search_web': argument mismatch (match_mode='exact').
#                      Expected: {\"query\": \"AI news 2025\"}.
#                      Got: {\"query\": \"AI news\"}."

Changes

python/agentevals/trajectory/strict.py — per-step diagnostics (length, role, tool name, tool args)
python/agentevals/trajectory/unordered.py — missing/extra tool call names
python/agentevals/trajectory/subset.py — extra tool call names not in reference
python/agentevals/trajectory/superset.py — missing required tool call names from reference
python/tests/test_trajectory.py — updated failure-case assertions to allow diagnostic comments
python/tests/test_trajectory_diagnostics.py — 13 new tests covering each failure mode

Backward compatibility

Fully backward-compatible. All callers that only check result["score"] are unaffected. Pass cases still return True (no tuple), so comment remains None on success.

…ailures Fixes langchain-ai#66. Modifies _scorer functions in strict, unordered, subset, and superset to return (bool, str) tuples on failure. The reasoning string surfaces as EvaluatorResult.comment via the openevals runner. Fully backward-compatible — callers that only check result.score are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Sarthak Gupta and others added 2 commits March 27, 2026 21:57

Merge branch 'main' into fix/trajectory-diagnostic-comments

651a3cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trajectory): add diagnostic comments to trajectory match scorer failures#81

fix(trajectory): add diagnostic comments to trajectory match scorer failures#81
Sarthak Gupta (sarthakkgupta) wants to merge 2 commits intolangchain-ai:mainfrom
sarthakkgupta:fix/trajectory-diagnostic-comments

Sarthak Gupta (sarthakkgupta) commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sarthak Gupta (sarthakkgupta) commented Mar 28, 2026

Summary

Before

After

Changes

Backward compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant