feat: add tool_calls grader (#187)#202
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new tool_calls grader to validate tool usage in an agent session (required/forbidden tools and total call-count bounds), integrating it into the grader kind registry, YAML parameter decoding, and grader factory creation.
Changes:
- Register new grader kind
tool_callsinmodelsand wire it into grader parameter decoding + factory creation. - Implement
ToolCallsGraderto evaluate required/forbidden tools and min/max tool call counts with partial scoring. - Add comprehensive unit tests for validation, grading behavior, scoring, details, and factory integration.
Show a summary per file
| File | Description |
|---|---|
| internal/models/outcome.go | Registers tool_calls as a grader kind for validation/error messaging. |
| internal/models/grader_params.go | Adds ToolCallsGraderParameters and YAML decode support for tool_calls. |
| internal/graders/grader.go | Adds factory wiring to instantiate the new grader. |
| internal/graders/tool_calls_grader.go | Implements the tool_calls grader logic and result payload. |
| internal/graders/tool_calls_grader_test.go | Adds unit tests covering parameter validation and grading/scoring scenarios. |
Copilot's findings
- Files reviewed: 5/5 changed files
- Comments generated: 3
There was a problem hiding this comment.
Copilot's findings
Comments suppressed due to low confidence (1)
internal/graders/tool_calls_grader.go:121
- This main success/failure return also omits
Type. All graders should setGraderResults.Typeto their kind; otherwise downstream reporting can’t reliably identify the grader type.
return &models.GraderResults{
Name: g.name,
Passed: passed,
Score: score,
Feedback: feedback,
Details: map[string]any{
- Files reviewed: 7/7 changed files
- Comments generated: 5
| | Option | Type | Default | Description | | ||
| | ----------------- | ----------- | ------- | ----------------------------------------------- | | ||
| | `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session | | ||
| | `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear | | ||
| | `min_calls` | `int` | `0` | Minimum total tool calls required (0 = no min) | | ||
| | `max_calls` | `int` | `0` | Maximum total tool calls allowed (0 = no limit) | |
There was a problem hiding this comment.
The docs say min_calls/max_calls default to 0 and that 0 disables the bound, but the implementation treats these as optional pointers (unset = no check) and 0 as an explicit bound (e.g., max_calls: 0 requires zero calls). Please update the defaults/description to reflect the actual behavior (use “unset/omitted” for no min/max).
| | Option | Type | Default | Description | | |
| | ----------------- | ----------- | ------- | ----------------------------------------------- | | |
| | `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session | | |
| | `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear | | |
| | `min_calls` | `int` | `0` | Minimum total tool calls required (0 = no min) | | |
| | `max_calls` | `int` | `0` | Maximum total tool calls allowed (0 = no limit) | | |
| | Option | Type | Default | Description | | |
| | ----------------- | ----------- | ------- | -------------------------------------------------------- | | |
| | `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session | | |
| | `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear | | |
| | `min_calls` | `int` | unset | Minimum total tool calls required (omit for no minimum) | | |
| | `max_calls` | `int` | unset | Maximum total tool calls allowed (omit for no maximum) | |
| bounds. Use `tool_constraint` when you also need turn and token budgets. | ||
| Use `behavior` when you need `max_duration_ms` or a single | ||
| `max_tool_calls` cap without tool-name validation. |
There was a problem hiding this comment.
This aside states that tool_constraint supports turn/token budgets, but the current tool_constraint grader parameters/implementation don’t include max_turns/max_tokens checks (it only validates tool usage patterns). Please adjust the comparison text to match what tool_constraint actually enforces so readers choose the right grader.
| bounds. Use `tool_constraint` when you also need turn and token budgets. | |
| Use `behavior` when you need `max_duration_ms` or a single | |
| `max_tool_calls` cap without tool-name validation. | |
| bounds. Use `tool_constraint` when you need more structured tool-usage | |
| constraints or pattern validation. Use `behavior` when you need | |
| `max_duration_ms` or a single `max_tool_calls` cap without tool-name | |
| validation. |
| // ToolCallsGrader validates which tools an agent called during execution. | ||
| // It checks required tools, forbidden tools, minimum calls, and maximum calls. | ||
| type ToolCallsGrader struct { |
There was a problem hiding this comment.
tool_calls currently checks only ToolCall.Name and total count. The linked issue (#187) mentions needing access to tool arguments/results (e.g., namespace-mode where the real tool is in arguments), so this may not fully satisfy the stated requirement if the PR is intended to close that issue. Consider clarifying scope (simple name/count constraints) or extending the grader params to support matching on ToolCall.Arguments/Result similarly to tool_constraint.
| return &models.GraderResults{ | ||
| Name: g.name, | ||
| Passed: false, | ||
| Score: 0, | ||
| Feedback: "no session data available for tool_calls grading", | ||
| }, nil |
There was a problem hiding this comment.
models.GraderResults.Type is required and is set by all other graders, but this early-return result doesn’t populate it. Please set Type: models.GraderKindToolCalls here so JSON/JUnit/web output includes the grader kind consistently.
This issue also appears on line 116 of the same file.
| calledNames := make([]string, 0, len(calledSet)) | ||
| for name := range calledSet { | ||
| calledNames = append(calledNames, name) | ||
| } | ||
|
|
There was a problem hiding this comment.
unique_tools is built by iterating over a map, so the order will be nondeterministic across runs. Consider sorting calledNames before returning it to keep Details stable (helps debugging and avoids flaky snapshot/JSON comparisons).
Adds a new tool_calls grader that validates which tools an agent called during execution. Supports four constraint types: - required_tools: tools that must appear in the session - forbidden_tools: tools that must not appear in the session - min_calls: minimum total tool call count - max_calls: maximum total tool call count Partial scoring: score = passed_checks / total_checks. Each constraint counts as one check. Constructor validates parameters at creation time. Includes 25 tests covering constructor validation, required/forbidden tools, call count bounds, combined checks, partial scoring, details output, edge cases, and factory integration. Closes #187 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add tool_calls to at-a-glance table in graders.mdx - Add full Tool Calls section with config options, examples, and comparison tip - Add tool_calls to grader type enum in schema.mdx - Site builds clean Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5263d99 to
5fd38c5
Compare
Summary
Adds a new
tool_callsgrader that validates which tools an agent called during execution, without caring about order.Closes #187
Constraints Supported
required_tools[]stringforbidden_tools[]stringmin_calls*intmax_calls*intScoring
Partial credit:
score = passed_checks / total_checks. Each constraint counts as one check. Constructor validates parameters at creation time (non-negative bounds, min ≤ max, at least one constraint).Example YAML
Changes
internal/models/outcome.go: AddedGraderKindToolCallsconstant +AllGraderKinds()registrationinternal/models/grader_params.go: AddedToolCallsGraderParametersstruct + YAML decoder caseinternal/graders/grader.go: Added factory caseinternal/graders/tool_calls_grader.go: Full grader implementation (~120 lines)internal/graders/tool_calls_grader_test.go: 25 tests covering constructor validation, required/forbidden tools, call count bounds, combined checks, partial scoring, details output, edge cases, and factory integrationTesting
All 25 new tests pass. Existing model and grader tests unaffected.