Skip to content

feat: add tool_calls grader (#187)#202

Merged
spboyer merged 3 commits into
mainfrom
squad/187-tool-calls-grader
Apr 21, 2026
Merged

feat: add tool_calls grader (#187)#202
spboyer merged 3 commits into
mainfrom
squad/187-tool-calls-grader

Conversation

@spboyer
Copy link
Copy Markdown
Member

@spboyer spboyer commented Apr 21, 2026

Summary

Adds a new tool_calls grader that validates which tools an agent called during execution, without caring about order.

Closes #187

Constraints Supported

Parameter Type Description
required_tools []string Tools that must appear in the session
forbidden_tools []string Tools that must not appear
min_calls *int Minimum total tool call count
max_calls *int Maximum total tool call count

Scoring

Partial credit: score = passed_checks / total_checks. Each constraint counts as one check. Constructor validates parameters at creation time (non-negative bounds, min ≤ max, at least one constraint).

Example YAML

graders:
  - kind: tool_calls
    name: verify-tools
    params:
      required_tools: [bash, view]
      forbidden_tools: [rm]
      min_calls: 2
      max_calls: 20

Changes

  • internal/models/outcome.go: Added GraderKindToolCalls constant + AllGraderKinds() registration
  • internal/models/grader_params.go: Added ToolCallsGraderParameters struct + YAML decoder case
  • internal/graders/grader.go: Added factory case
  • internal/graders/tool_calls_grader.go: Full grader implementation (~120 lines)
  • internal/graders/tool_calls_grader_test.go: 25 tests covering constructor validation, required/forbidden tools, call count bounds, combined checks, partial scoring, details output, edge cases, and factory integration

Testing

All 25 new tests pass. Existing model and grader tests unaffected.

Copilot AI review requested due to automatic review settings April 21, 2026 17:11
@github-actions github-actions Bot enabled auto-merge (squash) April 21, 2026 17:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new tool_calls grader to validate tool usage in an agent session (required/forbidden tools and total call-count bounds), integrating it into the grader kind registry, YAML parameter decoding, and grader factory creation.

Changes:

  • Register new grader kind tool_calls in models and wire it into grader parameter decoding + factory creation.
  • Implement ToolCallsGrader to evaluate required/forbidden tools and min/max tool call counts with partial scoring.
  • Add comprehensive unit tests for validation, grading behavior, scoring, details, and factory integration.
Show a summary per file
File Description
internal/models/outcome.go Registers tool_calls as a grader kind for validation/error messaging.
internal/models/grader_params.go Adds ToolCallsGraderParameters and YAML decode support for tool_calls.
internal/graders/grader.go Adds factory wiring to instantiate the new grader.
internal/graders/tool_calls_grader.go Implements the tool_calls grader logic and result payload.
internal/graders/tool_calls_grader_test.go Adds unit tests covering parameter validation and grading/scoring scenarios.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 3

Comment thread internal/graders/tool_calls_grader.go
Comment thread internal/graders/tool_calls_grader.go
Comment thread internal/graders/tool_calls_grader.go
Copilot AI review requested due to automatic review settings April 21, 2026 17:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (1)

internal/graders/tool_calls_grader.go:121

  • This main success/failure return also omits Type. All graders should set GraderResults.Type to their kind; otherwise downstream reporting can’t reliably identify the grader type.
		return &models.GraderResults{
			Name:     g.name,
			Passed:   passed,
			Score:    score,
			Feedback: feedback,
			Details: map[string]any{
  • Files reviewed: 7/7 changed files
  • Comments generated: 5

Comment on lines +541 to +546
| Option | Type | Default | Description |
| ----------------- | ----------- | ------- | ----------------------------------------------- |
| `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session |
| `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear |
| `min_calls` | `int` | `0` | Minimum total tool calls required (0 = no min) |
| `max_calls` | `int` | `0` | Maximum total tool calls allowed (0 = no limit) |
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say min_calls/max_calls default to 0 and that 0 disables the bound, but the implementation treats these as optional pointers (unset = no check) and 0 as an explicit bound (e.g., max_calls: 0 requires zero calls). Please update the defaults/description to reflect the actual behavior (use “unset/omitted” for no min/max).

Suggested change
| Option | Type | Default | Description |
| ----------------- | ----------- | ------- | ----------------------------------------------- |
| `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session |
| `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear |
| `min_calls` | `int` | `0` | Minimum total tool calls required (0 = no min) |
| `max_calls` | `int` | `0` | Maximum total tool calls allowed (0 = no limit) |
| Option | Type | Default | Description |
| ----------------- | ----------- | ------- | -------------------------------------------------------- |
| `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session |
| `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear |
| `min_calls` | `int` | unset | Minimum total tool calls required (omit for no minimum) |
| `max_calls` | `int` | unset | Maximum total tool calls allowed (omit for no maximum) |

Copilot uses AI. Check for mistakes.
Comment on lines +554 to +556
bounds. Use `tool_constraint` when you also need turn and token budgets.
Use `behavior` when you need `max_duration_ms` or a single
`max_tool_calls` cap without tool-name validation.
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aside states that tool_constraint supports turn/token budgets, but the current tool_constraint grader parameters/implementation don’t include max_turns/max_tokens checks (it only validates tool usage patterns). Please adjust the comparison text to match what tool_constraint actually enforces so readers choose the right grader.

Suggested change
bounds. Use `tool_constraint` when you also need turn and token budgets.
Use `behavior` when you need `max_duration_ms` or a single
`max_tool_calls` cap without tool-name validation.
bounds. Use `tool_constraint` when you need more structured tool-usage
constraints or pattern validation. Use `behavior` when you need
`max_duration_ms` or a single `max_tool_calls` cap without tool-name
validation.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +13
// ToolCallsGrader validates which tools an agent called during execution.
// It checks required tools, forbidden tools, minimum calls, and maximum calls.
type ToolCallsGrader struct {
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool_calls currently checks only ToolCall.Name and total count. The linked issue (#187) mentions needing access to tool arguments/results (e.g., namespace-mode where the real tool is in arguments), so this may not fully satisfy the stated requirement if the PR is intended to close that issue. Consider clarifying scope (simple name/count constraints) or extending the grader params to support matching on ToolCall.Arguments/Result similarly to tool_constraint.

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +55
return &models.GraderResults{
Name: g.name,
Passed: false,
Score: 0,
Feedback: "no session data available for tool_calls grading",
}, nil
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

models.GraderResults.Type is required and is set by all other graders, but this early-return result doesn’t populate it. Please set Type: models.GraderKindToolCalls here so JSON/JUnit/web output includes the grader kind consistently.

This issue also appears on line 116 of the same file.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +115
calledNames := make([]string, 0, len(calledSet))
for name := range calledSet {
calledNames = append(calledNames, name)
}

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_tools is built by iterating over a map, so the order will be nondeterministic across runs. Consider sorting calledNames before returning it to keep Details stable (helps debugging and avoids flaky snapshot/JSON comparisons).

Copilot uses AI. Check for mistakes.
Copilot AI added 3 commits April 21, 2026 14:21
Adds a new tool_calls grader that validates which tools an agent called
during execution. Supports four constraint types:

- required_tools: tools that must appear in the session
- forbidden_tools: tools that must not appear in the session
- min_calls: minimum total tool call count
- max_calls: maximum total tool call count

Partial scoring: score = passed_checks / total_checks. Each constraint
counts as one check. Constructor validates parameters at creation time.

Includes 25 tests covering constructor validation, required/forbidden
tools, call count bounds, combined checks, partial scoring, details
output, edge cases, and factory integration.

Closes #187

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add tool_calls to at-a-glance table in graders.mdx
- Add full Tool Calls section with config options, examples, and comparison tip
- Add tool_calls to grader type enum in schema.mdx
- Site builds clean

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the squad/187-tool-calls-grader branch from 5263d99 to 5fd38c5 Compare April 21, 2026 18:22
@spboyer spboyer merged commit 7bd731f into main Apr 21, 2026
6 checks passed
@spboyer spboyer deleted the squad/187-tool-calls-grader branch April 21, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature request: implementing tool_calls grader

3 participants