feat: add tool_calls grader (#187) by spboyer · Pull Request #202 · microsoft/waza

spboyer · 2026-04-21T17:11:25Z

Summary

Adds a new tool_calls grader that validates which tools an agent called during execution, without caring about order.

Closes #187

Constraints Supported

Parameter	Type	Description
`required_tools`	`[]string`	Tools that must appear in the session
`forbidden_tools`	`[]string`	Tools that must not appear
`min_calls`	`*int`	Minimum total tool call count
`max_calls`	`*int`	Maximum total tool call count

Scoring

Partial credit: score = passed_checks / total_checks. Each constraint counts as one check. Constructor validates parameters at creation time (non-negative bounds, min ≤ max, at least one constraint).

Example YAML

graders:
  - kind: tool_calls
    name: verify-tools
    params:
      required_tools: [bash, view]
      forbidden_tools: [rm]
      min_calls: 2
      max_calls: 20

Changes

internal/models/outcome.go: Added GraderKindToolCalls constant + AllGraderKinds() registration
internal/models/grader_params.go: Added ToolCallsGraderParameters struct + YAML decoder case
internal/graders/grader.go: Added factory case
internal/graders/tool_calls_grader.go: Full grader implementation (~120 lines)
internal/graders/tool_calls_grader_test.go: 25 tests covering constructor validation, required/forbidden tools, call count bounds, combined checks, partial scoring, details output, edge cases, and factory integration

Testing

All 25 new tests pass. Existing model and grader tests unaffected.

Copilot

Pull request overview

Adds a new tool_calls grader to validate tool usage in an agent session (required/forbidden tools and total call-count bounds), integrating it into the grader kind registry, YAML parameter decoding, and grader factory creation.

Changes:

Register new grader kind tool_calls in models and wire it into grader parameter decoding + factory creation.
Implement ToolCallsGrader to evaluate required/forbidden tools and min/max tool call counts with partial scoring.
Add comprehensive unit tests for validation, grading behavior, scoring, details, and factory integration.

Show a summary per file

File	Description
internal/models/outcome.go	Registers `tool_calls` as a grader kind for validation/error messaging.
internal/models/grader_params.go	Adds `ToolCallsGraderParameters` and YAML decode support for `tool_calls`.
internal/graders/grader.go	Adds factory wiring to instantiate the new grader.
internal/graders/tool_calls_grader.go	Implements the `tool_calls` grader logic and result payload.
internal/graders/tool_calls_grader_test.go	Adds unit tests covering parameter validation and grading/scoring scenarios.

Copilot's findings

Files reviewed: 5/5 changed files
Comments generated: 3

Copilot

Copilot's findings

Comments suppressed due to low confidence (1)

internal/graders/tool_calls_grader.go:121

This main success/failure return also omits Type. All graders should set GraderResults.Type to their kind; otherwise downstream reporting can’t reliably identify the grader type.

		return &models.GraderResults{
			Name:     g.name,
			Passed:   passed,
			Score:    score,
			Feedback: feedback,
			Details: map[string]any{

Files reviewed: 7/7 changed files
Comments generated: 5

Copilot · 2026-04-21T17:31:01Z

+| Option            | Type        | Default | Description                                     |
+| ----------------- | ----------- | ------- | ----------------------------------------------- |
+| `required_tools`  | `list[str]` | `[]`    | Tool names that **must** appear in the session  |
+| `forbidden_tools` | `list[str]` | `[]`    | Tool names that **must not** appear             |
+| `min_calls`       | `int`       | `0`     | Minimum total tool calls required (0 = no min)  |
+| `max_calls`       | `int`       | `0`     | Maximum total tool calls allowed (0 = no limit) |


The docs say min_calls/max_calls default to 0 and that 0 disables the bound, but the implementation treats these as optional pointers (unset = no check) and 0 as an explicit bound (e.g., max_calls: 0 requires zero calls). Please update the defaults/description to reflect the actual behavior (use “unset/omitted” for no min/max).

Suggested change

| Option | Type | Default | Description |

| ----------------- | ----------- | ------- | ----------------------------------------------- |

| `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session |

| `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear |

| `min_calls` | `int` | `0` | Minimum total tool calls required (0 = no min) |

| `max_calls` | `int` | `0` | Maximum total tool calls allowed (0 = no limit) |

| Option | Type | Default | Description |

| ----------------- | ----------- | ------- | -------------------------------------------------------- |

| `required_tools` | `list[str]` | `[]` | Tool names that **must** appear in the session |

| `forbidden_tools` | `list[str]` | `[]` | Tool names that **must not** appear |

| `min_calls` | `int` | unset | Minimum total tool calls required (omit for no minimum) |

| `max_calls` | `int` | unset | Maximum total tool calls allowed (omit for no maximum) |

Copilot · 2026-04-21T17:31:01Z

+  bounds. Use `tool_constraint` when you also need turn and token budgets.
+  Use `behavior` when you need `max_duration_ms` or a single
+  `max_tool_calls` cap without tool-name validation.


This aside states that tool_constraint supports turn/token budgets, but the current tool_constraint grader parameters/implementation don’t include max_turns/max_tokens checks (it only validates tool usage patterns). Please adjust the comparison text to match what tool_constraint actually enforces so readers choose the right grader.

Suggested change

bounds. Use `tool_constraint` when you also need turn and token budgets.

Use `behavior` when you need `max_duration_ms` or a single

`max_tool_calls` cap without tool-name validation.

bounds. Use `tool_constraint` when you need more structured tool-usage

constraints or pattern validation. Use `behavior` when you need

`max_duration_ms` or a single `max_tool_calls` cap without tool-name

validation.

Copilot · 2026-04-21T17:31:01Z

+// ToolCallsGrader validates which tools an agent called during execution.
+// It checks required tools, forbidden tools, minimum calls, and maximum calls.
+type ToolCallsGrader struct {


tool_calls currently checks only ToolCall.Name and total count. The linked issue (#187) mentions needing access to tool arguments/results (e.g., namespace-mode where the real tool is in arguments), so this may not fully satisfy the stated requirement if the PR is intended to close that issue. Consider clarifying scope (simple name/count constraints) or extending the grader params to support matching on ToolCall.Arguments/Result similarly to tool_constraint.

Copilot · 2026-04-21T17:31:02Z

+			return &models.GraderResults{
+				Name:     g.name,
+				Passed:   false,
+				Score:    0,
+				Feedback: "no session data available for tool_calls grading",
+			}, nil


models.GraderResults.Type is required and is set by all other graders, but this early-return result doesn’t populate it. Please set Type: models.GraderKindToolCalls here so JSON/JUnit/web output includes the grader kind consistently.

This issue also appears on line 116 of the same file.

Copilot · 2026-04-21T17:31:02Z

+		calledNames := make([]string, 0, len(calledSet))
+		for name := range calledSet {
+			calledNames = append(calledNames, name)
+		}
+


unique_tools is built by iterating over a map, so the order will be nondeterministic across runs. Consider sorting calledNames before returning it to keep Details stable (helps debugging and avoids flaky snapshot/JSON comparisons).

Adds a new tool_calls grader that validates which tools an agent called during execution. Supports four constraint types: - required_tools: tools that must appear in the session - forbidden_tools: tools that must not appear in the session - min_calls: minimum total tool call count - max_calls: maximum total tool call count Partial scoring: score = passed_checks / total_checks. Each constraint counts as one check. Constructor validates parameters at creation time. Includes 25 tests covering constructor validation, required/forbidden tools, call count bounds, combined checks, partial scoring, details output, edge cases, and factory integration. Closes #187 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add tool_calls to at-a-glance table in graders.mdx - Add full Tool Calls section with config options, examples, and comparison tip - Add tool_calls to grader type enum in schema.mdx - Site builds clean Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 21, 2026 17:11

spboyer requested review from chlowell, richardpark-msft and wbreza as code owners April 21, 2026 17:11

github-actions Bot enabled auto-merge (squash) April 21, 2026 17:11

Copilot started reviewing on behalf of spboyer April 21, 2026 17:12 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread internal/graders/tool_calls_grader.go

Comment thread internal/graders/tool_calls_grader.go

Comment thread internal/graders/tool_calls_grader.go

Copilot AI review requested due to automatic review settings April 21, 2026 17:24

Copilot started reviewing on behalf of spboyer April 21, 2026 17:25 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Copilot AI added 3 commits April 21, 2026 14:21

fix: errcheck lint violation in tool_calls grader test

5fd38c5

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer force-pushed the squad/187-tool-calls-grader branch from 5263d99 to 5fd38c5 Compare April 21, 2026 18:22

spboyer merged commit 7bd731f into main Apr 21, 2026
6 checks passed

spboyer deleted the squad/187-tool-calls-grader branch April 21, 2026 18:26

spboyer mentioned this pull request Apr 21, 2026

feature request: implementing tool_calls grader #187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add tool_calls grader (#187)#202

feat: add tool_calls grader (#187)#202
spboyer merged 3 commits into
mainfrom
squad/187-tool-calls-grader

spboyer commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

spboyer commented Apr 21, 2026

Summary

Constraints Supported

Scoring

Example YAML

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants