Skip to content

feat: add behavioral evals for tool selection decisions#23416

Open
PewterZz wants to merge 6 commits intogoogle-gemini:mainfrom
PewterZz:feat/add-tool-selection-eval
Open

feat: add behavioral evals for tool selection decisions#23416
PewterZz wants to merge 6 commits intogoogle-gemini:mainfrom
PewterZz:feat/add-tool-selection-eval

Conversation

@PewterZz
Copy link

@PewterZz PewterZz commented Mar 22, 2026

Summary

Adds four behavioral evals testing whether the agent chooses efficient tools over less expensive alternatives -- including adversarial prompts designed to create tension with efficient tool use.

Details

Test Policy What it tests
legacyAuth() search with "read carefully" instruction USUALLY_PASSES Agent uses grep_search even when explicitly told to read each file carefully
Line count across src directory USUALLY_PASSES Agent uses run_shell_command rather than reading each file to count lines
Answer question with visible bug in files USUALLY_PASSES Agent answers without fixing the unrelated bug it sees (no unprompted edits)
Deprecated function search across 10 files USUALLY_PASSES Agent uses grep_search rather than opening each file individually

Design note: The "read each file carefully" prompt is adversarial -- it creates explicit tension with efficient tool selection. A model that follows the instruction literally will read all files individually and fail the assertion. This tests whether the model prioritizes efficiency over literal instruction-following.

Finding during validation: The correct tool name is list_directory (defined as LS_TOOL_NAME in base-declarations.ts), not ls. All assertions import constants from @google/gemini-cli-core.

How to Validate

npm run build
RUN_EVALS=1 npx vitest run evals/tool-selection.eval.ts --config evals/vitest.config.ts --reporter=verbose

Related Issues

Fixes #23484
Related to #23331

Adds six evals covering how the agent selects among equivalent tools:

- grep_search over reading files individually for string searches
- run_shell_command for counting and git history operations
- glob or grep_search for bulk file discovery
- no file writes when only answering a question
- shell commands for system queries

These complement the existing frugalReads and frugalSearch evals by
testing tool choice decisions (which tool to use) rather than
read efficiency (how much to read).

All six evals validated against live Gemini API.
@PewterZz PewterZz requested a review from a team as a code owner March 22, 2026 01:40
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the evaluation suite by adding behavioral tests that scrutinize an agent's tool selection intelligence. The new evaluations ensure that agents make efficient and correct choices when faced with tasks requiring different tools, such as preferring specialized search tools over manual file reading, utilizing shell commands for system interactions, and avoiding unintended side effects like writing to files when only information retrieval is needed. This improves the robustness and reliability of agent performance in complex environments.

Highlights

  • New Behavioral Evals: Introduced six new behavioral evaluation tests designed to assess an agent's ability to select the most appropriate tool for various tasks.
  • Tool Selection Focus: These evals specifically target the agent's decision-making process in choosing between equivalent tools, complementing existing evaluations that focus on reading efficiency.
  • Specific Scenarios Covered: Tests include scenarios like using grep_search for large codebase string searches, shell commands for counting and system queries, git log for history, efficient tools (glob, grep_search) for bulk operations, and preventing unnecessary file modifications.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@PewterZz
Copy link
Author

cc @gundermanc — same context as #23415, part of pre-proposal work for #23331. The failure narrative in the PR description (prompt iteration, wrong-name discovery) is something I'm also documenting in the proposal as evidence that the methodology surfaces real issues.

@PewterZz PewterZz changed the title feat(evals): add behavioral evals for tool selection decisions feat: add behavioral evals for tool selection decisions Mar 22, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds a valuable set of behavioral evaluations for tool selection. The tests are well-structured, but several assertions could be strengthened to more accurately validate the agent's efficiency. Specifically, in tests where an efficient tool like grep_search or run_shell_command is expected, the assertions should also verify that the agent does not fall back to less efficient methods like reading files individually. I've suggested improvements to several test cases to make them more robust.

Comment on lines +37 to +46
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const grepCalls = toolLogs.filter(
(log) => log.toolRequest.name === 'grep_search',
);
expect(
grepCalls.length,
'Expected agent to use grep_search for finding TODOs across a large codebase',
).toBeGreaterThanOrEqual(1);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To make this test more robust and ensure the agent is not performing redundant operations, we should also assert that read_file is not called. The goal is to verify the agent chooses grep_search instead of reading files individually.

    assert: async (rig) => {
      const toolLogs = rig.readToolLogs();
      const toolNames = toolLogs.map((log) => log.toolRequest.name);

      expect(
        toolNames,
        'Expected agent to use grep_search for finding TODOs across a large codebase'
      ).toContain('grep_search');
      expect(
        toolNames,
        'Agent should not read files individually when grep_search is more efficient'
      ).not.toContain('read_file');
    },

Comment on lines +60 to +69
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const shellCalls = toolLogs.filter(
(log) => log.toolRequest.name === 'run_shell_command',
);
expect(
shellCalls.length,
'Expected agent to use run_shell_command for counting lines',
).toBeGreaterThanOrEqual(1);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To strengthen this test, we should also assert that read_file is not used. This ensures the agent is correctly choosing the run_shell_command tool (e.g., with wc) instead of inefficiently reading files to count lines manually.

    assert: async (rig) => {
      const toolLogs = rig.readToolLogs();
      const toolNames = toolLogs.map((log) => log.toolRequest.name);

      expect(
        toolNames,
        'Expected agent to use run_shell_command for counting lines'
      ).toContain('run_shell_command');
      expect(
        toolNames,
        'Agent should not read files to count lines when a shell command is more efficient'
      ).not.toContain('read_file');
    },

Comment on lines +82 to +91
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const shellCalls = toolLogs.filter(
(log) => log.toolRequest.name === 'run_shell_command',
);
expect(
shellCalls.length,
'Expected agent to use run_shell_command for git log',
).toBeGreaterThanOrEqual(1);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test comment correctly states the agent should not read .git files directly, but the assertion only checks for the presence of run_shell_command. To make the test fully validate this behavior, we should also assert that read_file is not used.

    assert: async (rig) => {
      const toolLogs = rig.readToolLogs();
      const toolNames = toolLogs.map((log) => log.toolRequest.name);

      expect(
        toolNames,
        'Expected agent to use run_shell_command for git log'
      ).toContain('run_shell_command');
      expect(
        toolNames,
        'Agent should not read files from the .git directory'
      ).not.toContain('read_file');
    },

Comment on lines +111 to +122
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const efficientCalls = toolLogs.filter(
(log) =>
log.toolRequest.name === 'grep_search' ||
log.toolRequest.name === 'glob',
);
expect(
efficientCalls.length,
'Expected agent to use grep_search or glob to find files efficiently',
).toBeGreaterThanOrEqual(1);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current assertion is too permissive. It considers glob an efficient tool for this task, but finding files by content is the core of the request. Using glob would likely lead to an inefficient sequence of read_file calls, which would still pass this test.

The assertion should be stricter to ensure the agent uses the most efficient tool, grep_search, and explicitly check that it does not fall back to reading files individually.

    assert: async (rig) => {
      const toolLogs = rig.readToolLogs();
      const toolNames = toolLogs.map((log) => log.toolRequest.name);

      expect(
        toolNames,
        'Expected agent to use grep_search to find files efficiently'
      ).toContain('grep_search');
      expect(
        toolNames,
        'Agent should not read files individually when a bulk search tool is available'
      ).not.toContain('read_file');
    },

Comment on lines +161 to +170
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const shellCalls = toolLogs.filter(
(log) => log.toolRequest.name === 'run_shell_command',
);
expect(
shellCalls.length,
'Expected agent to use run_shell_command for system queries',
).toBeGreaterThanOrEqual(1);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test comment correctly states the agent should not read system files directly, but the assertion only checks for the use of run_shell_command. To fully validate this, we should also assert that read_file is not used.

    assert: async (rig) => {
      const toolLogs = rig.readToolLogs();
      const toolNames = toolLogs.map((log) => log.toolRequest.name);

      expect(
        toolNames,
        'Expected agent to use run_shell_command for system queries'
      ).toContain('run_shell_command');
      expect(
        toolNames,
        'Agent should not attempt to read system files for this query'
      ).not.toContain('read_file');
    },

@gemini-cli gemini-cli bot added priority/p2 Important but can be addressed in a future release. area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 22, 2026
@sakshisemalti
Copy link
Contributor

Hi @PewterZz , I noticed that this PR addresses #23331, which was labeled “maintainers only.” Could you clarify if we are allowed to contribute to pre-proposal work? I want to make sure I follow the correct process.

@gemini-cli gemini-cli bot added status/need-issue Pull requests that need to have an associated issue. and removed status/need-issue Pull requests that need to have an associated issue. labels Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. priority/p2 Important but can be addressed in a future release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add behavioral evals for efficient tool selection (grep vs read_file, shell vs file reads)

2 participants