feat: add behavioral evals for tool selection decisions by PewterZz · Pull Request #23416 · google-gemini/gemini-cli

PewterZz · 2026-03-22T01:40:55Z

Summary

Adds four behavioral evals testing whether the agent chooses efficient tools over less expensive alternatives -- including adversarial prompts designed to create tension with efficient tool use.

Details

Test	Policy	What it tests
`legacyAuth()` search with "read carefully" instruction	`USUALLY_PASSES`	Agent uses `grep_search` even when explicitly told to read each file carefully
Line count across src directory	`USUALLY_PASSES`	Agent uses `run_shell_command` rather than reading each file to count lines
Answer question with visible bug in files	`USUALLY_PASSES`	Agent answers without fixing the unrelated bug it sees (no unprompted edits)
Deprecated function search across 10 files	`USUALLY_PASSES`	Agent uses `grep_search` rather than opening each file individually

Design note: The "read each file carefully" prompt is adversarial -- it creates explicit tension with efficient tool selection. A model that follows the instruction literally will read all files individually and fail the assertion. This tests whether the model prioritizes efficiency over literal instruction-following.

Finding during validation: The correct tool name is list_directory (defined as LS_TOOL_NAME in base-declarations.ts), not ls. All assertions import constants from @google/gemini-cli-core.

How to Validate

npm run build
RUN_EVALS=1 npx vitest run evals/tool-selection.eval.ts --config evals/vitest.config.ts --reporter=verbose

Related Issues

Fixes #23484
Related to #23331

Adds six evals covering how the agent selects among equivalent tools: - grep_search over reading files individually for string searches - run_shell_command for counting and git history operations - glob or grep_search for bulk file discovery - no file writes when only answering a question - shell commands for system queries These complement the existing frugalReads and frugalSearch evals by testing tool choice decisions (which tool to use) rather than read efficiency (how much to read). All six evals validated against live Gemini API.

gemini-code-assist · 2026-03-22T01:41:10Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the evaluation suite by adding behavioral tests that scrutinize an agent's tool selection intelligence. The new evaluations ensure that agents make efficient and correct choices when faced with tasks requiring different tools, such as preferring specialized search tools over manual file reading, utilizing shell commands for system interactions, and avoiding unintended side effects like writing to files when only information retrieval is needed. This improves the robustness and reliability of agent performance in complex environments.

Highlights

New Behavioral Evals: Introduced six new behavioral evaluation tests designed to assess an agent's ability to select the most appropriate tool for various tasks.
Tool Selection Focus: These evals specifically target the agent's decision-making process in choosing between equivalent tools, complementing existing evaluations that focus on reading efficiency.
Specific Scenarios Covered: Tests include scenarios like using grep_search for large codebase string searches, shell commands for counting and system queries, git log for history, efficient tools (glob, grep_search) for bulk operations, and preventing unnecessary file modifications.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

PewterZz · 2026-03-22T01:41:17Z

cc @gundermanc — same context as #23415, part of pre-proposal work for #23331. The failure narrative in the PR description (prompt iteration, wrong-name discovery) is something I'm also documenting in the proposal as evidence that the methodology surfaces real issues.

gemini-code-assist

Code Review

This PR adds a valuable set of behavioral evaluations for tool selection. The tests are well-structured, but several assertions could be strengthened to more accurately validate the agent's efficiency. Specifically, in tests where an efficient tool like grep_search or run_shell_command is expected, the assertions should also verify that the agent does not fall back to less efficient methods like reading files individually. I've suggested improvements to several test cases to make them more robust.

gemini-code-assist · 2026-03-22T01:43:31Z

evals/tool-selection.eval.ts

+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+      const grepCalls = toolLogs.filter(
+        (log) => log.toolRequest.name === 'grep_search',
+      );
+      expect(
+        grepCalls.length,
+        'Expected agent to use grep_search for finding TODOs across a large codebase',
+      ).toBeGreaterThanOrEqual(1);
+    },


To make this test more robust and ensure the agent is not performing redundant operations, we should also assert that read_file is not called. The goal is to verify the agent chooses grep_search instead of reading files individually.

assert: async (rig) => { const toolLogs = rig.readToolLogs(); const toolNames = toolLogs.map((log) => log.toolRequest.name); expect( toolNames, 'Expected agent to use grep_search for finding TODOs across a large codebase' ).toContain('grep_search'); expect( toolNames, 'Agent should not read files individually when grep_search is more efficient' ).not.toContain('read_file'); },

gemini-code-assist · 2026-03-22T01:43:31Z

evals/tool-selection.eval.ts

+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+      const shellCalls = toolLogs.filter(
+        (log) => log.toolRequest.name === 'run_shell_command',
+      );
+      expect(
+        shellCalls.length,
+        'Expected agent to use run_shell_command for counting lines',
+      ).toBeGreaterThanOrEqual(1);
+    },


To strengthen this test, we should also assert that read_file is not used. This ensures the agent is correctly choosing the run_shell_command tool (e.g., with wc) instead of inefficiently reading files to count lines manually.

assert: async (rig) => { const toolLogs = rig.readToolLogs(); const toolNames = toolLogs.map((log) => log.toolRequest.name); expect( toolNames, 'Expected agent to use run_shell_command for counting lines' ).toContain('run_shell_command'); expect( toolNames, 'Agent should not read files to count lines when a shell command is more efficient' ).not.toContain('read_file'); },

gemini-code-assist · 2026-03-22T01:43:31Z

evals/tool-selection.eval.ts

+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+      const shellCalls = toolLogs.filter(
+        (log) => log.toolRequest.name === 'run_shell_command',
+      );
+      expect(
+        shellCalls.length,
+        'Expected agent to use run_shell_command for git log',
+      ).toBeGreaterThanOrEqual(1);
+    },


The test comment correctly states the agent should not read .git files directly, but the assertion only checks for the presence of run_shell_command. To make the test fully validate this behavior, we should also assert that read_file is not used.

assert: async (rig) => { const toolLogs = rig.readToolLogs(); const toolNames = toolLogs.map((log) => log.toolRequest.name); expect( toolNames, 'Expected agent to use run_shell_command for git log' ).toContain('run_shell_command'); expect( toolNames, 'Agent should not read files from the .git directory' ).not.toContain('read_file'); },

gemini-code-assist · 2026-03-22T01:43:31Z

evals/tool-selection.eval.ts

+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+      const efficientCalls = toolLogs.filter(
+        (log) =>
+          log.toolRequest.name === 'grep_search' ||
+          log.toolRequest.name === 'glob',
+      );
+      expect(
+        efficientCalls.length,
+        'Expected agent to use grep_search or glob to find files efficiently',
+      ).toBeGreaterThanOrEqual(1);
+    },


The current assertion is too permissive. It considers glob an efficient tool for this task, but finding files by content is the core of the request. Using glob would likely lead to an inefficient sequence of read_file calls, which would still pass this test.

The assertion should be stricter to ensure the agent uses the most efficient tool, grep_search, and explicitly check that it does not fall back to reading files individually.

assert: async (rig) => { const toolLogs = rig.readToolLogs(); const toolNames = toolLogs.map((log) => log.toolRequest.name); expect( toolNames, 'Expected agent to use grep_search to find files efficiently' ).toContain('grep_search'); expect( toolNames, 'Agent should not read files individually when a bulk search tool is available' ).not.toContain('read_file'); },

gemini-code-assist · 2026-03-22T01:43:31Z

evals/tool-selection.eval.ts

+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+      const shellCalls = toolLogs.filter(
+        (log) => log.toolRequest.name === 'run_shell_command',
+      );
+      expect(
+        shellCalls.length,
+        'Expected agent to use run_shell_command for system queries',
+      ).toBeGreaterThanOrEqual(1);
+    },


The test comment correctly states the agent should not read system files directly, but the assertion only checks for the use of run_shell_command. To fully validate this, we should also assert that read_file is not used.

assert: async (rig) => { const toolLogs = rig.readToolLogs(); const toolNames = toolLogs.map((log) => log.toolRequest.name); expect( toolNames, 'Expected agent to use run_shell_command for system queries' ).toContain('run_shell_command'); expect( toolNames, 'Agent should not attempt to read system files for this query' ).not.toContain('read_file'); },

sakshisemalti · 2026-03-22T18:50:50Z

Hi @PewterZz , I noticed that this PR addresses #23331, which was labeled “maintainers only.” Could you clarify if we are allowed to contribute to pre-proposal work? I want to make sure I follow the correct process.

… literals

…real decisions

PewterZz requested a review from a team as a code owner March 22, 2026 01:40

PewterZz changed the title ~~feat(evals): add behavioral evals for tool selection decisions~~ feat: add behavioral evals for tool selection decisions Mar 22, 2026

gemini-code-assist bot reviewed Mar 22, 2026

View reviewed changes

github-actions bot mentioned this pull request Mar 22, 2026

📊 Bản tin hàng ngày công cụ AI CLI 2026-03-22 compasify/agents-radar#70

Open

gemini-cli bot added priority/p2 Important but can be addressed in a future release. area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 22, 2026

PewterZz mentioned this pull request Mar 22, 2026

feat: add behavioral eval for write_todos task planning #23418

Open

PewterZz added 2 commits March 23, 2026 02:06

fix: add negative assertions to prevent fallback to inefficient tools

6276f5c

fix: add turn count assertions to grep and shell tool selection evals

ef51120

gemini-cli bot added status/need-issue Pull requests that need to have an associated issue. and removed status/need-issue Pull requests that need to have an associated issue. labels Mar 22, 2026

github-actions bot mentioned this pull request Mar 23, 2026

📊 AI CLI 工具社区动态日报 2026-03-23 gsscsd/big_model_radar#80

Open

PewterZz added 3 commits March 25, 2026 19:56

fix: use tool name constants from base-declarations instead of string…

2efb22a

… literals

test: add hard case for grep usage across large codebase

0963e58

fix: rewrite tool-selection evals with adversarial prompts that test …

d36e2fb

…real decisions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add behavioral evals for tool selection decisions#23416

feat: add behavioral evals for tool selection decisions#23416
PewterZz wants to merge 6 commits intogoogle-gemini:mainfrom
PewterZz:feat/add-tool-selection-eval

PewterZz commented Mar 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Uh oh!

PewterZz commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

sakshisemalti commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PewterZz commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

How to Validate

Related Issues

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

PewterZz commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

sakshisemalti commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PewterZz commented Mar 22, 2026 •

edited

Loading