feat: add behavioral evals for tool selection decisions#23416
feat: add behavioral evals for tool selection decisions#23416PewterZz wants to merge 6 commits intogoogle-gemini:mainfrom
Conversation
Adds six evals covering how the agent selects among equivalent tools: - grep_search over reading files individually for string searches - run_shell_command for counting and git history operations - glob or grep_search for bulk file discovery - no file writes when only answering a question - shell commands for system queries These complement the existing frugalReads and frugalSearch evals by testing tool choice decisions (which tool to use) rather than read efficiency (how much to read). All six evals validated against live Gemini API.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the evaluation suite by adding behavioral tests that scrutinize an agent's tool selection intelligence. The new evaluations ensure that agents make efficient and correct choices when faced with tasks requiring different tools, such as preferring specialized search tools over manual file reading, utilizing shell commands for system interactions, and avoiding unintended side effects like writing to files when only information retrieval is needed. This improves the robustness and reliability of agent performance in complex environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
cc @gundermanc — same context as #23415, part of pre-proposal work for #23331. The failure narrative in the PR description (prompt iteration, wrong-name discovery) is something I'm also documenting in the proposal as evidence that the methodology surfaces real issues. |
There was a problem hiding this comment.
Code Review
This PR adds a valuable set of behavioral evaluations for tool selection. The tests are well-structured, but several assertions could be strengthened to more accurately validate the agent's efficiency. Specifically, in tests where an efficient tool like grep_search or run_shell_command is expected, the assertions should also verify that the agent does not fall back to less efficient methods like reading files individually. I've suggested improvements to several test cases to make them more robust.
| assert: async (rig) => { | ||
| const toolLogs = rig.readToolLogs(); | ||
| const grepCalls = toolLogs.filter( | ||
| (log) => log.toolRequest.name === 'grep_search', | ||
| ); | ||
| expect( | ||
| grepCalls.length, | ||
| 'Expected agent to use grep_search for finding TODOs across a large codebase', | ||
| ).toBeGreaterThanOrEqual(1); | ||
| }, |
There was a problem hiding this comment.
To make this test more robust and ensure the agent is not performing redundant operations, we should also assert that read_file is not called. The goal is to verify the agent chooses grep_search instead of reading files individually.
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolNames = toolLogs.map((log) => log.toolRequest.name);
expect(
toolNames,
'Expected agent to use grep_search for finding TODOs across a large codebase'
).toContain('grep_search');
expect(
toolNames,
'Agent should not read files individually when grep_search is more efficient'
).not.toContain('read_file');
},
evals/tool-selection.eval.ts
Outdated
| assert: async (rig) => { | ||
| const toolLogs = rig.readToolLogs(); | ||
| const shellCalls = toolLogs.filter( | ||
| (log) => log.toolRequest.name === 'run_shell_command', | ||
| ); | ||
| expect( | ||
| shellCalls.length, | ||
| 'Expected agent to use run_shell_command for counting lines', | ||
| ).toBeGreaterThanOrEqual(1); | ||
| }, |
There was a problem hiding this comment.
To strengthen this test, we should also assert that read_file is not used. This ensures the agent is correctly choosing the run_shell_command tool (e.g., with wc) instead of inefficiently reading files to count lines manually.
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolNames = toolLogs.map((log) => log.toolRequest.name);
expect(
toolNames,
'Expected agent to use run_shell_command for counting lines'
).toContain('run_shell_command');
expect(
toolNames,
'Agent should not read files to count lines when a shell command is more efficient'
).not.toContain('read_file');
},
evals/tool-selection.eval.ts
Outdated
| assert: async (rig) => { | ||
| const toolLogs = rig.readToolLogs(); | ||
| const shellCalls = toolLogs.filter( | ||
| (log) => log.toolRequest.name === 'run_shell_command', | ||
| ); | ||
| expect( | ||
| shellCalls.length, | ||
| 'Expected agent to use run_shell_command for git log', | ||
| ).toBeGreaterThanOrEqual(1); | ||
| }, |
There was a problem hiding this comment.
The test comment correctly states the agent should not read .git files directly, but the assertion only checks for the presence of run_shell_command. To make the test fully validate this behavior, we should also assert that read_file is not used.
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolNames = toolLogs.map((log) => log.toolRequest.name);
expect(
toolNames,
'Expected agent to use run_shell_command for git log'
).toContain('run_shell_command');
expect(
toolNames,
'Agent should not read files from the .git directory'
).not.toContain('read_file');
},
evals/tool-selection.eval.ts
Outdated
| assert: async (rig) => { | ||
| const toolLogs = rig.readToolLogs(); | ||
| const efficientCalls = toolLogs.filter( | ||
| (log) => | ||
| log.toolRequest.name === 'grep_search' || | ||
| log.toolRequest.name === 'glob', | ||
| ); | ||
| expect( | ||
| efficientCalls.length, | ||
| 'Expected agent to use grep_search or glob to find files efficiently', | ||
| ).toBeGreaterThanOrEqual(1); | ||
| }, |
There was a problem hiding this comment.
The current assertion is too permissive. It considers glob an efficient tool for this task, but finding files by content is the core of the request. Using glob would likely lead to an inefficient sequence of read_file calls, which would still pass this test.
The assertion should be stricter to ensure the agent uses the most efficient tool, grep_search, and explicitly check that it does not fall back to reading files individually.
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolNames = toolLogs.map((log) => log.toolRequest.name);
expect(
toolNames,
'Expected agent to use grep_search to find files efficiently'
).toContain('grep_search');
expect(
toolNames,
'Agent should not read files individually when a bulk search tool is available'
).not.toContain('read_file');
},| assert: async (rig) => { | ||
| const toolLogs = rig.readToolLogs(); | ||
| const shellCalls = toolLogs.filter( | ||
| (log) => log.toolRequest.name === 'run_shell_command', | ||
| ); | ||
| expect( | ||
| shellCalls.length, | ||
| 'Expected agent to use run_shell_command for system queries', | ||
| ).toBeGreaterThanOrEqual(1); | ||
| }, |
There was a problem hiding this comment.
The test comment correctly states the agent should not read system files directly, but the assertion only checks for the use of run_shell_command. To fully validate this, we should also assert that read_file is not used.
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolNames = toolLogs.map((log) => log.toolRequest.name);
expect(
toolNames,
'Expected agent to use run_shell_command for system queries'
).toContain('run_shell_command');
expect(
toolNames,
'Agent should not attempt to read system files for this query'
).not.toContain('read_file');
},
Summary
Adds four behavioral evals testing whether the agent chooses efficient tools over less expensive alternatives -- including adversarial prompts designed to create tension with efficient tool use.
Details
legacyAuth()search with "read carefully" instructionUSUALLY_PASSESgrep_searcheven when explicitly told to read each file carefullyUSUALLY_PASSESrun_shell_commandrather than reading each file to count linesUSUALLY_PASSESUSUALLY_PASSESgrep_searchrather than opening each file individuallyDesign note: The "read each file carefully" prompt is adversarial -- it creates explicit tension with efficient tool selection. A model that follows the instruction literally will read all files individually and fail the assertion. This tests whether the model prioritizes efficiency over literal instruction-following.
Finding during validation: The correct tool name is
list_directory(defined asLS_TOOL_NAMEinbase-declarations.ts), notls. All assertions import constants from@google/gemini-cli-core.How to Validate
Related Issues
Fixes #23484
Related to #23331