fix(plan): Fix AskUser evals by Adib234 · Pull Request #22074 · google-gemini/gemini-cli

Adib234 · 2026-03-11T21:09:23Z

Summary

Some AskUser evals seemed to be failing, after some debugging it looks like it was failing because it was running in non-interactive mode which doesn't support the AskUser tool. Changes were made to support running evals in an interactive environment.

Details

Related Issues

Fixes #22073

How to Validate

Should pass locally and I've verified in the nightly evals workflow that the evals pass

Two workflows
https://github.com/google-gemini/gemini-cli/actions/runs/22971547342
https://github.com/google-gemini/gemini-cli/actions/runs/22974083453

Pre-Merge Checklist

github-actions · 2026-03-11T21:12:24Z

Size Change: -4 B (0%)

Total Size: 26.1 MB

Filename	Size	Change
`./bundle/chunk-Q22W3GF4.js`	0 B	-3.62 MB (removed)	🏆
`./bundle/chunk-V7GMLYYI.js`	0 B	-13.4 MB (removed)	🏆
`./bundle/core-7UQOL5N4.js`	0 B	-40.1 kB (removed)	🏆
`./bundle/devtoolsService-PVZYGYRZ.js`	0 B	-27.7 kB (removed)	🏆
`./bundle/interactiveCli-QFEKEZGF.js`	0 B	-1.59 MB (removed)	🏆
`./bundle/oauth2-provider-QIWCNLVN.js`	0 B	-9.19 kB (removed)	🏆
`./bundle/chunk-DHNDYRY7.js`	13.4 MB	+13.4 MB (new file)	🆕
`./bundle/chunk-EIKPCH4X.js`	3.62 MB	+3.62 MB (new file)	🆕
`./bundle/core-NEJD7DCK.js`	40.1 kB	+40.1 kB (new file)	🆕
`./bundle/devtoolsService-MGOHA53A.js`	27.7 kB	+27.7 kB (new file)	🆕
`./bundle/interactiveCli-JYQT2PPQ.js`	1.59 MB	+1.59 MB (new file)	🆕
`./bundle/oauth2-provider-WNR34MDU.js`	9.19 kB	+9.19 kB (new file)	🆕

ℹ️ View Unchanged

Filename	Size	Change
`./bundle/chunk-34MYV7JD.js`	2.45 kB	0 B
`./bundle/chunk-37ZTTFQF.js`	966 kB	0 B
`./bundle/chunk-5AUYMPVF.js`	858 B	0 B
`./bundle/chunk-5Q3GACO5.js`	1.95 MB	0 B
`./bundle/chunk-664ZODQF.js`	124 kB	0 B
`./bundle/chunk-DAHVX5MI.js`	206 kB	0 B
`./bundle/chunk-IUUIT4SU.js`	56.5 kB	0 B
`./bundle/chunk-RJTRUG2J.js`	39.8 kB	0 B
`./bundle/devtools-36NN55EP.js`	696 kB	0 B
`./bundle/dist-T73EYRDX.js`	356 B	0 B
`./bundle/gemini.js`	689 kB	0 B
`./bundle/getMachineId-bsd-TXG52NKR.js`	1.55 kB	0 B
`./bundle/getMachineId-darwin-7OE4DDZ6.js`	1.55 kB	0 B
`./bundle/getMachineId-linux-SHIFKOOX.js`	1.34 kB	0 B
`./bundle/getMachineId-unsupported-5U5DOEYY.js`	1.06 kB	0 B
`./bundle/getMachineId-win-6KLLGOI4.js`	1.72 kB	0 B
`./bundle/keychain-token-storage-UHHE7LLN.js`	0 B	-518 B (removed)	🏆
`./bundle/memoryDiscovery-RQEFN44F.js`	922 B	0 B
`./bundle/multipart-parser-KPBZEGQU.js`	11.7 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js`	221 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js`	227 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js`	11.5 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js`	132 B	0 B
`./bundle/sandbox-macos-permissive-open.sb`	890 B	0 B
`./bundle/sandbox-macos-permissive-proxied.sb`	1.31 kB	0 B
`./bundle/sandbox-macos-restrictive-open.sb`	3.36 kB	0 B
`./bundle/sandbox-macos-restrictive-proxied.sb`	3.56 kB	0 B
`./bundle/sandbox-macos-strict-open.sb`	4.82 kB	0 B
`./bundle/sandbox-macos-strict-proxied.sb`	5.02 kB	0 B
`./bundle/src-QVCVGIUX.js`	47 kB	0 B
`./bundle/tree-sitter-7U6MW5PS.js`	274 kB	0 B
`./bundle/tree-sitter-bash-34ZGLXVX.js`	1.84 MB	0 B
`./bundle/undici-4X2YZID5.js`	360 B	0 B
`./bundle/keychain-token-storage-A2VJ7RAZ.js`	518 B	+518 B (new file)	🆕

_{compressed-size-action}

gemini-code-assist · 2026-03-11T21:14:12Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses issues with AskUser evaluations failing in non-interactive environments by enabling and configuring an interactive testing mode. The changes ensure that tests involving user interaction tools can run correctly, improving the reliability of the evaluation suite. This also includes updates to how tool confirmations are awaited, making the testing framework more adaptable to dynamic tool usage.

Highlights

Interactive Evaluation Support: Introduced support for running evaluations in an interactive environment, which is crucial for tools like AskUser that require user interaction. This involved modifying the test helper to simulate interactive terminal input and handle pending confirmations.
AskUser Evaluation Updates: Updated existing AskUser evaluations to leverage the new interactive testing framework. This includes switching from evalTest to appEvalTest, configuring approvalMode and autoUpdate settings, and explicitly allowing the ask_user tool via policy files within the test setup.
Flexible Tool Confirmation Waiting: Enhanced the waitForPendingConfirmation utility to accept an array of tool names or display names, allowing tests to wait for confirmation from multiple possible tools, improving robustness for scenarios where the exact tool sequence might vary.
Terminal Setup Prompt Bypass: Implemented a mechanism to automatically bypass the terminal keybindings setup prompt during evaluations by setting terminalSetupPromptShown in the .gemini/state.json file, ensuring a smoother test execution flow for interactive tests.

Changelog

evals/ask_user.eval.ts
- Updated evalTest calls to appEvalTest to utilize the new interactive evaluation setup.
- Added configOverrides to evaluation cases to set general configurations like approvalMode and disable auto-updates.
- Included .gemini/state.json and .gemini/policies/allow_ask_user.toml in test file setups to enable interactive mode and allow the ask_user tool.
- Modified assertions to use waitForPendingConfirmation instead of waitForToolCall for interactive tool interactions.
- Introduced setup functions to set breakpoints for tool calls.
evals/test-helper.ts
- Added logic to bypass the terminal keybindings setup prompt by writing to .gemini/state.json.
- Implemented conditional execution within evalTest to support interactive runs using rig.runInteractive and sendKeys.
- Extended the EvalCase interface with an optional interactive property to specify interactive test cases.
packages/cli/src/test-utils/AppRig.tsx
- Modified the waitForPendingConfirmation method to accept an array of toolNameOrDisplayName for more flexible tool confirmation matching.

Activity

The pull request is currently in DRAFT status.
Adib234 verified that the evals pass locally and in nightly evals workflows, providing links to two workflow runs: https://github.com/google-gemini/gemini-cli/actions/runs/22971547342 and https://github.com/google-gemini/gemini-cli/actions/runs/22974083453.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the ask_user evaluation tests to run in an interactive environment using appEvalTest to fix failures in non-interactive mode.

My review focuses on improving the robustness and clarity of the new test setups. I've identified a conflicting policy configuration in the ask_user tests and an unreliable setTimeout in test-helper.ts, both of which could lead to test flakiness. Please see my detailed comments for suggestions on how to address these issues.

evals/ask_user.eval.ts

evals/test-helper.ts

gemini-code-assist

Code Review

This pull request refactors the AskUser evaluation tests to run in an interactive environment, which is a necessary fix since the AskUser tool is not supported in non-interactive mode. The changes involve introducing a new appEvalTest helper, updating test assertions to wait for interactive confirmations, and enhancing the test rig to support waiting for multiple possible tool calls. A potential issue was found in how default configurations are merged with test-specific overrides, which could lead to incorrect test setups, violating a rule about consistent merge operations.

evals/ask_user.eval.ts

Adib234 · 2026-03-12T15:19:17Z

Nightly evals workflow run after recent changes: https://github.com/google-gemini/gemini-cli/actions/runs/23008601171

evals/test-helper.ts

evals/ask_user.eval.ts

gundermanc · 2026-03-12T20:44:34Z

evals/test-helper.ts

          );
        }

+        // Bypassing terminal keybindings setup prompt for interactive tests


We run the tests in parallel. Is updating the home dir going to cause intermittent test failures?

added _createStateFile to make sure that state.json mocking happens within in the isolated TestRig.homeDir unique to each test run

gundermanc · 2026-03-12T20:49:12Z

evals/ask_user.eval.ts

+      },
+    },
+    files: {
+      '.gemini/state.json': JSON.stringify({ terminalSetupPromptShown: true }),


We're writing this file in the test-helper.ts. Do we need it both here and there?

Also, can we avoid the [action at a distance](https://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) phenomenon by plumbing this option through to the TestRig and having it pass a cmd line arg or write the config file in one place? We should ideally directly write and read state.json in just one place in the codebase to minimize risk of regression.

created _createStateFile to read and write state.json once

…/write state.json in one place

Adib234 added 4 commits March 11, 2026 12:33

complete

5fa4490

test

a9bf778

test on nightly evals

f9479f5

revert random changes

64a7693

Adib234 self-assigned this Mar 11, 2026

Adib234 requested a review from a team as a code owner March 11, 2026 21:09

Adib234 marked this pull request as draft March 11, 2026 21:09

gemini-cli bot added the area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality label Mar 11, 2026

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

evals/ask_user.eval.ts Outdated Show resolved Hide resolved

evals/test-helper.ts Outdated Show resolved Hide resolved

Adib234 added 2 commits March 12, 2026 10:54

address bot comments

6efc259

clean up code

7955265

Adib234 marked this pull request as ready for review March 12, 2026 15:01

Adib234 changed the title ~~[DRAFT] Fix AskUser evals~~ fix(plan): Fix AskUser evals Mar 12, 2026

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

evals/ask_user.eval.ts Show resolved Hide resolved

address bot comment

2949b2a

gundermanc reviewed Mar 12, 2026

View reviewed changes

evals/test-helper.ts Outdated Show resolved Hide resolved

removed unused code

fa59260

gundermanc reviewed Mar 12, 2026

View reviewed changes

evals/test-helper.ts Outdated Show resolved Hide resolved

avoid null suppression

fe8ae8d

gundermanc reviewed Mar 12, 2026

View reviewed changes

evals/ask_user.eval.ts Outdated Show resolved Hide resolved

gundermanc reviewed Mar 12, 2026

View reviewed changes

address nits: remove timeout, remove changes in test-helper, and read…

2a2ba53

…/write state.json in one place

gundermanc approved these changes Mar 13, 2026

View reviewed changes

Adib234 added this pull request to the merge queue Mar 13, 2026

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Mar 13, 2026

Adib234 added this pull request to the merge queue Mar 13, 2026

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Mar 13, 2026

Merge branch 'main' into adibakm/fix-ask-user-evals

cdb5c0f

Adib234 enabled auto-merge March 13, 2026 13:13

Adib234 added this pull request to the merge queue Mar 13, 2026

gemini-cli bot added area/core Issues related to User Interface, OS Support, Core Functionality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 13, 2026

Merged via the queue into main with commit 263b8cd Mar 13, 2026
27 checks passed

Adib234 deleted the adibakm/fix-ask-user-evals branch March 13, 2026 13:40

ruomengz pushed a commit that referenced this pull request Mar 13, 2026

fix(plan): Fix AskUser evals (#22074)

a6108a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(plan): Fix AskUser evals#22074

fix(plan): Fix AskUser evals#22074
Adib234 merged 11 commits intomainfrom
adibakm/fix-ask-user-evals

Adib234 commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Adib234 commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gundermanc Mar 12, 2026

Uh oh!

Adib234 Mar 12, 2026

Uh oh!

gundermanc Mar 12, 2026

Uh oh!

Adib234 Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Adib234 commented Mar 11, 2026

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Adib234 commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gundermanc Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Adib234 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

gundermanc Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Adib234 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 11, 2026 •

edited

Loading