Skip to content

fix(plan): Fix AskUser evals#22074

Merged
Adib234 merged 11 commits intomainfrom
adibakm/fix-ask-user-evals
Mar 13, 2026
Merged

fix(plan): Fix AskUser evals#22074
Adib234 merged 11 commits intomainfrom
adibakm/fix-ask-user-evals

Conversation

@Adib234
Copy link
Contributor

@Adib234 Adib234 commented Mar 11, 2026

Summary

Some AskUser evals seemed to be failing, after some debugging it looks like it was failing because it was running in non-interactive mode which doesn't support the AskUser tool. Changes were made to support running evals in an interactive environment.

Details

Related Issues

Fixes #22073

How to Validate

Should pass locally and I've verified in the nightly evals workflow that the evals pass

Two workflows
https://github.com/google-gemini/gemini-cli/actions/runs/22971547342
https://github.com/google-gemini/gemini-cli/actions/runs/22974083453

Pre-Merge Checklist

  • Updated relevant documentation and README (if needed)
  • Added/updated tests (if needed)
  • Noted breaking changes (if any)
  • Validated on required platforms/methods:
    • MacOS
      • npm run
      • npx
      • Docker
      • Podman
      • Seatbelt
    • Windows
      • npm run
      • npx
      • Docker
    • Linux
      • npm run
      • npx
      • Docker

@Adib234 Adib234 self-assigned this Mar 11, 2026
@Adib234 Adib234 requested a review from a team as a code owner March 11, 2026 21:09
@Adib234 Adib234 marked this pull request as draft March 11, 2026 21:09
@github-actions
Copy link

github-actions bot commented Mar 11, 2026

Size Change: -4 B (0%)

Total Size: 26.1 MB

Filename Size Change
./bundle/chunk-Q22W3GF4.js 0 B -3.62 MB (removed) 🏆
./bundle/chunk-V7GMLYYI.js 0 B -13.4 MB (removed) 🏆
./bundle/core-7UQOL5N4.js 0 B -40.1 kB (removed) 🏆
./bundle/devtoolsService-PVZYGYRZ.js 0 B -27.7 kB (removed) 🏆
./bundle/interactiveCli-QFEKEZGF.js 0 B -1.59 MB (removed) 🏆
./bundle/oauth2-provider-QIWCNLVN.js 0 B -9.19 kB (removed) 🏆
./bundle/chunk-DHNDYRY7.js 13.4 MB +13.4 MB (new file) 🆕
./bundle/chunk-EIKPCH4X.js 3.62 MB +3.62 MB (new file) 🆕
./bundle/core-NEJD7DCK.js 40.1 kB +40.1 kB (new file) 🆕
./bundle/devtoolsService-MGOHA53A.js 27.7 kB +27.7 kB (new file) 🆕
./bundle/interactiveCli-JYQT2PPQ.js 1.59 MB +1.59 MB (new file) 🆕
./bundle/oauth2-provider-WNR34MDU.js 9.19 kB +9.19 kB (new file) 🆕
ℹ️ View Unchanged
Filename Size Change
./bundle/chunk-34MYV7JD.js 2.45 kB 0 B
./bundle/chunk-37ZTTFQF.js 966 kB 0 B
./bundle/chunk-5AUYMPVF.js 858 B 0 B
./bundle/chunk-5Q3GACO5.js 1.95 MB 0 B
./bundle/chunk-664ZODQF.js 124 kB 0 B
./bundle/chunk-DAHVX5MI.js 206 kB 0 B
./bundle/chunk-IUUIT4SU.js 56.5 kB 0 B
./bundle/chunk-RJTRUG2J.js 39.8 kB 0 B
./bundle/devtools-36NN55EP.js 696 kB 0 B
./bundle/dist-T73EYRDX.js 356 B 0 B
./bundle/gemini.js 689 kB 0 B
./bundle/getMachineId-bsd-TXG52NKR.js 1.55 kB 0 B
./bundle/getMachineId-darwin-7OE4DDZ6.js 1.55 kB 0 B
./bundle/getMachineId-linux-SHIFKOOX.js 1.34 kB 0 B
./bundle/getMachineId-unsupported-5U5DOEYY.js 1.06 kB 0 B
./bundle/getMachineId-win-6KLLGOI4.js 1.72 kB 0 B
./bundle/keychain-token-storage-UHHE7LLN.js 0 B -518 B (removed) 🏆
./bundle/memoryDiscovery-RQEFN44F.js 922 B 0 B
./bundle/multipart-parser-KPBZEGQU.js 11.7 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 221 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 227 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 11.5 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B
./bundle/src-QVCVGIUX.js 47 kB 0 B
./bundle/tree-sitter-7U6MW5PS.js 274 kB 0 B
./bundle/tree-sitter-bash-34ZGLXVX.js 1.84 MB 0 B
./bundle/undici-4X2YZID5.js 360 B 0 B
./bundle/keychain-token-storage-A2VJ7RAZ.js 518 B +518 B (new file) 🆕

compressed-size-action

@gemini-cli gemini-cli bot added the area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality label Mar 11, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses issues with AskUser evaluations failing in non-interactive environments by enabling and configuring an interactive testing mode. The changes ensure that tests involving user interaction tools can run correctly, improving the reliability of the evaluation suite. This also includes updates to how tool confirmations are awaited, making the testing framework more adaptable to dynamic tool usage.

Highlights

  • Interactive Evaluation Support: Introduced support for running evaluations in an interactive environment, which is crucial for tools like AskUser that require user interaction. This involved modifying the test helper to simulate interactive terminal input and handle pending confirmations.
  • AskUser Evaluation Updates: Updated existing AskUser evaluations to leverage the new interactive testing framework. This includes switching from evalTest to appEvalTest, configuring approvalMode and autoUpdate settings, and explicitly allowing the ask_user tool via policy files within the test setup.
  • Flexible Tool Confirmation Waiting: Enhanced the waitForPendingConfirmation utility to accept an array of tool names or display names, allowing tests to wait for confirmation from multiple possible tools, improving robustness for scenarios where the exact tool sequence might vary.
  • Terminal Setup Prompt Bypass: Implemented a mechanism to automatically bypass the terminal keybindings setup prompt during evaluations by setting terminalSetupPromptShown in the .gemini/state.json file, ensuring a smoother test execution flow for interactive tests.
Changelog
  • evals/ask_user.eval.ts
    • Updated evalTest calls to appEvalTest to utilize the new interactive evaluation setup.
    • Added configOverrides to evaluation cases to set general configurations like approvalMode and disable auto-updates.
    • Included .gemini/state.json and .gemini/policies/allow_ask_user.toml in test file setups to enable interactive mode and allow the ask_user tool.
    • Modified assertions to use waitForPendingConfirmation instead of waitForToolCall for interactive tool interactions.
    • Introduced setup functions to set breakpoints for tool calls.
  • evals/test-helper.ts
    • Added logic to bypass the terminal keybindings setup prompt by writing to .gemini/state.json.
    • Implemented conditional execution within evalTest to support interactive runs using rig.runInteractive and sendKeys.
    • Extended the EvalCase interface with an optional interactive property to specify interactive test cases.
  • packages/cli/src/test-utils/AppRig.tsx
    • Modified the waitForPendingConfirmation method to accept an array of toolNameOrDisplayName for more flexible tool confirmation matching.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ask_user evaluation tests to run in an interactive environment using appEvalTest to fix failures in non-interactive mode.

My review focuses on improving the robustness and clarity of the new test setups. I've identified a conflicting policy configuration in the ask_user tests and an unreliable setTimeout in test-helper.ts, both of which could lead to test flakiness. Please see my detailed comments for suggestions on how to address these issues.

@Adib234 Adib234 marked this pull request as ready for review March 12, 2026 15:01
@Adib234 Adib234 changed the title [DRAFT] Fix AskUser evals fix(plan): Fix AskUser evals Mar 12, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the AskUser evaluation tests to run in an interactive environment, which is a necessary fix since the AskUser tool is not supported in non-interactive mode. The changes involve introducing a new appEvalTest helper, updating test assertions to wait for interactive confirmations, and enhancing the test rig to support waiting for multiple possible tool calls. A potential issue was found in how default configurations are merged with test-specific overrides, which could lead to incorrect test setups, violating a rule about consistent merge operations.

@Adib234
Copy link
Contributor Author

Adib234 commented Mar 12, 2026

Nightly evals workflow run after recent changes: https://github.com/google-gemini/gemini-cli/actions/runs/23008601171

);
}

// Bypassing terminal keybindings setup prompt for interactive tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run the tests in parallel. Is updating the home dir going to cause intermittent test failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added _createStateFile to make sure that state.json mocking happens within in the isolated TestRig.homeDir unique to each test run

},
},
files: {
'.gemini/state.json': JSON.stringify({ terminalSetupPromptShown: true }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're writing this file in the test-helper.ts. Do we need it both here and there?

Also, can we avoid the [action at a distance](https://en.wikipedia.org/wiki/Action_at_a_distance_(computer_programming) phenomenon by plumbing this option through to the TestRig and having it pass a cmd line arg or write the config file in one place? We should ideally directly write and read state.json in just one place in the codebase to minimize risk of regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created _createStateFile to read and write state.json once

@Adib234 Adib234 added this pull request to the merge queue Mar 13, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Mar 13, 2026
@Adib234 Adib234 added this pull request to the merge queue Mar 13, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Mar 13, 2026
@Adib234 Adib234 enabled auto-merge March 13, 2026 13:13
@Adib234 Adib234 added this pull request to the merge queue Mar 13, 2026
@gemini-cli gemini-cli bot added area/core Issues related to User Interface, OS Support, Core Functionality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 13, 2026
Merged via the queue into main with commit 263b8cd Mar 13, 2026
27 checks passed
@Adib234 Adib234 deleted the adibakm/fix-ask-user-evals branch March 13, 2026 13:40
ruomengz pushed a commit that referenced this pull request Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality area/core Issues related to User Interface, OS Support, Core Functionality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix failing AskUser evals

2 participants