feat(evals): implement related evaluation system for targeted testing by alisa-alisa · Pull Request #24877 · google-gemini/gemini-cli

alisa-alisa · 2026-04-08T02:36:29Z

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

gemini-cli · 2026-04-08T02:36:46Z

Hi @alisa-alisa, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

github-actions · 2026-04-08T02:39:34Z

Size Change: -4 B (0%)

Total Size: 34 MB

Filename	Size	Change
`./bundle/chunk-6NC7VEIE.js`	0 B	-14.8 MB (removed)	🏆
`./bundle/chunk-KGKDW64C.js`	0 B	-3.16 MB (removed)	🏆
`./bundle/chunk-YZMOYGHO.js`	0 B	-3.8 kB (removed)	🏆
`./bundle/core-EFVQHIPO.js`	0 B	-46 kB (removed)	🏆
`./bundle/devtoolsService-HEWEMTVN.js`	0 B	-28.4 kB (removed)	🏆
`./bundle/gemini-XFAXZEY7.js`	0 B	-552 kB (removed)	🏆
`./bundle/interactiveCli-EOUTS4RE.js`	0 B	-1.66 MB (removed)	🏆
`./bundle/oauth2-provider-HNL3PGOR.js`	0 B	-9.16 kB (removed)	🏆
`./bundle/chunk-EF4VWSDY.js`	3.16 MB	+3.16 MB (new file)	🆕
`./bundle/chunk-MWSLZPCV.js`	3.8 kB	+3.8 kB (new file)	🆕
`./bundle/chunk-VGWUQILE.js`	14.8 MB	+14.8 MB (new file)	🆕
`./bundle/core-OZACINWD.js`	46 kB	+46 kB (new file)	🆕
`./bundle/devtoolsService-ZTYLDWMR.js`	28.4 kB	+28.4 kB (new file)	🆕
`./bundle/gemini-B4ZUVDIR.js`	552 kB	+552 kB (new file)	🆕
`./bundle/interactiveCli-L4FGCOSA.js`	1.66 MB	+1.66 MB (new file)	🆕
`./bundle/oauth2-provider-RS2PMGIT.js`	9.16 kB	+9.16 kB (new file)	🆕

ℹ️ View Unchanged

Filename	Size	Change
`./bundle/bundled/third_party/index.js`	8 MB	0 B
`./bundle/chunk-34MYV7JD.js`	2.45 kB	0 B
`./bundle/chunk-5AUYMPVF.js`	858 B	0 B
`./bundle/chunk-5PS3AYFU.js`	1.18 kB	0 B
`./bundle/chunk-664ZODQF.js`	124 kB	0 B
`./bundle/chunk-DAHVX5MI.js`	206 kB	0 B
`./bundle/chunk-IUUIT4SU.js`	56.5 kB	0 B
`./bundle/chunk-P52CHEF3.js`	1.96 MB	0 B
`./bundle/chunk-RJTRUG2J.js`	39.8 kB	0 B
`./bundle/cleanup-KBIRXHLW.js`	0 B	-932 B (removed)	🏆
`./bundle/devtools-36NN55EP.js`	696 kB	0 B
`./bundle/dist-T73EYRDX.js`	356 B	0 B
`./bundle/events-XB7DADIJ.js`	418 B	0 B
`./bundle/gemini.js`	4.97 kB	0 B
`./bundle/getMachineId-bsd-TXG52NKR.js`	1.55 kB	0 B
`./bundle/getMachineId-darwin-7OE4DDZ6.js`	1.55 kB	0 B
`./bundle/getMachineId-linux-SHIFKOOX.js`	1.34 kB	0 B
`./bundle/getMachineId-unsupported-5U5DOEYY.js`	1.06 kB	0 B
`./bundle/getMachineId-win-6KLLGOI4.js`	1.72 kB	0 B
`./bundle/memoryDiscovery-23MICZEL.js`	980 B	0 B
`./bundle/multipart-parser-KPBZEGQU.js`	11.7 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js`	222 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js`	229 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js`	13.4 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js`	132 B	0 B
`./bundle/sandbox-macos-permissive-open.sb`	890 B	0 B
`./bundle/sandbox-macos-permissive-proxied.sb`	1.31 kB	0 B
`./bundle/sandbox-macos-restrictive-open.sb`	3.36 kB	0 B
`./bundle/sandbox-macos-restrictive-proxied.sb`	3.56 kB	0 B
`./bundle/sandbox-macos-strict-open.sb`	4.82 kB	0 B
`./bundle/sandbox-macos-strict-proxied.sb`	5.02 kB	0 B
`./bundle/src-QVCVGIUX.js`	47 kB	0 B
`./bundle/tree-sitter-7U6MW5PS.js`	274 kB	0 B
`./bundle/tree-sitter-bash-34ZGLXVX.js`	1.84 MB	0 B
`./bundle/cleanup-ZN7YV5ZN.js`	932 B	+932 B (new file)	🆕

_{compressed-size-action}

github-actions · 2026-04-08T02:48:49Z

🧪 Related Evaluation Rationale

Force-testing all tests in evals/failing_steering.eval.ts (part of core_steering suite) because the file was modified.
Testing core_steering because packages/core/src/prompts/snippets.ts was modified.

Something missing? Update evals/suites.json to adjust detection logic.

✅ 60 tests passed successfully on gemini-3-flash-preview.

🧠 Model Steering Guidance

This PR modifies files that affect the model's behavior (prompts, tools, or instructions).

🚀 Maintainer Reminder: Please ensure that these changes do not regress results on benchmark evals before merging.

This is an automated guidance message triggered by steering logic signatures.

gemini-code-assist · 2026-04-08T02:54:26Z

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

gundermanc · 2026-04-08T18:27:38Z

 ```bash
-# Run the full regression loop for a specific model
+# Run the targeted regression loop for your changes
+MODEL_LIST=gemini-3-flash-preview node scripts/run_eval_regression.js --related


Should we just make this an npm run script? Maybe it should default to 3 flash to simplify the command?

gundermanc · 2026-04-08T18:32:44Z

+#### Updating Detection Logic
+
+If you add a new tool or functional area, you should update `evals/suites.json`
+to ensure your new evaluations are triggered correctly.


I worry that this will get out of sync. Maybe this hint is enough for the agent to auto-update the config, maybe not..

Can we run Gemini CLI and task it with picking test files to run based on metadata + file paths edited?

It's not deterministic but seems a lot more flexible.

gundermanc · 2026-04-08T18:47:14Z

+    "evals": ["evals/background_processes.eval.ts"]
+  },
+  "core_steering": {
+    "description": "System prompts and core model steering",


I think this might not be granular enough. If the user changes the topic prompt, for example, we might break the update_topic tool's scenarios, but we don't run the tests.

A few ways we could fix this are either:

Run any tests that are affected by the system prompt -- may be too broad. Any prompt change would run just about everything.

Refactor the prompt into separate files by feature area and increase the specificity of the mapping.

Use Gemini CLI analysis of the diff to decide.

I'd lean towards the last option, simply because it'll be the most robust and flexible regardless of the shape of the code.

feat(evals): implement related evaluation system for targeted testing

db8910c

alisa-alisa marked this pull request as ready for review April 8, 2026 02:36

alisa-alisa requested review from a team as code owners April 8, 2026 02:36

alisa-alisa temporarily deployed to eval-gate April 8, 2026 02:37 — with GitHub Actions Inactive

github-advanced-security AI found potential problems Apr 8, 2026

View reviewed changes

Comment thread scripts/changed_prompt.js Fixed

gemini-cli bot added the priority/p1 Important and should be addressed in the near term. label Apr 8, 2026

gundermanc reviewed Apr 8, 2026

View reviewed changes

gundermanc mentioned this pull request Apr 8, 2026

Generalize evals infra to support more types of evals, organization and queuing of named suites #24941

Merged

alisa-alisa added 2 commits April 8, 2026 21:03

Merge branch 'main' into alisa/eval_suites

6ca4147

chore(evals): add steering demonstration with deliberate failure

a591cd9

alisa-alisa temporarily deployed to eval-gate April 9, 2026 04:25 — with GitHub Actions Inactive

alisa-alisa added 2 commits April 8, 2026 21:58

feat(evals): implement force-run for modified test files

573199c

fix(ci): remove external dependency from detection script

ce674a5

alisa-alisa temporarily deployed to eval-gate April 9, 2026 05:07 — with GitHub Actions Inactive

fix(evals): correctly handle new failing test files and fix lint error

90ff8fe

alisa-alisa temporarily deployed to eval-gate April 9, 2026 05:26 — with GitHub Actions Inactive

debug(evals): add debug logging and exclude nulllinter from vitest

ce26052

alisa-alisa had a problem deploying to eval-gate April 9, 2026 05:45 — with GitHub Actions Error

alisa-alisa temporarily deployed to eval-gate April 9, 2026 05:47 — with GitHub Actions Inactive

alisa-alisa temporarily deployed to eval-gate April 10, 2026 20:44 — with GitHub Actions Inactive

fix(evals): add required metadata to failing steering evaluation

ef1d399

alisa-alisa temporarily deployed to eval-gate April 10, 2026 21:22 — with GitHub Actions Inactive

kimjune01 mentioned this pull request Apr 18, 2026

forge-v2: Generalize evals infra to support more types of evals, organization and queuing of named suites kimjune01/gemini-cli-claude#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): implement related evaluation system for targeted testing#24877

feat(evals): implement related evaluation system for targeted testing#24877
alisa-alisa wants to merge 8 commits intomainfrom
alisa/eval_suites

alisa-alisa commented Apr 8, 2026

Uh oh!

gemini-cli bot commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 8, 2026

Uh oh!

gundermanc Apr 8, 2026 •

edited

Loading

Uh oh!

gundermanc Apr 8, 2026

Uh oh!

gundermanc Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alisa-alisa commented Apr 8, 2026

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

Uh oh!

gemini-cli bot commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 Related Evaluation Rationale

🧠 Model Steering Guidance

Uh oh!

gemini-code-assist bot commented Apr 8, 2026

Uh oh!

gundermanc Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gundermanc Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gundermanc Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 8, 2026 •

edited

Loading

gundermanc Apr 8, 2026 •

edited

Loading