Skip to content

feat(evals): implement related evaluation system for targeted testing#24877

Open
alisa-alisa wants to merge 8 commits intomainfrom
alisa/eval_suites
Open

feat(evals): implement related evaluation system for targeted testing#24877
alisa-alisa wants to merge 8 commits intomainfrom
alisa/eval_suites

Conversation

@alisa-alisa
Copy link
Copy Markdown
Contributor

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

  • Updated relevant documentation and README (if needed)
  • Added/updated tests (if needed)
  • Noted breaking changes (if any)
  • Validated on required platforms/methods:
    • MacOS
      • npm run
      • npx
      • Docker
      • Podman
      • Seatbelt
    • Windows
      • npm run
      • npx
      • Docker
    • Linux
      • npm run
      • npx
      • Docker

@alisa-alisa alisa-alisa marked this pull request as ready for review April 8, 2026 02:36
@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli bot commented Apr 8, 2026

Hi @alisa-alisa, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

@alisa-alisa alisa-alisa requested review from a team as code owners April 8, 2026 02:36
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

Size Change: -4 B (0%)

Total Size: 34 MB

Filename Size Change
./bundle/chunk-6NC7VEIE.js 0 B -14.8 MB (removed) 🏆
./bundle/chunk-KGKDW64C.js 0 B -3.16 MB (removed) 🏆
./bundle/chunk-YZMOYGHO.js 0 B -3.8 kB (removed) 🏆
./bundle/core-EFVQHIPO.js 0 B -46 kB (removed) 🏆
./bundle/devtoolsService-HEWEMTVN.js 0 B -28.4 kB (removed) 🏆
./bundle/gemini-XFAXZEY7.js 0 B -552 kB (removed) 🏆
./bundle/interactiveCli-EOUTS4RE.js 0 B -1.66 MB (removed) 🏆
./bundle/oauth2-provider-HNL3PGOR.js 0 B -9.16 kB (removed) 🏆
./bundle/chunk-EF4VWSDY.js 3.16 MB +3.16 MB (new file) 🆕
./bundle/chunk-MWSLZPCV.js 3.8 kB +3.8 kB (new file) 🆕
./bundle/chunk-VGWUQILE.js 14.8 MB +14.8 MB (new file) 🆕
./bundle/core-OZACINWD.js 46 kB +46 kB (new file) 🆕
./bundle/devtoolsService-ZTYLDWMR.js 28.4 kB +28.4 kB (new file) 🆕
./bundle/gemini-B4ZUVDIR.js 552 kB +552 kB (new file) 🆕
./bundle/interactiveCli-L4FGCOSA.js 1.66 MB +1.66 MB (new file) 🆕
./bundle/oauth2-provider-RS2PMGIT.js 9.16 kB +9.16 kB (new file) 🆕
ℹ️ View Unchanged
Filename Size Change
./bundle/bundled/third_party/index.js 8 MB 0 B
./bundle/chunk-34MYV7JD.js 2.45 kB 0 B
./bundle/chunk-5AUYMPVF.js 858 B 0 B
./bundle/chunk-5PS3AYFU.js 1.18 kB 0 B
./bundle/chunk-664ZODQF.js 124 kB 0 B
./bundle/chunk-DAHVX5MI.js 206 kB 0 B
./bundle/chunk-IUUIT4SU.js 56.5 kB 0 B
./bundle/chunk-P52CHEF3.js 1.96 MB 0 B
./bundle/chunk-RJTRUG2J.js 39.8 kB 0 B
./bundle/cleanup-KBIRXHLW.js 0 B -932 B (removed) 🏆
./bundle/devtools-36NN55EP.js 696 kB 0 B
./bundle/dist-T73EYRDX.js 356 B 0 B
./bundle/events-XB7DADIJ.js 418 B 0 B
./bundle/gemini.js 4.97 kB 0 B
./bundle/getMachineId-bsd-TXG52NKR.js 1.55 kB 0 B
./bundle/getMachineId-darwin-7OE4DDZ6.js 1.55 kB 0 B
./bundle/getMachineId-linux-SHIFKOOX.js 1.34 kB 0 B
./bundle/getMachineId-unsupported-5U5DOEYY.js 1.06 kB 0 B
./bundle/getMachineId-win-6KLLGOI4.js 1.72 kB 0 B
./bundle/memoryDiscovery-23MICZEL.js 980 B 0 B
./bundle/multipart-parser-KPBZEGQU.js 11.7 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 222 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 229 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 13.4 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B
./bundle/src-QVCVGIUX.js 47 kB 0 B
./bundle/tree-sitter-7U6MW5PS.js 274 kB 0 B
./bundle/tree-sitter-bash-34ZGLXVX.js 1.84 MB 0 B
./bundle/cleanup-ZN7YV5ZN.js 932 B +932 B (new file) 🆕

compressed-size-action

Comment thread scripts/changed_prompt.js Fixed
@gemini-cli gemini-cli bot added the priority/p1 Important and should be addressed in the near term. label Apr 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

🧪 Related Evaluation Rationale

  • Force-testing all tests in evals/failing_steering.eval.ts (part of core_steering suite) because the file was modified.
  • Testing core_steering because packages/core/src/prompts/snippets.ts was modified.

Something missing? Update evals/suites.json to adjust detection logic.


60 tests passed successfully on gemini-3-flash-preview.

🧠 Model Steering Guidance

This PR modifies files that affect the model's behavior (prompts, tools, or instructions).

  • 🚀 Maintainer Reminder: Please ensure that these changes do not regress results on benchmark evals before merging.

This is an automated guidance message triggered by steering logic signatures.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

Comment thread evals/README.md
```bash
# Run the full regression loop for a specific model
# Run the targeted regression loop for your changes
MODEL_LIST=gemini-3-flash-preview node scripts/run_eval_regression.js --related
Copy link
Copy Markdown
Member

@gundermanc gundermanc Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make this an npm run script? Maybe it should default to 3 flash to simplify the command?

Comment thread evals/README.md
#### Updating Detection Logic

If you add a new tool or functional area, you should update `evals/suites.json`
to ensure your new evaluations are triggered correctly.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this will get out of sync. Maybe this hint is enough for the agent to auto-update the config, maybe not..

Can we run Gemini CLI and task it with picking test files to run based on metadata + file paths edited?

It's not deterministic but seems a lot more flexible.

Comment thread evals/suites.json
"evals": ["evals/background_processes.eval.ts"]
},
"core_steering": {
"description": "System prompts and core model steering",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might not be granular enough. If the user changes the topic prompt, for example, we might break the update_topic tool's scenarios, but we don't run the tests.

A few ways we could fix this are either:

  • Run any tests that are affected by the system prompt -- may be too broad. Any prompt change would run just about everything.
  • Refactor the prompt into separate files by feature area and increase the specificity of the mapping.
  • Use Gemini CLI analysis of the diff to decide.

I'd lean towards the last option, simply because it'll be the most robust and flexible regardless of the shape of the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority/p1 Important and should be addressed in the near term.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants