feat(evals): implement related evaluation system for targeted testing#24877
feat(evals): implement related evaluation system for targeted testing#24877alisa-alisa wants to merge 8 commits intomainfrom
Conversation
|
Hi @alisa-alisa, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this. We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines. Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed. Thank you for your understanding and for being a part of our community! |
|
Size Change: -4 B (0%) Total Size: 34 MB
ℹ️ View Unchanged
|
🧪 Related Evaluation Rationale
Something missing? Update evals/suites.json to adjust detection logic. ✅ 60 tests passed successfully on gemini-3-flash-preview. 🧠 Model Steering GuidanceThis PR modifies files that affect the model's behavior (prompts, tools, or instructions).
This is an automated guidance message triggered by steering logic signatures. |
|
Warning Gemini encountered an error creating the summary. You can try again by commenting |
| ```bash | ||
| # Run the full regression loop for a specific model | ||
| # Run the targeted regression loop for your changes | ||
| MODEL_LIST=gemini-3-flash-preview node scripts/run_eval_regression.js --related |
There was a problem hiding this comment.
Should we just make this an npm run script? Maybe it should default to 3 flash to simplify the command?
| #### Updating Detection Logic | ||
|
|
||
| If you add a new tool or functional area, you should update `evals/suites.json` | ||
| to ensure your new evaluations are triggered correctly. |
There was a problem hiding this comment.
I worry that this will get out of sync. Maybe this hint is enough for the agent to auto-update the config, maybe not..
Can we run Gemini CLI and task it with picking test files to run based on metadata + file paths edited?
It's not deterministic but seems a lot more flexible.
| "evals": ["evals/background_processes.eval.ts"] | ||
| }, | ||
| "core_steering": { | ||
| "description": "System prompts and core model steering", |
There was a problem hiding this comment.
I think this might not be granular enough. If the user changes the topic prompt, for example, we might break the update_topic tool's scenarios, but we don't run the tests.
A few ways we could fix this are either:
- Run any tests that are affected by the system prompt -- may be too broad. Any prompt change would run just about everything.
- Refactor the prompt into separate files by feature area and increase the specificity of the mapping.
- Use Gemini CLI analysis of the diff to decide.
I'd lean towards the last option, simply because it'll be the most robust and flexible regardless of the shape of the code.
Summary
Details
Related Issues
How to Validate
Pre-Merge Checklist