Add optional eval set result persistence to AgentEvaluator#4414
Add optional eval set result persistence to AgentEvaluator#4414ftnext wants to merge 5 commits intogoogle:mainfrom
AgentEvaluator#4414Conversation
…aluator Add optional eval result persistence to AgentEvaluator to align programmatic evaluation with existing EvalSetResultsManager workflows. - Extend AgentEvaluator.evaluate() with: - app_name: Optional[str] = None - eval_set_results_manager: Optional[EvalSetResultsManager] = None - Extend AgentEvaluator.evaluate_eval_set() with the same optional parameters. - Persist aggregated EvalCaseResult entries per eval set when a results manager is provided. - Save results before failure assertion so failed runs still leave artifacts. - Add app name resolution logic (explicit app_name first, then derive from agent_module, including ".agent" suffix handling). - Add unit tests covering explicit/derived app_name, save-on-failure behavior, and argument propagation from evaluate() to evaluate_eval_set().
Add an integration test that demonstrates how to persist eval set results from AgentEvaluator.evaluate() without explicitly passing app_name. - Use LocalEvalSetResultsManager with pytest tmp_path as agents_dir. - Call AgentEvaluator.evaluate() with eval_set_results_manager only. - Verify that an eval set result file is created under: <tmp_path>/home_automation_agent/.adk/eval_history/*.evalset_result.json This serves as a usage example and verifies derived app_name behavior in an end-to-end evaluation flow.
Move newly added optional parameters (`app_name`, `eval_set_results_manager`) to the end of public AgentEvaluator method signatures to avoid breaking existing positional-argument callers. - Keep backward compatibility for: - AgentEvaluator.evaluate_eval_set(...) - AgentEvaluator.evaluate(...) - Add regression tests covering positional argument behavior for: - `print_detailed_results` in evaluate_eval_set - `initial_session_file` and `print_detailed_results` in evaluate
Summary of ChangesHello @ftnext, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses the lack of built-in evaluation set result persistence in Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature for persisting evaluation results in AgentEvaluator, which aligns well with existing workflows. The changes are thoughtfully implemented, maintaining backward compatibility by adding new optional parameters to the end of method signatures. The new helper methods for resolving application names and managing results are well-designed and encapsulated. The accompanying tests are comprehensive, covering the new functionality, edge cases like saving results on failure, and ensuring positional argument compatibility is not broken. I have one minor suggestion to improve the robustness of the result flattening logic.
…o per-case as `adk eval` and Web UI/API
Please ensure you have read the contribution guide before creating a pull request.
Link to Issue or Description of Change
1. Link to an existing issue (if applicable):
Problem:
AgentEvaluator.evaluate()did not support built-in eval set result persistence, making it harder to reuse the same workflow as CLI/Web paths that already useEvalSetResultsManager.Also, introducing new parameters in the middle of method signatures would break positional-argument compatibility for existing users.
Solution:
This PR adds optional eval result persistence to
AgentEvaluatorwhile preserving backward compatibility:AgentEvaluator.evaluate()andAgentEvaluator.evaluate_eval_set():app_name: Optional[str] = Noneeval_set_results_manager: Optional[EvalSetResultsManager] = NoneEvalCaseResult(one save perEvalCaseResult), aligningAgentEvaluatorwith existing CLI/Web/API persistence behavior.app_namefrom explicit input first, then derive fromagent_module(including.agentsuffix handling).app_nameevaluate()toevaluate_eval_set()app_nameomission withLocalEvalSetResultsManager.Testing Plan
Unit Tests:
Manual End-to-End (E2E) Tests:
Verify result files are created under:
<tmp_path>/<derived_app_name>/.adk/eval_history/*.evalset_result.json(e.g., 2 files when num_runs=2 on a single-case eval fixture).This is helpful for debugging failed integration tests.
Checklist
Additional context