Skip to content

Add optional eval set result persistence to AgentEvaluator#4414

Open
ftnext wants to merge 5 commits intogoogle:mainfrom
ftnext:agent-evaluator-save-evalset-result
Open

Add optional eval set result persistence to AgentEvaluator#4414
ftnext wants to merge 5 commits intogoogle:mainfrom
ftnext:agent-evaluator-save-evalset-result

Conversation

@ftnext
Copy link
Contributor

@ftnext ftnext commented Feb 8, 2026

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Problem:
AgentEvaluator.evaluate() did not support built-in eval set result persistence, making it harder to reuse the same workflow as CLI/Web paths that already use EvalSetResultsManager.
Also, introducing new parameters in the middle of method signatures would break positional-argument compatibility for existing users.

Solution:
This PR adds optional eval result persistence to AgentEvaluator while preserving backward compatibility:

  • Add optional parameters to AgentEvaluator.evaluate() and AgentEvaluator.evaluate_eval_set():
    • app_name: Optional[str] = None
    • eval_set_results_manager: Optional[EvalSetResultsManager] = None
  • Persist results per EvalCaseResult (one save per EvalCaseResult), aligning AgentEvaluator with existing CLI/Web/API persistence behavior.
  • Resolve app_name from explicit input first, then derive from agent_module (including .agent suffix handling).
  • Save results before failure assertion so failed eval runs still leave artifacts for inspection.
  • Keep existing positional argument behavior by appending new parameters at the end of public method signatures.
  • Add/extend tests to verify:
    • explicit and derived app_name
    • save-on-failure behavior
    • argument propagation from evaluate() to evaluate_eval_set()
    • positional-argument backward compatibility
  • Add an integration usage example for app_name omission with LocalEvalSetResultsManager.
  • For multi-run evals, this produces multiple result files (roughly num_runs × number_of_eval_cases; for single-case evals, effectively num_runs files).

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.
% pytest tests/unittests/evaluation

======================== 357 passed, 169 warnings in 9.68s =========================

Manual End-to-End (E2E) Tests:

% pytest tests/integration/test_with_test_file.py::test_with_single_test_file_saves_eval_set_result

======================== 1 passed, 14 warnings in 5.24s ========================

Verify result files are created under: <tmp_path>/<derived_app_name>/.adk/eval_history/*.evalset_result.json (e.g., 2 files when num_runs=2 on a single-case eval fixture).
This is helpful for debugging failed integration tests.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

  • This PR intentionally preserves public API positional compatibility by appending new optional parameters at the tail of method signatures.
  • A generated local eval result JSON file may exist in the working tree from manual verification and is intentionally not part of the code change.

…aluator

Add optional eval result persistence to AgentEvaluator to align programmatic
evaluation with existing EvalSetResultsManager workflows.

- Extend AgentEvaluator.evaluate() with:
  - app_name: Optional[str] = None
  - eval_set_results_manager: Optional[EvalSetResultsManager] = None
- Extend AgentEvaluator.evaluate_eval_set() with the same optional parameters.
- Persist aggregated EvalCaseResult entries per eval set when a results manager is provided.
- Save results before failure assertion so failed runs still leave artifacts.
- Add app name resolution logic (explicit app_name first, then derive from agent_module, including ".agent" suffix handling).
- Add unit tests covering explicit/derived app_name, save-on-failure behavior, and argument propagation from evaluate() to evaluate_eval_set().
Add an integration test that demonstrates how to persist eval set results from
AgentEvaluator.evaluate() without explicitly passing app_name.

- Use LocalEvalSetResultsManager with pytest tmp_path as agents_dir.
- Call AgentEvaluator.evaluate() with eval_set_results_manager only.
- Verify that an eval set result file is created under:
  <tmp_path>/home_automation_agent/.adk/eval_history/*.evalset_result.json

This serves as a usage example and verifies derived app_name behavior in an end-to-end evaluation flow.
Move newly added optional parameters (`app_name`, `eval_set_results_manager`)
to the end of public AgentEvaluator method signatures to avoid breaking
existing positional-argument callers.

- Keep backward compatibility for:
  - AgentEvaluator.evaluate_eval_set(...)
  - AgentEvaluator.evaluate(...)
- Add regression tests covering positional argument behavior for:
  - `print_detailed_results` in evaluate_eval_set
  - `initial_session_file` and `print_detailed_results` in evaluate
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ftnext, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses the lack of built-in evaluation set result persistence in AgentEvaluator, bringing its capabilities closer to CLI/Web evaluation workflows. It integrates a flexible mechanism to save evaluation outcomes, ensuring that valuable data is retained for analysis, even when evaluations do not complete successfully. The implementation carefully preserves the existing public API to prevent breaking changes for current users.

Highlights

  • Eval Set Result Persistence: Introduced optional persistence for evaluation set results within AgentEvaluator, allowing aggregated EvalCaseResult entries to be saved using an EvalSetResultsManager.
  • Backward Compatibility: New optional parameters (app_name and eval_set_results_manager) were added to AgentEvaluator.evaluate() and AgentEvaluator.evaluate_eval_set() at the end of their signatures to maintain positional argument compatibility for existing users.
  • Failure Resilience: Evaluation results are now saved before any failure assertions, ensuring that artifacts are available for inspection even if an evaluation run fails.
  • Application Name Resolution: The app_name for persistence can be explicitly provided or automatically derived from the agent_module name, including handling for .agent suffixes.
  • Comprehensive Testing: New unit and integration tests have been added to cover explicit and derived app_name usage, save-on-failure behavior, argument propagation, and positional-argument backward compatibility.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/google/adk/evaluation/agent_evaluator.py
    • Imported EvalSetResultsManager for result persistence.
    • Added app_name: Optional[str] and eval_set_results_manager: Optional[EvalSetResultsManager] parameters to evaluate_eval_set and evaluate methods.
    • Updated docstrings for evaluate_eval_set and evaluate to describe the new parameters.
    • Introduced a call to _maybe_save_eval_set_result within evaluate_eval_set to trigger persistence.
    • Added new static helper methods: _resolve_app_name for determining the application name, _flatten_eval_results_by_eval_case_order for ordering results, and _maybe_save_eval_set_result for conditional saving of evaluation results.
  • tests/integration/test_with_test_file.py
    • Imported LocalEvalSetResultsManager for local persistence testing.
    • Added test_with_single_test_file_saves_eval_set_result to verify that evaluation results are correctly persisted to disk with a derived application name when app_name is not explicitly provided.
  • tests/unittests/evaluation/test_agent_evaluator.py
    • Added new unit tests to validate the persistence functionality.
    • Includes test_evaluate_eval_set_saves_results_with_explicit_app_name to check saving with a specified app_name.
    • Includes test_evaluate_eval_set_uses_derived_app_name to verify automatic app_name derivation.
    • Includes test_evaluate_eval_set_saves_before_assert_failure to confirm results are saved even if the evaluation process later raises an AssertionError.
    • Includes test_evaluate_passes_results_manager_and_app_name to ensure evaluate correctly passes the new parameters to evaluate_eval_set.
    • Includes test_evaluate_eval_set_keeps_positional_print_detailed_results and test_evaluate_keeps_positional_initial_session_file_and_print_flag to confirm backward compatibility with existing positional arguments.
Activity
  • The author, ftnext, has implemented new features to enable optional persistence of evaluation results in AgentEvaluator.
  • Unit tests have been updated and new ones added, all passing locally, to ensure the correctness and backward compatibility of the changes.
  • Manual end-to-end tests were conducted to confirm that result files are generated as expected, aiding in debugging.
  • The author has completed a self-review, added necessary code comments, and confirmed adherence to contribution guidelines.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Feb 8, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for persisting evaluation results in AgentEvaluator, which aligns well with existing workflows. The changes are thoughtfully implemented, maintaining backward compatibility by adding new optional parameters to the end of method signatures. The new helper methods for resolving application names and managing results are well-designed and encapsulated. The accompanying tests are comprehensive, covering the new functionality, edge cases like saving results on failure, and ensuring positional argument compatibility is not broken. I have one minor suggestion to improve the robustness of the result flattening logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants