🔴 Required Information
Is your feature request related to a specific problem?
The Evaluator.evaluate_invocations interface does not receive session state, making it impossible to build custom evaluators that assess the state an agent produces. The method signature only accepts actual_invocations, expected_invocations, and conversation_scenario:
# google.adk.evaluation.evaluator.Evaluator
def evaluate_invocations(
self,
actual_invocations: list[Invocation],
expected_invocations: Optional[list[Invocation]] = None,
conversation_scenario: Optional[ConversationScenario] = None,
) -> EvaluationResult:
The Invocation model itself contains user_content, final_response, intermediate_data, rubrics, and app_details — but no session state.
Meanwhile, the data model already anticipates session state evaluation:
EvalCase.final_session_state: Optional[SessionState] exists with the docstring "The expected final session state at the end of the conversation"
EvalCaseResult.session_details: Optional[Session] is populated by LocalEvalService with the full session (including Session.state)
The plumbing exists, but the evaluation pipeline never wires session state through to evaluators — meaning EvalCase.final_session_state is defined but unused.
Describe the Solution You'd Like
Extend the Evaluator.evaluate_invocations interface to optionally receive the initial and final session state, so that custom evaluators can access them:
def evaluate_invocations(
self,
actual_invocations: list[Invocation],
expected_invocations: Optional[list[Invocation]] = None,
conversation_scenario: Optional[ConversationScenario] = None,
# New parameters:
initial_session_state: Optional[SessionState] = None,
final_session_state: Optional[SessionState] = None,
) -> EvaluationResult:
Where:
initial_session_state is the state the session started with (from EvalCase.session_input.state)
final_session_state is the state after inference completed (from Session.state via EvalCaseResult.session_details)
LocalEvalService._evaluate_metric would forward these from data already available at evaluation time. This is backward-compatible since both new parameters default to None, and existing evaluators can simply ignore them.
With this change, developers could build custom Evaluator subclasses that validate session state — compare it against EvalCase.final_session_state (the expected golden state), check that specific keys were populated, verify structural constraints, or use an LLM-as-judge over the state contents.
Impact on your work
We are building multi-agent systems where agents collaboratively produce structured outputs stored in session state (e.g., timelines, assignments). The conversation is the means, but session state is the actual output artifact.
Currently, ADK's evaluation metrics can only evaluate text responses and tool calls, which represent a fraction of the agent's real output. The structured data we need to evaluate lives in session state and is invisible to the evaluator pipeline.
We have a workaround, but first-class support would let us leverage ADK's metric aggregation, reporting, and eval infrastructure for session state evaluation instead of maintaining a parallel post-processing pipeline.
Willingness to contribute
No
🟡 Recommended Information
Describe Alternatives You've Considered
We currently bypass the Evaluator interface entirely and run session-state-aware evaluation as a post-processing step after ADK's pipeline completes. We built a custom LLM judge class that receives EvalCaseResult, extracts conversation history from invocations, and additionally pulls structured data from session_details.state to include as context for the judge LLM.
Why this is suboptimal:
- It runs outside the ADK evaluation pipeline, so results aren't integrated into ADK's per-invocation metric tracking, aggregation, or
EvalMetricResult reporting.
- It requires custom plumbing to collect
EvalSetResult objects after cli_eval completes, run judges separately, and merge results into our own reporting format.
EvalCase.final_session_state exists in the data model but is completely unused — we can't leverage it for defining expected state in eval sets.
Proposed API / Implementation
1. Extend Evaluator.evaluate_invocations
# evaluator.py
class Evaluator(ABC):
def evaluate_invocations(
self,
actual_invocations: list[Invocation],
expected_invocations: Optional[list[Invocation]] = None,
conversation_scenario: Optional[ConversationScenario] = None,
initial_session_state: Optional[SessionState] = None,
final_session_state: Optional[SessionState] = None,
) -> EvaluationResult:
raise NotImplementedError()
2. Wire it through LocalEvalService._evaluate_metric
# local_eval_service.py — _evaluate_metric_for_eval_case
evaluation_result = await self._evaluate_metric(
eval_metric=eval_metric,
actual_invocations=inference_result.inferences,
expected_invocations=eval_case.conversation,
conversation_scenario=eval_case.conversation_scenario,
initial_session_state=eval_case.session_input.state if eval_case.session_input else None,
final_session_state=session.state, # from session_service.get_session()
)
Both values are already available at evaluation time — session_input.state from the EvalCase, and Session.state from the session service (already fetched for EvalCaseResult.session_details).
Additional Context
EvalCase.final_session_state (in eval_case.py) already exists as Optional[SessionState] but was never connected to the evaluation pipeline. With the proposed interface change, developers could compare the actual final_session_state against this golden/expected value in their custom evaluators.
🔴 Required Information
Is your feature request related to a specific problem?
The
Evaluator.evaluate_invocationsinterface does not receive session state, making it impossible to build custom evaluators that assess the state an agent produces. The method signature only acceptsactual_invocations,expected_invocations, andconversation_scenario:The
Invocationmodel itself containsuser_content,final_response,intermediate_data,rubrics, andapp_details— but no session state.Meanwhile, the data model already anticipates session state evaluation:
EvalCase.final_session_state: Optional[SessionState]exists with the docstring "The expected final session state at the end of the conversation"EvalCaseResult.session_details: Optional[Session]is populated byLocalEvalServicewith the full session (includingSession.state)The plumbing exists, but the evaluation pipeline never wires session state through to evaluators — meaning
EvalCase.final_session_stateis defined but unused.Describe the Solution You'd Like
Extend the
Evaluator.evaluate_invocationsinterface to optionally receive the initial and final session state, so that custom evaluators can access them:Where:
initial_session_stateis the state the session started with (fromEvalCase.session_input.state)final_session_stateis the state after inference completed (fromSession.stateviaEvalCaseResult.session_details)LocalEvalService._evaluate_metricwould forward these from data already available at evaluation time. This is backward-compatible since both new parameters default toNone, and existing evaluators can simply ignore them.With this change, developers could build custom
Evaluatorsubclasses that validate session state — compare it againstEvalCase.final_session_state(the expected golden state), check that specific keys were populated, verify structural constraints, or use an LLM-as-judge over the state contents.Impact on your work
We are building multi-agent systems where agents collaboratively produce structured outputs stored in session state (e.g., timelines, assignments). The conversation is the means, but session state is the actual output artifact.
Currently, ADK's evaluation metrics can only evaluate text responses and tool calls, which represent a fraction of the agent's real output. The structured data we need to evaluate lives in session state and is invisible to the evaluator pipeline.
We have a workaround, but first-class support would let us leverage ADK's metric aggregation, reporting, and eval infrastructure for session state evaluation instead of maintaining a parallel post-processing pipeline.
Willingness to contribute
No
🟡 Recommended Information
Describe Alternatives You've Considered
We currently bypass the
Evaluatorinterface entirely and run session-state-aware evaluation as a post-processing step after ADK's pipeline completes. We built a custom LLM judge class that receivesEvalCaseResult, extracts conversation history from invocations, and additionally pulls structured data fromsession_details.stateto include as context for the judge LLM.Why this is suboptimal:
EvalMetricResultreporting.EvalSetResultobjects aftercli_evalcompletes, run judges separately, and merge results into our own reporting format.EvalCase.final_session_stateexists in the data model but is completely unused — we can't leverage it for defining expected state in eval sets.Proposed API / Implementation
1. Extend
Evaluator.evaluate_invocations2. Wire it through
LocalEvalService._evaluate_metricBoth values are already available at evaluation time —
session_input.statefrom theEvalCase, andSession.statefrom the session service (already fetched forEvalCaseResult.session_details).Additional Context
EvalCase.final_session_state(ineval_case.py) already exists asOptional[SessionState]but was never connected to the evaluation pipeline. With the proposed interface change, developers could compare the actualfinal_session_stateagainst this golden/expected value in their custom evaluators.