Skip to content

Evaluator interface lacks access to session state #4532

@lucasbarzotto-axonify

Description

@lucasbarzotto-axonify

🔴 Required Information

Is your feature request related to a specific problem?

The Evaluator.evaluate_invocations interface does not receive session state, making it impossible to build custom evaluators that assess the state an agent produces. The method signature only accepts actual_invocations, expected_invocations, and conversation_scenario:

# google.adk.evaluation.evaluator.Evaluator
def evaluate_invocations(
    self,
    actual_invocations: list[Invocation],
    expected_invocations: Optional[list[Invocation]] = None,
    conversation_scenario: Optional[ConversationScenario] = None,
) -> EvaluationResult:

The Invocation model itself contains user_content, final_response, intermediate_data, rubrics, and app_details — but no session state.

Meanwhile, the data model already anticipates session state evaluation:

  • EvalCase.final_session_state: Optional[SessionState] exists with the docstring "The expected final session state at the end of the conversation"
  • EvalCaseResult.session_details: Optional[Session] is populated by LocalEvalService with the full session (including Session.state)

The plumbing exists, but the evaluation pipeline never wires session state through to evaluators — meaning EvalCase.final_session_state is defined but unused.

Describe the Solution You'd Like

Extend the Evaluator.evaluate_invocations interface to optionally receive the initial and final session state, so that custom evaluators can access them:

def evaluate_invocations(
    self,
    actual_invocations: list[Invocation],
    expected_invocations: Optional[list[Invocation]] = None,
    conversation_scenario: Optional[ConversationScenario] = None,
    # New parameters:
    initial_session_state: Optional[SessionState] = None,
    final_session_state: Optional[SessionState] = None,
) -> EvaluationResult:

Where:

  • initial_session_state is the state the session started with (from EvalCase.session_input.state)
  • final_session_state is the state after inference completed (from Session.state via EvalCaseResult.session_details)

LocalEvalService._evaluate_metric would forward these from data already available at evaluation time. This is backward-compatible since both new parameters default to None, and existing evaluators can simply ignore them.

With this change, developers could build custom Evaluator subclasses that validate session state — compare it against EvalCase.final_session_state (the expected golden state), check that specific keys were populated, verify structural constraints, or use an LLM-as-judge over the state contents.

Impact on your work

We are building multi-agent systems where agents collaboratively produce structured outputs stored in session state (e.g., timelines, assignments). The conversation is the means, but session state is the actual output artifact.

Currently, ADK's evaluation metrics can only evaluate text responses and tool calls, which represent a fraction of the agent's real output. The structured data we need to evaluate lives in session state and is invisible to the evaluator pipeline.

We have a workaround, but first-class support would let us leverage ADK's metric aggregation, reporting, and eval infrastructure for session state evaluation instead of maintaining a parallel post-processing pipeline.

Willingness to contribute

No


🟡 Recommended Information

Describe Alternatives You've Considered

We currently bypass the Evaluator interface entirely and run session-state-aware evaluation as a post-processing step after ADK's pipeline completes. We built a custom LLM judge class that receives EvalCaseResult, extracts conversation history from invocations, and additionally pulls structured data from session_details.state to include as context for the judge LLM.

Why this is suboptimal:

  • It runs outside the ADK evaluation pipeline, so results aren't integrated into ADK's per-invocation metric tracking, aggregation, or EvalMetricResult reporting.
  • It requires custom plumbing to collect EvalSetResult objects after cli_eval completes, run judges separately, and merge results into our own reporting format.
  • EvalCase.final_session_state exists in the data model but is completely unused — we can't leverage it for defining expected state in eval sets.

Proposed API / Implementation

1. Extend Evaluator.evaluate_invocations

# evaluator.py
class Evaluator(ABC):
    def evaluate_invocations(
        self,
        actual_invocations: list[Invocation],
        expected_invocations: Optional[list[Invocation]] = None,
        conversation_scenario: Optional[ConversationScenario] = None,
        initial_session_state: Optional[SessionState] = None,
        final_session_state: Optional[SessionState] = None,
    ) -> EvaluationResult:
        raise NotImplementedError()

2. Wire it through LocalEvalService._evaluate_metric

# local_eval_service.py — _evaluate_metric_for_eval_case
evaluation_result = await self._evaluate_metric(
    eval_metric=eval_metric,
    actual_invocations=inference_result.inferences,
    expected_invocations=eval_case.conversation,
    conversation_scenario=eval_case.conversation_scenario,
    initial_session_state=eval_case.session_input.state if eval_case.session_input else None,
    final_session_state=session.state,  # from session_service.get_session()
)

Both values are already available at evaluation time — session_input.state from the EvalCase, and Session.state from the session service (already fetched for EvalCaseResult.session_details).

Additional Context

  • EvalCase.final_session_state (in eval_case.py) already exists as Optional[SessionState] but was never connected to the evaluation pipeline. With the proposed interface change, developers could compare the actual final_session_state against this golden/expected value in their custom evaluators.

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluationneeds review[Status] The PR/issue is awaiting review from the maintainer

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions