Evaluator interface lacks access to session state

## 🔴 Required Information

### Is your feature request related to a specific problem?

The `Evaluator.evaluate_invocations` interface does not receive session state, making it impossible to build custom evaluators that assess the state an agent produces. The method signature only accepts `actual_invocations`, `expected_invocations`, and `conversation_scenario`:

```python
# google.adk.evaluation.evaluator.Evaluator
def evaluate_invocations(
    self,
    actual_invocations: list[Invocation],
    expected_invocations: Optional[list[Invocation]] = None,
    conversation_scenario: Optional[ConversationScenario] = None,
) -> EvaluationResult:
```

The `Invocation` model itself contains `user_content`, `final_response`, `intermediate_data`, `rubrics`, and `app_details` — but no session state.

Meanwhile, the data model already anticipates session state evaluation:
- `EvalCase.final_session_state: Optional[SessionState]` exists with the docstring *"The expected final session state at the end of the conversation"*
- `EvalCaseResult.session_details: Optional[Session]` is populated by `LocalEvalService` with the full session (including `Session.state`)

The plumbing exists, but the evaluation pipeline never wires session state through to evaluators — meaning `EvalCase.final_session_state` is defined but unused.

### Describe the Solution You'd Like

Extend the `Evaluator.evaluate_invocations` interface to optionally receive the initial and final session state, so that custom evaluators can access them:

```python
def evaluate_invocations(
    self,
    actual_invocations: list[Invocation],
    expected_invocations: Optional[list[Invocation]] = None,
    conversation_scenario: Optional[ConversationScenario] = None,
    # New parameters:
    initial_session_state: Optional[SessionState] = None,
    final_session_state: Optional[SessionState] = None,
) -> EvaluationResult:
```

Where:
- `initial_session_state` is the state the session started with (from `EvalCase.session_input.state`)
- `final_session_state` is the state after inference completed (from `Session.state` via `EvalCaseResult.session_details`)

`LocalEvalService._evaluate_metric` would forward these from data already available at evaluation time. This is backward-compatible since both new parameters default to `None`, and existing evaluators can simply ignore them.

With this change, developers could build custom `Evaluator` subclasses that validate session state — compare it against `EvalCase.final_session_state` (the expected golden state), check that specific keys were populated, verify structural constraints, or use an LLM-as-judge over the state contents.

### Impact on your work

We are building multi-agent systems where agents collaboratively produce structured outputs stored in session state (e.g., timelines, assignments). The conversation is the means, but session state is the actual output artifact.

Currently, ADK's evaluation metrics can only evaluate text responses and tool calls, which represent a fraction of the agent's real output. The structured data we need to evaluate lives in session state and is invisible to the evaluator pipeline.

We have a workaround, but first-class support would let us leverage ADK's metric aggregation, reporting, and eval infrastructure for session state evaluation instead of maintaining a parallel post-processing pipeline.

### Willingness to contribute

No

---

## 🟡 Recommended Information

### Describe Alternatives You've Considered

We currently bypass the `Evaluator` interface entirely and run session-state-aware evaluation as a **post-processing step** after ADK's pipeline completes. We built a custom LLM judge class that receives `EvalCaseResult`, extracts conversation history from invocations, and additionally pulls structured data from `session_details.state` to include as context for the judge LLM.

**Why this is suboptimal:**
- It runs *outside* the ADK evaluation pipeline, so results aren't integrated into ADK's per-invocation metric tracking, aggregation, or `EvalMetricResult` reporting.
- It requires custom plumbing to collect `EvalSetResult` objects after `cli_eval` completes, run judges separately, and merge results into our own reporting format.
- `EvalCase.final_session_state` exists in the data model but is completely unused — we can't leverage it for defining expected state in eval sets.

### Proposed API / Implementation

#### 1. Extend `Evaluator.evaluate_invocations`

```python
# evaluator.py
class Evaluator(ABC):
    def evaluate_invocations(
        self,
        actual_invocations: list[Invocation],
        expected_invocations: Optional[list[Invocation]] = None,
        conversation_scenario: Optional[ConversationScenario] = None,
        initial_session_state: Optional[SessionState] = None,
        final_session_state: Optional[SessionState] = None,
    ) -> EvaluationResult:
        raise NotImplementedError()
```

#### 2. Wire it through `LocalEvalService._evaluate_metric`

```python
# local_eval_service.py — _evaluate_metric_for_eval_case
evaluation_result = await self._evaluate_metric(
    eval_metric=eval_metric,
    actual_invocations=inference_result.inferences,
    expected_invocations=eval_case.conversation,
    conversation_scenario=eval_case.conversation_scenario,
    initial_session_state=eval_case.session_input.state if eval_case.session_input else None,
    final_session_state=session.state,  # from session_service.get_session()
)
```

Both values are already available at evaluation time — `session_input.state` from the `EvalCase`, and `Session.state` from the session service (already fetched for `EvalCaseResult.session_details`).

### Additional Context

- `EvalCase.final_session_state` (in `eval_case.py`) already exists as `Optional[SessionState]` but was never connected to the evaluation pipeline. With the proposed interface change, developers could compare the actual `final_session_state` against this golden/expected value in their custom evaluators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator interface lacks access to session state #4532

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

1. Extend `Evaluator.evaluate_invocations`

2. Wire it through `LocalEvalService._evaluate_metric`

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluator interface lacks access to session state #4532

Description

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

1. Extend Evaluator.evaluate_invocations

2. Wire it through LocalEvalService._evaluate_metric

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Extend `Evaluator.evaluate_invocations`

2. Wire it through `LocalEvalService._evaluate_metric`