Support conversation-level evaluators that don't require per-invocation results

## 🔴 Required Information

### Is your feature request related to a specific problem?

The evaluation pipeline enforces that every `Evaluator` must return exactly one `PerInvocationResult` for each invocation. In `LocalEvalService._evaluate_metric_for_eval_case`:

```python
if (
    evaluation_result.overall_eval_status != EvalStatus.NOT_EVALUATED
    and len(evaluation_result.per_invocation_results)
    != len(eval_metric_result_per_invocation)
):
    raise ValueError(
        'Eval metric should return results for each invocation. Found '
        f'{len(evaluation_result.per_invocation_results)} results for '
        f'{len(eval_metric_result_per_invocation)} invocations.'
    )
```

This makes it impossible to build evaluators that assess the conversation or session holistically and produce a single overall score. For example:
- "Did the agent handle adversarial user behavior appropriately across all turns?"
- "Is the structured output the agent built over the course of the conversation well-formed?"
- "Did the agent follow the correct multi-agent routing workflow end-to-end?"

These evaluations don't have meaningful per-turn scores — the judgment applies to the conversation as a whole. Currently, an evaluator must either produce artificial per-invocation scores (semantically wrong and misleading in reports) or bypass the `Evaluator` pipeline entirely.

Additionally, there's no place to store conversation-level reasoning in the result models. `RubricScore.rationale` exists but is tied to individual rubrics. Neither `EvaluationResult` (what the evaluator returns) nor `EvalMetricResultDetails` (where results are stored) has a general-purpose reasoning field. For conversation-level evaluation, there's no way to persist the overall rationale (e.g., "The agent failed because it leaked its system prompt in turn 5 after a jailbreak attempt in turn 4").

### Describe the Solution You'd Like

Two changes:

1. **Allow evaluators to return only an overall result without per-invocation results.** If `EvaluationResult.per_invocation_results` is empty and `overall_eval_status` is not `NOT_EVALUATED`, the framework should record the overall metric result without raising a `ValueError`. The overall score/status would be stored in `EvalCaseResult.overall_eval_metric_results` as it already is, and per-invocation entries for that metric would simply be omitted.

2. **Add an optional reasoning field to the result chain.** Conversation-level evaluators need a place to store their rationale alongside the score. This would require a field on both sides of the mapping:
    - On `EvaluationResult` (what the evaluator returns): so the evaluator can provide reasoning
    - On `EvalMetricResultDetails` (where results are persisted): so the reasoning is stored in `EvalCaseResult`

    The exact field names and design are up to the ADK team, but the gap is: there's currently no way to attach a free-text rationale to an overall metric result without defining rubrics.

Both changes are backward-compatible — existing per-invocation evaluators would behave exactly as they do today.

### Impact on your work

We build multi-agent systems where several evaluation criteria are inherently conversation-level: adversarial handling, overall text quality, end-to-end workflow correctness, and structural validation of outputs assembled across multiple turns. These don't decompose into per-turn scores.

We currently run these evaluations outside the ADK pipeline as a post-processing step, which means we lose integration with ADK's metric aggregation, result persistence, and reporting. Having first-class support for conversation-level evaluators would let us consolidate our evaluation logic into the ADK framework.

There's also a practical cost dimension: when using LLM-as-judge for conversation-level criteria, the per-invocation model forces N separate LLM calls (one per turn) where a single call reviewing the full conversation would suffice. For a 15-turn conversation with 3 conversation-level metrics, that's 45 LLM judge calls instead of 3.

### Willingness to contribute

No

---

## 🟡 Recommended Information

### Describe Alternatives You've Considered

**1. Producing artificial per-invocation results:** An evaluator could duplicate its conversation-level score across all invocations to satisfy the validation. This is semantically misleading (the score isn't really per-turn) and inflates the number of LLM judge calls when using LLM-as-judge.

**2. Abusing the rubric system for reasoning:** We could create a dummy rubric and store conversation-level reasoning in `RubricScore.rationale`. This works but misuses the rubric API, which is designed for evaluating specific testable criteria, not for general-purpose explanations.

**3. Post-processing outside the pipeline (current approach):** We run conversation-level judges on `EvalCaseResult` objects after ADK's evaluation completes, extracting conversation history from `eval_metric_result_per_invocation`. Results are then tracked in a separate reporting format. This works but means our most important evaluation criteria live outside ADK's metric tracking and reporting.

### Proposed API / Implementation

#### 1. Relax the validation in `LocalEvalService._evaluate_metric_for_eval_case`

```python
# Only validate per-invocation count when per-invocation results are provided.
if (
    evaluation_result.overall_eval_status != EvalStatus.NOT_EVALUATED
    and evaluation_result.per_invocation_results  # Skip if empty
    and len(evaluation_result.per_invocation_results)
    != len(eval_metric_result_per_invocation)
):
    raise ValueError(...)

# Only distribute per-invocation scores when they exist.
if evaluation_result.per_invocation_results:
    for idx, invocation in enumerate(eval_metric_result_per_invocation):
        # ... existing per-invocation tracking logic ...
```

#### 2. Add reasoning to the result chain

The exact field names are for the ADK team to decide, but conceptually:

```python
# evaluator.py — what the evaluator returns
class EvaluationResult(BaseModel):
    overall_score: Optional[float] = None
    overall_eval_status: EvalStatus = EvalStatus.NOT_EVALUATED
    per_invocation_results: list[PerInvocationResult] = []
    overall_rubric_scores: Optional[list[RubricScore]] = None
    overall_reasoning: Optional[str] = None  # NEW

# eval_metrics.py — where results are persisted
class EvalMetricResultDetails(EvalBaseModel):
    rubric_scores: Optional[list[RubricScore]] = None
    reasoning: Optional[str] = None  # NEW
```

The pipeline in `_evaluate_metric_for_eval_case` would map `overall_reasoning` to `reasoning`, following the same pattern as `overall_rubric_scores` → `rubric_scores`.

#### Example conversation-level evaluator

```python
class ConversationQualityEvaluator(Evaluator):
    def evaluate_invocations(
        self,
        actual_invocations: list[Invocation],
        expected_invocations: Optional[list[Invocation]] = None,
        conversation_scenario: Optional[ConversationScenario] = None,
    ) -> EvaluationResult:
        full_conversation = [inv.final_response for inv in actual_invocations]
        score, reasoning = self._judge_full_conversation(full_conversation)

        return EvaluationResult(
            overall_score=score,
            overall_eval_status=EvalStatus.PASSED if score >= 0.8 else EvalStatus.FAILED,
            overall_reasoning=reasoning,
            # No per_invocation_results — this is a conversation-level metric.
        )
```

### Additional Context

- The change to `LocalEvalService` is very localized (~5 lines). The data models (`EvaluationResult`, `EvalCaseResult`) already structurally support conversation-level results since `overall_eval_metric_results` and `eval_metric_result_per_invocation` are tracked independently.
- This complements the existing per-invocation evaluators — it doesn't replace them. Metrics like `final_response_match_v2` and `tool_trajectory_avg_score` are naturally per-invocation, while conversation-level criteria are a different class of evaluation that the framework doesn't currently accommodate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support conversation-level evaluators that don't require per-invocation results #4533

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

1. Relax the validation in `LocalEvalService._evaluate_metric_for_eval_case`

2. Add reasoning to the result chain

Example conversation-level evaluator

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support conversation-level evaluators that don't require per-invocation results #4533

Description

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

1. Relax the validation in LocalEvalService._evaluate_metric_for_eval_case

2. Add reasoning to the result chain

Example conversation-level evaluator

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Relax the validation in `LocalEvalService._evaluate_metric_for_eval_case`