Skip to content

Support conversation-level evaluators that don't require per-invocation results #4533

@lucasbarzotto-axonify

Description

@lucasbarzotto-axonify

🔴 Required Information

Is your feature request related to a specific problem?

The evaluation pipeline enforces that every Evaluator must return exactly one PerInvocationResult for each invocation. In LocalEvalService._evaluate_metric_for_eval_case:

if (
    evaluation_result.overall_eval_status != EvalStatus.NOT_EVALUATED
    and len(evaluation_result.per_invocation_results)
    != len(eval_metric_result_per_invocation)
):
    raise ValueError(
        'Eval metric should return results for each invocation. Found '
        f'{len(evaluation_result.per_invocation_results)} results for '
        f'{len(eval_metric_result_per_invocation)} invocations.'
    )

This makes it impossible to build evaluators that assess the conversation or session holistically and produce a single overall score. For example:

  • "Did the agent handle adversarial user behavior appropriately across all turns?"
  • "Is the structured output the agent built over the course of the conversation well-formed?"
  • "Did the agent follow the correct multi-agent routing workflow end-to-end?"

These evaluations don't have meaningful per-turn scores — the judgment applies to the conversation as a whole. Currently, an evaluator must either produce artificial per-invocation scores (semantically wrong and misleading in reports) or bypass the Evaluator pipeline entirely.

Additionally, there's no place to store conversation-level reasoning in the result models. RubricScore.rationale exists but is tied to individual rubrics. Neither EvaluationResult (what the evaluator returns) nor EvalMetricResultDetails (where results are stored) has a general-purpose reasoning field. For conversation-level evaluation, there's no way to persist the overall rationale (e.g., "The agent failed because it leaked its system prompt in turn 5 after a jailbreak attempt in turn 4").

Describe the Solution You'd Like

Two changes:

  1. Allow evaluators to return only an overall result without per-invocation results. If EvaluationResult.per_invocation_results is empty and overall_eval_status is not NOT_EVALUATED, the framework should record the overall metric result without raising a ValueError. The overall score/status would be stored in EvalCaseResult.overall_eval_metric_results as it already is, and per-invocation entries for that metric would simply be omitted.

  2. Add an optional reasoning field to the result chain. Conversation-level evaluators need a place to store their rationale alongside the score. This would require a field on both sides of the mapping:

    • On EvaluationResult (what the evaluator returns): so the evaluator can provide reasoning
    • On EvalMetricResultDetails (where results are persisted): so the reasoning is stored in EvalCaseResult

    The exact field names and design are up to the ADK team, but the gap is: there's currently no way to attach a free-text rationale to an overall metric result without defining rubrics.

Both changes are backward-compatible — existing per-invocation evaluators would behave exactly as they do today.

Impact on your work

We build multi-agent systems where several evaluation criteria are inherently conversation-level: adversarial handling, overall text quality, end-to-end workflow correctness, and structural validation of outputs assembled across multiple turns. These don't decompose into per-turn scores.

We currently run these evaluations outside the ADK pipeline as a post-processing step, which means we lose integration with ADK's metric aggregation, result persistence, and reporting. Having first-class support for conversation-level evaluators would let us consolidate our evaluation logic into the ADK framework.

There's also a practical cost dimension: when using LLM-as-judge for conversation-level criteria, the per-invocation model forces N separate LLM calls (one per turn) where a single call reviewing the full conversation would suffice. For a 15-turn conversation with 3 conversation-level metrics, that's 45 LLM judge calls instead of 3.

Willingness to contribute

No


🟡 Recommended Information

Describe Alternatives You've Considered

1. Producing artificial per-invocation results: An evaluator could duplicate its conversation-level score across all invocations to satisfy the validation. This is semantically misleading (the score isn't really per-turn) and inflates the number of LLM judge calls when using LLM-as-judge.

2. Abusing the rubric system for reasoning: We could create a dummy rubric and store conversation-level reasoning in RubricScore.rationale. This works but misuses the rubric API, which is designed for evaluating specific testable criteria, not for general-purpose explanations.

3. Post-processing outside the pipeline (current approach): We run conversation-level judges on EvalCaseResult objects after ADK's evaluation completes, extracting conversation history from eval_metric_result_per_invocation. Results are then tracked in a separate reporting format. This works but means our most important evaluation criteria live outside ADK's metric tracking and reporting.

Proposed API / Implementation

1. Relax the validation in LocalEvalService._evaluate_metric_for_eval_case

# Only validate per-invocation count when per-invocation results are provided.
if (
    evaluation_result.overall_eval_status != EvalStatus.NOT_EVALUATED
    and evaluation_result.per_invocation_results  # Skip if empty
    and len(evaluation_result.per_invocation_results)
    != len(eval_metric_result_per_invocation)
):
    raise ValueError(...)

# Only distribute per-invocation scores when they exist.
if evaluation_result.per_invocation_results:
    for idx, invocation in enumerate(eval_metric_result_per_invocation):
        # ... existing per-invocation tracking logic ...

2. Add reasoning to the result chain

The exact field names are for the ADK team to decide, but conceptually:

# evaluator.py — what the evaluator returns
class EvaluationResult(BaseModel):
    overall_score: Optional[float] = None
    overall_eval_status: EvalStatus = EvalStatus.NOT_EVALUATED
    per_invocation_results: list[PerInvocationResult] = []
    overall_rubric_scores: Optional[list[RubricScore]] = None
    overall_reasoning: Optional[str] = None  # NEW

# eval_metrics.py — where results are persisted
class EvalMetricResultDetails(EvalBaseModel):
    rubric_scores: Optional[list[RubricScore]] = None
    reasoning: Optional[str] = None  # NEW

The pipeline in _evaluate_metric_for_eval_case would map overall_reasoning to reasoning, following the same pattern as overall_rubric_scoresrubric_scores.

Example conversation-level evaluator

class ConversationQualityEvaluator(Evaluator):
    def evaluate_invocations(
        self,
        actual_invocations: list[Invocation],
        expected_invocations: Optional[list[Invocation]] = None,
        conversation_scenario: Optional[ConversationScenario] = None,
    ) -> EvaluationResult:
        full_conversation = [inv.final_response for inv in actual_invocations]
        score, reasoning = self._judge_full_conversation(full_conversation)

        return EvaluationResult(
            overall_score=score,
            overall_eval_status=EvalStatus.PASSED if score >= 0.8 else EvalStatus.FAILED,
            overall_reasoning=reasoning,
            # No per_invocation_results — this is a conversation-level metric.
        )

Additional Context

  • The change to LocalEvalService is very localized (~5 lines). The data models (EvaluationResult, EvalCaseResult) already structurally support conversation-level results since overall_eval_metric_results and eval_metric_result_per_invocation are tracked independently.
  • This complements the existing per-invocation evaluators — it doesn't replace them. Metrics like final_response_match_v2 and tool_trajectory_avg_score are naturally per-invocation, while conversation-level criteria are a different class of evaluation that the framework doesn't currently accommodate.

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluationneeds review[Status] The PR/issue is awaiting review from the maintainer

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions