🔴 Required Information
Is your feature request related to a specific problem?
The evaluation pipeline enforces that every Evaluator must return exactly one PerInvocationResult for each invocation. In LocalEvalService._evaluate_metric_for_eval_case:
if (
evaluation_result.overall_eval_status != EvalStatus.NOT_EVALUATED
and len(evaluation_result.per_invocation_results)
!= len(eval_metric_result_per_invocation)
):
raise ValueError(
'Eval metric should return results for each invocation. Found '
f'{len(evaluation_result.per_invocation_results)} results for '
f'{len(eval_metric_result_per_invocation)} invocations.'
)
This makes it impossible to build evaluators that assess the conversation or session holistically and produce a single overall score. For example:
- "Did the agent handle adversarial user behavior appropriately across all turns?"
- "Is the structured output the agent built over the course of the conversation well-formed?"
- "Did the agent follow the correct multi-agent routing workflow end-to-end?"
These evaluations don't have meaningful per-turn scores — the judgment applies to the conversation as a whole. Currently, an evaluator must either produce artificial per-invocation scores (semantically wrong and misleading in reports) or bypass the Evaluator pipeline entirely.
Additionally, there's no place to store conversation-level reasoning in the result models. RubricScore.rationale exists but is tied to individual rubrics. Neither EvaluationResult (what the evaluator returns) nor EvalMetricResultDetails (where results are stored) has a general-purpose reasoning field. For conversation-level evaluation, there's no way to persist the overall rationale (e.g., "The agent failed because it leaked its system prompt in turn 5 after a jailbreak attempt in turn 4").
Describe the Solution You'd Like
Two changes:
-
Allow evaluators to return only an overall result without per-invocation results. If EvaluationResult.per_invocation_results is empty and overall_eval_status is not NOT_EVALUATED, the framework should record the overall metric result without raising a ValueError. The overall score/status would be stored in EvalCaseResult.overall_eval_metric_results as it already is, and per-invocation entries for that metric would simply be omitted.
-
Add an optional reasoning field to the result chain. Conversation-level evaluators need a place to store their rationale alongside the score. This would require a field on both sides of the mapping:
- On
EvaluationResult (what the evaluator returns): so the evaluator can provide reasoning
- On
EvalMetricResultDetails (where results are persisted): so the reasoning is stored in EvalCaseResult
The exact field names and design are up to the ADK team, but the gap is: there's currently no way to attach a free-text rationale to an overall metric result without defining rubrics.
Both changes are backward-compatible — existing per-invocation evaluators would behave exactly as they do today.
Impact on your work
We build multi-agent systems where several evaluation criteria are inherently conversation-level: adversarial handling, overall text quality, end-to-end workflow correctness, and structural validation of outputs assembled across multiple turns. These don't decompose into per-turn scores.
We currently run these evaluations outside the ADK pipeline as a post-processing step, which means we lose integration with ADK's metric aggregation, result persistence, and reporting. Having first-class support for conversation-level evaluators would let us consolidate our evaluation logic into the ADK framework.
There's also a practical cost dimension: when using LLM-as-judge for conversation-level criteria, the per-invocation model forces N separate LLM calls (one per turn) where a single call reviewing the full conversation would suffice. For a 15-turn conversation with 3 conversation-level metrics, that's 45 LLM judge calls instead of 3.
Willingness to contribute
No
🟡 Recommended Information
Describe Alternatives You've Considered
1. Producing artificial per-invocation results: An evaluator could duplicate its conversation-level score across all invocations to satisfy the validation. This is semantically misleading (the score isn't really per-turn) and inflates the number of LLM judge calls when using LLM-as-judge.
2. Abusing the rubric system for reasoning: We could create a dummy rubric and store conversation-level reasoning in RubricScore.rationale. This works but misuses the rubric API, which is designed for evaluating specific testable criteria, not for general-purpose explanations.
3. Post-processing outside the pipeline (current approach): We run conversation-level judges on EvalCaseResult objects after ADK's evaluation completes, extracting conversation history from eval_metric_result_per_invocation. Results are then tracked in a separate reporting format. This works but means our most important evaluation criteria live outside ADK's metric tracking and reporting.
Proposed API / Implementation
1. Relax the validation in LocalEvalService._evaluate_metric_for_eval_case
# Only validate per-invocation count when per-invocation results are provided.
if (
evaluation_result.overall_eval_status != EvalStatus.NOT_EVALUATED
and evaluation_result.per_invocation_results # Skip if empty
and len(evaluation_result.per_invocation_results)
!= len(eval_metric_result_per_invocation)
):
raise ValueError(...)
# Only distribute per-invocation scores when they exist.
if evaluation_result.per_invocation_results:
for idx, invocation in enumerate(eval_metric_result_per_invocation):
# ... existing per-invocation tracking logic ...
2. Add reasoning to the result chain
The exact field names are for the ADK team to decide, but conceptually:
# evaluator.py — what the evaluator returns
class EvaluationResult(BaseModel):
overall_score: Optional[float] = None
overall_eval_status: EvalStatus = EvalStatus.NOT_EVALUATED
per_invocation_results: list[PerInvocationResult] = []
overall_rubric_scores: Optional[list[RubricScore]] = None
overall_reasoning: Optional[str] = None # NEW
# eval_metrics.py — where results are persisted
class EvalMetricResultDetails(EvalBaseModel):
rubric_scores: Optional[list[RubricScore]] = None
reasoning: Optional[str] = None # NEW
The pipeline in _evaluate_metric_for_eval_case would map overall_reasoning to reasoning, following the same pattern as overall_rubric_scores → rubric_scores.
Example conversation-level evaluator
class ConversationQualityEvaluator(Evaluator):
def evaluate_invocations(
self,
actual_invocations: list[Invocation],
expected_invocations: Optional[list[Invocation]] = None,
conversation_scenario: Optional[ConversationScenario] = None,
) -> EvaluationResult:
full_conversation = [inv.final_response for inv in actual_invocations]
score, reasoning = self._judge_full_conversation(full_conversation)
return EvaluationResult(
overall_score=score,
overall_eval_status=EvalStatus.PASSED if score >= 0.8 else EvalStatus.FAILED,
overall_reasoning=reasoning,
# No per_invocation_results — this is a conversation-level metric.
)
Additional Context
- The change to
LocalEvalService is very localized (~5 lines). The data models (EvaluationResult, EvalCaseResult) already structurally support conversation-level results since overall_eval_metric_results and eval_metric_result_per_invocation are tracked independently.
- This complements the existing per-invocation evaluators — it doesn't replace them. Metrics like
final_response_match_v2 and tool_trajectory_avg_score are naturally per-invocation, while conversation-level criteria are a different class of evaluation that the framework doesn't currently accommodate.
🔴 Required Information
Is your feature request related to a specific problem?
The evaluation pipeline enforces that every
Evaluatormust return exactly onePerInvocationResultfor each invocation. InLocalEvalService._evaluate_metric_for_eval_case:This makes it impossible to build evaluators that assess the conversation or session holistically and produce a single overall score. For example:
These evaluations don't have meaningful per-turn scores — the judgment applies to the conversation as a whole. Currently, an evaluator must either produce artificial per-invocation scores (semantically wrong and misleading in reports) or bypass the
Evaluatorpipeline entirely.Additionally, there's no place to store conversation-level reasoning in the result models.
RubricScore.rationaleexists but is tied to individual rubrics. NeitherEvaluationResult(what the evaluator returns) norEvalMetricResultDetails(where results are stored) has a general-purpose reasoning field. For conversation-level evaluation, there's no way to persist the overall rationale (e.g., "The agent failed because it leaked its system prompt in turn 5 after a jailbreak attempt in turn 4").Describe the Solution You'd Like
Two changes:
Allow evaluators to return only an overall result without per-invocation results. If
EvaluationResult.per_invocation_resultsis empty andoverall_eval_statusis notNOT_EVALUATED, the framework should record the overall metric result without raising aValueError. The overall score/status would be stored inEvalCaseResult.overall_eval_metric_resultsas it already is, and per-invocation entries for that metric would simply be omitted.Add an optional reasoning field to the result chain. Conversation-level evaluators need a place to store their rationale alongside the score. This would require a field on both sides of the mapping:
EvaluationResult(what the evaluator returns): so the evaluator can provide reasoningEvalMetricResultDetails(where results are persisted): so the reasoning is stored inEvalCaseResultThe exact field names and design are up to the ADK team, but the gap is: there's currently no way to attach a free-text rationale to an overall metric result without defining rubrics.
Both changes are backward-compatible — existing per-invocation evaluators would behave exactly as they do today.
Impact on your work
We build multi-agent systems where several evaluation criteria are inherently conversation-level: adversarial handling, overall text quality, end-to-end workflow correctness, and structural validation of outputs assembled across multiple turns. These don't decompose into per-turn scores.
We currently run these evaluations outside the ADK pipeline as a post-processing step, which means we lose integration with ADK's metric aggregation, result persistence, and reporting. Having first-class support for conversation-level evaluators would let us consolidate our evaluation logic into the ADK framework.
There's also a practical cost dimension: when using LLM-as-judge for conversation-level criteria, the per-invocation model forces N separate LLM calls (one per turn) where a single call reviewing the full conversation would suffice. For a 15-turn conversation with 3 conversation-level metrics, that's 45 LLM judge calls instead of 3.
Willingness to contribute
No
🟡 Recommended Information
Describe Alternatives You've Considered
1. Producing artificial per-invocation results: An evaluator could duplicate its conversation-level score across all invocations to satisfy the validation. This is semantically misleading (the score isn't really per-turn) and inflates the number of LLM judge calls when using LLM-as-judge.
2. Abusing the rubric system for reasoning: We could create a dummy rubric and store conversation-level reasoning in
RubricScore.rationale. This works but misuses the rubric API, which is designed for evaluating specific testable criteria, not for general-purpose explanations.3. Post-processing outside the pipeline (current approach): We run conversation-level judges on
EvalCaseResultobjects after ADK's evaluation completes, extracting conversation history fromeval_metric_result_per_invocation. Results are then tracked in a separate reporting format. This works but means our most important evaluation criteria live outside ADK's metric tracking and reporting.Proposed API / Implementation
1. Relax the validation in
LocalEvalService._evaluate_metric_for_eval_case2. Add reasoning to the result chain
The exact field names are for the ADK team to decide, but conceptually:
The pipeline in
_evaluate_metric_for_eval_casewould mapoverall_reasoningtoreasoning, following the same pattern asoverall_rubric_scores→rubric_scores.Example conversation-level evaluator
Additional Context
LocalEvalServiceis very localized (~5 lines). The data models (EvaluationResult,EvalCaseResult) already structurally support conversation-level results sinceoverall_eval_metric_resultsandeval_metric_result_per_invocationare tracked independently.final_response_match_v2andtool_trajectory_avg_scoreare naturally per-invocation, while conversation-level criteria are a different class of evaluation that the framework doesn't currently accommodate.