Summary
rubric_based_final_response_quality_v1 sometimes returns NOT_EVALUATED (eval_status: 3) with score: null and details.rubric_scores: null, even though the ADK eval result contains a non-empty actual_invocation.final_response.parts[0].text.
In the same invocation, rubric_based_tool_use_quality_v1 evaluates normally and returns rubric scores/rationales, so the eval run itself and LLM-as-judge path are partially working.
Environment
google-adk: 1.32.0
litellm: 1.82.6
google-genai: 1.73.1
- Python:
3.13.5
- Eval command shape:
adk eval src tests/eval/evalsets/job_scraper_core.json:itviec_immediate_goal_before_producer_scripting \
--config_file_path tests/eval/eval_config_goal_contract.json \
--log_level ERROR
Observed result excerpt
The eval case final status was passed:
{
"eval_set_result_id": "src_job_scraper_core_1778989037.305544",
"final_eval_status": 1
}
The actual invocation has a final response:
{
"actual_invocation": {
"final_response": {
"role": "model",
"parts": [
{
"text": "Extracted job count: 20\n\nArtifacts:\n- page workspace fixture: `page_2554f031654b` / `pages__page_2554f031654b__page.html`\n- sandbox audit: `sandbox_run_20260517_033632_dfda78d0`\n\nWhat happened:\n- Loaded the fixed ITviec AI Engineer Hanoi fixture.\n- Established the repeated job-card boundary with a bounded probe.\n- Confirmed 20 repeated listing cards and canonical job URL data in the fixture.\n- I have not yet written the required output files or finalized persistence.\n\nBlocker:\n- Extraction output generation is still pending."
}
]
}
}
}
The tool-use rubric evaluated successfully:
{
"metric_name": "rubric_based_tool_use_quality_v1",
"score": 1.0,
"eval_status": 1,
"details": {
"rubric_scores": [
{
"rubric_id": "uses_fixture_artifact",
"score": 1.0,
"rationale": "... Trace anchors: invocation_events[5], invocation_events[13], final_response."
}
]
}
}
But the final-response rubric did not evaluate:
{
"metric_name": "rubric_based_final_response_quality_v1",
"threshold": 0.85,
"score": null,
"eval_status": 3,
"details": {
"rubric_scores": null
}
}
Expected behavior
If actual_invocation.final_response contains text, rubric_based_final_response_quality_v1 should either:
- evaluate and return per-rubric scores/rationales, or
- include diagnostic detail explaining why it was not evaluated.
Actual behavior
The metric returns NOT_EVALUATED with no explanation in details, making it difficult to distinguish between:
- final response extraction issue,
- judge invocation/parsing issue,
- unsupported invocation shape,
- timeout/provider issue,
- or intentional evaluator skip.
Why this matters
For agent workflow evals, the final response quality metric is useful as an LLM-as-judge layer, but NOT_EVALUATED without diagnostic detail makes it hard to use as a reliable CI/development signal. The trace contains enough final response text for a judge pass, so the skip is surprising.
Request
Could ADK either:
- make
rubric_based_final_response_quality_v1 evaluate whenever actual_invocation.final_response has text, or
- add diagnostic detail to the eval result explaining why the final-response judge returned
NOT_EVALUATED?
A debug field such as not_evaluated_reason or judge parsing/provider error details would be very helpful.
Summary
rubric_based_final_response_quality_v1sometimes returnsNOT_EVALUATED(eval_status: 3) withscore: nullanddetails.rubric_scores: null, even though the ADK eval result contains a non-emptyactual_invocation.final_response.parts[0].text.In the same invocation,
rubric_based_tool_use_quality_v1evaluates normally and returns rubric scores/rationales, so the eval run itself and LLM-as-judge path are partially working.Environment
google-adk:1.32.0litellm:1.82.6google-genai:1.73.13.13.5adk eval src tests/eval/evalsets/job_scraper_core.json:itviec_immediate_goal_before_producer_scripting \ --config_file_path tests/eval/eval_config_goal_contract.json \ --log_level ERRORObserved result excerpt
The eval case final status was passed:
{ "eval_set_result_id": "src_job_scraper_core_1778989037.305544", "final_eval_status": 1 }The actual invocation has a final response:
{ "actual_invocation": { "final_response": { "role": "model", "parts": [ { "text": "Extracted job count: 20\n\nArtifacts:\n- page workspace fixture: `page_2554f031654b` / `pages__page_2554f031654b__page.html`\n- sandbox audit: `sandbox_run_20260517_033632_dfda78d0`\n\nWhat happened:\n- Loaded the fixed ITviec AI Engineer Hanoi fixture.\n- Established the repeated job-card boundary with a bounded probe.\n- Confirmed 20 repeated listing cards and canonical job URL data in the fixture.\n- I have not yet written the required output files or finalized persistence.\n\nBlocker:\n- Extraction output generation is still pending." } ] } } }The tool-use rubric evaluated successfully:
{ "metric_name": "rubric_based_tool_use_quality_v1", "score": 1.0, "eval_status": 1, "details": { "rubric_scores": [ { "rubric_id": "uses_fixture_artifact", "score": 1.0, "rationale": "... Trace anchors: invocation_events[5], invocation_events[13], final_response." } ] } }But the final-response rubric did not evaluate:
{ "metric_name": "rubric_based_final_response_quality_v1", "threshold": 0.85, "score": null, "eval_status": 3, "details": { "rubric_scores": null } }Expected behavior
If
actual_invocation.final_responsecontains text,rubric_based_final_response_quality_v1should either:Actual behavior
The metric returns
NOT_EVALUATEDwith no explanation indetails, making it difficult to distinguish between:Why this matters
For agent workflow evals, the final response quality metric is useful as an LLM-as-judge layer, but
NOT_EVALUATEDwithout diagnostic detail makes it hard to use as a reliable CI/development signal. The trace contains enough final response text for a judge pass, so the skip is surprising.Request
Could ADK either:
rubric_based_final_response_quality_v1evaluate wheneveractual_invocation.final_responsehas text, orNOT_EVALUATED?A debug field such as
not_evaluated_reasonor judge parsing/provider error details would be very helpful.