Skip to content

TypeError in LocalEvalSampler when metric evaluation fails #5403

@msteiner-google

Description

@msteiner-google

When running adk optimize, if a metric evaluation fails (e.g., due to a transient API error, rate limiting, or a JSONDecodeError from the LLM judge), the LocalEvalSampler crashes with a TypeError. This happens because the valuation logic gracefully catches the exception and returns a result with a None score, but the sampler subsequently tries to round this None value.

  Error Logs

    TypeError: type NoneType doesn't define __round__ method
   
    Traceback (most recent call last):
      ...
      File ".../google/adk/optimization/local_eval_sampler.py", line 362, in sample_and_score
        self._extract_eval_data(eval_set_id, eval_results)
      File ".../google/adk/optimization/local_eval_sampler.py", line 292, in _extract_eval_data
        "score": round(eval_metric_result.score, 2),  # accurate enough
    TypeError: type NoneType doesn't define __round__ method

Root Cause

In google/adk/evaluation/local_eval_service.py, the _evaluate_metric_for_eval_case method catches all exceptions during evaluation:

except Exception as e:
  logger.error(...)
  # We use an empty result.
  evaluation_result = EvaluationResult(
     overall_eval_status=EvalStatus.NOT_EVALUATED
  )

The EvaluationResult (and its nested PerInvocationResult) defaults its score field to None.

In google/adk/optimization/local_eval_sampler.py, the _extract_eval_data method iterates through these results and attempts to round the score
without checking if it is None:

for eval_metric_result in per_invocation_result.eval_metric_results:
  eval_metric_results.append({
      "metric_name": eval_metric_result.metric_name,
      "score": round(eval_metric_result.score, 2),  # <--- CRASH HERE
      "eval_status": eval_metric_result.eval_status.name,
})

Reproduction Steps

  1. Configure an agent for optimization using adk optimize.
  2. Include a metric that relies on an LLM judge (e.g., rubric_based_tool_use_quality_v1).
  3. Trigger a scenario where the judge evaluation fails (e.g., simulate a network error or a malformed judge response).
  4. The process will crash during data extraction instead of reporting a 0.0 score or skipping the failed case.

Proposed Fix

The sampler should handle None scores gracefully, either by defaulting them to 0.0 or skipping the rounding step for un-evaluated metrics.

# google/adk/optimization/local_eval_sampler.py
   
"score": round(eval_metric_result.score, 2) if eval_metric_result.score is not None else 0.0,

Metadata

Metadata

Assignees

No one assigned

    Labels

    eval[Component] This issue is related to evaluation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions