Skip to content

tool_trajectory_avg_score returns 0.0 even when tool name and args match exactly #5410

@Maithreya-100

Description

@Maithreya-100

Summary

The TrajectoryEvaluator in ADK is returning a score of 0.0 for tool trajectory evaluation even when the expected and actual tool calls have identical name and args fields.

Expected Behavior

According to the trajectory_evaluator.py code, the comparison only checks name and args:

def _are_tool_calls_exact_match(self, actual_tool_calls, expected_tool_calls):
    for actual, expected in zip(actual_tool_calls, expected_tool_calls):
      if actual.name != expected.name or actual.args != expected.args:
        return False
    return True

When both name and args match exactly, the score should be 1.0, not 0.0.

Actual Behavior

The tool_trajectory_avg_score metric returns 0.0 (FAILED) even when:

  • ✅ Tool name matches: execute_sql
  • ✅ Tool args are identical: {'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL_QUERY]'}
  • ✅ All arguments match exactly

Steps to Reproduce

1. Create Test Case File

Create a test file with InvocationEvents format containing a BigQuery SQL execution:

{
  "eval_set_id": "test_trajectory",
  "name": "test_trajectory",
  "eval_cases": [
    {
      "eval_id": "sql_execution_test",
      "conversation": [
        {
          "invocation_id": "test-invocation-001",
          "user_content": {
            "parts": [{"text": "Execute a query"}],
            "role": "user"
          },
          "final_response": {
            "parts": [{"text": "Query executed successfully with results..."}],
            "role": "model"
          },
          "intermediate_data": {
            "invocation_events": [
              {
                "author": "test_agent",
                "content": {
                  "parts": [
                    {"text": "Executing SQL query..."},
                    {
                      "function_call": {
                        "name": "execute_sql",
                        "args": {
                          "project_id": "[YOUR_GCP_PROJECT_ID]",
                          "query": "SELECT * FROM [your_table] WHERE [condition]"
                        }
                      }
                    }
                  ],
                  "role": "model"
                }
              },
              {
                "author": "test_agent",
                "content": {
                  "parts": [
                    {
                      "function_response": {
                        "name": "execute_sql",
                        "response": {
                          "status": "SUCCESS",
                          "rows": [{"sample": "data"}]
                        }
                      }
                    }
                  ],
                  "role": "user"
                }
              }
            ]
          }
        }
      ],
      "session_input": {
        "app_name": "test_agent",
        "user_id": "user"
      }
    }
  ]
}

2. Create Config File

Create test_config.json:

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 0.5,
      "match_type": "ANY_ORDER"
    },
    "response_match_score": 0.5
  }
}

3. Run Evaluation

adk eval test_agent test_case.evalset.json --config_file_path=test_config.json --print_detailed_results

4. Observe Results

The evaluation will fail with:

Metric: tool_trajectory_avg_score, Status: FAILED, Score: 0.0, Threshold: 0.5

Expected tool calls:
id=None args={'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL]'}

Actual tool calls:
id='toolu_vrtx_01...' args={'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL]'}

Both args dictionaries are IDENTICAL, but the score is 0.0


Expected vs Actual Comparison

Expected Tool Call

{
  "id": None,
  "name": "execute_sql",
  "args": {
    "project_id": "[GCP_PROJECT_ID]",
    "query": "[SQL_QUERY_STRING]"
  }
}

Actual Tool Call

{
  "id": "[DYNAMICALLY_GENERATED_ID]",
  "name": "execute_sql",
  "args": {
    "project_id": "[GCP_PROJECT_ID]",
    "query": "[SQL_QUERY_STRING]"
  }
}

Comparison Result

Field Expected Actual Match
name execute_sql execute_sql ✅ YES
args.project_id [GCP_PROJECT_ID] [GCP_PROJECT_ID] ✅ YES
args.query [SQL_QUERY_STRING] [SQL_QUERY_STRING] ✅ YES
Expected Result Score: 1.0 Score: 0.0 ❌ MISMATCH

Affected Code

The issue appears to be in:

  • File: google/adk/evaluation/trajectory_evaluator.py
  • Function: _are_tool_calls_exact_match() (Line ~180-190)
  • Function: _calculate_tool_use_accuracy() (Line ~130-150)

Code Review

The logic in _are_tool_calls_exact_match() appears correct:

def _are_tool_calls_exact_match(self, actual_tool_calls, expected_tool_calls):
    if len(actual_tool_calls) != len(expected_tool_calls):
      return False
    
    for actual, expected in zip(actual_tool_calls, expected_tool_calls):
      if actual.name != expected.name or actual.args != expected.args:
        return False
    
    return True

However, the comparison might fail due to:

  1. Type Mismatch: actual.args and expected.args might be different types
  2. Dictionary Comparison Issue: Args might contain nested objects that don't compare as equal
  3. Object Reference Issue: Comparing object references instead of values
  4. Serialization Issue: JSON deserialization creating different object types

Environment Information

  • ADK Version: Latest
  • Python Version: 3.12+
  • OS: Windows/Linux/Mac
  • ADK Components: google-adk[eval]
  • Tool Used: BigQuery Toolset (but likely affects all tools)

Related Issues

This issue appears related to:


Additional Context

This bug blocks reliable tool trajectory testing in ADK evaluation. Even when tools are called correctly with identical arguments, the evaluation fails.

Note: This issue report contains placeholder values for sensitive information:

  • [GCP_PROJECT_ID] = your actual GCP project ID
  • [SQL_QUERY_STRING] = your actual SQL query
  • [YOUR_TABLE] = your actual table names

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions