Summary
The TrajectoryEvaluator in ADK is returning a score of 0.0 for tool trajectory evaluation even when the expected and actual tool calls have identical name and args fields.
Expected Behavior
According to the trajectory_evaluator.py code, the comparison only checks name and args:
def _are_tool_calls_exact_match(self, actual_tool_calls, expected_tool_calls):
for actual, expected in zip(actual_tool_calls, expected_tool_calls):
if actual.name != expected.name or actual.args != expected.args:
return False
return True
When both name and args match exactly, the score should be 1.0, not 0.0.
Actual Behavior
The tool_trajectory_avg_score metric returns 0.0 (FAILED) even when:
- ✅ Tool name matches:
execute_sql
- ✅ Tool args are identical:
{'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL_QUERY]'}
- ✅ All arguments match exactly
Steps to Reproduce
1. Create Test Case File
Create a test file with InvocationEvents format containing a BigQuery SQL execution:
{
"eval_set_id": "test_trajectory",
"name": "test_trajectory",
"eval_cases": [
{
"eval_id": "sql_execution_test",
"conversation": [
{
"invocation_id": "test-invocation-001",
"user_content": {
"parts": [{"text": "Execute a query"}],
"role": "user"
},
"final_response": {
"parts": [{"text": "Query executed successfully with results..."}],
"role": "model"
},
"intermediate_data": {
"invocation_events": [
{
"author": "test_agent",
"content": {
"parts": [
{"text": "Executing SQL query..."},
{
"function_call": {
"name": "execute_sql",
"args": {
"project_id": "[YOUR_GCP_PROJECT_ID]",
"query": "SELECT * FROM [your_table] WHERE [condition]"
}
}
}
],
"role": "model"
}
},
{
"author": "test_agent",
"content": {
"parts": [
{
"function_response": {
"name": "execute_sql",
"response": {
"status": "SUCCESS",
"rows": [{"sample": "data"}]
}
}
}
],
"role": "user"
}
}
]
}
}
],
"session_input": {
"app_name": "test_agent",
"user_id": "user"
}
}
]
}
2. Create Config File
Create test_config.json:
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 0.5,
"match_type": "ANY_ORDER"
},
"response_match_score": 0.5
}
}
3. Run Evaluation
adk eval test_agent test_case.evalset.json --config_file_path=test_config.json --print_detailed_results
4. Observe Results
The evaluation will fail with:
Metric: tool_trajectory_avg_score, Status: FAILED, Score: 0.0, Threshold: 0.5
Expected tool calls:
id=None args={'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL]'}
Actual tool calls:
id='toolu_vrtx_01...' args={'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL]'}
Both args dictionaries are IDENTICAL, but the score is 0.0
Expected vs Actual Comparison
Expected Tool Call
{
"id": None,
"name": "execute_sql",
"args": {
"project_id": "[GCP_PROJECT_ID]",
"query": "[SQL_QUERY_STRING]"
}
}
Actual Tool Call
{
"id": "[DYNAMICALLY_GENERATED_ID]",
"name": "execute_sql",
"args": {
"project_id": "[GCP_PROJECT_ID]",
"query": "[SQL_QUERY_STRING]"
}
}
Comparison Result
| Field |
Expected |
Actual |
Match |
name |
execute_sql |
execute_sql |
✅ YES |
args.project_id |
[GCP_PROJECT_ID] |
[GCP_PROJECT_ID] |
✅ YES |
args.query |
[SQL_QUERY_STRING] |
[SQL_QUERY_STRING] |
✅ YES |
| Expected Result |
Score: 1.0 |
Score: 0.0 |
❌ MISMATCH |
Affected Code
The issue appears to be in:
- File:
google/adk/evaluation/trajectory_evaluator.py
- Function:
_are_tool_calls_exact_match() (Line ~180-190)
- Function:
_calculate_tool_use_accuracy() (Line ~130-150)
Code Review
The logic in _are_tool_calls_exact_match() appears correct:
def _are_tool_calls_exact_match(self, actual_tool_calls, expected_tool_calls):
if len(actual_tool_calls) != len(expected_tool_calls):
return False
for actual, expected in zip(actual_tool_calls, expected_tool_calls):
if actual.name != expected.name or actual.args != expected.args:
return False
return True
However, the comparison might fail due to:
- Type Mismatch:
actual.args and expected.args might be different types
- Dictionary Comparison Issue: Args might contain nested objects that don't compare as equal
- Object Reference Issue: Comparing object references instead of values
- Serialization Issue: JSON deserialization creating different object types
Environment Information
- ADK Version: Latest
- Python Version: 3.12+
- OS: Windows/Linux/Mac
- ADK Components:
google-adk[eval]
- Tool Used: BigQuery Toolset (but likely affects all tools)
Related Issues
This issue appears related to:
Additional Context
This bug blocks reliable tool trajectory testing in ADK evaluation. Even when tools are called correctly with identical arguments, the evaluation fails.
Note: This issue report contains placeholder values for sensitive information:
[GCP_PROJECT_ID] = your actual GCP project ID
[SQL_QUERY_STRING] = your actual SQL query
[YOUR_TABLE] = your actual table names
Summary
The
TrajectoryEvaluatorin ADK is returning a score of 0.0 for tool trajectory evaluation even when the expected and actual tool calls have identicalnameandargsfields.Expected Behavior
According to the
trajectory_evaluator.pycode, the comparison only checksnameandargs:When both
nameandargsmatch exactly, the score should be 1.0, not 0.0.Actual Behavior
The
tool_trajectory_avg_scoremetric returns 0.0 (FAILED) even when:execute_sql{'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL_QUERY]'}Steps to Reproduce
1. Create Test Case File
Create a test file with
InvocationEventsformat containing a BigQuery SQL execution:{ "eval_set_id": "test_trajectory", "name": "test_trajectory", "eval_cases": [ { "eval_id": "sql_execution_test", "conversation": [ { "invocation_id": "test-invocation-001", "user_content": { "parts": [{"text": "Execute a query"}], "role": "user" }, "final_response": { "parts": [{"text": "Query executed successfully with results..."}], "role": "model" }, "intermediate_data": { "invocation_events": [ { "author": "test_agent", "content": { "parts": [ {"text": "Executing SQL query..."}, { "function_call": { "name": "execute_sql", "args": { "project_id": "[YOUR_GCP_PROJECT_ID]", "query": "SELECT * FROM [your_table] WHERE [condition]" } } } ], "role": "model" } }, { "author": "test_agent", "content": { "parts": [ { "function_response": { "name": "execute_sql", "response": { "status": "SUCCESS", "rows": [{"sample": "data"}] } } } ], "role": "user" } } ] } } ], "session_input": { "app_name": "test_agent", "user_id": "user" } } ] }2. Create Config File
Create
test_config.json:{ "criteria": { "tool_trajectory_avg_score": { "threshold": 0.5, "match_type": "ANY_ORDER" }, "response_match_score": 0.5 } }3. Run Evaluation
adk eval test_agent test_case.evalset.json --config_file_path=test_config.json --print_detailed_results4. Observe Results
The evaluation will fail with:
Both args dictionaries are IDENTICAL, but the score is 0.0
Expected vs Actual Comparison
Expected Tool Call
{ "id": None, "name": "execute_sql", "args": { "project_id": "[GCP_PROJECT_ID]", "query": "[SQL_QUERY_STRING]" } }Actual Tool Call
{ "id": "[DYNAMICALLY_GENERATED_ID]", "name": "execute_sql", "args": { "project_id": "[GCP_PROJECT_ID]", "query": "[SQL_QUERY_STRING]" } }Comparison Result
nameexecute_sqlexecute_sqlargs.project_id[GCP_PROJECT_ID][GCP_PROJECT_ID]args.query[SQL_QUERY_STRING][SQL_QUERY_STRING]Affected Code
The issue appears to be in:
google/adk/evaluation/trajectory_evaluator.py_are_tool_calls_exact_match()(Line ~180-190)_calculate_tool_use_accuracy()(Line ~130-150)Code Review
The logic in
_are_tool_calls_exact_match()appears correct:However, the comparison might fail due to:
actual.argsandexpected.argsmight be different typesEnvironment Information
google-adk[eval]Related Issues
This issue appears related to:
tool_trajectory_avg_scorethat can never be predicted #3434 -idfield in tool calls being compared despite being dynamically generatedAdditional Context
This bug blocks reliable tool trajectory testing in ADK evaluation. Even when tools are called correctly with identical arguments, the evaluation fails.
Note: This issue report contains placeholder values for sensitive information:
[GCP_PROJECT_ID]= your actual GCP project ID[SQL_QUERY_STRING]= your actual SQL query[YOUR_TABLE]= your actual table names