tool_trajectory_avg_score returns 0.0 even when tool name and args match exactly

### Summary
The `TrajectoryEvaluator` in ADK is returning a score of 0.0 for tool trajectory evaluation even when the expected and actual tool calls have identical `name` and `args` fields.

### Expected Behavior
According to the `trajectory_evaluator.py` code, the comparison only checks `name` and `args`:

```python
def _are_tool_calls_exact_match(self, actual_tool_calls, expected_tool_calls):
    for actual, expected in zip(actual_tool_calls, expected_tool_calls):
      if actual.name != expected.name or actual.args != expected.args:
        return False
    return True
```

When both `name` and `args` match exactly, the score should be **1.0**, not 0.0.

### Actual Behavior
The `tool_trajectory_avg_score` metric returns 0.0 (FAILED) even when:
- ✅ Tool name matches: `execute_sql`
- ✅ Tool args are identical: `{'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL_QUERY]'}`
- ✅ All arguments match exactly

---

## Steps to Reproduce

### 1. Create Test Case File
Create a test file with `InvocationEvents` format containing a BigQuery SQL execution:

```json
{
  "eval_set_id": "test_trajectory",
  "name": "test_trajectory",
  "eval_cases": [
    {
      "eval_id": "sql_execution_test",
      "conversation": [
        {
          "invocation_id": "test-invocation-001",
          "user_content": {
            "parts": [{"text": "Execute a query"}],
            "role": "user"
          },
          "final_response": {
            "parts": [{"text": "Query executed successfully with results..."}],
            "role": "model"
          },
          "intermediate_data": {
            "invocation_events": [
              {
                "author": "test_agent",
                "content": {
                  "parts": [
                    {"text": "Executing SQL query..."},
                    {
                      "function_call": {
                        "name": "execute_sql",
                        "args": {
                          "project_id": "[YOUR_GCP_PROJECT_ID]",
                          "query": "SELECT * FROM [your_table] WHERE [condition]"
                        }
                      }
                    }
                  ],
                  "role": "model"
                }
              },
              {
                "author": "test_agent",
                "content": {
                  "parts": [
                    {
                      "function_response": {
                        "name": "execute_sql",
                        "response": {
                          "status": "SUCCESS",
                          "rows": [{"sample": "data"}]
                        }
                      }
                    }
                  ],
                  "role": "user"
                }
              }
            ]
          }
        }
      ],
      "session_input": {
        "app_name": "test_agent",
        "user_id": "user"
      }
    }
  ]
}
```

### 2. Create Config File
Create `test_config.json`:

```json
{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 0.5,
      "match_type": "ANY_ORDER"
    },
    "response_match_score": 0.5
  }
}
```

### 3. Run Evaluation
```bash
adk eval test_agent test_case.evalset.json --config_file_path=test_config.json --print_detailed_results
```

### 4. Observe Results
The evaluation will fail with:
```
Metric: tool_trajectory_avg_score, Status: FAILED, Score: 0.0, Threshold: 0.5

Expected tool calls:
id=None args={'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL]'}

Actual tool calls:
id='toolu_vrtx_01...' args={'project_id': '[GCP_PROJECT_ID]', 'query': '[SQL]'}
```

**Both args dictionaries are IDENTICAL, but the score is 0.0**

---

## Expected vs Actual Comparison

### Expected Tool Call
```python
{
  "id": None,
  "name": "execute_sql",
  "args": {
    "project_id": "[GCP_PROJECT_ID]",
    "query": "[SQL_QUERY_STRING]"
  }
}
```

### Actual Tool Call
```python
{
  "id": "[DYNAMICALLY_GENERATED_ID]",
  "name": "execute_sql",
  "args": {
    "project_id": "[GCP_PROJECT_ID]",
    "query": "[SQL_QUERY_STRING]"
  }
}
```

### Comparison Result
| Field | Expected | Actual | Match |
|-------|----------|--------|-------|
| `name` | `execute_sql` | `execute_sql` | ✅ YES |
| `args.project_id` | `[GCP_PROJECT_ID]` | `[GCP_PROJECT_ID]` | ✅ YES |
| `args.query` | `[SQL_QUERY_STRING]` | `[SQL_QUERY_STRING]` | ✅ YES |
| **Expected Result** | **Score: 1.0** | **Score: 0.0** | ❌ MISMATCH |

---

## Affected Code

The issue appears to be in:
- **File**: `google/adk/evaluation/trajectory_evaluator.py`
- **Function**: `_are_tool_calls_exact_match()` (Line ~180-190)
- **Function**: `_calculate_tool_use_accuracy()` (Line ~130-150)

### Code Review

The logic in `_are_tool_calls_exact_match()` appears correct:

```python
def _are_tool_calls_exact_match(self, actual_tool_calls, expected_tool_calls):
    if len(actual_tool_calls) != len(expected_tool_calls):
      return False
    
    for actual, expected in zip(actual_tool_calls, expected_tool_calls):
      if actual.name != expected.name or actual.args != expected.args:
        return False
    
    return True
```

However, the comparison might fail due to:

1. **Type Mismatch**: `actual.args` and `expected.args` might be different types
2. **Dictionary Comparison Issue**: Args might contain nested objects that don't compare as equal
3. **Object Reference Issue**: Comparing object references instead of values
4. **Serialization Issue**: JSON deserialization creating different object types

---

## Environment Information

- **ADK Version**: Latest
- **Python Version**: 3.12+
- **OS**: Windows/Linux/Mac
- **ADK Components**: `google-adk[eval]`
- **Tool Used**: BigQuery Toolset (but likely affects all tools)

---

## Related Issues

This issue appears related to:
- #266 - Loading evaluation results fails if trajectory check doesn't pass
- #3434 - `id` field in tool calls being compared despite being dynamically generated
- #3439 - Tool trajectory score returns 0 instead of partial score when arguments don't match

---

## Additional Context

This bug blocks reliable tool trajectory testing in ADK evaluation. Even when tools are called correctly with identical arguments, the evaluation fails.

**Note**: This issue report contains placeholder values for sensitive information:
- `[GCP_PROJECT_ID]` = your actual GCP project ID
- `[SQL_QUERY_STRING]` = your actual SQL query
- `[YOUR_TABLE]` = your actual table names

---


Field	Expected	Actual	Match
`name`	`execute_sql`	`execute_sql`	✅ YES
`args.project_id`	`[GCP_PROJECT_ID]`	`[GCP_PROJECT_ID]`	✅ YES
`args.query`	`[SQL_QUERY_STRING]`	`[SQL_QUERY_STRING]`	✅ YES
Expected Result	Score: 1.0	Score: 0.0	❌ MISMATCH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tool_trajectory_avg_score returns 0.0 even when tool name and args match exactly #5410

Summary

Expected Behavior

Actual Behavior

Steps to Reproduce

1. Create Test Case File

2. Create Config File

3. Run Evaluation

4. Observe Results

Expected vs Actual Comparison

Expected Tool Call

Actual Tool Call

Comparison Result

Affected Code

Code Review

Environment Information

Related Issues

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tool_trajectory_avg_score returns 0.0 even when tool name and args match exactly #5410

Description

Summary

Expected Behavior

Actual Behavior

Steps to Reproduce

1. Create Test Case File

2. Create Config File

3. Run Evaluation

4. Observe Results

Expected vs Actual Comparison

Expected Tool Call

Actual Tool Call

Comparison Result

Affected Code

Code Review

Environment Information

Related Issues

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions