# üéØ Tool Call Accuracy Evaluation

This notebook demonstrates how to evaluate **tool call accuracy** using Microsoft Foundry's built-in evaluators. Tool Call Accuracy measures whether an agent correctly identifies and calls the appropriate tools with proper parameters.

## üéØ Learning Objectives

1. **Understand tool call accuracy** evaluation concepts
2. **Define tool definitions** for evaluation
3. **Create test scenarios** with expected tool calls
4. **Evaluate** whether agents make correct tool choices

## üíº Industry Use Case: Banking Operations Tool Selection

In banking, agents must select the correct tool for each request:
- **Wire transfers** require `initiate_wire_transfer` (not `account_balance`)
- **Account inquiries** require `get_account_balance` (not `execute_trade`)
- **Fraud alerts** require `report_suspicious_activity`

**Incorrect tool selection can lead to:**
- Unauthorized transactions
- Regulatory violations
- Customer complaints
- Financial losses

### ‚ö†Ô∏è Disclaimer
> **This is a demonstration with simulated data.** In production, actual banking API calls would be validated against compliance requirements.

## üîê Authentication Setup

Before running this notebook, authenticate with Azure CLI:

```bash
az login --use-device-code
```

## 1. Environment Setup

In [None]:
import json
import os
import time
from pathlib import Path
from pprint import pprint
from dotenv import load_dotenv

# Load environment variables
notebook_path = Path().absolute()
env_path = notebook_path.parent / '.env'
load_dotenv(env_path)

# Verify required environment variables
project_endpoint = os.environ.get("AI_FOUNDRY_PROJECT_ENDPOINT")
tenant_id = os.environ.get("TENANT_ID")
model_deployment = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "gpt-4o")

if not project_endpoint:
    raise ValueError("üö® AI_FOUNDRY_PROJECT_ENDPOINT not set in .env")

print(f"üîë Tenant ID: {tenant_id}")
print(f"üìç Project Endpoint: {project_endpoint[:50]}...")
print(f"ü§ñ Model Deployment: {model_deployment}")

## 2. Initialize AI Project Client

In [None]:
from azure.identity import AzureCliCredential
from azure.ai.projects import AIProjectClient
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
    CreateEvalJSONLRunDataSourceParam,
    SourceFileContent,
    SourceFileContentContent,
)
from openai.types.eval_create_params import DataSourceConfigCustom

# Initialize credentials and clients
credential = AzureCliCredential(tenant_id=tenant_id)
project_client = AIProjectClient(endpoint=project_endpoint, credential=credential)
openai_client = project_client.get_openai_client()

print("‚úÖ AIProjectClient initialized")
print("‚úÖ OpenAI client retrieved for evaluations")

## 3. Define Banking Tool Definitions

These are the tools our banking agent should have access to. The evaluator will check if the agent selects the correct tool for each query.

In [None]:
# Banking tool definitions for FSI scenarios
banking_tool_definitions = [
    {
        "type": "function",
        "name": "get_account_balance",
        "description": "Retrieves the current balance for a customer's bank account",
        "parameters": {
            "type": "object",
            "properties": {
                "account_number": {"type": "string", "description": "The account number to check"},
                "account_type": {"type": "string", "description": "Type of account: checking, savings, or investment"}
            },
        },
    },
    {
        "type": "function",
        "name": "initiate_wire_transfer",
        "description": "Initiates a wire transfer from one account to another",
        "parameters": {
            "type": "object",
            "properties": {
                "from_account": {"type": "string", "description": "Source account number"},
                "to_account": {"type": "string", "description": "Destination account number"},
                "amount": {"type": "number", "description": "Amount to transfer in USD"},
                "memo": {"type": "string", "description": "Transfer description or memo"}
            },
        },
    },
    {
        "type": "function",
        "name": "get_transaction_history",
        "description": "Retrieves recent transactions for a bank account",
        "parameters": {
            "type": "object",
            "properties": {
                "account_number": {"type": "string", "description": "The account number"},
                "days": {"type": "integer", "description": "Number of days of history to retrieve"}
            },
        },
    },
    {
        "type": "function",
        "name": "report_suspicious_activity",
        "description": "Reports suspicious or potentially fraudulent activity on an account",
        "parameters": {
            "type": "object",
            "properties": {
                "account_number": {"type": "string", "description": "The affected account number"},
                "description": {"type": "string", "description": "Description of the suspicious activity"},
                "urgency": {"type": "string", "description": "Urgency level: low, medium, high, critical"}
            },
        },
    },
    {
        "type": "function",
        "name": "apply_for_loan",
        "description": "Initiates a loan application process",
        "parameters": {
            "type": "object",
            "properties": {
                "loan_type": {"type": "string", "description": "Type of loan: mortgage, auto, personal"},
                "amount": {"type": "number", "description": "Requested loan amount"},
                "term_months": {"type": "integer", "description": "Loan term in months"}
            },
        },
    },
]

print(f"‚úÖ Defined {len(banking_tool_definitions)} banking tools:")
for tool in banking_tool_definitions:
    print(f"   ‚Ä¢ {tool['name']}: {tool['description'][:50]}...")

## 4. Create Test Scenarios

We'll create test scenarios that simulate:
1. **Correct tool selection** - Agent picks the right tool
2. **Correct parameters** - Agent extracts proper values from query
3. **Multi-tool scenarios** - Agent uses multiple tools when needed

In [None]:
# Scenario 1: Simple balance inquiry - should call get_account_balance
query1 = "What's the current balance in my checking account CHK-12345?"
tool_calls1 = [
    {
        "type": "tool_call",
        "tool_call_id": "call_balance_1",
        "name": "get_account_balance",
        "arguments": {"account_number": "CHK-12345", "account_type": "checking"},
    }
]

# Scenario 2: Wire transfer request - should call initiate_wire_transfer
query2 = "Please transfer $5,000 from my account ACC-001 to ACC-002 for the office rent payment."
tool_calls2 = [
    {
        "type": "tool_call",
        "tool_call_id": "call_transfer_1",
        "name": "initiate_wire_transfer",
        "arguments": {
            "from_account": "ACC-001",
            "to_account": "ACC-002",
            "amount": 5000,
            "memo": "office rent payment"
        },
    }
]

# Scenario 3: Fraud report - should call report_suspicious_activity
query3 = "I noticed unauthorized charges on account SAV-99999. There's a $2000 ATM withdrawal I didn't make. This is urgent!"
tool_calls3 = [
    {
        "type": "tool_call",
        "tool_call_id": "call_fraud_1",
        "name": "report_suspicious_activity",
        "arguments": {
            "account_number": "SAV-99999",
            "description": "Unauthorized ATM withdrawal of $2000",
            "urgency": "high"
        },
    }
]

# Scenario 4: Multi-step request - should call multiple tools
query4 = "Show me my last 30 days of transactions for account CHK-55555, and also check the balance."
tool_calls4 = [
    {
        "type": "tool_call",
        "tool_call_id": "call_history_1",
        "name": "get_transaction_history",
        "arguments": {"account_number": "CHK-55555", "days": 30},
    },
    {
        "type": "tool_call",
        "tool_call_id": "call_balance_2",
        "name": "get_account_balance",
        "arguments": {"account_number": "CHK-55555", "account_type": "checking"},
    }
]

# Scenario 5: Loan application - should call apply_for_loan
query5 = "I'd like to apply for a $25,000 personal loan with a 36-month term."
tool_calls5 = [
    {
        "type": "tool_call",
        "tool_call_id": "call_loan_1",
        "name": "apply_for_loan",
        "arguments": {
            "loan_type": "personal",
            "amount": 25000,
            "term_months": 36
        },
    }
]

print("‚úÖ Created 5 test scenarios:")
print("   1. Balance inquiry ‚Üí get_account_balance")
print("   2. Wire transfer ‚Üí initiate_wire_transfer")
print("   3. Fraud report ‚Üí report_suspicious_activity")
print("   4. Multi-tool request ‚Üí get_transaction_history + get_account_balance")
print("   5. Loan application ‚Üí apply_for_loan")

## 5. Configure Tool Call Accuracy Evaluation

The `builtin.tool_call_accuracy` evaluator checks:
- Was the correct tool selected?
- Were the parameters extracted correctly?
- Were all required tools called?

In [None]:
# Define data source config for tool call accuracy evaluation
data_source_config = DataSourceConfigCustom(
    type="custom",
    item_schema={
        "type": "object",
        "properties": {
            "query": {"anyOf": [{"type": "string"}, {"type": "array", "items": {"type": "object"}}]},
            "tool_definitions": {
                "anyOf": [{"type": "object"}, {"type": "array", "items": {"type": "object"}}]
            },
            "tool_calls": {"anyOf": [{"type": "object"}, {"type": "array", "items": {"type": "object"}}]},
            "response": {"anyOf": [{"type": "string"}, {"type": "array", "items": {"type": "object"}}]},
        },
        "required": ["query", "tool_definitions"],
    },
    include_sample_schema=True,
)

# Testing criteria using the Tool Call Accuracy evaluator
testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "tool_call_accuracy",
        "evaluator_name": "builtin.tool_call_accuracy",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {
            "query": "{{item.query}}",
            "tool_definitions": "{{item.tool_definitions}}",
            "tool_calls": "{{item.tool_calls}}",
            "response": "{{item.response}}",
        },
    }
]

print("‚úÖ Evaluation criteria configured:")
print("   ‚Ä¢ Evaluator: builtin.tool_call_accuracy")
print("   ‚Ä¢ Data fields: query, tool_definitions, tool_calls, response")

## 6. Create Evaluation Object

In [None]:
# Create evaluation object
eval_object = openai_client.evals.create(
    name="Banking Tool Call Accuracy Evaluation",
    data_source_config=data_source_config,
    testing_criteria=testing_criteria,  # type: ignore
)

print(f"‚úÖ Evaluation created")
print(f"   ID: {eval_object.id}")
print(f"   Name: {eval_object.name}")

## 7. Run Evaluation with Test Data

In [None]:
# Create evaluation run with inline test data
eval_run = openai_client.evals.runs.create(
    eval_id=eval_object.id,
    name="Banking Tool Call Accuracy Run",
    metadata={"team": "fsi-banking", "scenario": "tool-accuracy-v1"},
    data_source=CreateEvalJSONLRunDataSourceParam(
        type="jsonl",
        source=SourceFileContent(
            type="file_content",
            content=[
                # Scenario 1: Balance inquiry
                SourceFileContentContent(
                    item={
                        "query": query1,
                        "tool_definitions": banking_tool_definitions,
                        "tool_calls": tool_calls1,
                        "response": None,
                    }
                ),
                # Scenario 2: Wire transfer
                SourceFileContentContent(
                    item={
                        "query": query2,
                        "tool_definitions": banking_tool_definitions,
                        "tool_calls": tool_calls2,
                        "response": None,
                    }
                ),
                # Scenario 3: Fraud report
                SourceFileContentContent(
                    item={
                        "query": query3,
                        "tool_definitions": banking_tool_definitions,
                        "tool_calls": tool_calls3,
                        "response": None,
                    }
                ),
                # Scenario 4: Multi-tool request
                SourceFileContentContent(
                    item={
                        "query": query4,
                        "tool_definitions": banking_tool_definitions,
                        "tool_calls": tool_calls4,
                        "response": None,
                    }
                ),
                # Scenario 5: Loan application
                SourceFileContentContent(
                    item={
                        "query": query5,
                        "tool_definitions": banking_tool_definitions,
                        "tool_calls": tool_calls5,
                        "response": None,
                    }
                ),
            ]
        ),
    ),
)

print(f"üöÄ Evaluation run created")
print(f"   Run ID: {eval_run.id}")
print(f"   Status: {eval_run.status}")

In [None]:
# Poll for evaluation completion
print("‚è≥ Waiting for evaluation to complete...")
print("-" * 40)

while True:
    eval_run = openai_client.evals.runs.retrieve(
        run_id=eval_run.id,
        eval_id=eval_object.id
    )
    print(f"   Status: {eval_run.status}")
    
    if eval_run.status == "completed" or eval_run.status == "failed":
        break
    
    time.sleep(5)

if eval_run.status == "completed":
    print("\n‚úÖ Evaluation completed successfully!")
else:
    print("\n‚ùå Evaluation failed.")

## 8. Analyze Evaluation Results

In [None]:
if eval_run.status == "completed":
    print("\n" + "=" * 60)
    print("üìä TOOL CALL ACCURACY EVALUATION RESULTS")
    print("=" * 60)
    
    # Display result counts
    print(f"\nüìà Result Counts: {eval_run.result_counts}")
    
    # Get output items
    output_items = list(
        openai_client.evals.runs.output_items.list(
            run_id=eval_run.id,
            eval_id=eval_object.id
        )
    )
    
    print(f"\nüìù SCENARIOS EVALUATED: {len(output_items)}")
    print("-" * 60)
    
    # Display results for each scenario
    scenario_names = [
        "Balance Inquiry",
        "Wire Transfer",
        "Fraud Report",
        "Multi-Tool Request",
        "Loan Application"
    ]
    
    for i, item in enumerate(output_items):
        scenario = scenario_names[i] if i < len(scenario_names) else f"Scenario {i+1}"
        print(f"\nüîπ {scenario}:")
        print(f"   Status: {item.status}")
        if hasattr(item, 'results') and item.results:
            for result in item.results:
                print(f"   Score: {result.score if hasattr(result, 'score') else 'N/A'}")
    
    # Display report URL
    if eval_run.report_url:
        print(f"\nüîó Full Report URL: {eval_run.report_url}")
    
    print("\n" + "-" * 60)
    print("üìã Detailed Results:")
    print("-" * 60)
    pprint(output_items)
else:
    print("\n‚ùå Cannot display results - evaluation did not complete.")
    if eval_run.report_url:
        print(f"üîó Check report for details: {eval_run.report_url}")

## 9. FSI Compliance Insights

In [None]:
print("\n" + "=" * 60)
print("üíº FSI COMPLIANCE INSIGHTS - Tool Call Accuracy")
print("=" * 60)

print("\nüéØ Why Tool Call Accuracy Matters in Banking:")
print("-" * 50)
print("   1. SECURITY: Wrong tool could expose sensitive data")
print("   2. COMPLIANCE: Unauthorized actions violate regulations")
print("   3. FINANCIAL: Incorrect transfers = potential losses")
print("   4. AUDIT: All tool calls must be traceable")

print("\nüìä Scenarios Validated:")
print("-" * 50)
print("   ‚úì Balance inquiries use read-only tools")
print("   ‚úì Wire transfers use proper authorization flow")
print("   ‚úì Fraud reports trigger correct escalation")
print("   ‚úì Multi-step requests handle all required actions")
print("   ‚úì Loan applications route to correct processing")

print("\nüîê Risk Mitigation:")
print("-" * 50)
print("   ‚Ä¢ Agents cannot accidentally initiate transfers")
print("   ‚Ä¢ Fraud alerts reach the right team immediately")
print("   ‚Ä¢ Parameter extraction is validated")
print("   ‚Ä¢ All tool selections are auditable")

if eval_run.report_url:
    print(f"\nüîó View detailed report: {eval_run.report_url}")

## 10. Cleanup

In [None]:
# # Clean up resources
# try:
#     openai_client.evals.delete(eval_id=eval_object.id)
#     print("üóëÔ∏è Evaluation deleted")
# except Exception as e:
#     print(f"‚ö†Ô∏è Could not delete evaluation: {e}")

# print("\n‚úÖ Cleanup completed!")

## üéØ Summary

In this notebook, you learned how to:

‚úÖ **Define banking tool definitions** with proper schemas  
‚úÖ **Create test scenarios** covering various banking operations  
‚úÖ **Use `builtin.tool_call_accuracy`** evaluator  
‚úÖ **Evaluate tool selection** for correct function calls  
‚úÖ **Validate parameter extraction** from user queries  

### üîß Key APIs Used

| API | Purpose |
|-----|--------|
| `DataSourceConfigCustom` | Define schema for evaluation data |
| `builtin.tool_call_accuracy` | Evaluate correct tool selection |
| `CreateEvalJSONLRunDataSourceParam` | Pass inline test data |
| `SourceFileContentContent` | Individual test scenarios |

### üìä Data Fields for Tool Call Accuracy

| Field | Description |
|-------|-------------|
| `query` | User's input/request |
| `tool_definitions` | Available tools the agent can use |
| `tool_calls` | Tools the agent actually called |
| `response` | Optional - agent's text response |

### üìö Next Steps

1. **Add negative tests** - scenarios where wrong tools are called
2. **Test edge cases** - ambiguous queries, missing info
3. **Combine with other evaluators** - fluency, safety
4. **Integrate into CI/CD** - automated validation
