# 03: Test Multi-Tool Calling Agent

Comprehensive testing of the agent against all functional requirements.

**Test Scenarios**:
1. Single-domain queries (FR-001, FR-003)
2. Multi-domain queries (FR-005, FR-006)
3. Context-aware follow-ups (FR-011)
4. Proactive suggestions (FR-013)
5. Error handling (FR-008)
6. Performance (FR-012)

## Setup

In [None]:
%pip install --quiet --upgrade mlflow langgraph langchain-core
dbutils.library.restartPython()

In [None]:
import mlflow
import time
from typing import List, Tuple

print("âœ… Imports successful")

## Load Agent from MLflow

In [None]:
# Load latest agent version
model_name = "multi_tool_calling_agent"
model_uri = f"models:/{model_name}/latest"

print(f"Loading agent from: {model_uri}")
agent = mlflow.langchain.load_model(model_uri)

print(f"âœ… Agent loaded successfully")

## Helper Functions

In [None]:
def query_agent(query: str, conversation_history: List[Tuple] = None) -> dict:
    """Query agent and return result"""
    messages = conversation_history or []
    messages.append(("user", query))
    
    start_time = time.time()
    result = agent.invoke({"messages": messages})
    elapsed_ms = (time.time() - start_time) * 1000
    
    final_message = result["messages"][-1]
    response = final_message.content
    
    return {
        "query": query,
        "response": response,
        "messages": result["messages"],
        "elapsed_ms": elapsed_ms
    }

def print_result(result: dict):
    """Pretty print result"""
    print(f"\nQuery: {result['query']}")
    print(f"Response: {result['response']}")
    print(f"Time: {result['elapsed_ms']:.0f}ms")

print("âœ… Helper functions ready")

## Test 1: Single-Domain Query (FR-001, FR-003)

Test basic customer behavior query.

In [None]:
result = query_agent("What are the top cart abandonment products?")
print_result(result)

# Validations
assert result['response'], "Should have response"
assert result['elapsed_ms'] < 60000, "Should complete within 60s (FR-012)"
assert "[Source:" in result['response'] or "Customer Behavior" in result['response'], "Should cite source (FR-009)"

print("\nâœ… Test 1 PASSED")

## Test 2: Multi-Domain Query (FR-005, FR-006)

Test query spanning customer behavior and inventory.

In [None]:
result = query_agent(
    "What products are frequently abandoned in carts and do we have inventory issues with those items?"
)
print_result(result)

# Validations
assert result['response'], "Should have response"
assert "abandon" in result['response'].lower() or "cart" in result['response'].lower(), "Should address cart abandonment"
assert "inventory" in result['response'].lower() or "stock" in result['response'].lower(), "Should address inventory"

print("\nâœ… Test 2 PASSED")

## Test 3: Context-Aware Follow-Up (FR-011)

Test conversation history and context understanding.

In [None]:
# First query
result1 = query_agent("What are the top cart abandonment products?")
print_result(result1)

# Follow-up using context
conversation_history = result1['messages']
result2 = query_agent("What about their inventory levels?", conversation_history)
print_result(result2)

# Validations
assert result2['response'], "Should have follow-up response"
assert "inventory" in result2['response'].lower() or "stock" in result2['response'].lower(), "Should understand context reference"

print(f"\nConversation length: {len(conversation_history)} messages")
print("\nâœ… Test 3 PASSED")

## Test 4: Proactive Suggestions (FR-013)

Verify agent provides suggestions for related insights.

In [None]:
result = query_agent("Show me this month's sales data")
print_result(result)

# Check for suggestions (numbered list pattern)
import re
suggestions = re.findall(r'\d+\.\s+(.+)', result['response'])

print(f"\nSuggestions found: {len(suggestions)}")
for i, suggestion in enumerate(suggestions, 1):
    print(f"  {i}. {suggestion}")

# Note: FR-013 requires suggestions, but format may vary
print("\nâœ… Test 4 PASSED (check suggestions manually)")

## Test 5: Error Handling (FR-008)

Test graceful handling of unavailable data.

In [None]:
result = query_agent("What is the weather forecast for next week?")
print_result(result)

# Validations
assert result['response'], "Should provide response"
# Agent should politely explain it can't answer weather questions

print("\nâœ… Test 5 PASSED")

## Test 6: Performance (FR-012)

Verify complex queries complete within 60 seconds.

In [None]:
complex_queries = [
    "Analyze cart abandonment patterns and correlate with inventory stockouts",
    "What products have high cart abandonment and low inventory?",
    "Show customer segments affected by inventory constraints"
]

for query in complex_queries:
    result = query_agent(query)
    print(f"\nQuery: {query[:60]}...")
    print(f"Time: {result['elapsed_ms']:.0f}ms")
    assert result['elapsed_ms'] < 60000, f"Query exceeded 60s: {result['elapsed_ms']}ms"

print("\nâœ… Test 6 PASSED - All queries under 60s")

## Test Summary

Run this cell to see overall test results.

In [None]:
print("="*50)
print("TEST SUMMARY")
print("="*50)
print("âœ… Test 1: Single-domain queries (FR-001, FR-003)")
print("âœ… Test 2: Multi-domain queries (FR-005, FR-006)")
print("âœ… Test 3: Context-aware follow-ups (FR-011)")
print("âœ… Test 4: Proactive suggestions (FR-013)")
print("âœ… Test 5: Error handling (FR-008)")
print("âœ… Test 6: Performance under 60s (FR-012)")
print("="*50)
print("\nðŸŽ‰ All tests completed!")

## Next Steps

1. Review test results
2. Run additional custom queries
3. Deploy agent to Model Serving (see deployment docs)
4. Create evaluation dataset for MLflow evaluation