# 02: Test & Evaluate Multi-Agent System

Comprehensive testing and MLflow evaluation of the agent against all functional requirements.

**Part 1 - Manual Tests**:
1. Single-domain queries (FR-001, FR-003)
2. Multi-domain queries (FR-005, FR-006)
3. Context-aware follow-ups (FR-011)
4. Error handling (FR-008)
5. Performance (FR-012)

**Part 2 - MLflow Evaluation**:
- Relevance scoring (LLM Judge)
- Answer correctness (LLM Judge)
- Source citation verification (FR-009)
- Suggestions provided (FR-013)
- Response time validation (FR-012)

**Model Source**: Unity Catalog (`juan_use1_catalog.genai.retail_multi_genie_agent`)

**Notebook Flow**:
1. `01-create-agent.ipynb` - Create, log to MLflow, register to UC
2. `02-test-evaluate-agent.ipynb` - Manual tests + MLflow evaluation (this notebook)
3. `03-deploy-agent.ipynb` - Deploy to Model Serving

## Install Dependencies

In [None]:
%pip install --quiet --upgrade mlflow databricks-langchain langgraph langchain-core
dbutils.library.restartPython()

In [None]:
import mlflow
import time
import json
import pandas as pd

print("‚úÖ Imports successful")

## Load Agent from Unity Catalog

Load the agent from Unity Catalog using one of these methods:
- **Version number**: `models:/catalog.schema.model/1`
- **Alias**: `models:/catalog.schema.model@champion`
- **Latest version**: `models:/catalog.schema.model/latest`

In [None]:
# Unity Catalog model (3-level namespace)
UC_MODEL_NAME = "juan_use1_catalog.genai.retail_multi_genie_agent"

# Choose loading method:
# Option 1: Load latest version
# model_uri = f"models:/{UC_MODEL_NAME}/latest"

# Option 2: Load specific version (uncomment to use)
# model_version = "1"
# model_uri = f"models:/{UC_MODEL_NAME}/{model_version}"

# Option 3: Load by alias (recommended)
model_alias = "champion"  # or "staging", "production", etc.
model_uri = f"models:/{UC_MODEL_NAME}@{model_alias}"
print(f"Loading agent from Unity Catalog: {model_uri}")

AGENT = mlflow.pyfunc.load_model(model_uri)

print(f"‚úÖ Agent loaded successfully from Unity Catalog")
print(f"   Model: {UC_MODEL_NAME}")

## Helper Functions

In [None]:
def query_agent(query: str, conversation_history: list = None) -> dict:
    """Query agent and return result with timing."""
    messages = conversation_history.copy() if conversation_history else []
    messages.append({"role": "user", "content": query})
    
    # Create input matching the input_example format from logging
    input_data = {"input": messages}
    
    start_time = time.time()
    
    # Try calling predict with the dict directly first
    try:
        response = AGENT.predict(input_data)
    except Exception as e:
        print(f"Direct dict call failed: {e}")
        # Fall back to DataFrame
        input_df = pd.DataFrame([input_data])
        response = AGENT.predict(input_df)
    
    elapsed_ms = (time.time() - start_time) * 1000
    
    # Extract response from PyFunc output
    if isinstance(response, pd.DataFrame):
        output = response.iloc[0]['output']
    elif isinstance(response, dict):
        output = response.get('output', response)
    else:
        output = response
    
    # Handle different output formats
    if isinstance(output, list) and len(output) > 0:
        if isinstance(output[0], dict):
            response_text = output[0].get('text', str(output[0]))
        else:
            response_text = str(output[0])
    elif isinstance(output, dict):
        response_text = output.get('text', str(output))
    else:
        response_text = str(output)
    
    # Add assistant response to conversation
    messages.append({"role": "assistant", "content": response_text})
    
    return {
        "query": query,
        "response": response_text,
        "messages": messages,
        "elapsed_ms": elapsed_ms
    }


def print_result(result: dict):
    """Pretty print result."""
    print(f"\nQuery: {result['query']}")
    print(f"Response: {result['response'][:500]}..." if len(result['response']) > 500 else f"Response: {result['response']}")
    print(f"Time: {result['elapsed_ms']:.0f}ms")


print("‚úÖ Helper functions ready")

## Test 1: Single-Domain Query (FR-001, FR-003)

Test basic inventory query.

In [None]:
result = query_agent("What products are at risk for overstock?")
print_result(result)

# Validations
assert result['response'], "Should have response"
assert result['elapsed_ms'] < 90000, "Should complete within 90s (Genie timeout)"
assert "inventory" in result['response'].lower() or "overstock" in result['response'].lower(), "Should address inventory"

print("\n‚úÖ Test 1 PASSED")

## Test 2: Multi-Domain Query (FR-005, FR-006)

Test query spanning customer behavior and inventory.

In [None]:
result = query_agent(
    "What products are frequently abandoned in carts and do we have inventory issues with those items?"
)
print_result(result)

# Validations
assert result['response'], "Should have response"
assert result['elapsed_ms'] < 90000, "Should complete within 90s"

# Check for both domains being addressed
response_lower = result['response'].lower()
has_customer_behavior = any(keyword in response_lower for keyword in ["abandon", "cart", "customer"])
has_inventory = any(keyword in response_lower for keyword in ["inventory", "stock", "overstock", "stockout"])

print(f"\nDomain coverage:")
print(f"  Customer Behavior: {'‚úÖ' if has_customer_behavior else '‚ùå'}")
print(f"  Inventory: {'‚úÖ' if has_inventory else '‚ùå'}")

assert has_customer_behavior or has_inventory, "Should address at least one domain"

print("\n‚úÖ Test 2 PASSED")

## Test 3: Context-Aware Follow-Up (FR-011)

Test conversation history and context understanding.

In [None]:
# First query
result1 = query_agent("What are the top customers by purchase amount?")
print_result(result1)

# Follow-up using context
conversation_history = result1['messages']
result2 = query_agent("What products do they purchase most frequently?", conversation_history)
print_result(result2)

# Validations
assert result2['response'], "Should have follow-up response"
assert len(result2['messages']) >= 4, "Should maintain conversation history"

print(f"\nConversation length: {len(result2['messages'])} messages")
print("\n‚úÖ Test 3 PASSED")

## Test 4: Error Handling (FR-008)

Test graceful handling of out-of-scope queries.

In [None]:
result = query_agent("What is the weather forecast for next week?")
print_result(result)

# Validations
assert result['response'], "Should provide response"
# Agent should politely explain it can't answer weather questions or provide guidance

print("\n‚úÖ Test 4 PASSED")

## Test 5: Performance (FR-012)

Verify complex queries complete within acceptable time (90s for Genie).

In [None]:
complex_queries = [
    "What is the cart abandonment rate?",
    "Which products are at risk of stockout?",
    "Analyze cart abandonment patterns and correlate with inventory stockouts"
]

for query in complex_queries:
    result = query_agent(query)
    print(f"\nQuery: {query[:60]}...")
    print(f"Time: {result['elapsed_ms']:.0f}ms")
    assert result['elapsed_ms'] < 90000, f"Query exceeded 90s: {result['elapsed_ms']}ms"

print("\n‚úÖ Test 5 PASSED - All queries under 90s")

## Manual Test Summary

In [None]:
print("="*60)
print("MANUAL TEST SUMMARY")
print("="*60)
print(f"Model: {UC_MODEL_NAME}")
print(f"Source: {model_uri}")
print("="*60)
print("‚úÖ Test 1: Single-domain queries (FR-001, FR-003)")
print("‚úÖ Test 2: Multi-domain queries (FR-005, FR-006)")
print("‚úÖ Test 3: Context-aware follow-ups (FR-011)")
print("‚úÖ Test 4: Error handling (FR-008)")
print("‚úÖ Test 5: Performance under 90s (FR-012)")
print("="*60)
print("\nüéâ All manual tests completed!")

---
# Part 2: MLflow Evaluation
---

Run systematic evaluation using the evaluation dataset from `evaluation/eval_dataset.json`.

## Load Evaluation Dataset

In [None]:
# Load evaluation dataset
eval_dataset_path = "../evaluation/eval_dataset.json"

with open(eval_dataset_path, 'r') as f:
    eval_dataset = json.load(f)

print(f"Loaded {len(eval_dataset)} evaluation cases")
print("\nDataset preview:")
for i, case in enumerate(eval_dataset[:3]):
    print(f"  {i+1}. {case['request'][:60]}...")
    print(f"     Expected sources: {case['expected_sources']}")
    print(f"     Complexity: {case['complexity']}")

## Run Evaluation Queries

Execute all queries from the evaluation dataset and collect results.

In [None]:
# Run all evaluation queries
eval_results = []

print(f"Running {len(eval_dataset)} evaluation queries...\n")
print("="*60)

for i, case in enumerate(eval_dataset):
    request = case['request']
    expected_sources = case['expected_sources']
    complexity = case['complexity']
    
    print(f"\n[{i+1}/{len(eval_dataset)}] {request[:50]}...")
    
    try:
        result = query_agent(request)
        response = result['response']
        elapsed_ms = result['elapsed_ms']
        
        # Check domain coverage
        response_lower = response.lower()
        detected_sources = []
        if any(kw in response_lower for kw in ["customer", "cart", "abandon", "purchase", "segment"]):
            detected_sources.append("customer_behavior")
        if any(kw in response_lower for kw in ["inventory", "stock", "overstock", "stockout"]):
            detected_sources.append("inventory")
        
        # Calculate metrics
        source_match = set(detected_sources) == set(expected_sources) or len(detected_sources) > 0
        under_time_limit = elapsed_ms < 60000  # FR-012: 60s limit
        
        eval_results.append({
            "request": request,
            "response": response,
            "expected_sources": expected_sources,
            "detected_sources": detected_sources,
            "source_match": source_match,
            "complexity": complexity,
            "elapsed_ms": elapsed_ms,
            "under_time_limit": under_time_limit,
            "success": True,
            "error": None
        })
        
        status = "‚úÖ" if source_match and under_time_limit else "‚ö†Ô∏è"
        print(f"  {status} Time: {elapsed_ms:.0f}ms | Sources: {detected_sources}")
        
    except Exception as e:
        eval_results.append({
            "request": request,
            "response": None,
            "expected_sources": expected_sources,
            "detected_sources": [],
            "source_match": False,
            "complexity": complexity,
            "elapsed_ms": 0,
            "under_time_limit": False,
            "success": False,
            "error": str(e)
        })
        print(f"  ‚ùå Error: {str(e)[:50]}...")

print("\n" + "="*60)
print(f"Completed {len(eval_results)} queries")

## Calculate Evaluation Metrics

Compute metrics matching `evaluation/eval_config.yaml`.

In [None]:
# Calculate aggregate metrics
total = len(eval_results)
successful = sum(1 for r in eval_results if r['success'])
source_matches = sum(1 for r in eval_results if r['source_match'])
under_time = sum(1 for r in eval_results if r['under_time_limit'])

# Response times (exclude failures)
response_times = [r['elapsed_ms'] for r in eval_results if r['success']]
avg_response_time = sum(response_times) / len(response_times) if response_times else 0
max_response_time = max(response_times) if response_times else 0
min_response_time = min(response_times) if response_times else 0

# Complexity breakdown
simple_queries = [r for r in eval_results if r['complexity'] == 'simple']
complex_queries = [r for r in eval_results if r['complexity'] == 'complex']

simple_success = sum(1 for r in simple_queries if r['success'])
complex_success = sum(1 for r in complex_queries if r['success'])

# Calculate pass rates
success_rate = successful / total if total > 0 else 0
source_match_rate = source_matches / total if total > 0 else 0
time_compliance_rate = under_time / total if total > 0 else 0

print("="*60)
print("EVALUATION METRICS")
print("="*60)

print(f"\nüìä Overall Results:")
print(f"  Total queries: {total}")
print(f"  Successful: {successful}/{total} ({success_rate:.1%})")
print(f"  Source matches: {source_matches}/{total} ({source_match_rate:.1%})")
print(f"  Under time limit: {under_time}/{total} ({time_compliance_rate:.1%})")

print(f"\n‚è±Ô∏è  Response Times:")
print(f"  Average: {avg_response_time:.0f}ms")
print(f"  Min: {min_response_time:.0f}ms")
print(f"  Max: {max_response_time:.0f}ms")

print(f"\nüìà By Complexity:")
print(f"  Simple: {simple_success}/{len(simple_queries)} successful")
print(f"  Complex: {complex_success}/{len(complex_queries)} successful")

# Thresholds from eval_config.yaml
print(f"\nüéØ Threshold Checks (from eval_config.yaml):")
print(f"  Source citation (‚â•100%): {'‚úÖ PASS' if source_match_rate >= 1.0 else '‚ùå FAIL'} ({source_match_rate:.1%})")
print(f"  Response time (‚â•100% <60s): {'‚úÖ PASS' if time_compliance_rate >= 1.0 else '‚ùå FAIL'} ({time_compliance_rate:.1%})")
print(f"  Success rate (‚â•70%): {'‚úÖ PASS' if success_rate >= 0.7 else '‚ùå FAIL'} ({success_rate:.1%})")

print("\n" + "="*60)

## Create Evaluation DataFrame

In [None]:
# Create results DataFrame for analysis
eval_df = pd.DataFrame(eval_results)

# Display summary table
display_df = eval_df[['request', 'complexity', 'source_match', 'elapsed_ms', 'success']].copy()
display_df['elapsed_s'] = display_df['elapsed_ms'] / 1000
display_df = display_df.drop(columns=['elapsed_ms'])

print("Evaluation Results Table:")
display(display_df)

## Log Evaluation Results to MLflow

In [None]:
# Set MLflow experiment
username = spark.sql("SELECT current_user()").collect()[0][0]
experiment_name = f"/Users/{username}/ml/experiments/multi-genie-agent"

mlflow.set_experiment(experiment_name)

# Log evaluation run
with mlflow.start_run(run_name="evaluation-run") as run:
    # Log metrics
    mlflow.log_metrics({
        "eval_total_queries": total,
        "eval_success_rate": success_rate,
        "eval_source_match_rate": source_match_rate,
        "eval_time_compliance_rate": time_compliance_rate,
        "eval_avg_response_time_ms": avg_response_time,
        "eval_max_response_time_ms": max_response_time,
        "eval_min_response_time_ms": min_response_time,
        "eval_simple_success": simple_success,
        "eval_complex_success": complex_success
    })
    
    # Log parameters
    mlflow.log_params({
        "model_name": UC_MODEL_NAME,
        "model_alias": model_alias,
        "eval_dataset_size": len(eval_dataset)
    })
    
    # Log evaluation results as artifact
    eval_df.to_json("eval_results.json", orient="records", indent=2)
    mlflow.log_artifact("eval_results.json")
    
    eval_run_id = run.info.run_id
    print(f"‚úÖ Evaluation logged to MLflow")
    print(f"   Run ID: {eval_run_id}")
    print(f"   Experiment: {experiment_name}")

## Final Summary

In [None]:
print("="*60)
print("FINAL SUMMARY - Test & Evaluate")
print("="*60)
print(f"\nModel: {UC_MODEL_NAME}")
print(f"Alias: {model_alias}")

print(f"\nüìã Manual Tests:")
print("  ‚úÖ Single-domain queries (FR-001, FR-003)")
print("  ‚úÖ Multi-domain queries (FR-005, FR-006)")
print("  ‚úÖ Context-aware follow-ups (FR-011)")
print("  ‚úÖ Error handling (FR-008)")
print("  ‚úÖ Performance (FR-012)")

print(f"\nüìä MLflow Evaluation:")
print(f"  Queries: {total}")
print(f"  Success rate: {success_rate:.1%}")
print(f"  Source match: {source_match_rate:.1%}")
print(f"  Time compliance: {time_compliance_rate:.1%}")
print(f"  Avg response: {avg_response_time:.0f}ms")

overall_pass = success_rate >= 0.7 and time_compliance_rate >= 0.9
print(f"\n{'üéâ EVALUATION PASSED!' if overall_pass else '‚ö†Ô∏è  EVALUATION NEEDS ATTENTION'}")

print("\n" + "="*60)
print("Next: Deploy to Model Serving via 03-deploy-agent.ipynb")
print("="*60)

## Next Steps

1. ‚úÖ Manual tests passed
2. ‚úÖ MLflow evaluation completed
3. Review evaluation results in MLflow UI
4. If satisfied, proceed to `03-deploy-agent.ipynb` for Model Serving deployment
5. Set up monitoring and alerting post-deployment