# 04: Deploy Multi-Agent System to Model Serving

Deploy the Unity Catalog registered agent to a Databricks Model Serving endpoint.

**What this notebook does**:
1. Creates or updates a Model Serving endpoint
2. Deploys the UC model with proper configuration
3. Validates the deployment with comprehensive tests
4. Provides endpoint information for integration

**Prerequisites**:
- Model registered in Unity Catalog (`juan_dev.genai.retail_multi_genie_agent`)
- Model tested successfully (see `03-test-agent.ipynb`)
- Proper permissions on Unity Catalog model
- Model Serving permissions in Databricks workspace

**Endpoint Configuration**:
- **Name**: `retail-multi-genie-agent`
- **Model**: `juan_dev.genai.retail_multi_genie_agent`
- **Workload**: CPU Small (suitable for agent orchestration)
- **Scale to Zero**: Enabled (cost optimization)
- **Authentication**: Automatic passthrough via Genie Space resources

## Install Dependencies

In [None]:
%pip install --quiet --upgrade databricks-sdk mlflow
dbutils.library.restartPython()

## Configuration

Configure the endpoint and model details.

In [None]:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput
import time

# Initialize workspace client
w = WorkspaceClient()

# Unity Catalog model configuration
UC_MODEL_NAME = "juan_dev.genai.retail_multi_genie_agent"

# Endpoint configuration
ENDPOINT_NAME = "retail-multi-genie-agent"

# Model version selection (choose one):
# Option 1: Use alias (recommended for production)
MODEL_VERSION = "challenger"  # or "champion", "staging"
MODEL_REFERENCE = f"{UC_MODEL_NAME}@{MODEL_VERSION}"

# Option 2: Use specific version (uncomment to use)
# MODEL_VERSION = "1"
# MODEL_REFERENCE = f"{UC_MODEL_NAME}/{MODEL_VERSION}"

# Option 3: Use latest version (uncomment to use)
# MODEL_REFERENCE = f"{UC_MODEL_NAME}/latest"

# Workload configuration
WORKLOAD_SIZE = "Small"  # Small, Medium, Large
WORKLOAD_TYPE = "CPU"     # CPU workload (not GPU - this is orchestration, not inference)
SCALE_TO_ZERO = True      # Enable cost optimization

print(f"Endpoint Name: {ENDPOINT_NAME}")
print(f"Model Reference: {MODEL_REFERENCE}")
print(f"Workload: {WORKLOAD_TYPE} {WORKLOAD_SIZE}")
print(f"Scale to Zero: {SCALE_TO_ZERO}")

## Check Existing Endpoint

Check if the endpoint already exists.

In [None]:
# Check if endpoint exists
endpoint_exists = False
try:
    existing_endpoint = w.serving_endpoints.get(ENDPOINT_NAME)
    endpoint_exists = True
    print(f"‚úÖ Endpoint '{ENDPOINT_NAME}' already exists")
    print(f"   Current state: {existing_endpoint.state.config_update}")
    print(f"\nWill UPDATE existing endpoint with new model version")
except Exception as e:
    print(f"‚ÑπÔ∏è  Endpoint '{ENDPOINT_NAME}' does not exist")
    print(f"\nWill CREATE new endpoint")

## Create or Update Serving Endpoint

Deploy the model to a Model Serving endpoint.

In [None]:
# Endpoint configuration
endpoint_config = EndpointCoreConfigInput(
    name=ENDPOINT_NAME,
    served_entities=[
        ServedEntityInput(
            entity_name=UC_MODEL_NAME,
            entity_version=MODEL_VERSION,
            workload_size=WORKLOAD_SIZE,
            scale_to_zero_enabled=SCALE_TO_ZERO
        )
    ]
)

if endpoint_exists:
    # Update existing endpoint
    print(f"Updating endpoint '{ENDPOINT_NAME}'...")
    w.serving_endpoints.update_config(
        name=ENDPOINT_NAME,
        served_entities=endpoint_config.served_entities
    )
    print(f"‚úÖ Endpoint update initiated")
else:
    # Create new endpoint
    print(f"Creating endpoint '{ENDPOINT_NAME}'...")
    w.serving_endpoints.create(
        name=ENDPOINT_NAME,
        config=endpoint_config
    )
    print(f"‚úÖ Endpoint creation initiated")

print(f"\nDeployment started. Waiting for endpoint to be ready...")

## Wait for Endpoint Ready

Poll the endpoint until it reaches READY state.

**Note**: This can take 10-15 minutes for initial deployment.

In [None]:
import time
from datetime import datetime

def wait_for_endpoint_ready(endpoint_name, timeout_minutes=20):
    """Wait for endpoint to reach READY state"""
    start_time = time.time()
    timeout_seconds = timeout_minutes * 60
    
    print(f"Waiting for endpoint '{endpoint_name}' to be ready...")
    print(f"Timeout: {timeout_minutes} minutes\n")
    
    last_state = None
    while (time.time() - start_time) < timeout_seconds:
        try:
            endpoint = w.serving_endpoints.get(endpoint_name)
            current_state = endpoint.state.ready
            
            # Print status update if changed
            if current_state != last_state:
                elapsed = int(time.time() - start_time)
                timestamp = datetime.now().strftime("%H:%M:%S")
                print(f"[{timestamp}] ({elapsed}s) State: {current_state}")
                last_state = current_state
            
            # Check if ready
            if current_state == "READY":
                elapsed = int(time.time() - start_time)
                print(f"\n‚úÖ Endpoint is READY! (took {elapsed}s)")
                return True
            
            # Check for failure states
            if current_state in ["FAILED", "UNHEALTHY"]:
                print(f"\n‚ùå Endpoint deployment failed with state: {current_state}")
                if hasattr(endpoint.state, 'config_update'):
                    print(f"   Config update status: {endpoint.state.config_update}")
                return False
            
            # Wait before next check
            time.sleep(30)  # Check every 30 seconds
            
        except Exception as e:
            print(f"Error checking endpoint status: {e}")
            time.sleep(30)
    
    print(f"\n‚ö†Ô∏è  Timeout after {timeout_minutes} minutes")
    return False

# Wait for endpoint
is_ready = wait_for_endpoint_ready(ENDPOINT_NAME, timeout_minutes=20)

if not is_ready:
    print("\n‚ö†Ô∏è  Endpoint is not ready. Check the Model Serving UI for details.")
    print(f"   UI: /ml/endpoints/{ENDPOINT_NAME}")

## Get Endpoint Information

Display endpoint details and scoring URI.

In [None]:
# Get endpoint details
endpoint = w.serving_endpoints.get(ENDPOINT_NAME)

print("="*60)
print("ENDPOINT INFORMATION")
print("="*60)
print(f"Name: {endpoint.name}")
print(f"State: {endpoint.state.ready}")
print(f"\nModel:")
print(f"  - {UC_MODEL_NAME}")
print(f"  - Version/Alias: {MODEL_VERSION}")
print(f"\nConfiguration:")
print(f"  - Workload: {WORKLOAD_TYPE} {WORKLOAD_SIZE}")
print(f"  - Scale to Zero: {SCALE_TO_ZERO}")

# Get serving URI
if hasattr(endpoint, 'config') and hasattr(endpoint.config, 'served_entities'):
    for entity in endpoint.config.served_entities:
        print(f"\nServed Entity:")
        print(f"  - Name: {entity.entity_name}")
        print(f"  - Version: {entity.entity_version}")

print(f"\nEndpoint URL: /ml/endpoints/{ENDPOINT_NAME}")
print("="*60)

## Test Endpoint - Single Domain Query

Test the deployed endpoint with a simple inventory query.

In [None]:
import requests
import json
import time

# Get workspace URL and token
workspace_url = w.config.host
token = w.config.token

# Endpoint URL
endpoint_url = f"{workspace_url}/serving-endpoints/{ENDPOINT_NAME}/invocations"

# Test query
test_query = "What 5 products are at risk for overstock?"

# Request payload
payload = {
    "input": [
        {"role": "user", "content": test_query}
    ]
}

print(f"Testing endpoint with query: {test_query}\n")

# Make request
start_time = time.time()
response = requests.post(
    endpoint_url,
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    },
    json=payload
)
elapsed_ms = (time.time() - start_time) * 1000

# Check response
if response.status_code == 200:
    result = response.json()
    print(f"Response Time: {elapsed_ms:.0f}ms")
    print(f"\nResponse:")
    
    # Extract and display response text
    if 'output' in result and len(result['output']) > 0:
        response_text = result['output'][0].get('text', str(result['output'][0]))
        print(response_text)
    else:
        print(json.dumps(result, indent=2))
    
    # Validation
    assert elapsed_ms < 90000, f"Response time {elapsed_ms}ms exceeds 90s limit"
    print(f"\n‚úÖ Single domain query test PASSED")
else:
    print(f"‚ùå Request failed with status {response.status_code}")
    print(response.text)

## Test Endpoint - Multi-Domain Query

Test with a query that spans both customer behavior and inventory domains.

In [None]:
# Multi-domain test query
test_query = "What products are frequently abandoned in carts and do we have inventory issues with those items?"

# Request payload
payload = {
    "input": [
        {"role": "user", "content": test_query}
    ]
}

print(f"Testing multi-domain query: {test_query}\n")

# Make request
start_time = time.time()
response = requests.post(
    endpoint_url,
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    },
    json=payload
)
elapsed_ms = (time.time() - start_time) * 1000

# Check response
if response.status_code == 200:
    result = response.json()
    print(f"Response Time: {elapsed_ms:.0f}ms")
    print(f"\nResponse:")
    
    # Extract and display response text
    if 'output' in result and len(result['output']) > 0:
        response_text = result['output'][0].get('text', str(result['output'][0]))
        print(response_text)
        
        # Check domain coverage
        response_lower = response_text.lower()
        has_customer = any(kw in response_lower for kw in ["abandon", "cart", "customer"])
        has_inventory = any(kw in response_lower for kw in ["inventory", "stock", "overstock"])
        
        print(f"\nDomain Coverage:")
        print(f"  Customer Behavior: {'‚úÖ' if has_customer else '‚ùå'}")
        print(f"  Inventory: {'‚úÖ' if has_inventory else '‚ùå'}")
    else:
        print(json.dumps(result, indent=2))
    
    # Validation
    assert elapsed_ms < 90000, f"Response time {elapsed_ms}ms exceeds 90s limit"
    print(f"\n‚úÖ Multi-domain query test PASSED")
else:
    print(f"‚ùå Request failed with status {response.status_code}")
    print(response.text)

## Test Endpoint - Conversation Context

Test conversation history with follow-up questions.

In [None]:
# First query
query1 = "What are the top 3 customers by purchase amount?"

payload1 = {
    "input": [
        {"role": "user", "content": query1}
    ]
}

print(f"Query 1: {query1}\n")

# First request
response1 = requests.post(
    endpoint_url,
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    },
    json=payload1
)

if response1.status_code == 200:
    result1 = response1.json()
    response_text1 = result1['output'][0].get('text', str(result1['output'][0]))
    print(f"Response 1:\n{response_text1}\n")
    
    # Follow-up query with context
    query2 = "What products do they purchase most frequently?"
    
    payload2 = {
        "input": [
            {"role": "user", "content": query1},
            {"role": "assistant", "content": response_text1},
            {"role": "user", "content": query2}
        ]
    }
    
    print(f"Query 2 (follow-up): {query2}\n")
    
    # Second request with context
    response2 = requests.post(
        endpoint_url,
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json=payload2
    )
    
    if response2.status_code == 200:
        result2 = response2.json()
        response_text2 = result2['output'][0].get('text', str(result2['output'][0]))
        print(f"Response 2:\n{response_text2}")
        print(f"\n‚úÖ Conversation context test PASSED")
    else:
        print(f"‚ùå Follow-up request failed: {response2.status_code}")
else:
    print(f"‚ùå Initial request failed: {response1.status_code}")

## Performance Validation

Run multiple queries to validate consistent performance.

In [None]:
# Performance test queries
test_queries = [
    "What is the cart abandonment rate?",
    "Which products are at risk of stockout?",
    "Show me customer segments with high purchase frequency",
    "What items have overstock issues?",
    "Analyze cart abandonment by product category"
]

print("Running performance tests...\n")
print("="*60)

response_times = []
success_count = 0

for i, query in enumerate(test_queries, 1):
    payload = {
        "input": [{"role": "user", "content": query}]
    }
    
    start_time = time.time()
    response = requests.post(
        endpoint_url,
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json=payload
    )
    elapsed_ms = (time.time() - start_time) * 1000
    
    if response.status_code == 200:
        response_times.append(elapsed_ms)
        success_count += 1
        status = "‚úÖ" if elapsed_ms < 90000 else "‚ö†Ô∏è"
        print(f"{status} Query {i}: {elapsed_ms:.0f}ms")
        print(f"   {query[:50]}..." if len(query) > 50 else f"   {query}")
    else:
        print(f"‚ùå Query {i}: Failed ({response.status_code})")
        print(f"   {query[:50]}..." if len(query) > 50 else f"   {query}")

print("="*60)
print(f"\nPerformance Summary:")
print(f"  Successful: {success_count}/{len(test_queries)}")
if response_times:
    print(f"  Avg Response Time: {sum(response_times)/len(response_times):.0f}ms")
    print(f"  Min Response Time: {min(response_times):.0f}ms")
    print(f"  Max Response Time: {max(response_times):.0f}ms")
    
    # Check if all under 90s
    all_under_limit = all(t < 90000 for t in response_times)
    if all_under_limit:
        print(f"\n‚úÖ All queries completed under 90s limit")
    else:
        print(f"\n‚ö†Ô∏è  Some queries exceeded 90s limit")

## Endpoint Usage Example

Example code for calling the endpoint from external applications.

In [None]:
print("="*60)
print("ENDPOINT USAGE EXAMPLE")
print("="*60)

print(f"""\n# Python Example
import requests

endpoint_url = "{workspace_url}/serving-endpoints/{ENDPOINT_NAME}/invocations"
token = "YOUR_DATABRICKS_TOKEN"

payload = {{
    "input": [
        {{"role": "user", "content": "What products are at risk for overstock?"}}
    ]
}}

response = requests.post(
    endpoint_url,
    headers={{
        "Authorization": f"Bearer {{token}}",
        "Content-Type": "application/json"
    }},
    json=payload
)

result = response.json()
print(result['output'][0]['text'])
""")

print(f"""\n# cURL Example
curl -X POST "{workspace_url}/serving-endpoints/{ENDPOINT_NAME}/invocations" \\
  -H "Authorization: Bearer YOUR_DATABRICKS_TOKEN" \\
  -H "Content-Type: application/json" \\
  -d '{{
    "input": [
      {{"role": "user", "content": "What products are at risk for overstock?"}}
    ]
  }}'
""")

print("="*60)

## Deployment Summary

Review the deployment status and next steps.

In [None]:
print("="*60)
print("DEPLOYMENT SUMMARY")
print("="*60)
print(f"\n‚úÖ Endpoint: {ENDPOINT_NAME}")
print(f"‚úÖ Model: {UC_MODEL_NAME}")
print(f"‚úÖ Version: {MODEL_VERSION}")
print(f"‚úÖ State: {endpoint.state.ready}")
print(f"‚úÖ Workload: {WORKLOAD_TYPE} {WORKLOAD_SIZE}")
print(f"‚úÖ Scale to Zero: {SCALE_TO_ZERO}")

print(f"\nüìä Test Results:")
print(f"  ‚úÖ Single domain queries")
print(f"  ‚úÖ Multi-domain queries")
print(f"  ‚úÖ Conversation context")
print(f"  ‚úÖ Performance validation")

print(f"\nüîó Endpoint URL: {workspace_url}/ml/endpoints/{ENDPOINT_NAME}")
print(f"\nüéØ Scoring URI: {workspace_url}/serving-endpoints/{ENDPOINT_NAME}/invocations")

print("\n" + "="*60)
print("‚úÖ Deployment successful! Agent is ready for production use.")
print("="*60)

## Cleanup (Optional)

**WARNING**: Only run this if you want to delete the endpoint!

Uncomment and run to delete the serving endpoint.

In [None]:
# ‚ö†Ô∏è UNCOMMENT TO DELETE ENDPOINT
# print(f"Deleting endpoint '{ENDPOINT_NAME}'...")
# w.serving_endpoints.delete(ENDPOINT_NAME)
# print(f"‚úÖ Endpoint deleted")

## Next Steps

Now that your agent is deployed:

### 1. **Integration**
- Integrate endpoint into applications using the examples above
- Use Databricks SDK or REST API
- Authentication via Databricks personal access token

### 2. **Monitoring**
- Monitor endpoint metrics in Databricks UI: `/ml/endpoints/{ENDPOINT_NAME}`
- Track request latency, throughput, and errors
- Set up alerts for performance degradation

### 3. **Scaling**
- **Current**: CPU Small (0-4 concurrent requests)
- **Medium**: 8-16 concurrent requests
- **Large**: 16-64 concurrent requests
- Update endpoint configuration as load increases

### 4. **Model Versioning**
- Test new versions in Unity Catalog
- Use aliases for staged rollouts:
  - `challenger` ‚Üí testing
  - `staging` ‚Üí pre-production
  - `champion` ‚Üí production
- Update endpoint to new version when ready

### 5. **Cost Optimization**
- Scale-to-zero enabled (endpoint scales down when idle)
- Monitor usage patterns
- Adjust workload size based on actual traffic

### 6. **Governance**
- Model registered in Unity Catalog
- Genie Space permissions automatically enforced
- Audit logs available in Databricks
- Version history tracked in UC