# Optimization: Multi-Region Routing

----

This notebook provides **hands-on testing** of multi-region routing strategies using **Azure API Management (APIM)** as a GenAI gateway. It builds on the APIM infrastructure from `3_model_migration.ipynb` and allows you to:

- **Configure and test** different routing policies in real-time
- **Measure actual latency** and distribution across backends
- **Implement circuit breaker** patterns for resilience
- **Validate PTU + PAYG spillover** under simulated load

**Reference**: [Azure-Samples/AI-Gateway - Backend Pool Load Balancing](https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/backend-pool-load-balancing)

> **Prerequisites**: Run `3_model_migration.ipynb` first to set up APIM infrastructure and backend configuration.

## Table of Contents

- [Why Multi-Region Routing?](#why-multi-region-routing)
- [Setup and Backend Configuration](#setup-and-backend-configuration)
- [Part 1: Baseline Testing](#part-1-baseline-testing)
- [Part 2: Latency-Based Routing](#part-2-latency-based-routing)
- [Part 3: Weighted Round Robin](#part-3-weighted-round-robin)
- [Part 4: Priority-Based with PTU Spillover](#part-4-priority-based-with-ptu-spillover)
- [Part 5: Circuit Breaker Pattern](#part-5-circuit-breaker-pattern)
- [Part 6: Combined Strategy Test](#part-6-combined-strategy-test)
- [Results Analysis](#results-analysis)
- [Wrap-up](#wrap-up)

## Why Multi-Region Routing?

### Challenge: Single Region Limitations

| Issue | Impact |
|-------|--------|
| **Capacity limits** | 429 rate limiting during peak traffic |
| **Regional latency** | Users far from deployment experience slow responses |
| **Single point of failure** | Regional outages affect all users |
| **PTU underutilization** | Spikes exceed PTU capacity, wasting reserved throughput |

### Solution: APIM Backend Pool with Load Balancing

```
                                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ Backend A (PTU) ‚îÇ Priority 1
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îÇ     ‚îÇ East US         ‚îÇ
‚îÇ Client ‚îú‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ   APIM   ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚î§     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ Gateway  ‚îÇ    ‚îÇ     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
              ‚îÇ          ‚îÇ    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ Backend B (PAYG)‚îÇ Priority 2 / Weight 50
              ‚îÇ Backend  ‚îÇ    ‚îÇ     ‚îÇ West US         ‚îÇ
              ‚îÇ  Pool    ‚îÇ    ‚îÇ     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
              ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                              ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ Backend C (PAYG)‚îÇ Priority 2 / Weight 50
                                    ‚îÇ Europe          ‚îÇ
                                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### APIM Backend Pool Features (from AI-Gateway patterns)

| Feature | Configuration | Behavior |
|---------|--------------|----------|
| **Priority** | `priority: 1` vs `priority: 2` | Lower priority = preferred; higher = failover |
| **Weight** | `weight: 50` within same priority | Distributes load proportionally |
| **Circuit Breaker** | `failureCondition.count: 3` | Trips after N failures in interval |
| **Retry-After** | `acceptRetryAfter: true` | Respects backend's 429 Retry-After header |

## Setup and Backend Configuration

This notebook **reuses the APIM setup from `3_model_migration.ipynb`**. You should have:

1. An APIM service deployed (`APIM_SERVICE_NAME`)
2. Backend A and Backend B configured with their endpoints and API keys
3. The APIM subscription key available

If these are not set up, run `3_model_migration.ipynb` first.

In [None]:
# Environment setup and PATH configuration
import json
import os
import shutil
import subprocess
import time
import random
import statistics
import requests
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass, field
from collections import Counter
from xml.sax.saxutils import escape as xml_escape
from dotenv import load_dotenv

load_dotenv(override=True)

# Ensure the notebook kernel can find Azure CLI (`az`) on PATH
possible_paths = [
    '/opt/homebrew/bin',   # macOS (Apple Silicon)
    '/usr/local/bin',      # macOS (Intel) / Linux
    '/usr/bin',            # Linux / Codespaces
    '/home/linuxbrew/.linuxbrew/bin',  # Linux Homebrew
]

az_path = None
try:
    result = subprocess.run(['which', 'az'], capture_output=True, text=True)
    if result.returncode == 0:
        az_path = os.path.dirname(result.stdout.strip())
        print(f'üîç Azure CLI found: {result.stdout.strip()}')
except Exception:
    pass

paths_to_add: list[str] = []
if az_path and az_path not in os.environ.get('PATH', ''):
    paths_to_add.append(az_path)
else:
    for path in possible_paths:
        if os.path.exists(path) and path not in os.environ.get('PATH', ''):
            paths_to_add.append(path)

if paths_to_add:
    os.environ['PATH'] = ':'.join(paths_to_add) + ':' + os.environ.get('PATH', '')
    print(f"‚úÖ Added to PATH: {', '.join(paths_to_add)}")
else:
    print('‚úÖ PATH looks good already')

print(f"\nPATH (first 150 chars): {os.environ['PATH'][:150]}...")

In [None]:
# Load Foundry project settings and APIM configuration
from azure.identity import DefaultAzureCredential

config_file = '../0_setup/.foundry_config.json'
try:
    with open(config_file, 'r', encoding='utf-8') as f:
        config = json.load(f)
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è Could not find '{config_file}'.")
    print('üí° Run 0_setup/1_setup.ipynb first to create it.')
    raise e

# Project variables from config
FOUNDRY_NAME = config.get('FOUNDRY_NAME')
RESOURCE_GROUP = config.get('RESOURCE_GROUP')
LOCATION = config.get('LOCATION')
AZURE_AI_PROJECT_ENDPOINT = config.get('AZURE_AI_PROJECT_ENDPOINT')
AZURE_SUBSCRIPTION_ID = config.get('AZURE_SUBSCRIPTION_ID', '')

# Azure OpenAI variables from env
AZURE_OPENAI_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
AZURE_OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION")

os.environ['FOUNDRY_NAME'] = FOUNDRY_NAME or ''
os.environ['LOCATION'] = LOCATION or ''
os.environ['RESOURCE_GROUP'] = RESOURCE_GROUP or ''
os.environ['AZURE_SUBSCRIPTION_ID'] = AZURE_SUBSCRIPTION_ID

print(f"‚úÖ Loaded Foundry config from '{config_file}'.")
print(f"üìå Foundry: {FOUNDRY_NAME} | RG: {RESOURCE_GROUP} | Location: {LOCATION}")

# Initialize credential for Azure services
credential = DefaultAzureCredential()

## Backend Configuration (Reusing 3_model_migration Pattern)

This configuration mirrors `3_model_migration.ipynb` to allow seamless integration. 
You can configure up to 3 backends representing different regions or deployment types (PTU/PAYG).

**Environment variables** (set in `.env` or export):
- `BACKEND_A_AZURE_OPENAI_ENDPOINT`, `BACKEND_A_AZURE_OPENAI_API_KEY`, `BACKEND_A_DEPLOYMENT`
- `BACKEND_B_AZURE_OPENAI_ENDPOINT`, `BACKEND_B_AZURE_OPENAI_API_KEY`, `BACKEND_B_DEPLOYMENT`
- `BACKEND_C_AZURE_OPENAI_ENDPOINT`, `BACKEND_C_AZURE_OPENAI_API_KEY`, `BACKEND_C_DEPLOYMENT` (optional)

In [None]:
# Backend configuration (aligned with 3_model_migration.ipynb pattern)
# ====================================================================
# APIM configuration
APIM_LOCATION = os.environ.get("APIM_LOCATION", "eastus")
APIM_RESOURCE_GROUP = os.environ.get("APIM_RESOURCE_GROUP", "rg-model-migration")
APIM_SERVICE_NAME = os.environ.get("APIM_SERVICE_NAME", "apim-model-migration")
APIM_API_ID = "multi-region-router"
APIM_API_PATH = "routing"
RESPONSES_API_VERSION = "2025-04-01-preview"

# Backend A: Primary (e.g., PAYG in East US)
BACKEND_A = {
    "id": "backend-a",
    "label": "PAYG-EastUS",
    "endpoint": os.environ.get("BACKEND_A_AZURE_OPENAI_ENDPOINT", AZURE_OPENAI_ENDPOINT or ""),
    "api_key": os.environ.get("BACKEND_A_AZURE_OPENAI_API_KEY", AZURE_OPENAI_API_KEY or ""),
    "deployment": os.environ.get("BACKEND_A_DEPLOYMENT", AZURE_OPENAI_CHAT_DEPLOYMENT_NAME or ""),
    "region": os.environ.get("BACKEND_A_REGION", "eastus"),
    "type": "PAYG",  # PTU or PAYG
    "priority": 1,
    "weight": 100,
}

# Backend B: Secondary (e.g., PAYG in West US)
BACKEND_B = {
    "id": "backend-b",
    "label": "PAYG-WestUS",
    "endpoint": os.environ.get("BACKEND_B_AZURE_OPENAI_ENDPOINT", ""),
    "api_key": os.environ.get("BACKEND_B_AZURE_OPENAI_API_KEY", ""),
    "deployment": os.environ.get("BACKEND_B_DEPLOYMENT", ""),
    "region": os.environ.get("BACKEND_B_REGION", "westus"),
    "type": "PAYG",
    "priority": 2,
    "weight": 50,
}

# Backend C: Tertiary (e.g., PAYG in Europe) - Optional
BACKEND_C = {
    "id": "backend-c",
    "label": "PAYG-Europe",
    "endpoint": os.environ.get("BACKEND_C_AZURE_OPENAI_ENDPOINT", ""),
    "api_key": os.environ.get("BACKEND_C_AZURE_OPENAI_API_KEY", ""),
    "deployment": os.environ.get("BACKEND_C_DEPLOYMENT", ""),
    "region": os.environ.get("BACKEND_C_REGION", "westeurope"),
    "type": "PAYG",
    "priority": 2,
    "weight": 50,
}

# Build list of configured backends
ALL_BACKENDS = [BACKEND_A]
if BACKEND_B["endpoint"] and BACKEND_B["api_key"]:
    ALL_BACKENDS.append(BACKEND_B)
if BACKEND_C["endpoint"] and BACKEND_C["api_key"]:
    ALL_BACKENDS.append(BACKEND_C)

# APIM subscription key
APIM_SUBSCRIPTION_KEY = os.environ.get("APIM_SUBSCRIPTION_KEY", "")

print("üîß Multi-Region Routing Configuration")
print("=" * 70)
print(f"APIM: {APIM_SERVICE_NAME}.azure-api.net/{APIM_API_PATH}")
print(f"API ID: {APIM_API_ID}")
print(f"Configured backends: {len(ALL_BACKENDS)}")
print()
for b in ALL_BACKENDS:
    status = "‚úÖ" if b["endpoint"] and b["api_key"] else "‚ö†Ô∏è (missing credentials)"
    print(f"  {status} {b['label']:15} | {b['region']:12} | {b['type']:4} | pri={b['priority']} wt={b['weight']}")
print("=" * 70)

if len(ALL_BACKENDS) < 2:
    print("\n‚ö†Ô∏è Only 1 backend configured. Multi-region routing requires at least 2 backends.")
    print("   Set BACKEND_B_* environment variables to enable routing tests.")

üîß Multi-Region Routing Configuration
APIM: apim-model-migration.azure-api.net/routing
API ID: multi-region-router
Configured backends: 2

  ‚úÖ PTU-EastUS      | eastus       | PTU  | pri=1 wt=100
  ‚úÖ PAYG-WestUS     | westus       | PAYG | pri=2 wt=50


## Helper Functions

Shared utilities for APIM policy management, backend testing, and result analysis.

In [8]:
# Helper functions for APIM policy management and testing
# =========================================================

def run_az(args: List[str]) -> str:
    """Run Azure CLI command and return stdout."""
    cmd = ["az"] + args
    print(f"  $ az {' '.join(args[:6])}{'...' if len(args) > 6 else ''}")
    p = subprocess.run(cmd, capture_output=True, text=True)
    if p.returncode != 0:
        raise RuntimeError((p.stderr or p.stdout).strip())
    return p.stdout.strip()

def maybe_set_subscription() -> None:
    """Set Azure subscription if specified."""
    sub = (AZURE_SUBSCRIPTION_ID or "").strip()
    if sub:
        run_az(["account", "set", "--subscription", sub])

def get_subscription_id() -> str:
    """Get current subscription ID."""
    if (AZURE_SUBSCRIPTION_ID or "").strip():
        return AZURE_SUBSCRIPTION_ID.strip()
    return run_az(["account", "show", "--query", "id", "-o", "tsv"]).strip()

def apim_service_exists() -> bool:
    """Check if APIM service exists."""
    try:
        run_az(["apim", "show", "-g", APIM_RESOURCE_GROUP, "-n", APIM_SERVICE_NAME, "-o", "none"])
        return True
    except Exception:
        return False

def apply_apim_api_policy(policy_xml: str, api_id: str = APIM_API_ID) -> None:
    """Apply API-level policy to APIM."""
    policy_path = Path(f"routing_policy_{api_id}.xml")
    policy_path.write_text(policy_xml, encoding="utf-8")

    # Try CLI subcommand first
    try:
        run_az([
            "apim", "api", "policy", "create",
            "-g", APIM_RESOURCE_GROUP,
            "--service-name", APIM_SERVICE_NAME,
            "--api-id", api_id,
            "--xml-content", f"@{policy_path}",
        ])
        print(f"‚úÖ Policy applied via: az apim api policy create")
        return
    except Exception as e:
        print(f"‚ÑπÔ∏è Trying az rest fallback...")

    # Fallback to REST API
    sub_id = get_subscription_id()
    uri = (
        "https://management.azure.com"
        f"/subscriptions/{sub_id}"
        f"/resourceGroups/{APIM_RESOURCE_GROUP}"
        f"/providers/Microsoft.ApiManagement/service/{APIM_SERVICE_NAME}"
        f"/apis/{api_id}"
        "/policies/policy"
        "?api-version=2022-08-01"
    )
    payload = {"properties": {"format": "xml", "value": policy_xml}}
    payload_path = Path(f"routing_policy_payload_{api_id}.json")
    payload_path.write_text(json.dumps(payload), encoding="utf-8")
    run_az([
        "rest",
        "--method", "put",
        "--uri", uri,
        "--body", f"@{payload_path}",
        "--headers", "Content-Type=application/json",
    ])
    print(f"‚úÖ Policy applied via: az rest (Management API)")

def ensure_apim_api_exists() -> None:
    """Create the APIM API and operation if they don't exist."""
    try:
        run_az([
            "apim", "api", "create",
            "-g", APIM_RESOURCE_GROUP,
            "--service-name", APIM_SERVICE_NAME,
            "--api-id", APIM_API_ID,
            "--path", APIM_API_PATH,
            "--display-name", "Multi-Region Router",
            "--protocols", "https",
            "--service-url", "https://placeholder.openai.azure.com",
            "--subscription-required", "false",
        ])
        print("‚úÖ API created")
    except Exception:
        print("‚ÑπÔ∏è API already exists (or creation failed)")

    try:
        run_az([
            "apim", "api", "operation", "create",
            "-g", APIM_RESOURCE_GROUP,
            "--service-name", APIM_SERVICE_NAME,
            "--api-id", APIM_API_ID,
            "--operation-id", "responses",
            "--display-name", "Responses",
            "--method", "POST",
            "--url-template", "/responses",
        ])
        print("‚úÖ Operation created")
    except Exception:
        print("‚ÑπÔ∏è Operation already exists (or creation failed)")

def percentile(data: list, p: float) -> float:
    """Calculate the p-th percentile of a list of numbers."""
    if not data:
        return 0.0
    sorted_data = sorted(data)
    k = (len(sorted_data) - 1) * (p / 100.0)
    f = int(k)
    c = f + 1 if f + 1 < len(sorted_data) else f
    return sorted_data[f] + (sorted_data[c] - sorted_data[f]) * (k - f)

print("‚úÖ Helper functions loaded")

‚úÖ Helper functions loaded


In [9]:
# APIM Test Runner - calls the configured endpoint and collects metrics
# =====================================================================

APIM_BASE_URL = f"https://{APIM_SERVICE_NAME}.azure-api.net"
APIM_URL = f"{APIM_BASE_URL}/{APIM_API_PATH}/responses"

@dataclass
class TestResult:
    """Result of a single APIM test request."""
    index: int
    routed_backend: str
    latency_s: float
    success: bool
    status_code: int = 200
    error: Optional[str] = None
    content: str = ""

def run_apim_test(
    payload: dict,
    timeout_s: float = 60.0,
    index: int = 0
) -> TestResult:
    """Send a single request to APIM and return the result."""
    headers = {"Content-Type": "application/json"}
    _key = APIM_SUBSCRIPTION_KEY or os.environ.get("APIM_SUBSCRIPTION_KEY", "")
    if _key:
        headers["Ocp-Apim-Subscription-Key"] = _key

    try:
        t0 = time.perf_counter()
        resp = requests.post(APIM_URL, headers=headers, json=payload, timeout=timeout_s)
        latency_s = time.perf_counter() - t0
        routed = resp.headers.get("x-routed-backend", "unknown")

        if not resp.ok:
            return TestResult(
                index=index,
                routed_backend=routed,
                latency_s=latency_s,
                success=False,
                status_code=resp.status_code,
                error=f"{resp.status_code}: {resp.text[:200]}"
            )

        # Extract content from Responses API format
        data = resp.json()
        content = ""
        try:
            out0 = (data.get("output") or [])[0]
            content0 = (out0.get("content") or [])[0]
            content = content0.get("text", "")
        except (IndexError, KeyError, TypeError):
            pass

        return TestResult(
            index=index,
            routed_backend=routed,
            latency_s=latency_s,
            success=True,
            status_code=resp.status_code,
            content=content
        )
    except Exception as e:
        return TestResult(
            index=index,
            routed_backend="error",
            latency_s=0,
            success=False,
            error=str(e)
        )

def run_load_test(
    num_requests: int = 20,
    payload: Optional[dict] = None,
    delay_between_s: float = 0.1
) -> List[TestResult]:
    """Run multiple requests and collect results."""
    if payload is None:
        payload = {
            "model": "will-be-overridden",
            "instructions": "Answer in exactly 3 words.",
            "input": "What color is the sky?",
            "max_output_tokens": 20,
        }

    results = []
    print(f"üöÄ Running {num_requests} requests to {APIM_URL}")
    print("=" * 60)

    for i in range(num_requests):
        result = run_apim_test(payload, index=i)
        results.append(result)
        status = "‚úÖ" if result.success else "‚ùå"
        print(f"  [{i+1:3}/{num_requests}] {status} ‚Üí {result.routed_backend:10} ({result.latency_s:.2f}s)")
        if delay_between_s > 0 and i < num_requests - 1:
            time.sleep(delay_between_s)

    return results

def analyze_results(results: List[TestResult], strategy_name: str = "Test") -> Dict[str, Any]:
    """Analyze test results and print summary."""
    print(f"\nüìä Results Analysis: {strategy_name}")
    print("=" * 60)

    successful = [r for r in results if r.success]
    failed = [r for r in results if not r.success]

    # Routing distribution
    routed_counts = Counter([r.routed_backend for r in results])
    print(f"\nüéØ Routing Distribution:")
    for backend, count in sorted(routed_counts.items()):
        pct = count / len(results) * 100
        bar = "‚ñà" * int(pct / 4)
        print(f"   {backend:15} ‚îÇ {bar:25} ‚îÇ {count:3} ({pct:.1f}%)")

    # Latency stats
    latencies = [r.latency_s for r in successful]
    if latencies:
        print(f"\n‚è±Ô∏è  Latency Statistics:")
        print(f"   Avg: {statistics.mean(latencies):.3f}s")
        print(f"   Min: {min(latencies):.3f}s")
        print(f"   Max: {max(latencies):.3f}s")
        print(f"   P50: {percentile(latencies, 50):.3f}s")
        print(f"   P95: {percentile(latencies, 95):.3f}s")

    # Success/failure
    success_rate = len(successful) / len(results) * 100
    print(f"\n‚úÖ Success Rate: {len(successful)}/{len(results)} ({success_rate:.1f}%)")

    if failed:
        print(f"\n‚ùå Failed Requests ({len(failed)}):")
        for r in failed[:5]:
            print(f"   [{r.index}] {r.error[:80] if r.error else 'Unknown error'}")

    return {
        "strategy": strategy_name,
        "total": len(results),
        "successful": len(successful),
        "failed": len(failed),
        "success_rate": success_rate,
        "distribution": dict(routed_counts),
        "latencies": latencies,
    }

print("‚úÖ Test runner loaded")
print(f"   APIM URL: {APIM_URL}")

‚úÖ Test runner loaded
   APIM URL: https://apim-model-migration.azure-api.net/routing/responses


## Part 1: Baseline Testing

Before testing routing strategies, let's verify APIM connectivity and establish a baseline.

In [10]:
# Ensure APIM API exists and verify connectivity
# ==============================================
if not shutil.which("az"):
    raise RuntimeError("‚ö†Ô∏è Azure CLI not found. Install and login to continue.")

maybe_set_subscription()

if not apim_service_exists():
    print(f"‚ùå APIM service not found: {APIM_SERVICE_NAME}")
    print("   Run 3_model_migration.ipynb first to create APIM infrastructure.")
    raise RuntimeError("APIM not found")

print(f"‚úÖ APIM service found: {APIM_SERVICE_NAME}")

# Create API and operation for routing tests
ensure_apim_api_exists()

  $ az account set --subscription 3d4d3dd0-79d4-40cf-a94e-b4154812c6ca
  $ az apim show -g rg-model-migration -n apim-model-migration...
‚úÖ APIM service found: apim-model-migration
  $ az apim api create -g rg-model-migration --service-name...
‚úÖ API created
  $ az apim api operation create -g rg-model-migration...
‚úÖ Operation created


## Part 2: Latency-Based Routing

Route all traffic to the lowest-latency backend. This strategy is ideal when you have 
backends in multiple regions and want to serve users from the closest/fastest endpoint.

### How It Works

1. **Measure latencies** for each backend
2. **Build policy** that routes to fastest backend
3. **Test and verify** routing behavior

In [13]:
# Measure actual latencies to each backend
# ========================================
from openai import AzureOpenAI

def measure_backend_latency(backend: dict, num_samples: int = 3) -> Tuple[float, bool]:
    """Measure average latency to a backend."""
    if not backend.get("endpoint") or not backend.get("api_key"):
        return float('inf'), False

    try:
        client = AzureOpenAI(
            azure_endpoint=backend["endpoint"],
            api_key=backend["api_key"],
            api_version=AZURE_OPENAI_API_VERSION,
        )

        latencies = []
        for _ in range(num_samples):
            start = time.perf_counter()
            client.chat.completions.create(
                model=backend["deployment"],
                messages=[{"role": "user", "content": "Say ok"}],
            )
            latencies.append((time.perf_counter() - start) * 1000)

        return statistics.mean(latencies), True
    except Exception as e:
        print(f"   ‚ùå {backend['label']}: {e}")
        return float('inf'), False

print("üìä Measuring Backend Latencies")
print("=" * 60)

latency_results = []
for backend in ALL_BACKENDS:
    print(f"   Testing {backend['label']}...", end=" ")
    avg_ms, success = measure_backend_latency(backend, num_samples=3)
    if success:
        print(f"‚úÖ {avg_ms:.0f}ms")
        latency_results.append((backend, avg_ms))
    else:
        print(f"‚ùå Failed")

# Sort by latency
latency_results.sort(key=lambda x: x[1])

print(f"\nüéØ Preferred Order (fastest first):")
for i, (backend, latency) in enumerate(latency_results):
    print(f"   {i+1}. {backend['label']:15} ({backend['region']:12}) - {latency:.0f}ms")

üìä Measuring Backend Latencies
   Testing PTU-EastUS... 

‚úÖ 377ms
   Testing PAYG-WestUS... ‚úÖ 833ms

üéØ Preferred Order (fastest first):
   1. PTU-EastUS      (eastus      ) - 377ms
   2. PAYG-WestUS     (westus      ) - 833ms


In [15]:
# Apply Latency-Based Routing Policy
# ===================================
# Routes ALL traffic to the fastest backend (no load balancing)

if len(latency_results) < 1:
    print("‚ö†Ô∏è No backends available. Check your configuration.")
else:
    fastest_backend = latency_results[0][0]

    policy_xml = f"""<policies>
  <inbound>
    <base />
    <!-- Latency-Based Routing: Always use fastest backend -->
    <set-variable name="backendLabel" value="{xml_escape(fastest_backend['id'])}" />
    <set-variable name="targetModel" value="{xml_escape(fastest_backend['deployment'])}" />
    <set-backend-service base-url="{xml_escape(fastest_backend['endpoint'].rstrip('/'))}" />
    <set-header name="api-key" exists-action="override">
      <value>{xml_escape(fastest_backend['api_key'])}</value>
    </set-header>
    <set-query-parameter name="api-version" exists-action="override">
      <value>{RESPONSES_API_VERSION}</value>
    </set-query-parameter>
    <rewrite-uri template="/openai/responses" />
    <set-body><![CDATA[@{{
      var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true);
      body["model"] = (string)context.Variables["targetModel"];
      return body.ToString(Newtonsoft.Json.Formatting.None);
    }}]]></set-body>
  </inbound>
  <backend>
    <base />
  </backend>
  <outbound>
    <base />
    <set-header name="x-routed-backend" exists-action="override">
      <value>@((string)context.Variables.GetValueOrDefault("backendLabel", "unknown"))</value>
    </set-header>
    <set-header name="x-routing-strategy" exists-action="override">
      <value>latency-based</value>
    </set-header>
  </outbound>
  <on-error>
    <base />
  </on-error>
</policies>"""

    print(f"üöÄ Applying Latency-Based Routing Policy")
    print(f"   Target: {fastest_backend['label']} ({fastest_backend['region']})")
    apply_apim_api_policy(policy_xml)

üöÄ Applying Latency-Based Routing Policy
   Target: PTU-EastUS (eastus)
  $ az apim api policy create -g rg-model-migration...


‚ÑπÔ∏è Trying az rest fallback...
  $ az rest --method put --uri https://management.azure.com/subscriptions/3d4d3dd0-79d4-40cf-a94e-b4154812c6ca/resourceGroups/rg-model-migration/providers/Microsoft.ApiManagement/service/apim-model-migration/apis/multi-region-router/policies/policy?api-version=2022-08-01 --body...
‚úÖ Policy applied via: az rest (Management API)


In [16]:
# Test Latency-Based Routing
# ==========================
results_latency = run_load_test(num_requests=10, delay_between_s=0.5)
analysis_latency = analyze_results(results_latency, "Latency-Based Routing")

üöÄ Running 10 requests to https://apim-model-migration.azure-api.net/routing/responses


  [  1/10] ‚úÖ ‚Üí backend-a  (1.53s)
  [  2/10] ‚úÖ ‚Üí backend-a  (1.10s)
  [  3/10] ‚úÖ ‚Üí backend-a  (0.97s)
  [  4/10] ‚úÖ ‚Üí backend-a  (2.18s)
  [  5/10] ‚úÖ ‚Üí backend-a  (2.78s)
  [  6/10] ‚úÖ ‚Üí backend-a  (2.24s)
  [  7/10] ‚úÖ ‚Üí backend-a  (1.21s)
  [  8/10] ‚úÖ ‚Üí backend-a  (3.36s)
  [  9/10] ‚úÖ ‚Üí backend-a  (1.02s)
  [ 10/10] ‚úÖ ‚Üí backend-a  (1.50s)

üìä Results Analysis: Latency-Based Routing

üéØ Routing Distribution:
   backend-a       ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà ‚îÇ  10 (100.0%)

‚è±Ô∏è  Latency Statistics:
   Avg: 1.788s
   Min: 0.967s
   Max: 3.357s
   P50: 1.515s
   P95: 3.099s

‚úÖ Success Rate: 10/10 (100.0%)


## Part 3: Weighted Round Robin

Distribute traffic across backends based on configured weights. Higher weight = more traffic.

### Configuration Pattern (from AI-Gateway)

```bicep
pool: {
  services: [
    { id: 'backend-a', priority: 1, weight: 70 }  // 70% when both at same priority
    { id: 'backend-b', priority: 1, weight: 30 }  // 30% when both at same priority
  ]
}
```

In APIM policy, we simulate this with random selection based on cumulative weights.

In [17]:
# Apply Weighted Round Robin Policy
# ==================================
# Distributes traffic based on configured weights

if len(ALL_BACKENDS) < 2:
    print("‚ö†Ô∏è Weighted round robin requires at least 2 backends.")
    print("   Configure BACKEND_B_* environment variables.")
else:
    # Calculate weight threshold for routing decision
    total_weight = sum(b["weight"] for b in ALL_BACKENDS)
    
    # Build policy with weighted selection
    backends_for_policy = ALL_BACKENDS[:2]  # Use first 2 backends
    weight_a = backends_for_policy[0]["weight"]
    weight_b = backends_for_policy[1]["weight"]
    threshold = int(round(100.0 * (weight_a / (weight_a + weight_b))))

    policy_xml = f"""<policies>
  <inbound>
    <base />
    <!-- Weighted Round Robin: A={weight_a}, B={weight_b} (threshold={threshold}%) -->
    <set-variable name="roll" value="@((new System.Random()).Next(0, 100))" />
    <choose>
      <when condition="@(((int)context.Variables[&quot;roll&quot;]) &lt; {threshold})">
        <set-variable name="backendLabel" value="{xml_escape(backends_for_policy[0]['id'])}" />
        <set-variable name="targetModel" value="{xml_escape(backends_for_policy[0]['deployment'])}" />
        <set-backend-service base-url="{xml_escape(backends_for_policy[0]['endpoint'].rstrip('/'))}" />
        <set-header name="api-key" exists-action="override">
          <value>{xml_escape(backends_for_policy[0]['api_key'])}</value>
        </set-header>
      </when>
      <otherwise>
        <set-variable name="backendLabel" value="{xml_escape(backends_for_policy[1]['id'])}" />
        <set-variable name="targetModel" value="{xml_escape(backends_for_policy[1]['deployment'])}" />
        <set-backend-service base-url="{xml_escape(backends_for_policy[1]['endpoint'].rstrip('/'))}" />
        <set-header name="api-key" exists-action="override">
          <value>{xml_escape(backends_for_policy[1]['api_key'])}</value>
        </set-header>
      </otherwise>
    </choose>
    <set-query-parameter name="api-version" exists-action="override">
      <value>{RESPONSES_API_VERSION}</value>
    </set-query-parameter>
    <rewrite-uri template="/openai/responses" />
    <set-body><![CDATA[@{{
      var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true);
      body["model"] = (string)context.Variables["targetModel"];
      return body.ToString(Newtonsoft.Json.Formatting.None);
    }}]]></set-body>
  </inbound>
  <backend>
    <base />
  </backend>
  <outbound>
    <base />
    <set-header name="x-routed-backend" exists-action="override">
      <value>@((string)context.Variables.GetValueOrDefault("backendLabel", "unknown"))</value>
    </set-header>
    <set-header name="x-routing-strategy" exists-action="override">
      <value>weighted-round-robin</value>
    </set-header>
  </outbound>
  <on-error>
    <base />
  </on-error>
</policies>"""

    print(f"üöÄ Applying Weighted Round Robin Policy")
    print(f"   {backends_for_policy[0]['label']}: {weight_a} ({threshold}%)")
    print(f"   {backends_for_policy[1]['label']}: {weight_b} ({100-threshold}%)")
    apply_apim_api_policy(policy_xml)

üöÄ Applying Weighted Round Robin Policy
   PTU-EastUS: 100 (67%)
   PAYG-WestUS: 50 (33%)
  $ az apim api policy create -g rg-model-migration...
‚ÑπÔ∏è Trying az rest fallback...
  $ az rest --method put --uri https://management.azure.com/subscriptions/3d4d3dd0-79d4-40cf-a94e-b4154812c6ca/resourceGroups/rg-model-migration/providers/Microsoft.ApiManagement/service/apim-model-migration/apis/multi-region-router/policies/policy?api-version=2022-08-01 --body...
‚úÖ Policy applied via: az rest (Management API)


In [18]:
# Test Weighted Round Robin
# =========================
if len(ALL_BACKENDS) >= 2:
    results_wrr = run_load_test(num_requests=20, delay_between_s=0.2)
    analysis_wrr = analyze_results(results_wrr, "Weighted Round Robin")
else:
    print("‚ö†Ô∏è Skipping test - requires 2+ backends")
    analysis_wrr = None

üöÄ Running 20 requests to https://apim-model-migration.azure-api.net/routing/responses
  [  1/20] ‚úÖ ‚Üí backend-a  (3.13s)
  [  2/20] ‚úÖ ‚Üí backend-a  (1.09s)
  [  3/20] ‚úÖ ‚Üí backend-b  (1.95s)
  [  4/20] ‚úÖ ‚Üí backend-a  (3.12s)
  [  5/20] ‚úÖ ‚Üí backend-a  (1.40s)
  [  6/20] ‚úÖ ‚Üí backend-a  (1.04s)
  [  7/20] ‚úÖ ‚Üí backend-a  (1.02s)
  [  8/20] ‚úÖ ‚Üí backend-a  (0.97s)
  [  9/20] ‚úÖ ‚Üí backend-a  (1.13s)
  [ 10/20] ‚úÖ ‚Üí backend-a  (1.07s)
  [ 11/20] ‚úÖ ‚Üí backend-a  (1.13s)
  [ 12/20] ‚úÖ ‚Üí backend-a  (1.55s)
  [ 13/20] ‚úÖ ‚Üí backend-a  (1.76s)
  [ 14/20] ‚úÖ ‚Üí backend-a  (1.12s)
  [ 15/20] ‚úÖ ‚Üí backend-a  (1.49s)
  [ 16/20] ‚úÖ ‚Üí backend-a  (1.81s)
  [ 17/20] ‚úÖ ‚Üí backend-a  (1.04s)
  [ 18/20] ‚úÖ ‚Üí backend-a  (1.08s)
  [ 19/20] ‚úÖ ‚Üí backend-b  (1.87s)
  [ 20/20] ‚úÖ ‚Üí backend-b  (1.84s)

üìä Results Analysis: Weighted Round Robin

üéØ Routing Distribution:
   backend-a       ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚

## Part 4: Priority-Based with PTU Spillover

This is the **recommended production pattern** from Azure-Samples/AI-Gateway:

- **Priority 1**: PTU backend (preferred, lower cost per token)
- **Priority 2**: PAYG backend(s) (fallback when PTU returns 429)

### Backend Pool Configuration (Bicep pattern)

```bicep
resource backendPool 'Microsoft.ApiManagement/service/backends@2024-05-01' = {
  properties: {
    type: 'Pool'
    pool: {
      services: [
        { id: ptuBackend.id,  priority: 1, weight: 100 }  // Primary
        { id: paygBackend1.id, priority: 2, weight: 50 }  // Failover
        { id: paygBackend2.id, priority: 2, weight: 50 }  // Failover
      ]
    }
  }
}
```

In [19]:
# Apply Priority-Based Routing with PTU Spillover
# ================================================
# Primary (PTU) ‚Üí Retry to Secondary (PAYG) on 429

if len(ALL_BACKENDS) < 2:
    print("‚ö†Ô∏è Priority-based spillover requires at least 2 backends.")
else:
    primary = ALL_BACKENDS[0]  # PTU
    secondary = ALL_BACKENDS[1]  # PAYG

    policy_xml = f"""<policies>
  <inbound>
    <base />
    <!-- Priority-Based: PTU first, PAYG fallback on 429 -->
    <set-variable name="backendLabel" value="{xml_escape(primary['id'])}" />
    <set-variable name="targetModel" value="{xml_escape(primary['deployment'])}" />
    <set-variable name="primaryEndpoint" value="{xml_escape(primary['endpoint'].rstrip('/'))}" />
    <set-variable name="primaryKey" value="{xml_escape(primary['api_key'])}" />
    <set-variable name="secondaryEndpoint" value="{xml_escape(secondary['endpoint'].rstrip('/'))}" />
    <set-variable name="secondaryKey" value="{xml_escape(secondary['api_key'])}" />
    <set-variable name="secondaryModel" value="{xml_escape(secondary['deployment'])}" />
    <set-variable name="secondaryId" value="{xml_escape(secondary['id'])}" />
    
    <set-backend-service base-url="@((string)context.Variables[&quot;primaryEndpoint&quot;])" />
    <set-header name="api-key" exists-action="override">
      <value>@((string)context.Variables["primaryKey"])</value>
    </set-header>
    <set-query-parameter name="api-version" exists-action="override">
      <value>{RESPONSES_API_VERSION}</value>
    </set-query-parameter>
    <rewrite-uri template="/openai/responses" />
    <set-body><![CDATA[@{{
      var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true);
      body["model"] = (string)context.Variables["targetModel"];
      return body.ToString(Newtonsoft.Json.Formatting.None);
    }}]]></set-body>
  </inbound>
  <backend>
    <!-- Retry with PAYG on 429 from PTU -->
    <retry condition="@(context.Response.StatusCode == 429)" count="1" interval="0" first-fast-retry="true">
      <set-backend-service base-url="@((string)context.Variables[&quot;secondaryEndpoint&quot;])" />
      <set-header name="api-key" exists-action="override">
        <value>@((string)context.Variables["secondaryKey"])</value>
      </set-header>
      <set-variable name="backendLabel" value="@((string)context.Variables[&quot;secondaryId&quot;])" />
      <set-variable name="targetModel" value="@((string)context.Variables[&quot;secondaryModel&quot;])" />
      <set-body><![CDATA[@{{
        var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true);
        body["model"] = (string)context.Variables["targetModel"];
        return body.ToString(Newtonsoft.Json.Formatting.None);
      }}]]></set-body>
      <forward-request />
    </retry>
  </backend>
  <outbound>
    <base />
    <set-header name="x-routed-backend" exists-action="override">
      <value>@((string)context.Variables.GetValueOrDefault("backendLabel", "unknown"))</value>
    </set-header>
    <set-header name="x-routing-strategy" exists-action="override">
      <value>priority-spillover</value>
    </set-header>
  </outbound>
  <on-error>
    <base />
  </on-error>
</policies>"""

    print(f"üöÄ Applying Priority-Based Spillover Policy")
    print(f"   Primary (Priority 1): {primary['label']} ({primary['type']})")
    print(f"   Secondary (Priority 2): {secondary['label']} ({secondary['type']})")
    apply_apim_api_policy(policy_xml)

üöÄ Applying Priority-Based Spillover Policy
   Primary (Priority 1): PTU-EastUS (PTU)
   Secondary (Priority 2): PAYG-WestUS (PAYG)
  $ az apim api policy create -g rg-model-migration...
‚ÑπÔ∏è Trying az rest fallback...
  $ az rest --method put --uri https://management.azure.com/subscriptions/3d4d3dd0-79d4-40cf-a94e-b4154812c6ca/resourceGroups/rg-model-migration/providers/Microsoft.ApiManagement/service/apim-model-migration/apis/multi-region-router/policies/policy?api-version=2022-08-01 --body...
‚úÖ Policy applied via: az rest (Management API)


In [20]:
# Test Priority-Based Spillover
# =============================
if len(ALL_BACKENDS) >= 2:
    results_spillover = run_load_test(num_requests=15, delay_between_s=0.2)
    analysis_spillover = analyze_results(results_spillover, "Priority-Based Spillover")
else:
    print("‚ö†Ô∏è Skipping test - requires 2+ backends")
    analysis_spillover = None

üöÄ Running 15 requests to https://apim-model-migration.azure-api.net/routing/responses
  [  1/15] ‚úÖ ‚Üí backend-b  (1.80s)
  [  2/15] ‚úÖ ‚Üí backend-b  (1.99s)
  [  3/15] ‚úÖ ‚Üí backend-b  (2.70s)
  [  4/15] ‚úÖ ‚Üí backend-b  (1.85s)
  [  5/15] ‚úÖ ‚Üí backend-b  (2.14s)
  [  6/15] ‚úÖ ‚Üí backend-b  (2.63s)
  [  7/15] ‚úÖ ‚Üí backend-b  (1.86s)
  [  8/15] ‚úÖ ‚Üí backend-b  (1.90s)
  [  9/15] ‚úÖ ‚Üí backend-b  (2.58s)
  [ 10/15] ‚úÖ ‚Üí backend-b  (1.86s)
  [ 11/15] ‚úÖ ‚Üí backend-b  (1.93s)
  [ 12/15] ‚úÖ ‚Üí backend-b  (1.79s)
  [ 13/15] ‚úÖ ‚Üí backend-b  (1.73s)
  [ 14/15] ‚úÖ ‚Üí backend-b  (2.74s)
  [ 15/15] ‚úÖ ‚Üí backend-b  (1.82s)

üìä Results Analysis: Priority-Based Spillover

üéØ Routing Distribution:
   backend-b       ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà ‚îÇ  15 (100.0%)

‚è±Ô∏è  Latency Statistics:
   Avg: 2.088s
   Min: 1.726s
   Max: 2.739s
   P50: 1.902s
   P95: 2.710s

‚úÖ Success Rate: 15/15 (100.0%)


## Part 5: Circuit Breaker Pattern

Circuit breaker prevents cascading failures by temporarily isolating failing backends.

### APIM Backend Circuit Breaker (from AI-Gateway)

```bicep
circuitBreaker: {
  rules: [
    {
      name: 'openai-circuit-breaker'
      failureCondition: {
        count: 3                    // Trip after 3 failures
        interval: 'PT10S'           // Within 10-second window
        statusCodeRanges: [
          { min: 429, max: 429 }    // Count 429s as failures
          { min: 500, max: 599 }    // Count 5xx as failures
        ]
      }
      tripDuration: 'PT1M'          // Stay open for 1 minute
      acceptRetryAfter: true        // Respect Retry-After header
    }
  ]
}
```

### Circuit States

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     3 failures in 10s      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ CLOSED ‚îÇ ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ  OPEN  ‚îÇ
‚îÇ (OK)   ‚îÇ                            ‚îÇ (Skip) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                            ‚îî‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò
     ‚ñ≤                                    ‚îÇ
     ‚îÇ         success                    ‚îÇ 1 minute timeout
     ‚îÇ ‚óÑ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê             ‚îÇ
     ‚îÇ                      ‚îÇ             ‚ñº
     ‚îÇ                 ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÇHALF-OPEN ‚îÇ
                       ‚îÇ (Test)   ‚îÇ
                       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [21]:
# Apply Circuit Breaker with Failover Policy
# ==========================================
# This policy simulates circuit breaker behavior in APIM policy
# Note: Full circuit breaker requires backend-level configuration in Bicep

if len(ALL_BACKENDS) < 2:
    print("‚ö†Ô∏è Circuit breaker demo requires at least 2 backends.")
else:
    primary = ALL_BACKENDS[0]
    secondary = ALL_BACKENDS[1]

    # Circuit breaker simulation via policy (tracks failures in cache)
    policy_xml = f"""<policies>
  <inbound>
    <base />
    <!-- Circuit Breaker Pattern: Track failures and failover -->
    <cache-lookup-value key="circuit-open" default-value="false" variable-name="circuitOpen" />
    
    <choose>
      <when condition="@(((string)context.Variables[&quot;circuitOpen&quot;]) == &quot;true&quot;)">
        <!-- Circuit is OPEN - route to secondary -->
        <set-variable name="backendLabel" value="{xml_escape(secondary['id'])}" />
        <set-variable name="targetModel" value="{xml_escape(secondary['deployment'])}" />
        <set-backend-service base-url="{xml_escape(secondary['endpoint'].rstrip('/'))}" />
        <set-header name="api-key" exists-action="override">
          <value>{xml_escape(secondary['api_key'])}</value>
        </set-header>
      </when>
      <otherwise>
        <!-- Circuit is CLOSED - route to primary -->
        <set-variable name="backendLabel" value="{xml_escape(primary['id'])}" />
        <set-variable name="targetModel" value="{xml_escape(primary['deployment'])}" />
        <set-backend-service base-url="{xml_escape(primary['endpoint'].rstrip('/'))}" />
        <set-header name="api-key" exists-action="override">
          <value>{xml_escape(primary['api_key'])}</value>
        </set-header>
      </otherwise>
    </choose>
    
    <set-query-parameter name="api-version" exists-action="override">
      <value>{RESPONSES_API_VERSION}</value>
    </set-query-parameter>
    <rewrite-uri template="/openai/responses" />
    <set-body><![CDATA[@{{
      var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true);
      body["model"] = (string)context.Variables["targetModel"];
      return body.ToString(Newtonsoft.Json.Formatting.None);
    }}]]></set-body>
  </inbound>
  <backend>
    <base />
  </backend>
  <outbound>
    <base />
    <set-header name="x-routed-backend" exists-action="override">
      <value>@((string)context.Variables.GetValueOrDefault("backendLabel", "unknown"))</value>
    </set-header>
    <set-header name="x-routing-strategy" exists-action="override">
      <value>circuit-breaker</value>
    </set-header>
    <set-header name="x-circuit-state" exists-action="override">
      <value>@(((string)context.Variables.GetValueOrDefault("circuitOpen", "false")) == "true" ? "open" : "closed")</value>
    </set-header>
  </outbound>
  <on-error>
    <!-- On 429/5xx, increment failure count and potentially trip circuit -->
    <choose>
      <when condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)">
        <cache-lookup-value key="failure-count" default-value="0" variable-name="failureCount" />
        <set-variable name="newFailureCount" value="@(int.Parse((string)context.Variables[&quot;failureCount&quot;]) + 1)" />
        <cache-store-value key="failure-count" value="@(((int)context.Variables[&quot;newFailureCount&quot;]).ToString())" duration="10" />
        
        <choose>
          <when condition="@(((int)context.Variables[&quot;newFailureCount&quot;]) >= 3)">
            <!-- Trip the circuit for 60 seconds -->
            <cache-store-value key="circuit-open" value="true" duration="60" />
          </when>
        </choose>
      </when>
    </choose>
    <base />
  </on-error>
</policies>"""

    print(f"üöÄ Applying Circuit Breaker Policy")
    print(f"   Primary: {primary['label']}")
    print(f"   Failover: {secondary['label']}")
    print(f"   Trip condition: 3 failures in 10s ‚Üí open for 60s")
    apply_apim_api_policy(policy_xml)

üöÄ Applying Circuit Breaker Policy
   Primary: PTU-EastUS
   Failover: PAYG-WestUS
   Trip condition: 3 failures in 10s ‚Üí open for 60s
  $ az apim api policy create -g rg-model-migration...
‚ÑπÔ∏è Trying az rest fallback...
  $ az rest --method put --uri https://management.azure.com/subscriptions/3d4d3dd0-79d4-40cf-a94e-b4154812c6ca/resourceGroups/rg-model-migration/providers/Microsoft.ApiManagement/service/apim-model-migration/apis/multi-region-router/policies/policy?api-version=2022-08-01 --body...
‚úÖ Policy applied via: az rest (Management API)


In [22]:
# Test Circuit Breaker
# ====================
if len(ALL_BACKENDS) >= 2:
    results_cb = run_load_test(num_requests=10, delay_between_s=0.5)
    analysis_cb = analyze_results(results_cb, "Circuit Breaker")
else:
    print("‚ö†Ô∏è Skipping test - requires 2+ backends")
    analysis_cb = None

üöÄ Running 10 requests to https://apim-model-migration.azure-api.net/routing/responses
  [  1/10] ‚úÖ ‚Üí backend-a  (1.10s)
  [  2/10] ‚úÖ ‚Üí backend-a  (1.75s)
  [  3/10] ‚úÖ ‚Üí backend-a  (1.70s)
  [  4/10] ‚úÖ ‚Üí backend-a  (1.96s)
  [  5/10] ‚úÖ ‚Üí backend-a  (2.40s)
  [  6/10] ‚úÖ ‚Üí backend-a  (1.08s)
  [  7/10] ‚úÖ ‚Üí backend-a  (1.44s)
  [  8/10] ‚úÖ ‚Üí backend-a  (1.14s)
  [  9/10] ‚úÖ ‚Üí backend-a  (1.06s)
  [ 10/10] ‚úÖ ‚Üí backend-a  (1.10s)

üìä Results Analysis: Circuit Breaker

üéØ Routing Distribution:
   backend-a       ‚îÇ ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà ‚îÇ  10 (100.0%)

‚è±Ô∏è  Latency Statistics:
   Avg: 1.473s
   Min: 1.058s
   Max: 2.404s
   P50: 1.288s
   P95: 2.205s

‚úÖ Success Rate: 10/10 (100.0%)


## Part 6: Combined Strategy Test

Test all routing strategies back-to-back and compare their characteristics.

This cell compiles results from all the tests run above.

In [23]:
# Compile and Compare All Results
# ================================

all_analyses = []

# Collect available analyses
if 'analysis_latency' in dir() and analysis_latency:
    all_analyses.append(analysis_latency)
if 'analysis_wrr' in dir() and analysis_wrr:
    all_analyses.append(analysis_wrr)
if 'analysis_spillover' in dir() and analysis_spillover:
    all_analyses.append(analysis_spillover)
if 'analysis_cb' in dir() and analysis_cb:
    all_analyses.append(analysis_cb)

if not all_analyses:
    print("‚ö†Ô∏è No test results available. Run the routing strategy tests above.")
else:
    print("üìä Routing Strategy Comparison")
    print("=" * 80)
    print(f"{'Strategy':<25} {'Success %':>10} {'Avg Latency':>12} {'P95 Latency':>12} {'Distribution':<20}")
    print("-" * 80)

    for a in all_analyses:
        avg_lat = statistics.mean(a['latencies']) if a['latencies'] else 0
        p95_lat = percentile(a['latencies'], 95) if a['latencies'] else 0
        dist_str = ", ".join([f"{k}:{v}" for k, v in sorted(a['distribution'].items())])
        
        print(f"{a['strategy']:<25} {a['success_rate']:>9.1f}% {avg_lat:>11.3f}s {p95_lat:>11.3f}s {dist_str:<20}")

    print("=" * 80)

üìä Routing Strategy Comparison
Strategy                   Success %  Avg Latency  P95 Latency Distribution        
--------------------------------------------------------------------------------
Latency-Based Routing         100.0%       1.788s       3.099s backend-a:10        
Weighted Round Robin          100.0%       1.529s       3.122s backend-a:17, backend-b:3
Priority-Based Spillover      100.0%       2.088s       2.710s backend-b:15        
Circuit Breaker               100.0%       1.473s       2.205s backend-a:10        


## Results Analysis

### Strategy Selection Guide

| Strategy | Best For | Trade-offs |
|----------|----------|------------|
| **Latency-Based** | Geo-distributed users, consistent UX | Single backend, no load distribution |
| **Weighted Round Robin** | Canary deployments, gradual migration | Requires capacity planning |
| **Priority Spillover** | PTU + PAYG cost optimization | PAYG cost during spikes |
| **Circuit Breaker** | High availability, failure isolation | Complexity, delayed recovery |

### Recommended Production Configuration

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    APIM Backend Pool                             ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Backend A (PTU)      priority: 1    weight: 100                 ‚îÇ
‚îÇ    ‚îî‚îÄ Circuit breaker: 3 failures/10s ‚Üí trip 1min               ‚îÇ
‚îÇ                                                                  ‚îÇ
‚îÇ  Backend B (PAYG-1)   priority: 2    weight: 50                  ‚îÇ
‚îÇ    ‚îî‚îÄ Circuit breaker: 3 failures/10s ‚Üí trip 1min               ‚îÇ
‚îÇ                                                                  ‚îÇ
‚îÇ  Backend C (PAYG-2)   priority: 2    weight: 50                  ‚îÇ
‚îÇ    ‚îî‚îÄ Circuit breaker: 3 failures/10s ‚Üí trip 1min               ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Behavior:                                                       ‚îÇ
‚îÇ  1. Normal: All traffic ‚Üí Backend A (PTU)                        ‚îÇ
‚îÇ  2. PTU 429: Retry ‚Üí Backend B or C (weighted 50/50)            ‚îÇ
‚îÇ  3. PTU circuit trips: All traffic ‚Üí B/C until recovery         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [24]:
# Save Test Results to File
# ==========================
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_file = f"routing_results_{timestamp}.json"

results_to_save = {
    "timestamp": timestamp,
    "apim_service": APIM_SERVICE_NAME,
    "backends": [
        {
            "id": b["id"],
            "label": b["label"],
            "region": b["region"],
            "type": b["type"],
            "priority": b["priority"],
            "weight": b["weight"],
        }
        for b in ALL_BACKENDS
    ],
    "analyses": [
        {
            "strategy": a["strategy"],
            "total": a["total"],
            "successful": a["successful"],
            "failed": a["failed"],
            "success_rate": a["success_rate"],
            "distribution": a["distribution"],
            "avg_latency": statistics.mean(a["latencies"]) if a["latencies"] else 0,
            "p50_latency": percentile(a["latencies"], 50) if a["latencies"] else 0,
            "p95_latency": percentile(a["latencies"], 95) if a["latencies"] else 0,
        }
        for a in all_analyses
    ] if all_analyses else [],
}

Path(results_file).write_text(json.dumps(results_to_save, indent=2))
print(f"‚úÖ Results saved to {results_file}")

‚úÖ Results saved to routing_results_20260206_025400.json


## Wrap-up

### What You Learned

1. **Latency-based routing**: Route to fastest backend for optimal user experience
2. **Weighted round robin**: Distribute traffic based on capacity/preference for gradual rollouts
3. **Priority-based spillover**: PTU-first with PAYG fallback for cost optimization
4. **Circuit breaker**: Isolate failing backends to prevent cascading failures

### Key APIM Policy Patterns

| Pattern | Policy Element | Purpose |
|---------|---------------|---------|
| Backend selection | `<set-backend-service>` | Route to specific endpoint |
| Weighted routing | `<set-variable>` + `<choose>` | Random selection based on weights |
| Retry on 429 | `<retry condition="...">` | Failover to secondary backend |
| Response headers | `<set-header>` in outbound | Track which backend served request |

### Infrastructure as Code (Bicep) Reference

For production deployments, configure backend pools in Bicep:

```bicep
// See: Azure-Samples/AI-Gateway/labs/backend-pool-load-balancing
resource backendPool 'Microsoft.ApiManagement/service/backends@2024-05-01' = {
  properties: {
    type: 'Pool'
    pool: {
      services: [
        { id: ptuBackend.id, priority: 1, weight: 100 }
        { id: paygBackend.id, priority: 2, weight: 100 }
      ]
    }
  }
}
```

### Additional Resources

- [Azure-Samples/AI-Gateway](https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/backend-pool-load-balancing) - Backend pool load balancing lab
- [APIM Backend Configuration](https://learn.microsoft.com/en-us/azure/api-management/backends) - Official docs
- [Azure OpenAI PTU Overview](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput) - PTU concepts