# Optimization: Model Comparison & Benchmark
----

This notebook focuses on **Workload Optimization** through model comparison and benchmarking. You will learn how to systematically compare **two models you select** (e.g., GPT-4o vs GPT-5.1) with the **Azure OpenAI Responses API** and measure:

- **Latency**: Response time for different scenarios
- **Token Usage**: Input, output, and cached tokens
- **Cost Efficiency**: Price per request with prompt caching
- **Accuracy**: Answer correctness for enterprise scenarios

## Key Features
- Fair 7-dimension aligned comparison (same API, cache, padding, etc.)
- Two-model selection via variables (no hard-coded model pair)
- Prompt caching support with Azure OpenAI
- Enterprise scenario coverage (intent, sentiment, RAG, code, customer service)

## Table of contents

- [Why Model Comparison Matters](#why-model-comparison-matters)
- [Setup](#setup)
- [Step 1: Configure Pricing](#step-1-configure-pricing)
- [Step 2: Define Test Scenarios](#step-2-define-test-scenarios)
- [Step 3: Create Static Padding for Cache Eligibility](#step-3-create-static-padding-for-cache-eligibility)
- [Step 4: Implement Core Benchmark Functions](#step-4-implement-core-benchmark-functions)
- [Step 5: Cache Warmup](#step-5-cache-warmup)
- [Step 6: Run Benchmark](#step-6-run-benchmark)
- [Step 7: Analyze Results](#step-7-analyze-results)
- [Additional Resources](#additional-resources)
- [Wrap-up](#wrap-up)

## Why Model Comparison Matters

### Enterprise Migration Decision Support

When migrating between model versions, organizations need to consider:

| Dimension | Description |
|-----------|-------------|
| **Performance** | Response latency, time-to-first-token (TTFT) |
| **Cost** | Input/output token pricing, cache hit savings |
| **Quality** | Accuracy for specific use cases |
| **Scalability** | Behavior under different workloads |

### Fair Comparison Methodology

This benchmark ensures fairness through:

1. **Same API**: Using Responses API for both models
2. **Same Questions**: Identical test scenarios
3. **Same Cache Key**: Consistent `prompt_cache_key` for cache routing
4. **Warmup Phase**: Populate cache before measurement
5. **Multiple Runs**: Statistical significance through repetition

## Setup

This notebook reuses the configuration file (`.foundry_config.json`) created by `0_setup/1_setup.ipynb`.

- If the file is missing, run the setup notebook first.
- Make sure you can authenticate (e.g., `az login`), so `DefaultAzureCredential` can work.

In [1]:
# Environment setup and imports
import os
import sys
import time
import json
from datetime import datetime
from typing import Dict, List, Any, Optional
from dotenv import load_dotenv

load_dotenv(override=True)

# Verify required packages
try:
    from openai import AzureOpenAI
    print("‚úÖ Azure OpenAI package imported successfully")
except ImportError:
    print("‚ùå ERROR: openai package not installed.")
    print("   Run: pip install openai>=1.60.0")
    sys.exit(1)

# Load Foundry project settings
config_file = '../0_setup/.foundry_config.json'
try:
    with open(config_file, 'r', encoding='utf-8') as f:
        config = json.load(f)
    print(f"‚úÖ Loaded settings from '{config_file}'")
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è Could not find '{config_file}'.")
    print('üí° Run 0_setup/1_setup.ipynb first to create it.')
    raise e

# Extract configuration values
AZURE_OPENAI_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", "")
AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "")
AZURE_OPENAI_API_VERSION = "2025-04-01-preview"

print(f"\nüìå Azure OpenAI Endpoint: {AZURE_OPENAI_ENDPOINT[:50]}..." if AZURE_OPENAI_ENDPOINT else "‚ö†Ô∏è AZURE_OPENAI_ENDPOINT not set")
print(f"üìå API Key: {'‚úÖ Set' if AZURE_OPENAI_API_KEY else '‚ö†Ô∏è Not set'}")

‚úÖ Azure OpenAI package imported successfully
‚úÖ Loaded settings from '../0_setup/.foundry_config.json'

üìå Azure OpenAI Endpoint: https://foundry-rq90gs.openai.azure.com...
üìå API Key: ‚úÖ Set


In [2]:
# Validate required environment variables
if not AZURE_OPENAI_ENDPOINT or not AZURE_OPENAI_API_KEY:
    print("‚ùå ERROR: Missing required environment variables!")
    print("\nPlease set:")
    print("  export AZURE_OPENAI_ENDPOINT='https://your-resource.openai.azure.com/'")
    print("  export AZURE_OPENAI_API_KEY='your-api-key'")
    print("\nOr add them to your .env file.")
else:
    print("‚úÖ All required environment variables are set")
    
# Initialize Azure OpenAI client with Responses API base URL
client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
)
print(f"‚úÖ Azure OpenAI client initialized")
print(f"   Base URL: {AZURE_OPENAI_ENDPOINT.rstrip('/')}/openai/v1/")

‚úÖ All required environment variables are set
‚úÖ Azure OpenAI client initialized
   Base URL: https://foundry-rq90gs.openai.azure.com/openai/v1/


## Step 1: Configure Pricing

Define the pricing structure for each model. This allows us to calculate costs accurately.

**Azure OpenAI Pricing** (per 1M tokens):
- Prices are based on official Azure OpenAI pricing
- Cached input tokens are significantly cheaper than uncached tokens
- Source: [Azure OpenAI Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/)

In [3]:
# Pricing configuration (per 1M tokens) - Azure OpenAI Official Pricing
# Source: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/

PRICING: Dict[str, Dict[str, float]] = {
    "gpt-4o": {
        "input": 2.50,           # $2.50 per 1M input tokens
        "cached_input": 1.25,    # $1.25 per 1M cached input tokens (50% discount)
        "output": 10.00,         # $10.00 per 1M output tokens
    },
    "gpt-5.1": {
        "input": 1.25,           # $1.25 per 1M input tokens
        "cached_input": 0.13,    # $0.13 per 1M cached input tokens (~90% discount)
        "output": 10.00,         # $10.00 per 1M output tokens
    },
}

# Display pricing table
print("üìä Model Pricing (per 1M tokens)")
print("=" * 60)
print(f"{'Model':<12} {'Input':>12} {'Cached Input':>15} {'Output':>12}")
print("-" * 60)
for model, prices in PRICING.items():
    print(f"{model:<12} ${prices['input']:>10.2f} ${prices['cached_input']:>13.2f} ${prices['output']:>10.2f}")
print("=" * 60)

üìä Model Pricing (per 1M tokens)
Model               Input    Cached Input       Output
------------------------------------------------------------
gpt-4o       $      2.50 $         1.25 $     10.00
gpt-5.1      $      1.25 $         0.13 $     10.00


## Step 2: Define Test Scenarios

We define various test scenarios covering different enterprise use cases:

| Category | Description | Expected Response |
|----------|-------------|-------------------|
| **Short** | Intent classification, sentiment analysis | Single word/phrase |
| **Medium** | RAG Q&A, code explanation, fact extraction | 1-3 sentences |
| **Long** | Customer service replies, content generation | Paragraph (50-100 words) |

Each scenario includes:
- `category`: Response length category
- `name`: Descriptive name
- `question`: The prompt to send
- `answer_variants`: Expected keywords in the response (for accuracy check)
- `language`: EN (English) or ZH (Chinese)

In [4]:
# Test Scenarios - Enterprise use cases for migration validation

TEST_SCENARIOS: List[Dict[str, Any]] = [
    # =========================================================================
    # SHORT RESPONSE SCENARIOS (Intent Classification, Sentiment)
    # =========================================================================
    {
        "category": "Short",
        "name": "Intent Classification",
        "question": "Classify the intent (complaint/inquiry/praise/request): 'My delivery is 2 hours late!' Answer with ONE word.",
        "answer_variants": ["complaint", "Complaint"],
        "language": "EN"
    },
    {
        "category": "Short",
        "name": "Sentiment Analysis",
        "question": "What is the sentiment of 'The service was amazing!' (positive/negative/neutral)? ONE word.",
        "answer_variants": ["positive", "Positive"],
        "language": "EN"
    },
    
    # =========================================================================
    # MEDIUM RESPONSE SCENARIOS (RAG Q&A, Code Explanation)
    # =========================================================================
    {
        "category": "Medium",
        "name": "RAG Number Extraction",
        "question": "Based on: 'Company ABC reported revenue of $2.76 billion in 2023, a 26% increase.' What was the revenue? Growth rate? Brief answer.",
        "answer_variants": ["2.76", "26"],
        "language": "EN"
    },
    {
        "category": "Medium",
        "name": "RAG Fact Extraction",
        "question": "Based on: 'TechCorp was founded in 2018 in Seattle by Sarah Chen.' When and where was it founded? Brief answer.",
        "answer_variants": ["2018", "Seattle"],
        "language": "EN"
    },
    {
        "category": "Medium",
        "name": "Code Explanation",
        "question": "Explain this code: def f(n): return n if n<=1 else f(n-1)+f(n-2). Answer in 2 sentences.",
        "answer_variants": ["fibonacci", "recursive", "Fibonacci"],
        "language": "EN"
    },
    
    # =========================================================================
    # LONG RESPONSE SCENARIOS (Content Generation, Customer Service)
    # =========================================================================
    {
        "category": "Long",
        "name": "Customer Service Reply",
        "question": "You are a customer service agent. User says: 'My order is 2 hours late!' Generate an apology and solution in about 50 words.",
        "answer_variants": ["sorry", "apologize", "refund", "compensation", "delay"],
        "language": "EN"
    },
    {
        "category": "Long",
        "name": "Product Description",
        "question": "Write a 50-word description for a smart water bottle with hydration tracking features.",
        "answer_variants": ["hydration", "track", "smart", "water"],
        "language": "EN"
    },
]

# Display test scenarios summary
print("üìã Test Scenarios Summary")
print("=" * 60)
for i, scenario in enumerate(TEST_SCENARIOS, 1):
    print(f"{i}. [{scenario['category']}] {scenario['name']} ({scenario['language']})")
print(f"\n‚úÖ Total scenarios: {len(TEST_SCENARIOS)}")

üìã Test Scenarios Summary
1. [Short] Intent Classification (EN)
2. [Short] Sentiment Analysis (EN)
3. [Medium] RAG Number Extraction (EN)
4. [Medium] RAG Fact Extraction (EN)
5. [Medium] Code Explanation (EN)
6. [Long] Customer Service Reply (EN)
7. [Long] Product Description (EN)

‚úÖ Total scenarios: 7


## Step 3: Create Static Padding for Cache Eligibility

Azure OpenAI's prompt caching requires a **minimum of 1024 tokens** in the prompt prefix to be cache-eligible.

We create a realistic enterprise knowledge base document (~1030 tokens) that:
- Acts as context for all test scenarios
- Enables prompt caching for both models
- Simulates real-world RAG (Retrieval-Augmented Generation) scenarios

In [5]:
# Static Padding - Must be >= 1024 tokens for Azure prompt cache eligibility
# This simulates a realistic enterprise knowledge base

STATIC_PADDING = """[ENTERPRISE KNOWLEDGE BASE - VERSION 2024.12]
================================================================================

SECTION 1: CUSTOMER SERVICE PROTOCOLS
================================================================================

1.1 Response Time Standards:
- Priority 1 (Order Issues): Response within 2 minutes, resolution within 15 minutes
- Priority 2 (Payment Issues): Response within 5 minutes, resolution within 30 minutes
- Priority 3 (General Inquiry): Response within 10 minutes, resolution within 2 hours
- Priority 4 (Feedback/Suggestions): Response within 24 hours

1.2 Compensation Guidelines:
- Delivery delay 30-60 minutes: $5 coupon
- Delivery delay over 60 minutes: $10 coupon + free delivery next order
- Wrong item delivered: Full refund + replacement + $15 coupon
- Food quality issue: Full refund + $20 coupon

1.3 Escalation Procedures:
- Level 1: Frontline agent handles standard issues
- Level 2: Senior agent handles complex complaints
- Level 3: Supervisor handles escalated disputes
- Level 4: Quality assurance team handles legal/PR issues

================================================================================

SECTION 2: E-COMMERCE OPERATIONS
================================================================================

2.1 Order Lifecycle States:
- PENDING: Order placed, awaiting merchant confirmation
- CONFIRMED: Merchant accepted, preparing order
- PREPARING: Kitchen/warehouse processing
- READY: Order ready for pickup by rider
- DISPATCHED: Rider picked up, in transit
- ARRIVING: Rider within 500m of delivery address
- DELIVERED: Order handed to customer
- COMPLETED: Customer confirmed receipt
- CANCELLED: Order cancelled (with reason code)
- REFUNDED: Refund processed

2.2 Merchant Categories:
- Restaurant (Chinese, Western, Japanese, Korean, Fast Food)
- Grocery (Fresh Produce, Dairy, Snacks, Beverages)
- Pharmacy (OTC, Prescription, Health Products)
- Convenience Store (24/7 essentials)
- Specialty Shops (Bakery, Desserts, Coffee)

2.3 Delivery Optimization:
- Smart routing algorithm considers traffic, weather, rider capacity
- Batching orders from same merchant to nearby addresses
- Peak hour surge pricing and rider incentives
- Quality metrics: On-time rate, customer rating, order accuracy

================================================================================

SECTION 3: DATA ANALYTICS FRAMEWORK
================================================================================

3.1 Key Performance Indicators:
- GMV (Gross Merchandise Value): Total transaction value
- Take Rate: Revenue as percentage of GMV
- Order Frequency: Orders per user per month
- Customer Acquisition Cost (CAC): Marketing spend per new user
- Customer Lifetime Value (LTV): Predicted total revenue per user
- Rider Efficiency: Orders delivered per hour per rider

3.2 Reporting Cadence:
- Real-time: Order volume, active riders, system health
- Hourly: Regional performance, surge status
- Daily: Revenue, costs, profit margins
- Weekly: Trend analysis, anomaly detection
- Monthly: Executive summary, strategic metrics
- Quarterly: Investor reports, market analysis

3.3 Data Quality Standards:
- Completeness: All required fields populated
- Accuracy: Values within expected ranges
- Timeliness: Data available within SLA
- Consistency: No conflicting records

================================================================================

SECTION 4: TECHNICAL ARCHITECTURE
================================================================================

4.1 Microservices:
- Order Service: Order creation, modification, cancellation
- User Service: Authentication, profile management
- Merchant Service: Menu management, inventory, hours
- Rider Service: Assignment, tracking, earnings
- Payment Service: Transactions, refunds, settlements
- Notification Service: Push, SMS, in-app messages

4.2 Infrastructure:
- Multi-region deployment for high availability
- Kubernetes clusters with auto-scaling
- Redis clusters for caching and session management
- MySQL clusters with read replicas
- Kafka for event streaming
- Elasticsearch for search and analytics

4.3 API Standards:
- RESTful design with versioned endpoints
- OAuth 2.0 authentication
- Rate limiting: 1000 requests per minute per client
- Response format: JSON with standard error codes
- Pagination: Cursor-based for large datasets

================================================================================

SECTION 5: COMPLIANCE AND SECURITY
================================================================================

5.1 Data Protection:
- PII encryption at rest (AES-256), TLS 1.3 for data in transit
- Data retention: 3 years for transactions, 1 year for logs
- Right to deletion: Process within 30 days

5.2 Food Safety:
- Merchant license verification
- Regular hygiene inspections
- Temperature monitoring for cold chain
- Allergen information disclosure

5.3 Financial Compliance:
- Anti-money laundering (AML) monitoring
- Transaction limits per user per day
- Fraud detection and prevention
- Regular audit trails

================================================================================
[END OF KNOWLEDGE BASE]
================================================================================

Based on the above context, please answer the following question:
"""

# Estimate token count (rough approximation: ~4 chars per token)
estimated_tokens = len(STATIC_PADDING) // 4
print(f"üìÑ Static Padding Created")
print(f"   Character count: {len(STATIC_PADDING):,}")
print(f"   Estimated tokens: ~{estimated_tokens:,}")
print(f"   Cache eligible: {'‚úÖ Yes (>1024 tokens)' if estimated_tokens >= 1024 else '‚ùå No (<1024 tokens)'}")

üìÑ Static Padding Created
   Character count: 5,345
   Estimated tokens: ~1,336
   Cache eligible: ‚úÖ Yes (>1024 tokens)


## Step 4: Implement Core Benchmark Functions

We implement three core functions:

1. **`calculate_cost()`**: Computes the cost in USD based on token usage
2. **`check_answer()`**: Validates if the response contains expected keywords
3. **`test_with_cache_key()`**: Executes a single benchmark request with cache support

In [6]:
def calculate_cost(model: str, input_tokens: int, output_tokens: int, 
                   cached_tokens: int = 0) -> float:
    """
    Calculate the request cost in USD based on token usage.
    
    Parameters:
        model (str): Model name (e.g., 'gpt-4o', 'gpt-5.1')
        input_tokens (int): Total input tokens
        output_tokens (int): Total output tokens
        cached_tokens (int): Number of cached input tokens (default: 0)
    
    Returns:
        float: Cost in USD
    """
    pricing = PRICING[model]
    
    # Uncached tokens = total input - cached
    uncached_tokens = input_tokens - cached_tokens
    
    # Calculate cost: (uncached * input_price + cached * cached_price + output * output_price)
    cost = (
        uncached_tokens * pricing["input"] +
        cached_tokens * pricing["cached_input"] +
        output_tokens * pricing["output"]
    ) / 1_000_000  # Convert from per-1M to actual
    
    return cost


def check_answer(response: str, correct_variants: List[str]) -> bool:
    """
    Check if the response contains any of the expected answer variants.
    
    Parameters:
        response (str): The model's response text
        correct_variants (List[str]): List of acceptable answer keywords
    
    Returns:
        bool: True if any variant is found in the response
    """
    response_lower = response.lower().strip()
    
    for variant in correct_variants:
        if variant.lower() in response_lower:
            return True
    return False


# Test the helper functions
print("‚úÖ Helper functions defined:")
print("   - calculate_cost(model, input_tokens, output_tokens, cached_tokens)")
print("   - check_answer(response, correct_variants)")

# Example cost calculation
example_cost = calculate_cost("gpt-4o", input_tokens=1000, output_tokens=100, cached_tokens=500)
print(f"\nüìä Example cost calculation (gpt-4o, 1000 input, 100 output, 500 cached):")
print(f"   Cost: ${example_cost:.6f}")

‚úÖ Helper functions defined:
   - calculate_cost(model, input_tokens, output_tokens, cached_tokens)
   - check_answer(response, correct_variants)

üìä Example cost calculation (gpt-4o, 1000 input, 100 output, 500 cached):
   Cost: $0.002875


In [7]:
def test_with_cache_key(
    client: AzureOpenAI,
    model: str,
    instructions: str,
    question: str,
    cache_key: str,
    reasoning_effort: Optional[str] = None,
    stream: bool = False
) -> Dict[str, Any]:
    """
    Test a model using Responses API with prompt_cache_key.
    
    Parameters:
        client: OpenAI client configured for Responses API
        model (str): Model name (gpt-4o or gpt-5.1)
        instructions (str): System instructions (should be >1024 tokens for caching)
        question (str): User question
        cache_key (str): Prompt cache key for cache routing
        reasoning_effort (str): For GPT-5.1, set to "none", "low", "medium", or "high"
        stream (bool): Whether to use streaming mode
    
    Returns:
        dict: Contains latency, tokens, content, and success status
    """
    try:
        # Build request parameters
        params = {
            "model": model,
            "instructions": instructions,
            "input": question,
            "max_output_tokens": 100,
            "extra_body": {"prompt_cache_key": cache_key}
        }
        
        # Add reasoning effort for GPT-5.1 (controls thinking depth)
        if reasoning_effort and "5.1" in model:
            params["reasoning"] = {"effort": reasoning_effort}
        
        # Add streaming flag
        if stream:
            params["stream"] = True
        
        # Record start time
        start_time = time.time()
        
        if stream:
            # =====================
            # STREAMING MODE
            # =====================
            response_stream = client.responses.create(**params)
            content = ""
            input_tokens = 0
            output_tokens = 0
            cached_tokens = 0
            first_token_time = None
            
            for event in response_stream:
                # Record time to first token
                if first_token_time is None and hasattr(event, 'type'):
                    if event.type in ['response.output_item.added', 
                                      'response.content_part.added', 
                                      'response.output_text.delta']:
                        first_token_time = time.time() - start_time
                
                # Extract text from delta events
                if hasattr(event, 'type') and event.type == 'response.output_text.delta':
                    if hasattr(event, 'delta'):
                        content += event.delta
                
                # Extract usage from completed event
                if hasattr(event, 'type') and event.type == 'response.completed':
                    if hasattr(event, 'response') and hasattr(event.response, 'usage'):
                        usage = event.response.usage
                        input_tokens = usage.input_tokens
                        output_tokens = usage.output_tokens
                        cached_details = getattr(usage, 'input_tokens_details', None)
                        if cached_details:
                            cached_tokens = getattr(cached_details, 'cached_tokens', 0)
            
            latency = time.time() - start_time
            
            return {
                "success": True,
                "latency": latency,
                "first_token_time": first_token_time or latency,
                "input_tokens": input_tokens,
                "output_tokens": output_tokens,
                "cached_tokens": cached_tokens,
                "content": content,
                "stream": True
            }
        else:
            # =====================
            # NON-STREAMING MODE
            # =====================
            response = client.responses.create(**params)
            latency = time.time() - start_time
            
            # Extract token usage
            usage = response.usage
            input_tokens = usage.input_tokens
            output_tokens = usage.output_tokens
            cached_details = getattr(usage, 'input_tokens_details', {})
            cached_tokens = getattr(cached_details, 'cached_tokens', 0) if cached_details else 0
            
            # Extract response content
            content = ""
            if response.output:
                for item in response.output:
                    if hasattr(item, 'content'):
                        for c in item.content:
                            if hasattr(c, 'text'):
                                content += c.text
            
            return {
                "success": True,
                "latency": latency,
                "first_token_time": latency,  # Same as latency for non-streaming
                "input_tokens": input_tokens,
                "output_tokens": output_tokens,
                "cached_tokens": cached_tokens,
                "content": content,
                "stream": False
            }
            
    except Exception as e:
        # Return error result
        return {
            "success": False,
            "error": str(e),
            "latency": 0,
            "first_token_time": 0,
            "input_tokens": 0,
            "output_tokens": 0,
            "cached_tokens": 0,
            "content": "",
            "stream": stream
        }


print("‚úÖ test_with_cache_key() function defined")
print("   - Supports both streaming and non-streaming modes")
print("   - Tracks latency, token usage, and cache hits")
print("   - Returns structured result dictionary")

‚úÖ test_with_cache_key() function defined
   - Supports both streaming and non-streaming modes
   - Tracks latency, token usage, and cache hits
   - Returns structured result dictionary


## Step 5: Cache Warmup

Before running the actual benchmark, we need to **warm up the cache** for both models.

**Why warmup matters:**
- First requests always miss the cache (cold start)
- Subsequent requests benefit from cached prompts
- Warmup ensures fair measurement of cache effectiveness

We send 3 warmup requests per model with a simple question to populate the cache.

In [8]:
# Benchmark configuration
NUM_RUNS = 3  # Number of runs per scenario (adjust for more accuracy)
STREAM_MODE = False  # Set to True to enable streaming mode
CACHE_KEY = "benchmark_migration_v2"  # Consistent cache key for all requests

# Build full instructions (static padding + task instruction)
INSTRUCTIONS = STATIC_PADDING + """
You are a helpful assistant. Answer concisely and directly. 
For questions requiring a specific format, follow the format exactly.
"""

print("üîß Benchmark Configuration")
print("=" * 60)
print(f"   Runs per scenario: {NUM_RUNS}")
print(f"   Streaming mode: {'‚úÖ Enabled' if STREAM_MODE else '‚ùå Disabled'}")
print(f"   Cache key: {CACHE_KEY}")
print(f"   Total scenarios: {len(TEST_SCENARIOS)}")
print(f"   Instructions length: ~{len(INSTRUCTIONS)//4:,} tokens")

üîß Benchmark Configuration
   Runs per scenario: 3
   Streaming mode: ‚ùå Disabled
   Cache key: benchmark_migration_v2
   Total scenarios: 7
   Instructions length: ~1,369 tokens


In [9]:
# Phase 1: Cache Warmup
print("\n" + "=" * 80)
print(" PHASE 1: CACHE WARMUP")
print("=" * 80)

# -----------------------------------------------------------------------------
# Select TWO models to compare
# - label: Used in tables/keys (must be unique)
# - api_model: Value passed into client.responses.create(model=...)
# - price_key: Key used for PRICING / calculate_cost()
# - reasoning_effort: Only applies to GPT-5.x reasoning models (optional)
# -----------------------------------------------------------------------------
MODEL_A = {
    "label": "gpt-4o",
    "api_model": "gpt-4o",
    "price_key": "gpt-4o",
    "reasoning_effort": None,
}
MODEL_B = {
    "label": "gpt-5.1",
    "api_model": "gpt-5.1",
    "price_key": "gpt-5.1",
    "reasoning_effort": "none",  # none/low/medium/high (only for GPT-5.x)
}

MODELS_TO_TEST = [MODEL_A, MODEL_B]
MODEL_LABELS = [m["label"] for m in MODELS_TO_TEST]

# Basic validation
if len(MODELS_TO_TEST) != 2:
    raise ValueError("This notebook expects exactly TWO models in MODELS_TO_TEST")
if len(set(MODEL_LABELS)) != 2:
    raise ValueError("MODEL_A/ MODEL_B labels must be unique")
for m in MODELS_TO_TEST:
    if m["price_key"] not in PRICING:
        raise KeyError(f"Missing PRICING entry for price_key='{m['price_key']}'")

print("\nüß™ Models Selected")
print("-" * 60)
for m in MODELS_TO_TEST:
    effort = m.get("reasoning_effort")
    effort_label = f" (reasoning_effort={effort})" if effort else ""
    print(f"  ‚Ä¢ {m['label']}: api_model={m['api_model']} price_key={m['price_key']}{effort_label}")
print("-" * 60)

warmup_question = "What is 2+2?"
warmup_runs = 3

for m in MODELS_TO_TEST:
    label = m["label"]
    api_model = m["api_model"]
    effort = m.get("reasoning_effort")

    print(f"\n  üî• Warming up {label}...", end="", flush=True)
    for i in range(warmup_runs):
        result = test_with_cache_key(
            client=client,
            model=api_model,
            instructions=INSTRUCTIONS,
            question=warmup_question,
            cache_key=CACHE_KEY,
            reasoning_effort=effort,
        )

        if result["success"]:
            print(".", end="", flush=True)
        else:
            print("x", end="", flush=True)

        time.sleep(0.3)  # Brief pause between requests
    print(" done")

# Wait for cache to stabilize
print("\n  ‚è≥ Waiting 2 seconds for cache to stabilize...")
time.sleep(2)
print("  ‚úÖ Cache warmup complete!")


 PHASE 1: CACHE WARMUP

üß™ Models Selected
------------------------------------------------------------
  ‚Ä¢ gpt-4o: api_model=gpt-4o price_key=gpt-4o
  ‚Ä¢ gpt-5.1: api_model=gpt-5.1 price_key=gpt-5.1 (reasoning_effort=none)
------------------------------------------------------------

  üî• Warming up gpt-4o...... done

  üî• Warming up gpt-5.1...... done

  ‚è≥ Waiting 2 seconds for cache to stabilize...
  ‚úÖ Cache warmup complete!


## Step 6: Run Benchmark

Now we run the actual benchmark across all test scenarios. For each scenario:

1. Test two selected models (configured in `MODEL_A` and `MODEL_B`)
2. Run multiple times (configurable via `NUM_RUNS`)
3. Collect latency, token usage, and accuracy metrics
4. Calculate costs using prompt caching

In [None]:
# Phase 2: Benchmark Measurement
print("\n" + "=" * 80)
print(" PHASE 2: BENCHMARK MEASUREMENT")
print("=" * 80)

# Initialize results storage
scenario_results = []
results = {
    m["label"]: {
        "latency": [],
        "first_token_time": [],
        "input_tokens": [],
        "output_tokens": [],
        "cached_tokens": [],
        "correct": 0,
        "total": 0,
        "cost": 0,
    }
    for m in MODELS_TO_TEST
}

# Convenience for 2-model reporting
MODEL_1_LABEL = MODEL_A["label"]
MODEL_2_LABEL = MODEL_B["label"]

# Run benchmark for each scenario
for scenario_idx, scenario in enumerate(TEST_SCENARIOS):
    category = scenario["category"]
    name = scenario["name"]
    question = scenario["question"]
    language = scenario["language"]

    print(f"\n  [{scenario_idx + 1}/{len(TEST_SCENARIOS)}] {category} - {name} ({language})")

    # Store scenario-level data
    scenario_data = {
        "id": scenario_idx + 1,
        "category": category,
        "name": name,
        "language": language,
    }
    for m in MODELS_TO_TEST:
        scenario_data[m["label"]] = {}

    # Test each model
    for m in MODELS_TO_TEST:
        display_name = m["label"]
        actual_model = m["api_model"]
        effort = m.get("reasoning_effort")
        price_key = m["price_key"]

        run_latencies = []
        run_first_token = []
        run_input = []
        run_output = []
        run_cached = []
        correct = 0

        # Multiple runs per scenario
        for run in range(NUM_RUNS):
            result = test_with_cache_key(
                client=client,
                model=actual_model,
                instructions=INSTRUCTIONS,
                question=question,
                cache_key=CACHE_KEY,
                reasoning_effort=effort,
                stream=STREAM_MODE,
            )

            if result["success"]:
                # Collect metrics
                run_latencies.append(result["latency"])
                run_first_token.append(result.get("first_token_time", result["latency"]))
                run_input.append(result["input_tokens"])
                run_output.append(result["output_tokens"])
                run_cached.append(result["cached_tokens"])

                # Aggregate results
                results[display_name]["latency"].append(result["latency"])
                results[display_name]["first_token_time"].append(result.get("first_token_time", result["latency"]))
                results[display_name]["input_tokens"].append(result["input_tokens"])
                results[display_name]["output_tokens"].append(result["output_tokens"])
                results[display_name]["cached_tokens"].append(result["cached_tokens"])

                # Check accuracy
                if check_answer(result["content"], scenario["answer_variants"]):
                    correct += 1
                    results[display_name]["correct"] += 1

                results[display_name]["total"] += 1
                time.sleep(0.2)  # Brief pause between requests
            else:
                print(f"      ‚ö†Ô∏è Error: {result.get('error', '')}")

        # Calculate scenario metrics for this model
        avg_latency = sum(run_latencies) / len(run_latencies) if run_latencies else 0
        avg_first_token = sum(run_first_token) / len(run_first_token) if run_first_token else 0
        avg_input = sum(run_input) / len(run_input) if run_input else 0
        avg_output = sum(run_output) / len(run_output) if run_output else 0
        avg_cached = sum(run_cached) / len(run_cached) if run_cached else 0
        cache_pct = (avg_cached / avg_input * 100) if avg_input > 0 else 0
        accuracy = (correct / NUM_RUNS * 100) if NUM_RUNS > 0 else 0

        # Calculate cost for this scenario
        scenario_cost = calculate_cost(price_key, sum(run_input), sum(run_output), sum(run_cached))
        results[display_name]["cost"] += scenario_cost

        # Store scenario data
        scenario_data[display_name] = {
            "avg_latency": round(avg_latency, 3),
            "avg_first_token_time": round(avg_first_token, 3),
            "avg_input_tokens": round(avg_input, 0),
            "avg_output_tokens": round(avg_output, 0),
            "avg_cached_tokens": round(avg_cached, 0),
            "cache_hit_pct": round(cache_pct, 1),
            "accuracy": round(accuracy, 1),
            "cost": round(scenario_cost * 1000, 4),  # Cost in milli-dollars
        }

        # Display results
        status = "‚úÖ" if accuracy == 100 else "‚ö†Ô∏è" if accuracy >= 50 else "‚ùå"
        effort_label = f" (effort={effort})" if effort else ""
        stream_label = " [stream]" if STREAM_MODE else ""
        ttft_info = f" TTFT:{avg_first_token:.3f}s" if STREAM_MODE else ""

        print(
            f"      {display_name}{effort_label}{stream_label}: {avg_latency:.3f}s{ttft_info} | "
            f"in:{avg_input:.0f} out:{avg_output:.0f} cache:{cache_pct:.1f}% | "
            f"acc:{accuracy:.0f}% {status}"
        )

    scenario_results.append(scenario_data)

print("\n" + "=" * 80)
print(" BENCHMARK COMPLETE")
print("=" * 80)


 PHASE 2: BENCHMARK MEASUREMENT

  [1/7] Short - Intent Classification (EN)
      gpt-4o: 0.655s | in:1085 out:2 cache:94.4% | acc:100% ‚úÖ
      gpt-5.1 (effort=none): 1.461s | in:1084 out:12 cache:94.5% | acc:100% ‚úÖ

  [2/7] Short - Sentiment Analysis (EN)
      gpt-4o: 0.985s | in:1079 out:2 cache:63.3% | acc:100% ‚úÖ
      gpt-5.1 (effort=none): 1.810s | in:1078 out:11 cache:95.0% | acc:100% ‚úÖ

  [3/7] Medium - RAG Number Extraction (EN)
      gpt-4o: 0.835s | in:1094 out:16 cache:93.6% | acc:100% ‚úÖ
      gpt-5.1 (effort=none): 1.495s | in:1093 out:30 cache:93.7% | acc:100% ‚úÖ

  [4/7] Medium - RAG Fact Extraction (EN)
      gpt-4o: 0.607s | in:1086 out:12 cache:94.3% | acc:100% ‚úÖ
      gpt-5.1 (effort=none): 1.610s | in:1085 out:20 cache:94.4% | acc:100% ‚úÖ

  [5/7] Medium - Code Explanation (EN)
      gpt-4o: 1.750s | in:1089 out:66 cache:94.0% | acc:100% ‚úÖ


## Step 7: Analyze Results

Now we aggregate the results and generate a comprehensive comparison report.

In [None]:
# Phase 3: Calculate Aggregated Metrics
print("\n" + "=" * 80)
print(" PHASE 3: ANALYZING RESULTS")
print("=" * 80)

# Calculate aggregated metrics for each selected model
for model in MODEL_LABELS:
    r = results[model]

    if r["latency"]:
        r["avg_latency"] = sum(r["latency"]) / len(r["latency"])
        r["avg_first_token"] = (
            sum(r["first_token_time"]) / len(r["first_token_time"])
            if r["first_token_time"]
            else r["avg_latency"]
        )
        r["avg_input"] = sum(r["input_tokens"]) / len(r["input_tokens"])
        r["avg_output"] = sum(r["output_tokens"]) / len(r["output_tokens"])
        r["avg_cached"] = sum(r["cached_tokens"]) / len(r["cached_tokens"])
        r["cache_pct"] = (r["avg_cached"] / r["avg_input"] * 100) if r["avg_input"] > 0 else 0
        r["accuracy"] = (r["correct"] / r["total"] * 100) if r["total"] > 0 else 0
        r["total_input"] = sum(r["input_tokens"])
        r["total_output"] = sum(r["output_tokens"])
        r["total_cached"] = sum(r["cached_tokens"])

# Shorthand references for 2-model comparisons
r1 = results[MODEL_1_LABEL]
r2 = results[MODEL_2_LABEL]

# Calculate differences (Model 2 vs Model 1)
if r1.get("avg_latency", 0) > 0:
    latency_diff = (r2.get("avg_latency", 0) - r1["avg_latency"]) / r1["avg_latency"] * 100
else:
    latency_diff = 0

if r1.get("cost", 0) > 0:
    cost_savings = (r1["cost"] - r2.get("cost", 0)) / r1["cost"] * 100
else:
    cost_savings = 0

print("\n‚úÖ Aggregation complete")

In [None]:
# Display Executive Summary
print("\n" + "=" * 80)
print(" EXECUTIVE SUMMARY")
print("=" * 80)

print("\nüìä Performance Comparison")
print("-" * 80)
print(f"{'Metric':<20} {MODEL_1_LABEL:>18} {MODEL_2_LABEL:>18} {'Difference':>18}")
print("-" * 80)

# Latency
print(
    f"{'Avg Latency':<20} "
    f"{r1.get('avg_latency', 0):>17.3f}s "
    f"{r2.get('avg_latency', 0):>17.3f}s "
    f"{latency_diff:>+17.1f}%"
 )

# TTFT (if streaming)
if STREAM_MODE:
    print(
        f"{'Avg TTFT':<20} "
        f"{r1.get('avg_first_token', 0):>17.3f}s "
        f"{r2.get('avg_first_token', 0):>17.3f}s"
    )

# Accuracy
acc_diff = r2.get('accuracy', 0) - r1.get('accuracy', 0)
print(
    f"{'Accuracy':<20} "
    f"{r1.get('accuracy', 0):>17.1f}% "
    f"{r2.get('accuracy', 0):>17.1f}% "
    f"{acc_diff:>+17.1f}%"
 )

# Cache Hit Rate
cache_diff = r2.get('cache_pct', 0) - r1.get('cache_pct', 0)
print(
    f"{'Cache Hit Rate':<20} "
    f"{r1.get('cache_pct', 0):>17.1f}% "
    f"{r2.get('cache_pct', 0):>17.1f}% "
    f"{cache_diff:>+17.1f}%"
 )

# Total Cost
print(
    f"{'Total Cost':<20} "
    f"${r1.get('cost', 0):>16.6f} "
    f"${r2.get('cost', 0):>16.6f} "
    f"{-cost_savings:>+17.1f}%"
 )

print("-" * 80)
print(f"{'üíµ Cost Savings (Model2 vs Model1)':<40} {cost_savings:>38.1f}%")
print("=" * 80)

In [None]:
# Display Token Usage Summary
print("\nüì¶ Token Usage Summary")
print("-" * 80)
print(f"{'Metric':<25} {MODEL_1_LABEL:>18} {MODEL_2_LABEL:>18}")
print("-" * 80)
print(f"{'Total Input Tokens':<25} {r1.get('total_input', 0):>18,.0f} {r2.get('total_input', 0):>18,.0f}")
print(f"{'Total Output Tokens':<25} {r1.get('total_output', 0):>18,.0f} {r2.get('total_output', 0):>18,.0f}")
print(f"{'Total Cached Tokens':<25} {r1.get('total_cached', 0):>18,.0f} {r2.get('total_cached', 0):>18,.0f}")
print("-" * 80)
print(f"{'Avg Input/Request':<25} {r1.get('avg_input', 0):>18,.0f} {r2.get('avg_input', 0):>18,.0f}")
print(f"{'Avg Output/Request':<25} {r1.get('avg_output', 0):>18,.0f} {r2.get('avg_output', 0):>18,.0f}")
print(f"{'Avg Cached/Request':<25} {r1.get('avg_cached', 0):>18,.0f} {r2.get('avg_cached', 0):>18,.0f}")
print("=" * 80)

In [None]:
# Display Detailed Results by Scenario
print("\nüìã Detailed Results by Scenario")
print("=" * 120)

# Group scenarios by category
categories = {}
for s in scenario_results:
    cat = s["category"]
    if cat not in categories:
        categories[cat] = []
    categories[cat].append(s)

for cat, scenarios in categories.items():
    print(f"\n### {cat} Response Scenarios")
    print("-" * 120)
    print(
        f"{'#':<3} {'Scenario':<30} {'Lang':<5} "
        f"{MODEL_1_LABEL + ' Lat':>12} {MODEL_2_LABEL + ' Lat':>12} "
        f"{MODEL_1_LABEL + ' Cache':>14} {MODEL_2_LABEL + ' Cache':>14} "
        f"{MODEL_1_LABEL + ' Acc':>12} {MODEL_2_LABEL + ' Acc':>12}"
    )
    print("-" * 120)

    for s in scenarios:
        m1 = s.get(MODEL_1_LABEL, {})
        m2 = s.get(MODEL_2_LABEL, {})

        acc1 = m1.get("accuracy", 0)
        acc2 = m2.get("accuracy", 0)
        acc1_icon = "‚úÖ" if acc1 == 100 else "‚ö†Ô∏è" if acc1 >= 50 else "‚ùå"
        acc2_icon = "‚úÖ" if acc2 == 100 else "‚ö†Ô∏è" if acc2 >= 50 else "‚ùå"

        print(
            f"{s['id']:<3} {s['name'][:28]:<30} {s['language']:<5} "
            f"{m1.get('avg_latency', 0):>11.3f}s {m2.get('avg_latency', 0):>11.3f}s "
            f"{m1.get('cache_hit_pct', 0):>13.1f}% {m2.get('cache_hit_pct', 0):>13.1f}% "
            f"{acc1:>10.0f}%{acc1_icon} {acc2:>10.0f}%{acc2_icon}"
        )

print("\n" + "=" * 120)

In [None]:
# Generate Recommendation
print("\nüéØ RECOMMENDATION")
print("=" * 80)

# Performance analysis
if r2.get("avg_latency", float("inf")) <= r1.get("avg_latency", float("inf")):
    latency_winner = MODEL_2_LABEL
    latency_verdict = "faster or equal"
else:
    latency_winner = MODEL_1_LABEL
    latency_verdict = "faster"

# Accuracy analysis
accuracy_gap = abs(r1.get("accuracy", 0) - r2.get("accuracy", 0))
if accuracy_gap < 1:
    accuracy_verdict = "Both models perform equally"
elif r1.get("accuracy", 0) > r2.get("accuracy", 0):
    accuracy_verdict = f"{MODEL_1_LABEL} has {accuracy_gap:.1f}% higher accuracy"
else:
    accuracy_verdict = f"{MODEL_2_LABEL} has {accuracy_gap:.1f}% higher accuracy"

print(f"\n  üìä Performance:")
print(f"     ‚Ä¢ Latency: {latency_winner} is {latency_verdict} by {abs(latency_diff):.1f}%")
print(f"     ‚Ä¢ Accuracy: {accuracy_verdict}")

print(f"\n  üí∞ Cost Efficiency:")
print(f"     ‚Ä¢ {MODEL_2_LABEL} saves {cost_savings:.1f}% compared to {MODEL_1_LABEL}")
print(f"     ‚Ä¢ Prompt caching effective: {r1.get('cache_pct', 0):.1f}% ({MODEL_1_LABEL}) / {r2.get('cache_pct', 0):.1f}% ({MODEL_2_LABEL})")

# Final recommendation (simple heuristic)
if cost_savings > 50 and r2.get("accuracy", 0) >= r1.get("accuracy", 0) - 5:
    print(f"\n  ‚úÖ **{MODEL_2_LABEL} recommended** for cost-sensitive workloads with acceptable latency trade-off")
else:
    print(f"\n  ‚ö†Ô∏è Evaluate based on your latency, quality, and cost requirements")

print("\n" + "=" * 80)
print(f"\n‚úÖ Benchmark completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [None]:
# Save results to JSON file
output_data = {
    "timestamp": datetime.now().isoformat(),
    "config": {
        "padding_tokens": len(STATIC_PADDING) // 4,
        "cache_key": CACHE_KEY,
        "runs_per_scenario": NUM_RUNS,
        "total_scenarios": len(TEST_SCENARIOS),
        "streaming": STREAM_MODE,
        "models": MODELS_TO_TEST,
    },
    "summary": {
        MODEL_1_LABEL: {
            "avg_latency": r1.get("avg_latency", 0),
            "accuracy": r1.get("accuracy", 0),
            "cache_pct": r1.get("cache_pct", 0),
            "total_cost": r1.get("cost", 0),
        },
        MODEL_2_LABEL: {
            "avg_latency": r2.get("avg_latency", 0),
            "accuracy": r2.get("accuracy", 0),
            "cache_pct": r2.get("cache_pct", 0),
            "total_cost": r2.get("cost", 0),
        },
    },
    "scenarios": scenario_results,
}

output_filename = f"benchmark_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_filename, "w", encoding="utf-8") as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print(f"\nüìÅ Results saved to: {output_filename}")

## Pricing Reference

Pricing is intentionally kept as a simple dictionary in this notebook so you can extend it easily.

- If you change `MODEL_A` / `MODEL_B`, make sure their `price_key` values exist in `PRICING`.
- If your org uses custom pricing/discounts, update `PRICING` to match your contract.

| Model Key (price_key) | Input (per 1M) | Cached Input (per 1M) | Output (per 1M) |
|-------|----------------|----------------------|-----------------|
| gpt-4o | $2.50 | $1.25 | $10.00 |
| gpt-5.1 | $1.25 | $0.13 | $10.00 |

**Tip**: Cached input pricing can dominate total cost for repeated workloads; always benchmark with prompt caching enabled for an apples-to-apples migration comparison.

## Additional Resources

- [Azure OpenAI Service Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/)
- [Azure OpenAI Prompt Caching](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/prompt-caching)
- [Azure OpenAI Responses API](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/responses)
- [Model Migration Best Practices](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/model-versions)
- [Azure AI Foundry Cost Optimization](https://learn.microsoft.com/en-us/azure/ai-foundry/control-plane/how-to-optimize-cost-performance?view=foundry)

## Wrap-up

In this notebook, you learned how to benchmark and compare AI models for enterprise migration decisions.

### Key Takeaways

1. **Fair Comparison Methodology**
   - Use identical API (Responses API) for both models
   - Consistent cache keys and warmup procedures
   - Multiple runs for statistical significance

2. **Prompt Caching Benefits**
   - Requires ‚â•1024 tokens in prompt prefix
   - Dramatically reduces costs for repeated queries
   - GPT-5.1 offers ~90% discount on cached tokens vs ~50% for GPT-4o

3. **Migration Decision Factors**
   - **Latency**: Measure actual response times for your workloads
   - **Accuracy**: Validate with your specific use cases
   - **Cost**: Calculate total cost including cache benefits
   - **Trade-offs**: Balance performance vs cost for your requirements

### Suggested Next Steps

1. **Customize Test Scenarios**: Add scenarios specific to your use cases
2. **Increase Runs**: Run more iterations for production-grade benchmarks
3. **Enable Streaming**: Test TTFT (Time-to-First-Token) with `STREAM_MODE = True`
4. **Compare More Models**: Extend `MODELS_TO_TEST` and `PRICING` for other models
5. **Integrate with CI/CD**: Automate benchmarks as part of deployment pipelines