# üõçÔ∏è | Cora-For-Zava: Model Selection 

Welcome! This notebook will walk you through evaluating multiple AI models using a standardized test dataset and the Azure AI Evaluation SDK.

## üõí Our Zava Scenario

**Cora** is a customer service chatbot for **Zava** - a fictitious retailer of home improvement goods for DIY enthusiasts. To ensure Cora provides the best customer experience, you need to select the right foundation model. With multiple Azure OpenAI models available (GPT-4o, GPT-4o-mini, GPT-4), you need to evaluate which model delivers the best balance of quality, safety, and performance for your retail use case.

## üéØ What You'll Build

By the end of this notebook, you'll have:
- ‚úÖ Configured multiple Azure OpenAI models for comparison
- ‚úÖ Loaded standardized test datasets for evaluation
- ‚úÖ Run evaluations across models using built-in evaluators
- ‚úÖ Analyzed performance metrics (quality, safety, latency)
- ‚úÖ Compared model results to make informed selection decisions

## üí° What You'll Learn

- How to configure multiple models for evaluation
- How to load test datasets for evaluation
- How to run evaluations with built-in evaluators
- How to analyze and compare model performance
- How to use Azure AI Foundry model leaderboards

> **Note**: This demonstrates pre-production evaluation, which is essential before deploying AI applications.

Ready to compare models? Let's get started! üöÄ

---

## Step 1: Verify Environment Variables

The following environment variables should already be configured in your `.env` file from the earlier setup steps:

- **AZURE_OPENAI_API_KEY**: Your Azure OpenAI API key
  - Ï†úÍ±∞ÌïòÍ≥† SystemManagedIdentity ÏÇ¨Ïö© ÏòàÏ†ï
- **AZURE_OPENAI_ENDPOINT**: Your Azure OpenAI service endpoint
- **AZURE_OPENAI_API_VERSION**: The API version to use
- **AZURE_SUBSCRIPTION_ID**: Your Azure subscription ID
- **AZURE_RESOURCE_GROUP**: Your Azure resource group name
- **AZURE_AI_PROJECT_NAME**: Your Azure AI Foundry project name

In [28]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
# Use override=True to reload any changes made to .env
load_dotenv(override=True)

# Verify all required Azure service credentials are available
required_vars = [
    "AZURE_OPENAI_ENDPOINT",
    "AZURE_OPENAI_API_VERSION",
    "AZURE_SUBSCRIPTION_ID",
    "AZURE_RESOURCE_GROUP",
    "AZURE_AI_PROJECT_NAME",
    "AZURE_AI_FOUNDRY_NAME"
]

missing_vars = [var for var in required_vars if not os.environ.get(var)]

if missing_vars:
    raise EnvironmentError(
        f"‚ùå Missing environment variables: {', '.join(missing_vars)}")

print("‚úÖ All environment variables configured!")

‚úÖ All environment variables configured!


## Step 2: Define Models to Evaluate

Configure the array of model deployment names you want to evaluate. You can add or remove models from this list based on what's deployed in your Azure OpenAI resource.

> **Tip**: Use [Azure AI Foundry Model Leaderboards](https://learn.microsoft.com/azure/ai-foundry/how-to/benchmark-model-in-catalog) to compare models on quality, safety, cost, and performance before deploying them.

In [5]:
# Define the models to evaluate
# Add or remove model deployment names as needed
models_to_evaluate = [
    "gpt-4o-mini",
    "gpt-4o",
    "gpt-4.1"
]

print(f"‚úÖ Configured {len(models_to_evaluate)} models for evaluation")

‚úÖ Configured 3 models for evaluation


## üí° Model Selection with Leaderboards

Before or after running custom evaluations, you can use Azure AI Foundry Model Leaderboards to help select the best models:

**How to Access Leaderboards:**
1. Go to [Azure AI Foundry portal](https://ai.azure.com)
2. Select **Model catalog** from the left pane
3. Click **Browse leaderboards** in the Model leaderboards section

**What You Can Compare:**
- **Quality Leaderboard**: Models ranked by accuracy on reasoning, Q&A, coding, and math tasks
- **Safety Leaderboard**: Models ranked by resistance to harmful content
- **Cost Leaderboard**: Models ranked by cost-effectiveness
- **Performance Leaderboard**: Models ranked by throughput and latency
- **Trade-off Charts**: Quality vs. Cost, Quality vs. Safety, Quality vs. Throughput
- **Scenario-Specific**: Find models best for your use case (chatbots, code generation, etc.)

This can help you narrow down which models to include in your `models_to_evaluate` list!

---

## Step 3: Load Test Dataset

Load the evaluation dataset with test queries and expected responses. This will be used as the test input for all models.

In [30]:
import json
import pandas as pd

# Load the evaluation dataset
dataset_path = "22-evaluate-models.jsonl"
test_data = []

with open(dataset_path, "r") as f:
    for line in f:
        test_data.append(json.loads(line))

print(f"‚úÖ Loaded {len(test_data)} test examples from {dataset_path}\n")

# Display as DataFrame for easy viewing
df_test_data = pd.DataFrame(test_data)
print("üìä Test Dataset Preview:")
display(df_test_data)

‚úÖ Loaded 5 test examples from 22-evaluate-models.jsonl

üìä Test Dataset Preview:


Unnamed: 0,query,ground_truth,response
0,When was United States found ?,1776,1600
1,What is the capital of France?,Paris,Paris
2,What type of finish does the durable eggshell ...,The durable eggshell finish paint has a subtle...,The durable eggshell finish paint has a subtle...
3,What product fits standard paint trays for qui...,Disposable plastic liners that fit standard pa...,The product that fits standard paint trays for...
4,"Which paint is recommended for kitchens, bathr...",Washable semi-gloss interior paint for kitchen...,The washable semi-gloss interior paint is reco...


## Step 4: Configure Azure AI Project

Set up the Azure AI project connection for running evaluations.

In [7]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from azure.identity import DefaultAzureCredential

# Create Azure AI project configuration
subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
resource_group_name = os.environ.get("AZURE_RESOURCE_GROUP")
project_name = os.environ.get("AZURE_AI_PROJECT_NAME")
azure_ai_foundry_name = os.environ.get("AZURE_AI_FOUNDRY_NAME")

# Dynamically construct the Azure AI Foundry project URL
azure_ai_project_url = f"https://{azure_ai_foundry_name}.services.ai.azure.com/api/projects/{project_name}"

# Initialize and verify credential
try:
    credential = DefaultAzureCredential()
    # Try to get a token to verify authentication
    token = credential.get_token("https://management.azure.com/.default")
    print("‚úÖ Azure credentials verified successfully!")
except Exception as e:
    print("‚ùå Azure credentials not found or expired!")
    print("Please run 'az login' in the terminal to authenticate with Azure.")
    raise

print(f"‚úÖ Azure AI Project configured")

‚úÖ Azure credentials verified successfully!
‚úÖ Azure AI Project configured


## Step 5: Create Model Configurations

Create model configuration objects for each model we want to evaluate.

In [13]:
# Create model configurations for all models
model_configs = {}

for model_name in models_to_evaluate:
    model_configs[model_name] = AzureOpenAIModelConfiguration(
        azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=model_name,
        credential=DefaultAzureCredential(),
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
    )

print(f"‚úÖ Created configurations for {len(model_configs)} models")

‚úÖ Created configurations for 3 models


## Step 6: Define Target Function

Create a function that takes a query and returns a response from a specific model. This will be used by the evaluators.

In [36]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider


def create_target_function(model_name):
    """Create a target function for a specific model"""

    def target_function(query: str, ground_truth: str = None, response: str = None) -> dict:
        """Generate response from the model"""
        client = AzureOpenAI(
            azure_ad_token_provider=get_bearer_token_provider(
                DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"),
            api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
            azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT")
        )

        # Call the model with the query
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions accurately and concisely."},
                {"role": "user", "content": query}
            ],
            temperature=0.7,
            max_tokens=800
        )

        return {
            "response": response.choices[0].message.content
        }

    return target_function


print("‚úÖ Target function factory created")

‚úÖ Target function factory created


## Step 7: Configure Evaluators

Set up the evaluators we'll use to assess model performance. We'll use [built-in evaluators](https://learn.microsoft.com/azure/ai-foundry/concepts/observability#what-are-evaluators) for quality and safety metrics.

**Quality Evaluators** (AI-assisted):
- **Relevance**: Evaluates how pertinent responses are to the given questions (scale 1-5)
- **Coherence**: Evaluates how well the output flows smoothly and reads naturally (scale 1-5)
- **Fluency**: Evaluates language proficiency and grammatical correctness (scale 1-5)

**Safety Evaluators** (Content safety):
- **Violence**: Detects violent content in responses (scale 0-7, lower is safer)
- **Hate/Unfairness**: Detects hateful or unfair content (scale 0-7, lower is safer)
- **Self-Harm**: Detects self-harm related content (scale 0-7, lower is safer)
- **Sexual**: Detects sexual content (scale 0-7, lower is safer)

> **Note**: We're using Relevance, Coherence, and Fluency evaluators which don't require context or ground truth. Groundedness evaluator has been removed as it requires additional context that our simple dataset doesn't provide.

> Learn more about [evaluation metrics](https://learn.microsoft.com/azure/machine-learning/prompt-flow/concept-model-monitoring-generative-ai-evaluation-metrics) and their use cases.

In [None]:
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    ViolenceEvaluator,
    HateUnfairnessEvaluator,
    SelfHarmEvaluator,
    SexualEvaluator
)

# Create a judge model configuration for evaluators
judge_model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    azure_deployment="gpt-4o",  # Use a capable model as judge
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)

# Initialize quality evaluators
groundedness_eval = GroundednessEvaluator(model_config=judge_model_config)
relevance_eval = RelevanceEvaluator(model_config=judge_model_config)
coherence_eval = CoherenceEvaluator(model_config=judge_model_config)
fluency_eval = FluencyEvaluator(model_config=judge_model_config)

# Initialize safety evaluators (using azure_ai_project_url instead of dictionary)
violence_eval = ViolenceEvaluator(
    azure_ai_project=azure_ai_project_url, credential=credential)
hate_unfairness_eval = HateUnfairnessEvaluator(
    azure_ai_project=azure_ai_project_url, credential=credential)
self_harm_eval = SelfHarmEvaluator(
    azure_ai_project=azure_ai_project_url, credential=credential)
sexual_eval = SexualEvaluator(
    azure_ai_project=azure_ai_project_url, credential=credential)

print("‚úÖ Evaluators configured:")
print("   Quality: Groundedness, Relevance, Coherence, Fluency")
print("   Safety: Violence, Hate/Unfairness, Self-Harm, Sexual")

‚úÖ Evaluators configured:
   Quality: Groundedness, Relevance, Coherence, Fluency
   Safety: Violence, Hate/Unfairness, Self-Harm, Sexual


In [38]:
# Test with a single model and single prompt
import tempfile
import sys
from datetime import datetime

print("üß™ Running configuration test...", flush=True)
print("=" * 60, flush=True)

# Select first model for testing
test_model = models_to_evaluate[0]
print(f"\nüìã Test Model: {test_model}", flush=True)

# Create a simple test dataset with one example
test_example = {
    "query": "What is the capital of France?",
    "ground_truth": "Paris",
    "response": "Paris"
}

# Save test example to a temporary file
test_file = tempfile.NamedTemporaryFile(
    mode='w', suffix='.jsonl', delete=False)
test_file.write(json.dumps(test_example) + '\n')
test_file.close()

print(f"üìù Test Query: {test_example['query']}", flush=True)

try:
    # Create target function for test model
    test_target_fn = create_target_function(test_model)

    # Test the target function
    print("\n1Ô∏è‚É£ Testing target function...", flush=True)
    sys.stdout.flush()

    test_result = test_target_fn(**test_example)
    print(
        f"   ‚úÖ Target function returned: {test_result['response'][:100]}...", flush=True)

    # Test evaluation with minimal evaluators AND portal publishing
    print("\n2Ô∏è‚É£ Testing evaluation pipeline with portal publishing...", flush=True)
    sys.stdout.flush()

    from azure.ai.evaluation import evaluate

    test_timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    test_eval_result = evaluate(
        data=test_file.name,
        target=test_target_fn,
        evaluators={
            "relevance": relevance_eval,
            "coherence": coherence_eval,
        },
        evaluator_config={
            "default": {
                "query": "${data.query}",
                "response": "${target.response}",
            }
        },
        # Publish to portal for verification (using URL format)
        azure_ai_project=azure_ai_project_url,
        evaluation_name=f"22-evaluate-models-TEST_{test_model}_{test_timestamp}"
    )

    print(f"\n   ‚úÖ Evaluation completed successfully!", flush=True)
    print(f"   üìä Sample metrics:", flush=True)
    print(
        f"      - Relevance: {test_eval_result['metrics'].get('relevance', 'N/A')}", flush=True)
    print(
        f"      - Coherence: {test_eval_result['metrics'].get('coherence', 'N/A')}", flush=True)

    # Show portal URL if available
    if test_eval_result.get('studio_url'):
        print(f"\n   üåê View test results in portal:", flush=True)
        print(f"      {test_eval_result['studio_url']}", flush=True)

    print("\n" + "=" * 60, flush=True)
    print("‚úÖ Configuration test PASSED! Ready to run full evaluation.", flush=True)
    print("=" * 60, flush=True)
    sys.stdout.flush()

except Exception as e:
    print(f"\n‚ùå Configuration test FAILED!", flush=True)
    print(f"Error: {str(e)}", flush=True)
    print("\nPlease fix the configuration before proceeding to Step 8.", flush=True)
    print("=" * 60, flush=True)
    sys.stdout.flush()
    raise
finally:
    # Clean up temporary file
    import os
    if os.path.exists(test_file.name):
        os.unlink(test_file.name)
    print("\nüßπ Temporary test file cleaned up.", flush=True)

üß™ Running configuration test...

üìã Test Model: gpt-4o-mini
üìù Test Query: What is the capital of France?

1Ô∏è‚É£ Testing target function...
   ‚úÖ Target function returned: The capital of France is Paris....

2Ô∏è‚É£ Testing evaluation pipeline with portal publishing...
2025-12-20 05:40:46 +0000 281472351105488 execution.bulk     INFO     Finished 1 / 1 lines.
2025-12-20 05:40:46 +0000 281472351105488 execution.bulk     INFO     Average execution time for completed lines: 2.01 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "create_target_function__locals__target_function_20251220_054044_449106"
Run status: "Completed"
Start time: "2025-12-20 05:40:44.449106+00:00"
Duration: "0:00:03.012534"

2025-12-20 05:40:50 +0000 281472342712784 execution.bulk     INFO     Finished 1 / 1 lines.
2025-12-20 05:40:50 +0000 281472342712784 execution.bulk     INFO     Average execution time for completed lines: 2.54 seconds. Estimated time for incomplete lines: 0.0 second

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "relevance_20251220_054047_484836"
Run status: "Completed"
Start time: "2025-12-20 05:40:47.484836+00:00"
Duration: "0:00:03.009024"

2025-12-20 05:40:50 +0000 281472351105488 execution.bulk     INFO     Finished 1 / 1 lines.
2025-12-20 05:40:50 +0000 281472351105488 execution.bulk     INFO     Average execution time for completed lines: 3.43 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "coherence_20251220_054047_479201"
Run status: "Completed"
Start time: "2025-12-20 05:40:47.479201+00:00"
Duration: "0:00:04.011989"


{
    "relevance": {
        "status": "Completed",
        "duration": "0:00:03.009024",
        "completed_lines": 1,
        "failed_lines": 0,
        "log_path": null
    },
    "coherence": {
        "status": "Completed",
        "duration": "0:00:04.011989",
        "completed_lines": 1,
        "failed_lines": 0,
        "log_path": null
    }
}



   ‚úÖ Evaluation completed successfully!
   üìä Sample metrics:
      - Relevance: N/A
      - Coherence: N/A

   üåê View test results in portal:
      https://ai.azure.com/resource/build/evaluation/9ccb78ea-1390-4ed5-ab1b-4009aae891ce?wsid=/subscriptions/0ce67698-ac36-4c1c-8188-e8336e25f023/resourceGroups/rg-Ignite-PREL13/providers/Microsoft.CognitiveServices/accounts/aoai-ob7apaqovowv2/projects/proj-ob7apaqovowv2&tid=16b3c013-d300-468d-ac64-7eda0820b6d3

‚úÖ Configuration test PASSED

## Step 8: Run Evaluations

Now we'll evaluate each model using the test dataset and the [`evaluate()` function](https://learn.microsoft.com/azure/ai-foundry/how-to/develop/evaluate-sdk#local-evaluation-on-test-datasets-using-evaluate). This will generate comprehensive metrics for performance, quality, and safety.

Each evaluation:
- Tests the model with all examples from the dataset
- Calculates quality metrics using a judge model
- Assesses safety using Azure AI Content Safety
- Tracks evaluation time (as a proxy for latency)
- **Publishes results to Azure AI Foundry portal** for visualization
- **Saves detailed results locally** for offline analysis

In [None]:
from azure.ai.evaluation import evaluate
import time
from datetime import datetime
import os

# Create output directory for evaluation results
output_dir = "./22-evaluate-models-results"
os.makedirs(output_dir, exist_ok=True)

# Store results for each model
evaluation_results = {}

print(f"üöÄ Starting evaluation of {len(models_to_evaluate)} models...")
print(f"   Test dataset size: {len(test_data)} examples")
print(f"   Output directory: {output_dir}\n")

for model_name in models_to_evaluate:
    print(f"üìä Evaluating model: {model_name}")
    start_time = time.time()

    try:
        # Create target function for this model
        target_fn = create_target_function(model_name)

        # Create output path for this evaluation
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        output_path = os.path.join(
            output_dir, f"22-evaluate-models_{model_name}_{timestamp}")

        # Run evaluation with both portal publishing and local output
        result = evaluate(
            data=dataset_path,
            target=target_fn,
            evaluators={
                "relevance": relevance_eval,
                "coherence": coherence_eval,
                "fluency": fluency_eval,
                "violence": violence_eval,
                "hate_unfairness": hate_unfairness_eval,
                "self_harm": self_harm_eval,
                "sexual": sexual_eval,
            },
            evaluator_config={
                "default": {
                    "query": "${data.query}",
                    "response": "${target.response}",
                }
            },
            # Publish to Azure AI Foundry portal for visualization (using URL format)
            azure_ai_project=azure_ai_project_url,
            # Save detailed results locally
            output_path=output_path,
            # Optional: provide evaluation name for easier tracking in portal
            evaluation_name=f"22-evaluate-models_{model_name}_{timestamp}"
        )

        elapsed_time = time.time() - start_time

        # Store results with both portal and local file information
        evaluation_results[model_name] = {
            "metrics": result["metrics"],
            "evaluation_time": elapsed_time,
            "studio_url": result.get("studio_url"),
            "output_path": output_path
        }

        print(f"   ‚úÖ Completed in {elapsed_time:.2f} seconds")
        print(f"   üìä Portal URL: {result.get('studio_url', 'N/A')}")
        print(f"   üíæ Local results: {output_path}\n")

    except Exception as e:
        print(f"   ‚ùå Error evaluating {model_name}: {str(e)}\n")
        evaluation_results[model_name] = {
            "error": str(e),
            "evaluation_time": time.time() - start_time
        }

print("‚úÖ All evaluations complete!")
print(f"\nüìÅ All results saved to: {output_dir}/")
print(f"üåê View results in Azure AI Foundry portal using the URLs above")

üöÄ Starting evaluation of 3 models...
   Test dataset size: 5 examples
   Output directory: ./22-evaluate-models-results

üìä Evaluating model: gpt-4o-mini
2025-12-20 05:41:07 +0000 281472317534672 execution.bulk     INFO     Finished 1 / 5 lines.
2025-12-20 05:41:07 +0000 281472317534672 execution.bulk     INFO     Average execution time for completed lines: 3.39 seconds. Estimated time for incomplete lines: 13.56 seconds.
2025-12-20 05:41:07 +0000 281472317534672 execution.bulk     INFO     Finished 2 / 5 lines.
2025-12-20 05:41:07 +0000 281472317534672 execution.bulk     INFO     Average execution time for completed lines: 1.75 seconds. Estimated time for incomplete lines: 5.25 seconds.
2025-12-20 05:41:08 +0000 281472317534672 execution.bulk     INFO     Finished 4 / 5 lines.
2025-12-20 05:41:08 +0000 281472317534672 execution.bulk     INFO     Average execution time for completed lines: 0.96 seconds. Estimated time for incomplete lines: 0.96 seconds.
2025-12-20 05:41:08 +0000 2

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics


2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Finished 2 / 5 lines.
2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Average execution time for completed lines: 3.64 seconds. Estimated time for incomplete lines: 10.92 seconds.
2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Finished 3 / 5 lines.
2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Average execution time for completed lines: 2.47 seconds. Estimated time for incomplete lines: 4.94 seconds.
2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Finished 4 / 5 lines.
2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Average execution time for completed lines: 1.86 seconds. Estimated time for incomplete lines: 1.86 seconds.
2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Finished 5 / 5 lines.
2025-12-20 05:41:15 +0000 281472351105488 execution.bulk     INFO     Average execution time f

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "coherence_20251220_054108_351691"
Run status: "Completed"
Start time: "2025-12-20 05:41:08.351691+00:00"
Duration: "0:00:07.977608"

2025-12-20 05:41:24 +0000 281472309141968 execution.bulk     INFO     Finished 2 / 5 lines.
2025-12-20 05:41:24 +0000 281472309141968 execution.bulk     INFO     Average execution time for completed lines: 8.28 seconds. Estimated time for incomplete lines: 24.84 seconds.
2025-12-20 05:41:25 +0000 281471956808144 execution.bulk     INFO     Finished 1 / 5 lines.
2025-12-20 05:41:25 +0000 281471956808144 execution.bulk     INFO     Average execution time for completed lines: 16.9 seconds. Estimated time for incomplete lines: 67.6 seconds.
2025-12-20 05:41:25 +0000 281471948415440 execution.bulk     INFO     Finished 3 / 5 lines.
2025-12-20 05:41:25 +0000 281471948415440 execution.bulk     INFO     Average execution time for completed lines: 5.67 seconds. Estimated time for incomplete lines: 11.34 seconds.
2025-12-20 05:41:25 +0000 28147231753467

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "fluency_20251220_054108_360263"
Run status: "Completed"
Start time: "2025-12-20 05:41:08.360263+00:00"
Duration: "0:00:47.552339"


{
    "relevance": {
        "status": "Completed",
        "duration": "0:00:07.204553",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "coherence": {
        "status": "Completed",
        "duration": "0:00:07.977608",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "fluency": {
        "status": "Completed",
        "duration": "0:00:47.552339",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "violence": {
        "status": "Completed",
        "duration": "0:00:30.516284",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "hate_unfairness": {
        "status": "Completed",
        "duration": "0:00:29.196249",
        "completed_lines": 5,
        "failed_lines": 0

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "fluency_20251220_054217_500265"
Run status: "Completed"
Start time: "2025-12-20 05:42:17.500265+00:00"
Duration: "0:00:07.292067"


Run name: "coherence_20251220_054217_503853"
Run status: "Completed"
Start time: "2025-12-20 05:42:17.503853+00:00"
Duration: "0:00:07.301405"


Run name: "relevance_20251220_054217_496431"
Run status: "Completed"
Start time: "2025-12-20 05:42:17.496431+00:00"
Duration: "0:00:07.317705"

2025-12-20 05:42:33 +0000 281471948415440 execution.bulk     INFO     Finished 2 / 5 lines.
2025-12-20 05:42:33 +0000 281471948415440 execution.bulk     INFO     Average execution time for completed lines: 7.78 seconds. Estimated time for incomplete lines: 23.34 seconds.
2025-12-20 05:42:33 +0000 281472317534672 execution.bulk     INFO     Finished 2 / 5 lines.
2025-12-20 05:42:33 +0000 281472317534672 execution.bulk     INFO     Average execution time for completed lines: 8.02 seconds. Estimated time for incomplete lines: 24.06 seconds.
2025-12-20 05:42:33 +00

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "violence_20251220_054217_507292"
Run status: "Completed"
Start time: "2025-12-20 05:42:17.507292+00:00"
Duration: "0:00:46.282727"


{
    "relevance": {
        "status": "Completed",
        "duration": "0:00:07.317705",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "coherence": {
        "status": "Completed",
        "duration": "0:00:07.301405",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "fluency": {
        "status": "Completed",
        "duration": "0:00:07.292067",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "violence": {
        "status": "Completed",
        "duration": "0:00:46.282727",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "hate_unfairness": {
        "status": "Completed",
        "duration": "0:00:31.395667",
        "completed_lines": 5,
        "failed_lines": 

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "relevance_20251220_054313_306126"
Run status: "Completed"
Start time: "2025-12-20 05:43:13.306126+00:00"
Duration: "0:00:20.119187"

2025-12-20 05:43:33 +0000 281472334320080 execution.bulk     INFO     Finished 3 / 5 lines.
2025-12-20 05:43:33 +0000 281472334320080 execution.bulk     INFO     Average execution time for completed lines: 6.79 seconds. Estimated time for incomplete lines: 13.58 seconds.
2025-12-20 05:43:33 +0000 281472325927376 execution.bulk     INFO     Finished 3 / 5 lines.
2025-12-20 05:43:33 +0000 281472325927376 execution.bulk     INFO     Average execution time for completed lines: 6.82 seconds. Estimated time for incomplete lines: 13.64 seconds.
2025-12-20 05:43:34 +0000 281472309141968 execution.bulk     INFO     Finished 5 / 5 lines.
2025-12-20 05:43:34 +0000 281472309141968 execution.bulk     INFO     Average execution time for completed lines: 4.15 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-12-20 05:43:34 +0000 281472334320080

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "fluency_20251220_054313_293883"
Run status: "Completed"
Start time: "2025-12-20 05:43:13.293883+00:00"
Duration: "0:00:21.397616"

2025-12-20 05:43:35 +0000 281472317534672 execution.bulk     INFO     Finished 5 / 5 lines.
2025-12-20 05:43:35 +0000 281472317534672 execution.bulk     INFO     Average execution time for completed lines: 4.4 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "sexual_20251220_054313_309185"
Run status: "Completed"
Start time: "2025-12-20 05:43:13.309185+00:00"
Duration: "0:00:21.986884"

2025-12-20 05:43:42 +0000 281472325927376 execution.bulk     INFO     Finished 4 / 5 lines.
2025-12-20 05:43:42 +0000 281472325927376 execution.bulk     INFO     Average execution time for completed lines: 7.27 seconds. Estimated time for incomplete lines: 7.27 seconds.
2025-12-20 05:43:42 +0000 281471948415440 execution.bulk     INFO     Finished 4 / 5 lines.
2025-12-20 05:43:42 +0000 281471948415440 execution.bulk     INFO     Average execu

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "hate_unfairness_20251220_054313_303688"
Run status: "Completed"
Start time: "2025-12-20 05:43:13.303688+00:00"
Duration: "0:00:30.306841"

2025-12-20 05:44:00 +0000 281471948415440 execution.bulk     INFO     Finished 5 / 5 lines.
2025-12-20 05:44:00 +0000 281471948415440 execution.bulk     INFO     Average execution time for completed lines: 9.36 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "self_harm_20251220_054313_310032"
Run status: "Completed"
Start time: "2025-12-20 05:43:13.310032+00:00"
Duration: "0:00:46.811249"


{
    "relevance": {
        "status": "Completed",
        "duration": "0:00:20.119187",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "coherence": {
        "status": "Completed",
        "duration": "0:00:17.260616",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "fluency": {
        "status": "Completed",
        "duration": "0:00:21.397616",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "violence": {
        "status": "Completed",
        "duration": "0:00:30.303418",
        "completed_lines": 5,
        "failed_lines": 0,
        "log_path": null
    },
    "hate_unfairness": {
        "status": "Completed",
        "duration": "0:00:30.306841",
        "completed_lines": 5,
        "failed_lines":

üìä Model Evaluation Comparison

üìä Model Evaluation Comparison



## Step 9: Analyze Results

Let's create a summary comparison of all models across key metrics.

In [41]:
import pandas as pd

# Prepare data for comparison
comparison_data = []

for model_name, results in evaluation_results.items():
    if "error" in results:
        print(f"‚ö†Ô∏è  {model_name}: Evaluation failed - {results['error']}\n")
        continue

    metrics = results["metrics"]

    # The metrics are stored with keys like "relevance.relevance", "coherence.coherence", etc.
    row = {
        "Model": model_name,
        "Eval Time (s)": results['evaluation_time'],
        "Relevance": metrics.get('relevance.relevance', metrics.get('relevance', 0)),
        "Coherence": metrics.get('coherence.coherence', metrics.get('coherence', 0)),
        "Fluency": metrics.get('fluency.fluency', metrics.get('fluency', 0)),
        "Violence": metrics.get('violence.violence_defect_rate', metrics.get('violence', 0)),
        "Hate/Unfairness": metrics.get('hate_unfairness.hate_unfairness_defect_rate', metrics.get('hate_unfairness', 0)),
        "Self-Harm": metrics.get('self_harm.self_harm_defect_rate', metrics.get('self_harm', 0)),
        "Sexual": metrics.get('sexual.sexual_defect_rate', metrics.get('sexual', 0)),
    }

    comparison_data.append(row)

# Create comparison DataFrame
df_comparison = pd.DataFrame(comparison_data)

print("üìä Model Evaluation Comparison\n")
display(df_comparison)

Unnamed: 0,Model,Eval Time (s),Relevance,Coherence,Fluency,Violence,Hate/Unfairness,Self-Harm,Sexual
0,gpt-4o-mini,58.093849,4.8,4.0,3.6,0.0,0.0,0.0,0.0
1,gpt-4o,67.854621,4.6,4.2,3.8,0.0,0.0,0.0,0.0
2,gpt-4.1,56.816964,4.4,4.4,3.6,0.0,0.0,0.0,0.0


## Step 10: Performance Summary

Analyze the evaluation results to identify the best performing models across different metrics.

In [43]:
from IPython.display import display, Markdown, HTML
import pandas as pd

# Check for successful evaluations
successful_models = {name: results for name,
                     results in evaluation_results.items() if 'error' not in results}
failed_models = {name: results for name,
                 results in evaluation_results.items() if 'error' in results}

if successful_models:
    display(Markdown("## üèÜ Best Performing Models by Metric"))

    # Define evaluator metrics and their optimization direction
    evaluator_metrics = [
        ('relevance.relevance', 'Relevance', True),
        ('coherence.coherence', 'Coherence', True),
        ('fluency.fluency', 'Fluency', True),
        ('violence.violence_defect_rate', 'Violence Safety', False),
        ('hate_unfairness.hate_unfairness_defect_rate',
         'Hate/Unfairness Safety', False),
        ('self_harm.self_harm_defect_rate', 'Self-Harm Safety', False),
        ('sexual.sexual_defect_rate', 'Sexual Safety', False),
    ]

    # Create a dataframe for best models
    best_models_data = []

    for metric_key, display_name, higher_is_better in evaluator_metrics:
        valid_models = {}

        # Collect scores for this metric from all successful models
        for model_name, results in successful_models.items():
            metrics = results['metrics']
            score = metrics.get(metric_key)
            if score is not None:
                valid_models[model_name] = score

        if valid_models:
            # Find best model based on optimization direction
            if higher_is_better:
                best_model_name = max(valid_models, key=valid_models.get)
                best_score = valid_models[best_model_name]
                direction = "‚Üë Higher is Better"
            else:
                best_model_name = min(valid_models, key=valid_models.get)
                best_score = valid_models[best_model_name]
                direction = "‚Üì Lower is Better"

            best_models_data.append({
                "Metric": display_name,
                "Best Model": best_model_name,
                "Score": f"{best_score:.3f}",
                "Direction": direction
            })

    df_best = pd.DataFrame(best_models_data)
    display(df_best)

    # Calculate overall best model (based on quality metrics average)
    display(Markdown("---"))
    display(Markdown("## üåü Overall Best Model (Quality Metrics Average)"))

    quality_metric_keys = ['relevance.relevance',
                           'coherence.coherence', 'fluency.fluency']
    model_quality_scores = {}

    for model_name, results in successful_models.items():
        metrics = results['metrics']
        scores = []
        for metric_key in quality_metric_keys:
            score = metrics.get(metric_key)
            if score is not None:
                scores.append(score)

        if scores:
            avg_score = sum(scores) / len(scores)
            model_quality_scores[model_name] = {
                'avg_quality': avg_score,
                'eval_time': results['evaluation_time']
            }

    if model_quality_scores:
        best_overall = max(model_quality_scores,
                           key=lambda x: model_quality_scores[x]['avg_quality'])
        best_data = model_quality_scores[best_overall]

        display(Markdown(
            f"ü•á **{best_overall}** - Quality: {best_data['avg_quality']:.3f} | Time: {best_data['eval_time']:.2f}s"))

        # Show all model quality scores for comparison
        display(Markdown(""))

        ranking_data = []
        sorted_models = sorted(model_quality_scores.items(
        ), key=lambda x: x[1]['avg_quality'], reverse=True)

        for rank, (model_name, data) in enumerate(sorted_models, 1):
            medal = "ü•á" if rank == 1 else "ü•à" if rank == 2 else "ü•â" if rank == 3 else ""
            ranking_data.append({
                "Rank": f"{medal} {rank}",
                "Model": model_name,
                "Avg Quality Score": f"{data['avg_quality']:.3f}",
                "Eval Time (s)": f"{data['eval_time']:.2f}"
            })

        df_ranking = pd.DataFrame(ranking_data)
        display(df_ranking)

## üèÜ Best Performing Models by Metric

Unnamed: 0,Metric,Best Model,Score,Direction
0,Relevance,gpt-4o-mini,4.8,‚Üë Higher is Better
1,Coherence,gpt-4.1,4.4,‚Üë Higher is Better
2,Fluency,gpt-4o,3.8,‚Üë Higher is Better
3,Violence Safety,gpt-4o-mini,0.0,‚Üì Lower is Better
4,Hate/Unfairness Safety,gpt-4o-mini,0.0,‚Üì Lower is Better
5,Self-Harm Safety,gpt-4o-mini,0.0,‚Üì Lower is Better
6,Sexual Safety,gpt-4o-mini,0.0,‚Üì Lower is Better


---

## üåü Overall Best Model (Quality Metrics Average)

ü•á **gpt-4o** - Quality: 4.200 | Time: 67.85s



Unnamed: 0,Rank,Model,Avg Quality Score,Eval Time (s)
0,ü•á 1,gpt-4o,4.2,67.85
1,ü•à 2,gpt-4o-mini,4.133,58.09
2,ü•â 3,gpt-4.1,4.133,56.82


## Step 11: Next Steps

You've successfully evaluated multiple models! Here are some next steps to consider:

### üìä View Results in Two Places

- **Azure AI Foundry Portal**: Interactive visualizations with [detailed charts and comparisons](https://learn.microsoft.com/azure/ai-foundry/how-to/evaluate-results)
- **Portal URLs**: Each evaluation includes a studio URL for easy access and team sharing
- **Local Files**: All results saved in `./22-evaluate-models-results/` for offline analysis
- **Version Control**: Commit JSON files for reproducibility and tracking over time

### üèÜ Use Model Leaderboards for Selection

- **Browse Leaderboards**: Compare models by [Quality, Safety, Cost, and Performance](https://learn.microsoft.com/azure/ai-foundry/how-to/benchmark-model-in-catalog)
- **Trade-off Analysis**: View quality vs. cost, quality vs. safety charts
- **Scenario Filtering**: Find models best suited for your use case (Q&A, coding, reasoning)
- **Access Portal**: [Azure AI Foundry Model Catalog ‚Üí Browse Leaderboards](https://aka.ms/model-leaderboards)

---

**Great work! You now have comprehensive evaluation metrics for multiple models.** üéâ
