# Base Model Selection & Evaluation

The first stage of the AI lifecycle involves selecting an appropriate base model. Generative AI models vary widely in terms of capabilities, strengths, and limitations, so it's essential to identify which model best suits your specific use case. 

In this notebook, we teach you how to use the Evaluation SDK to assess multiple model candidates using the same _set of evaluators_ - then reviewing the results to see which model fits our needs best. This lets you tailor your evaluation to specific metrics.

Alternatively, you can use the [Leaderboards](https://ai.azure.com/leaderboards) to compare models using standard industry benchmarks, for quality, safety and costs.

Let's get started! üöÄ

---

## Step 1: Verify Required Python Packages

The dev container has already installed all the necessary Python packages for you:
- `azure-ai-evaluation`: The main SDK for running evaluations
- `azure-identity`: For authentication with Azure
- `pandas`: For data manipulation and analysis

Let's verify these packages are available and check their versions.

In [None]:
# Verify required packages are installed
import importlib.metadata

try:
    azure_eval_version = importlib.metadata.version('azure-ai-evaluation')
    azure_identity_version = importlib.metadata.version('azure-identity')
    pandas_version = importlib.metadata.version('pandas')
    
    print("‚úÖ All required packages are installed!")
    print(f"üì¶ azure-ai-evaluation: {azure_eval_version}")
    print(f"üì¶ azure-identity: {azure_identity_version}")
    print(f"üì¶ pandas: {pandas_version}")
except importlib.metadata.PackageNotFoundError as e:
    print(f"‚ùå Missing package: {e}")
    print("Please ensure the dev container has been properly set up.")

## Step 2: Load Environment Variables

Let's load the environment variables from the `.env` file. These variables should already be configured from the initial lab setup.

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

print("‚úÖ Environment variables loaded from .env file!")

## Step 3: Verify Required Environment Variables

Before we proceed, let's verify that all the required environment variables are set:

1. **AZURE_OPENAI_ENDPOINT**: Your Azure OpenAI endpoint
2. **AZURE_OPENAI_API_VERSION**: The API version for Azure OpenAI
3. **AZURE_OPENAI_API_KEY**: Your Azure OpenAI API key
4. **AZURE_OPENAI_DEPLOYMENT**: The deployment name for the judge model
5. **AZURE_SUBSCRIPTION_ID**: Your Azure subscription ID
6. **AZURE_RESOURCE_GROUP**: Your Azure resource group name
7. **AZURE_AI_PROJECT_NAME**: Your Azure AI project name

> **Note**: These environment variables should already be set from the initial lab setup in your `.env` file.

In [None]:
def check_env_variables(env_vars):
    undefined_vars = [var for var in env_vars if os.getenv(var) is None]
    if undefined_vars:
        print(f"‚ùå The following environment variables are not defined: {', '.join(undefined_vars)}")
        raise ValueError(f"Missing required environment variables: {', '.join(undefined_vars)}")
    else:
        print("‚úÖ All required environment variables are defined.")

# Let's check required env variables for this exercise
env_vars_to_check = [
    'AZURE_OPENAI_API_KEY', 
    'AZURE_OPENAI_ENDPOINT', 
    'AZURE_OPENAI_DEPLOYMENT', 
    'AZURE_SUBSCRIPTION_ID', 
    'AZURE_RESOURCE_GROUP', 
    'AZURE_AI_PROJECT_NAME'
]
check_env_variables(env_vars_to_check)

# Print configuration for verification
print(f"\nüìç Azure OpenAI Endpoint: {os.environ.get('AZURE_OPENAI_ENDPOINT')}")
print(f"üìã API Version: {os.environ.get('AZURE_OPENAI_API_VERSION', 'Not set - will use default')}")
print(f"‚öñÔ∏è  Judge Model Deployment: {os.environ.get('AZURE_OPENAI_DEPLOYMENT')}")
print(f"üìÇ Project Name: {os.environ.get('AZURE_AI_PROJECT_NAME')}")

## Step 4: Create Azure AI Project Configuration

Let's create the Azure AI Project configuration object that will be used to upload evaluation results to the Azure AI Foundry portal for viewing and analysis.

In [None]:
from pprint import pprint

# Get Azure AI project configuration from environment variables
subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
resource_group_name = os.environ.get("AZURE_RESOURCE_GROUP")
project_name = os.environ.get("AZURE_AI_PROJECT_NAME")

# Create the azure_ai_project configuration
azure_ai_project = {
    "subscription_id": subscription_id,
    "resource_group_name": resource_group_name,
    "project_name": project_name,
}

print("‚úÖ Azure AI Foundry project configuration created!")
print("="*80)
pprint(azure_ai_project)

## Step 5: Load Test Data

Let's load the test data from `data.jsonl`. This file contains:
- **query**: Questions to ask the model
- **context**: Background information for answering the question
- **ground_truth**: Expected correct answers for comparison

This data will be used to evaluate how well each model performs.

In [None]:
import pandas as pd
import pathlib

# Load the test data
data_file = "data.jsonl"
data_path = str(pathlib.Path(pathlib.Path.cwd())) + f"/{data_file}"

# Read and display the data
df = pd.read_json(data_file, lines=True)

print(f"‚úÖ Test data loaded successfully!")
print(f"üìÑ Data file: {data_path}")
print(f"üìä Number of test cases: {len(df)}")
print("\n" + "="*80)
print("üìã SAMPLE DATA:")
print("="*80)
print(df.head())

## Step 6: Configure the Judge Model

To evaluate model responses, we need an "LLM Judge" - another AI model that will assess the quality of the responses. We'll configure the judge model using a simple dictionary configuration.

The judge model will be used by evaluators like:
- **RelevanceEvaluator**: Checks if the response is relevant to the question
- **CoherenceEvaluator**: Checks if the response is logically structured

This is also known as "LLM as a Judge" evaluation.

In [None]:
# Configure the judge model using the dictionary approach (as shown in sample-eval.ipynb)
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

print("‚úÖ Judge model configured successfully!")
print("="*80)
pprint(model_config)

## Step 7: Initialize the Evaluators

We'll use multiple types of evaluators to comprehensively assess model performance:

**LLM as Judge Evaluators** (require judge model):
- **RelevanceEvaluator**: Measures how relevant the response is to the query
- **CoherenceEvaluator**: Measures logical flow and readability

**NLP Evaluators** (no judge model needed):
- **BleuScoreEvaluator**: Compares response to ground truth using n-gram overlap
- **RougeScoreEvaluator**: Measures overlap of words/phrases with ground truth

**Content Safety Evaluators** (Azure AI service):
- **ViolenceEvaluator**: Detects violent or harmful content

This combination gives us a comprehensive view of quality, accuracy, and safety.

In [None]:
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    RelevanceEvaluator, 
    CoherenceEvaluator, 
    BleuScoreEvaluator, 
    RougeScoreEvaluator, 
    RougeType, 
    ViolenceEvaluator,
)
from azure.identity import DefaultAzureCredential

# LLM as Judge evaluators (require judge model)
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)

# NLP evaluators (no model required)
bleu_score_evaluator = BleuScoreEvaluator()
rouge_score_evaluator = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

# Content Safety evaluator (requires Azure AI project)
violence_evaluator = ViolenceEvaluator(
    azure_ai_project=azure_ai_project, 
    credential=DefaultAzureCredential()
)

print("‚úÖ All evaluators initialized successfully!")
print("üìä Evaluators configured:")
print("   - Relevance (LLM Judge)")
print("   - Coherence (LLM Judge)")
print("   - BLEU Score (NLP)")
print("   - ROUGE Score (NLP)")
print("   - Violence (Content Safety)")

## Step 8: Create Target Function for Model Evaluation

The target function is what the evaluation SDK calls to get responses from your model. It receives a query from the test data and returns the model's response.

We'll create a callable class that:
1. Accepts a model deployment name in the constructor
2. Implements a `__call__` method that takes a query and returns a response
3. Uses Azure OpenAI to generate responses

This allows us to easily test different models by creating instances with different deployment names.

In [None]:
from openai import AzureOpenAI

class ModelTarget:
    """
    A callable target class for model evaluation.
    This class wraps an Azure OpenAI model and provides a simple interface
    for the evaluation SDK to call and get responses.
    """
    
    def __init__(self, model_deployment_name: str):
        """
        Initialize the model target with a specific deployment.
        
        Args:
            model_deployment_name: The name of the Azure OpenAI model deployment
        """
        self.model_deployment_name = model_deployment_name
        
        # Create Azure OpenAI client
        self.client = AzureOpenAI(
            api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
            api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
            azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT")
        )
        
    def __call__(self, query: str) -> dict:
        """
        Generate a response for the given query.
        
        Args:
            query: The question or prompt from the test data
            
        Returns:
            A dictionary with the 'response' key containing the model's answer
        """
        try:
            # Call Azure OpenAI with the query
            completion = self.client.chat.completions.create(
                model=self.model_deployment_name,
                messages=[
                    {"role": "system", "content": "You are a helpful AI assistant. Provide clear, accurate, and concise responses."},
                    {"role": "user", "content": query}
                ],
                temperature=0.7,
                max_tokens=800
            )
            
            # Extract the response text
            response_text = completion.choices[0].message.content
            
            return {"response": response_text}
            
        except Exception as e:
            print(f"‚ùå Error calling model {self.model_deployment_name}: {e}")
            return {"response": f"Error: {str(e)}"}

print("‚úÖ ModelTarget class defined successfully!")
print("üí° This class will be used to generate responses from different models for evaluation.")

## Step 9: Create Evaluation Function

Now let's create a reusable function that:
1. Takes a model deployment name as input
2. Creates a `ModelTarget` instance for that model
3. Runs the evaluation with all configured evaluators
4. Returns the results

This function will make it easy to evaluate multiple models with the same test data and evaluators.

In [None]:
def evaluate_model(model_deployment_name: str):
    """
    Evaluate a specific model deployment using the configured evaluators.
    
    Args:
        model_deployment_name: The name of the Azure OpenAI deployment to evaluate
        
    Returns:
        Evaluation results including metrics and a link to Azure AI Foundry
    """
    print(f"\n{'='*80}")
    print(f"üîÑ Starting evaluation for: {model_deployment_name}")
    print(f"{'='*80}\n")
    
    # Create the target for this model
    model_target = ModelTarget(model_deployment_name)
    
    # Run the evaluation
    results = evaluate(
        evaluation_name=f"Base Model Evaluation - {model_deployment_name}",
        data=data_path,
        target=model_target,
        evaluators={
            "relevance": relevance_evaluator,
            "coherence": coherence_evaluator,
            "bleu_score": bleu_score_evaluator,
            "rouge_score": rouge_score_evaluator,
            "violence_score": violence_evaluator,
        },
        azure_ai_project=azure_ai_project,
    )
    
    print(f"\n‚úÖ Evaluation completed for {model_deployment_name}!")
    print(f"üìä Results uploaded to Azure AI Foundry")
    
    return results

print("‚úÖ Evaluation function defined successfully!")
print("üöÄ Ready to evaluate models!")

## Step 10: Evaluate All Deployed Models

Now let's evaluate all your deployed models systematically. We'll run the same evaluation on each model to compare their performance side-by-side.

**Models to evaluate:**
- **gpt-4.1**: Latest GPT-4 model with enhanced capabilities
- **gpt-4.1-mini**: Efficient version of GPT-4.1
- **gpt-4.1-nano**: Ultra-efficient version for cost-sensitive workloads
- **gpt-4o**: Optimized GPT-4 variant (2024-11-20)
- **gpt-4o-mini**: Smaller, faster version of GPT-4o

The evaluation will measure:
- How relevant the responses are
- How coherent and well-structured they are
- How closely they match the ground truth
- Whether they contain any harmful content

> **Note**: This may take several minutes as we evaluate multiple models.

In [None]:
# Define all deployed models to evaluate
models_to_evaluate = [
    "gpt-4.1",
    "gpt-4.1-mini",
    "gpt-4o",
    "gpt-4o-mini"
]

# Dictionary to store all evaluation results
all_results = {}

# Evaluate each model
for model_name in models_to_evaluate:
    print(f"\n{'='*80}")
    print(f"üöÄ Evaluating model: {model_name}")
    print(f"{'='*80}")
    
    try:
        # Run evaluation
        results = evaluate_model(model_name)
        all_results[model_name] = results
        
        # Display summary
        results_df = pd.DataFrame(results["rows"])
        print(f"\nüìä Results for {model_name}:")
        print(results_df.head())
        print(f"\nüåê View in portal: {results['studio_url']}")
        
    except Exception as e:
        print(f"‚ùå Error evaluating {model_name}: {e}")
        all_results[model_name] = None

print("\n" + "="*80)
print("‚úÖ ALL MODEL EVALUATIONS COMPLETED!")
print("="*80)
print(f"\nüìä Successfully evaluated {len([r for r in all_results.values() if r is not None])} out of {len(models_to_evaluate)} models")
print("\nüí° Results are stored both locally and in Azure AI Foundry portal!")

## Step 11: Analyze & Compare Results (Local + Cloud)

Now let's analyze the evaluation results both **locally** (in this notebook) and in the **Azure AI Foundry portal** for comprehensive comparison.

### üìä Local Analysis

We'll aggregate metrics across all models to identify trends and patterns right here in the notebook.

### üåê Cloud Portal Analysis

We'll also provide links to view detailed dashboards in Azure AI Foundry where you can:
- Compare side-by-side metrics
- View interactive charts
- Drill down into individual test cases
- Export results for further analysis

In [None]:
import json
from datetime import datetime

# ============================================================================
# LOCAL ANALYSIS: Aggregate metrics across all models
# ============================================================================

print("\n" + "="*80)
print("üìä LOCAL ANALYSIS: MODEL COMPARISON SUMMARY")
print("="*80 + "\n")

# Create comparison table
comparison_data = []

for model_name, results in all_results.items():
    if results is not None:
        # Extract metrics from results
        metrics = results.get("metrics", {})
        
        comparison_data.append({
            "Model": model_name,
            "Relevance": f"{metrics.get('relevance.gpt_relevance', 0):.2f}",
            "Coherence": f"{metrics.get('coherence.gpt_coherence', 0):.2f}",
            "BLEU Score": f"{metrics.get('bleu_score.bleu_score', 0):.3f}",
            "ROUGE Score": f"{metrics.get('rouge_score.rouge_score', 0):.3f}",
            "Violence": f"{metrics.get('violence_score.violence_score', 0):.2f}",
        })

# Create DataFrame for comparison
comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

# Identify best performing models
print("\n" + "-"*80)
print("üèÜ BEST PERFORMERS BY METRIC:")
print("-"*80)

if len(comparison_data) > 0:
    # Convert string values back to float for comparison
    for metric in ["Relevance", "Coherence", "BLEU Score", "ROUGE Score"]:
        values = [(row["Model"], float(row[metric])) for row in comparison_data]
        best_model, best_score = max(values, key=lambda x: x[1])
        print(f"  {metric}: {best_model} ({best_score:.3f})")
    
    # For violence, lower is better
    violence_values = [(row["Model"], float(row["Violence"])) for row in comparison_data]
    safest_model, lowest_violence = min(violence_values, key=lambda x: x[1])
    print(f"  Safety (Violence): {safest_model} ({lowest_violence:.3f})")

# Save results locally
output_file = f"evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_file, 'w') as f:
    # Save all results with metrics
    output_data = {
        "timestamp": datetime.now().isoformat(),
        "models_evaluated": models_to_evaluate,
        "comparison_summary": comparison_data,
        "detailed_results": {
            model: {
                "metrics": results.get("metrics", {}) if results else None,
                "studio_url": results.get("studio_url", "") if results else None
            }
            for model, results in all_results.items()
        }
    }
    json.dump(output_data, f, indent=2)

print(f"\nüíæ Local results saved to: {output_file}")

# ============================================================================
# CLOUD PORTAL ANALYSIS: Links to Azure AI Foundry
# ============================================================================

print("\n" + "="*80)
print("üåê CLOUD PORTAL ANALYSIS: Azure AI Foundry Dashboard")
print("="*80 + "\n")

print("üìã View detailed results for each model in the portal:\n")
for model_name, results in all_results.items():
    if results is not None:
        print(f"  ‚Ä¢ {model_name}:")
        print(f"    {results['studio_url']}\n")

print("-"*80)
print("üìä TO COMPARE ALL MODELS IN THE PORTAL:")
print("-"*80)
print("""
1. Navigate to the Evaluations tab in Azure AI Foundry portal
2. You should see evaluation runs for all models:
   - Base Model Evaluation - gpt-4.1
   - Base Model Evaluation - gpt-4.1-mini
   - Base Model Evaluation - gpt-4.1-nano
   - Base Model Evaluation - gpt-4o
   - Base Model Evaluation - gpt-4o-mini

3. Select the runs you want to compare (check the boxes)

4. Click "Switch to Dashboard View" for visual comparison

5. Use the Charts tab to see:
   - Color-coded performance metrics
   - Side-by-side comparisons
   - Trend analysis across evaluation criteria

üí° TIP: Compare models by family (e.g., all gpt-4.1 variants) to see 
        how mini/nano versions trade performance for efficiency!
""")

print("="*80)
print("‚úÖ ANALYSIS COMPLETE!")
print("="*80)
print(f"üìä Local summary: {len(comparison_data)} models analyzed")
print(f"üíæ Results saved: {output_file}")
print(f"üåê Portal URLs: Ready for cloud comparison")
print("\nüí° You now have both local metrics and cloud dashboards for comprehensive analysis!")

## üéâ Congratulations!

You've successfully completed the Base Model Selection & Evaluation tutorial! Here's what you learned:

### Key Concepts Covered:
1. ‚úÖ **Package Verification**: Confirmed all required packages are installed
2. ‚úÖ **Environment Configuration**: Loaded variables from .env file using environment variables
3. ‚úÖ **Azure AI Project Setup**: Extracted project details from connection string
4. ‚úÖ **Test Data Loading**: Loaded and reviewed evaluation test cases
5. ‚úÖ **Judge Model Configuration**: Set up an LLM judge for evaluation (BASE_MODELS_JUDGE)
6. ‚úÖ **Evaluator Initialization**: Configured multiple types of evaluators (quality, NLP, safety)
7. ‚úÖ **Target Function Creation**: Built a reusable ModelTarget class for testing
8. ‚úÖ **Model Evaluation**: Evaluated 5 different models with the same test data
9. ‚úÖ **Local & Cloud Analysis**: Analyzed results both locally and in Azure AI Foundry portal
10. ‚úÖ **Results Storage**: Saved evaluation data locally (JSON) and to the cloud

### Environment Variables Used:
- `AZURE_OPENAI_ENDPOINT`: Your Azure OpenAI endpoint
- `AZURE_OPENAI_API_VERSION`: API version for Azure OpenAI
- `AZURE_OPENAI_API_KEY`: Authentication for Azure OpenAI
- `AZURE_AI_CONNECTION_STRING`: Azure AI Foundry project connection
- `BASE_MODELS_JUDGE`: Judge model deployment for LLM-based evaluation

### Evaluators Explored:
- **Relevance**: Measures response relevance to query (LLM Judge)
- **Coherence**: Measures logical flow and readability (LLM Judge)
- **BLEU Score**: N-gram overlap with ground truth (NLP)
- **ROUGE Score**: Word/phrase overlap with ground truth (NLP)
- **Violence**: Content safety evaluation (Azure AI)

### Models Evaluated:
- ü§ñ **gpt-4.1**: Latest GPT-4 model with enhanced capabilities (capacity: 50)
- ü§ñ **gpt-4.1-mini**: Efficient version of GPT-4.1 (capacity: 20)
- ü§ñ **gpt-4.1-nano**: Ultra-efficient version for cost-sensitive workloads (capacity: 20)
- ü§ñ **gpt-4o**: Optimized GPT-4 variant from Nov 2024 (capacity: 20)
- ü§ñ **gpt-4o-mini**: Smaller, faster version of GPT-4o

### What You Learned:
- How to compare multiple models systematically
- How to use different types of evaluators for comprehensive assessment
- How to analyze results locally with aggregated metrics
- How to leverage Azure AI Foundry portal for visual comparisons
- How to save evaluation results in both local files and cloud storage
- How to interpret evaluation metrics for informed model selection

### Results Storage:
- **Local**: JSON file with timestamp (evaluation_results_YYYYMMDD_HHMMSS.json)
- **Cloud**: Azure AI Foundry portal with interactive dashboards
- **Format**: Structured data with metrics, URLs, and comparison summaries