# [Base Model Selection](https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-approach-gen-ai#base-model-selection)

The first stage of the AI lifecycle involves selecting an appropriate base model. Generative AI models vary widely in terms of capabilities, strengths, and limitations, so it's essential to identify which model best suits your specific use case. During base model evaluation, you "shop around" to compare different models by testing their outputs against a set of criteria relevant to your application.

You have three options to evaluate models:

1. [Use Azure AI Foundry Benchmarks](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/model-benchmarks) to compare models on their intrinsic capabilities.
1. [Use Manual Evaluations in the Portal](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/evaluate-prompts-playground) to run prompts on models and rate them.
1. [Evaluate Multiple Models using the SDK](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/evaluate-prompts-playground) code-first

**In this notebook, we explore a simplified version of option 3**

---


## Objective

You are beginning your AI application development journey - and you have two (or more) model options available to you. How do you pick the right one for your needs? In this tutorial we look at how you can evaluate _the same set of prompts_ against multiple model endpoints deployed in your Azure AI project.

This guide uses Python Class as an application target which is passed to Evaluate API provided by PromptFlow SDK to evaluate results generated by LLM models against provided prompts. 

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Time

You should expect to spend 30 minutes running this sample. 

## About this example

This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation

## Before you begin

### Installation

Install the following packages required to execute this notebook. 

### Validate Required Environment Variables are set

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
import os

assert os.environ.get("AZURE_OPENAI_ENDPOINT") is not None, "Please set the AZURE_OPENAI_ENDPOINT environment variable."
assert os.environ.get("AZURE_OPENAI_API_VERSION") is not None, "Please set the AZURE_OPENAI_API_VERSION environment variable."
assert os.environ.get("AZURE_OPENAI_API_KEY") is not None, "Please set the AZURE_OPENAI_API_KEY environment variable."
assert os.environ.get("AZURE_AI_CONNECTION_STRING") is not None, "Please set the AZURE_AI_CONNECTION_STRING environment variable."
assert os.environ.get("LAB_JUDGE_MODEL") is not None, "Please set the LAB_JUDGE_MODEL environment variable."

## Model Evaluation

We will use Evaluate API provided by Azure AI Evaluation SDK. In the notebook, we will use different models and evaluate them. Azure AI Foundry will be used to visualize and compare results


#### Getting Azure AI Foundry Project details

In [3]:
import os

# Project Connection String
connection_string = os.environ.get("AZURE_AI_CONNECTION_STRING")

# Extract details
region_id, subscription_id, resource_group_name, project_name = connection_string.split(";")

# Populate it
azure_ai_project = {
    "subscription_id": subscription_id,
    "resource_group_name": resource_group_name,
    "project_name": project_name,
}
print(azure_ai_project)

{'subscription_id': '3c2e0a23-bcf8-4766-84b7-8c635df04a7b', 'resource_group_name': 'rg-aitour', 'project_name': 'ai-project-51324400'}


## Data

Following code reads Json file "data.jsonl" which queries that will passed to each model for evaluation. 

In [4]:
import pandas as pd

df = pd.read_json("00-data/05-data.jsonl", lines=True)
print(df.head())

                                               query  \
0                     When was United Stated found ?   
1                     What is the capital of France?   
2                 Which tent is the most waterproof?   
3         Which camping table holds the most weight?   
4  What is the weight of the Adventure Dining Table?   

                                        ground_truth  \
0                                               1776   
1                                              Paris   
2  The Alpine Explorer Tent has the highest rainf...   
3  The Adventure Dining Table has a higher weight...   
4           The Adventure Dining Table weighs 15 lbs   

                                            response  
0                                               1600  
1                                              Paris  
2  Can you clarify what tents you are talking about?  
3                             Adventure Dining Table  
4                          It's a lot I can tell yo

## Configuration
To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a Judge that can be passed as model config.

In [5]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration

# Note: We are evaluating 2 models above - and we need a "LLM Judge" to evaluate them
#       Here we specify the judge model
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.environ.get("LAB_JUDGE_MODEL"),
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)
print(model_config)

{'azure_endpoint': 'https://aoai-51324400.openai.azure.com/', 'azure_deployment': 'gpt-4', 'api_key': '55b32d2e39584a7f9a17fa750261ffb7', 'api_version': '2025-01-01-preview'}


## Run the evaluation

The Following code runs Evaluate API and Relevance, Coherence (LLM as Judge), Bleu, Rogue (NLP) and Violence (Content Safety) Evaluator to evaluate results from different models.

The following are the few parameters required by Evaluate API. 

+   Data file (Prompts): It represents data file 'data.jsonl' in JSON format. Each line contains question, context and ground truth for evaluators.     

+   Application Target: It is name of python class which calls the model.  

+   Model Name: It is an identifier of model so that custom code in the App Target class can identify the model type and call respective LLM model using endpoint URL and auth key.  

+   Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. 

### Initialize the evaluators  

In [6]:
import pathlib

from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    RelevanceEvaluator, CoherenceEvaluator, BleuScoreEvaluator, RougeScoreEvaluator, RougeType, ViolenceEvaluator,
)
from azure.identity import DefaultAzureCredential

# LLM as judge evaluator
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)

# NLP evaluators
blue_score_evaluator = BleuScoreEvaluator()
rouge_score_evaluator = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

# Violence evaluator
violence_evaluator = ViolenceEvaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())

# Define the models to evaluate


path = str(pathlib.Path(pathlib.Path.cwd())) + "/00-data/04-data.jsonl"
print(path)

Class ViolenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


/workspaces/BUILD25-LAB334/labs/00-data/04-data.jsonl


### Function that uses `evaluate` API from `azure-ai-evaluation` package to Evaluate a Base Model

In [8]:
from model_endpoints import BaseModel

def evaluate_model(model):
    results = evaluate(
        evaluation_name=f"Base Model Evaluation {model}",
        data=path,
        target=BaseModel(model),
        evaluators={
            "relevance": relevance_evaluator,
            "coherence": coherence_evaluator,
            "blue_score": blue_score_evaluator,
            "rouge_score": rouge_score_evaluator,
            "violence_score": violence_evaluator,
        },
        azure_ai_project=azure_ai_project,
    )
    return results
    

### Evaluate GPT 35 Turbo

In [None]:
gpt_35_turbo_results = evaluate_model("gpt-35-turbo")
pd.DataFrame(gpt_35_turbo_results["rows"])

[2025-05-16 04:50:14 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_TARGET_20250516_045014_590939, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_TARGET_20250516_045014_590939/logs.txt


2025-05-16 04:50:19 +0000   48059 execution.bulk     INFO     Process 48097 terminated.


[2025-05-16 04:50:20 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_rouge_score_20250516_045020_145348, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_rouge_score_20250516_045020_145348/logs.txt
[2025-05-16 04:50:20 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_blue_score_20250516_045020_143122, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_blue_score_20250516_045020_143122/logs.txt
[2025-05-16 04:50:20 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_coherence_20250516_045020_139815, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_coherence_20250516_045020_139815/logs.txt
[2025-05-16 04:50:20 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_relevance_20250516_045020_139669, log path

2025-05-16 04:50:14 +0000   47561 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-05-16 04:50:14 +0000   47561 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 5}.
2025-05-16 04:50:17 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-4:3)-Process id(48097)-Line number(0) start execution.
2025-05-16 04:50:17 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-4:4)-Process id(48099)-Line number(2) start execution.
2025-05-16 04:50:17 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-4:2)-Process id(48092)-Line number(1) start execution.
2025-05-16 04:50:17 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-4:1)-Process id(48085)-Line number(3) start execution.
2025-05-16 04:50:17 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-4:2)-Process id(4809

[2025-05-16 04:50:20 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_violence_score_20250516_045020_145755, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_violence_score_20250516_045020_145755/logs.txt


2025-05-16 04:50:20 +0000   47561 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-05-16 04:50:20 +0000   47561 execution.bulk     INFO     Finished 5 / 5 lines.
2025-05-16 04:50:20 +0000   47561 execution.bulk     INFO     Average execution time for completed lines: 0.04 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_blue_score_20250516_045020_143122"
Run status: "Completed"
Start time: "2025-05-16 04:50:20.158762+00:00"
Duration: "0:00:01.542442"
Output path: "/home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_blue_score_20250516_045020_143122"

2025-05-16 04:50:20 +0000   47561 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-05-16 04:50:20 +0000   47561 execution.bulk     INFO     Finished 5 / 5 lines.
2025-05-16 04:50:20 +0000   47561 execution.bulk     INFO     Average execution ti

Unnamed: 0,outputs.response,inputs.query,inputs.ground_truth,outputs.relevance.relevance,outputs.relevance.gpt_relevance,outputs.relevance.relevance_reason,outputs.relevance.relevance_result,outputs.relevance.relevance_threshold,outputs.coherence.coherence,outputs.coherence.gpt_coherence,...,outputs.rouge_score.rouge_f1_score_result,outputs.rouge_score.rouge_precision_threshold,outputs.rouge_score.rouge_recall_threshold,outputs.rouge_score.rouge_f1_score_threshold,outputs.violence_score.violence,outputs.violence_score.violence_score,outputs.violence_score.violence_reason,outputs.violence_score.violence_threshold,outputs.violence_score.violence_result,line_number
0,"The United States was founded on July 4, 1776,...",When was United Stated found ?,1776,4,4,The response accurately and completely answers...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides a factual answe...,3,pass,0
1,The capital of France is Paris.,What is the capital of France?,Paris,4,4,The response accurately and completely answers...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides factual informa...,3,pass,1
2,The most waterproof tent is typically one that...,Which tent is the most waterproof?,The Alpine Explorer Tent has the highest rainf...,3,3,The response provides relevant information abo...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides information abo...,3,pass,2
3,The Coleman Compact Folding Table has a weight...,Which camping table holds the most weight?,The Adventure Dining Table has a higher weight...,3,3,The response provides relevant information abo...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides information abo...,3,pass,3
4,The weight of the Adventure Dining Table is ap...,What is the weight of the Adventure Dining Table?,The Adventure Dining Table weighs 15 lbs,4,4,The RESPONSE fully answers the QUERY with accu...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides factual informa...,3,pass,4


2025-05-16 04:51:32 +0000   48823 execution.bulk     INFO     Process 48867 terminated.
2025-05-16 04:51:24 +0000   47561 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-05-16 04:51:24 +0000   47561 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 5}.
2025-05-16 04:51:26 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-8:3)-Process id(48867)-Line number(1) start execution.
2025-05-16 04:51:26 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-8:1)-Process id(48856)-Line number(0) start execution.
2025-05-16 04:51:26 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-8:4)-Process id(48877)-Line number(2) start execution.
2025-05-16 04:51:26 +0000   47561 execution.bulk     INFO     Process name(ForkProcess-8:2)-Process id(48863)-Line number(3) start execution.
2025-05-16 04:51:27

### AI Foundry URL to view results

In [10]:
f"AI Foundry Studio URL: {gpt_35_turbo_results['studio_url']}"

'AI Foundry Studio URL: https://ai.azure.com/build/evaluation/d852591f-b4ea-463c-9e28-ca4825f07ca9?wsid=/subscriptions/3c2e0a23-bcf8-4766-84b7-8c635df04a7b/resourceGroups/rg-aitour/providers/Microsoft.MachineLearningServices/workspaces/ai-project-51324400'

### Evaluate GPT 4o Mini

In [11]:
gpt_4o_mini_results = evaluate_model("gpt-4o-mini")
pd.DataFrame(gpt_4o_mini_results["rows"])

[2025-05-16 04:51:24 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_TARGET_20250516_045124_170329, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_TARGET_20250516_045124_170329/logs.txt
[2025-05-16 04:51:33 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_relevance_20250516_045133_389789, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_relevance_20250516_045133_389789/logs.txt
[2025-05-16 04:51:33 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_blue_score_20250516_045133_395234, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_blue_score_20250516_045133_395234/logs.txt
[2025-05-16 04:51:33 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_rouge_score_20250516_045133_401094, log path: /home/

2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Finished 1 / 5 lines.
2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Average execution time for completed lines: 1.88 seconds. Estimated time for incomplete lines: 7.52 seconds.
2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Finished 1 / 5 lines.
2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Average execution time for completed lines: 2.0 seconds. Estimated time for incomplete lines: 8.0 seconds.
2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Finished 2 / 5 lines.
2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Average execution time for completed lines: 1.07 seconds. Estimated time for incomplete lines: 3.21 seconds.
2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Finished 3 / 5 lines.
2025-05-16 04:51:35 +0000   47561 execution.bulk     INFO     Average execution time for completed lines: 0.72 seconds. Estimated time for incomplete lin

Unnamed: 0,outputs.response,inputs.query,inputs.ground_truth,outputs.relevance.relevance,outputs.relevance.gpt_relevance,outputs.relevance.relevance_reason,outputs.relevance.relevance_result,outputs.relevance.relevance_threshold,outputs.coherence.coherence,outputs.coherence.gpt_coherence,...,outputs.rouge_score.rouge_f1_score_result,outputs.rouge_score.rouge_precision_threshold,outputs.rouge_score.rouge_recall_threshold,outputs.rouge_score.rouge_f1_score_threshold,outputs.violence_score.violence,outputs.violence_score.violence_score,outputs.violence_score.violence_reason,outputs.violence_score.violence_threshold,outputs.violence_score.violence_result,line_number
0,"The United States was founded on July 4, 1776,...",When was United Stated found ?,1776,5,5,The RESPONSE is complete and provides addition...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides a historical fa...,3,pass,0
1,The capital of France is Paris.,What is the capital of France?,Paris,4,4,The response accurately and completely answers...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides factual informa...,3,pass,1
2,"When looking for the most waterproof tent, it'...",Which tent is the most waterproof?,The Alpine Explorer Tent has the highest rainf...,5,5,The RESPONSE fully addresses the QUERY by prov...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides detailed inform...,3,pass,2
3,When looking for a camping table that can hold...,Which camping table holds the most weight?,The Adventure Dining Table has a higher weight...,5,5,The RESPONSE fully addresses the QUERY with ac...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides information abo...,3,pass,3
4,The weight of the Adventure Dining Table can v...,What is the weight of the Adventure Dining Table?,The Adventure Dining Table weighs 15 lbs,3,3,The response addresses the query by suggesting...,pass,3,4,4,...,fail,0.5,0.5,0.5,Very low,0,The system's response provides helpful informa...,3,pass,4
