# **Evaluating AI Models in Azure AI Foundry**

## Overview
This notebook demonstrates how to evaluate AI model outputs using Azure AI Foundry's evaluation capabilities. You'll learn how to assess the groundedness and quality of AI-generated responses, ensuring they're factually accurate and aligned with provided context.

## Evaluation Types in this notebook

### **Groundedness Evaluation**
Groundedness evaluates whether an AI model's responses are properly supported by the provided context:

- **GroundednessEvaluator**: Measures how well the generated response aligns with the given context, focusing on its relevance and accuracy with respect to the context.
- **GroundednessProEvaluator**: It detects whether the generated text response is consistent or accurate with respect to the given context.

Read more here: [docs](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-metrics-built-in?tabs=warning#ai-assisted-groundedness)

### **Other Common Evaluations (Not Covered in This Notebook)**
- **Relevance**: It measures how effectively a response addresses a query. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given query.
- **Coherence**: It measures the logical flow and organization of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought.
- **Fluency**: It measures the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability.

Read more here: [docs](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-metrics-built-in?tabs=warning)

## 1. Setting Up the Environment

First, we'll import the necessary libraries and load environment variables from a `.env` file.
This provides access to our Azure AI Foundry project connection string and model configuration.

In [1]:
import os
import dotenv
dotenv.load_dotenv(".env", override=True)

True

### Setting Up Azure AI Foundry Client and Connections

Now we'll establish connections to our Azure AI Foundry project and Azure OpenAI service. These connections are essential for:

1. Accessing evaluation capabilities in Azure AI Foundry
2. Configuring the evaluator model that will assess our AI outputs
3. Converting our OpenAI connection into a format compatible with evaluators

The `to_evaluator_model_config()` method extracts the necessary configuration details (endpoint, key, model name) from our Azure OpenAI connection.

In [2]:
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=os.environ["PROJECT_CONNECTION_STRING"],
)

oai_connection = project_client.connections.get(
    connection_name=os.getenv("OAI_CONNECTION_NAME"),
    include_credentials=True
)

model_config = oai_connection.to_evaluator_model_config(
    deployment_name=os.environ.get("chatModel"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
    include_credentials=True,
)

## 2. Defining Evaluation Function

Here we define a function that will run our evaluations and upload the results to Azure AI Foundry. This function configures:

### Evaluation Types
1. **Regular Groundedness**: Measures how well the generated response aligns with the given context, focusing on its relevance and accuracy with respect to the context.
2. **Pro Groundedness**: It detects whether the generated text response is consistent or accurate with respect to the given context.

### Data Mapping
The `column_mapping` parameter specifies how data fields in our JSONL file map to evaluator inputs:
- `query`: The original user question
- `context`: The reference information that responses should be based on
- `response`: The AI-generated answer we want to evaluate

In [3]:
from azure.ai.evaluation import evaluate
import datetime
from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator

def run_eval_on_azure(model_config, data_path):
    now = datetime.datetime.now()
    result = evaluate(
        evaluation_name = f"eval-demo-groundedness-{now.strftime('%Y-%m-%d-%H-%M-%S')}",
        data=data_path,
        evaluators={
            "Regular": GroundednessEvaluator(model_config=model_config),
            "Pro": GroundednessProEvaluator(azure_ai_project=project_client.scope, credential=DefaultAzureCredential()),
        },
        evaluator_config={
            "Regular": {
                "column_mapping": {
                    "query": "${data.query}",
                    "context": "${data.context}",
                    "response": "${data.response}"
                }
            },
            "Pro": {
                "column_mapping": {
                    "query": "${data.query}",
                    "context": "${data.context}",
                    "response": "${data.response}"
                }
            }
        },
        azure_ai_project = project_client.scope,
        # output_path="./myevalresults.json"
    )

ERROR:azure.monitor.opentelemetry.exporter.export._base:Non-retryable server side error: Operation returned an invalid status 'Bad Request'.
ERROR:azure.monitor.opentelemetry.exporter.export._base:Non-retryable server side error: Operation returned an invalid status 'Bad Request'.
ERROR:azure.monitor.opentelemetry.exporter.export._base:Non-retryable server side error: Operation returned an invalid status 'Bad Request'.


## 3. Running the Evaluation

Now we'll execute the evaluation on our sample data file. The `eval.jsonl` file contains test cases with:
- User queries
- Context information
- AI-generated responses to evaluate

When this cell runs, it will:
1. Upload the evaluation configuration to Azure AI Foundry
2. Process each sample in the data file
3. Score each response based on factual accuracy/groundedness
4. Generate evaluation results viewable in the Azure AI Foundry web interface

After running this cell, you can view the evaluation results in your Azure AI Foundry project portal under the 'Evaluations' section.

In [4]:
run_eval_on_azure(
    model_config=model_config,
    data_path="data/eval.jsonl"
)

Class GroundednessProEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[2025-04-07 23:27:58 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_2khp12i_20250407_232757_355647, log path: C:\Users\povelf\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_2khp12i_20250407_232757_355647\logs.txt
[2025-04-07 23:27:58 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_e315jen1_20250407_232757_355647, log path: C:\Users\povelf\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_e315jen1_20250407_232757_355647\logs.txt


Prompt flow service has started...
Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_e315jen1_20250407_232757_355647
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_2khp12i_20250407_232757_355647
2025-04-07 23:27:58 +0200   34460 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-07 23:28:00 +0200   34460 execution.bulk     INFO     Finished 1 / 3 lines.
2025-04-07 23:28:00 +0200   34460 execution.bulk     INFO     Average execution time for completed lines: 1.98 seconds. Estimated time for incomplete lines: 3.96 seconds.
2025-04-07 23:28:00 +0200   34460 execution.bulk     INFO     Finished 2 / 3 lines.
2025-04-07 23:28:00 +0200   34460 execution.bulk     INFO     Average execution time




{
    "Regular": {
        "status": "Completed",
        "duration": "0:00:04.035492",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": "C:\\Users\\povelf\\.promptflow\\.runs\\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_2khp12i_20250407_232757_355647"
    },
    "Pro": {
        "status": "Completed",
        "duration": "0:00:13.721957",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": "C:\\Users\\povelf\\.promptflow\\.runs\\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_e315jen1_20250407_232757_355647"
    }
}




ERROR:opencensus.ext.azure.common.transport:Non-retryable server side error 400: {"itemsReceived":8,"itemsAccepted":0,"appId":null,"errors":[{"index":0,"statusCode":400,"message":"Invalid instrumentation key"},{"index":1,"statusCode":400,"message":"Invalid instrumentation key"},{"index":2,"statusCode":400,"message":"Invalid instrumentation key"},{"index":3,"statusCode":400,"message":"Invalid instrumentation key"},{"index":4,"statusCode":400,"message":"Invalid instrumentation key"},{"index":5,"statusCode":400,"message":"Invalid instrumentation key"},{"index":6,"statusCode":400,"message":"Invalid instrumentation key"},{"index":7,"statusCode":400,"message":"Invalid instrumentation key"}]}.
ERROR:opencensus.ext.azure.common.transport:Non-retryable server side error 400: {"itemsReceived":4,"itemsAccepted":0,"appId":null,"errors":[{"index":0,"statusCode":400,"message":"Invalid instrumentation key"},{"index":1,"statusCode":400,"message":"Invalid instrumentation key"},{"index":2,"statusCode":4