# **Evaluate AI Agent on Azure**

In this notebook, we focus on **evaluating the AI agent using Azure services**. This involves importing the required libraries, loading the necessary configurations, performing evaluations using Azure AI services, and analyzing the results.

### Objectives:
- **Import Libraries:** Import the necessary libraries for evaluation.
- **Load Configurations:** Load the necessary configurations from the environment file.
- **Perform Evaluation:** Use your LLM judge to evaluate the AI agent.
- **Analyze Results:** Analyze the evaluation results to gain insights into the AI agent's performance.

### Key Steps:
1. **Import Libraries:** Import the necessary libraries for evaluation.
2. **Load Configurations:** Load the necessary configurations from the environment file.
3. **Perform Evaluation:** Use your LLM jduge to evaluate the AI agent.
4. **Analyze Results:** Analyze the evaluation results to gain insights into the AI agent's performance.

This notebook ensures that the AI agent is evaluated effectively using your LLM judge, providing insights into its performance and areas for improvement.

In [None]:
from azure.ai.evaluation import evaluate
from azure.identity import DefaultAzureCredential

import pandas as pd
import numpy as np

import json
import os
import datetime

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI

import dotenv
dotenv.load_dotenv(".env")

True

In [None]:
aoai_endpoint=os.environ["AZURE_OPENAI_API_BASE"]
aoai_api_key=os.environ["AZURE_OPENAI_API_KEY"]
aoai_chat_model_mini=os.environ["AZURE_OPENAI_MODEL_MINI"]
llm_judge=os.environ["LLM_JUDGE"]
aoai_api_version=os.environ["AZURE_OPENAI_API_VERSION"]

**Transform data to correct format**
- Just for demonstration test-data is transformed to evaluation data

In [4]:
import pandas as pd

# Load the CSV file into a DataFrame
fpath = 'data/agent-output/testtest.csv'
df = pd.read_csv(fpath)

df_subset = df[['synthetic_question', 'chunk_data', 'synthetic_response']].rename(
    columns={
        'synthetic_question': 'query',
        'chunk_data': 'context',
        'synthetic_response': 'response'
    }
)

# Export the DataFrame to a JSONL file.
jsonl_path = 'data/agent-output/testtest.jsonl'
df_subset.to_json(jsonl_path, orient='records', lines=True)


**Run evaluation and upload results to AI Foundry**

In [5]:
def get_model_config(eval_model=llm_judge):
    return {
        "azure_endpoint": aoai_endpoint,
        "api_key": aoai_api_key,
        "azure_deployment": eval_model,
        "api_version": aoai_api_version
    }

In [6]:
def load_config(eval_model=llm_judge):
    credential = DefaultAzureCredential()

    model_config = get_model_config(eval_model)

    # Initialize Azure AI project and Azure OpenAI conncetion with your environment variables
    azure_ai_project = {
        "subscription_id": os.environ["SUB_ID"],
        "resource_group_name": os.environ["RG_NAME"],
        "project_name": os.environ["AZURE_PROJECT_NAME"],
    }
    return azure_ai_project, model_config, credential

In [7]:
def run_eval_on_azure(azure_ai_project, custom_groundedness, model_name, path):
    now = datetime.datetime.now()
    result = evaluate(
        evaluation_name = f"custom-groundedness-{model_name}-{now.strftime('%Y-%m-%d-%H-%M-%S')}",
        data=path,
        evaluators={
            "custom_groundedness_0_1": custom_groundedness,
        },
        # column mapping
        evaluator_config={
            "custom_groundedness": {
                "column_mapping": {
                    "query": "${data.query}",
                    "context": "${data.context}",
                    "response": "${data.response}"
                }
            }
        },
        azure_ai_project = azure_ai_project,
        # output_path="./myevalresults.json"
    )


In [8]:
from evaluators.aoai.custom_groundedness import CustomGroundednessEvaluator

# Load Azure AI project and model configuration
azure_ai_project, model_config, credential = load_config(eval_model=llm_judge)

# Custom evaluator for groundedness
custom_groundedness = CustomGroundednessEvaluator(model_config)

# Run evaluation
model_name = model_config["azure_deployment"]
run_eval_on_azure(azure_ai_project, custom_groundedness, model_name, jsonl_path)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\povelf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[2025-02-08 21:51:07 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run evaluators_aoai_custom_groundedness_customgroundednessevaluator_ql67as7j_20250208_215106_870650, log path: C:\Users\povelf\.promptflow\.runs\evaluators_aoai_custom_groundedness_customgroundednessevaluator_ql67as7j_20250208_215106_870650\logs.txt


Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=evaluators_aoai_custom_groundedness_customgroundednessevaluator_ql67as7j_20250208_215106_870650
2025-02-08 21:51:07 +0100   79816 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-02-08 21:51:07 +0100   79816 execution.bulk     INFO     The timeout for the batch run is 3600 seconds.
2025-02-08 21:51:07 +0100   79816 execution.bulk     INFO     Current system's available memory is 9745.91015625MB, memory consumption of current process is 294.91015625MB, estimated available worker count is 9745.91015625/294.91015625 = 33
2025-02-08 21:51:07 +0100   79816 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 50, 'estimated_worker_count_based_on_memory_usage': 33}.
2025-02-08 21:51:15 +0100   79816 execution.bulk     INFO  

ERROR:azure.monitor.opentelemetry.exporter.export._base:Non-retryable server side error: Operation returned an invalid status 'Bad Request'.
ERROR:azure.monitor.opentelemetry.exporter.export._base:Non-retryable server side error: Operation returned an invalid status 'Bad Request'.


ERROR:opencensus.ext.azure.common.transport:Non-retryable server side error 400: {"itemsReceived":4,"itemsAccepted":0,"appId":null,"errors":[{"index":0,"statusCode":400,"message":"Invalid instrumentation key"},{"index":1,"statusCode":400,"message":"Invalid instrumentation key"},{"index":2,"statusCode":400,"message":"Invalid instrumentation key"},{"index":3,"statusCode":400,"message":"Invalid instrumentation key"}]}.
ERROR:opencensus.ext.azure.common.transport:Non-retryable server side error 400: {"itemsReceived":3,"itemsAccepted":0,"appId":null,"errors":[{"index":0,"statusCode":400,"message":"Invalid instrumentation key"},{"index":1,"statusCode":400,"message":"Invalid instrumentation key"},{"index":2,"statusCode":400,"message":"Invalid instrumentation key"}]}.



{
    "custom_groundedness_0_1": {
        "status": "Completed",
        "duration": "0:00:34.004987",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": "C:\\Users\\povelf\\.promptflow\\.runs\\evaluators_aoai_custom_groundedness_customgroundednessevaluator_ql67as7j_20250208_215106_870650"
    }
}


Evaluation results saved to "C:\Users\povelf\OneDrive - Microsoft\Povel @ MSFT\EXTERNAL\Customers\Sandvik\demo-sdvk\evaluation\myevalresults.json".



ERROR:opencensus.ext.azure.common.transport:Non-retryable server side error 400: {"itemsReceived":2,"itemsAccepted":0,"appId":null,"errors":[{"index":0,"statusCode":400,"message":"Invalid instrumentation key"},{"index":1,"statusCode":400,"message":"Invalid instrumentation key"}]}.
