# Evaluate application using manual data set

## Objective

This tutorial provides a step-by-step guide on how to use the simulator to automate interaction with the LLM. This will allow user to create synthetic test data. After data has been simulated we use the evaluation SDK to score the output/

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Before you begin

### Installation

Install the following packages required to execute this notebook. 

In [1]:
%pip install azure-ai-evaluation
%pip install promptflow-azure
%pip install azure-identity
%pip install --upgrade openai
%pip install python-dotenv
#%pip install marshmallow==3.23.3

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Parameters and imports

In [2]:
import pandas as pd
import os
import json
import importlib.resources as pkg_resources
import requests
from typing import Any, Dict, List, Optional
from pathlib import Path
from pprint import pprint
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import GroundednessEvaluator
from azure.ai.evaluation.simulator import Simulator
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv
# Using mock endpoint for testing (no APPLICATION_ENDPOINT/KEY needed)
from application_endpoint_mock import ApplicationEndpoint

# Load .env file - explicitly specify the path
env_path = Path(".env")
load_dotenv(dotenv_path=env_path)
print(f"✅ Loaded .env from: {env_path.absolute()}")
print(f"   AZURE_OPENAI_ENDPOINT: {os.environ.get('AZURE_OPENAI_ENDPOINT')}")
print(f"   AZURE_OPENAI_DEPLOYMENT: {os.environ.get('AZURE_OPENAI_DEPLOYMENT')}")

✅ Loaded .env from: /workspaces/eval-hack-2025/Lab4_ApplicationEvaluation/.env
   AZURE_OPENAI_ENDPOINT: https://ai-eval-hack.cognitiveservices.azure.com/
   AZURE_OPENAI_DEPLOYMENT: gpt-4.1


We set up the configuration for AI Foundry Projet and model config. This will be use for result visibility and use LLM as the judge.

In [3]:


project_scope = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_AI_FOUNDRY_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_AI_FOUNDRY_PROJECT_NAME"),
}
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}



AI Foundry SDK provide the similuator to be able to simulate the interaction with the LLM.

In [4]:
simulator = Simulator(model_config=model_config)

Class Simulator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Connect application end point to the simulator. The call back function map the input and output between application endpoint and the suitable format for the evaluator.

In [5]:
async def callback(
    messages: List[Dict],
    stream: bool = False,
    session_state: Any = None,  # noqa: ANN401
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # get last message
    latest_message = messages_list[-1]
    query = latest_message["content"]
    context = latest_message.get("context", None)
    # call model end point
    application_endpoint = ApplicationEndpoint()
    response = application_endpoint(query, None)
    print(response)
    # we are formatting the response to follow the openAI chat protocol format
    formatted_response = {
        "content": response["response"],
        "role": "assistant",
        "context": context,
    }
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

We trigger the simular and provide some configuration where we will ask the simular to try to complete two task with 2 turn of the conversation.

In [6]:
outputs = await simulator(
    target=callback,
    num_queries=2,
    max_conversation_turns=2,
    tasks=[
        f"I want to learn more about Responsible AI",
        f"I want to know how to implement Responsible AI in my organization",
    ],
)

Generating:   0%|                                                        | 0/4 [00:00<?, ?message/s]

🔧 Using MOCK Application Endpoint (for testing only)
📝 Mock query: I'm interested in Responsible AI. Can you tell me ...
📝 Mock response: According to the Responsible AI Standard: Microsoft's Responsible AI Standard provides a framework f...
{'query': "I'm interested in Responsible AI. Can you tell me what the content of the provided text is, and how it relates to Responsible AI principles?", 'response': "According to the Responsible AI Standard: Microsoft's Responsible AI Standard provides a framework for developing AI systems responsibly. It covers principles including fairness, reliability, safety, privacy, inclusiveness, transparency, and accountability. Each principle provides guidance for building trustworthy AI systems."}


Generating:  25%|████████████                                    | 1/4 [00:02<00:07,  2.43s/message]

🔧 Using MOCK Application Endpoint (for testing only)
📝 Mock query: Thanks for explaining the principles! Can you give...
📝 Mock response: Microsoft's Responsible AI Standard emphasizes that AI systems should treat all people fairly. This ...
{'query': 'Thanks for explaining the principles! Can you give me some real-world examples of how these Responsible AI principles are applied in practice? For instance, how do companies ensure fairness or transparency when deploying AI systems?', 'response': "Microsoft's Responsible AI Standard emphasizes that AI systems should treat all people fairly. This includes identifying and mitigating unfair bias in AI systems."}


Generating:  50%|████████████████████████                        | 2/4 [00:04<00:04,  2.48s/message]

🔧 Using MOCK Application Endpoint (for testing only)
📝 Mock query: I'm interested in implementing Responsible AI in m...
📝 Mock response: Microsoft's Responsible AI Standard provides a framework for developing AI systems responsibly. It c...
{'query': "I'm interested in implementing Responsible AI in my organization. Can you tell me what information or resources are available to help me get started, and are there any specific guidelines or frameworks I should be aware of?", 'response': "Microsoft's Responsible AI Standard provides a framework for developing AI systems responsibly. It covers principles including fairness, reliability, safety, privacy, inclusiveness, transparency, and accountability. Each principle provides guidance for building trustworthy AI systems."}


Generating:  75%|████████████████████████████████████            | 3/4 [00:07<00:02,  2.47s/message]

🔧 Using MOCK Application Endpoint (for testing only)
📝 Mock query: Thanks for sharing that framework! Could you provi...
📝 Mock response: Based on Microsoft's Responsible AI guidelines: Microsoft's Responsible AI Standard provides a frame...
{'query': 'Thanks for sharing that framework! Could you provide examples of how organizations have successfully applied these principles in practice? Also, are there any tools or assessment checklists that can help me evaluate whether our current AI projects align with Responsible AI standards?', 'response': "Based on Microsoft's Responsible AI guidelines: Microsoft's Responsible AI Standard provides a framework for developing AI systems responsibly. It covers principles including fairness, reliability, safety, privacy, inclusiveness, transparency, and accountability. Each principle provides guidance for building trustworthy AI systems."}


Generating: 100%|████████████████████████████████████████████████| 4/4 [00:10<00:00,  2.51s/message]
Generating: 100%|████████████████████████████████████████████████| 4/4 [00:10<00:00,  2.51s/message]


After simulation has been completed we can have a write a output to a file. 

In [7]:
simulated_output_file = Path("simulated_output.json")
with simulated_output_file.open("a") as f:
    json.dump(outputs, f)

The simulated response include the full conversation of the simulator with the application. In order to evaluate the output we need to convert to the standard format using to_eval_qr_json_lines

In [8]:
eval_data_file = Path("simulated_eval_data.jsonl")
with eval_data_file.open("w") as file:
    for output in outputs:
        file.write(output.to_eval_qr_json_lines())


Please provide Azure AI Project details so that traces and eval results are pushing in the project in Azure AI Studio.

In [9]:
azure_ai_project = {
    "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
    "resource_group_name": os.environ["AZURE_AI_FOUNDRY_RESOURCE_GROUP"],
    "project_name": os.environ["AZURE_AI_FOUNDRY_PROJECT_NAME"],
}

Following code reads Json file "data.jsonl" which contains inputs to the Application Target function. It provides question, context on each line. 

In [10]:
df = pd.read_json("simulated_eval_data.jsonl", lines=True)
print(df.head())

                                               query  \
0  I'm interested in Responsible AI. Can you tell...   
1  Thanks for explaining the principles! Can you ...   
2  I'm interested in implementing Responsible AI ...   
3  Thanks for sharing that framework! Could you p...   

                                            response context  
0  According to the Responsible AI Standard: Micr...    None  
1  Microsoft's Responsible AI Standard emphasizes...    None  
2  Microsoft's Responsible AI Standard provides a...    None  
3  Based on Microsoft's Responsible AI guidelines...    None  


To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a Judge that can be passed as model config.

In [11]:
import os

model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    "api_key": os.environ.get("AZURE_OPENAI_KEY"),
}

The Following code runs Evaluate API and uses Content Safety, Relevance and Coherence Evaluator to evaluate results from different models.

The following are the few parameters required by Evaluate API. 

+   Data file (Prompts): It represents data file 'simulated_eval_data.jsonl' in JSON format. Each line contains question, context for evaluators.     

+   Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. 

Note: You can see that similarity is commented out as this is simulated data and we dont have ground truth from SME to validate.

In [12]:
import pathlib
from datetime import datetime
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
)
from application_endpoint import ApplicationEndpoint


content_safety_evaluator = ContentSafetyEvaluator(
    azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()
)
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)
groundedness_evaluator = GroundednessEvaluator(model_config)
fluency_evaluator = FluencyEvaluator(model_config)
similarity_evaluator = SimilarityEvaluator(model_config)

path = str(pathlib.Path(pathlib.Path.cwd())) + "/simulated_eval_data.jsonl"

results = evaluate(
    evaluation_name=f"Simulated-Eval-Run-{datetime.now().strftime("%Y-%m-%d")}",
    data=path,
    evaluators={
        "content_safety": content_safety_evaluator,
        "coherence": coherence_evaluator,
        "relevance": relevance_evaluator,
        "groundedness": groundedness_evaluator,
        "fluency": fluency_evaluator,
        # "similarity": similarity_evaluator,
    },
    azure_ai_project=azure_ai_project,
    evaluator_config={
        "content_safety": {"column_mapping": {"query": "${data.query}", "response": "${data.response}"}},
        "coherence": {"column_mapping": {"response": "${data.response}", "query": "${data.query}"}},
        "relevance": {
            "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        },
        "groundedness": {
            "column_mapping": {
                "response": "${data.response}",
                "context": "${data.context}",
                "query": "${data.query}",
            }
        },
        "fluency": {
            "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        },
        # "similarity": {
        #     "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        # },
    },
)

Class ContentSafetyEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ViolenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ViolenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SexualEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SexualEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SelfHarmEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SelfHarmEvaluator: This is an experimental class, and may change at any time. Ple

2025-10-20 14:18:26 +0000 139697896548032 execution.bulk     INFO     Finished 1 / 4 lines.
2025-10-20 14:18:26 +0000 139697896548032 execution.bulk     INFO     Average execution time for completed lines: 1.41 seconds. Estimated time for incomplete lines: 4.23 seconds.
2025-10-20 14:18:26 +0000 139697896548032 execution.bulk     INFO     Average execution time for completed lines: 1.41 seconds. Estimated time for incomplete lines: 4.23 seconds.
2025-10-20 14:18:27 +0000 139697896548032 execution.bulk     INFO     Finished 2 / 4 lines.
2025-10-20 14:18:27 +0000 139697896548032 execution.bulk     INFO     Average execution time for completed lines: 0.86 seconds. Estimated time for incomplete lines: 1.72 seconds.
2025-10-20 14:18:27 +0000 139697896548032 execution.bulk     INFO     Finished 2 / 4 lines.
2025-10-20 14:18:27 +0000 139697896548032 execution.bulk     INFO     Average execution time for completed lines: 0.86 seconds. Estimated time for incomplete lines: 1.72 seconds.
2025-10-

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "content_safety_20251020_141825_575967"
Run status: "Completed"
Start time: "2025-10-20 14:18:25.575967+00:00"
Duration: "0:00:57.702149"


{
    "content_safety": {
        "status": "Completed",
        "duration": "0:00:57.702149",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    },
    "coherence": {
        "status": "Completed",
        "duration": "0:00:03.001457",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    },
    "relevance": {
        "status": "Completed",
        "duration": "0:00:02.001003",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    },
    "groundedness": {
        "status": "Completed",
        "duration": "0:00:03.001963",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    },
    "fluency": {
        "status": "Completed",
        "duration": "0:00:03.000843",
        "completed_lines": 4,
        "failed

EvaluationException: (UserError) Failed to upload evaluation run to the cloud due to insufficient permission to access the storage. Please ensure that the necessary access rights are granted.
Visit https://aka.ms/azsdk/python/evaluation/remotetracking/troubleshoot to troubleshoot this issue.

View the results here or you can also view the result in AI Foundry project.

In [None]:
pprint(results)

In [None]:
pd.DataFrame(results["rows"])

We can also generate adversial type conversation where we will try to hack or jail break the application 

In [None]:
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario

adversarial_outputs = await simulator(
    scenario=AdversarialScenario.ADVERSARIAL_SEARCH, max_conversation_turns=2, max_simulation_results=2, target=callback
)

We can print the conversation to review.

In [None]:

with Path("adversarial_outputs.jsonl").open("w") as f:
    json.dump(adversarial_outputs, f)