# Evaluate application using manual data set

## Objective

This tutorial provides a step-by-step guide on how to use the simulator to automate interaction with the LLM. This will allow user to create synthetic test data. After data has been simulated we use the evaluation SDK to score the output/

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Before you begin

### Installation

Install the following packages required to execute this notebook. 

In [6]:
%pip install azure-ai-evaluation
%pip install promptflow-azure
%pip install azure-identity
%pip install --upgrade openai
%pip install python-dotenv
#%pip install marshmallow==3.23.3

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Parameters and imports

In [7]:
import pandas as pd
import os
import json
import importlib.resources as pkg_resources
import requests
from typing import Any, Dict, List, Optional
from pathlib import Path
from pprint import pprint
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import GroundednessEvaluator
from azure.ai.evaluation.simulator import Simulator
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv
from application_endpoint import ApplicationEndpoint
load_dotenv()

True

We set up the configuration for AI Foundry Projet and model config. This will be use for result visibility and use LLM as the judge.

In [8]:


project_scope = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_AI_FOUNDRY_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_AI_FOUNDRY_PROJECT_NAME"),
}
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}



AI Foundry SDK provide the similuator to be able to simulate the interaction with the 

In [9]:

simulator = Simulator(model_config=model_config)

Connect application end point to the simulator

In [None]:
async def callback(
    messages: List[Dict],
    stream: bool = False,
    session_state: Any = None,  # noqa: ANN401
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # get last message
    latest_message = messages_list[-1]
    query = latest_message["content"]
    context = latest_message.get("context", None)
    # call model end point
    model_endpoint = ApplicationEndpoint()
    response = model_endpoint(query, None)
    print(response)
    # we are formatting the response to follow the openAI chat protocol format
    formatted_response = {
        "content": response["response"],
        "role": "assistant",
        "context": context,
    }
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

Call the simultor

In [5]:
outputs = await simulator(
    target=callback,
    num_queries=2,
    max_conversation_turns=2,
    tasks=[
        f"I want to learn more about Responsible AI",
        f"I want to know how to implement Responsible AI in my organization",
    ],
)

Generating:   0%|                                                        | 0/4 [00:00<?, ?message/s]

{'azure_endpoint': 'https://aoai-sweden-central-hd.openai.azure.com/', 'azure_deployment': 'gpt-4o', 'type': 'azure_openai', 'api_version': '2024-06-01'}
{'chat_output': "I don't know."}
{'query': "Can you clarify what specific aspects of Responsible AI you'd like to learn about? For example, are you interested in ethical principles, technical implementation, or real-world applications?", 'response': "I don't know."}


Generating:  25%|████████████                                    | 1/4 [00:04<00:12,  4.07s/message]

{'azure_endpoint': 'https://aoai-sweden-central-hd.openai.azure.com/', 'azure_deployment': 'gpt-4o', 'type': 'azure_openai', 'api_version': '2024-06-01'}
{'chat_output': "The ethical principles in Responsible AI, as outlined by Microsoft, are designed to address the unique risks and societal needs associated with AI. These principles guide the development and deployment of AI systems to ensure they are responsible and aligned with societal values. Microsoft has operationalized these principles through its Responsible AI Standard, which provides actionable guidance for product development teams (Source: Microsoft-Responsible-AI-Standard-v2-General-Requirements-3.pdf).\n\nThe principles emphasize the importance of collaboration among industry, academia, civil society, and government to advance the state-of-the-art in AI and address open research questions, measurement gaps, and the design of new practices and tools (Source: Microsoft-Responsible-AI-Standard-v2-General-Requirements-3.pdf)

Generating:  50%|████████████████████████                        | 2/4 [00:11<00:12,  6.10s/message]

{'azure_endpoint': 'https://aoai-sweden-central-hd.openai.azure.com/', 'azure_deployment': 'gpt-4o', 'type': 'azure_openai', 'api_version': '2024-06-01'}
{'chat_output': "To implement Responsible AI in your organization, you can follow specific steps and strategies outlined in Microsoft's Responsible AI Standard v2. Here are some key recommendations and frameworks to consider:\n\n1. **Follow Guidelines for Human-AI Interaction**: When designing AI systems, adhere to established guidelines for human-AI interaction to ensure the system is user-friendly and aligns with ethical principles (Source: Microsoft-Responsible-AI-Standard-v2-General-Requirements-3.pdf).\n\n2. **Leverage Interpretability Tools**: Use techniques from tools like the InterpretML toolkit to understand the impact of features on system behavior. This can help stakeholders comprehend model predictions and ensure transparency in AI decision-making (Source: Microsoft-Responsible-AI-Standard-v2-General-Requirements-3.pdf).\n

Generating:  75%|████████████████████████████████████            | 3/4 [00:21<00:07,  7.88s/message]

{'azure_endpoint': 'https://aoai-sweden-central-hd.openai.azure.com/', 'azure_deployment': 'gpt-4o', 'type': 'azure_openai', 'api_version': '2024-06-01'}
{'chat_output': 'The document does not explicitly mention specific tools or platforms for monitoring AI systems for fairness, bias, and transparency. However, it does recommend practices such as using CheckList to evaluate risks involving identified demographic groups and conducting red teaming exercises to assess these risks (Source: Microsoft-Responsible-AI-Standard-v2-General-Requirements-3.pdf). \n\nTo ensure your team is adequately trained to implement these practices, the document suggests working with user researchers, subject matter experts, and members of identified demographic groups to understand risks and their impacts (Source: Microsoft-Responsible-AI-Standard-v2-General-Requirements-3.pdf). Additionally, establishing feedback mechanisms and a plan for addressing problems can help your team stay aligned with responsible A

Generating: 100%|████████████████████████████████████████████████| 4/4 [00:28<00:00,  7.16s/message]


## Save the simulate data

In [11]:
simulated_output_file = Path("simulated_output.json")
with simulated_output_file.open("a") as f:
    json.dump(outputs, f)

In [12]:
eval_data_file = Path("simulated_eval_data.jsonl")
with eval_data_file.open("w") as file:
    for output in outputs:
        file.write(output.to_eval_qr_json_lines())

## Validate The result

We will use Evaluate API provided by Prompt Flow SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. 

In the notebook, we will use an Application Target `ModelEndpoints` to get answers from multiple model endpoints against provided question aka prompts. 

This application target requires list of model endpoints and their authentication keys. For simplicity, we have provided them in the `env_var` variable which is passed into init() function of `ModelEndpoints`.


Please provide Azure AI Project details so that traces and eval results are pushing in the project in Azure AI Studio.

In [13]:
azure_ai_project = {
    "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
    "resource_group_name": os.environ["AZURE_AI_FOUNDRY_RESOURCE_GROUP"],
    "project_name": os.environ["AZURE_AI_FOUNDRY_PROJECT_NAME"],
}

## Data

Following code reads Json file "data.jsonl" which contains inputs to the Application Target function. It provides question, context and ground truth on each line. 

In [14]:
df = pd.read_json("simulated_eval_data.jsonl", lines=True)
print(df.head())

                                               query  \
0  Can you clarify what specific aspects of Respo...   
1  No problem! How about we start with an overvie...   
2  What specific steps or strategies can I use to...   
3  These steps sound comprehensive! Could you pro...   

                                            response context  
0                                      I don't know.    None  
1  The ethical principles in Responsible AI, as o...    None  
2  To implement Responsible AI in your organizati...    None  
3  The document does not explicitly mention speci...    None  


## Configuration
To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a Judge that can be passed as model config.

In [15]:
import os

model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    "api_key": os.environ.get("AZURE_OPENAI_KEY"),
}

## Run the evaluation

The Following code runs Evaluate API and uses Content Safety, Relevance and Coherence Evaluator to evaluate results from different models.

The following are the few parameters required by Evaluate API. 

+   Data file (Prompts): It represents data file 'data.jsonl' in JSON format. Each line contains question, context and ground truth for evaluators.     

+   Application Target: It is name of python class which can route the calls to specific model endpoints using model name in conditional logic.  

+   Model Name: It is an identifier of model so that custom code in the App Target class can identify the model type and call respective LLM model using endpoint URL and auth key.  

+   Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. 

In [16]:
import pathlib
from datetime import datetime
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
)
from application_endpoint import ApplicationEndpoint


content_safety_evaluator = ContentSafetyEvaluator(
    azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()
)
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)
groundedness_evaluator = GroundednessEvaluator(model_config)
fluency_evaluator = FluencyEvaluator(model_config)
similarity_evaluator = SimilarityEvaluator(model_config)

path = str(pathlib.Path(pathlib.Path.cwd())) + "/simulated_eval_data.jsonl"

results = evaluate(
    evaluation_name=f"Simulated-Eval-Run-{datetime.now().strftime("%Y-%m-%d")}",
    data=path,
    evaluators={
        "content_safety": content_safety_evaluator,
        "coherence": coherence_evaluator,
        "relevance": relevance_evaluator,
        "groundedness": groundedness_evaluator,
        "fluency": fluency_evaluator,
        # "similarity": similarity_evaluator,
    },
    azure_ai_project=azure_ai_project,
    evaluator_config={
        "content_safety": {"column_mapping": {"query": "${data.query}", "response": "${data.response}"}},
        "coherence": {"column_mapping": {"response": "${data.response}", "query": "${data.query}"}},
        "relevance": {
            "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        },
        "groundedness": {
            "column_mapping": {
                "response": "${data.response}",
                "context": "${data.context}",
                "query": "${data.query}",
            }
        },
        "fluency": {
            "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        },
        # "similarity": {
        #     "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        # },
    },
)

Class ContentSafetyEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ViolenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SexualEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SelfHarmEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class HateUnfairnessEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[2025-02-24 16:25:00 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_6v_46gbj_20250224_162459_896423, log path: /home/vscode/.promptflow/

2025-02-24 16:25:00 +0000    1575 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-02-24 16:25:02 +0000    1575 execution.bulk     INFO     Finished 1 / 4 lines.
2025-02-24 16:25:02 +0000    1575 execution.bulk     INFO     Average execution time for completed lines: 2.68 seconds. Estimated time for incomplete lines: 8.04 seconds.
2025-02-24 16:25:03 +0000    1575 execution.bulk     INFO     Finished 2 / 4 lines.
2025-02-24 16:25:03 +0000    1575 execution.bulk     INFO     Average execution time for completed lines: 1.51 seconds. Estimated time for incomplete lines: 3.02 seconds.
2025-02-24 16:25:03 +0000    1575 execution.bulk     INFO     Finished 3 / 4 lines.
2025-02-24 16:25:03 +0000    1575 execution.bulk     INFO     Average execution time for completed lines: 1.16 seconds. Estimated time for incomplete lines: 1.16 seconds.
2025-02-24 16:25:03 +0000    1575 execution.bulk     INFO     Finished 4 / 4 lines.
2025-

View the results

In [29]:
pprint(results)

In [48]:
pd.DataFrame(results["rows"])

Unnamed: 0,inputs.query,inputs.response,inputs.context,outputs.coherence.coherence,outputs.coherence.gpt_coherence,outputs.coherence.coherence_reason,outputs.relevance.relevance,outputs.relevance.gpt_relevance,outputs.relevance.relevance_reason,outputs.groundedness.groundedness,outputs.groundedness.gpt_groundedness,outputs.groundedness.groundedness_reason,outputs.fluency.fluency,outputs.fluency.gpt_fluency,outputs.fluency.fluency_reason,line_number
0,I'm curious about the guidelines for Responsib...,The title of the document that outlines Micros...,,4,4,The RESPONSE is coherent and effectively addre...,4,4,The RESPONSE accurately and completely provide...,1,1,The RESPONSE introduces information about Micr...,3,3,The response is clear and grammatically correc...,0
1,What are some key principles outlined in Micro...,Microsoft's Responsible AI Standard outlines s...,,4,4,The RESPONSE is coherent and effectively addre...,5,5,The RESPONSE fully addresses the QUERY by list...,1,1,The RESPONSE is completely unrelated to any CO...,4,4,"The RESPONSE is well-articulated, with good gr...",1
2,I'm curious about the specific goals outlined ...,The General Requirements of the Responsible AI...,,4,4,The RESPONSE is coherent as it directly answer...,5,5,The RESPONSE fully addresses the QUERY by prov...,1,1,The RESPONSE is entirely unrelated to any CONT...,4,4,The RESPONSE demonstrates proficient fluency w...,2
3,That's a comprehensive list of goals! Can you ...,When implementing Responsible AI in your organ...,,4,4,The RESPONSE is coherent and effectively addre...,4,4,The RESPONSE fully addresses the QUERY by prov...,1,1,The RESPONSE introduces a topic about Responsi...,4,4,"The RESPONSE is well-articulated, with good co...",3
