# Evaluate application using manual data set

## Objective

This tutorial provides a step-by-step guide on how to use the simulator to automate interaction with the LLM. This will allow user to create synthetic test data. After data has been simulated we use the evaluation SDK to score the output/

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Before you begin

### Installation

Install the following packages required to execute this notebook. 

In [None]:
%pip install azure-ai-evaluation
%pip install promptflow-azure
%pip install azure-identity
%pip install --upgrade openai
%pip install python-dotenv
#%pip install marshmallow==3.23.3

### Parameters and imports

In [None]:
import pandas as pd
import os
import json
import importlib.resources as pkg_resources
import requests
from typing import Any, Dict, List, Optional
from pathlib import Path
from pprint import pprint
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import GroundednessEvaluator
from azure.ai.evaluation.simulator import Simulator
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv
from application_endpoint import ApplicationEndpoint
load_dotenv()

We set up the configuration for AI Foundry Projet and model config. This will be use for result visibility and use LLM as the judge.

In [None]:


project_scope = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_AI_FOUNDRY_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_AI_FOUNDRY_PROJECT_NAME"),
}
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}



AI Foundry SDK provide the similuator to be able to simulate the interaction with the LLM.

In [None]:
simulator = Simulator(model_config=model_config)

Connect application end point to the simulator. The call back function map the input and output between application endpoint and the suitable format for the evaluator.

In [None]:
async def callback(
    messages: List[Dict],
    stream: bool = False,
    session_state: Any = None,  # noqa: ANN401
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # get last message
    latest_message = messages_list[-1]
    query = latest_message["content"]
    context = latest_message.get("context", None)
    # call model end point
    application_endpoint = ApplicationEndpoint()
    response = application_endpoint(query, None)
    print(response)
    # we are formatting the response to follow the openAI chat protocol format
    formatted_response = {
        "content": response["response"],
        "role": "assistant",
        "context": context,
    }
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

We trigger the simular and provide some configuration where we will ask the simular to try to complete two task with 2 turn of the conversation.

In [None]:
outputs = await simulator(
    target=callback,
    num_queries=2,
    max_conversation_turns=2,
    tasks=[
        f"I want to learn more about Responsible AI",
        f"I want to know how to implement Responsible AI in my organization",
    ],
)

After simulation has been completed we can have a write a output to a file. 

In [None]:
simulated_output_file = Path("simulated_output.json")
with simulated_output_file.open("a") as f:
    json.dump(outputs, f)

The simulated response include the full conversation of the simulator with the application. In order to evaluate the output we need to convert to the standard format using to_eval_qr_json_lines

In [None]:
eval_data_file = Path("simulated_eval_data.jsonl")
with eval_data_file.open("w") as file:
    for output in outputs:
        file.write(output.to_eval_qr_json_lines())


Please provide Azure AI Project details so that traces and eval results are pushing in the project in Azure AI Studio.

In [None]:
azure_ai_project = {
    "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
    "resource_group_name": os.environ["AZURE_AI_FOUNDRY_RESOURCE_GROUP"],
    "project_name": os.environ["AZURE_AI_FOUNDRY_PROJECT_NAME"],
}

## Data

Following code reads Json file "data.jsonl" which contains inputs to the Application Target function. It provides question, context on each line. 

In [None]:
df = pd.read_json("simulated_eval_data.jsonl", lines=True)
print(df.head())

To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a Judge that can be passed as model config.

In [None]:
import os

model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    "api_key": os.environ.get("AZURE_OPENAI_KEY"),
}

## Run the evaluation

The Following code runs Evaluate API and uses Content Safety, Relevance and Coherence Evaluator to evaluate results from different models.

The following are the few parameters required by Evaluate API. 

+   Data file (Prompts): It represents data file 'simulated_eval_data.jsonl' in JSON format. Each line contains question, context for evaluators.     

+   Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. 

In [None]:
import pathlib
from datetime import datetime
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
)
from application_endpoint import ApplicationEndpoint


content_safety_evaluator = ContentSafetyEvaluator(
    azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()
)
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)
groundedness_evaluator = GroundednessEvaluator(model_config)
fluency_evaluator = FluencyEvaluator(model_config)
similarity_evaluator = SimilarityEvaluator(model_config)

path = str(pathlib.Path(pathlib.Path.cwd())) + "/simulated_eval_data.jsonl"

results = evaluate(
    evaluation_name=f"Simulated-Eval-Run-{datetime.now().strftime("%Y-%m-%d")}",
    data=path,
    evaluators={
        "content_safety": content_safety_evaluator,
        "coherence": coherence_evaluator,
        "relevance": relevance_evaluator,
        "groundedness": groundedness_evaluator,
        "fluency": fluency_evaluator,
        # "similarity": similarity_evaluator,
    },
    azure_ai_project=azure_ai_project,
    evaluator_config={
        "content_safety": {"column_mapping": {"query": "${data.query}", "response": "${data.response}"}},
        "coherence": {"column_mapping": {"response": "${data.response}", "query": "${data.query}"}},
        "relevance": {
            "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        },
        "groundedness": {
            "column_mapping": {
                "response": "${data.response}",
                "context": "${data.context}",
                "query": "${data.query}",
            }
        },
        "fluency": {
            "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        },
        # "similarity": {
        #     "column_mapping": {"response": "${data.response}", "context": "${data.context}", "query": "${data.query}"}
        # },
    },
)

View the results

In [None]:
pprint(results)

In [None]:
pd.DataFrame(results["rows"])