# Prompt Flow SDK evaluation

https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/flow-evaluate-sdk

To thoroughly assess the performance of your generative AI application when applied to a substantial dataset, you can evaluate in your development environment with the prompt flow SDK. Given either a test dataset or a target, your generative AI application generations are quantitatively measured with both mathematical based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.

In this notebook, you will learn how to run evaluators on a single row of data, a larger test dataset on an application target with built-in evaluators using the prompt flow SDK then track the results and evaluation logs in Azure AI Studio.

## Getting started
Install the necessary packages

In [2]:
!pip install promptflow-evals promptflow



Navigate to the .env file and add your credentials from AI Studio model deployments:
- "AZURE_OPENAI_ENDPOINT"= "<your model deployment endpoint>."
- "AZURE_OPENAI_API_KEY"= "<your openai key>"
- "AZURE_OPENAI_DEPLOYMENT"= "<deployment name>"
- "AZURE_OPENAI_API_VERSION"= "<api version, f.ex. 2023-03-15-preview>"

## Built-in evaluators
Built-in evaluators support the following application scenarios:

Question and answer: This scenario is designed for applications that involve sending in queries and generating answers.
Chat: This scenario is suitable for applications where the model engages in conversation using a retrieval-augmented approach to extract information from your provided documents and generate detailed responses.
For more in-depth information on each evaluator definition and how it's calculated, learn more here.

| Category                | Evaluator class                                      |
|-------------------------|------------------------------------------------------|
| Performance and quality | GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, F1ScoreEvaluator |
| Risk and safety         | ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator |
| Composite               | QAEvaluator, ChatEvaluator, ContentSafetyEvaluator, ContentSafetyChatEvaluator |


Built-in composite evaluators are composed of individual evaluators.

QAEvaluator combines all the quality evaluators for a single output of combined metrics for question and answer pairs
ChatEvaluator combines all the quality evaluators for a single output of combined metrics for chat messages following the OpenAI message protocol that can be found here. In addition to all the quality evaluators, we include support for retrieval score. Retrieval score isn't currently supported as a standalone evaluator class.
ContentSafetyEvaluator combines all the safety evaluators for a single output of combined metrics for question and answer pairs
ContentSafetyChatEvaluator combines all the safety evaluators for a single output of combined metrics for chat messages following the OpenAI message protocol that can be found here.
Required data input for built-in evaluators
We require question and answer pairs in .jsonl format with the required inputs, and column mapping for evaluating datasets, as follows:

| Evaluator                | question            | answer               | context              | ground_truth         |
|--------------------------|---------------------|----------------------|----------------------|----------------------|
| GroundednessEvaluator    | N/A                 | Required: String     | Required: String     | N/A                  |
| RelevanceEvaluator       | Required: String    | Required: String     | Required: String     | N/A                  |
| CoherenceEvaluator       | Required: String    | Required: String     | N/A                  | N/A                  |
| FluencyEvaluator         | Required: String    | Required: String     | N/A                  | N/A                  |
| SimilarityEvaluator      | Required: String    | Required: String     | N/A                  | Required: String     |
| F1ScoreEvaluator         | N/A                 | Required: String     | N/A                  | Required: String     |
| ViolenceEvaluator        | Required: String    | Required: String     | N/A                  | N/A                  |
| SexualEvaluator          | Required: String    | Required: String     | N/A                  | N/A                  |
| SelfHarmEvaluator        | Required: String    | Required: String     | N/A                  | N/A                  |
| HateUnfairnessEvaluator  | Required: String    | Required: String     | N/A                  | N/A                  |

- Question: the question sent in to the generative AI application
- Answer: the response to question generated by the generative AI application
- Context: the source that response is generated with respect to (that is, grounding documents)
- Ground truth: the response to question generated by user/human as the true answer

## Performance evaluation

You can run the built-in evaluators by importing the desired evaluator class. Ensure that you set your environment variables.
Example importing only one class for the evaluation:

In [1]:
import os
from dotenv import load_dotenv
from promptflow.core import AzureOpenAIModelConfiguration
from promptflow.evals.evaluators import RelevanceEvaluator

# load .env variables
load_dotenv()

# Initialize Azure OpenAI Connection with your environment variables
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
    #api_version=os.getenv("AZURE_OPENAI_API_VERSION")
)

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)
# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    answer="The Alpine Explorer Tent is the most waterproof.",
    context="From the our product list,"
    " the alpine explorer tent is the most waterproof."
    " The Adventure Dining Table has higher weight.",
    question="Which tent is the most waterproof?",
)
print(relevance_score)

{'gpt_relevance': 5.0}


## Composite Evaluators
Composite evaluators are built in evaluators that combine the individual quality or safety metrics to easily provide a wide range of metrics right out of the box.

The `ChatEvaluator` class provides quality metrics for evaluating chat messages, therefore there's an optional flag to indicate that you only want to evaluate on the last turn of a conversation.

In [2]:
from promptflow.evals.evaluators import ChatEvaluator
import pprint

chat_eval = ChatEvaluator(
    model_config=model_config,
    eval_last_turn=True
  )

conversation = [
    {"role": "user", "content": "What is the value of 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 = 4", "context": {
        "citations": [
                {"id": "math_doc.md", "content": "Information about additions: 1 + 2 = 3, 2 + 2 = 4"}
                ]
        }
    }
]
chat_evaluator = chat_eval(conversation=conversation)

pprint.pprint(chat_evaluator)

{'evaluation_per_turn': {'gpt_coherence': {'score': [5.0]},
                         'gpt_fluency': {'score': [5.0]},
                         'gpt_groundedness': {'score': [5.0]},
                         'gpt_relevance': {'score': [5.0]},
                         'gpt_retrieval': {'score': [5.0]}},
 'gpt_coherence': np.float64(5.0),
 'gpt_fluency': np.float64(5.0),
 'gpt_groundedness': np.float64(5.0),
 'gpt_relevance': np.float64(5.0),
 'gpt_retrieval': np.float64(5.0)}


## Custom evaluators
Built-in evaluators are great out of the box to start evaluating your application's generations. However you might want to build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.

### Code-based evaluators
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. Given a simple Python class in an example `answer_length.py` that calculates the length of an answer:

In [3]:
with open("answer_length.py") as fin:
    print(fin.read())
from answer_length import AnswerLengthEvaluator

ans_len = AnswerLengthEvaluator()
answer_length = ans_len.__call__(answer="What is the speed of ligth?")

print(answer_length)

class AnswerLengthEvaluator:
    def __init__(self):
        pass

    def __call__(self, *, answer: str, **kwargs):
        return {"answer_length": len(answer)}
{'answer_length': 27}


### Prompt-based evaluators
To build your own prompt-based large language model evaluator, you can create a custom evaluator based on a Prompty file. Prompty is a file with `.prompty` extension for developing prompt template. The Prompty asset is a markdown file with a modified front matter. The front matter is in YAML format that contains many metadata fields that define model configuration and expected inputs of the Prompty. Given an example `apology.prompty` file that looks like the following:

In [4]:
with open("apology.prompty") as fin:
    print(fin.read())
from promptflow.client import load_flow

# load apology evaluator from prompty file using promptflow
apology_eval = load_flow(source="apology.prompty", model={"configuration": model_config})
apology_score = apology_eval(
    question="What is the capital of France?", answer="Paris"
)
print()
print(apology_score)

---
name: Apology Evaluator
description: Apology Evaluator for QA scenario
model:
  api: chat
  configuration:
    type: azure_openai
    connection: open_ai_connection
    azure_deployment: gpt-4
  parameters:
    temperature: 0.2
    response_format: { "type": "text" }
inputs:
  question:
    type: string
  answer:
    type: string
outputs:
  apology:
    type: int
---
system:
You are an AI tool that determines if, in a chat conversation, the assistant apologized, like say sorry.
Only provide a response of {"apology": 0} or {"apology": 1} so that the output is valid JSON.
Give a apology of 1 if apologized in the chat conversation.



{"apology": 0}


## Evaluate on test dataset using evaluate()
After you spot-check your built-in or custom evaluators on a single row of data, you can combine multiple evaluators with the `evaluate()` API on an entire test dataset. In order to ensure the `evaluate()` can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that are accepted by the evaluators. In this case, we specify the data mapping for `ground_truth`.

In [5]:
from promptflow.evals.evaluate import evaluate
from answer_length import AnswerLengthEvaluator

ans_len = AnswerLengthEvaluator()

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        "relevance": relevance_eval,
        #"answer_length": ans_len
    },
    # column mapping
    evaluator_config={
        "default": {
            "ground_truth": "${data.ground_truth}"
        }
    },
    # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI studio project
    ##### azure_ai_project = azure_ai_project,
    
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
    output_path="./myevalresults.json"
)

[2024-07-29 12:30:58 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run promptflow_evals_evaluators_relevance_relevance_relevanceevaluator_tcjzrqz7_20240729_123057_996955, log path: /home/codespace/.promptflow/.runs/promptflow_evals_evaluators_relevance_relevance_relevanceevaluator_tcjzrqz7_20240729_123057_996955/logs.txt


Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=promptflow_evals_evaluators_relevance_relevance_relevanceevaluator_tcjzrqz7_20240729_123057_996955
2024-07-29 12:31:01 +0000   61364 execution.bulk     INFO     Process 61395 terminated.
2024-07-29 12:31:02 +0000   61364 execution.bulk     INFO     Process 61396 terminated.


[2024-07-29 12:31:03 +0000][promptflow.evals.evaluate._utils][ERROR] - Unable to log traces as trace destination was not defined.


2024-07-29 12:30:58 +0000   60155 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-07-29 12:30:58 +0000   60155 execution.bulk     INFO     Set process count to 3 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 3}.
2024-07-29 12:31:00 +0000   60155 execution.bulk     INFO     Process name(ForkProcess-2:1)-Process id(61389)-Line number(0) start execution.
2024-07-29 12:31:00 +0000   60155 execution.bulk     INFO     Process name(ForkProcess-2:3)-Process id(61396)-Line number(2) start execution.
2024-07-29 12:31:00 +0000   60155 execution.bulk     INFO     Process name(ForkProcess-2:2)-Process id(61395)-Line number(1) start execution.
2024-07-29 12:31:00 +0000   60155 execution.bulk     INFO     Process name(ForkProcess-2:3)-Process id(61396)-Line number(2) completed.
2024-07-29 12:31:01 +0000   60155 execution.bulk     INFO     Process name(ForkProcess-2:2)-Process id(61395)-Lin

In [6]:
# check the results
pprint.pprint(result)

{'metrics': {'relevance.gpt_relevance': 5.0},
 'rows': [{'inputs.answer': 'Paris is the capital of France.',
           'inputs.context': 'France is in Europe',
           'inputs.ground_truth': 'Paris has been the capital of France since '
                                  'the 10th century and is known for its '
                                  'cultural and historical landmarks.',
           'inputs.question': 'What is the capital of France?',
           'outputs.relevance.gpt_relevance': 5},
          {'inputs.answer': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.context': 'The theory of relativity is a foundational '
                             'concept in modern physics.',
           'inputs.ground_truth': 'Albert Einstein developed the theory of '
                                  'relativity, with his special relativity '
                                  'published in 1905 and general relativity in '
               

In [31]:
# DONT RUN IT - needs many requirements (pwsh, az cli, azd....)

azure_ai_project = {
    "subscription_id": "cb72b49c-6479-4ec7-891f-f18be9ee10f2",
    "resource_group_name": "ai-bootcamp",
    "project_name": "josealonso-5048",
}

from promptflow.evals.evaluators import ViolenceEvaluator

# Initialzing Violence Evaluator with project information
violence_eval = ViolenceEvaluator(azure_ai_project)
# Running Violence Evaluator on single input row
violence_score = violence_eval(question="What is the capital of France?", answer="Paris.")
print(violence_score)

[2024-07-29 12:28:25 +0000][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-07-29 12:28:25 +0000][flowinvoker][INFO] - Promptflow get connections successfully. keys: dict_keys([])
[2024-07-29 12:28:25 +0000][flowinvoker][INFO] - Promptflow executor starts initializing...
[2024-07-29 12:28:25 +0000][flowinvoker][INFO] - Promptflow executor initiated successfully.
[2024-07-29 12:28:25 +0000][flowinvoker][INFO] - Validating flow input with data {'metric_name': 'violence', 'question': 'What is the capital of France?', 'answer': 'Paris.', 'project_scope': {'subscription_id': 'cb72b49c-6479-4ec7-891f-f18be9ee10f2', 'resource_group_name': 'ai-bootcamp', 'project_name': 'josealonso-5048'}, 'credential': None}
[2024-07-29 12:28:25 +0000][flowinvoker][INFO] - Execute flow with data {'metric_name': 'violence', 'question': 'What is the capital of France?', 'answer': 'Paris.', 'project_scope': {'subscription_id': 'cb72b49c-6479-4ec7-891f-f18be9ee10f2'

2024-07-29 12:28:25 +0000   37577 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-07-29 12:28:25 +0000   37577 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-07-29 12:28:25 +0000   37577 execution.flow     INFO     Executing node validate_inputs. node run id: a3484049-3058-446b-88f0-30342082f7d7_validate_inputs_665f5656-67f5-4709-bac9-ff2655383b11
2024-07-29 12:28:25 +0000   37577 execution.flow     INFO     Node validate_inputs completes.
2024-07-29 12:28:25 +0000   37577 execution.flow     INFO     The node 'evaluate_with_rai_service' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-07-29 12:28:25 +0000   37577 execution.flow     INFO     Executing node evaluate_with_rai_service. node run id: a3484049-3058-446b-88f0-30342082f7d7_evaluate_with_rai_service_c94e83cd-c09e-4b96-b85c-bcc5bbcfede1


Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable. The requested identity has not been assigned to this resource. Error: Unexpected response "{'error': 'invalid_request', 'error_description': 'Identity not found'}"
	SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
	AzureCliCredential: Azure CLI not found on path
	AzurePowerShellCredential: PowerShell is not installed
	AzureDeveloperCliCredential: Please run 'azd auth login' from a command prompt to authenticate before using this credential.
To mitigate this issue, please refer to the troubleshooting guidelines here at https://aka.ms/azsdk/python/identity/defaultazurecredential/troubleshoot

2024-07-29 12:28:25 +0000   37577 execution          ERROR    Node evaluate_with_rai_service in line None failed. Exception: Execution failure in 'evaluate_with_rai_service': (ClientAuthenticationError) DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable. The requested identity has not been assigned to this resource. Error: Unexpected response "{'error': 'invalid_request', 'error_description': 'Identity not found'}"
	SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
	AzureCliCredential: Azure CLI not found on path
	AzurePowerShellCredential: PowerShell is not ins

ToolExecutionError: Execution failure in 'evaluate_with_rai_service': (ClientAuthenticationError) DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable. The requested identity has not been assigned to this resource. Error: Unexpected response "{'error': 'invalid_request', 'error_description': 'Identity not found'}"
	SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
	AzureCliCredential: Azure CLI not found on path
	AzurePowerShellCredential: PowerShell is not installed
	AzureDeveloperCliCredential: Please run 'azd auth login' from a command prompt to authenticate before using this credential.
To mitigate this issue, please refer to the troubleshooting guidelines here at https://aka.ms/azsdk/python/identity/defaultazurecredential/troubleshoot.

## Jailbreak: Blue vs Read team
Evaluating jailbreak is a comparative measurement, not an AI-assisted metric. Run ContentSafetyEvaluator or ContentSafetyChatEvaluator on two different, red-teamed datasets: a baseline adversarial test dataset versus the same adversarial test dataset with jailbreak injections in the first turn. You can do this with functionality and attack datasets generated with the [adversarial simulator](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/simulator-interaction-data). Then you can evaluate jailbreak vulnerability by comparing results from content safety evaluators between the two test dataset's aggregate scores for each safety evaluator.