# Install Dependencies

First install the evaluators package from prompt flow SDK

In [1]:
%pip install promptflow-evals
%pip install --upgrade promptflow

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# Built-in evaluators

Built-in evaluators support the following application scenarios:

- Question and answer: This scenario is designed for applications that involve sending in queries and generating answers.
- Chat: This scenario is suitable for applications where the model engages in conversation using a retrieval-augmented approach to extract information from your provided documents and generate detailed responses.


| Category | Evaluator class |
|----------|-----------------|
| Performance and quality | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `F1ScoreEvaluator` |
| Risk and safety | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator` |
| Composite	| `QAEvaluator`, `ChatEvaluator`, `ContentSafetyEvaluator`, `ContentSafetyChatEvaluator` |

Both categories of built-in quality and safety metrics take in question and answer pairs, along with additional information for specific evaluators.

Built-in composite evaluators are composed of individual evaluators.

- `QAEvaluator` combines all the quality evaluators for a single output of combined metrics for question and answer pairs
- `ChatEvaluator` combines all the quality evaluators for a single output of combined metrics for chat messages following the OpenAI message protocol that can be found here. In addition to all the quality evaluators, we include support for retrieval score. Retrieval score isn't currently supported as a standalone evaluator class.
- `ContentSafetyEvaluator` combines all the safety evaluators for a single output of combined metrics for question and answer pairs
- `ContentSafetyChatEvaluator` combines all the safety evaluators for a single output of combined metrics for chat messages following the OpenAI message protocol that can be found here.


# Required data input for built-in evaluators
We require question and answer pairs in .jsonl format with the required inputs, and column mapping for evaluating datasets, as follows:

| Evaluator | question | answer | context | ground_truth |
|-----------|----------|--------|---------|------------- |
| `GroundednessEvaluator`	| N/A |	Required: String | Required: String | N/A |
| `RelevanceEvaluator` | Required: String | Required: String | Required: String | N/A |
| `CoherenceEvaluator`	| Required: String | Required: String | N/A | N/A |
| `FluencyEvaluator`	| Required: String | Required: String | N/A	N/A |
| `SimilarityEvaluator` | Required: String | Required: String | N/A |Required: String |
| `F1ScoreEvaluator` | N/A | Required: String | N/A | Required: String |
| `ViolenceEvaluator` | Required: String | Required: String | N/A | N/A |
| `SexualEvaluator` | Required: String | Required: String | N/A | N/A |
| `SelfHarmEvaluator` | Required: String | Required: String | N/A | N/A |
| `HateUnfairnessEvaluator` | Required: String | Required: String | N/A | N/A |

- Question: the question sent in to the generative AI application
- Answer: the response to question generated by the generative AI application
- Context: the source that response is generated with respect to (that is, grounding documents)
- Ground truth: the response to question generated by user/human as the true answer

# Performance and quality evaluators
When using AI-assisted performance and quality metrics, you must specify a GPT model for the calculation process. Choose a deployment with either GPT-3.5, GPT-4, or the Davinci model for your calculations and set it as your `model_config`.

> Note: 
We recommend using GPT models that do not have the (preview) suffix for the best performance and parseable responses with our evaluators.

You can run the built-in evaluators by importing the desired evaluator class. Ensure that you set your environment variables.

In [1]:
import os
from dotenv import load_dotenv
from promptflow.core import AzureOpenAIModelConfiguration

load_dotenv("credentials.env")
print(os.getenv("AZURE_OPENAI_ENDPOINT"))

# Initialize Azure OpenAI Connection with your environment variables
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)
from promptflow.evals.evaluators import RelevanceEvaluator

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)
# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    answer="The Alpine Explorer Tent is the most waterproof.",
    context="From the our product list,"
    " the alpine explorer tent is the most waterproof."
    " The Adventure Dining Table has higher weight.",
    question="Which tent is the most waterproof?",
)
print(relevance_score)

https://gpt4o-ilc.openai.azure.com/
{'gpt_relevance': 5.0}


# Risk and safety evaluators
When you use AI-assisted risk and safety metrics, a GPT model isn't required. Instead of `model_config`, provide your `azure_ai_project` information. This accesses the Azure AI Studio safety evaluations back-end service, which provisions a GPT-4 model that can generate content risk severity scores and reasoning to enable your safety evaluators. Note that to use this you must be logged into Azure in your current IDE, eg VsCode. 

> Note: 
Currently AI-assisted risk and safety metrics are only available in the following regions: East US 2, France Central, UK South, Sweden Central. Groundedness measurement leveraging Azure AI Content Safety Groundedness Detection is only supported following regions: East US 2 and Sweden Central. Read more about the supported metrics here and when to use which metric.

In [None]:
# %pip install azure-identity

#You may need to log into your tenant through the Azure CLI to authenticate

In [2]:
from azure.core.credentials import TokenCredential
from azure.identity import DefaultAzureCredential

In [5]:
project_scope = {
   "subscription_id": "13e90e3a-50db-43f8-86e5-42bca5b14ebd",
    "resource_group_name": "rg-aistudio-35hub",
    "project_name": "sk-project"
}


from promptflow.evals.evaluators import ViolenceEvaluator
from azure.identity import DefaultAzureCredential

# Initialize the credential variable
credential = DefaultAzureCredential()

# Initializing Violence Evaluator with project information
violence_eval = ViolenceEvaluator(project_scope)

# Running Violence Evaluator on single input row
violence_score = violence_eval(question="What is the capital of France?", answer="Paris.")
print(violence_score)



[2024-08-01 09:34:24 +0100][flowinvoker][INFO] - Validating flow input with data {'metric_name': 'violence', 'question': 'What is the capital of France?', 'answer': 'Paris.', 'project_scope': {'subscription_id': '13e90e3a-50db-43f8-86e5-42bca5b14ebd', 'resource_group_name': 'rg-aistudio-35hub', 'project_name': 'sk-project'}, 'credential': None}
[2024-08-01 09:34:24 +0100][flowinvoker][INFO] - Execute flow with data {'metric_name': 'violence', 'question': 'What is the capital of France?', 'answer': 'Paris.', 'project_scope': {'subscription_id': '13e90e3a-50db-43f8-86e5-42bca5b14ebd', 'resource_group_name': 'rg-aistudio-35hub', 'project_name': 'sk-project'}, 'credential': None}


2024-08-01 09:34:24 +0100   31196 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-01 09:34:24 +0100   31196 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-01 09:34:24 +0100   31196 execution.flow     INFO     Executing node validate_inputs. node run id: 624b2d13-c935-47eb-b7f8-cfc572256c15_validate_inputs_434627c0-2c0c-4b94-8d83-a818ac71be45
2024-08-01 09:34:24 +0100   31196 execution.flow     INFO     Node validate_inputs completes.
2024-08-01 09:34:24 +0100   31196 execution.flow     INFO     The node 'evaluate_with_rai_service' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-01 09:34:24 +0100   31196 execution.flow     INFO     Executing node evaluate_with_rai_service. node run id: 624b2d13-c935-47eb-b7f8-cfc572256c15_evaluate_with_rai_service_e82ea221-0cf2-4d51-8881-749001e276c8
2024-08-01 09:34:42 +0100   31196 execution.flow     INFO     Node ev

The result of the safety evaluators is a dictionary containing:

- `{metric_name}` provides a severity label for that content risk ranging from Very low, Low, Medium, and High.
- `{metric_name}_score` has a range between 0 and 7 severity level that maps to a severity label given in {metric_name}.
- `{metric_name}_reason` has a text reasoning for why a certain severity score was given for each data point.

## Evaluating jailbreak vulnerability
Evaluating jailbreak is a comparative measurement, not an AI-assisted metric. Run `ContentSafetyEvaluator` or `ContentSafetyChatEvaluator` on two different, red-teamed datasets: a baseline adversarial test dataset versus the same adversarial test dataset with jailbreak injections in the first turn. You can do this with functionality and attack datasets generated with the adversarial simulator. Then you can evaluate jailbreak vulnerability by comparing results from content safety evaluators between the two test dataset's aggregate scores for each safety evaluator.

# Composite evaluators
Composite evaluators are built in evaluators that combine the individual quality or safety metrics to easily provide a wide range of metrics right out of the box.

The `ChatEvaluator` class provides quality metrics for evaluating chat messages, therefore there's an optional flag to indicate that you only want to evaluate on the last turn of a conversation.

In [6]:
from promptflow.evals.evaluators import ChatEvaluator

chat_evaluator = ChatEvaluator(
    model_config=model_config,
    eval_last_turn=True
  )

conversation = [
    {"role": "user", "content": "What is the value of 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 = 4", "context": {
        "citations": [
                {"id": "math_doc.md", "content": "Information about additions: 1 + 2 = 3, 2 + 2 = 4"}
                ]
        }
    }
]
result = chat_evaluator(conversation=conversation)

print(result)

  aggregated[metric] = np.nanmean(values)


{'evaluation_per_turn': {'gpt_fluency': {'score': [nan]}, 'gpt_coherence': {'score': [1.0]}, 'gpt_groundedness': {'score': [5.0]}, 'gpt_relevance': {'score': [5.0]}, 'gpt_retrieval': {'score': [5.0]}}, 'gpt_coherence': np.float64(1.0), 'gpt_fluency': np.float64(nan), 'gpt_groundedness': np.float64(5.0), 'gpt_relevance': np.float64(5.0), 'gpt_retrieval': np.float64(5.0)}


# Custom evaluators
Built-in evaluators are great out of the box to start evaluating your application's generations. However you might want to build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.

## Code-based evaluators
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. Given a simple Python class in an example `answer_length.py` that calculates the length of an answer:

In [7]:
with open("answer_length.py", mode='r') as fin:
    print(fin.read())
from answer_length import AnswerLengthEvaluator

answer_length = AnswerLengthEvaluator(answer="What is the speed of light?")

print(answer_length)

# answer_length.py

class AnswerLengthEvaluator:
    def __init__(self, answer):
        self.answer = answer

    def __str__(self):
        return f"The length of the answer '{self.answer}' is {len(self.answer)} characters."
The length of the answer 'What is the speed of light?' is 27 characters.


## Prompt-based evaluators
To build your own prompt-based large language model evaluator, you can create a custom evaluator based on a Prompty file. Prompty is a file with `.prompty` extension for developing prompt template. The Prompty asset is a markdown file with a modified front matter. The front matter is in YAML format that contains many metadata fields that define model configuration and expected inputs of the Prompty. Use the example `apology.prompty` file. 

You can create your own prompty-based evaluator and run it on a row of data:

In [8]:
with open("apology.prompty") as fin:
    print(fin.read())
from promptflow.client import load_flow

# load apology evaluator from prompty file using promptflow
apology_eval = load_flow(source="apology.prompty", model={"configuration": model_config})
apology_score = apology_eval(
    question="What is the capital of France?", answer="Paris"
)
print(apology_score)

---
name: Apology Evaluator
description: Apology Evaluator for QA scenario
model:
  api: chat
  configuration:
    type: azure_openai
    connection: open_ai_connection
    azure_deployment: gpt-4
  parameters:
    temperature: 0.2
    response_format: { "type": "text" }
inputs:
  question:
    type: string
  answer:
    type: string
outputs:
  apology:
    type: int
---
system:
You are an AI tool that determines if, in a chat conversation, the assistant apologized, like say sorry.
Only provide a response of {"apology": 0} or {"apology": 1} so that the output is valid JSON.
Give a apology of 1 if apologized in the chat conversation.
{"apology": 0}


# Evaluate on test dataset using `evaluate()`
After you spot-check your built-in or custom evaluators on a single row of data, you can combine multiple evaluators with the `evaluate()` API on an entire test dataset. In order to ensure the `evaluate()` can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that are accepted by the evaluators. In this case, we specify the data mapping for `ground_truth`.

In [9]:
%pip install promptflow-azure

Collecting promptflow-azure
  Downloading promptflow_azure-1.14.0-py3-none-any.whl.metadata (3.1 kB)
Collecting azure-ai-ml<2.0.0,>=1.14.0 (from promptflow-azure)
  Downloading azure_ai_ml-1.18.0-py3-none-any.whl.metadata (31 kB)
Collecting azure-cosmos<5.0.0,>=4.5.1 (from promptflow-azure)
  Downloading azure_cosmos-4.7.0-py3-none-any.whl.metadata (70 kB)
     ---------------------------------------- 0.0/70.3 kB ? eta -:--:--
     ---------------------------------------- 70.3/70.3 kB 3.8 MB/s eta 0:00:00
Collecting azure-storage-blob<13.0.0,>=12.17.0 (from azure-storage-blob[aio]<13.0.0,>=12.17.0->promptflow-azure)
  Downloading azure_storage_blob-12.21.0-py3-none-any.whl.metadata (26 kB)
Collecting azure-mgmt-core>=1.3.0 (from azure-ai-ml<2.0.0,>=1.14.0->promptflow-azure)
  Using cached azure_mgmt_core-1.4.0-py3-none-any.whl.metadata (4.1 kB)
Collecting azure-storage-file-share (from azure-ai-ml<2.0.0,>=1.14.0->promptflow-azure)
  Downloading azure_storage_file_share-12.17.0-py3-none


[notice] A new release of pip is available: 24.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
%pip install promptflow

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
from promptflow.evals.evaluate import evaluate

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        "relevance": relevance_eval
        },
    # column mapping
    evaluator_config={
        "default": {
            "ground_truth": "${data.truth}"
        }
    }
    # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI studio project
    #azure_ai_project = azure_ai_project,
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
    #output_path="./myevalresults.json"
)

Starting prompt flow service...


[2024-08-01 09:49:12 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run promptflow_evals_evaluators_relevance_relevance_relevanceevaluator_ylt5kw_a_20240801_094900_130072, log path: C:\Users\iancurtis\.promptflow\.runs\promptflow_evals_evaluators_relevance_relevance_relevanceevaluator_ylt5kw_a_20240801_094900_130072\logs.txt


You can stop the prompt flow service with the following command:'[1mpf service stop[0m'.

You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=promptflow_evals_evaluators_relevance_relevance_relevanceevaluator_ylt5kw_a_20240801_094900_130072


[2024-08-01 09:49:22 +0100][promptflow.evals.evaluate._utils][ERROR] - Unable to log traces as trace destination was not defined.


2024-08-01 09:49:12 +0100   31196 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-08-01 09:49:12 +0100   31196 execution.bulk     INFO     Current system's available memory is 14735.01953125MB, memory consumption of current process is 209.66796875MB, estimated available worker count is 14735.01953125/209.66796875 = 70
2024-08-01 09:49:12 +0100   31196 execution.bulk     INFO     Set process count to 3 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 3, 'estimated_worker_count_based_on_memory_usage': 70}.
2024-08-01 09:49:16 +0100   31196 execution.bulk     INFO     Process name(SpawnProcess-2)-Process id(8812)-Line number(0) start execution.
2024-08-01 09:49:16 +0100   31196 execution.bulk     INFO     Process name(SpawnProcess-3)-Process id(23352)-Line number(1) start execution.
2024-08-01 09:49:16 +0100   31196 execution.bulk     INFO     Process name(SpawnProcess-4)-Process 

> Tip: Get the contents of the `result.studio_url` property for a link to view your logged evaluation results in Azure AI Studio. The evaluator outputs results in a dictionary which contains aggregate `metrics` and row-level data and metrics. An example of an output:

In [12]:
{'metrics': {'answer_length.value': 49.333333333333336,
             'relevance.gpt_relevance': 5.0},
 'rows': [{'inputs.answer': 'Paris is the capital of France.',
           'inputs.context': 'France is in Europe',
           'inputs.ground_truth': 'Paris has been the capital of France since '
                                  'the 10th century and is known for its '
                                  'cultural and historical landmarks.',
           'inputs.question': 'What is the capital of France?',
           'outputs.answer_length.value': 31,
           'outputs.relevance.gpt_relevance': 5},
          {'inputs.answer': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.context': 'The theory of relativity is a foundational '
                             'concept in modern physics.',
           'inputs.ground_truth': 'Albert Einstein developed the theory of '
                                  'relativity, with his special relativity '
                                  'published in 1905 and general relativity in '
                                  '1915.',
           'inputs.question': 'Who developed the theory of relativity?',
           'outputs.answer_length.value': 51,
           'outputs.relevance.gpt_relevance': 5},
          {'inputs.answer': 'The speed of light is approximately 299,792,458 '
                            'meters per second.',
           'inputs.context': 'Light travels at a constant speed in a vacuum.',
           'inputs.ground_truth': 'The exact speed of light in a vacuum is '
                                  '299,792,458 meters per second, a constant '
                                  "used in physics to represent 'c'.",
           'inputs.question': 'What is the speed of light?',
           'outputs.answer_length.value': 66,
           'outputs.relevance.gpt_relevance': 5}],
 'traces': {}}

{'metrics': {'answer_length.value': 49.333333333333336,
  'relevance.gpt_relevance': 5.0},
 'rows': [{'inputs.answer': 'Paris is the capital of France.',
   'inputs.context': 'France is in Europe',
   'inputs.ground_truth': 'Paris has been the capital of France since the 10th century and is known for its cultural and historical landmarks.',
   'inputs.question': 'What is the capital of France?',
   'outputs.answer_length.value': 31,
   'outputs.relevance.gpt_relevance': 5},
  {'inputs.answer': 'Albert Einstein developed the theory of relativity.',
   'inputs.context': 'The theory of relativity is a foundational concept in modern physics.',
   'inputs.ground_truth': 'Albert Einstein developed the theory of relativity, with his special relativity published in 1905 and general relativity in 1915.',
   'inputs.question': 'Who developed the theory of relativity?',
   'outputs.answer_length.value': 51,
   'outputs.relevance.gpt_relevance': 5},
  {'inputs.answer': 'The speed of light is appro

## Supported data formats for `evaluate()`
The `evaluate()` API only accepts data in the JSONLines format. For all built-in evaluators, except for `ChatEvaluator` or `ContentSafetyChatEvaluator`, `evaluate()` requires data in the following format with required input fields. See the previous section on required data input for built-in evaluators.

``` json
{
  "question":"What is the capital of France?",
  "context":"France is in Europe",
  "answer":"Paris is the capital of France.",
  "ground_truth": "Paris"
}
```

For the composite evaluator class, `ChatEvaluator` and `ContentSafetyChatEvaluator`, we require an array of messages that adheres to OpenAI's messages protocol that can be found here. The messages protocol contains a role-based list of messages with the following:

- `content`: The content of that turn of the interaction between user and application or assistant.
- `role`: Either the user or application/assistant.
- `"citations"` (within `"context"`): Provides the documents and its ID as key value pairs from the retrieval-augmented generation model.

| Evaluator class |	Citations from retrieved documents |
|-----------------|------------------------------------|
| `GroundednessEvaluator` |	Required: String |
| `RelevanceEvaluator` | Required: String |
| `CoherenceEvaluator` | N/A |
| `FluencyEvaluator` | N/A |

Citations: the relevant source from retrieved documents by retrieval model or user provided context that model's answer is generated with respect to.

```json
{
    "messages": [
        {
            "content": "<conversation_turn_content>", 
            "role": "<role_name>", 
            "context": {
                "citations": [
                    {
                        "id": "<content_key>",
                        "content": "<content_value>"
                    }
                ]
            }
        }
    ]
}
```

To `evaluate()` with either the `ChatEvaluator` or `ContentSafetyChatEvaluator`, ensure in the data mapping you match the key `messages` to your array of messages, given that your data adheres to the chat protocol defined above:

``` python
result = evaluate(
    data="data.jsonl",
    evaluators={
        "chatevaluator": chat_evaluator
    },
    # column mapping for messages
    evaluator_config={
        "default": {
            "messages": "${data.messages}"
        }
    }
)
```

# Evaluate on a target
If you have a list of queries that you'd like to run then evaluate, the `evaluate()` also supports a `target parameter`, which can send queries to an application to collect answers then run your evaluators on the resulting question and answers.

A target can be any callable class in your directory. In this case we have a python script `askwiki.py` with a callable class `askwiki()` that we can set as our target. Given a dataset of queries we can send into our simple `askwiki` app, we can evaluate the relevance of the outputs.

``` python
from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "relevance": relevance_eval
    },
    evaluator_config={
        "default": {
            "question": "${data.queries}"
            "context": "${outputs.context}"
            "answer": "${outputs.response}"
        }
    }
)
```