# Prompt Flow SDK evaluation

https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/flow-evaluate-sdk

To thoroughly assess the performance of your generative AI application when applied to a substantial dataset, you can evaluate in your development environment with the prompt flow SDK. Given either a test dataset or a target, your generative AI application generations are quantitatively measured with both mathematical based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.

In this notebook, you will learn how to run evaluators on a single row of data, a larger test dataset on an application target with built-in evaluators using the prompt flow SDK then track the results and evaluation logs in Azure AI Studio.

## Getting started
Install the necessary packages

In [1]:
!pip install promptflow-evals

Collecting promptflow-evals
  Downloading promptflow_evals-0.3.1-py3-none-any.whl.metadata (4.1 kB)
Collecting aiohttp_retry>=2.8.3 (from promptflow-evals)
  Downloading aiohttp_retry-2.8.3-py3-none-any.whl.metadata (8.9 kB)
Collecting jsonpath_ng>=1.5.0 (from promptflow-evals)
  Downloading jsonpath_ng-1.6.1-py3-none-any.whl.metadata (18 kB)
Collecting promptflow-core<2.0.0,>=1.13.0 (from promptflow-evals)
  Downloading promptflow_core-1.14.0-py3-none-any.whl.metadata (2.7 kB)
Collecting promptflow-devkit<2.0.0,>=1.13.0 (from promptflow-evals)
  Downloading promptflow_devkit-1.14.0-py3-none-any.whl.metadata (5.7 kB)
Collecting aiohttp (from aiohttp_retry>=2.8.3->promptflow-evals)
  Downloading aiohttp-3.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting ply (from jsonpath_ng>=1.5.0->promptflow-evals)
  Downloading ply-3.11-py2.py3-none-any.whl.metadata (844 bytes)
Collecting docstring_parser (from promptflow-core<2.0.0,>=1.13.0->promptflow-eval

Navigate to the .env file and add your credentials from AI Studio model deployments:
- "AZURE_OPENAI_ENDPOINT"= "<your model deployment endpoint>."
- "AZURE_OPENAI_API_KEY"= "<your openai key>"
- "AZURE_OPENAI_DEPLOYMENT"= "<deployment name>"
- "AZURE_OPENAI_API_VERSION"= "<api version, f.ex. 2023-03-15-preview>"

## Built-in evaluators
Built-in evaluators support the following application scenarios:

Question and answer: This scenario is designed for applications that involve sending in queries and generating answers.
Chat: This scenario is suitable for applications where the model engages in conversation using a retrieval-augmented approach to extract information from your provided documents and generate detailed responses.
For more in-depth information on each evaluator definition and how it's calculated, learn more here.

| Category                | Evaluator class                                      |
|-------------------------|------------------------------------------------------|
| Performance and quality | GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, F1ScoreEvaluator |
| Risk and safety         | ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator |
| Composite               | QAEvaluator, ChatEvaluator, ContentSafetyEvaluator, ContentSafetyChatEvaluator |


Built-in composite evaluators are composed of individual evaluators.

QAEvaluator combines all the quality evaluators for a single output of combined metrics for question and answer pairs
ChatEvaluator combines all the quality evaluators for a single output of combined metrics for chat messages following the OpenAI message protocol that can be found here. In addition to all the quality evaluators, we include support for retrieval score. Retrieval score isn't currently supported as a standalone evaluator class.
ContentSafetyEvaluator combines all the safety evaluators for a single output of combined metrics for question and answer pairs
ContentSafetyChatEvaluator combines all the safety evaluators for a single output of combined metrics for chat messages following the OpenAI message protocol that can be found here.
Required data input for built-in evaluators
We require question and answer pairs in .jsonl format with the required inputs, and column mapping for evaluating datasets, as follows:

| Evaluator                | question            | answer               | context              | ground_truth         |
|--------------------------|---------------------|----------------------|----------------------|----------------------|
| GroundednessEvaluator    | N/A                 | Required: String     | Required: String     | N/A                  |
| RelevanceEvaluator       | Required: String    | Required: String     | Required: String     | N/A                  |
| CoherenceEvaluator       | Required: String    | Required: String     | N/A                  | N/A                  |
| FluencyEvaluator         | Required: String    | Required: String     | N/A                  | N/A                  |
| SimilarityEvaluator      | Required: String    | Required: String     | N/A                  | Required: String     |
| F1ScoreEvaluator         | N/A                 | Required: String     | N/A                  | Required: String     |
| ViolenceEvaluator        | Required: String    | Required: String     | N/A                  | N/A                  |
| SexualEvaluator          | Required: String    | Required: String     | N/A                  | N/A                  |
| SelfHarmEvaluator        | Required: String    | Required: String     | N/A                  | N/A                  |
| HateUnfairnessEvaluator  | Required: String    | Required: String     | N/A                  | N/A                  |

- Question: the question sent in to the generative AI application
- Answer: the response to question generated by the generative AI application
- Context: the source that response is generated with respect to (that is, grounding documents)
- Ground truth: the response to question generated by user/human as the true answer

## Performance evaluation

In [14]:
import os
from dotenv import load_dotenv
#from promptflow.core import AzureOpenAIModelConfiguration
#from promptflow.evals.evaluators import RelevanceEvaluator

# load .env variables
load_dotenv()

print(os.getenv("AZURE_OPENAI_ENDPOINT"))

None


In [10]:
import os
from dotenv import load_dotenv
from promptflow.core import AzureOpenAIModelConfiguration
from promptflow.evals.evaluators import RelevanceEvaluator

# load .env variables
load_dotenv()

print(os.getenv("AZURE_OPENAI_ENDPOINT"))

# Initialize Azure OpenAI Connection with your environment variables
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)
# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    answer="The Alpine Explorer Tent is the most waterproof.",
    context="From the our product list,"
    " the alpine explorer tent is the most waterproof."
    " The Adventure Dining Table has higher weight.",
    question="Which tent is the most waterproof?",
)
print(relevance_score)

[2024-07-26 14:31:58 +0000][promptflow.core._prompty_utils][ERROR] - Exception occurs: ValueError: Must provide one of the `base_url` or `azure_endpoint` arguments, or the `AZURE_OPENAI_ENDPOINT` environment variable


None


LLMError: OpenAI API hits exception: ValueError: Must provide one of the `base_url` or `azure_endpoint` arguments, or the `AZURE_OPENAI_ENDPOINT` environment variable