# Evaluation with Data
In this notebook, we introduce built-in evaluators and guide you through creating your own custom evaluators. We'll cover both code-based and prompt-based custom evaluators. Finally, we'll demonstrate how to use the `evaluate` API to assess data using these evaluators.


In [None]:
# Use the following code to set the environment variables if not already set. If set, you can skip this step.
import os
os.environ["AZURE_OPENAI_API_KEY"] = "<api_key>"
os.environ["AZURE_OPENAI_API_VERSION"] = "<api_version>"
os.environ["AZURE_OPENAI_DEPLOYMENT"] = "<deployment>"
os.environ["AZURE_OPENAI_ENDPOINT"] = "<endpoint>"

## 1. Built-in Evaluators

The table below lists all the built-in evaluators we support. In the following sections, we will select a few of these evaluators to demonstrate how to use them.

| Category       | Namespace                                        | Evaluator Class           | Notes                                             |
|----------------|--------------------------------------------------|---------------------------|---------------------------------------------------|
| Quality        | promptflow.evals.evaluators                      | GroundednessEvaluator     |                                                   |
|                |                                                  | RelevanceEvaluator        |                                                   |
|                |                                                  | CoherenceEvaluator        |                                                   |
|                |                                                  | FluencyEvaluator          |                                                   |
|                |                                                  | SimilarityEvaluator       |                                                   |
|                |                                                  | F1ScoreEvaluator          |                                                   |
| Content Safety | promptflow.evals.evaluators.content_safety       | ViolenceEvaluator         |                                                   |
|                |                                                  | SexualEvaluator           |                                                   |
|                |                                                  | SelfHarmEvaluator         |                                                   |
|                |                                                  | HateUnfairnessEvaluator   |                                                   |
| Composite      | promptflow.evals.evaluators                      | QAEvaluator               | Built on top of individual quality evaluators.    |
|                |                                                  | ChatEvaluator             | Similar to QAEvaluator but designed for evaluating chat messages. |
|                |                                                  | ContentSafetyEvaluator    | Built on top of individual content safety evaluators. |



### 1.1 Quality Evaluator

In [None]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

# Initialize Azure OpenAI Connection
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)

In [None]:
from promptflow.evals.evaluators import RelevanceEvaluator

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)

In [None]:
# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    answer="The Alpine Explorer Tent is the most waterproof.",
    context="From the our product list,"
    " the alpine explorer tent is the most waterproof."
    " The Adventure Dining Table has higher weight.",
    question="Which tent is the most waterproof?",
)

In [None]:
print(relevance_score)

### 1.2 Content Safety Evaluator

Unlike quality evaluators, which are prompt-based, content safety evaluators are supported by the RAI service. To use these evaluators, you need to provide the subscription ID, resource group, and project name of your Azure AI project.

In [None]:
# AI Project Scope
project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}

In [None]:
from promptflow.evals.evaluators.content_safety import ViolenceEvaluator

# Initialzing Violence Evaluator
violence_eval = ViolenceEvaluator(project_scope)

In [None]:
# Running Violence Evaluator on single input row
violence_score = violence_eval(question="What is the capital of France?", answer="Paris.")

In [None]:
print(violence_score)

## 2. Custom Evaluator

After gaining a basic understanding of how the built-in evaluators work, let's explore how to create your own evaluator. In this section, we will discuss creating both a code-based custom evaluator and a prompt-based custom evaluator.

### 2.1 Define a Code based Evaluator

A code-based evaluator can be as simple as a callable class. In the example below, we will show you a simple evaluator that calculates the length of an answer.

In [None]:
with open("custom/answer_length.py") as fin:
    print(fin.read())

In [None]:
from custom.answer_length import AnswerLengthEvaluator

answer_length = AnswerLengthEvaluator()

print(answer_length(answer="some answer"))

### 2.2 Define a Prompty based Evaluator

Prompty is a file with .prompty extension for developing prompt template. The prompty asset is a markdown file with a modified front matter. The front matter is in yaml format that contains a number of metadata fields which defines model configuration and expected inputs of the prompty.

In [None]:
with open("custom/apology.prompty") as fin:
    print(fin.read())

In [None]:
from promptflow.client import load_flow

# load apology evaluatorfrom prompty
apology_eval = load_flow(source="custom/apology.prompty", model={"configuration": model_config})

In [None]:
result = apology_eval(
    question="What is the capital of France?", answer="Paris"
)

In [None]:
print(result)

## 3. Using Evaluate API to evaluate with data

In previous sections, we walked you through how to use built-in evaluators to evaluate a single row and how to define your own custom evaluators. Now, we will show you how to use these evaluators with the powerful `evaluate` API to assess an entire dataset.

First, let's take a peek at what the data looks like.

In [None]:
import pandas as pd

data_path = "data.jsonl"

df = pd.read_json(data_path, lines=True)
df

Now, we will invoke the `evaluate` API using a few evaluators that we already initialized

Additionally, we have a column mapping to map the `truth` column from the dataset to `ground_truth`, which is accepted by the evaluator.

In [None]:
from promptflow.evals.evaluate import evaluate

result = evaluate(
    data="data.jsonl",
    evaluators={
        "relevance": relevance_eval,
        "violence": violence_eval,
        "answer_length": answer_length,
        "apology": apology_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "ground_truth": "${data.truth}"
        }
    }
)


Finally, let's check the results produced by the evaluate API.

In [None]:
from IPython.display import display, JSON

display(JSON(result))

In [None]:
# Check the results using Azure AI Studio UI
print(result["studio_url"])