# Evaluation with Data
In this notebook, we introduce built-in evaluators and guide you through creating your own custom evaluators. We'll cover both code-based and prompt-based custom evaluators. Finally, we'll demonstrate how to use the `evaluate` API to assess data using these evaluators.


## 1. Built-in Evaluators


| Category       | Namespace                                        | Evaluator Class           |
|----------------|--------------------------------------------------|---------------------------|
| Quality        | promptflow.evals.evaluators                      | GroundednessEvaluator     |
|                |                                                  | RelevanceEvaluator        |
|                |                                                  | CoherenceEvaluator        |
|                |                                                  | FluencyEvaluator          |
|                |                                                  | SimilarityEvaluator       |
|                |                                                  | F1ScoreEvaluator          |
| Content Safety | promptflow.evals.evaluators.content_safety       | ViolenceEvaluator         |
|                |                                                  | SexualEvaluator           |
|                |                                                  | SelfHarmEvaluator         |
|                |                                                  | HateUnfairnessEvaluator   |
| Composite      | promptflow.evals.evaluators                      | QAEvaluator               |
|                |                                                  | ChatEvaluator             |
|                |                                                  | ContentSafetyEvaluator    |


### 1.1 Quality Evaluator

In [None]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

# Initialize Azure OpenAI Connection
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_key=os.environ.get("AZURE_OPENAI_KEY"),
    azure_deployment="gpt-4",
)

In [None]:
from promptflow.evals.evaluators import RelevanceEvaluator

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)

# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    answer="The Alpine Explorer Tent is the most waterproof.",
    context="From the our product list,"
    " the alpine explorer tent is the most waterproof."
    " The Adventure Dining Table has higher weight.",
    question="Which tent is the most waterproof?",
)

In [None]:
print(relevance_score)

### 1.2 Content Safety Evaluator

In [None]:
# Initialize Project Scope
project_scope = {
    "subscription_id": "e0fd569c-e34a-4249-8c24-e8d723c7f054",
    "resource_group_name": "resource-group",
    "project_name": "project-name",
}
project_scope = {
    "subscription_id": "b17253fa-f327-42d6-9686-f3e553e24763",
    "resource_group_name": "promptflow-evals-ci",
    "project_name": "pf-evals-ws"
}

In [None]:
from promptflow.evals.evaluators.content_safety import ViolenceEvaluator

# Initialzing Violence Evaluator
violence_eval = ViolenceEvaluator(project_scope)

# Running Violence Evaluator on single input row
violence_score = violence_eval(question="What is the capital of France?", answer="Paris.")

In [None]:
print(violence_score)

### 1.3 Composite Evaluator

#### 1.3.1 QA Evaluator
QAEvaluator is a composite evaluator that combines all quality evaluators, including GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator and F1ScoreEvaluator.

In [None]:
from promptflow.evals.evaluators import QAEvaluator

# Initialzing QA Evaluator
qa_eval = QAEvaluator(model_config=model_config)

# Running QA Evaluator on single input row
score = qa_eval(
    question="Tokyo is the capital of which country?",
    answer="Japan",
    context="Tokyo is the capital of Japan.",
    ground_truth="Japan",
)

In [None]:
print(score)

#### 1.3.2 Chat Evaluator
Chat Evaluator is another composite evaluator that utilizes quality evaluators such as the CoherenceEvaluator, FluencyEvaluator, GroundednessEvaluator, and RelevanceEvaluator to assess chat messages.

In [None]:
from promptflow.evals.evaluators import ChatEvaluator

# Initialize Chat Evaluator
chat_eval = ChatEvaluator(model_config=model_config)

conversation = [
    {"role": "user", "content": "What is the value of 2 + 2?"},
    {
        "role": "assistant",
        "content": "2 + 2 = 4",
        "context": {
            "citations": [{"id": "doc.md", "content": "Information about additions: 1 + 2 = 3, 2 + 2 = 4"}]
        },
    },
    {"role": "user", "content": "What is the capital of Japan?"},
    {
        "role": "assistant",
        "content": "The capital of Japan is Tokyo.",
        "context": {
            "citations": [
                {
                    "id": "doc.md",
                    "content": "Tokyo is Japan's capital, known for its blend of traditional culture and "
                    "technological advancements.",
                }
            ]
        },
    },
]

# Running Chat Evaluator on chat conversation
score = chat_eval(conversation=conversation)

In [None]:
print(score)

## 2. Custom Evaluator

### 2.1 Define a Code based Evaluator

In [None]:
def answer_length(input, **kwargs):
    return {"value": len(input)}

### 2.2 Define a Prompt based Evaluator

In [None]:
import os

# Get the current working directory
current_directory = os.getcwd()

# Print the current working directory
print(current_directory)

In [None]:
with open("apology.prompty") as fin:
    print(fin.read())

## 3. Using Evaluate API to evaluate with data

In [None]:
import pandas as pd

data_path = "data.jsonl"

df = pd.read_json(data_path, lines=True)
df

In [None]:
from promptflow.evals.evaluate import evaluate

result = evaluate(
    data="data.jsonl",
    evaluators={
       # "answer_length": answer_length,
        "relevance": relevance_eval,
    },
    evaluator_config={
        "answer_length": {"input": "${data.answer}"},
    },
)

In [None]:
from IPython.display import display, JSON

display(JSON(result))