<a href="https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic5/5.1_llm_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials by Nebius Academy

Course github: [link](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/tree/main)

The course is in development now, with more materials coming soon.

# Intro to LLM evaluations

**Authored by** **Emeli Dral** and **Elena Samuylova**, creators of [**Evidently**](https://www.evidentlyai.com/) ([GitHub](https://github.com/evidentlyai/evidently)), an open-source ML and LLM evaluation framework with 25M+ downloads.

<center>
<img src="https://raw.githubusercontent.com/Nebius-Academy/knowledge-base/refs/heads/main/assets/images/evaluation/evidently_ai_logo_docs.png" width=400 />

</center>

This example uses open-source Evidently Python library. If you face any issues, open a GitHub issue https://github.com/evidentlyai/evidently or ask in Evidently Discord https://discord.gg/xZjKRaNp8b.

You can also check the official docs https://docs.evidentlyai.com

⭐️ If you enjoy the example, [give us a star on GitHub](https://github.com/evidentlyai/evidently) to support the project!

In [None]:
pip install evidently[llm]

# 🧠 Why Evaluate LLMs?

LLM evaluation is critical across development, iteration and monitoring stages of your LLM app. In this tutorial, we'll walk through:
- Reference-based evaluations: when you have a ground truth to compare against.
- Open-ended evaluations: when there's no “correct” answer.
- Multi-turn evals: when you are evaluating a complete conversation session, not just a single input-output pair.

We'll use a realistic mini-dataset for a financial assistant chatbot.

The goal is to demonstrate different evaluation methods you can use. We'll cover:
- Ground truth comparison (exact match, semantic similarity, BERTScore)
- LLM-as-a-judge evaluations (correctness, helpfulness, conciseness)
- Deterministic descriptors (response length, sentiment, keyword presence)

# Imports

In [None]:
import pandas as pd
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import TextLength, Sentiment, IncludesWords, SemanticSimilarity, ExactMatch, BERTScore, SentenceCount
from evidently.descriptors import LLMEval, PIILLMEval, DeclineLLMEval, CorrectnessLLMEval, FaithfulnessLLMEval, DeclineLLMEval, ContextQualityLLMEval
from evidently.llm.templates import BinaryClassificationPromptTemplate, MulticlassClassificationPromptTemplate
from evidently.presets import TextEvals
from evidently.metrics import CategoryCount, OutRangeValueCount
from evidently.tests import *

# Toy data

Generate a toy dataset. Let's imagine a Q&A (RAG) use case where the system generates the response based on the retrieved context.

In [None]:
import pandas as pd

eval_data = pd.DataFrame([
    {
        "question": "Can I send $2000 to Brazil today?",
        "context": "Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.",
        "answer": "Yes, you can send $2000 to Brazil today waithout any restrictions.",
        "reference_answer": "Yes, you can send $2000 to Brazil today. The daily limit is $3000."
    },
    {
        "question": "How do I block my card if it's lost?",
        "context": "To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.",
        "answer": "Go to the Cards section, select your card, and tap 'Block card' to block it instantly.",
        "reference_answer": "Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate."
    },
    {
        "question": "Do you offer loans in Argentina?",
        "context": "FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.",
        "answer": "Yes, FinBot offers personal loans in Argentina with competitive rates.",  # Incorrect
        "reference_answer": "No, FinBot does not currently offer loans in Argentina."
    },
    {
        "question": "Is there a fee for using an ATM in Mexico?",
        "context": "ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",
        "answer": "You’ll be charged $2.50 when using a non-partner ATM in Mexico.",
        "reference_answer": "Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free."
    },
    {
        "question": "Can I cancel a transaction after it's sent?",
        "context": "Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",
        "answer": "I am afraid I do not have information to answer this question.",
        "reference_answer": "No, but you can submit a recall request. It depends on the recipient’s bank."
    }
])

In [None]:
pd.set_option('display.max_colwidth', None)

Let's first take a look at the starting point: a golden dataset of expected questions and answers.

Having a dataset like this lets run efficient offline evaluations e.g. as you itrerate on prompts, models, etc. and compare answers against expected ones.

In [None]:
golden_df = eval_data[["question", "reference_answer"]].copy()

In [None]:
golden_df.head()

Unnamed: 0,question,reference_answer
0,Can I send $2000 to Brazil today?,"Yes, you can send $2000 to Brazil today. The daily limit is $3000."
1,How do I block my card if it's lost?,"Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate."
2,Do you offer loans in Argentina?,"No, FinBot does not currently offer loans in Argentina."
3,Is there a fee for using an ATM in Mexico?,"Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free."
4,Can I cancel a transaction after it's sent?,"No, but you can submit a recall request. It depends on the recipient’s bank."


Let's assume we ran it through our app and got the actual answer and context used to generate it. We'll simply imitate it by calling a pre-designed demo dataset. This is the output we will work with:

In [None]:
eval_data.head()

Unnamed: 0,question,context,answer,reference_answer
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.","Yes, you can send $2000 to Brazil today. The daily limit is $3000."
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.","Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate."
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.","No, FinBot does not currently offer loans in Argentina."
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,"Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free."
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,"No, but you can submit a recall request. It depends on the recipient’s bank."


# Prepare the dataset for evals

In [None]:
definition = DataDefinition(text_columns=["question", "context", "answer", "reference_answer"])

In [None]:
eval_df = Dataset.from_pandas(
    pd.DataFrame(eval_data),
    data_definition=definition)

# Reference-based evals

## Deterministic

Exact match - let's use it for illustration. We will add descriptors to the dataset and preview it locally as a pandas dataframe to see the process. (At the end of this tutorial you will see how to aggregate the results into the Report).

In [None]:
eval_df.add_descriptors(descriptors=[
    ExactMatch(columns=["answer", "reference_answer"], alias="ExactMatch"),
])

In [None]:
eval_df.as_dataframe()

Unnamed: 0,question,context,answer,reference_answer,ExactMatch
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.","Yes, you can send $2000 to Brazil today. The daily limit is $3000.",False
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.","Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.",False
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.","No, FinBot does not currently offer loans in Argentina.",False
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,"Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.",False
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,"No, but you can submit a recall request. It depends on the recipient’s bank.",False


Exact Match checks if the generated response matches the reference text exactly.

However, in real-world LLM output, even perfectly valid answers may use different wording or structure. This method is too strict.

In [None]:
# You can also create the dataframe together with adding the descriptors, and use automated data definition.
# This will get you the same result.

eval_df = Dataset.from_pandas(
    pd.DataFrame(eval_data),
    data_definition=DataDefinition(),
    descriptors=[ExactMatch(columns=["answer", "reference_answer"],
                            alias="ExactMatch")])
eval_df.as_dataframe()

Unnamed: 0,question,context,answer,reference_answer,ExactMatch
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.","Yes, you can send $2000 to Brazil today. The daily limit is $3000.",False
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.","Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.",False
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.","No, FinBot does not currently offer loans in Argentina.",False
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,"Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.",False
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,"No, but you can submit a recall request. It depends on the recipient’s bank.",False


## Semantic match

Let's compare semantic match.

We’ll use two approaches:

*   SemanticSimilarity: cosine similarity over sentence embeddings. This method produces a single vector per sentence using a built-in embedding model. Measures closeness in meaning between answer and reference. Outputs a float between 0 and 1, where 0 is opposite meanings, 0.5 is unrelated, and 1 is exactly matching.
*   BERTScore looks at token-level alignment. Uses contextual embeddings from BERT and cmputes pairwise cosine similarities between tokens in candidate and reference. We look at the resulting F1 score.

In [None]:
eval_df.add_descriptors(descriptors=[
    SemanticSimilarity(columns=["answer", "reference_answer"], alias="Semantic Similarity"),
    BERTScore(columns=["answer", "reference_answer"], alias="BERTScore"),
])

In [None]:
eval_df.as_dataframe()

Unnamed: 0,question,context,answer,reference_answer,ExactMatch,Semantic Similarity,BERTScore
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.","Yes, you can send $2000 to Brazil today. The daily limit is $3000.",False,0.964713,0.837965
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.","Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.",False,0.947559,0.827429
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.","No, FinBot does not currently offer loans in Argentina.",False,0.953436,0.846422
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,"Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.",False,0.860564,0.753561
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,"No, but you can submit a recall request. It depends on the recipient’s bank.",False,0.603113,0.571043


While embedding-based metrics are helpful for measuring overall semantic closeness (and help us capture issues like a denial to respond), they aren't always precise enough for factual evaluations. These methods rely on vector similarity, so they may consider two responses "similar" even if they differ in one little detail like reversing a yes/no fact.

## LLM as a judge

We can achieve better result with LLM-based judges that can reason about meaning or detect contradictions between texts.

You will need an Open AI API key to use LLM as a judge. Import it as an environment variable.

In [None]:
## import os
## os.environ["OPENAI_API_KEY"] = "YOUR KEY"

In [None]:
# if using Google Colab

import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPEN_AI_API_KEY')

You can also use a different judge model. See docs: https://docs.evidentlyai.com/metrics/customize_llm_judge#change-the-evaluator-llm

We will use a built-in "Correctness" LLM judge from Evidently.

In [None]:
eval_df.add_descriptors(descriptors=[
     CorrectnessLLMEval("answer", target_output="reference_answer"),
])

In [None]:
eval_df.as_dataframe()

Unnamed: 0,question,context,answer,reference_answer,ExactMatch,Semantic Similarity,BERTScore,Correctness,Correctness reasoning
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.","Yes, you can send $2000 to Brazil today. The daily limit is $3000.",False,0.964713,0.837965,INCORRECT,"The OUTPUT states that there are 'waithout any restrictions', which contradicts the REFERENCE that specifies a daily limit of $3000. This changes the original meaning and introduces an inaccuracy regarding restrictions on sending money."
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.","Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.",False,0.947559,0.827429,CORRECT,"The OUTPUT conveys the same facts and details as the REFERENCE. It mentions going to the Cards section, selecting the card, and tapping 'Block card' which aligns with the instructions in the REFERENCE. The term 'block it instantly' in the OUTPUT preserves the original meaning of 'Blocking is immediate' in the REFERENCE."
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.","No, FinBot does not currently offer loans in Argentina.",False,0.953436,0.846422,INCORRECT,"The OUTPUT states that FinBot offers personal loans in Argentina, which directly contradicts the REFERENCE that states FinBot does not currently offer loans in Argentina."
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,"Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.",False,0.860564,0.753561,INCORRECT,"The OUTPUT states that the fee of $2.50 applies when using a non-partner ATM in Mexico, but the REFERENCE does not specify any geographical limitations and only states the fee for non-partner ATMs without mentioning any location. This implies that the OUTPUT introduces a geographical context that was not present in the REFERENCE, thus changing the original meaning."
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,"No, but you can submit a recall request. It depends on the recipient’s bank.",False,0.603113,0.571043,INCORRECT,"The OUTPUT states that there is no information to answer the question, which contradicts the REFERENCE that provides a specific instruction to submit a recall request depending on the recipient’s bank. Therefore, it omits key details and does not preserve the original meaning."


### Custom LLM judge - multi-class

Let's create a custom judge that will instead use 4 categories based on what we observe.

Let's re-import data so that we drop the existing descriptors:

In [None]:
eval_df_2 = Dataset.from_pandas(
    pd.DataFrame(eval_data),
    data_definition=definition)

Define the judge prompt using built-in template. (You can also use a binary template).

In [None]:
correctness_multiclass = MulticlassClassificationPromptTemplate(
    pre_messages=[("system", "You are a judge that evaluates the factual alignment of two texts.")],
    criteria="""You are given a new answer and a reference answer and a new answer. Classify the new answer based on how it compares to the reference.
    ===
    Reference: {reference_answer} """,
    category_criteria={
        "fully_correct": "The answer matches the reference in all factual and semantic details.",
        "incomplete": "The answer is correct in what it says but leaves out details from the reference.",
        "adds_claims": "The answer does not contradict reference but introduces new claims not supported by the reference.",
        "contradictory": "The answer contradicts specific facts or meaning in the reference.",
    },
    uncertainty="unknown",
    include_reasoning=True,
    include_scores=False
)

Run the evaluator:

In [None]:
eval_df_2.add_descriptors(descriptors=[
    LLMEval("answer",
        template=correctness_multiclass,
        additional_columns={"reference_answer": "reference_answer"},
        provider="openai",
        model="gpt-4o-mini",
        alias="Multi-class correctness"
    )
])

In [None]:
eval_df_2.as_dataframe()

Unnamed: 0,question,context,answer,reference_answer,Multi-class correctness,Multi-class correctness reasoning
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.","Yes, you can send $2000 to Brazil today. The daily limit is $3000.",incomplete,"The new answer correctly states that you can send $2000 to Brazil today, but it omits the important detail about the daily limit being $3000, which is a key piece of information from the reference."
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.","Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.",fully_correct,"The new answer accurately describes the steps to block the card, including going to the Cards section, selecting the card, and tapping 'Block card'. It also states that the action is instant, which is aligned with the reference answer."
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.","No, FinBot does not currently offer loans in Argentina.",contradictory,"The new answer states that FinBot offers personal loans in Argentina, which directly contradicts the reference that asserts FinBot does not offer loans in Argentina."
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,"Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.",incomplete,"The new answer states that a $2.50 fee applies when using a non-partner ATM but does not mention the detail about partner ATMs being free, which is included in the reference answer."
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,"No, but you can submit a recall request. It depends on the recipient’s bank.",incomplete,"The new answer states a lack of information to respond to the question, which does not directly contradict the reference. However, it fails to provide the critical information that one can submit a recall request, and that the outcome depends on the recipient's bank, which are essential details present in the reference."


# Reference-free evals

In production, or high-volume testing, you may not have a reference answer. In this case, you can run open-ended evals judging only the final generation. In many cases, you can also use supplementary information - like question and context in your evaluations.

Let's assume that now we don't have the target answer, but we have the retrieved context.

In [None]:
prod_data = eval_data[["question", "context", "answer"]].copy()

In [None]:
prod_data.head()

Unnamed: 0,question,context,answer
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions."
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly."
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates."
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.


In [None]:
definition = DataDefinition(text_columns=["question", "context", "answer"])

In [None]:
prod_df = Dataset.from_pandas(
    pd.DataFrame(prod_data),
    data_definition=definition)

## Word presence

You can check if specific words are present in the outputs.

In [None]:
prod_df.add_descriptors(descriptors=[
     IncludesWords("answer",
              words_list=["hello", "hi", "good afternoon"],
              mode="any", alias="Says hi"),
      IncludesWords("answer",
                    words_list=["sorry", "apologies", "apologize", "cannot", "afraid"],
                    mode="any",
                    alias="Declines")
])

In [None]:
prod_df.as_dataframe()

Unnamed: 0,question,context,answer,Says hi,Declines
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.",False,False
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.",False,False
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.",False,False
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,False,False
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,False,True


## Text stats (length)

Check symbol and sentence count.

In [None]:
prod_df = Dataset.from_pandas(
    pd.DataFrame(prod_data),
    data_definition=definition,
    descriptors=[
        TextLength("answer", alias="Symbol_Length"),
        SentenceCount("answer", alias="Sentence_Count")])

In [None]:
prod_df.as_dataframe()

Unnamed: 0,question,context,answer,Symbol_Length,Sentence_Count
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.",66,1
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.",86,1
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.",70,1
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,63,1
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,62,1


Depending on the use case, could be `IsValidJSON()` etc.

## Semantic similarity

You can use semantic similarity between answer and context, or answer and question as proxies for hallucinations and relevance.

In [None]:
prod_df.add_descriptors(descriptors=[
     SemanticSimilarity(columns=["answer", "context"], alias="Hallicination proxy"),
     SemanticSimilarity(columns=["answer", "question"], alias="Relevance proxy")
])

In [None]:
prod_df.as_dataframe()

Unnamed: 0,question,context,answer,Symbol_Length,Sentence_Count,Hallicination proxy,Relevance proxy
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.",66,1,0.835857,0.957484
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.",86,1,0.821591,0.858594
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.",70,1,0.810964,0.90973
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,63,1,0.906232,0.897521
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,62,1,0.560166,0.56581


## LLM judge

Using LLM judge to check for hallucinations (contradictions between answer and context) can give even better results. Let's use a built-in Faithfulness judge.

In [None]:
prod_df_2 = Dataset.from_pandas(
    pd.DataFrame(prod_data),
    data_definition=definition)

In [None]:
prod_df_2.add_descriptors(descriptors=[
     FaithfulnessLLMEval("answer", context="context", alias="Faithfulness"),
     TextLength("answer", alias="Length")
])

In [None]:
prod_df_2.as_dataframe()

Unnamed: 0,question,context,answer,Faithfulness,Faithfulness reasoning,Length
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.",UNFAITHFUL,"The response incorrectly states that $2000 can be sent to Brazil without any restrictions. According to the source, personal account holders can transfer up to $3000 per day, but the mention of 'without any restrictions' contradicts the information given that transfers above the threshold require additional verification steps.",66
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.",FAITHFUL,"The text accurately describes the process to block a card as provided in the SOURCE. It states to go to the Cards section, select the card, and tap 'Block card', which aligns with the instructions in the SOURCE.",86
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.",UNFAITHFUL,"The response states that FinBot offers loans in Argentina, which contradicts the information provided in the source that indicates loans are available only in the US, Canada, and selected EU countries.",70
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,FAITHFUL,The statement accurately reflects the information from the source that non-partner ATMs incur a $2.50 fee per withdrawal in Mexico.,63
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,FAITHFUL,"The response indicates a lack of information to answer the question, which aligns with the option to decline answering when sufficient information is not provided in the source. It does not contradict the source or introduce new information.",62


### Custom LLM judge

Let's create a custom helpfulness evaluator.

In [None]:
helpfulness = MulticlassClassificationPromptTemplate(
    pre_messages=[("system", "You are a helpfulness evaluator for chatbot responses.")],
    criteria="""You are given a user question and an assistant's answer.
    Classify the answer based on how helpful it is in responding to the user's intent.
    ===
    Question:
    {question}
    """,
    category_criteria={
        "helpful": "The answer directly addresses the user's intent and is actionable. It may provide steps, a relevant clarification, or trigger progress in the conversation.",
        "partially_helpful": "The answer gives some relevant information but likely misses part of the user's intent, lacks clear next steps, or is only marginally actionable.",
        "unhelpful": "The answer does not address the user's question meaningfully, ignores the intent, denies responding or provides vague, irrelevant, or generic replies.",
    },
    uncertainty="unknown",
    include_reasoning=True,
    include_scores=False
)

In [None]:
prod_df_2.add_descriptors(descriptors=[
    LLMEval("answer",
        template=helpfulness,
        additional_columns={"question": "question"},
        provider="openai",
        model="gpt-4o-mini",
        alias="Answer helpfulness"
    )
])

In [None]:
prod_df_2.as_dataframe()

Unnamed: 0,question,context,answer,Faithfulness,Faithfulness reasoning,Length,Answer helpfulness,Answer helpfulness reasoning
0,Can I send $2000 to Brazil today?,Personal account holders are allowed to transfer up to $3000 USD per day to Brazil. Transfers above this threshold require additional verification steps. This limit applies regardless of the recipient’s bank.,"Yes, you can send $2000 to Brazil today waithout any restrictions.",UNFAITHFUL,"The response incorrectly states that $2000 can be sent to Brazil without any restrictions. According to the source, personal account holders can transfer up to $3000 per day, but the mention of 'without any restrictions' contradicts the information given that transfers above the threshold require additional verification steps.",66,partially_helpful,"The answer confirms that sending $2000 to Brazil today is possible, which addresses the user's intent. However, it fails to provide additional context or information regarding how to send the money, potential fees, or other requirements that may be necessary, making it only partially helpful."
1,How do I block my card if it's lost?,"To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.","Go to the Cards section, select your card, and tap 'Block card' to block it instantly.",FAITHFUL,"The text accurately describes the process to block a card as provided in the SOURCE. It states to go to the Cards section, select the card, and tap 'Block card', which aligns with the instructions in the SOURCE.",86,helpful,"The answer directly addresses the user's intent by providing clear and actionable steps on how to block a lost card, fulfilling the user's request effectively."
2,Do you offer loans in Argentina?,"FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.","Yes, FinBot offers personal loans in Argentina with competitive rates.",UNFAITHFUL,"The response states that FinBot offers loans in Argentina, which contradicts the information provided in the source that indicates loans are available only in the US, Canada, and selected EU countries.",70,helpful,"The answer directly confirms that loans are offered in Argentina and specifies that they have competitive rates, which effectively addresses the user's question and provides actionable information."
3,Is there a fee for using an ATM in Mexico?,"ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",You’ll be charged $2.50 when using a non-partner ATM in Mexico.,FAITHFUL,The statement accurately reflects the information from the source that non-partner ATMs incur a $2.50 fee per withdrawal in Mexico.,63,helpful,The answer directly addresses the user's question about ATM fees in Mexico by specifying a charge for using a non-partner ATM. This provides the user with actionable information regarding their inquiry.
4,Can I cancel a transaction after it's sent?,"Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient’s bank must agree to reverse the transfer.",I am afraid I do not have information to answer this question.,FAITHFUL,"The response indicates a lack of information to answer the question, which aligns with the option to decline answering when sufficient information is not provided in the source. It does not contradict the source or introduce new information.",62,unhelpful,The answer states a lack of information and does not attempt to address the user's question regarding the possibility of canceling a transaction after it has been sent. This does not provide any relevant information or guidance to the user.


# Reports: summarize evals

After you run an evaluation and add it to the dataset, you can create a Report that will summarize the distribution of all scores. This will render it directly in Jupyter/Colab.

In [None]:
report = Report([
    TextEvals()
])

my_eval = report.run(prod_df_2)
my_eval

You can also export the result:

In [None]:
# my_eval.json()
# my_eval.dict()
# my_report.save_html(“file.html”)

# Advanced: add Tests  

You can also add tests to get pass/fail results instead of just the scores.

In [None]:
report = Report([
    TextEvals(),
    CategoryCount(column="Answer helpfulness", category="unhelpful", tests=[eq(0)]), #expect all answers to be helpful
    CategoryCount(column="Faithfulness", category="UNFAITHFUL", tests=[eq(0)]), #expect all answers to be faithful
    OutRangeValueCount(column="Length", left=10, right=70, tests=[eq(0)]) #expect all answers to be within 10-70 symbols length
])

my_eval = report.run(prod_df_2)

In [None]:
my_eval

You can have softer conditions, like allowing for up to 20% unhelpful responses and respones out of expected length range.

In [None]:
report = Report([
    TextEvals(),
    CategoryCount(column="Answer helpfulness", category="unhelpful", share_tests=[lte(0.2)]),
    CategoryCount(column="Faithfulness", category="UNFAITHFUL", tests=[eq(0)]),
    OutRangeValueCount(column="Length", left=10, right=70, share_tests=[lte(0.2)])
])

my_eval = report.run(prod_df_2)

In [None]:
my_eval

# Advanced: multi-turn coversation evals

For dialog systems, it can be useful to analyze multiple responses together. This allows you to assess metrics that apply to the entire session, such as coherence, consistency, and overall user satisfaction.

Let's create a toy dataset.

In [None]:
multi_turn_dialogs = {
    "Session": [
        # 1. Sending monet
        "User: Hi, I need to send money to my cousin in Brazil today.\n"
        "Assistant: Sure! You can transfer up to $3000 USD per day to Brazil.\n"
        "User: I’m sending $2000. Is that okay?\n"
        "Assistant: Yes, that’s fine and doesn’t require any additional steps.\n"
        "User: Great. The recipient’s name is Carla Mendes, CPF number 123.456.789-00. Is that all you need?\n"
        "Assistant: Yes, that's enough to proceed. Please confirm the recipient's bank details on the transfer screen.\n"
        "User: Confirmed, thank you!",

        # 2. Lost card
        "User: I lost my debit card! What do I do?\n"
        "Assistant: Don’t worry. You can block your card immediately through the FinBot app.\n"
        "User: How?\n"
        "Assistant: Open the app, go to the Cards section, select the lost card, and tap 'Block card'. It’ll be instant.\n"
        "User: Is it reversible?\n"
        "Assistant: No, once blocked, the card can't be unblocked. You’ll need to order a replacement.\n"
        "User: Got it. Doing it now. Thanks!",

        # 3. Financial market
        "User: Do you know what the average interest rate is for personal loans in Argentina?\n"
        "Assistant: I'm sorry, I can’t provide general financial market information like average interest rates in specific countries.\n"
        "User: Oh, I thought you might have that info.\n"
        "Assistant: I can help with FinBot-specific loan offerings, if available, but I recommend checking with a local financial advisor or bank for up-to-date rates.\n"
        "User: Okay, thanks anyway.",

        # 4. ATM fees
        "User: I'm traveling to Mexico next week. Will I be charged ATM fees there?\n"
        "Assistant: If you use a partner ATM, withdrawals are free. Non-partner ATMs charge $2.50 per withdrawal.\n"
        "User: How do I know which ones are partners?\n"
        "Assistant: You can find a list of partner ATMs in the FinBot app’s “ATM Finder” section.\n"
        "User: Awesome, thanks!",

        # 5. Cancel transaction
        "User: I just sent money to the wrong person! Can I cancel it?\n"
        "Assistant: Unfortunately, once a transaction is processed, it can't be canceled.\n"
        "User: Is there anything I can do?\n"
        "Assistant: You can submit a recall request, but success depends on the recipient's bank cooperation.\n"
        "User: Okay, I'll try that. How do I submit it?\n"
        "Assistant: In the app, go to the transaction details and tap ‘Request Recall’. Follow the steps there.\n"
        "User: Got it, thanks for your help."
    ]
}

multi_turn_df = pd.DataFrame(multi_turn_dialogs)

In [None]:
for entry in multi_turn_df.Session:
    print(entry)
    print('-----')

User: Hi, I need to send money to my cousin in Brazil today.
Assistant: Sure! You can transfer up to $3000 USD per day to Brazil.
User: I’m sending $2000. Is that okay?
Assistant: Yes, that’s fine and doesn’t require any additional steps.
User: Great. The recipient’s name is Carla Mendes, CPF number 123.456.789-00. Is that all you need?
Assistant: Yes, that's enough to proceed. Please confirm the recipient's bank details on the transfer screen.
User: Confirmed, thank you!
-----
User: I lost my debit card! What do I do?
Assistant: Don’t worry. You can block your card immediately through the FinBot app.
User: How?
Assistant: Open the app, go to the Cards section, select the lost card, and tap 'Block card'. It’ll be instant.
User: Is it reversible?
Assistant: No, once blocked, the card can't be unblocked. You’ll need to order a replacement.
User: Got it. Doing it now. Thanks!
-----
User: Do you know what the average interest rate is for personal loans in Argentina?
Assistant: I'm sorry, I

In [None]:
prod_df_3 = Dataset.from_pandas(
    pd.DataFrame(multi_turn_df),
    data_definition=definition)

Now we apply LLM judges that will look for presence of PII or declines inside the complete conversation.

In [None]:
prod_df_3.add_descriptors(descriptors=[
     DeclineLLMEval("Session", include_reasoning=False),
     PIILLMEval("Session", include_reasoning=True)
])

In [None]:
prod_df_3.as_dataframe()

Unnamed: 0,Session,Decline,PII,PII reasoning
0,"User: Hi, I need to send money to my cousin in Brazil today.\nAssistant: Sure! You can transfer up to $3000 USD per day to Brazil.\nUser: I’m sending $2000. Is that okay?\nAssistant: Yes, that’s fine and doesn’t require any additional steps.\nUser: Great. The recipient’s name is Carla Mendes, CPF number 123.456.789-00. Is that all you need?\nAssistant: Yes, that's enough to proceed. Please confirm the recipient's bank details on the transfer screen.\nUser: Confirmed, thank you!",OK,PII,"The text contains a recipient's name (Carla Mendes) and a CPF number (123.456.789-00), both of which are considered personally identifiable information (PII) as they can be used to identify an individual."
1,"User: I lost my debit card! What do I do?\nAssistant: Don’t worry. You can block your card immediately through the FinBot app.\nUser: How?\nAssistant: Open the app, go to the Cards section, select the lost card, and tap 'Block card'. It’ll be instant.\nUser: Is it reversible?\nAssistant: No, once blocked, the card can't be unblocked. You’ll need to order a replacement.\nUser: Got it. Doing it now. Thanks!",OK,OK,"The text does not contain any personally identifiable information. It discusses a general process regarding a lost debit card without revealing any specific personal information, such as names, addresses, or other identifiers."
2,"User: Do you know what the average interest rate is for personal loans in Argentina?\nAssistant: I'm sorry, I can’t provide general financial market information like average interest rates in specific countries.\nUser: Oh, I thought you might have that info.\nAssistant: I can help with FinBot-specific loan offerings, if available, but I recommend checking with a local financial advisor or bank for up-to-date rates.\nUser: Okay, thanks anyway.",DECLINE,OK,The conversation does not contain any personally identifiable information (PII) as it discusses general financial information without identifying any individual or including any personal details.
3,"User: I'm traveling to Mexico next week. Will I be charged ATM fees there?\nAssistant: If you use a partner ATM, withdrawals are free. Non-partner ATMs charge $2.50 per withdrawal.\nUser: How do I know which ones are partners?\nAssistant: You can find a list of partner ATMs in the FinBot app’s “ATM Finder” section.\nUser: Awesome, thanks!",OK,OK,"The text does not contain any personally identifiable information (PII). It discusses ATM fees and locations without revealing any individual's name, address, or any other identifiable details."
4,"User: I just sent money to the wrong person! Can I cancel it?\nAssistant: Unfortunately, once a transaction is processed, it can't be canceled.\nUser: Is there anything I can do?\nAssistant: You can submit a recall request, but success depends on the recipient's bank cooperation.\nUser: Okay, I'll try that. How do I submit it?\nAssistant: In the app, go to the transaction details and tap ‘Request Recall’. Follow the steps there.\nUser: Got it, thanks for your help.",OK,OK,The text does not contain any personally identifiable information (PII). It consists of a conversation about canceling a money transaction and does not disclose any details that could identify an individual.


In [None]:
raw_dialog_data = prod_df_3.as_dataframe()
raw_dialog_data[(raw_dialog_data["Decline"] == "DECLINE") | (raw_dialog_data["PII"] == "PII")]

Unnamed: 0,Session,Decline,PII,PII reasoning
0,"User: Hi, I need to send money to my cousin in Brazil today.\nAssistant: Sure! You can transfer up to $3000 USD per day to Brazil.\nUser: I’m sending $2000. Is that okay?\nAssistant: Yes, that’s fine and doesn’t require any additional steps.\nUser: Great. The recipient’s name is Carla Mendes, CPF number 123.456.789-00. Is that all you need?\nAssistant: Yes, that's enough to proceed. Please confirm the recipient's bank details on the transfer screen.\nUser: Confirmed, thank you!",OK,PII,"The text contains a recipient's name (Carla Mendes) and a CPF number (123.456.789-00), both of which are considered personally identifiable information (PII) as they can be used to identify an individual."
2,"User: Do you know what the average interest rate is for personal loans in Argentina?\nAssistant: I'm sorry, I can’t provide general financial market information like average interest rates in specific countries.\nUser: Oh, I thought you might have that info.\nAssistant: I can help with FinBot-specific loan offerings, if available, but I recommend checking with a local financial advisor or bank for up-to-date rates.\nUser: Okay, thanks anyway.",DECLINE,OK,The conversation does not contain any personally identifiable information (PII) as it discusses general financial information without identifying any individual or including any personal details.


In [None]:
report = Report([
    TextEvals()
])

my_eval = report.run(prod_df_3)
my_eval

You can upload runs to Evidently Cloud to keep track of all your runs and debug them in the UI. Check quickstarts for examples https://docs.evidentlyai.com/quickstart_llm