<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>

# <center>Evaluation using Pydantic Evals</center>

Pydantic offers an evaluation library that can be used to run preset direct evaluations, such as whether an output matches a Pydantic model, as well as LLM Judge evaluations. These evals can be run directly over dataframes of cases defined with Pydantic. However, you may want to run evaluations over real traces as opposed to presaved cases.

This notebook shows you how you can use Pydantic Evals alongside Arize Phoenix to run evals on traces captured from your running application.

<img width=500px src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/pydantic-eval-diagram.png" />

*Note: Phoenix does include its own evals package, however it is designed to work with other eval packages like Pydantic Evals as well.*

## Install dependencies

In [None]:
!pip install -q pydantic-evals arize-phoenix openai openinference-instrumentation-openai "httpx<0.28.0,>=0.23.0"

## Setup API keys and imports

In [2]:
import os

import dotenv
import phoenix as px
from openai import OpenAI
from pydantic_evals import Case, Dataset

# if os.getenv("OPENAI_API_KEY") is None:
#     os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

dotenv.load_dotenv()

True

## Enable Phoenix Tracing

Sign up for a free instance of [Phoenix Cloud](https://app.phoenix.arize.com) to get your API key. If you'd prefer, you can instead [self-host Phoenix](https://arize.com/docs/phoenix/deployment).

In [3]:
PHOENIX_BASE = "http://localhost:6006"
PHOENIX_COLLECTOR = f"{PHOENIX_BASE}/v1/traces"
PROJECT = "pydantic-evals-tutorial"
PHOENIX_API_KEY = os.getenv("PHOENIX_API_KEY")  # CHANGE: read once


In [7]:
from phoenix.otel import register

tracer_provider = register(
    project_name=PROJECT,  # CHANGE: keep project consistent with query/upload
    endpoint=PHOENIX_COLLECTOR,  # CHANGE: no hard-coded duplicate
    auto_instrument=True,
    batch=True,
    verbose=True,
)
tracer = tracer_provider.get_tracer(__name__)


Overriding of current TracerProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: pydantic-evals-tutorial
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: http://localhost:6006/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## Create Example Traces to Evaluate

Next, we'll run some example inputs through an LLM call to generate traces that we can evaluate. In practice, you'd likely already have an application you're tracing that you'd want to evaluate instead.

In [8]:
client = OpenAI()

inputs = [
    "What is the capital of France?",
    "Who wrote Romeo and Juliet?",
    "What is the largest planet in our solar system?",
]


def generate_trace(input):
    client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Only respond with the answer to the question as a single word or proper noun.",
            },
            {"role": "user", "content": input},
        ],
    )


for input in inputs:
    generate_trace(input)

AuthenticationError: Error code: 401 - {'error': {'message': "You didn't provide an API key. You need to provide your API key in an Authorization header using Bearer auth (i.e. Authorization: Bearer YOUR_KEY), or as the password field (with blank username) if you're accessing the API from your browser and are prompted for a username and password. You can obtain an API key from https://platform.openai.com/account/api-keys.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

You should now see three traces captured in your Phoenix instance. If you don't see them right away, make sure you've selected the `pydantic-evals-tutorial` project.

## Export Traces from Phoenix

Next, you export those traces from Phoenix so that you can evaluate them using Pydantic Evals.

In [None]:
from phoenix.trace.dsl import SpanQuery

query = SpanQuery().select(
    input="llm.input_messages",
    output="llm.output_messages",
)

# The Phoenix Client can take this query and return the dataframe.
spans = px.Client().query_spans(query, project_name="pydantic-evals-tutorial")
spans["input"] = spans["input"].apply(lambda x: x[1].get("message").get("content"))
spans["output"] = spans["output"].apply(lambda x: x[0].get("message").get("content"))
spans.head()

## Define the Evaluation Dataset
Create a dataset of test cases using Pydantic Evals for a question-answering task.
1. Each Case represents a single test with an input (question) and an expected output (answer).
2. The Dataset aggregates these cases for evaluation.

In [8]:
cases = [
    Case(
        name="capital of France", inputs="What is the capital of France?", expected_output="Paris"
    ),
    Case(
        name="author of Romeo and Juliet",
        inputs="Who wrote Romeo and Juliet?",
        expected_output="William Shakespeare",
    ),
    Case(
        name="largest planet",
        inputs="What is the largest planet in our solar system?",
        expected_output="Jupiter",
    ),
]

## Setup LLM task, Evaluator, and Dataset for Pydantic

Pydantic Evals requires a task to run each case through. Since you've already run this task for a given input (represented by the traces you captured above), this case will simply be retrieving the corresponding output from your dataframe of exported traces.

In [11]:
import nest_asyncio

nest_asyncio.apply()


async def task(input: str) -> str:
    output = spans[spans["input"] == input]["output"].values[0]
    return output

Then create a basic evaluator that checks whether the output matches the expected value exactly.

In [9]:
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

client = OpenAI()


class MatchesExpectedOutput(Evaluator[str, str]):
    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        is_correct = ctx.expected_output == ctx.output
        return is_correct

In [10]:
dataset = Dataset(
    cases=cases,
    evaluators=[MatchesExpectedOutput()],
)

## Run your experiment and evaluation

Now with everything connected up, you can run your evaluation using Pydantic:

In [None]:
report = dataset.evaluate_sync(task)
print(report)

## Redefine Eval to be LLM-powered or Semantic

That evaluation works fine, however the exact match is a bit too strict to work in a real world setting. Try adding two other kinds of evaluators, a fuzzy match eval and an LLM judge eval.

In [13]:
class FuzzyMatchesOutput(Evaluator[str, str]):
    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        # Using fuzzy matching to compare expected and actual outputs
        from difflib import SequenceMatcher

        def similarity_ratio(a, b):
            return SequenceMatcher(None, a, b).ratio()

        # Consider it correct if similarity is above 0.8 (80%)
        is_correct = similarity_ratio(ctx.expected_output, ctx.output) > 0.8
        return is_correct


dataset.add_evaluator(FuzzyMatchesOutput())

In [14]:
from pydantic_evals.evaluators import LLMJudge

dataset.add_evaluator(
    LLMJudge(
        rubric="Output and Expected Output should represent the same answer, even if the text doesn't match exactly",
        include_input=True,
        model="openai:gpt-4o-mini",
    ),
)

In [None]:
report = dataset.evaluate_sync(task)
print(report)

You should now see that the LLM Judge at least catches that "Shakespeare" and "William Shakespeare" represent the same answer.

## Upload Labels to Phoenix

As a final step, you can now upload your eval results to Phoenix to capture them in the UI.

In [None]:
results = report.model_dump()

In [17]:
# Create a dataframe for each eval
meo_spans = spans.copy()
fuzzy_label_spans = spans.copy()
llm_label_spans = spans.copy()

for case in results.get("cases"):
    # Phoenix expects a "label" column, so start by extracting the eval result from each row
    meo_label = case.get("assertions").get("MatchesExpectedOutput").get("value")
    fuzzy_label = case.get("assertions").get("FuzzyMatchesOutput").get("value")
    llm_label = case.get("assertions").get("LLMJudge").get("value")

    input = case.get("inputs")

    # Update the label in each dataframe where the input value matches
    meo_spans.loc[meo_spans["input"] == input, "label"] = str(meo_label)
    fuzzy_label_spans.loc[meo_spans["input"] == input, "label"] = str(fuzzy_label)
    llm_label_spans.loc[llm_label_spans["input"] == input, "label"] = str(llm_label)

# Phoenix can also take in a numeric score for each row which it uses to calculate overall metrics
meo_spans["score"] = meo_spans["label"].apply(lambda x: 1 if x else 0)
fuzzy_label_spans["score"] = fuzzy_label_spans["label"].apply(lambda x: 1 if x else 0)
llm_label_spans["score"] = llm_label_spans["label"].apply(lambda x: 1 if x else 0)

In [None]:
meo_spans.head()

In [None]:
from phoenix.trace import SpanEvaluations

# Upload your data to Phoenix:
px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=meo_spans,
        eval_name="Direct Match Eval",
    ),
    SpanEvaluations(
        dataframe=fuzzy_label_spans,
        eval_name="Fuzzy Match Eval",
    ),
    SpanEvaluations(
        dataframe=llm_label_spans,
        eval_name="LLM Match Eval",
    ),
)

![results_in_phoenix](https://storage.googleapis.com/arize-phoenix-assets/assets/images/pydantic-evals-results.png)

#### For more on LLM Evaluation, check out our [Arize Master Guide to LLM Evaluation](https://arize.com/llm-evaluation)!