# Judge Evaluations with Evidently

Evidently is an open-source Python library for evaluating, testing, and monitoring ML models and LLM applications. It provides tools to assess data quality, detect drift, and evaluate LLM outputs using techniques like LLM-as-a-Judge. Evidently generates interactive reports and dashboards to help you understand model performance and behavior in production.

This lesson demonstrates how to evaluate LLM responses using Evidently's LLM-as-a-Judge capabilities. You'll learn to compare generated answers against reference answers and create evaluation reports.

This is based on [one of the lessons from their course](https://github.com/evidentlyai/community-examples/blob/main/learn/LLMCourse_Tutorial_1_Intro_to_LLM_evals_methods.ipynb).

First, install evidently:

In [None]:
!uv add evidently

## Load and Prepare Data

Load the GitHub documentation data that will serve as our reference answers:

In [None]:
import docs

github_data = docs.read_github_data()
parsed_data = docs.parse_data(github_data)

file_index = {d['filename']: d['content'] for d in parsed_data}

Read the run results from the previous iteration:

In [None]:
import pickle

with open('sample_eval_rows.bin', 'rb') as f_in:
    rows = pickle.load(f_in)

Create a DataFrame with evaluation rows and map each question to its reference content:

In [None]:
import pandas as pd

df_evals = pd.DataFrame(rows)

df_evals['filename'] = df_evals.original_question.apply(lambda x: x['filename'])
df_evals['reference'] = df_evals.filename.apply(file_index.get)


## Configure LLM Evaluation

Import the necessary Evidently components for LLM evaluation:

In [None]:
from evidently import Dataset, DataDefinition
from evidently.descriptors import LLMEval
from evidently.llm.templates import MulticlassClassificationPromptTemplate

Define the evaluation criteria using a multiclass classification template. This template instructs the LLM judge to categorize answers into match, partial_match, mismatch, or not_available:

In [None]:
matcher = MulticlassClassificationPromptTemplate(
    pre_messages=[
        ("system", "You are a judge that evaluates the factual alignment of two chatbot answers.")
    ],
    criteria="""
    You are given a question, a new answer and a reference answer. 
    Classify the new answer based on how it compares to the reference.
    ===
    Question: {question}
    Reference: {reference}
    """,
    category_criteria={
        "match": "The answer matches the reference in all factual and semantic details.",
        "partial_match": "The answer is correct in what it says but leaves out details from the reference.",
        "mismatch": "The answer doesn't match the reference answer.",
        "not_available": "The answer says that information is not available.",
    },
    uncertainty="unknown",
    include_reasoning=True,
    include_scores=False
)


## Create Evaluation Dataset

Create an Evidently Dataset with LLM evaluation descriptor. This will evaluate each answer against its reference using "gpt-4o-mini":

In [None]:
eval_dataset = Dataset.from_pandas(
    data=df_evals,
    data_definition=DataDefinition(),
    descriptors=[
        LLMEval(
            column_name="answer",
            additional_columns={"question": "question", "reference": "reference"},
            template=matcher,
            provider="openai",
            model="gpt-4o-mini",
            alias="eval"
        )
    ]
)

Convert the evaluated dataset back to a DataFrame for analysis:

In [None]:
df_eval_result = eval_dataset.as_dataframe()

## Analyze Evaluation Results

Examine the reasoning provided by the LLM judge for a specific evaluation:

In [None]:
df_eval_result.iloc[1]['eval reasoning']

View the original question that was evaluated:

In [None]:
df_eval_result.iloc[1]['question']

View the answer that was evaluated:

In [None]:
df_eval_result.iloc[1]['answer']

## Generate Evaluation Report

We can now calculate some statistics from the dataframe. But we can also use Evidently's built-in reporting.

Import the reporting components:

In [None]:
from evidently import Report
from evidently.presets import TextEvals

Create and run a comprehensive text evaluation report using the TextEvals preset:

In [None]:
report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset, None)


Display the evaluation report with metrics and visualizations:

In [None]:
my_eval

## Conclusions

With Evidently it's quite easy to run evals and then display the reports