In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.insert(0, '..')

In [13]:
import docs

github_data = docs.read_github_data()
parsed_data = docs.parse_data(github_data)

file_index = {d['filename']: d['content'] for d in parsed_data}

In [14]:
import pickle

with open('./eval-run-v2-2025-10-24-21-42.bin', 'rb') as f_in:
    rows = pickle.load(f_in)

In [15]:
import pandas as pd

df_evals = pd.DataFrame(rows)

df_evals['filename'] = df_evals['original_question'].apply(lambda r: r['filename'])
df_evals['reference'] = df_evals['filename'].apply(file_index.get)

In [3]:
import main

In [11]:
result = await main.run_agent('how do I run llm as a judge')

TOOL CALL (search): search({"query": "run LLM as judge in Evidently"})
TOOL CALL (search): search({"query": "LLM judge"})
TOOL CALL (search): search({"query": "evaluate LLM performance Evidently"})
TOOL CALL (search): search({"query": "Evidently LLM integration"})
TOOL CALL (search): search({"query": "use LLM as evaluator"})
TOOL CALL (search): read_file({"filename":"examples/LLM_judge.mdx"})


In [12]:
print(result.output.format_article())

# How to Run LLM as a Judge in Evidently

## Overview

In this tutorial, you will learn how to use a Language Model (LLM) as a judge to evaluate text based on custom criteria. Two main approaches will be covered: reference-based evaluation against known accurate responses, and open-ended evaluation based on custom metrics.

### References
- [LLM as a judge](https://github.com/evidentlyai/docs/blob/main/examples/LLM_judge.mdx)
## Tutorial Steps

### 1. Environment Setup
- Install the Evidently library:
    ```bash
    pip install evidently
    ```
- Import necessary modules in your Python script:
    ```python
    import pandas as pd
    import numpy as np
    from evidently import Dataset, DataDefinition, Report
    from evidently.llm.templates import BinaryClassificationPromptTemplate
    import os
    os.environ["OPENAI_API_KEY"] = "YOUR_KEY"
    ```

### 2. Create the Evaluation Dataset
- Create a toy dataset containing questions, target responses, new responses, and their labels:
 

In [9]:
from evidently import Dataset, DataDefinition, Report
from evidently.llm.templates import BinaryClassificationPromptTemplate

In [16]:
definition = DataDefinition(
    text_columns=["question", "reference", "answer"],
)

eval_dataset = Dataset.from_pandas(
    df_evals,
    data_definition=definition
)

In [17]:
correctness = BinaryClassificationPromptTemplate(
    criteria="An ANSWER is correct when it matches the REFERENCE in all details",
    target_category="incorrect",
    non_target_category="correct",
    uncertainty="unknown",
    include_reasoning=True,
    pre_messages=[("system", "You are an expert evaluator")]
)

In [19]:
from evidently.descriptors import LLMEval

In [20]:
eval_dataset.add_descriptors(descriptors=[
    LLMEval(
        "answer",
        template=correctness,
        provider="openai",
        model="gpt-4o-mini",
        alias="Correctness",
        additional_columns={"reference": "reference", "question": "question"}
    ),
])

In [21]:
from evidently import Report
from evidently.presets import TextEvals

In [23]:
df = eval_dataset.as_dataframe()

In [31]:
df.head()

Unnamed: 0,question,answer,messages,tool_call_number,requests,original_question,original_result,filename,reference,Correctness,Correctness reasoning
0,regexp descriptor usage,# Evidently Regexp Descriptor Usage\n\n## Over...,"[{'kind': 'user-prompt', 'content': 'regexp de...",8,4,"{'question': 'regexp descriptor usage', 'summa...",AgentRunResult(output=SearchResultArticle(foun...,metrics/all_descriptors.mdx,"<Info>\n For an intro, read about [Core Conce...",incorrect,The text contains duplicated references statin...
1,dataset-level evaluation metrics,# Dataset Level Evaluation Metrics\n\n## Overv...,"[{'kind': 'user-prompt', 'content': 'dataset-l...",4,2,{'question': 'dataset-level evaluation metrics...,AgentRunResult(output=SearchResultArticle(foun...,metrics/all_metrics.mdx,"<Info>\n For an intro, read [Core Concepts](/...",correct,The provided text accurately describes the dat...
2,customize Text Evals report conditions,# Text Evals Report Customization\n\n## Custom...,"[{'kind': 'user-prompt', 'content': 'customize...",5,2,{'question': 'customize Text Evals report cond...,AgentRunResult(output=SearchResultArticle(foun...,metrics/preset_text_evals.mdx,"To run this Report, first compute `descriptors...",correct,The text provides accurate information on cust...
3,automatically map data columns,# Automatic Data Column Mapping in Evidently\n...,"[{'kind': 'user-prompt', 'content': 'automatic...",6,2,"{'question': 'automatically map data columns',...",AgentRunResult(output=SearchResultArticle(foun...,docs/library/data_definition.mdx,"To run evaluations, you must create a `Dataset...",correct,The text accurately describes the automatic da...
4,using synthetic data for AI testing,# Using Synthetic Data for AI Testing\n\n## Ge...,"[{'kind': 'user-prompt', 'content': 'using syn...",5,2,{'question': 'using synthetic data for AI test...,AgentRunResult(output=SearchResultArticle(foun...,synthetic-data/why_synthetic.mdx,"When working on an AI system, you need test da...",correct,The text accurately describes the process of g...


In [29]:
print(df.iloc[1]['answer'])

# Dataset Level Evaluation Metrics

## Overview of Dataset Evaluation Metrics

Evidently provides several dataset-level evaluation metrics that help users understand the quality of their datasets and the performance of machine learning models. Key metrics include:**
  - **Model Quality Summary Metrics**: These metrics consist of Mean Error (ME), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), which give a quantitative assessment of model predictions relative to actual values.
  - **Classification Quality Metrics**: Classification-based evaluation metrics provide insights into how well the model classifies data into categories.
  - **Regression Metrics**: These include metrics for assessing performance in regression tasks such as errors in predictions, plotting predicted versus actual values, error distributions, and evaluating specific groups of predictions (e.g., underestimations and overestimations).**

Evidently's reporting tools also allow for interactive visu

In [30]:
df.iloc[1]['Correctness reasoning']

'The provided text accurately describes the dataset level evaluation metrics from Evidently, detailing the key metrics used for assessing model performance, including Mean Error, Mean Absolute Error, and Mean Absolute Percentage Error, as well as classification and regression metrics. It also mentions the interactive visualizations and reporting tools, which align with the referenced sources.'

In [9]:
from evidently import Dataset, Report
from evidently.llm.templates import BinaryClassificationPromptTemplate

In [None]:
llm_evals = Dataset.from_pandas(
   df_evals,
   data_definition=DataDefinition(),
   descriptors=[
       LLMEval("response", template=appropriate_scope, provider="openai", model="gpt-4o-mini")
   ]
)