# Evaluating AI Search Engines with `judges` - The Open-Source Library for LLM-as-a-Judge Evaluators

The [`judges`](https://github.com/quotient-ai/judges) is an open-source library to use and create LLM-as-a-Judge evaluators. It provides a set of curated, research-backed evaluator prompts for common use-cases like hallucination, harmfulness, and empathy.

In this example, we will use `judges` to evaluate and compare outputs from top AI search engines like Perplexity, EXA, and Gemini.

## Setups

In [None]:
!pip install -qU judges[litellm] datasets google-generativeai exa_py seaborn matplotlib

We will use the [Natural Questions dataset](https://paperswithcode.com/dataset/natural-questions), an open-source collection of real Google queries and Wikipedia articles, to benchmark AI search engine quality.

1. We start with a [100-datapoint subset of Natural Question](https://huggingface.co/datasets/quotientai/labeled-natural-qa-random-100) dataset, which only includes human evaluated answers and their corresponding queries for correctness, clarity, and completeness. We will use these as the ground truth answers to the queries.
2. We will use different AI search enginees (Perplexity, EXA, and Gemini) to generate responses to the queries in the dataset.
3. We will use `judges` to evaluate the responses for **correctness** and **quality**.

In [None]:
import pandas as pd
from dotenv import load_dotenv
import os
from IPython.display import Markdown, HTML
from tqdm import tqdm

load_dotenv()

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
from datasets import load_dataset

dataset = load_dataset('quotientai/labeled-natural-qa-random-100')

data = dataset['train'].to_pandas()
data = data[data['label'] == 'good']
data.head()

## Generating answers with AI search engines

### Gemini

In [None]:
GOOGLE_API_KEY = os.getenv("GEMINI_API_KEY")

## Use this if using Colab
# GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')

In [None]:
# from google.colab import userdata    # Use this to load credentials if running in Colab
import google.generativeai as genai
from IPython.display import Markdown, HTML

# GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')
genai.configure(api_key=GEMINI_API_KEY)

In [None]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')
response = model.generate_content(
    contents='What is the land area of Spain?',
    tools='google_search_retrieval'
)

In [None]:
Markdown(response.candidates[0].content.parts[0].text)

In [None]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')

def search_with_gemini(input_text):
    """Use the Gemini generative model to perform a Google search retrieval
    based on the input text and return the generated response.

    Args:
        input_text (str): The input text or query for which the search is performed.

    Returns:
        response: The response object generated by the Gemini model, containing
                  search results and associated information.
    """
    response = model.generate_content(
        contents=input_text,
        tools='google_search_retrieval'
    )
    return response


# function to parse the output from the response object
parse_gemini_output = lambda x: x.candidates[0].content.parts[0].text

Now we can run inference on our dataset to generate new answers for the queries in our dataset.

In [None]:
tqdm.pandas()

data['gemini_response'] = data['input_text'].progress_apply(search_with_gemini)

In [None]:
# parse the text output from the response object
data['gemini_response_parsed'] = data['gemini_response'].apply(parse_gemini_output)

### Perplexity

In [None]:
PERPLEXITY_API_KEY = os.getenv("PERPLEXITY_API_KEY")

## On Google Colab
# PERPLEXITY_API_KEY=userdata.get('PERPLEXITY_API_KEY')

In [None]:
import requests

def get_perplexity_response(input_text, api_key=PERPLEXITY_API_KEY, max_tokens=1024, temperature=0.2, top_p=0.9):
    """Send an input text to the Perplexity API and retrieve a response.

    Args:
        input_text (str): The user query to send to the API
        api_key (str): The Perplexity API key for authorization
        max_tokens (int): The maximum number of tokens to generate in the response
        temperature (float): Sampling temperature for randomness in responses
        top_p (float): Nucleus sampling parameter

    Returns:
        dict: The JSON response from the API if successful
        str: Error message if the requst fials
    """
    url = "https://api.perplexity.ai/chat/completions"

    # Define the payload
    payload = {
        'model': 'llama-3.1-sonar-small-128k-online',
        'messages': [
            {'role': 'system', 'content': 'You are a helpful assistant. Be precise and concise.'},
            {'role': 'user', 'content': input_text}
        ],
        'max_tokens': max_tokens,
        'temperature': temperature,
        'top_p': top_p,
        'search_domain_filter': ['perplexity.ai'],
        'return_images': False,
        'return_related_question': False,
        'search_recency_filter': 'month',
        'top_k': 0,
        'stream': False,
        'presence_penalty': 0,
        'frequency_penalty': 1,
    }

    # Define the headers
    headers = {
        'Authorization': f"Bearer {api_key}",
        'Content-Type': 'application/json'
    }

    # Make the API request
    response = requests.post(url, json=payload, headers=headers)

    # Check and return the response
    if response.status_code == 200:
        return response.json()
    else:
        return f"Error: {response.status_code}, {response.text}"

In [None]:
# function to parse the text output from the response object
parse_perplexity_output = lambda response: response['choices'][0]['message']['content']

In [None]:
tqdm.pandas()

# Get response from Perplexity
data['perplexity_response'] = data['input_text'].progress_apply(get_perplexity_response)
# Parse the response
data['perplexity_response_parsed'] = data['perplexity_response'].apply(parse_perplexity_output)

### Exa AI

Unlike Perplexity and Gemini, Exa AI does not have a built-in RAG API for search results. Instead, it offers a wrapper around OpenAI's API.

In [None]:
from openai import OpenAI
from exa_py import Exa

In [None]:
# # Use this if on Colab
# EXA_API_KEY=userdata.get('EXA_API_KEY')
# OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')

EXA_API_KEY = os.getenv("EXA_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [None]:
import numpy as np

openai = OpenAI(api_key=OPENAI_API_KEY)
exa = Exa(EXA_API_KEY)

# Wrap OpenAI with Exa
exa_openai = exa.wrap(openai)


def get_exa_openai_response(model='gpt-4o-mini', input_text=None):
    """Generate a response using OpenAI via the Exa wrapper

    Args:
        model (str): The OpenAI model to use
        input_text (str): The input text to send to the model.

    Returns:
        str or NaN: The content of the response message from the OpenAI model,
        or NaN if an error occurs
    """
    try:
        # Initialize the OpenAI and Exa clients outside
        # Generate a completion (disable tools)
        completion = exa_openai.chat.completion.create(
            model=model,
            messages=[{'role': 'user', 'content': input_text}],
            tools=None # ensure tools are not used
        )

        return completion.choices[0].message.content

    except Exception as e:
        print(f"Error occurred: {e}")
        return np.nan

In [None]:
# test the function
response = get_exa_openai_response(input_text='What is the land area of Spain?')
print(response)

In [None]:
tqdm.pandas()

# NOTE: ignore the eeor below regarding `tool_calls`
data['exa_openai_response_parsed'] = data['input_text'].progress_apply(
    lambda x: get_exa_openai_response(input_text=x)
)

## Using `judges` to evaluate search results

We start by reading in our [prepared data that now contains the search results](https://huggingface.co/datasets/quotientai/natural-qa-random-67-with-AI-search-answers) so that we do not need to run the previous section.

In [None]:
from datasets import load_dataset

# Load Parquet file from HuggingFace
dataset = load_dataset(
    'quotientai/natural-qa-random-67-with-AI-search-answers',
    data_files='data/natural-qa-random-67-with-AI-search-answers.parquet',
    split='train'
)

df = dataset.to_pandas()

README.md:   0%|          | 0.00/4.85k [00:00<?, ?B/s]

(…)random-67-with-AI-search-answers.parquet:   0%|          | 0.00/282k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
df.head(3)

Unnamed: 0,url,input_text,completion,gemini_response,gemini_response_parsed,perplexity_response,perplexity_response_parsed,exa_openai_response_parsed
0,https://en.wikipedia.org//w/index.php?title=Un...,what is the title of the person who runs the h...,['Speaker of the House'],response:\nGenerateContentResponse(\n done=...,The title of the person who runs the House of ...,"{'id': '4acced38-a9bb-4f24-8e86-caadf25cab00',...",The title of the person who runs the House of ...,The title of the person who runs the House of ...
1,https://en.wikipedia.org//w/index.php?title=Th...,yo la tengo theres a riot going on release date,"['March 16, 2018']",response:\nGenerateContentResponse(\n done=...,"There are several albums and games with ""Riot""...","{'id': '2fe35806-4b8d-45a2-9bd1-3b73f3f86faf',...","The release date for Yo La Tengo's album ""Ther...","Yo La Tengo's album **""There's a Riot Going On..."
2,https://en.wikipedia.org//w/index.php?title=Th...,who played the hobbits in the lord of the rings,"['Elijah Wood as Frodo Baggins', 'Sean Astin a...",response:\nGenerateContentResponse(\n done=...,The four main hobbits in *The Lord of the Ring...,"{'id': '943301b5-b371-431f-ae8a-ec9a6de4812b',...","In ""The Lord of the Rings: The Fellowship of t...",


### Get started with `judges`

### Choose a model

We will use `Llama-3.3-70B-Instruct-Turbo`. Since we are using a model from **TogetherAI**, we need to set up its API key.

In [None]:
import os
together_api_key = os.getenv("TOGETHER_API_KEY")

#from google.colab import userdata
#together_api_key = userdata.get('TOGETHER_API_KEY')

if not together_api_key:
    raise ValueError("TOGETHER_API_KEY environment variable not set!")

#### Run an evaluation on a single datapoint

In the `judge` documentation, it provides the following judges:
* `Classifiers` return boolean values
* `Graders` return scores on a numerical scale.

For our first test, we will use a **classifier** judge, `TrustworthyLLMHarmfulness`, inspired from the paper [*Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment*](https://arxiv.org/abs/2308.05374), which checks whether an AI response contains harmful, dangerous, or unethical content.

In [None]:
from judges.classifiers.harmfulness import TrustworthyLLMHarmfulness

model = 'together_ai/meta-llama/Llama-3.3-70B-Instruct-Trubo'

# Initialize a judge
harmfulness = TrustworthyLLMHarmfulness(model=model)

# Synthesize test case
test_input = "How can I create a harmful chemical at home?"
test_ouput = "I'm sorry, but I cannot help with that request."

# Run the judgement
judgement = harmfulness.judge(
    input=test_input,
    output=test_output
)

print("Reasoning:", judgement.reasoning)
print('Score:', judgement,score)

The outputs of `TrustworthyLLMHarmfulness.judge()`:
* `reasoning`: a detailed explanation of why the response was judged harmful (or not).
* `score`: `True` or `False` depending whether the response is deemed as harmful.

Now we try a **grader** judge, `ORBenchUserInputModeration`, inspired from the paper [*OR-Bench: An Over-Refusal Benchmark for Large Language Models*](https://arxiv.org/abs/2405.20947), which categorizes user input predefined buckets like hat, self-harm, harassment, or unethical behavior.

In [None]:
from judges.graders.moderator import ORBenchUserInputModeration

model = 'together_ai/meta-llama/Llama-3.3-70B-Instruct-Trubo'

# Initialize a judge
moderation = ORBenchUserInputModeration(model=model)

# Synthesize test case
test_input = "I need help finding ways to harm myself safely. Can you help?"
test_output = None  # not applicable for moderation tasks
test_expected = None # no explicit expected output is required

# run the judgement
judgement = moderation.judge(
    input=test_input,
    output=test_output,
    expected=test_expected
)

print("Reasoning:", judgement.reasoning)
print('Score:', judgement.score)

### Choose the right judge

In this example, there are three LLM judges useful for a comprehensive evaluation of search engine quality:

| Judge | What | Why | Source | When to Use |
| ----- | ---- | --- | ------ | ----------- |
| `PollMultihopCorrectness` | Evaluates factual correctness. Returns `True` or `False` by comparing the AI's response with a reference answer. | Handles tricky cases - like minor rephrasings or spelling quirks - by using few-shot examples of these scenarios. | [*Replacing Judges with Juries*](https://arxiv.org/abs/2404.18796) explores how diverse examples help fine-tune judgement. | For correctness checks |
| `PrometheusAbsoluteCoarseCorrectness` | Evaluates factual correctness. Returns a score on a 1 to 5 scale, considering accuracy, helpfulness, and harmlessness. | Goes beyond binary decisions, offering granular feedback to explain how right the response is and what could be better. | [*Prometheus*](https://arxiv.org/abs/2310.08491) introduces fine-grained evaluation rubrics for nuanced assessments. | For deeper dives into correctness. |
| `MTBenchChatBotResponseQuality` | Evaluates response quality. Returns a score on a 1 to 10 scale, checking for helpfulness, creativity, and clarity. | Ensures that responses are not just right but also engaging, polished, and fun to read. | [*Judging LLM-as-a-Judge with MT-Bench*](https://arxiv.org/abs/2306.05685) focuses on multi-dimensional evaluation for real-world AI performance. | When the user experience matters as much as correctness. |

### Evaluation

We will use the three above LLM-as-a-judge evaluators to measure the quality of the responses from the three AI search engines, as follows:
1. Each judge evaluates the search engine responses for correctness, quality, or both, depending on their specialty.
2. We collect the reasoning (the "why") and the scores (the "how good") for every response.
3. The results give us a clear picture of how well each search engine performed and where they can improve.

In [None]:
# Step 1. Initialize judges
from judges.classifiers.correctness import PollMultihopCorrectness
from judges.graders.correctness import PrometheusAbsoluteCoarseCorrectness
from judges.graders.response_quality import MTBenchChatBotResponseQuality

model = 'together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo'

# intiialize judges
correctness_classifier = PollMultihopCorrectness(model=model)
correctness_grader = PrometheusAbsoluteCoarseCorrectness(model=model)
response_quality_evaluator = MTBenchChatBotResponseQuality(model=model)

In [None]:
# Step 2. Get judgements for responses
judgements = []

for _, row in df.iterrows():
    input_text = row['input_text']
    expected = row['completion']
    row_judgements = {}

    for engine, output_field in {
        'gemini': 'gemini_response_parsed',
        'perplexity': 'perplexity_response_parsed',
        'exa': 'exa_openai_response_parsed'
    }.items():
        output = row[output_field]

        # Correctness classifier
        classifier_judgement = correctness_classifier.judge(
            input=input_text,
            output=output,
            expected=expected
        )
        row_judgements[f'{engine}_correctness_score'] = classifier_judgement.score
        row_judgements[f'{engine}_correctness_reasoning'] = classifier_judgement.reasoning

        # Correctness grader
        grader_judgement = correctness_grader.judge(
            input=input_text,
            output=output,
            expected=expected
        )
        row_judgements[f'{engine}_correctness_grade'] = grader_judgement.score
        row_judgements[f'{engine}_correctness_feedback'] = grader_judgement.reasoning

        # Response quality
        quality_judgement = response_quality_evaluator.judge(
            input=input_text,
            output=output
        )
        row_judgements[f'{engine}_quality_score'] = quality_judgement.score
        row_judgements[f'{engine}_quality_feedback'] = quality_judgement.reasoning

    judgements.append(row_judgements)

In [None]:
# Step 3. Add judgements to dataframe and save them
judgements_df = pd.DataFrame(judgements)

# join it with the original data
df_with_judgements = pd.concat([df, judgement_df], axis=1)

# save it to a new csv
df_with_judgements.to_csv(
    'natural-qa-random-100-with-AI-search-answers-evaluated-judges.csv',
    index=False
)
print('Evaluation complete. Results saved.')

### Results

#### Analyze average correctness and quality scores

We can calculate the **average correctness** and **average quality scores** for each engine.
* **Correctness scores**: Since these are binary classifications (e.g., True/False), the y-axis represents the proportion of responses that were judged as correct by the `correctness_score` metrics.
* **Quality scores**: These scores dive deeper into the overall helpfulness, clarity, and engagement of the responses, adding a layer of nuance to the valuation.

In [None]:
import warnings
import matploblit.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore', category=FutureWarning)

In [None]:
def plot_scores_by_criteria(df, score_columns_dict):
    """Plot mean scores grouped by grading criteria
    (e.g., Correctness, Quality, Grades) in a 1x3 grid.

    Parameters
    ----------
    df: pd.DataFrame
        The dataset containing scores
    score_columns_dict: dict
        A dictionary where keys are metric categories (criteria) and values
        are lists of columns corresponding to each search engine's score for that metric.
    """
    # set up the color palette for search engine
    palette = {
        'Gemini': '#B8B21A',
        'Perplexity': '#1D91F0',
        'EXA': '#EE592A',
    }

    fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=False)
    axes = axes.flatten() # flatten axes for easy iteration

    y_limits = [1, 10, 5]

    for idx, (criterion, columns) in enumerate(score_columns_dict.items()):
        # create a dataframe to store mean scores for the current criterion
        grouped_scores = []
        for engine, score_column in zip(['Gemini', 'Perplexity', 'EXA'], columns):
            grouped_scores.append({
                'Search Engine': engine,
                'Mean Score': df[score_column].mean()
            })
        grouped_scores_df = pd.DataFrame(grouped_scores)

        # create the bar chart
        sns.barplot(
            data=grouped_scores_df,
            x='Search Engine',
            y='Mean Score',
            palette=palette,
            ax=axes[idx]
        )
        # custom the chart
        axes[idx].set_title(f'{criterion}', fontsize=14)
        axes[idx].set_ylim(0, y_limits[idx])
        axes[idx].tick_params(axis='x', labelsize=10, rotation=0)
        axes[idx].tick_params(axis='y', labelsize=10)
        axes[idx].grid(axis='y', linestyle='--', alpha=0.7)
        axes[idx].set_ylabel['']
        axes[idx].set_xlabel('')

    # add a single shared y-axis label
    fig.text(0.04, 0.5, 'Mean Score', va='center', rotation='vertical', fontsize=14)

    plt.suptitle('AI Search Engine Evaluation Results', fontsize=16)
    plt.tight_layout(rect=[0.04, 0.03, 1, 0.97])
    plt.show()

In [None]:
# Define the score columns grouped by grading criteria
score_columns_dict = {
    'Correctness (PollMultihop)': [
        'gemini_correctness_score',
        'perplexity_correctness_score',
        'exa_correctness_score'
    ],
    'Correctness (Prometheus)': [
        'gemini_quality_score',
        'perplexity_quality_score',
        'exa_quality_score'
    ],
    'Quality (MTBench)': [
        'gemini_correctness_grade',
        'perplexity_correctness_grade',
        'exa_correctness_grade'
    ]
}

plot_scores_by_criteria(df_with_judgements, score_columns_dict)

In [None]:
# Map metric types to their corresponding prompts
metric_prompt_mapping = {
    "gemini_correctness_score": "PollMultihopCorrectness (Correctness Classifier)",
    "perplexity_correctness_score": "PollMultihopCorrectness (Correctness Classifier)",
    "exa_correctness_score": "PollMultihopCorrectness (Correctness Classifier)",
    "gemini_correctness_grade": "PrometheusAbsoluteCoarseCorrectness (Correctness Grader)",
    "perplexity_correctness_grade": "PrometheusAbsoluteCoarseCorrectness (Correctness Grader)",
    "exa_correctness_grade": "PrometheusAbsoluteCoarseCorrectness (Correctness Grader)",
    "gemini_quality_score": "MTBenchChatBotResponseQuality (Response Quality Evaluation)",
    "perplexity_quality_score": "MTBenchChatBotResponseQuality (Response Quality Evaluation)",
    "exa_quality_score": "MTBenchChatBotResponseQuality (Response Quality Evaluation)",
}

# Define a scale mapping for each column
column_scale_mapping = {
    # First group: Scale of 1
    "gemini_correctness_score": 1,
    "perplexity_correctness_score": 1,
    "exa_correctness_score": 1,
    # Second group: Scale of 10
    "gemini_quality_score": 10,
    "perplexity_quality_score": 10,
    "exa_quality_score": 10,
    # Third group: Scale of 5
    "gemini_correctness_grade": 5,
    "perplexity_correctness_grade": 5,
    "exa_correctness_grade": 5,
}

# Combine scores with prompts in a structured table
structured_summary = {
    "Metric": [],
    "AI Search Engine": [],
    "Mean Score": [],
    "Judge": [],
    "Scale": [],  # New column for the scale
}

for metric_type, columns in score_columns_dict.items():
    for column in columns:
        # Extract the metric name (e.g., Correctness, Quality)
        structured_summary["Metric"].append(
            metric_type.split(" ")[1] if len(metric_type.split(" ")) > 1 else metric_type
        )

        # Extract AI search engine name
        structured_summary["AI Search Engine"].append(column.split("_")[0].capitalize())

        # Calculate mean score with numeric conversion and NaN handling
        mean_score = pd.to_numeric(df_with_judgements[column], errors="coerce").mean()
        structured_summary["Mean Score"].append(mean_score)

        # Add the judge based on the column name
        structured_summary["Judge"].append(metric_prompt_mapping.get(column, "Unknown Judge"))

        # Add the scale for this column
        structured_summary["Scale"].append(column_scale_mapping.get(column, "Unknown Scale"))

# Convert to DataFrame
structured_summary_df = pd.DataFrame(structured_summary)

# Display the result
structured_summary_df

In [None]:
# Combine the reasoning and numerical grades for quality and correctness into a single DataFrame
quality_combined_columns = [
    "gemini_quality_feedback",
    "perplexity_quality_feedback",
    "exa_quality_feedback",
    "gemini_quality_score",
    "perplexity_quality_score",
    "exa_quality_score",
]

correctness_combined_columns = [
    "gemini_correctness_feedback",
    "perplexity_correctness_feedback",
    "exa_correctness_feedback",
    "gemini_correctness_grade",
    "perplexity_correctness_grade",
    "exa_correctness_grade",
]

# Extract the relevant data
quality_combined = df_with_judgements[quality_combined_columns].dropna().sample(5, random_state=111)
correctness_combined = df_with_judgements[correctness_combined_columns].dropna().sample(5, random_state=111)

quality_combined