# **ARES** Evaluation Strategies

This notebook presents an in-depth exploration of the various configurations within **ARES**, highlighting comparative analyses and diverse evaluation strategies, including few-shot prompting and other frameworks such as RAGAS.

**ARES** innovatively integrates synthetic data generation with fine-tuned classifiers to efficiently evaluate context relevance, answer faithfulness, and answer relevance, thereby reducing the reliance on extensive human annotations. By utilizing synthetic query generation and Prediction-Powered Inference (PPI), **ARES** ensures accurate evaluations with high statistical confidence.

### 1) Setting up

In [6]:
# Optional for UES/IDP, configure API key for desired model(s)
from dotenv import load_dotenv

load_dotenv()

True

remember to download datasets

In [None]:
# Download Synthetic Query Dataset

# https://drive.google.com/file/d/1e5jXjScVIXb1lRD7YQ0ENPGteMibNDTO/view?usp=sharing

In [None]:
# Download checkpoints for evaluation

# Context Relevance: https://drive.google.com/file/d/1INyHfZpsUsn5UEBLSRehI9AX08AI12Lt/view?usp=sharing
# Answer Relevance: https://drive.google.com/file/d/1yg1q6WrCwq7q07YceZUsd7FLVuLNJEue/view?usp=sharing

explore dataset

In [7]:
import pandas as pd

# Read the few-shot prompts file
prompts_df = pd.read_csv(
    "../data/ares/example_files/nq_few_shot_prompt_for_judge_scoring.tsv", sep="\t"
)

# Read the unlabeled evaluation set
eval_df = pd.read_csv("../data/ares/example_files/nq_labeled_output.tsv", sep="\t")

# # Basic exploration
# print("\nFew-shot prompts file:")
# print("Shape:", prompts_df.shape)
# print("Columns:", prompts_df.columns)
# print("\nFirst few rows:")
# print(prompts_df.head())

print("\nUnlabeled evaluation set:")
print("Shape:", eval_df.shape)
print("Columns:", eval_df.columns)
print("\nFirst few rows:")
print(eval_df.head(2))



Unlabeled evaluation set:
Shape: (6189, 12)
Columns: Index(['id', 'input', 'meta', 'output', 'wikipedia_id', 'Document',
       'paragraph_number', 'Answer', 'Query', 'Context_Relevance_Label',
       'Answer_Faithfulness_Label', 'Answer_Relevance_Label'],
      dtype='object')

First few rows:
                    id                                              input  \
0 -6371603500131574271  who sings somebody's watching me with michael ...   
1  6860341019198485637         who cracked the enigma code in world war 2   

                                                meta  \
0  {'left_context': '', 'mention': '', 'right_con...   
1  {'left_context': '', 'mention': '', 'right_con...   

                                              output  wikipedia_id  \
0  [{'answer': 'Rockwell', 'meta': {'score': -1},...       1551152   
1  [{'answer': 'Turing', 'meta': {'score': -1}, '...          1208   

                                            Document  paragraph_number  \
0  "Somebody's Wa

### 2) IDP + UES
<p>Uses targeted prompts to enable pre-trained models to assess content relevance and accuracy in a zero-shot manner.</p>

IDP is in-domain prompts
UES is unlabeled evaluation sets

In [10]:
from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "../data/ares/example_files/nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "../data/ares/example_files/nq_output_5samples.tsv",
    # Model: GPT-3.5
    "model_choice": "gpt-4o",
}

# Optional: Provide an alternative model of your choice below.
# Here are some models you can choose from:
# - mistralai/Mistral-7B-Instruct-v0.2
# - mistralai/Mixtral-8x7B-Instruct-v0.1
# - gpt-4-turbo-preview
# - microsoft/deberta-v3-large
# - openlm-research/open_llama_7b_v2
# - mosaicml/mpt-7b-instruct

In [11]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

Evaluating large subset with gpt-4o:   0%|          | 0/5 [00:00<?, ?it/s]

Number of times did not extract Yes or No: 0
{'Context Relevance Scores': 0.6, 'Answer Faithfulness Scores': 0.4, 'Answer Relevance Scores': 0.4, 'Raw Scores':    Context_Relevance_Score  Answer_Relevance_Score  Answer_Faithfulness_Score
0                        1                       1                          1
1                        1                       1                          1
2                        0                       0                          0
3                        1                       0                          0
4                        0                       0                          0}


### 3) Training Classifier + IDP + UES

In [6]:
from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Model: GPT-3.5
    "model_choice": "gpt-3.5-turbo-0125",
}

# Training Classifier
classifier_config = {
    "training_dataset": ["nq_synth_queries.tsv"],
    "validation_set": ["nq_ratio_0.7.tsv"],
    "label_column": ["Context_Relevance_Label"],
    "num_epochs": 10,
    "patience_value": 3,
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
}

In [7]:
ares_module = ARES(classifier_model=classifier_config)
results = ares_module.train_classifier()
print(results)

# Trains and saves checkpoints



tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

--------------------------------------------------------------------------
Starting new learning rate: 5e-06
--------------------------------------------------------------------------
Creating parent checkpoint directory: checkpoints/microsoft-deberta-v3-large
--------------------------------------------------------------------------
Dataset: nq_synth_queries.tsv
Model: microsoft/deberta-v3-large
Test Set Selection: nq_ratio_0.7.tsv
Number of Runs: 1
Learning Rate: 5e-06
Checkpoint Path: checkpoints/microsoft-deberta-v3-large/Context_Relevance_Label_nq_ratio_0.7_2025-01-06_15:41:49.pt
Patience: 3
Validation Set Choice: True
Number of Epochs: 10
Number of warmup steps: 100
--------------------------------------------------------------------------


FileNotFoundError: [Errno 2] No such file or directory: 'nq_synth_queries.tsv'

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

## 4) Training Classifier + PPI + UES

<h3>UES</h3>

In [None]:
from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Default model choice
    "model_choice": "gpt-3.5-turbo-1106",
}

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

<h3>Training Classifier</h3>

<p>Generates checkpoint which is used in PPI below</p>

In [None]:
from ares import ARES

classifier_config = {
    "training_dataset": ["nq_synth_queries.tsv"],
    "validation_set": ["nq_ratio_0.7.tsv"],
    "label_column": ["Context_Relevance_Label"],
    "num_epochs": 10,
    "patience_value": 3,
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
}

In [None]:
ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

<h3>PPI</h3>

In [None]:
from ares import ARES

ppi_config = {
    "evaluation_datasets": ["nq_unlabeled_output.tsv"],
    "checkpoints": ["Context_Relevance_Label_joint_trained_date_time.pt"],
    "labels": ["Context_Relevance_Label"],
    "rag_type": "question_answering",
    "gold_label_paths": ["nq_labeled_output.tsv"],
    "prediction_filepaths": ["nq_0.6_predictions_updated.tsv"],
}

# Install checkpoint here!
# Context Relevance: https://drive.google.com/file/d/1INyHfZpsUsn5UEBLSRehI9AX08AI12Lt/view?usp=sharing


In [None]:
ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

## 5) ARES Comparison to RAGAS and Zeroshot Llama and Mixtral

<h3>ARES Configuration</h3>

<p>Synthetic Generator</p>

In [None]:
from ares import ARES

synth_config = {
    "document_filepaths": ["/content/nq_unlabeled_output.tsv"],
    "few_shot_prompt_filenames": ["/content/nq_few_shot_prompt_for_judge_scoring.tsv"],
    "synthetic_queries_filenames": ["nq_synthetic_queries.tsv"],
    "documents_sampled": 6189,
}

ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)

# Generates and saves synthetic queries

# Install Synthetic Query File here!
# https://drive.google.com/file/d/1e5jXjScVIXb1lRD7YQ0ENPGteMibNDTO/view?usp=sharing


<p>Training Classifier</p>

In [None]:
from ares import ARES

classifier_config = {
    "training_dataset": ["nq_synth_queries.tsv"],
    "validation_set": ["nq_ratio_0.7.tsv"],
    "label_column": ["Context_Relevance_Label", "Answer_Relevance_Label"],
    "num_epochs": 10,
    "patience_value": 3,
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
}

ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

# Trains and saves classifier for context relevance and answer relevance

# Download checkpoints here!

# Context Relevance: https://drive.google.com/file/d/1INyHfZpsUsn5UEBLSRehI9AX08AI12Lt/view?usp=sharing
# Answer Relevance: https://drive.google.com/file/d/1yg1q6WrCwq7q07YceZUsd7FLVuLNJEue/view?usp=sharing

<p>PPI</p>

In [None]:
from ares import ARES

ppi_config = {
    "evaluation_datasets": ["nq_unlabeled_output.tsv"],
    "checkpoints": [
        "Context_Relevance_Label_joint_trained_date_time.pt",
        "Answer_Relevance_Label_joint_trained_date_time.pt",
    ],
    "rag_type": "question_answering",
    "labels": ["Context_Relevance_Label", "Answer_Relevance_Label"],
    "gold_label_path": "nq_labeled_output.tsv",
}

ares_module = ARES(ppi=ppi_config)
results = ares_module.evaluate_RAG()
print(results)

# Evaluation numbers below should match

Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300
------------

Answer_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.5955191133227766]
ARES Confidence Interval: [[0.577, 0.614]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.977]
Annotated Examples used for PPI: 300



<h3>RAGAS Configuration</h3>

<p>Data Cleaning | Context Relevance Label Filter</p>

In [1]:
from datasets import load_dataset, Dataset
import pandas as pd
import os


def load_and_prepare_dataset(file_path):
    # Load the dataset from the TSV file
    dataset_df = pd.read_csv(file_path, delimiter="\t")

    # Remove rows where 'Context_Relevance_Label' has no values
    dataset_df = dataset_df.dropna(subset=["Context_Relevance_Label"])

    # Convert 'Context_Relevance_Label' to string if it is not already
    dataset_df["Context_Relevance_Label"] = dataset_df[
        "Context_Relevance_Label"
    ].astype(str)

    # Use 'Context_Relevance_Label' as 'ground_truth'
    prepared_data = {
        "question": dataset_df["Query"].tolist(),
        "contexts": [
            [doc] for doc in dataset_df["Document"].tolist()
        ],  # Contexts are expected to be list of lists
        "answer": dataset_df["Answer"].tolist(),
        "ground_truth": dataset_df[
            "Context_Relevance_Label"
        ].tolist(),  # Using 'Context_Relevance_Label' as 'ground_truth'
    }

    # Convert to HuggingFace's Dataset format
    dataset = Dataset.from_dict(prepared_data)
    return dataset


<p> ARES Label Filter: Removes rows w/ no values for specified label</p>

<p>Context Relevance Accuracy</p>

In [19]:
from ragas import evaluate
from ragas.metrics import context_recall, context_precision

# Load and prepare the dataset
file_path = "../data/ares/example_files/nq_output_15samples.tsv"  # Update with the actual file path
prepared_dataset = load_and_prepare_dataset(file_path)

# Specify metrics
metrics = [
    context_precision,
    context_recall,
]

result = evaluate(prepared_dataset, metrics=metrics)  # Pass the initialized llm
print(result)

Evaluating:   0%|          | 0/22 [00:00<?, ?it/s]

{'context_precision': 0.3636, 'context_recall': 0.0000}


<p>Data Cleaning | Answer Relevance Label Filter</p>

In [2]:
from datasets import Dataset
import pandas as pd


def load_and_prepare_dataset(file_path):
    # Load the dataset from the TSV file
    dataset_df = pd.read_csv(file_path, delimiter="\t")

    dataset_df = dataset_df.dropna(subset=["Answer_Relevance_Label"])

    # Convert 'Context_Relevance_Label' to string if it is not already
    dataset_df["Answer_Relevance_Label"] = dataset_df["Answer_Relevance_Label"].astype(
        str
    )

    # Use 'Context_Relevance_Label' as 'ground_truth'
    prepared_data = {
        "user_input": dataset_df["Query"].tolist(),
        "retrieved_contexts": [[doc] for doc in dataset_df["Document"].tolist()],
        "response": dataset_df["Answer"].tolist(),
        "ground_truth": dataset_df["Answer_Relevance_Label"].tolist(),
    }

    # Convert to HuggingFace's Dataset format
    dataset = Dataset.from_dict(prepared_data)
    return dataset


note that single word responses causes problems with RAGAS, not designed for this purpose

RAGAS doesn't work with single word answers!

In [3]:
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

file_path = "../data/ares/example_files/nq_output_15samples.tsv"
prepared_dataset = load_and_prepare_dataset(file_path)

# Specify metrics
metrics = [answer_relevancy, faithfulness]

In [4]:
# Evaluate
result = evaluate(prepared_dataset, metrics=metrics)

print(result)

Evaluating:   0%|          | 0/22 [00:00<?, ?it/s]

['Rockwell']
{0: 'Rockwell'}
['Turing']
{0: 'Turing'}
['Pittsburgh']
{0: 'Pittsburgh'}
['2010']
{0: '2010'}
['Erica Rivera']
{0: 'Erica Rivera'}
['Bob Dylan']
{0: 'Bob Dylan'}
['1983']
{0: '1983'}
['On June 27 , 1954']
{}


No statements were generated from the answer.


['5 %']
{0: '5 %'}
['22 July 1947']
{}
['in the 1980s']
{}


No statements were generated from the answer.
No statements were generated from the answer.


{'answer_relevancy': 0.6982, 'faithfulness': 0.1250}


In [5]:
results_df = result.to_pandas()
results_df

Unnamed: 0,user_input,retrieved_contexts,response,reference,answer_relevancy,faithfulness
0,who sings somebody's watching me with michael ...,"[""Somebody's Watching Me"" is a song recorded b...",Rockwell,1.0,0.579954,1.0
1,who cracked the enigma code in world war 2,[Alan Turing\n],Turing,1.0,0.74628,0.0
2,where do characters live in this is us,[Most episodes feature a storyline taking plac...,Pittsburgh,1.0,0.648006,0.0
3,when did slave to the rhythm come out,"[The song was written and recorded in 1990, wi...",2010,1.0,0.726683,0.0
4,who plays bianca in that's so raven,"[BULLET::::- Bianca, played by Erica Rivera\n]",Erica Rivera,1.0,0.615961,0.0
5,who is playing halftime show super bowl 50,[The Super Bowl 50 Halftime Show took place on...,Bob Dylan,0.0,0.685556,0.0
6,how did early humans make use of stones during...,[Prehistoric technology is technology that pre...,1983,0.0,0.728894,0.0
7,philadelphia is known as the city of what,"[Penn named the city Philadelphia, which is Gr...","On June 27 , 1954",0.0,0.719964,
8,the probability of making a type i error when ...,[If the probability of obtaining a result as e...,5 %,1.0,0.762169,0.0
9,when was the national flag of india adopted,[The National Flag of India is a horizontal re...,22 July 1947,1.0,0.735336,


<h3>Zeroshot Llama Configuration</h3>

In [None]:
from ares import ARES


ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Model: Mistral 7B
    "model_choice": "codellama/CodeLlama-13b-Instruct-hf",
}

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

<h3>Zeroshot Mistral Configuration</h3>

In [None]:
from ares import ARES
import os

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Model: Mistral 7B
    "model_choice": "mistralai/Mixtral-8x7B-v0.1",
}

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}