# **ARES** Evaluation Strategies

This notebook presents an in-depth exploration of the various configurations within **ARES**, highlighting comparative analyses and diverse evaluation strategies, including few-shot prompting and other frameworks such as RAGAS.

**ARES** innovatively integrates synthetic data generation with fine-tuned classifiers to efficiently evaluate context relevance, answer faithfulness, and answer relevance, thereby reducing the reliance on extensive human annotations. By utilizing synthetic query generation and Prediction-Powered Inference (PPI), **ARES** ensures accurate evaluations with high statistical confidence.

### 1) Setting up

In [None]:
# Optional for UES/IDP, configure API key for desired model(s)
from dotenv import load_dotenv

load_dotenv()

In [None]:
# Download tutorial datasets

!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_judge_scoring.tsv
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_synthetic_query_generation.tsv
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_labeled_output.tsv
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_unlabeled_output.tsv
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/eval_datasets/nq/nq_ratio_0.7.tsv

In [None]:
# Download Synthetic Query Dataset

# https://drive.google.com/file/d/1e5jXjScVIXb1lRD7YQ0ENPGteMibNDTO/view?usp=sharing

In [None]:
# Download checkpoints for evaluation

# Context Relevance: https://drive.google.com/file/d/1INyHfZpsUsn5UEBLSRehI9AX08AI12Lt/view?usp=sharing
# Answer Relevance: https://drive.google.com/file/d/1yg1q6WrCwq7q07YceZUsd7FLVuLNJEue/view?usp=sharing

In [None]:
!export CUDA_VISIBLE_DEVICES = <specify GPUs>

In [None]:
from google.colab import output

output.enable_custom_widget_manager()

### 2) IDP + UES
<p>Uses targeted prompts to enable pre-trained models to assess content relevance and accuracy in a zero-shot manner.</p>

In [None]:
from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "/content/nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "/content/nq_unlabeled_output.tsv",
    # Model: GPT-3.5
    "model_choice": "gpt-3.5-turbo-0125",
}

# Optional: Provide an alternative model of your choice below.
# Here are some models you can choose from:
# - mistralai/Mistral-7B-Instruct-v0.2
# - mistralai/Mixtral-8x7B-Instruct-v0.1
# - gpt-4-turbo-preview
# - microsoft/deberta-v3-large
# - openlm-research/open_llama_7b_v2
# - mosaicml/mpt-7b-instruct

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

### 3) Training Classifier + IDP + UES

In [None]:
from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Model: GPT-3.5
    "model_choice": "gpt-3.5-turbo-0125",
}

# Training Classifier
classifier_config = {
    "training_dataset": ["nq_synth_queries.tsv"],
    "validation_set": ["nq_ratio_0.7.tsv"],
    "label_column": ["Context_Relevance_Label"],
    "num_epochs": 10,
    "patience_value": 3,
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
}

In [None]:
ares_module = ARES(classifier_model=classifier_config)
results = ares_module.train_classifier()
print(results)

# Trains and saves checkpoints

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

## 4) Training Classifier + PPI + UES

<h3>UES</h3>

In [None]:
from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Default model choice
    "model_choice": "gpt-3.5-turbo-1106",
}

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

<h3>Training Classifier</h3>

<p>Generates checkpoint which is used in PPI below</p>

In [None]:
from ares import ARES

classifier_config = {
    "training_dataset": ["nq_synth_queries.tsv"],
    "validation_set": ["nq_ratio_0.7.tsv"],
    "label_column": ["Context_Relevance_Label"],
    "num_epochs": 10,
    "patience_value": 3,
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
}

In [None]:
ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

<h3>PPI</h3>

In [None]:
from ares import ARES

ppi_config = {
    "evaluation_datasets": ["nq_unlabeled_output.tsv"],
    "checkpoints": ["Context_Relevance_Label_joint_trained_date_time.pt"],
    "labels": ["Context_Relevance_Label"],
    "rag_type": "question_answering",
    "gold_label_paths": ["nq_labeled_output.tsv"],
    "prediction_filepaths": ["nq_0.6_predictions_updated.tsv"],
}

# Install checkpoint here!
# Context Relevance: https://drive.google.com/file/d/1INyHfZpsUsn5UEBLSRehI9AX08AI12Lt/view?usp=sharing


In [None]:
ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

## 5) ARES Comparison to RAGAS and Zeroshot Llama and Mixtral

<h3>ARES Configuration</h3>

<p>Synthetic Generator</p>

In [None]:
from ares import ARES

synth_config = {
    "document_filepaths": ["/content/nq_unlabeled_output.tsv"],
    "few_shot_prompt_filenames": ["/content/nq_few_shot_prompt_for_judge_scoring.tsv"],
    "synthetic_queries_filenames": ["nq_synthetic_queries.tsv"],
    "documents_sampled": 6189,
}

ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)

# Generates and saves synthetic queries

# Install Synthetic Query File here!
# https://drive.google.com/file/d/1e5jXjScVIXb1lRD7YQ0ENPGteMibNDTO/view?usp=sharing


<p>Training Classifier</p>

In [None]:
from ares import ARES

classifier_config = {
    "training_dataset": ["nq_synth_queries.tsv"],
    "validation_set": ["nq_ratio_0.7.tsv"],
    "label_column": ["Context_Relevance_Label", "Answer_Relevance_Label"],
    "num_epochs": 10,
    "patience_value": 3,
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
}

ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

# Trains and saves classifier for context relevance and answer relevance

# Download checkpoints here!

# Context Relevance: https://drive.google.com/file/d/1INyHfZpsUsn5UEBLSRehI9AX08AI12Lt/view?usp=sharing
# Answer Relevance: https://drive.google.com/file/d/1yg1q6WrCwq7q07YceZUsd7FLVuLNJEue/view?usp=sharing

<p>PPI</p>

In [None]:
from ares import ARES

ppi_config = {
    "evaluation_datasets": ["nq_unlabeled_output.tsv"],
    "checkpoints": [
        "Context_Relevance_Label_joint_trained_date_time.pt",
        "Answer_Relevance_Label_joint_trained_date_time.pt",
    ],
    "rag_type": "question_answering",
    "labels": ["Context_Relevance_Label", "Answer_Relevance_Label"],
    "gold_label_path": "nq_labeled_output.tsv",
}

ares_module = ARES(ppi=ppi_config)
results = ares_module.evaluate_RAG()
print(results)

# Evaluation numbers below should match

Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300
------------

Answer_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.5955191133227766]
ARES Confidence Interval: [[0.577, 0.614]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.977]
Annotated Examples used for PPI: 300



<h3>RAGAS Configuration</h3>

<p>Data Cleaning | Context Relevance Label Filter</p>

In [None]:
from datasets import load_dataset, Dataset
import pandas as pd
import os


def load_and_prepare_dataset(file_path):
    # Load the dataset from the TSV file
    dataset_df = pd.read_csv(file_path, delimiter="\t")

    # Remove rows where 'Context_Relevance_Label' has no values
    dataset_df = dataset_df.dropna(subset=["Context_Relevance_Label"])

    # Convert 'Context_Relevance_Label' to string if it is not already
    dataset_df["Context_Relevance_Label"] = dataset_df[
        "Context_Relevance_Label"
    ].astype(str)

    # Use 'Context_Relevance_Label' as 'ground_truth'
    prepared_data = {
        "question": dataset_df["Query"].tolist(),
        "contexts": [
            [doc] for doc in dataset_df["Document"].tolist()
        ],  # Contexts are expected to be list of lists
        "answer": dataset_df["Answer"].tolist(),
        "ground_truth": dataset_df[
            "Context_Relevance_Label"
        ].tolist(),  # Using 'Context_Relevance_Label' as 'ground_truth'
    }

    # Convert to HuggingFace's Dataset format
    dataset = Dataset.from_dict(prepared_data)
    return dataset


<p> ARES Label Filter: Removes rows w/ no values for specified label</p>

<p>Context Relevance Accuracy</p>

In [None]:
from ragas import evaluate
from ragas.metrics import context_recall, context_precision

# Load and prepare the dataset
file_path = "nq_unlabeled_output.tsv"  # Update with the actual file path
prepared_dataset = load_and_prepare_dataset(file_path)

# Specify metrics
metrics = [
    context_precision,
    context_recall,
]

result = evaluate(prepared_dataset, metrics=metrics)  # Pass the initialized llm
print(result)

Evaluating: 100%|██████████| 8842/8842 [12:15<00:00, 12.03it/s]


{'context_precision': 0.5549, 'context_recall': 0.4737}


<p>Data Cleaning | Answer Relevance Label Filter</p>

In [None]:
from datasets import Dataset
import pandas as pd


def load_and_prepare_dataset(file_path):
    # Load the dataset from the TSV file
    dataset_df = pd.read_csv(file_path, delimiter="\t")

    dataset_df = dataset_df.dropna(subset=["Answer_Relevance_Label"])

    # Convert 'Context_Relevance_Label' to string if it is not already
    dataset_df["Answer_Relevance_Label"] = dataset_df["Answer_Relevance_Label"].astype(
        str
    )

    # Use 'Context_Relevance_Label' as 'ground_truth'
    prepared_data = {
        "question": dataset_df["Query"].tolist(),
        "contexts": [[doc] for doc in dataset_df["Document"].tolist()],
        "answer": dataset_df["Answer"].tolist(),
        "ground_truth": dataset_df["Answer_Relevance_Label"].tolist(),
    }

    # Convert to HuggingFace's Dataset format
    dataset = Dataset.from_dict(prepared_data)
    return dataset


In [None]:
from ragas import evaluate
from ragas.metrics import answer_relevancy

file_path = "nq_unlabeled_output.tsv"
prepared_dataset = load_and_prepare_dataset(file_path)

# Specify metrics
metrics = [answer_relevancy]

# Evaluate
result = evaluate(prepared_dataset, metrics=metrics)

print(result)

Evaluating:  92%|█████████▏| 4054/4421 [23:01<02:25,  2.52it/s]Failed to parse output. Returning None.
Evaluating:  92%|█████████▏| 4063/4421 [23:04<02:01,  2.95it/s]Failed to parse output. Returning None.
Evaluating: 100%|██████████| 4421/4421 [25:03<00:00,  2.94it/s]


{'answer_relevancy': 0.7511}


<h3>Zeroshot Llama Configuration</h3>

In [None]:
from ares import ARES


ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Model: Mistral 7B
    "model_choice": "codellama/CodeLlama-13b-Instruct-hf",
}

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

<h3>Zeroshot Mistral Configuration</h3>

In [None]:
from ares import ARES
import os

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
    # Model: Mistral 7B
    "model_choice": "mistralai/Mixtral-8x7B-v0.1",
}

In [None]:
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}