# Meta-Evaluation Notebook

This notebook largely mirrors the `Experiments.ipynb` notebook in terms of the used evaluation methods and general setup, but focuses on meta-evaluation on synthetically perturbed data. That is, it aims to establish which of the available evaluation methods are the most suitable for assessing the performance of LLMs on the given task.


Before using EvalSense, you may need to configure several environment variables, especially when working with local models. Some commonly used variables are described below:

- `EVALSENSE_STORAGE_DIR` — Specifies the directory where EvalSense should store datasets, logs and results. If not set, it defaults to a platform-specific application cache folder.
- `HF_HOME` — Specifies the directory used for storing local Hugging Face models. Depending on your system and preferences, you may wish to change this setting instead of using the default cache folder.
- `CUDA_HOME` — May be required when using vLLM local models with NVIDIA GPUs. If not already defined in your environment, this should point to the root of your CUDA installation (e.g., `/usr/local/cuda-12.4.0`).
- `TORCH_CUDA_ARCH_LIST` — Helps suppress CUDA-related warnings when using local models with NVIDIA GPUs. You can determine the appropriate value(s) by running:
  ```
  nvidia-smi --query-gpu=compute_cap --format=csv
  ```
- `TOKENIZERS_PARALLELISM` — It is recommended to set this to `true` when using local models. However, if you encounter issues related to parallelism, you may need to set it to `false`.


In [None]:
%load_ext autoreload
%autoreload 2

import os

os.environ["HF_HOME"] = "/vol/bitbucket/ad5518/hf_home"
os.environ["CUDA_HOME"] = "/vol/cuda/12.4.0/"
os.environ["TORCH_CUDA_ARCH_LIST"] = "8.0"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

## Dataset

As a first step, we need to define a dataset. In this case, we will use the built-in [ACI-Bench dataset](https://www.nature.com/articles/s41597-023-02487-3).


In [None]:
from evalsense.datasets import DatasetManager

aci_dataset_manager = DatasetManager.create(
    "aci-bench", splits=["test1", "test2", "test3"]
)

## Perturbation Prompts

In the next step, we specify different "tiers" of prompts applying progressively more aggressive perturbations to the outputs. This establishes a ground-truth ranking that we can compare with the scores from different evaluation methods to determine their suitability for assessing a specific criterion on the considered task.


In [None]:
tiered_perturbation_types = [
    [
        "- Rephrase sentences while preserving the exact medical meaning. You may use synonyms, vary sentence structure, or change sentence length, but all clinical facts and measurements must remain unchanged.",
        "- Slightly alter the writing style, such as using different terminology or presenting findings differently, while ensuring the factual content remains identical.",
    ],
    [
        "- Make small changes to test results and quantitative measurements, ensuring they remain clinically plausible and consistent with the original context.",
        "- Introduce minor modifications to the patient's reported symptoms, making sure they are still consistent with the assessment, diagnosis, and treatment plan (e.g., adding or substituting symptoms that commonly co-occur).",
        "- Slightly adjust the patient's clinical history, ensuring consistency with the assessment, diagnosis, and treatment plan.",
        "- Make minor modifications to the treatment plan, but ensure it remains appropriate for the assessment and diagnosis.",
    ],
    [
        "- Significantly alter test results and quantitative measurements, in a way that may change the clinical interpretation or implications of the note.",
        "- Make substantial changes to the patient's reported symptoms, potentially affecting the clinical interpretation of the note.",
        "- Make substantial changes to the patient's clinical history, potentially affecting the clinical interpretation.",
        "- Significantly modify the treatment plan, such that it may lead to a different clinical outcome than the original plan.",
    ],
]

system_prompt_template = "You are a medical AI assistant. Your role is to generate plausible variations of clinical notes by applying controlled content perturbations."

user_prompt_template = """Your task is to generate a clinically plausible variation of the provided clinical note. 

You should maintain the original note's structure and formatting, but modify its content according to the specified types of perturbation below. Try to maintain internal consistency and general medical plausibility when applying any changes.

**Perturbation Instructions**  
Apply the following types of perturbations:
{perturbation_types}

Respond only with the perturbed clinical note, do not include any commentary, reasoning or explanation.

**Original Clinical Note**
{prompt}

**Perturbed Clinical Note**
"""

## Model Config

Next, we specify the model(s) to be used in our experiment. We can use any model and model provider supported by Inspect AI, configurable through model arguments and generation configuration. For an overview of supported models and the arguments, please refer to the related [Inspect AI documentation](https://inspect.aisi.org.uk/models.html). In this demo, we use several local vLLM models.


In [None]:
from inspect_ai.model import GenerateConfigArgs

from evalsense.generation import ModelConfig

gemma_27_config = ModelConfig(
    "vllm/google/gemma-3-27b-it",
    model_args={
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

qwen_14_config = ModelConfig(
    "vllm/Qwen/Qwen3-14B",
    model_args={
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

llama_config = ModelConfig(
    "vllm/meta-llama/Llama-3.1-8B-Instruct",
    model_args={
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

phi_config = ModelConfig(
    "vllm/microsoft/phi-4",
    model_args={
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

## Evaluation

In this step, we define the evaluation metrics. We use the same evaluation metrics as in the `Experiments.ipynb` notebook. Please refer to this notebook for a more detailed description.


In [None]:
from evalsense.evaluation.evaluators import (
    get_bertscore_evaluator,
    get_bleu_evaluator,
    get_rouge_evaluator,
)

bleu_evaluator = get_bleu_evaluator()
rouge_evaluator = get_rouge_evaluator()
bertscore_evaluator = get_bertscore_evaluator(device="cuda:1")

In [None]:
from evalsense.evaluation.evaluators import get_g_eval_evaluator


detailed_prompt = """You are a medical expert tasked with evaluating the faithfulness of a clinical note generated by a model from doctor-patient dialogue. You will be given:

* The reference note produced by a human expert
* The candidate note produced by the model

Your goal is to determine whether the candidate note is faithful to the reference note, using the following evaluation criteria:

1. **Accuracy of Medical Facts**: All medical conditions, diagnoses, treatments, and test results must be correctly stated and consistent with the source text.
2. **Completeness of Critical Information**: The note should include all vital information necessary for follow-up care (e.g., key symptoms, diagnoses, procedures, outcomes).
3. **Absence of Hallucinations**: The note should not introduce any information that is not present in the original discharge note.
4. **Clarity and Non-Misleading Content**: The note should be clear, free of ambiguity, and should not distort or misrepresent any facts.

Instructions:
* Compare the candidate note to the reference note.
* Provide a numerical rating of the candidate note's faithfulness on a scale from 1 (completely unfaithful) to 5 (fully faithful).
* Respond only with the numerical rating without any explanation or context.

Reference Note:
{reference}

Candidate Note:
{prediction}

Output Format:
[Numerical faithfulness rating only, from 1 to 5]"""


brief_prompt = """Assess whether the candidate note faithfully and accurately reflects the content of the reference note.

Provide a numerical rating of the candidate note's faithfulness on a scale from 1 (completely unfaithful) to 5 (fully faithful). Respond only with the numerical rating without any explanation or context.

Reference Note:
{reference}

Candidate Note:
{prediction}

Output Format:
[Numerical faithfulness rating only, from 1 to 5]"""


llama_detailed_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Detail",
    model_name="Llama 3.1 8B",
    prompt_template=detailed_prompt,
    model_config=llama_config,
    min_score=1,
    max_score=5,
)
llama_brief_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Brief",
    model_name="Llama 3.1 8B",
    prompt_template=brief_prompt,
    model_config=llama_config,
    min_score=1,
    max_score=5,
)
qwen_detailed_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Detail",
    model_name="Qwen 3 14B",
    prompt_template=detailed_prompt,
    model_config=qwen_14_config,
    min_score=1,
    max_score=5,
)
qwen_brief_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Brief",
    model_name="Qwen 3 14B",
    prompt_template=brief_prompt,
    model_config=qwen_14_config,
    min_score=1,
    max_score=5,
)
gemma_detailed_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Detail",
    model_name="Gemma 3 27B",
    prompt_template=detailed_prompt,
    model_config=gemma_27_config,
    min_score=1,
    max_score=5,
)
gemma_brief_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Brief",
    model_name="Gemma 3 27B",
    prompt_template=brief_prompt,
    model_config=gemma_27_config,
    min_score=1,
    max_score=5,
)

In [None]:
from typing import Any, Literal, override

from evalsense.evaluation.evaluators.qags import QagsConfig, get_qags_evaluator


class TernaryReferenceBasedQagsConfig(QagsConfig):
    def __init__(self):
        super().__init__(
            answer_comparison_mode="ternary",
        )

    @override
    def get_question_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = "You are an expert medical assistant specialised in processing clinical notes. Your task is to formulate a set of close-ended questions (with yes/no answers) that thoroughly cover the information in the provided clinical note. The questions should be clear, self-contained, unambiguous and directly referring to the key points in the note. You should respond with each question on a new line, without any additional comments or explanations (in particular, you should not provide any answers to the questions).\n\n"
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += prediction
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += """\n\nOutput Format:
[Each close-ended question on a separate line, without any additional comments or explanations]"""
        return prompt

    @override
    def get_answer_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        question: str,
        prediction: str | None = None,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = """You are an expert medical assistant specialised in answering close-ended questions about clinical notes. Your task is to provide an answer to the below question based on the provided clinical note. The answer should be a single word out of the following options:
* Yes
* No
* Unknown

You should reply with unknown if the answer cannot reasonably be determined from the note. You should not provide any additional comments or explanations.\n\n"""
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += self.enforce_not_none("prediction", prediction)
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += "\n\nQuestion:\n" + question
        prompt += """\n\nOutput Format:
[Single word answer: Yes, No, or Unknown]"""
        return prompt


class JudgeReferenceBasedQagsConfig(QagsConfig):
    def __init__(self):
        super().__init__(
            answer_comparison_mode="judge",
        )

    @override
    def get_question_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = "You are an expert medical assistant specialised in processing clinical notes. Your task is to formulate a set of questions that thoroughly cover the information in the provided clinical note. The questions should be clear, self-contained, unambiguous and directly referring to the key points in the note. All questions you devise should only require brief answers (typically a single word or a short phrase). You should respond with each question on a new line, without any additional comments or explanations (in particular, you should not provide any answers to the questions).\n\n"
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += prediction
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += """\n\nOutput Format:
[Each question on a separate line, without any additional comments or explanations]"""
        return prompt

    @override
    def get_answer_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        question: str,
        prediction: str | None = None,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = """You are an expert medical assistant specialised in answering questions about clinical notes. Your task is to provide an answer to the below question based on the provided clinical note.

You should reply with unknown if the answer cannot reasonably be determined from the note. You should not provide any additional comments or explanations — just a direct, succinct answer to the question.\n\n"""
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += self.enforce_not_none("prediction", prediction)
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += "\n\nQuestion:\n" + question
        prompt += """\n\nOutput Format:
[Succinct answer without any additional comments or explanations]"""
        return prompt

    @override
    def get_answer_comparison_prompt(
        self,
        *,
        question: str,
        prediction_answer: str,
        reference_answer: str,
        input: str | None = None,
        prediction: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = """You are an expert medical assistant specialised in evaluating answers to questions about clinical notes. Your task is to compare the two answers to the below question and determine whether they convey the same meaning. While the answers may be phrased differently, they should be semantically equivalent. You should respond with a single word answer out of the following options:
* Yes
* No

You should not provide any additional comments or explanations.\n\n"""
        prompt += "Question:\n" + question
        prompt += "\n\nPrediction Answer:\n" + prediction_answer
        prompt += "\n\nReference Answer:\n" + reference_answer
        prompt += """\n\nOutput Format:
[Single word answer: Yes or No]"""
        return prompt


ternary_reference_based_qags_config = TernaryReferenceBasedQagsConfig()
llama_ternary_reference_based_qags_evaluator = get_qags_evaluator(
    name="Ternary Ref. QAGS",
    model_name="Llama 3.1",
    config=ternary_reference_based_qags_config,
    model_config=llama_config,
)
judge_reference_based_qags_config = JudgeReferenceBasedQagsConfig()
llama_judge_reference_based_qags_evaluator = get_qags_evaluator(
    name="Judge Ref. QAGS",
    model_name="Llama 3.1",
    config=judge_reference_based_qags_config,
    model_config=llama_config,
)

## Experiment Configuration

Here, we specify the experiment configurations for performing the meta-evaluation. We need to define several components for each perturbation tier we specified earlier:

- **The `perturbation_record_to_sample`** function acts as a drop-in replacement for the simpler `FieldSpec` that we have been using in some of the other EvalSense notebooks. Instead of only specifying the mapping between the dataset fields and sample fields, using a dedicated function allows us to store the used perturbation tier in each sample's metadata (`"perturbation_type_tier": type_tier`). This allows distinguishing between different levels of perturbations in the subsequent analysis.
- **The `GenerationSteps`** specify the steps used during the generation to perturb the samples, and use the perturbation prompt templates defined above.
- **The `TaskPreprocessor`.** Since we don't need to perform any additonal preprocessing, we simply use the default task preprocessor, customizing its name to clarify that we are perturbing the samples.
- **The `TaskConfig`** specifies the generation task to be performed for each tier of perturbations, using the other components described above.

Finally, all the tasks, model configurations and evaluators are incorporated in an `ExperimentBatchConfig`, which specifies the full range of experiments to be performed.


In [None]:
from typing import Any

from inspect_ai.dataset import Sample
from inspect_ai.solver import generate, prompt_template, system_message

from evalsense.generation import GenerationSteps
from evalsense.evaluation import ExperimentBatchConfig, TaskConfig
from evalsense.tasks import DefaultTaskPreprocessor

tasks: list[TaskConfig] = []

for type_tier, perturbation_type in enumerate(tiered_perturbation_types):

    def perturbation_record_to_sample(
        record: dict[str, Any],
        type_tier: int = type_tier,
    ) -> Sample:
        return Sample(
            input=record["note"],
            target=record["note"],
            id=record["id"],
            metadata={
                "dialogue": record["dialogue"],
                "dataset": record["dataset"],
                "encounter_id": record["encounter_id"],
                "doctor_name": record["doctor_name"],
                "patient_gender": record["patient_gender"],
                "patient_age": record["patient_age"],
                "patient_firstname": record["patient_firstname"],
                "patient_familyname": record["patient_familyname"],
                "cc": record["cc"],
                "2nd_complaints": record["2nd_complaints"],
                "perturbation_type_tier": type_tier,
            },
        )

    user_prompt_template_variant = user_prompt_template.replace(
        "{perturbation_types}", "\n".join(perturbation_type)
    )
    perturb_generation = GenerationSteps(
        name=f"Perturbation tier {type_tier + 1}",
        steps=[
            system_message(system_prompt_template),
            prompt_template(user_prompt_template_variant),
            generate(),
        ],
    )
    perturb_task_preprocessor = DefaultTaskPreprocessor(name="Perturbation")

    task_config = TaskConfig(
        dataset_manager=aci_dataset_manager,
        generation_steps=perturb_generation,
        field_spec=perturbation_record_to_sample,
        task_preprocessor=perturb_task_preprocessor,
    )
    tasks.append(task_config)

experiment_config = ExperimentBatchConfig(
    tasks=tasks,
    model_configs=[
        gemma_27_config,
        qwen_14_config,
        llama_config,
        phi_config,
    ],
    evaluators=[
        bleu_evaluator,
        rouge_evaluator,
        bertscore_evaluator,
        llama_detailed_g_eval_evaluator,
        llama_brief_g_eval_evaluator,
        qwen_detailed_g_eval_evaluator,
        qwen_brief_g_eval_evaluator,
        gemma_detailed_g_eval_evaluator,
        gemma_brief_g_eval_evaluator,
        llama_ternary_reference_based_qags_evaluator,
        llama_judge_reference_based_qags_evaluator,
    ],
)

## Pipeline

To store all the logs, outputs and results associated with our experiments, we also create a new `Project` object, which is passed to the evaluation pipeline. When initialised, the project automatically loads any previously stored outputs and evaluation results assciated with the given project name. This ensures that any completed tasks are not unnecessarily rerun unless explicitly required by setting `force_rerun=True` when calling the pipeline's `run` method.


In [None]:
from evalsense.workflow import Pipeline, Project

aci_project = Project(name="ACI-Bench Perturbation")

aci_pipeline = Pipeline(
    experiments=experiment_config,
    project=aci_project,
)

In [None]:
aci_pipeline.run()

## Result Analysis

Finally, we can analyse the correlation between the rankings established by the tiered perturbations and the rankings produced by each of the considered evaluation methods. The evaluation methods with the highest correlation can be expected to be the most suitable for evaluating the given criterion.


In [None]:
import polars as pl

from evalsense.workflow.analysers import MetaResultAnalyser

analyser = MetaResultAnalyser[pl.DataFrame](output_format="polars")

In [None]:
corr_results = analyser(aci_project)

In [None]:
with pl.Config(tbl_rows=20, fmt_str_lengths=100):
    display(corr_results)