# EvalSense Demo Notebook

This notebooks illustrates a basic application of EvalSense to the ACI-Bench dataset.


Before using EvalSense, you may need to configure several environment variables, especially when working with local models. Some commonly used variables are described below:

- `EVALSENSE_STORAGE_DIR` — Specifies the directory where EvalSense should store datasets, models, logs ad results. If not set, it defaults to a platform-specific application cache folder.
- `CUDA_HOME` — May be required when using vLLM local models with NVIDIA GPUs. If not already defined in your environment, this should point to the root of your CUDA installation (e.g., `/usr/local/cuda-12.4.0`).
- `TORCH_CUDA_ARCH_LIST` — Helps suppress CUDA-related warnings when using local models with NVIDIA GPUs. You can determine the appropriate value(s) by running:
  ```
  nvidia-smi --query-gpu=compute_cap --format=csv
  ```
- `TOKENIZERS_PARALLELISM` — It is recommended to set this to `true` when using local models. However, if you encounter issues related to parallelism, you may need to set it to `false`.


In [None]:
%load_ext autoreload
%autoreload 2

import os

os.environ["EVALSENSE_STORAGE_DIR"] = "/vol/bitbucket/ad5518/evalsense_cache"
os.environ["CUDA_HOME"] = "/vol/cuda/12.4.0/"
os.environ["TORCH_CUDA_ARCH_LIST"] = "8.0"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

# This also initialises other environmental variables
from evalsense.constants import MODELS_PATH

## Dataset

As a first step, we need to define a dataset. In this case, we will use the built-in [ACI-Bench dataset](https://www.nature.com/articles/s41597-023-02487-3).


In [None]:
from evalsense.datasets.managers import AciBenchDatasetManager

aci_dataset_manager = AciBenchDatasetManager(splits=["train"])

## Generation Steps

The next step is to define how the data should be used to generate model outputs, including the prompts involved. This is achieved using the Solvers abstraction provided by the Inspect AI library. A list of available solvers can be found in the [official documentation](https://inspect.aisi.org.uk/solvers.html#built-in-solvers).

Solvers can be composed into a sequence, allowing multiple steps to be chained together, as illustrated below. The full generation steps need to be assigned a unique name — this enables the comparison of results across different experiments when varying prompts or solver configurations.


In [None]:
from inspect_ai.solver import generate, prompt_template, system_message

from evalsense.generation import GenerationSteps

system_prompt_template = "You are an expert clinical assistant specialising in the creation of medically accurate summaries from a dialogue between the doctor and patient."
user_prompt_template = """Your task is to generate a clinical note based on a conversation between a doctor and a patient. Use the following format for the clinical note:

1. **CHIEF COMPLAINT**: [Brief description of the main reason for the visit]
2. **HISTORY OF PRESENT ILLNESS**: [Summary of the patient's current health status and any changes since the last visit]
3. **REVIEW OF SYSTEMS**: [List of symptoms reported by the patient]
4. **PHYSICAL EXAMINATION**: [Findings from the physical examination]
5. **RESULTS**: [Relevant test results]
6. **ASSESSMENT AND PLAN**: [Doctor's assessment and plan for treatment or further testing]

**Conversation:**
{prompt}

**Note:**
"""

aci_generation = GenerationSteps(
    name="Structured",
    steps=[
        system_message(system_prompt_template),
        prompt_template(user_prompt_template),
        generate(),
    ],
)

## Task Spec

In addition to selecting the dataset to be used, it is also necessary to specify which columns will serve as inputs, targets (i.g., ground-truth references, if available) and sample IDs. Optionally, additional columns can be retained as metadata — for example, if they are useful for the result analysis.

Some datasets may support multiple tasks, in which case a preprocessing step may be needed to adapt the data for a specific task. In this case, no task-specific adjustments are needed, so we use the default preprocessor that returns the data unchanged.


In [None]:
from inspect_ai.dataset import FieldSpec

from evalsense.tasks import DefaultTaskPreprocessor

aci_field_spec = FieldSpec(
    input="dialogue",
    target="note",
    id="id",
    metadata=[
        "dataset",
        "encounter_id",
        "doctor_name",
        "patient_gender",
        "patient_age",
        "patient_firstname",
        "patient_familyname",
        "cc",
        "2nd_complaints",
    ],
)
dialogue_task_preprocessor = DefaultTaskPreprocessor(name="Dialogue")

## Model Config

Next, we specify the model(s) to be used in our experiment. We can use any model and model provider supported by Inspect AI, configurable through model arguments and generation configuration. For an overview of supported models and the arguments, please refer to the related [Inspect AI documentation](https://inspect.aisi.org.uk/models.html). In this demo, we use two local vLLM models — [Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and [Phi 4 Mini Instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct).


In [None]:
from inspect_ai.model import GenerateConfigArgs

from evalsense.generation import ModelConfig

gemma_12_config = ModelConfig(
    "vllm/google/gemma-3-12b-it",
    model_args={
        "download_dir": MODELS_PATH,
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

gemma_27_config = ModelConfig(
    "vllm/google/gemma-3-27b-it",
    model_args={
        "download_dir": MODELS_PATH,
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

qwen_14_config = ModelConfig(
    "vllm/Qwen/Qwen2.5-14B-Instruct",
    model_args={
        "download_dir": MODELS_PATH,
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

llama_config = ModelConfig(
    "vllm/meta-llama/Llama-3.1-8B-Instruct",
    model_args={
        "download_dir": MODELS_PATH,
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

phi_config = ModelConfig(
    "vllm/microsoft/phi-4",
    model_args={
        "download_dir": MODELS_PATH,
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

## Evaluation

In this step, we define the evaluation metrics. Here, we simply use the BLUE, ROUGE and BERTScore metrics as implemented in EvalSense. Custom metrics are also supported, provided that they are implemented as Inspect AI-compatible [scorers](https://inspect.aisi.org.uk/scorers.html#custom-scorers) and [metrics](https://inspect.aisi.org.uk/reference/inspect_ai.scorer.html#metric). For reference and inspiration, please check the [EvalSense BLUE implementation](https://github.com/nhsengland/evalsense/blob/main/evalsense/evaluation/evaluators/bleu.py) and the [relevant Inspect AI documentation](https://inspect.aisi.org.uk/scorers.html).


In [None]:
from evalsense.evaluation.evaluators import (
    get_bertscore_evaluator,
    get_bleu_evaluator,
    get_rouge_evaluator,
)

bleu_evaluator = get_bleu_evaluator()
rouge_evaluator = get_rouge_evaluator()
bertscore_evaluator = get_bertscore_evaluator(device="cuda:1")

To demonstrate the use of more advanced metrics, we also include [G-Eval](https://arxiv.org/abs/2303.16634), a LLM-as-a-Judge approach that uses structured propts and weights judge scores based on token output proabilities, resulting in more stable evaluation scores. Unlike simpler metrics, G-Eval requires additional configuration, such as a function for constructing the prompts as well as additional hyperparameters.

Since G-Eval is a model-based metric, we also need to supply a model configuration so that EvalSense can instantiate the appropriate model during evaluation. Internally, EvalSense handles this evaluator differently from simpler metrics by using a `ScorerFactory` — an abstraction that creates a scorer when passed an Inspect AI model. This design allows LLMScore to optimise resource usage by loading the model only when needed and releasing the associated resources afterwards.

While you don't need to worry about these internal details if you are only using default EvalSense evaluators, understanding them can be useful if you want to implement your own model-based evaluation methods. For reference, you can check the [EvalSense implementation of G-Eval](https://github.com/nhsengland/evalsense/blob/main/evalsense/evaluation/evaluators/g_eval.py).


In [None]:
from evalsense.evaluation.evaluators import get_g_eval_evaluator


detailed_prompt = """You are a medical expert tasked with evaluating the faithfulness of a clinical note generated by a model from doctor-patient dialogue. You will be given:

* The reference note produced by a human expert
* The candidate note produced by the model

Your goal is to determine whether the candidate note is faithful to the reference note, using the following evaluation criteria:

1. **Accuracy of Medical Facts**: All medical conditions, diagnoses, treatments, and test results must be correctly stated and consistent with the source text.
2. **Completeness of Critical Information**: The note should include all vital information necessary for follow-up care (e.g., key symptoms, diagnoses, procedures, outcomes).
3. **Absence of Hallucinations**: The note should not introduce any information that is not present in the original discharge note.
4. **Clarity and Non-Misleading Content**: The note should be clear, free of ambiguity, and should not distort or misrepresent any facts.

Instructions:
* Compare the candidate note to the reference note.
* Provide a numerical rating of the candidate note's faithfulness on a scale from 1 (completely unfaithful) to 10 (fully faithful).
* Respond only with the numerical rating without any explanation or context.

Reference Note:
{reference}

Candidate Note:
{prediction}

Output Format:
[Numerical faithfulness rating only, from 1 to 10]"""


brief_prompt = """Assess whether the candidate note faithfully and accurately reflects the content of the reference note.

Provide a numerical rating of the candidate note's faithfulness on a scale from 1 (completely unfaithful) to 10 (fully faithful). Respond only with the numerical rating without any explanation or context.

Reference Note:
{reference}

Candidate Note:
{prediction}

Output Format:
[Numerical faithfulness rating only, from 1 to 10]"""


llama_detailed_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Detail",
    model_name="Llama 3.1 8B",
    prompt_template=detailed_prompt,
    model_config=llama_config,
)
llama_brief_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Brief",
    model_name="Llama 3.1 8B",
    prompt_template=brief_prompt,
    model_config=llama_config,
)
qwen_detailed_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Detail",
    model_name="Qwen 2.5 14B",
    prompt_template=detailed_prompt,
    model_config=qwen_14_config,
)
qwen_brief_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Brief",
    model_name="Qwen 2.5 14B",
    prompt_template=brief_prompt,
    model_config=qwen_14_config,
)
gemma_detailed_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Detail",
    model_name="Gemma 3 27B",
    prompt_template=detailed_prompt,
    model_config=gemma_27_config,
)
gemma_brief_g_eval_evaluator = get_g_eval_evaluator(
    quality_name="F. Brief",
    model_name="Gemma 3 27B",
    prompt_template=brief_prompt,
    model_config=gemma_27_config,
)

Finally, as an example of even more sophisticated evaluation approaches, we define the QAGS scorer. This scorer evaluates agreement between the factual content in the reference text and the target text, and requires configuration for several different prompts that are used in the process.


In [None]:
from typing import Any, Literal, override

from evalsense.evaluation.evaluators.qags import QagsConfig, get_qags_evaluator


class TernaryReferenceBasedQagsConfig(QagsConfig):
    def __init__(self):
        super().__init__(
            answer_comparison_mode="ternary",
        )

    @override
    def get_question_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = "You are an expert medical assistant specialised in processing clinical notes. Your task is to formulate a set of close-ended questions (with yes/no answers) that thoroughly cover the information in the provided clinical note. The questions should be clear, self-contained, unambiguous and directly referring to the key points in the note. You should respond with each question on a new line, without any additional comments or explanations (in particular, you should not provide any answers to the questions).\n\n"
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += prediction
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += """\n\nOutput Format:
[Each close-ended question on a separate line, without any additional comments or explanations]"""
        return prompt

    @override
    def get_answer_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        question: str,
        prediction: str | None = None,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = """You are an expert medical assistant specialised in answering close-ended questions about clinical notes. Your task is to provide an answer to the below question based on the provided clinical note. The answer should be a single word out of the following options:
* Yes
* No
* Unknown

You should reply with unknown if the answer cannot reasonably be determined from the note. You should not provide any additional comments or explanations.\n\n"""
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += self.enforce_not_none("prediction", prediction)
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += "\n\nQuestion:\n" + question
        prompt += """\n\nOutput Format:
[Single word answer: Yes, No, or Unknown]"""
        return prompt


class JudgeReferenceBasedQagsConfig(QagsConfig):
    def __init__(self):
        super().__init__(
            answer_comparison_mode="judge",
        )

    @override
    def get_question_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = "You are an expert medical assistant specialised in processing clinical notes. Your task is to formulate a set of questions that thoroughly cover the information in the provided clinical note. The questions should be clear, self-contained, unambiguous and directly referring to the key points in the note. All questions you devise should only require brief answers (typically a single word or a short phrase). You should respond with each question on a new line, without any additional comments or explanations (in particular, you should not provide any answers to the questions).\n\n"
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += prediction
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += """\n\nOutput Format:
[Each question on a separate line, without any additional comments or explanations]"""
        return prompt

    @override
    def get_answer_generation_prompt(
        self,
        *,
        source: Literal["prediction"] | Literal["reference"],
        question: str,
        prediction: str | None = None,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = """You are an expert medical assistant specialised in answering questions about clinical notes. Your task is to provide an answer to the below question based on the provided clinical note.

You should reply with unknown if the answer cannot reasonably be determined from the note. You should not provide any additional comments or explanations — just a direct, succinct answer to the question.\n\n"""
        prompt += "Provided Note:\n"
        if source == "prediction":
            prompt += self.enforce_not_none("prediction", prediction)
        elif source == "reference":
            prompt += self.enforce_not_none("reference", reference)
        else:
            raise ValueError("source must be either 'prediction' or 'reference'")
        prompt += "\n\nQuestion:\n" + question
        prompt += """\n\nOutput Format:
[Succinct answer without any additional comments or explanations]"""
        return prompt

    @override
    def get_answer_comparison_prompt(
        self,
        *,
        question: str,
        prediction_answer: str,
        reference_answer: str,
        input: str | None = None,
        prediction: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        prompt = """You are an expert medical assistant specialised in evaluating answers to questions about clinical notes. Your task is to compare the two answers to the below question and determine whether they convey the same meaning. While the answers may be phrased differently, they should be semantically equivalent. You should respond with a single word answer out of the following options:
* Yes
* No

You should not provide any additional comments or explanations.\n\n"""
        prompt += "Question:\n" + question
        prompt += "\n\nPrediction Answer:\n" + prediction_answer
        prompt += "\n\nReference Answer:\n" + reference_answer
        prompt += """\n\nOutput Format:
[Single word answer: Yes or No]"""
        return prompt


ternary_reference_based_qags_config = TernaryReferenceBasedQagsConfig()
llama_ternary_reference_based_qags_evaluator = get_qags_evaluator(
    name="Ternary Ref. QAGS",
    model_name="Llama 3.1",
    config=ternary_reference_based_qags_config,
    model_config=llama_config,
)
judge_reference_based_qags_config = JudgeReferenceBasedQagsConfig()
llama_judge_reference_based_qags_evaluator = get_qags_evaluator(
    name="Judge Ref. QAGS",
    model_name="Llama 3.1",
    config=judge_reference_based_qags_config,
    model_config=llama_config,
)

## Workflow

To bring everything together, we configure the overall task and experiment pipeline. For more extensive and complex evaluations, multiple experiment batch configurations can be specified.


In [None]:
from evalsense.evaluation import ExperimentBatchConfig, TaskConfig

aci_task_config = TaskConfig(
    dataset_manager=aci_dataset_manager,
    generation_steps=aci_generation,
    field_spec=aci_field_spec,
    task_preprocessor=dialogue_task_preprocessor,
)

experiment_config = ExperimentBatchConfig(
    tasks=[aci_task_config],
    model_configs=[
        gemma_12_config,
        gemma_27_config,
        qwen_14_config,
        llama_config,
        phi_config,
    ],
    evaluators=[
        bleu_evaluator,
        rouge_evaluator,
        bertscore_evaluator,
        llama_detailed_g_eval_evaluator,
        llama_brief_g_eval_evaluator,
        qwen_detailed_g_eval_evaluator,
        qwen_brief_g_eval_evaluator,
        gemma_detailed_g_eval_evaluator,
        gemma_brief_g_eval_evaluator,
        llama_ternary_reference_based_qags_evaluator,
        llama_judge_reference_based_qags_evaluator,
    ],
)

To store all the logs, outputs and results associated with our experiments, we also create a new `Project` object, which is passed to the evaluation pipeline. When initialised, the project automatically loads any previously stored outputs and evaluation results assciated with the given project name. This ensures that any completed tasks are not unnecessarily rerun unless explicitly required by setting `force_rerun=True` when calling the pipeline's `run` method.


In [None]:
from evalsense.workflow import Pipeline, Project

aci_project = Project(name="ACI-Bench Evaluation")

aci_pipeline = Pipeline(
    experiments=experiment_config,
    project=aci_project,
)

In [None]:
aci_pipeline.run()

## Result Analysis

The results associated with the project can be analysed using dedicated result analysers. In this example, we use a simple tabular result analyser, which summarises all the evaluation metrics in a structured table format.


In [None]:
import polars as pl

from evalsense.workflow.analysers import TabularResultAnalyser

analyser = TabularResultAnalyser[pl.DataFrame](output_format="polars")

In [None]:
summary_results = analyser(aci_project)

In [None]:
summary_results

In [None]:
from evalsense.workflow.analysers import CorrelationResults, MetricCorrelationAnalyser

analyser = MetricCorrelationAnalyser[CorrelationResults[pl.DataFrame]](
    output_format="polars"
)

In [None]:
corr_results = analyser(
    aci_project,
    return_plot=True,
    metric_labels={
        "Ternary Ref. QAGS (Llama 3.1) Coverage": "Ternary Ref. QAGS (Llama 3.1) Cov.",
        "Ternary Ref. QAGS (Llama 3.1) Groundedness": "Ternary Ref. QAGS (Llama 3.1) Ground.",
        "Ternary Ref. QAGS (Llama 3.1) Accuracy": "Ternary Ref. QAGS (Llama 3.1) Acc.",
    },
)