# LLMScope Demo Notebook

This notebooks illustrates a basic application of LLMScope to the ACI-Bench dataset.


Before using LLMScope, you may need to configure several environment variables, especially when working with local models. Some commonly used variables are described below:

- `LLMSCOPE_STORAGE_DIR` — Specifies the directory where LLMScope should store datasets, models, logs ad results. If not set, it defaults to a platform-specific application cache folder.
- `CUDA_HOME` — May be required when using vLLM local models with NVIDIA GPUs. If not already defined in your environment, this should point to the root of your CUDA installation (e.g., `/usr/local/cuda-12.4.0`).
- `TORCH_CUDA_ARCH_LIST` — Helps suppress CUDA-related warnings when using local models with NVIDIA GPUs. You can determine the appropriate value(s) by running:
  ```
  nvidia-smi --query-gpu=compute_cap --format=csv
  ```
- `TOKENIZERS_PARALLELISM` — It is recommended to set this to `true` when using local models. However, if you encounter issues related to parallelism, you may need to set it to `false`.


In [None]:
%load_ext autoreload
%autoreload 2

import os

os.environ["LLMSCOPE_STORAGE_DIR"] = "/vol/bitbucket/ad5518/llmscope_cache"
os.environ["CUDA_HOME"] = "/vol/cuda/12.4.0/"
os.environ["TORCH_CUDA_ARCH_LIST"] = "8.0"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

## Dataset

As a first step, we need to define a dataset. In this case, we will use the built-in [ACI-Bench dataset](https://www.nature.com/articles/s41597-023-02487-3).


In [None]:
from llmscope.datasets.managers import AciBenchDatasetManager

aci_dataset_manager = AciBenchDatasetManager(splits=["train"])

## Generation Steps

The next step is to define how the data should be used to generate model outputs, including the prompts involved. This is achieved using the Solvers abstraction provided by the Inspect AI library. A list of available solvers can be found in the [official documentation](https://inspect.aisi.org.uk/solvers.html#built-in-solvers).

Solvers can be composed into a sequence, allowing multiple steps to be chained together, as illustrated below. The full generation steps need to be assigned a unique name — this enables the comparison of results across different experiments when varying prompts or solver configurations.


In [None]:
from inspect_ai.solver import generate, prompt_template, system_message

from llmscope.generation import GenerationSteps

system_prompt_template = "You are an expert clinical assistant specialising in the creation of medically accurate summaries from a dialogue between the doctor and patient."
user_prompt_template = """Your task is to generate a clinical note based on a conversation between a doctor and a patient. Use the following format for the clinical note:

1. **CHIEF COMPLAINT**: [Brief description of the main reason for the visit]
2. **HISTORY OF PRESENT ILLNESS**: [Summary of the patient's current health status and any changes since the last visit]
3. **REVIEW OF SYSTEMS**: [List of symptoms reported by the patient]
4. **PHYSICAL EXAMINATION**: [Findings from the physical examination]
5. **RESULTS**: [Relevant test results]
6. **ASSESSMENT AND PLAN**: [Doctor's assessment and plan for treatment or further testing]

**Conversation:**
{prompt}

**Note:**
"""

aci_generation = GenerationSteps(
    name="Structured",
    steps=[
        system_message(system_prompt_template),
        prompt_template(user_prompt_template),
        generate(),
    ],
)

## Task Spec

In addition to selecting the dataset to be used, it is also necessary to specify which columns will serve as inputs, targets (i.g., ground-truth references, if available) and sample IDs. Optionally, additional columns can be retained as metadata — for example, if they are useful for the result analysis.

Some datasets may support multiple tasks, in which case a preprocessing step may be needed to adapt the data for a specific task. In this case, no task-specific adjustments are needed, so we use the default preprocessor that returns the data unchanged.


In [None]:
from inspect_ai.dataset import FieldSpec

from llmscope.tasks import DefaultTaskPreprocessor

aci_field_spec = FieldSpec(
    input="dialogue",
    target="note",
    id="id",
    metadata=[
        "dataset",
        "encounter_id",
        "doctor_name",
        "patient_gender",
        "patient_age",
        "patient_firstname",
        "patient_familyname",
        "cc",
        "2nd_complaints",
    ],
)
dialogue_task_preprocessor = DefaultTaskPreprocessor(name="Dialogue")

## Model Config

Next, we specify the model(s) to be used in our experiment. We can use any model and model provider supported by Inspect AI, configurable through model arguments and generation configuration. For an overview of supported models and the arguments, please refer to the related [Inspect AI documentation](https://inspect.aisi.org.uk/models.html). In this demo, we use two local vLLM models — [Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and [Phi 4 Mini Instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct).


In [None]:
from inspect_ai.model import GenerateConfigArgs

from llmscope.constants import MODELS_PATH
from llmscope.generation import ModelConfig

llama_config = ModelConfig(
    "vllm/meta-llama/Llama-3.1-8B-Instruct",
    model_args={
        "download_dir": MODELS_PATH,
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

phi_config = ModelConfig(
    "vllm/microsoft/Phi-4-mini-instruct",
    model_args={
        "download_dir": MODELS_PATH,
        "device": "2",
        "gpu_memory_utilization": 0.8,
        "max_model_len": 8192,
    },
    generation_args=GenerateConfigArgs(
        seed=42,
        temperature=0.7,
        top_p=0.95,
        max_connections=128,
    ),
)

## Evaluation

In this step, we define the evaluation metrics. Here, we simply use the BLUE, ROUGE and BERTScore metrics as implemented in LLMScope. Custom metrics are also supported, provided that they are implemented as Inspect AI-compatible [scorers](https://inspect.aisi.org.uk/scorers.html#custom-scorers) and [metrics](https://inspect.aisi.org.uk/reference/inspect_ai.scorer.html#metric). For reference and inspiration, please check the [LLMScope BLUE implementation](https://github.com/nhsengland/llmscope/blob/main/llmscope/evaluation/evaluators/bleu.py) and the [relevant Inspect AI documentation](https://inspect.aisi.org.uk/scorers.html).


In [None]:
from llmscope.evaluation.evaluators import (
    get_bertscore_evaluator,
    get_bleu_evaluator,
    get_rouge_evaluator,
)

bleu_evaluator = get_bleu_evaluator()
rouge_evaluator = get_rouge_evaluator()
bertscore_evaluator = get_bertscore_evaluator()

To demonstrate the use of more advanced metrics, we also include [G-Eval](https://arxiv.org/abs/2303.16634), a LLM-as-a-Judge approach that uses structured propts and weights judge scores based on token output proabilities, resulting in more stable evaluation scores. Unlike simpler metrics, G-Eval requires additional configuration, such as a function for constructing the prompts as well as additional hyperparameters.

Since G-Eval is a model-based metric, we also need to supply a model configuration so that LLMScope can instantiate the appropriate model during evaluation. Internally, LLMScope handles this evaluator differently from simpler metrics by using a `ScorerFactory` — an abstraction that creates a scorer when passed an Inspect AI model. This design allows LLMScore to optimise resource usage by loading the model only when needed and releasing the associated resources afterwards.

While you don't need to worry about these internal details if you are only using default LLMScope evaluators, understanding them can be useful if you want to implement your own model-based evaluation methods. For reference, you can check the [LLMScope implementation of G-Eval](https://github.com/nhsengland/llmscope/blob/main/llmscope/evaluation/evaluators/g_eval.py).


In [None]:
from llmscope.evaluation.evaluators import get_g_eval_evaluator


def construct_prompt(prediction: str, reference: str | None, **kwargs: dict) -> str:
    if reference is None:
        raise ValueError(
            "The prompt template for evaluating faithfulness requires a reference input."
        )

    prompt = f"""
You are a medical expert tasked with evaluating the faithfulness of a clinical note generated by a model from doctor-patient dialogue. You will be given:

* The reference note produced by a human expert
* The candidate note produced by the model

Your goal is to determine whether the candidate summary is faithful to the reference note, using the following evaluation criteria:

1. **Accuracy of Medical Facts**: All medical conditions, diagnoses, treatments, and test results must be correctly stated and consistent with the source text.
2. **Completeness of Critical Information**: The summary should include all vital information necessary for follow-up care (e.g., key symptoms, diagnoses, procedures, outcomes).
3. **Absence of Hallucinations**: The summary should not introduce any information that is not present in the original discharge note.
4. **Clarity and Non-Misleading Content**: The summary should be clear, free of ambiguity, and should not distort or misrepresent any facts.

Instructions:
* Compare the candidate summary to the reference summary.
* Provide a numerical rating of the candidate summary's faithfulness on a scale from 1 (completely unfaithful) to 10 (fully faithful).
* Respond only with the numerical rating without any explanation or context.

Reference Note:

{reference}

Candidate Note:

{prediction}

Output Format:
[Numerical faithfulness rating only, from 1 to 10"""

    return prompt


g_eval_evaluator = get_g_eval_evaluator(
    quality_name="Faithfulness",
    prompt_template=construct_prompt,
    model_config=llama_config,
    min_score=1,
    max_score=10,
    normalise=True,
)

## Workflow

To bring everything together, we configure the overall task and experiment pipeline. For more extensive and complex evaluations, multiple experiment batch configurations can be specified.


In [None]:
from llmscope.evaluation import ExperimentBatchConfig, TaskConfig

aci_task_config = TaskConfig(
    dataset_manager=aci_dataset_manager,
    generation_steps=aci_generation,
    field_spec=aci_field_spec,
    task_preprocessor=dialogue_task_preprocessor,
)

experiment_config = ExperimentBatchConfig(
    tasks=[aci_task_config],
    model_configs=[llama_config, phi_config],
    evaluators=[bleu_evaluator, rouge_evaluator, bertscore_evaluator, g_eval_evaluator],
)

To store all the logs, outputs and results associated with our experiments, we also create a new `Project` object, which is passed to the evaluation pipeline. When initialised, the project automatically loads any previously stored outputs and evaluation results assciated with the given project name. This ensures that any completed tasks are not unnecessarily rerun unless explicitly required by setting `force_rerun=True` when calling the pipeline's `run` method.


In [None]:
from llmscope.workflow import Pipeline, Project

aci_project = Project(name="ACI-Bench Evaluation")

aci_pipeline = Pipeline(
    experiments=experiment_config,
    project=aci_project,
)

In [None]:
aci_pipeline.run()

## Result Analysis

The results associated with the project can be analysed using dedicated result analysers. In this example, we use a simple tabular result analyser, which summarises all the evaluation metrics in a structured table format.


In [None]:
import polars as pl

from llmscope.workflow.analysers import TabularResultAnalyser

analyser = TabularResultAnalyser[pl.DataFrame](output_format="polars")

In [None]:
analyser(aci_project)