# RadFact Example

Here we show an example of running RadFact for evaluation of either findings generation or grounded reporting evaluation.

## Endpoint setup

RadFact scores in the MAIRA-2 paper are computed using `Llama-3-70b-Instruct` for entailment verification and GPT-4 for report to phrase conversion.

* Edit [`configs/endpoints/azure_chat_openai.yaml`](configs/endpoints/azure_chat_openai.yaml) to configure the endpoints for the Azure Chat API. This will be used by default for parsing the reports into phrases.
* Edit [`configs/endpoints/chat_openai.yaml`](configs/endpoints/chat_openai.yaml) to configure the endpoints for the Chat API. This will be used by default for entailement verification. 
* Set env variable `API_KEY` if you want to use key-based authentication for these endpoints. In case you're using multiple endpoints, use different env variables for each endpoint, e.g., `API_KEY_CHAT_OPENAI` and `API_KEY_AZURE_CHAT_OPENAI`. Make sure to update the corresponding endpoint config files to use these env variables names in `api_key_env_var_name`.
* Update `endpoints` in [`configs/radfact.yaml`](configs/radfact.yaml) and [`configs/report_to_phrases.yaml`](src/report_to_phrases.yaml) to use either `ChatOpenAI` or `AzureChatOpenAI` endpoints as available.

See the [README](README.md#2-endpoint-llm-setup) for more detailed setup instructions.

In [None]:
import json
import pandas as pd
from radfact.data_utils.grounded_phrase_list import GroundedPhraseList
from radfact.metric.radfact import RadFactMetric
from radfact.metric.bootstrapping import MetricBootstrapper
from radfact.metric.print_utils import print_bootstrap_results, print_results
from radfact.paths import EXAMPLES_DIR

## Findings Generation Evaluation

We provide an example csv in [`findings_generation_examples.csv`](examples/findings_generation_examples.csv) with columns `example_id`, `prediction` (model generation), `target` (ground truth).

RadFact expects `candidates` (generations) and `references` (ground truths) in a dictionary where keys are an identifier, typically the study id. We use `example_id` here. `candidates` and `references` are expected to be strings corresponding to the predicted and target findings sections. They will first get converted into phrases using the report to phrase conversion prompts and then undergo entailment verification to get RadFact scores.

In [None]:
findings_generation_examples = pd.read_csv(EXAMPLES_DIR / 'findings_generation_examples.csv')
display(findings_generation_examples.head(2))
candidates_fg = findings_generation_examples.set_index("example_id")["prediction"].to_dict()
references_fg = findings_generation_examples.set_index("example_id")["target"].to_dict()

For findings generation, when we initialise the metric we set `is_narrative_text=True` to instruct it to first perfom report-to-phrase conversion.

In [None]:
radfact_metric_for_fg = RadFactMetric(is_narrative_text=True)

`logical_f1_fg` and `radfact_scores_f` can directly be obtained using the [`compute_metric_score`](radfact/src/radfact/metric/radfact.py#L369) method as shown below.

```python 
logical_f1_fg, radfact_scores_f = radfact_metric_for_fg.compute_metric_score(candidates, references)
```
This calls [`compute_results_per_sample`](radfact/src/radfact/metric/radfact.py#L284) and [`aggregate_results`](radfact/src/radfact/metric/radfact.py#L355) under the hood. However, we break it down explicitely in this example to be able to reuse the per sample results for bootstrapping.

In [None]:
results_per_sample_fg = radfact_metric_for_fg.compute_results_per_sample(candidates_fg, references_fg)
logical_f1_fg, radfact_scores_fg = radfact_metric_for_fg.aggregate_results(results_per_sample_fg)
logical_f1_fg

We can now look at the results. The only relevant scores for finding generation are logical_precision, logical_recall and logical_f1 since there are no boxes associated with findings to compute the other grounding and spatial scores.

In [None]:
print("Findings generation RadFact scores:")
print_results(radfact_scores_fg, metrics=["logical_precision", "logical_recall", "logical_f1"])

You can also compute the bootstrap confidence intervals for the scores as shown below.

We set the number of bootstrap samples (`num_samples`) to 10 here because our example dataset is quite small.

In [None]:
bootstrapper = MetricBootstrapper(metric=radfact_metric_for_fg, num_samples=10, seed=42)
radfact_scores_fg_with_cis = bootstrapper.compute_bootstrap_metrics(results_per_sample=results_per_sample_fg)

We can now inspect the results with the confidence intervals.

In [None]:
print("Findings generation RadFact scores (95% CI):")
print_bootstrap_results(radfact_scores_fg_with_cis, metrics=["logical_precision", "logical_recall", "logical_f1"])

## Grounded Reporting Evaluation

For grounded reporting, it's easiest to store model generations and ground truth in JSON format to accommodate both text and boxes. Each grounded report is represented as a list of dicts representing individual sentences, each with `text` and `boxes` keys. The `boxes` are `None` for non-grounded sentences. As for findings generation, the model generations are under `prediction` and the ground truth is under `target`.

Refer to the [grounded_reporting_examples.json](examples/grounded_reporting_examples.json) for examples of the expected JSON format.

From this JSON we can parse examples easily into `GroundedPhraseList`, which is expected by RadFact.

In [None]:
with open(EXAMPLES_DIR / 'grounded_reporting_examples.json', "r", encoding="utf-8") as f:
    grounded_reporting_examples = json.load(f)
candidates_gr = {
    example["example_id"]: GroundedPhraseList.from_list_of_dicts(example["prediction"])
    for example in grounded_reporting_examples
}
references_gr = {
    example["example_id"]: GroundedPhraseList.from_list_of_dicts(example["target"])
    for example in grounded_reporting_examples
}
print("Loaded", len(grounded_reporting_examples), "grounded reporting examples")

When operating on grounded reports, represented as `GroundedPhraseList`, we do not need to set `is_narrative_text=True` in the metric. With already-parsed reports, no step to convert reports into phrases is required. `is_narrative_text` is set to `False` by default.

In [None]:
radfact_metric_for_gr = RadFactMetric()

Similarly to findings generation, we can compute the metric scores and confidence intervals for grounded reporting.

We also break down the computation to be able to reuse the per sample results for bootstrapping.

In [None]:
results_per_sample_gr = radfact_metric_for_gr.compute_results_per_sample(candidates_gr, references_gr)
logical_f1_gr, radfact_scores_gr = radfact_metric_for_gr.aggregate_results(results_per_sample_gr)
logical_f1_gr

Since this is grounded reporting, we look at all the metrics returned by RadFact including grounding and spatial scores.

In [None]:
metrics = [
    "logical_precision",
    "logical_recall",
    "logical_f1",
    "spatial_precision",
    "spatial_recall",
    "spatial_f1",
    "grounding_precision",
    "grounding_recall",
    "grounding_f1",
]
print("Grounded reporting RadFact scores:")
print_results(radfact_scores_gr, metrics=metrics)


We can compute the bootstrap confidence intervals for the scores similarly.

In [None]:
bootstrapper = MetricBootstrapper(metric=radfact_metric_for_gr, num_samples=10, seed=42)
radfact_scores_gr_with_cis = bootstrapper.compute_bootstrap_metrics(results_per_sample=results_per_sample_gr)

We can now inspect the metrics with the confidence intervals.

In [None]:
print("Grounded reporting RadFact scores (95% CI):")
print_bootstrap_results(radfact_scores_gr_with_cis, metrics)