# DSCagg Calculation Example for HNTS-MRG 2024

This notebook demonstrates how to calculate the evaluation metric (aggregated Dice Similarity coefficient - DSCagg) for [HNTS-MRG 2024 Challenge](https://hntsmrg24.grand-challenge.org/). More information on the evaluation can be found [here](https://hntsmrg24.grand-challenge.org/tasks-and-evaluation/). The evaluation functions are encapsulated in a Docker container image which will be run on the outputs of participants' submitted algorithms. 

This specific example in this notebook uses a subset of masks from the HNTS-MRG 2024 training dataset, available on [Zenodo](https://zenodo.org/records/11199559). We use 20 out of 150 training case masks for this example (just as a proof of concept).

Credit to the HECKTOR 2022 organizers; most of this code is directly based on their [GitHub implementations](https://github.com/voreille/hecktor/blob/master/notebooks/evaluate_segmentation2022.ipynb).

## Table of Contents

1. [Imports](#Imports)
2. [Functions](#Functions)
3. [DSCagg Calculation](#dscagg-calculation)
4. [Extra: Conventional DSC Calculation](#extra-conventional-dsc-calculation)

## Imports

In [1]:
import SimpleITK as sitk
import os
import numpy as np

## Functions

In [2]:
def compute_volumes(im):
    """
    Compute the volumes of the GTVp and the GTVn
    """
    spacing = im.GetSpacing()
    voxvol = spacing[0] * spacing[1] * spacing[2]
    stats = sitk.LabelStatisticsImageFilter()
    stats.Execute(im, im)
    nvoxels1 = stats.GetCount(1)
    nvoxels2 = stats.GetCount(2)
    return nvoxels1 * voxvol, nvoxels2 * voxvol

def compute_agg_dice(intermediate_results):
    """
    Compute the aggregate dice score from the intermediate results
    """
    aggregate_results = {}
    TP1s = [v["TP1"] for v in intermediate_results]
    TP2s = [v["TP2"] for v in intermediate_results]
    vol_sum1s = [v["vol_sum1"] for v in intermediate_results]
    vol_sum2s = [v["vol_sum2"] for v in intermediate_results]
    DSCagg1 = 2 * np.sum(TP1s) / np.sum(vol_sum1s)
    DSCagg2 = 2 * np.sum(TP2s) / np.sum(vol_sum2s)
    aggregate_results['AggregatedDsc'] = {
        'GTVp': DSCagg1,
        'GTVn': DSCagg2,
        'mean': np.mean((DSCagg1, DSCagg2)),
    }
    return aggregate_results

def get_intermediate_metrics(patient_ID, groundtruth, prediction):
    """
    Compute intermediate metrics for a given groundtruth and prediction.
    These metrics are used to compute the aggregate dice.
    """
    overlap_measures = sitk.LabelOverlapMeasuresImageFilter()
    overlap_measures.SetNumberOfThreads(1)
    overlap_measures.Execute(groundtruth, prediction)

    DSC1 = overlap_measures.GetDiceCoefficient(1)
    DSC2 = overlap_measures.GetDiceCoefficient(2)

    vol_gt1, vol_gt2 = compute_volumes(groundtruth)
    vol_pred1, vol_pred2 = compute_volumes(prediction)

    vol_sum1 = vol_gt1 + vol_pred1
    vol_sum2 = vol_gt2 + vol_pred2
    TP1 = DSC1 * (vol_sum1) / 2
    TP2 = DSC2 * (vol_sum2) / 2
    return {
        "PatientID": patient_ID, # added patient ID so we can pinpoint exact results if needed
        "TP1": TP1,
        "TP2": TP2,
        "vol_sum1": vol_sum1,
        "vol_sum2": vol_sum2,
        "DSC1": DSC1,
        "DSC2": DSC2,
        "vol_gt1": vol_gt1, # needed if you want to exclude empty ground truths in conventional DSC calcs
        "vol_gt2": vol_gt2, 
    }

def check_prediction(groundtruth, prediction):
    """
    Check if the prediction is valid and apply padding if needed
    """

    # Cast to the same type
    caster = sitk.CastImageFilter()
    caster.SetOutputPixelType(sitk.sitkUInt8)
    caster.SetNumberOfThreads(1)
    groundtruth = caster.Execute(groundtruth)
    prediction = caster.Execute(prediction)

    # Check labels
    stats = sitk.LabelStatisticsImageFilter()
    stats.Execute(prediction, prediction)
    labels = stats.GetLabels()
    if not all([l in [0, 1, 2] for l in labels]):
        raise RuntimeError(
            "The labels are incorrect. The labels should be background: 0, GTVp: 1, GTVn: 2."
        )
    # Check spacings
    if not np.allclose(
            groundtruth.GetSpacing(), prediction.GetSpacing(), atol=0.000001):
        raise RuntimeError(
            "The resolution of the prediction is different from the MRI ground truth resolution."
        )
    else:
        print('prediction checked, no errors found')
        # to be sure that sitk won't trigger unnecessary errors
        prediction.SetSpacing(groundtruth.GetSpacing())

    # the resample_prediction is used to crop the prediction to the same size as the groundtruth
    #return resample_prediction(groundtruth, prediction) # dont wan't this returned for now so commenting out

## DSCagg Calculation

Remember DSCagg is calculated over the entire set of data so you do not get patient-level datapoints like with conventional volumetric DSC.

Ground truth masks here are the mid-RT masks while the "prediction" masks here are the registered pre-RT masks. 

In [3]:
# first set up the ground truth and prediction paths

prediction_folder = 'prediction_masks'
groundtruth_folder = 'groundtruth_masks'

prediction_files = [os.path.join(prediction_folder, file) for file in os.listdir(prediction_folder) if "nii.gz" in file]
groundtruth_files = [os.path.join(groundtruth_folder, file) for file in os.listdir(groundtruth_folder) if "nii.gz" in file]

print("Prediction files", prediction_files, "\n")

print("Ground truth files", groundtruth_files)

Prediction files ['prediction_masks/81_preRT_mask_registered.nii.gz', 'prediction_masks/94_preRT_mask_registered.nii.gz', 'prediction_masks/99_preRT_mask_registered.nii.gz', 'prediction_masks/84_preRT_mask_registered.nii.gz', 'prediction_masks/91_preRT_mask_registered.nii.gz', 'prediction_masks/77_preRT_mask_registered.nii.gz', 'prediction_masks/78_preRT_mask_registered.nii.gz', 'prediction_masks/88_preRT_mask_registered.nii.gz', 'prediction_masks/86_preRT_mask_registered.nii.gz', 'prediction_masks/93_preRT_mask_registered.nii.gz', 'prediction_masks/90_preRT_mask_registered.nii.gz', 'prediction_masks/96_preRT_mask_registered.nii.gz', 'prediction_masks/8_preRT_mask_registered.nii.gz', 'prediction_masks/95_preRT_mask_registered.nii.gz', 'prediction_masks/80_preRT_mask_registered.nii.gz'] 

Ground truth files ['groundtruth_masks/91_midRT_mask.nii.gz', 'groundtruth_masks/88_midRT_mask.nii.gz', 'groundtruth_masks/94_midRT_mask.nii.gz', 'groundtruth_masks/90_midRT_mask.nii.gz', 'groundtruth_

Please note in the below code some warnings (LabelOverlapMeasuresImageFilter) are thrown because there is no label in the ground truth file and the prediction file. This is not an error and expected behavior for the given setup.

In [4]:
results = list()
for f in prediction_files:
    patient_ID = os.path.split(f)[-1].split('_')[0] # get the patient ID from the path 
    gt_file = [k for k in groundtruth_files if os.path.split(k)[-1].split('_')[0] == patient_ID][0]

    print(f"Evaluating patient {patient_ID}")

    prediction = sitk.ReadImage(str(f))
    groundtruth = sitk.ReadImage(str(gt_file))
    check_prediction(groundtruth, prediction) 


    results.append(get_intermediate_metrics(patient_ID, groundtruth, prediction))

Evaluating patient 81
prediction checked, no errors found
Evaluating patient 94
prediction checked, no errors found
Evaluating patient 99
prediction checked, no errors found
Evaluating patient 84
prediction checked, no errors found
Evaluating patient 91
prediction checked, no errors found
Evaluating patient 77
prediction checked, no errors found


LabelOverlapMeasuresImageFilter (0x7f817cd98f40): Label  not found.



Evaluating patient 78
prediction checked, no errors found
Evaluating patient 88
prediction checked, no errors found
Evaluating patient 86
prediction checked, no errors found
Evaluating patient 93
prediction checked, no errors found
Evaluating patient 90
prediction checked, no errors found
Evaluating patient 96
prediction checked, no errors found
Evaluating patient 8
prediction checked, no errors found


LabelOverlapMeasuresImageFilter (0x7f817cd814e0): Label  not found.



Evaluating patient 95
prediction checked, no errors found


LabelOverlapMeasuresImageFilter (0x7f816c40dc60): Label  not found.



Evaluating patient 80
prediction checked, no errors found


LabelOverlapMeasuresImageFilter (0x7f817c908fd0): Label  not found.



Display aggregated DSC metrics. This is what will be used in the challenge evaluation/ranking.

In [5]:
# Display raw results
print("The raw results are:", results, "\n")

# Compute and display aggregate dice scores
agg_dice_scores = compute_agg_dice(results)
print(f"Aggregate dice scores: {agg_dice_scores}\n")

The raw results are: [{'PatientID': '81', 'TP1': 391.07343593704957, 'TP2': 4460.099424139208, 'vol_sum1': 24023.082493275902, 'vol_sum2': 14405.702132716853, 'DSC1': 0.03255813953488372, 'DSC2': 0.6192130564757211, 'vol_gt1': 838.014505579392, 'vol_gt2': 4838.369860685461}, {'PatientID': '94', 'TP1': 0.0, 'TP2': 2500.0, 'vol_sum1': 3693.5, 'vol_sum2': 9828.0, 'DSC1': 0.0, 'DSC2': 0.5087505087505088, 'vol_gt1': 0.0, 'vol_gt2': 4620.5}, {'PatientID': '99', 'TP1': 1696.4999999999998, 'TP2': 44592.0, 'vol_sum1': 33991.0, 'vol_sum2': 116139.0, 'DSC1': 0.09982054073137006, 'DSC2': 0.7679074212796735, 'vol_gt1': 1696.5, 'vol_gt2': 46632.0}, {'PatientID': '84', 'TP1': 536.5620653779163, 'TP2': 9859.47343994865, 'vol_sum1': 2928.395133385764, 'vol_sum2': 30560.759545830682, 'DSC1': 0.3664546899841018, 'DSC2': 0.6452374604867274, 'vol_gt1': 1094.0744933953172, 'vol_gt2': 11550.633268569285}, {'PatientID': '91', 'TP1': 595.1388605104555, 'TP2': 3307.638731168378, 'vol_sum1': 6276.388589607348, '

## Extra: Conventional DSC Calculation

Since conventional volumetric DSC was also calculated during the DSCagg calculation, we can also display these values as well just for reference. These metrics will not be used in the challenge directly but may be handy to know.

In [6]:
# Extract DSC1 and DSC2 values
DSC1_values = [result["DSC1"] for result in results]
DSC2_values = [result["DSC2"] for result in results]

# Compute and display mean DSC1 and DSC2
mean_DSC1 = np.mean(DSC1_values)
mean_DSC2 = np.mean(DSC2_values)
print(f"Mean DSC1 (GTVp): {mean_DSC1}")
print(f"Mean DSC2 (GTVn): {mean_DSC2}\n")

Mean DSC1 (GTVp): 0.2247082341264615
Mean DSC2 (GTVn): 0.5701023544115448



Conventional volumetric DSC may be disproportionately affected by a single false negative/postive result (yielding a DSC of 0). Therefore, it may be more informative to remove instances where the ground truth is empty. The code below removes instances with empty ground truth before computing the mean DSC values.

In [7]:
# Extract non-zero DSC1 and DSC2 values and print removed patient IDs
DSC1_values_nozeros = []
DSC2_values_nozeros = []
removed_patients_DSC1 = []
removed_patients_DSC2 = []

for result in results:
    patient_id = result["PatientID"]
    if result["vol_gt1"] != 0.0:
        DSC1_values_nozeros.append(result["DSC1"])
    else:
        removed_patients_DSC1.append(patient_id)
    if result["vol_gt2"] != 0.0:
        DSC2_values_nozeros.append(result["DSC2"])
    else:
        removed_patients_DSC2.append(patient_id)

# Print removed patient IDs
print("Removed patient IDs with empty ground truth volumes for DSC1:", removed_patients_DSC1)
print("Removed patient IDs with empty ground truth volumes for DSC2:", removed_patients_DSC2, "\n")

# Compute and display mean non-zero DSC1 and DSC2
mean_DSC1_nozeros = np.mean(DSC1_values_nozeros)
mean_DSC2_nozeros = np.mean(DSC2_values_nozeros)
print(f"Mean DSC1 (GTVp) without empty ground truth: {mean_DSC1_nozeros}")
print(f"Mean DSC2 (GTVn) without empty ground truth: {mean_DSC2_nozeros}")

Removed patient IDs with empty ground truth volumes for DSC1: ['94', '86', '8', '95']
Removed patient IDs with empty ground truth volumes for DSC2: ['77', '80'] 

Mean DSC1 (GTVp) without empty ground truth: 0.30642031926335656
Mean DSC2 (GTVn) without empty ground truth: 0.6578104089363979
