# Topic 1.5: Validation in medical image analysis

This notebook combines theory with questions to support the understanding of validation metrics in medical image analysis. Use available markdown sections to fill in your answers to questions as you proceed through the notebook.

**Contents:** <br>

1. [Validation (concepts)](#validation)<br>

    1.1 [Quality characteristics](#quality_characteristics)<br>
    
    - [Accuracy (bias)](#accuracy)<br>
    - [Precision (variation), reproducibility, reliability, replicability](#precision)<br>
    - [Robustness](#robustness)<br>
    - [Efficiency](#efficiency)<br>
    - [Fault detection](#fault)<br>
    
    1.2 [Ground truth](#ground_truth)<br>
    
    - [Ground truth from real data](#gt_real_data)<br>
    - [Ground truth from phantoms](#gt_phantoms)<br>
    - [Data representativeness](#data_representativeness)<br>
    - [Statistical significance](#stat_significance)<br>
    
    1.3 [Measures of quality](#quality_measures)<br>
    
    - [Segmentation - quality measures](#seg_qm)<br>
    - [Registration - quality measures](#reg_qm)<br>
    - [(Computer-aided) detection - quality measures](#det_qm)<br>
    
    
2. [Common limitations of performance metrics in biomedical image analysis](#limitations)

    - [Small structures](#small_structures)<br>
    - [Image artifacts](#artifacts)<br>
    - [Overlap measurements](#overlap_measurements)<br>
    - [Over- and undersegmentation](#over_underseg)<br>
    - [Single-object bias](#object_bias)<br>
    - [Metric combination](#combination)<br>
    - [Choosing the right metric for a given task](#right_metric)<br>


**References:** <br>

[1] Measures of quality: [Toennies Klaus, D. Guide to Medical Image Analysis - Methods and Algorithms, Chapter 13.1](https://link.springer.com/book/10.1007/978-1-4471-2751-2)

[2] Ground truth: [Toennies Klaus, D. Guide to Medical Image Analysis - Methods and Algorithms, Chapter 13.2](https://link.springer.com/book/10.1007/978-1-4471-2751-2)

[3] Limitations of performance metrics: [Reinke et al. Common Limitations of Image Processing Metrics: A Picture Story.](https://arxiv.org/abs/2104.05642)

[4] Assessment of registration errors: [Fitzpatrick, M. Visualization, Image-Guided Procedures and Modeling, 7261:1–12, SPIE Medical Imaging (2009).](https://spie.org/Publications/Proceedings/Paper/10.1117/12.813601)

In [1]:
%load_ext autoreload
%autoreload 2

<div id='validation'></div>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/read_ico.png" width="42" height="42"></div> 

## 1. Validation (concepts)

Validation of medical image analysis methods is the estimation of correctness of certain results from tests of a method on a representative sample set. In e.g. software design, validation is the evaluation of the degree to which user needs (performance requirements) are met, i.e. whether the right software is being built. In medical image analysis, we usually talk about technical validation, where the aim is to evaluate the performance of computing algorithms with respect to e.g. segmentation accuracy. Validation is used in various image computing classes (registration, segmentation, detection, classification, quantification). 

Prior to performing validation, suitable data needs to be selected, comparison measures need to be chosen, and a norm (e.g. [ground-truth](#ground_truth), explained below) needs to be defined. **Remember that every validation study must have a hypothesis on performance (e.g. outcome is better than...), and a ground truth (gold standard) is essential**. Validation can provide information about our method with respect to another method used to generate the same results (cross-validation). It is mandatory to document a detailed description of the validation procedure together with a well-founded justification of selected measures, as it allows potential new users of the method to investigate the validity of the arguments used to build the validation scenario.

<div id='quality_characteristics'></div>

### 1.1 Quality characteristics
There are a number of quality characteristics used in validation of medical image analysis methods: 

<div id='accuracy'></div>

#### Accuracy (bias)

Accuracy determines the deviation of results from known ground truth. It is computed via a measure of quality ([section 1.3](#quality_measures)) comparing results with some norm. Accuracy is calculated as the ratio between true/false positives and negatives: $\mathrm{A} = \frac{(\mathrm{TP + TN})}{(\mathrm{TP+FP+FN+TN})}$.

<div id='precision'></div>

#### Precision (variation), reproducibility, reliability, replicability 

These characteristics measure the extent to which equal or similar input produces equal or similar results. Reliable methods produce output within a given range of variation (e.g. in terms appearance). A method is reproducible, if two runs of this method with the exact same input and setup produce the exact same results. Replicability of a method can be determined if two runs of a method with the same input and same setup arrive at similar conclusions.
<br>
<br>
<center width="100%"><img src="../reader/assets/accuracy_precision.png" width="600"></center>

<div id='robustness'></div>

#### Robustness 

Robustness of a method characterizes the change of the quality of an analysis result if conditions deviate from assumptions made for analysis (e.g., when noise level increases or if object appearance deviates from prior assumptions). For example, robustness of a segmentation algorithm is the ability of an algorithm to persist in sufficient performance despite abnormalities in the input images (e.g. due to patient motion). Reproducibility and robustness are also important performance indicators in case of varying image data (different scanners, hospitals, patient populations, etc.).

<div id='efficiency'></div>

#### Efficiency

The effort which must be exerted to achieve an analysis result is described by efficiency. You may recall that there are semi-automated methods that require some degree of human interaction or expert knowledge. These factors contribute to the overall determination of a method's efficiency.

<div id='fault'></div>

#### Fault detection

The ability to discover possible faults while an analysis method is being applied is called fault detection. It is a very useful feature, because it requires the method to test for reliability of its own results. 

<div id='ground_truth'></div>

### 1.2 Ground truth

You may remember one of the lecture slides with the following statement: _"In medical image analysis, the truth is difficult to come by, since the reason for producing images in the first place was to gather information about the human body that cannot be accessed otherwise."_. 

Ground truth is a conceptual term relative to the knowledge of the truth concerning a specific question (the “ideal expected result”). In validation, all measures of quality estimation for an analysis method require comparison of the method's produced results with the true information. Ground truth data can be either real or artificial, however, it is never completely certain whether selected data are representative of the desired ground truth. See also [chapter 13.2 of the Guide to Medical Image Analysis by Tonnies, Klaus D](https://www.springer.com/gp/book/9781447160960)

<div id='gt_real_data'></div>

#### Ground truth from real data
Ground truth based on real data can be created by applying the currently established best method to it if such method exists at all. An example is the use of mutual information and spline-based non-rigid registration for registering MR brain images. An often encountered problem is proving that the conditions under which a standard is applied, are comparable with those conditions under which they are considered to be an established standard. Moreover, the implementation of the established methods is rarely available, even though these days, more implementations become open-source or integrated in widely used freely downloadable software packages.  

If an established method is missing, human experts may help produce ground truth data through _manual data annotation_. This approach requires a lot of effort both from the method's developer, as well as the expert who has to carry out the analysis on several datasets, document findings, and sometimes it is desirable to have the expert analyze the data(sets) multiple times (intra-observer variability) to increase the significance of the results. The developer must provide a sufficiently good user interface for the expert to avoid bias by the input component quality. Sometimes it may be more beneficial to ask more experts and measure (inter-observer) variability. In such case, it is crucial to define what is meant by agreement among all (e.g. agreement by all / the majority of observers, etc.).

An algorithm for the validation of image segmentation that estimates reference standard based on a set of segmentations is called [STAPLE](https://pubmed.ncbi.nlm.nih.gov/15250643/) (Simultaneous Truth and Performance Level Estimation).

<div id='gt_phantoms'></div>

#### Ground truth from phantoms

Phantoms can be used as ground truth as well. They are classified as follows:

_Based on real data_

- cadaver phantoms (human or animal)
- artificial hardware phantoms (e.g. CT and MRI slices generated in the [Visible Human Project](http://vhp.med.umich.edu/))

_Based on simulated data_

- software phantoms representing the reconstructed image or the imaged measurement distribution
- mathematical simulations (e.g. Shepp-logan phantom)

Phantoms are characteristic for specific properties (material, measurement properties, influences from image reconstruction, shape properties), according to which they are applied in different tasks. Phantoms are only useful in validation analyses when results have been generated in them. For a detection task, a couple of locations must be specified, and for registration tasks, fiducial markers have to be implanted, for example. Material and measurement properties are often idealized. Image artefacts are typically simulated, e.g. by using zero-mean Gaussian noise to simulate detector noise; smoothing data to evoke partial volume effects or through inclusion of artificial shading to model signal fluctuations. 

The advantage of a software phantom is that it is more straightforward to account for anatomical variation by creating several phantoms with different shapes, unlike in hardware phantoms, where anatomical variation can hardly be modelled. Examples of software phantoms include the BrainWeb phantom; the Field II ultrasound simulation program; the XCAT phantom; or the dynamic MCAT heart phantom simulating a moving heart.

<div id='data_representativeness'></div>

#### Data representativeness

To make a (ground truth) dataset representative, all data properties that may have an impact on the performance of an analysis method should be reflected in it. Representativeness can be enforced by:

- separation between test and training data (leave-one-out technique in classification tasks); if optimal parameters have to be determined for a method, it is unacceptable to validate the results on ground-truth data which has been used to arrive at the optimal parameter value
- identification of sources of variation (all should be covered by the ground truth data) and outlier detection (experts can help)
- robustness with respect to parameter variation (e.g. changes in input thresholds)

<div id='stat_significance'></div>

#### Statistical significance

While your analysis results may seem satisfactory, there is a chance that they are statistically insignificant due to low number of samples in your validation set. Significance of an experimental outcome can be indicated by the well-known $p$-value. For example, the probability of less than $1\%$ that a result arose by chance would be expressed as $p < 0.01$. Significance can be calculated via the [_Student's t-test_](https://towardsdatascience.com/the-statistical-analysis-t-test-explained-for-beginners-and-experts-fd0e358bbb62), which helps you find out if there is a statistical difference between two compared groups.

<div id='quality_measures'></div>

### 1.3 Measures of quality

Quality is determined by the kind of analysis which has been conducted on a dataset:

| Task         | Quality measure                                                           |
|:--------------|:---------------------------------------------------------------------------:|
| Segmentation  | Correspondence between the segmented object and a reference segmentation |
| Registration | Deviation from the correct registration transformation                    |
| Computer-aided detection (CAD)    | Ratio between correct and incorrect decisions                             |

See also [chapter 13.1 of the Guide to Medical Image Analysis by Tonnies, Klaus D](https://www.springer.com/gp/book/9781447160960)

<div id='seg_qm'></div>

#### Segmentation - quality measures

When segmenting an object in an image, a measure of comparison between some reference $g$ (usually a ground truth) and the segmented object $f$ is required. Mutual correspondence may be determined by calculating volumetric overlap, overlap between object and background or performing distance measurements (of boundary deviations). In 3D cases, volumetric measurements aim to count the number of voxels in both the segmented object and the reference norm weighted by the volume covered by each voxel. 

Overlaps between objects $f$ and $g$ can be calculated by measures that count over-segmentation (number of elements) and under-segmentation. 

<div id="dsc_hd_iou"></div>

The next measures often used for quality assessment are _Dice similarity coefficient_ (DSC) a.k.a _Sørensen–Dice coefficient_ ($d$),  _Intersection over union_ ($i$) and the _Jaccard index_ ($j$):

\begin{equation}
d = \frac{2|F\cap\,G|}{|F|+|G|}\,\,, 
\end{equation}

\begin{equation}
i = \frac{\mathrm{DSC}}{2 - \mathrm{DSC}}\,\,, \mathrm{and}
\end{equation}

\begin{equation}
j = \frac{|F\cap\,G|}{|F\cup\,G|}\,\,,
\end{equation}

where $F\cap G$ is the size of elements (voxels) in overlap, and $|F|$, $|G|$ are the sizes of individual volumes. The coefficient is equal to 1 in case of perfect correspondence; otherwise it is smaller than 1. In the medical image analysis community, the Dice coefficient is more popular, and therefore also more often present in literature.

Neither Dice nor Jaccard indices can be used to measure outliers (e.g. in tasks where organ boundaries are to be segmented as part of access planning in surgery). In minimally invasive procedures, it is crucial to determine the deviation of the segmented boundary from the true boundary. This can be done by _Hausdorff distance_ (HD) between $F$ and $G$. The Hausdorff distance is defined as the maximum of all shortest distances $d$ between points in $F$ and $G$. Since this measure is highly sensitive to image artefacts, the quantile Hausdorff distance is used, where distances of largest outliers are averaged. It is computed from a quantile of a histogram of distances from $F$ to $G$ and from $G$ to $F$:

$$
\begin{equation}
h^{q} = \mathrm{max}(t_{q}(d(f,G)),t_{q}(d(g,F)))
\end{equation}
$$

<div id='reg_qm'></div>

#### Registration - quality measures

Registration aims to find a geometric transformation that maps an $n$-dimensional image onto another one, bringing both images into alignment. In case of different dimensionalities of the registered objects, the transformation includes a projection step of the scene from higher dimension to the scene of lower dimension. The steps to evaluate registration accuracy when working with point-based registration are explained in [section 1 of notebook 1.2](../reader/1.2_Point-based_registration.ipynb). 

The quality of a registration method can be measured as the average deviation of known transformation parameters based on comparisons between vector fields (for non-rigid registration) or differences in global rotation and translation (for rigid transformation). Another way of assessing quality for a registration task is to compute locations of fiducials after registration, however, **one should never use the same corresponding point pairs/fiducials or image similarity metric were used for optimization when computing the registration transformation!** 

We use Fiducial Localization Error (FLE), Fiducial Registration Error (FRE) and Target Registration Error (TRE) to evaluate registration accuracy:

- FLE quantifies the error in determining the location of a point which is used to estimate the registration transformation. 
- FRE is the error of the fiducial markers following registration, i.e. $\vert\vert\,T(\mathbf{p_{f}}) - \mathbf{p_{m}})\vert\vert$, where $T$ is the estimated transformation and $\mathbf{p_{f}}$, $\mathbf{p_{m}}$ are the points that were **used for estimation**. 
- TRE computes the error of the target fiducials following registration, i.e. $\vert\vert\,T(\mathbf{p_{f}}) - \mathbf{p_{m}})\vert\vert$, where $T$ is the estimated transformation and $\mathbf{p_{f}}$, $\mathbf{p_{m}}$ are the points that were **<font color="red">not</font> used for estimation**.  

It is important to remember that FRE should never be utilized as a surrogate for TRE as the two error measures are uncorrelated given a specific registration task. Typically, we can only estimate the distribution of TRE as it is spatially varying. A good TRE depends on using a good fiducial configuration. More information on FRE and TRE can be found [in this article](https://spie.org/Publications/Proceedings/Paper/10.1117/12.813601?SSO=1).

If the transformation is unknown, image similarity metrics (see [notebook 1.3](../reader/1.3_Image_similarity_metrics.ipynb)), and the Structural Similarity Index (SSIM) can be used. The SSIM is  a perceptual image quality measure indicating whether two images are very similar or the same (a value of $+1$) or very different (a value of $-1$).  

<br>
<br>
<center width="100%"><img src="../reader/assets/quality_measures_registration_tasks.png" width="500"></center>

<font size="1">Figure from [Guide to Medical Image Analysis - Methods and Algorithms](https://link.springer.com/book/10.1007/978-1-4471-2751-2)</font>

<div id='det_qm'></div>

#### (Computer-aided) detection - quality measures

In detection tasks, an object is either found or not found while the object is or is not present in the data. The quality of detection is measured by _sensitivity_ (a.k.a recall rate) and _specificity_ (a.k.a precision rate):

- True positives (TP) are those detections belonging to the data and rightly resulted as positive. 
- True negatives (TN) are those objects not present in the data and rightly resulted as negative. 
- False positives (FP) are those objects that do not belong to the data, but were detected as present. 
- False negatives (FN) are results that belong to the data, but were classified as absent.

Sensitivity can be calculated as $\frac{\mathrm{TP}}{\mathrm{TP + FN}}$, while specificity is defined as $\frac{\mathrm{TN}}{\mathrm{TN + FP}}$. A good detection method would produce as many TP and TN as possible. FPs (e.g. tumor detected, though absent) and FNs (e.g. tumor overlooked) may have various consequences, and are therefore measured as two types of error (type-I error, and type-II error). The so-called _confusion matrix_ listing detection results in an organized way, specifies a two-class classification problem: 
<br>
<br>
<center width="100%"><img src="../reader/assets/confusion_matrix.jpg" width="300"></center>

<font size="1">Figure from [Guide to Medical Image Analysis - Methods and Algorithms](https://link.springer.com/book/10.1007/978-1-4471-2751-2)</font>

These metrics are commonly used in detection tasks involving medical images. Interestingly, they are also very important when interpreting the performance of any test (e.g., airport security, breast cancer screening, quality assurance in companies, etc.).

In practice, a trade-off between specificity and sensitivity is often targeted. In detection tasks, the ratio of sensitivity versus specificity is measured by the _receiver operator characteristic_ (ROC). The ROC curve can also serve as a measure of human operator performance when several operators performed the same task.
<br>
<br>
<center width="100%"><img src="../reader/assets/roc_curve.jpg" width="300"></center>

<font size="1">Figure from [Guide to Medical Image Analysis - Methods and Algorithms](https://link.springer.com/book/10.1007/978-1-4471-2751-2)</font>

<div id='limitations'></div>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/read_ico.png" width="42" height="42"></div> 

## 2. Common limitations of performance metrics used for segmentation tasks

Recent meta-analytical research has detected major issues in algorithm validation. Most of these flaws are related to the practical use of some performance metrics in a given analysis task. One of the core issues in medical image analysis is the choice of inappropriate metrics [Maier-Hein, L. et al. (2018)](https://www.nature.com/articles/s41467-018-07619-7). In the same publication, it has been reported that image segmentation is the most popular of all medical image processing tasks taking into account international challenges. In these competitions, the chosen metrics significantly influence the rankings of various methods, and it was found out that researchers are missing guidelines for choosing the right metric for a given problem. More information can be found in the article [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642).
<br>

<div id='small_structures'></div>

### Small structures

It is important to understand the mathematical properties of a metric before applying it to a given task. Segmentation of small structures, such as brain lesions (e.g. multiple sclerosis) often employs Dice scores, which may not be an appropriate metric because of the often unknown pathological outlines and high inter-observer variability in such tasks. The predictions of two algorithms may differ only by one pixel, yet the impact on the Dice score outcome is substantial (see figure below).
<br>
<br>
<center width="100%"><img src="../reader/assets/small_structure_segmentation.jpg" width="600"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>
<br>

<div id='artifacts'></div>

### Image artifacts

Similar issues may arise in the presence of image artifacts such as noise or errors in reference annotations. As seen in the figure below, a single erroneous pixel in the reference annotation may lead to a large performance decrease.
<br>
<br>
<center width="100%"><img src="../reader/assets/noise_effect_segmentation.jpg" width="400"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>
<br>

<div id='overlap_measurements'></div>

### Overlap measurements

In overlap measurements, dedicated metrics are incapable of discovering differences in shapes, which may have huge impact e.g. on radiotherapy applications. Completely different predictions may therefore lead to the exact same DSC value.
<br>
<br>
<center width="100%"><img src="../reader/assets/shape_unawareness.jpg" width="500"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>
<br>

<div id='over_underseg'></div>

### Over- and undersegmentation

In some applications detecting over- and undersegmentation, the DSC metric does not represent these performance indicators reliably, while HD is invariant to these properties.
<br>
<br>
<center width="100%"><img src="../reader/assets/over_under_segmentation.jpg" width="500"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>
<br>

<div id='object_bias'></div>

#### Single object bias

Commonly, segmentation metrics, such as DSC, are applied to detection and localization problems as well. In general, the DSC tends to be strongly biased against single objects, which is why its application in detection tasks should be avoided. An example where DSC underperforms, can be seen below.
<br>
<br>
<center width="100%"><img src="../reader/assets/detection_performance.jpg" width="500"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>
<br>

<div id='combination'></div>

#### Metric combination

Metrics are typically combined over all test cases to produce overall ranking. However, this can be detrimental in case of missing values (NA) and lead to a substantially higher DSC or varying HD compared to setting missing values to zero. Moreover, a single metric usually does not reflect all important features for algorithm validation. Through the combination of multiple metrics helps mitigate the problem, it has to be kept in mind that some metrics are mathematically related to each other, such as DSC and Intersection over union (IoU) ([see above](#dsc_hd_iou)). Thus combining related metrics will not change the ranking, and only metrics measuring different properties should be aggregated.
<br>

<div id='right_metric'></div>

#### Choosing the right metric for a given task

The selection of the most appropriate metric depends on your biomedical question and the characteristics of its problem: 

- What is the size, volume and shape of structures?
- Are there image artefacts?
- What is the annotation quality?
- Is computation time relevant?
- Is there any reference available?
- Do we prefer higher sensitivity or specificity?

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 1*:

Describe a situation where volume computation would be an appropriate criterion for measuring the quality of a segmentation task. When should it not be used?

<font style="color:red">Type your answer here</font>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 2*:
What information about segmentation quality is revealed by the Hausdorff distance? Please describe a scenario where this measure is important to rate a segmentation method.

<font style="color:red">Type your answer here</font>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 3*:
What needs to be made sure when selecting test data for ground truth?

<font style="color:red">Type your answer here</font>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 4*:
Why is it necessary to carry out manual segmentation several times by different and by the same person if it shall be used for ground truth? How is the information that is gained from these multiple segmentations used for rating the performance of an algorithm?

<font style="color:red">Type your answer here</font>