# Solutions to the Multiple Testing Problem
As mentioned at the end of the previous section, the multiple testing problem presents a large barrier to using the approach we have discussed so far. Luckily, there are some solutions available. In this section we will discuss the three main approaches you will need when working with `SPM`.

## Solution 1: A Stricter Threshold
The first approach we might consider is to simply use a stricter threshold than 5%. Typically, in imaging, the most liberal threshold we can get away with is $p < 0.001$. By using this threshold, we reduce the number of false positive down to a worst-case scenario of 0.1% of the voxels. For an image of 100,000 voxels, this would mean 1,000 false positives, for an image of 50,000 voxels this would mean 500 false positives and so on. Although this is still not great, it is nowhere near as bad as using the traditional 5% threshold. The advantage is that this will still retain good sensitivity, because we know we should also be seeing the majority of true effects as well. This is exemplified in {numref}`poldrack-uncorr-thresh-again-fig`, repeated from the previous section. Although there are lots of results outside the yellow circle, importantly we have identified the majority of true positives inside the circle as well.

```{figure} images/poldrack-uncorr-thresh.png
---
width: 800px
name: poldrack-uncorr-thresh-again-fig
---
Visualisation of thresholding an image that contains true activations (within the circle) and noise (outside the circle) using $p < 0.10$. Notice that, despite the false-positives, the sensitivity for detecting effects *within* the circle is very high.
```

## Solution 2: Control the Family-wise Error (FWE)
An alternative to a stricter threshold is to try and control the FWER directly. By doing so, we can limit the possibility of any false positives. If we control the FWER at 5% it means that there is only a 5% chance of any false positives in the image. This means that, after correction, we can be 95% confident that there are no false positives at all in our image. If we were to repeat our experiment multiple times, and correct our images each time with a FWER procedure, we would only expect 5% of those repeats to contain *any* false positives. So, most of the time we can be pretty confident that any results that survive a FWE-correction procedure are *true positives*. This is illustrated in {numref}`poldrack-fwe-thresh-fig` from [Poldrack, Mumford and Nichols (2011)](https://www-cambridge-org.manchester.idm.oclc.org/core/books/handbook-of-functional-mri-data-analysis/8EDF966C65811FCCC306F7C916228529#), where controlling the FWER at 10% leads to only 1/10 repeats showing any false positives (the second-to-last example, if you are having trouble seeing it). However, we should also note what this had done to our ability to see the true positives within the yellow circle.

```{figure} images/poldrack-fwe-thresh.png
---
width: 800px
name: poldrack-fwe-thresh-fig
---
Visualisation of thresholding an image that contains true activations (within the circle) and noise (outside the circle) using a FWE-corrected $p < 0.10$. Notice that only 10% of the repeats contain *any* false-positives.
```

```{tip}
When writing up your analyses using Microsoft Word, be aware of the fact that FWE will often get auto-corrected to "FEW". Please keep an eye on this as we get many submissions every year that talk about "FEW correction in SPM".
```

### Bonferroni Correction
One of the simplest ways of applying an FWE-correction is to use a Bonferroni procedure. Here, we simply create a new threshold by dividing the old threshold by the number of comparisons. So, for our image of 100,000 voxels, we have

$$
\alpha_{\text{BONF}} = \frac{0.05}{m} = \frac{0.05}{100000} = 0.0000005.
$$

Now, we only count voxels where $p < 0.0000005$ as significant. Hopefully it is clear just from this example how *strict* this approach is. It will indeed control the FWER, however, the Bonferroni correction is designed for cases where our tests are independent of one another. This means the value of one test is not connected in any way to the value of another. In fMRI, however, our tests have a degree of correlation because tests in neighbouring voxels are likely to be very similar. This is where the concept of multiple comparisons becomes a bit hazy. If two tests are perfectly correlated, does it still count as two comparisons or one? If those two tests are correlated, but not perfectly, how many tests does that count as? All of this is to say that if our tests are correlated then there should be some way of having a less severe correction, because the number of tests is not equivalent to the number of voxels.

### Random Field Theory Correction
A solution to this issue, that allows us to take the degree of correlation in an image into account, is given by the application of something called *random field theory*. This was the crowning achievement of the SPM developers and is one of the most significant additions to the world of neuroimaging. However, there is no getting around the fact that it is *complicated*. In fact, you could go as far to say that many of the people who work with fMRI data do not really understand how this works. By-and-large, they just assume that it does work. 

It is useful to have some sense of RFT correction, which is explained more fully in the advanced drop-down below. The simple explanation is that this method is able to quantify the degree of correlation in an image by calculating its *smoothness*. This can then be used to redefine the image in terms of *resolution-elements* (known as RESELS), which quantify the resolution of the image in terms of independent units. These RESELs are then used to determine the threshold needed to achieve FWER = 5%, using something known as the *Euler Characteristic* (EC), which quantifies the expected number of clusters under the null hypothesis.  Practically speaking, however, you do not need to know how this works to use `SPM`. If you are interested, read the drop-down below and consult the paper by [Nichols & Hayasaka (2003)](https://pubmed.ncbi.nlm.nih.gov/14599004/).  

````{admonition} Advanced: How Does Random Field Theory Correction Work?
:class: dropdown

To understand random field theory in more detail, we start by breaking the process into 3 steps:
- Estimate the smoothness of an image to quantify the correlation between voxels
- Enter that smoothness value into an equation to computer the expected Euler characteristic at different thresholds
- Find the threshold where the Euler characteristic tells you that only 5% of equivalent images would be expected to show at least one result. This becomes your multiple comparison correction threshold.

To unpack these steps, let us examine an image in 2D. Imagine an image that was just pure noise (i.e. a null image). If the values in that image were drawn from a Gaussian distribution, it might look something like the *left* of {numref}`smoothed-grf-fig`. As it stands, this is not a good representation of a null fMRI image, because we would expect a degree of correlation between neighbouring voxels (even if there was nothing going on). So let us apply some smoothing to the image to create that correlation. This leads to the image on the *right* of {numref}`smoothed-grf-fig`.

```{figure} images/smoothed-grf.png
---
width: 800px
name: smoothed-grf-fig
---
Illustration of an un-smoothed (*left*) and smoothed (*right*) Gaussian random field (GRF).
```

Now we have something that can represent our imaging data under the null hypothesis. This is known as a *Gaussian random field* (GRF). If we then imagine specifying different thresholds to show or hide the voxels in this image, we might see something akin to {numref}`thresh-grf-fig`.

```{figure} images/thresh-grf.png
---
width: 800px
name: thresh-grf-fig
---
Illustration of applying various thresholds to a GRF.
```

If we assume our random field represents our image data under the null hypothesis, then the number of blobs that remain after thresholding tell us how many results we expect under the null, for a given threshold. So this seems useful, because it means we can have some idea about how many results we would expect in our image by chance when using a particular threshold.

This concept of how many blobs remain in an image after thresholding can be quantified using a number known as the *Euler characteristic* (EC) of the image. Although, eventually, we will use the EC to define a $t$-threshold for our image of $t$-values, for this example we will be looking at using the EC to define a $z$-threshold for an image of $z$-values from a standard normal distribution. This is just to make the maths is a bit more palatable. So for a normal distribution, the EC is defined as

$$
EC(z) = R \frac{(4\ln{2})^{3/2}}{(2\pi)^{2}}e^{-z^{2}/2}(z^{2}-1),
$$

which does not look particularly intuitive or useful on the face of it. However, the main element we need to understand here is the term $R$, which is known as the RESEL-count of the image. $R$ is defined as

$$
R = \frac{V}{\text{FWHM}_{x} \times \text{FWHM}_{y} \times \text{FWHM}_{z}} = \frac{\text{Search volume}}{\text{Smoothness}},
$$

which is notable for two reasons. Firstly, it relates to the search volume (the size of the image), and secondly it relates to the intrinsic smoothness of the image. This smoothness quantifies the degree of correlation between neighbouring voxels, and is given as the dimensions of a Gaussian kernel which, if applied to an image of white noise, would result in the same smoothness as our data. This is not the same as the smoothing applied during pre-processing, but rather the combination of the inherent smoothness of the data and the smoothing we applied. The dimensions $\text{FWHM}_{x} \times \text{FWHM}_{y} \times \text{FWHM}_{z}$ define a virtual voxel called a *RESEL* (RESolution ELement), which reflects an independent chunk of our image. Dividing the size of image by the dimensions of a RESEL expresses the image in RESEL-units, which can be thought of as describing the resolution of the image in terms of independent chunks. The clever thing about this is that it means that the degree of correction is *automatically* tuned to both the size of the image and the degree of correlation within an image.

In order to use the Euler characteristic to determine the FWE threshold, consider the plot of Euler characteristic values against threshold values in {numref}`ec-plot-fig`. This tells us how many results we would expect in an image under the null, if we used different thresholds. For instance, picking a threshold of 2.5 would lead to an average of 5 false positive blobs each time we ran our experiment. The trick to using this information is to remember that the definition of the FWER is the probability of *one or more false-positives*. So if we find the threshold where the EC = 1, that tells us the threshold for an approximate FWER of 100% (i.e. a false positive every single time).

```{figure} images/ec-plot.png
---
width: 600px
name: ec-plot-fig
---
Illustration of the EC against different threshold values.
```

The point where EC = 1 is in the tail of the graph, a shown in {numref}`ec-is-1-fig`. So in this example, if we used a threshold of z = 3.15 we would expect an average of at least one false-positive results in our image. So a threshold of z = 3.15 is equivalent to a FWER of 100%. In order to get to a FWER of 5% we simply keep increasing the threshold until we get EC = 0.05. Based on the graph in {numref}`ec-is-1-fig`, this would be $z \approx 4$. So if we threshold our image using this threshold, then we should only see one or more false positives 5% of the time. This threshold therefore controls the FWER at 5%.

```{figure} images/ec-is-1.png
---
width: 800px
name: ec-is-1-fig
---
Illustration of how the tail of the EC plot can be used to determine the threshold needed for FWER = 0.05.
```

````

## Solution 3: Control the False Discovery Rate (FDR)
Although the FWE-correction using random field theory was a big breakthrough in allowing fMRI researchers to get around the problem of multiple testing, in reality many researchers found the approach far too strict. This is largely because controlling the family-wise error is inherently a strict thing to do, even after your take the correlation among voxels into account. This means that many fMRI experiments would simply show no results after FWE-correction. Although this does achieve the aim of having no false positives in our results, this is at the expense of false negatives. We saw this earlier in the illustration of FWE correction, where many of the true results inside the yellow circle were missing after FWE correction. 

As an alternative to FWE-correction, we can instead elect to not control the FWER at all. Instead, we choose to control a different quantity known as the false discovery rate (FDR). The FDR is simply the proportion of significant results that are actually false positives. An FDR procedure corrects $p$-values in such a way that, on average, the number of those $p$-values that are false positives will not exceed a given threshold. For instance, a procedure that controls the FDR at 5% will mean that on average no more than 5% of our results will be false positives. Unfortunately, we do not know which results are which, and so we have to accept the idea that FDR will make sure only a small number of them will be false. This is illustrated in {numref}``, where an FDR procedure of 10% guarantees that (on average) only 10% of the significant voxels will be false positives. In this example we can see the false positives outside the yellow circle, but of course in realisty we will have no idea which results are which.

```{figure} images/poldrack-fdr-thresh.png
---
width: 800px
name: poldrack-fdr-thresh-fig
---
Visualisation of thresholding an image that contains true activations (within the circle) and noise (outside the circle) using a FDR-corrected $p < 0.10$. On average, only 10% of the results outside of the circle are shown as significant.
```

Notice that, compared to FWE-correction, we have retained many more of our true positives. This comes at the expense of more false positives. In fact, we have false positives every single time. So FDR does not guarantee no false positives, it just guarantees control over the average proportion of our results that will be false. If we keep that small, then we can assume that the majority of our findings are real results (we just do not know which ones!). This puts FDR correction somewhere in-between a stricter uncorrected threshold and an FWE-corrected threshold in terms of its balance between sensitivity for true positives and risk of false positives.

```{note}
Application of FDR on a voxel-by-voxel basis was criticised by [Chumbley & Friston (2009)](https://pubmed.ncbi.nlm.nih.gov/18603449/) for its lack of spatial specificity. This is because the $p$-values are corrected without any consideration of *where* they came from in the image. Because of this, `SPM` reports *topological* FDR-corrected p-values, which are based on applying the FDR procedure to $p$-values from peaks in the image, as determined by random field theory. So just note that the `SPM` FDR values are not the same as regular `FDR` values. This is explained in more detail in [Chumbley et al. (2010)](https://pmc.ncbi.nlm.nih.gov/articles/PMC3221040/). Also note that FDR-corrected $p$-values are known as $q$-values, just to make the world a more confusing place.
```

## Choosing a Threshold
