# Levels of Inference
At this point, we are nearly at the end of our journey towards having some actual results. We have seen how to ask questions of our data using contrasts, create statistical maps from those contrasts and then finally threshold those maps using $p$-values corrected for multiple comparisons. At this stage, we could just stop and declare our findings based on where in the brain we see results after thresholding. However, there is one final choice we have to make in terms of our *level of inference*. This choice is based on the fact that the results at individual voxels are not necessarily the most interesting feature of an image. As such, we may want to evaluate our results using some other criteria that takes into account the *topology* of our statistical map.

## Features of an Image
Although our analysis has so far treated our image as a collection of individual datasets at each voxel, we know that in reality there are spatial connections between those voxels. So far, we have basically ignored this. However, we should really think of our images as a set of *topological features*. To see this, consider the slice through a statistical map shown in {numref}`func-flat-intensity-fig`. Looking at the image this way shows how we have a *landscape* of features, consisting of mountains and valleys of test statistics across the image.

```{figure} images/func-flat-intensity.png
---
width: 800px
name: func-flat-intensity-fig
---
Illustration of how a statistical map can be considered a landscape of topological features, rather than just individual tests at each voxel.
```

To make this a little simpler, consider the 1D slice through an image shown in {numref}`image-landscape-1d-fig`.

```{figure} images/image-landscape-1d.png
---
width: 800px
name: image-landscape-1d-fig
---
Illustration of a 1D slice through a test statistic image, highlighting how both the *peaks* and *spatial extent* of the test statistics may be of interest.
```

Taking this plot into account, we can identify two types of topological feature that may be of interest. One feature is the *peaks* of the signal, corresponding to the *largest* test statistics within the image. On the left we can see a very obvious spike corresponding to a single large test statistic value. The second feature is the *spatial extent* of the signal. For instance, on the right we can see a hill of points which seems to indicate some connection between the neighbouring voxels, even if no single voxel has a particularly large test statistic value. So far in this lesson we have focused on individual voxels, which is to say we have only been looking at the *peaks* in the image. However, there is an argument to say that looking for the *spatial extent* of the signal is more meaningful. After all, how much do we trust a single voxel versus multiple voxels that appear to show a consistent effect? This distinction is an example of the final choice we have to make, namely do we focus on the largest statistics or do we focus on the spatial extent of the signal? This distinction is referred to as performing either *voxel-level* inference or *cluster-level* inference.

## Voxel-level Inference
The approach we have taken so far, in terms of calculating a test statistic and $p$-value at each voxel, is known as *voxel-level* inference. As illustrated in {numref}`vox-level-fig`, this involves specifying a threshold $u$ and then declaring any voxel above that threshold as significant. As discussed in the last section, this threshold can be derived from uncorrected $p$-values, FWE-corrected $p$-values or FDR-corrected $p$-values.

The advantage of voxel-level inference is that it is the most spatially-specific approach. This is because we can point to any voxel with a significant $p$-value and say "there was a significant effect here". This means we can *localise* effects with great accuracy. However, the disadvantage is that voxel-level inference is the more *conservative* approach. In addition, we have to think about how useful or realistic it is to find a single result at a single voxel? Surely we expect some degree of spatial extent to the signal, given that a real brain is not neatly compartmentalised into discrete voxel units. Voxel-level inference does not take this into account. As such, it is not uncommon for only a single voxel to survive correction. When this happens, we have to consider the biological plausibility of a significant change in signal only occurring within a very specific $3mm^{3}$ region of the brain.

```{figure} images/vox-level.png
---
width: 800px
name: vox-level-fig
---
Illustration of voxel-level inference.
```

```{note}
Voxel-level inference in `SPM` is labeled as *peak-level*. The distinction between a voxel and a peak is really about the fact that a peak is a topological feature of the data, whereas a voxel is not. If this is confusing, do not worry about it. Just know that there is a distinction between voxels and peaks and that the terms are not necessarily interchangeable. If you are interested, [Chumbley *et al.* (2010)](https://pmc.ncbi.nlm.nih.gov/articles/PMC3221040/) discuss this in more detail.
```

You can see an example of voxel-level-thresholded results below

<iframe src="peak-level.html" width="800" height="600" frameborder="0" scrolling="no" title=""></iframe>

## Cluster-level Inference
An alternative to focusing on individual voxels is to instead focus on the *spatial extent* of the signal. In order to do this, we need to first isolate the signal of interest by thresholding the image. By specifying an intial threshold ($u_c$) we create *clusters* of voxels, defined as all the voxels touching each other after thresholding. Once we have done this, we can then assess the extent of the activation by counting the number of voxels within a cluster, as illustrated in {numref}`clust-level-fig`.

```{figure} images/clust-level.png
---
width: 800px
name: clust-level-fig
---
Illustration of cluster-level inference.
```

How do we use this information for inference? In brief, the machinery of random-field theory can be tuned to give us information on the size of clusters we would expect under the null hypothesis. Because of this, we can calculate $p$-values for the *size* of a cluster, allowing us to ask what the probability is of a cluster *this large or larger*, if the null hypothesis were true. This gives an uncorrected $p$-value, which can be subsequently corrected using the expected number of clusters to produce a $p_{\text{FWE}}$ value, or using an FDR procedure to produce a $q_{\text{FDR}}$ value. So, much like voxel-level inference, we can calculate $p$-values for clusters and then correct them across the brain.

Based on this description, it should be clear that cluster-level inference is a *two-step* procedure. First, we specify an initial threshold to create the clusters, then we perform inference using corrected $p$-values on the resultant clusters. This initial threshold is commonly known as the *cluster-forming threshold*, and is typically specified as an uncorrected $p < 0.001$. This threshold is essentially *arbitrary* and is there simply to create clusters. This threshold will determine the size of clusters you get, as the more liberal it is the bigger the clusters will become. In principle, this should not matter because the FWE correction takes account of this by knowing the size of cluster we would expect under the null at that threshold. However, if you make the threshold too liberal, you may find that individual clusters merge into single large clusters that are difficult to interpret. Also, evidence from Eklund, Nichols & Knutsson (2016) suggests that with a threshold more liberal than $p < 0.001$, the assumptions of random field theory break-down and the FWE procedure becomes shockingly liberal. All this is to say that you should probably just stick to $p < 0.001$ as the cluster-forming threshold.

An addition issue with the cluster-level is that, although more sensitive than voxel-level inference, this sensitivity comes at the cost of our ability to spatially localise effects. Whereas with voxel-level inference we can identify individual voxels as significant, we cannot do that with clusters because a significant cluster is based purely on the *size* and not the signal within the cluster. All you can really conclude is that one or more voxels in the cluster has evidence against the null. Although tempting to look at the peak statistic within a cluster, this will not always be helpful. If the cluster is very large then the peak may not be very representative of the whole cluster. For instance, there may be a variety of effects within a cluster that lead to $p < 0.001$. We therefore cannot always assume that the effect at the peak will be the same everywhere in the cluster. In addition, just focusing on the peak will not necessarily help with localising effects if the cluster covers half the brain (for instance). So we lose our ability to localise results very accurately, but do gain more sensitivity.

```{note}
Clusters are based on voxels touching each other after thresholding. Note that different software uses different definition of "touching". For instance, we could count touching as including only the *faces* and *edges* of voxels, or we could also include voxels where the *corners* touch. `SPM` uses an 18-connectivity scheme, which includes all faces (6) and all edges (12), but does not include corners. 
```

You can see an example of cluster-level-thresholded results below

<iframe src="cluster-level.html" width="800" height="600" frameborder="0" scrolling="no" title=""></iframe>

## Choosing a Level of Inference
Choosing between voxel-level and cluster-level is a balance between sensitivity and localising power. Voxel-level is less sensitive, but better at localisation, whereas cluster-level is more sensitive, but worse at localisation. Cluster-level also requires the arbitrary cluster-forming threshold, which can be cause for criticism given that there is no objective way to determine what this should be. We can see these difference by considering the results presented above. While there are many more results with cluster-level, the size of some of the clusters makes it difficult to localise exactly which regions of the brain are associated with the task. As such, we cannot point to any single voxel and say "there was a significant effect here". All we can really say is that there was enough consistent activity across a range of voxels to be surprising if there was no effect present. Although there are many fewer results with voxel-level, we do at least have the ability to point to a single location and say that there was an effect *precisely* here. Although most of these results have clustered in a convincing way, there are some locations where only a small number of voxels survives. Is this convincing? If not, how many voxels would you need for the results to be convincing? There is no answer to this, making it an entirely arbitrary choice. As such, we need to accept that, technically, a single voxel is just as legitimate a finding as a cluster of 100 voxels. But would you want to stake the main findings of your paper on only 1 voxel?

So, choosing between these two approach is tricky. Some software (such as `FSL`) defaults to cluster-level by arguing for increased sensitivity, but `SPM` does not do this. Instead, `SPM` shows you *all* levels of inference at once. You choose an initial feature-defining threshold and then `SPM` will show the cluster-level and peak-level results for that threshold. A typical approach is then to choose uncorrected $p < 0.001$ as the initial threshold and then look at both voxel-level and cluster-level statistics together. This threshold will create clusters and peaks for all the correction techniques that need them, but will not affect those correction techniques that do not. This means you can look at both cluster and voxel results in the results table, as shown in {numref}`spm-table-fig`. The advantage of this is that you can see what would happen if you used different levels of inference. The disadvantage is that the temptation is just to choose whichever one gives you results closest to the results you want. What you should be doing is choosing your level of inference *before* even analysing the data, based on the aims of your experiment.

```{figure} images/spm-table.png
---
width: 800px
name: spm-table-fig
---
Example of a results table from `SPM`.
```

```{note}
`SPM` also includes *set-level* inference, which uses GRF to determine the probability of the number of clusters in the results, under the null. For instance, {numref}`spm-table-fig` shows set-level inference for the 5 clusters that remain after thresholding. As this is significant, we could conclude that a significant number of clusters were found. However, given that there is no interesting spatial information in this level of inference, it is debatable how useful this is.
```

```{important}
You should *not* mix-and-match the cluster and voxel level, because they are asking two different questions about your data. The voxel-level is asking "is the effect at this voxel surprising if the null hypothesis were true" whereas the cluster-level is asking "is the size of this cluster at this cluster-defining threshold surprising if the null hypothesis were true". So you need to pick one to stick with for all your results, rather than picking whichever one makes your results look nicer on an ad-hoc basis. Much like choosing a correction technique, this lack of certainty on which level of inference to choose may feel unsatisfactory. Unfortunately, there is not single correct approach here, so we reach a point where we have to make a decision for ourselves about which one to go with. `SPM` can provide us with options, but ultimately the decision is ours.
```