# Sources of Variance
One of the key concepts we need in order to understand how group-level models work is *sources of variance*. When we have multiple subjects, each with multiple measurements, there are two general reasons why the data values differ from one measurement to the next. One of these corresponds to the the *internal consistency* of a single subject, and the other corresponds to the *general consistency* within a group of subjects. For example, {numref}`variance-sources-fig` shows some hypothetical data from 3 subjects who were each measured 5 times. Hopefully it is clear that there are two distinct patterns in the data. One pattern corresponds to how close the data points are *within* each subject, while the other pattern corresponds to how close the data points are *between* each of the subjects.

```{figure} images/variance-sources.png
---
width: 550px
name: variance-sources-fig
---
Illustration of how multiple measurements across multiple subjects creates two sources of variance, one corresponding to variance *within* each subject and one corresponding to variance *between* the subjects.
```

```{important}
The distinction between the *within-subject* and *between-subjects* variance turns out to be very important because our modelling choices influence which one of these sources the variance calculated from the GLM residuals represents. This directly influences the calculation of the standard errors, which then directly affects the magnitude of our test statistics and thus the magnitude of the $p$-values. As such, understanding the difference is *critical* for our inference, as we will come to see.
```

## Within-subject Variance
The *within-subject* variance corresponds to the spread of data points around each subject's mean. This is illustrated in {numref}`ws-variance-fig`, where the orange lines represent the individual subject means ($\mu_1$, $\mu_2$, $\mu_3$) and the spread is indicated by the dashed vertical lines. For each subject, we can therefore calculate a variance value that captures their own personal degree of error. This tells us how *internally consistent* that subject is. The smaller the errors, the more consistent a subject is across the repeated measurements. In other words, the better the mean is as a representation of that single subject.

```{figure} images/ws-variance.png
---
width: 550px
name: ws-variance-fig
---
Illustration of the *within-subject* variance for 3 subjects each with 5 repeated measurements.
```

From the perspective of fMRI, our repeated measurements for each subject are the time-series values. From this perspective, the 5 measurements shown in {numref}`ws-variance-fig` are representative of the 200-odd measurements that make up an fMRI time-series. The time-series model is not usually just a simple mean, nevertheless, we still get errors in terms of the discrepancy between the predicted values of the model and the raw data. From these errors, we can calculate an estimate of the within-subject variance. As such, the *within-subject* variance for fMRI is captured by the variance calculated from the single-subject GLM analysis. This is the `ResMS.nii` image saved by `SPM` for each subject.

## Between-subject Variance
The *between-subjects* variance corresponds to the spread of subject means around a group mean. This is illustrated in {numref}`bs-variance-fig`, where the orange lines represent the subject means (as before) and the green line represents the group mean (the average of all the subject means). This time, the errors correspond to the vertical distances between each subject's mean and the group mean. From this we can calculate a variance that captures how consistent the subjects are as a group. Alternatively, we can think of this as how similar the subjects are to each other, or how well the group mean represents the group as a whole. The smaller this value, the closer the subject means are to each other and the more consistent the group is.

```{figure} images/bs-variance.png
---
width: 550px
name: bs-variance-fig
---
Illustration of the *between-subject* variance for 3 subjects.
```

The concept of between-subject variance is important for allowing us to generalise from our current sample of 3 subjects to the wider population. This is because it gives us an estimate of how variable we expect subjects to be *in the population*. If we are going to make inference about the group mean, we need to have some sense of how well the group mean describes the population of subjects. If the between-subjects variance is high, it suggests that the individual subject means are highly variable and that the group mean is *not* a good summary. Conversely, if the between-subjects variance is low then it suggests that the individual subject means are very consistent and that the group mean *is* a good summary.

## The Bigger Picture
The concepts of *within-subject* and *between-subjects* variance play into our conceptualisation of the data generating process. If we imagine the simple case of being interested in only a single population, we only have to think of a single population distribution. This distribution has a mean and a variance that we want to estimate. This variance is the *between-subjects* variance. From this distribution, we sample individual subjects. As this is a distribution of averages we are effectively sampling a random assortment of subject means. Sampling multiple values from each subject represents another level of sampling, where we are drawing values from the individual subject distributions. These are parameterised using the individual subject means and individual subject variances. These individual subject variances are the *within-subject variances*, as illustrated in {numref}`sampling-model-fig`.

```{figure} images/sampling-model.png
---
width: 800px
name: sampling-model-fig
---
Illustration of the repeated measurements sampling model. ...
```

### The Hierarchical Sampling Model
We can formalise the sampling model illustrated in {numref}`sampling-model-fig` as a *hierarchical linear model*, also known as a *multilevel linear model*. 