# Sources of Variance
Before exploring group-level models for fMRI, one of the key concepts we need to understand is *sources of variance*. When we have multiple subjects, each with multiple measurements, there are two general reasons why the data differs from one measurement to the next. One of these reasons is the *internal consistency* of a single subject, and the other reason is the *general consistency* of each subject within the group. To make this clearer, {numref}`variance-sources-fig` shows some hypothetical data from 3 subjects who were each measured 5 times. Here, we can see two distinct patterns in the data. One pattern corresponds to how close the data points are *within* each subject, while the other pattern corresponds to how close the data points are *between* each of the subjects. These sources correspond to the *within-subject* and *between-subjects* variance.

```{figure} images/variance-sources.png
---
width: 550px
name: variance-sources-fig
---
Illustration of how multiple measurements across multiple subjects creates two sources of variance, one corresponding to variance *within* each subject and one corresponding to variance *between* the subjects.
```

## Within-subject Variance
The *within-subject* variance corresponds to the spread of data points around each subject's mean. This is illustrated in {numref}`ws-variance-fig`, where the orange lines represent the individual subject means ($\mu_1$, $\mu_2$, $\mu_3$) and the spread is indicated by the dashed vertical lines. For each subject, we can therefore calculate a variance value that captures their own personal degree of error. This tells us how *internally consistent* that subject is. The smaller the errors, the more consistent a subject is across the repeated measurements.

```{figure} images/ws-variance.png
---
width: 550px
name: ws-variance-fig
---
Illustration of the *within-subject* variance for 3 subjects each with 5 repeated measurements.
```

From the perspective of fMRI, our repeated measurements for each subject are the time-series values. We can therefore think of the 5 measurements shown in {numref}`ws-variance-fig` as representative of the 200-odd measurements that make up an fMRI time-series. The time-series model is not usually just a simple mean, nevertheless, we still get errors in terms of the discrepancy between the predicted values of the model and the raw data. From these errors, we can calculate an estimate of the within-subject variance. As such, the *within-subject* variance for fMRI is captured by the variance calculated from the single-subject GLM analysis. 

```{important}
The within-subject variance captures the consistency of a single subject who was measured multiple times. It tells us, on average, how much the raw data deviates from the predicted values of the model. In the context of `SPM`, this is the `ResMS.nii` image saved from each single-subject GLM analysis.
```

## Between-subject Variance
The *between-subjects* variance corresponds to the spread of the subject means around a group mean. This is illustrated in {numref}`bs-variance-fig`, where the orange lines represent the subject means (as before) and the green line represents the group mean (the average of all the subject means). This time, the errors correspond to the vertical distances between each subject's mean and the group mean. From this, we can calculate a variance that captures how consistent the subjects are as a group. The smaller this value, the closer the subject means are to each other and the more consistent the group is.

```{figure} images/bs-variance.png
---
width: 550px
name: bs-variance-fig
---
Illustration of the *between-subject* variance for 3 subjects.
```

The concept of *between-subjects* variance is important for allowing us to generalise from our current sample to the wider population. This is because this value is an estimate of how variable the population of *all* subjects is. When we think about sampling individual subjects from a population, the degree to which each subject differs from another corresponds to the variance of the population distribution. As such, if we want to make inference about the population of subjects, we need to use the *between-subjects* variance to do so.

```{important}
The between-subjects variance captures the consistency of a group of subjects. It tells us, on average, how much each individual subject deviates from the group average. In the context of `SPM`, this value is calculated using a 2nd-level GLM analysis, as we will see later in this lesson.
```

## The Repeated Measures Sampling Model
The concepts of *within-subject* and *between-subjects* variance play into our conceptualisation of the data generating process. If we imagine the simple case of being interested in only a single population, we only have to think of a single population distribution. This distribution has a mean and a variance that we want to estimate. This variance is the *between-subjects* variance. From this distribution, we sample individual subjects. As this is a distribution of averages we are effectively sampling a random assortment of subject means. Sampling multiple values from each subject represents another level of sampling, where we are drawing values from the individual subject distributions. These are parameterised using the individual subject means and individual subject variances. These individual subject variances are the *within-subject variances*, as illustrated in {numref}`sampling-model-fig`.

```{figure} images/sampling-model.png
---
width: 800px
name: sampling-model-fig
---
Illustration of the repeated measurements sampling model.
```

### The Hierarchical Perspective
We can formalise the sampling model illustrated in {numref}`sampling-model-fig` as a *hierarchical linear model*, also known as a *multilevel linear model*. In this format, we can write, for subject $j$

$$
\begin{align}
y_{ij}  &= \mu_{j} + \epsilon_{ij} &\quad \text{(Level 1)} \\
\mu_{j} &= \mu + \eta_{j} &\quad \text{(Level 2)} \\
\end{align}
$$

In this form, the repeated measurements for subject $j$ can be modelled in terms of each subject's individual mean $(\mu_{j})$. This forms the 1st-level of the model. The individual subject means can then be modelled in terms of the population mean $(\mu)$. This forms the 2nd-level of the model.

```{tip}
The terms "1st-level" analysis" and "2nd-level" analysis in `SPM` correspond to the hierarchical sampling perspective. Level 1 is each individual subject and Level 2 is a group of subjects.
```

Importantly, we can connect this hierarchical view with the illustration is {numref}`sampling-model-fig` by writing these equations in terms of distributions

$$
\begin{align}
y_{ij}  &\sim \mathcal{N}(\mu_{j}, \sigma^{2}_{w_{j}}) &\quad \text{(Level 1)} \\
\mu_{j} &\sim \mathcal{N}(\mu, \sigma^{2}_{b}) &\quad \text{(Level 2)} \\
\end{align}
$$

The important consequence of all of this is that if we were only interested in a one subject (for instance, a case study) then the population distribution is irrelevant. In which case, we only need the *within-subject* variance. This would also be true if we had a set of subjects where we were only interested in those specific subjects and no one else. Again, the population distribution does not matter because we do not care about the population, only those subjects (those subjects *are* the population). However, if we want to make generalisations to the whole population from a sample, then we need to take the between-subjects variance into account.

```{important}
The connection between all this information and group-level modelling of fMRI data will not be clear at this point. However, we will see how this distinction between different sources of variance is key to understanding the approaches that have been taken over the years. When you have been through this lesson once, it might be a good idea to re-read this section with the bigger picture in mind.
```