# Problems With Using the GLM to Model fMRI Data
Although we have now seen one way of using the GLM to model fMRI data, there are several problems with the approach described in the prevous section. These include issues with the shape and timing of the BOLD response, the correlation structure of time series data, the presence of low-frequency noise and the arbitrary scaling of BOLD data. Much of the early research into the mass-univariate method was concerned with identifying and providing solutions to these problems. This can be seen in a series of papers from Friston and colleagues called [Analysis of the fMRI time-series](https://onlinelibrary.wiley.com/doi/abs/10.1002/hbm.460010207), [Analysis of the fMRI time-series - revisited](https://pubmed.ncbi.nlm.nih.gov/9343589/), and [nalysis of the fMRI time-series - revisited -- again](https://pubmed.ncbi.nlm.nih.gov/9343600/). These papers demonstrate how it took a number of years to identify and solve the problems that accompany applying the GLM to fMRI data. Even to this day, some of the "solutions" used by SPM are contentious. As such, it is important to understand both the problems and their purported solutions in order to understand some of the limitations of the analysis approach we will be using going forward.

## Issue 1: The Shape and Timing of the BOLD Response
The first main problem with the approach described in the previous section is the *shape* and *timing* of the predictor variables. So far, we have used a dummy variable to indicate the onset and offset of the experimental conditions. This implies an instantaneous change in signal at the start of each experimental condition, as well as an instantaneous change in signal at the end of the experimental condition. However, as we know, the BOLD response is far from instantaneous. In {numref}`hrf-fig` we can see a typical BOLD response to a stimulus at time 0. 

```{figure} images/hrf.png
---
width: 500px
name: hrf-fig
---
Illustration of a typical BOLD response to a single stimulus at time 0.
```

Notice how it takes around 6 seconds for the signal to reach its peak and around 20 seconds to come back to baseline. As such, if we use a dummy variable we run the risk of our model not fitting the data simply because the actual BOLD response is offset from the stimulus onsets. This lack of fit will lead to larger residuals and a larger variance estimate. This will directly impact the magnitude of the standard errors, leading to a less sensitive model. This issue is illustrated in {numref}`bad-fit-fig`.

```{figure} images/bad-fit.png
---
width: 600px
name: bad-fit-fig
---
Illustration of how a dummy variable leads to poor correspondance with the BOLD signal given the delayed response to the stimuli.
```

One solution would be to offset our dummy variable in time to accommodate the delayed response. This would help improve the fit to an extent, but we would still not be capturing the shape of the actual response. In order to do this, we turn to the process of convolution, which we discussed during the Functional Neuroanatomy module. By convolving our dummy variable with a model of the hemodynamic response, we can create a much more realistic prediction of what our signal should look like, as shown below.

This model of the hemodynamic response is known as the hemodynamic response function (HRF), and was derived using de-convolution methods by several authors in the mid-90s. The specific version shown above is referred to as the canonical HRF by SPM, and is derived from combining two gamma distributions. This is all we are going to say about this for now. The derivation of the HRF and the justification for the use of convolution is part of assuming that the relationship between the neural signal and the hemodynamic signal is linear and time-invariant (LTI). This is something we will discuss in more detail in the Experimental Design and Optimisation module. For now, just to show that this is a good idea, the illustration below shows the results of using a raw dummy variable and a convolved dummy variable on the magnitude of the model errors.

As a final point, after convolution our predictor is no longer a dummy variable and is more akin to a continuous predictor variable, as shown below. However, trying to think about the parameters as regression slopes can become a bit unintuitive when trying to think about changes in relation to experimental conditions. Luckily, we can still think of the parameters as capturing changes in signal magnitude. Because the parameters scale the height of our predictor variable, a larger parameter estimate still means a larger signal change relative to the baseline. So we can think of these parameter estimates in the same way as those from a dummy variable, even after convolution.

## Issue 2: Autocorrelation

Another issue with our approach thus far is that we have not acknowledged that we are working with time series data. What is so special about a time series? Well, one little detail we have not indicated so far is that all the theory behind the GLM is based on assuming that the data in Y are independent. This means there is no correlation between the data points. Why does this matter? Well, correlation actually has quite a big impact on the structure of the variance and thus on the standard errors. So far, we have just used the GLM as if our timeseries contains no correlation, but unfortunately timeseries data have a very specific correlation structure known as autocorrelation. This means that values close to one another in time are closer in value (more correlated) than values far away in time. So we need to take this into account, otherwise our standard errors will be wrong and any subsequent inference will be inaccurate.

In order to accommodate this correlation, SPM starts by doing an initial model fit at every voxel and then estimating the correlation structure in the data using the residuals. This is done using an autoregressive model of order 1, also known as an AR(1) model, which is given by

$$
\epsilon_{t} = \rho \epsilon_{t-1} + \tau_{t}
$$

So this is an enhanced version of the errors we saw previously, where the term ρεt-1 has been added. Here ρ is the correlation and εt-1 is the error one step back in time. This effectively scales the influence of the previous error value on the current error value. The higher the correlation the more of the previous error there will be in the current error. The main consequence of this is that when you calculate the covariance of the errors you get

$$
\text{Cov}\left(\epsilon_{t},\epsilon_{t+n}\right) = \frac{\sigma^{2}}{1-\rho^{2}}\rho^{|n|}
$$

which means that the correlation gets weaker the further apart in time the data points are. This therefore captures a simple version of autocorrelation that can be estimated efficiently by SPM using maximum likelihood methods. Of note is that SPM does not estimate the correlation uniquely at each voxel. Instead, it pools together voxels that survive an initial thresholding and then uses those voxels to estimate a single correlation structure, which is assumed to be the same everywhere in the brain. This is done in the name of speed and efficiency, but clearly is far from desirable in terms of the assumptions that have to be made.

Once SPM has an estimate of the AR(1) correlation structure, it uses it to create a whitening matrix. How it does this is not important (you can look at pages 195-6 in Poldrack, Mumford & Nichols, 2011, if you are curious), all we need to know is that this whitening matrix (denoted W) can be used to remove the correlation from the data. SPM does this by pre-multiplying the data and design

$$
\mathbf{WY} = \mathbf{WX}\boldsymbol{\beta} + \mathbf{W}\boldsymbol{\epsilon}
$$

So after the whitening procedure our data and our design matrix will be different. This is because removing the correlation from the data breaks the connection between the data and the predictor variables, as the shape of the BOLD response will be changed by the whitening. In order to make sure our predictor variables are still able to predict the signal accurately, we also have to change them with the whitening matrix. This means that you will see the design matrix change colour in SPM during the statistical modelling. This is one of the reasons why.

## Issue 3: Low-frequency Noise
A third issue we face with applying the GLM to fMRI data is that the timeseries is often contaminated by low-frequency noise, also known as signal drift. For example, the timeseries below shows a steady increase in signal magnitude over time.

The problem with the drift is that it will bias the parameter estimates. For instance, we may get the impression that the signal change is larger in one experimental condition simply because there were repeats closer to the end of the experiment when the signal was larger than at the start of the experiment. Unfortunately, the cause of the drift relates to the scanners themselves and thus is difficult to avoid. As such, we have to find a way of removing this drift from the data instead.

The way SPM does this is by high-pass filtering the data. This is done using a discrete cosine transform (DCT) basis set. This is essentially a series of cosines increasing in frequency up to the desired filter cutoff, as shown below. The high-pass filter can be enacted by adding the DCT basis set as extra columns in the design matrix. This works in a similar fashion to the Fourier transform, in the sense that a linear combination of periodic functions can be used to represent any signal. In this case, the cosines will act together to remove any frequency below the cutoff from the data.

In order to not clutter up the design matrix, SPM actually takes a slightly different approach (which works out the same as adding the cosines to the design matrix). What SPM does is use the cosines to form a filtering matrix (denoted S) which can be used to pre-multiply the data and design, much in the same way as whitening.

$$
\mathbf{SY} = \mathbf{SX}\boldsymbol{\beta} + \mathbf{S}\boldsymbol{\epsilon}
$$

So this removes the low-frequencies from the data and the design, using the same principles as the whitening procedure. The only element of this we need to concern ourselves with is what cutoff to use. By default SPM chooses 128 seconds, which is equivalent to 1/128 = 0.008Hz. So any frequency below 0.008Hz will be removed from the data. Practically, this means we should design our experiments so that any periodic changes of interest (such as our experimental conditions) do not occur slower than around every 2 minutes, otherwise we run the risk of removing experimental signal with the filter. As we will see, there are facilities within SPM that can be used to find a suitable filter value and this is something we will also come back to in the Experimental Design and Optimisation module. This filtering is also another reason why you will see the design matrix change colour during the course of the analysis in SPM.

## Issue 4: Image Scaling
The final issue that we have not considered is that the BOLD signal itself is measured on an arbitrary scale that can differ from subject-to-subject or scanner-to-scanner. The problem is that the GLM parameters are on the same scale as the data. For instance, β1 gives the change in signal for a unit change in our first predictor variable. Unfortunately, this means we cannot meaningfully compare the magnitude of these parameters between subjects, because their value depends upon the arbitrary scale of the signal. To get around this, SPM will scale the data such that the mean of the data over all voxels and all volumes is equal to 100. This is known as grand mean scaling and will effectively put all our subjects on the same scale. This is done completely automatically in the background for you, so why do we care? Well you have to be slightly careful about interpretation here because the temptation might be to think that this scaling renders the parameters on some sort of standardised and interpretable scale, such as percentage signal change. However, this is not correct and the actual procedure to convert the parameters to percentage signal change is much more involved (as explained by Pernet (2014)). So note that SPM automatically scales the data, but remember this does not result in some standardised and easy to interpret metric for our effects.

## The Adapted GLM
So we have now seen that in order to make the GLM work with fMRI data we have to convolve our dummy variables with the HRF, estimate and remove the autocorrelation in the data, filter the data to remove low-frequency noise and scale the data to put it on a common scale across all subjects. All these steps form an adapted version of the GLM, as illustrated below

where the "star" symbol indicates the whitened and filtered versions of the data, design matrix and errors. Once the parameters have been estimated, they can be used to multiply each column to scale the prediction to best match the data, as shown below. Notice how the constant takes on the baseline level of signal (400 in this example) and that the conditions are simply changes relative to this baseline.

Adding all the scaled columns together provides our final model prediction, giving us a clear perspective on what the GLM is doing to try and separate the true signal from noise.
