# The General Linear Model

## The General Linear Model Framework
To understand the GLM, we have to start with *multiple regression*. In the Statistical Inference synchronous session we touched on simple linear regression, with a single outcome variable and a single predictor variable. Multiple regression is essentially the same, except we have multiple predictor variables. The multiple regression model with $k$ predictor variables is therefore given by

$$
y_{i} = \beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \dots + \beta_{k}x_{ik} + \epsilon_{i}
$$

where we usually assume a normal probability model of the form

$$
\begin{align}
y_{i} &\sim \mathcal{N}(\mu_{i},\sigma) \\
\mu_{i} &= \beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \dots + \beta_{k}x_{ik}
\end{align}
$$

Conceptually, this is the same as simple regression, except for a few details. Firstly, rather than a regression line this model estimates a regression plane in $k$-dimensional space. For instance, if we have 2 predictor variables then $k=2$ and we can visualise the regression model as shown below

```{figure} images/reg-plane.png
---
width: 600px
name: plane-fig
---
Illustration of how a multiple regression model with $k$ predictors forms a regression *plane* through $k$-dimensional space.
```

Obviously when $k > 2$ this becomes impossible to visualise, but the principle stays the same. So on this basis, the means of our assumed normal distributions become points on a plane in k-dimensional space, rather than points on the regression line. The individual regression slopes for each predictor are given by the edges of this plane. Importantly, these slopes are not the same as fitting a simple regression model to each predictor separately. The coefficients represent the effect of each predictor after taking all other predictors into account. This is an important point, because adding more predictors or changing the predictors will change all the coefficients. Many modelling decisions we make are built on this fact.

In terms of the GLM, the important point is that the GLM is simply multiple regression written in matrix notation. If you understand multiple regression, then you understand the GLM. The differences are largely notational, but beyond that the methods are exactly the same. The use of multiple regression as a general system for data analysis has long been used by statisticians, but it has taken time for other branches of science to catch up. For instance, Jacob Cohen wrote about this back in 1968, but it still has not completely caught on in Experimental Psychology. The usefulness of this approach is that we can use a single framework to define different kinds of statistical models, rather than treating them as separate entities. For instance, although regression and ANOVA models are usually taught separately, they can both be specified within the GLM framework. This flexibility is precisely why the GLM has been adopted as the main system for analysing neuroimaging data.

In terms of writing multiple regression in matrix notation, we start with the observation that there are always $n$ regression equations (one for each data point)

$$
\begin{align}
y_{1} &= \beta_{0} + \beta_{1}x_{11} + \beta_{2}x_{12} + \dots + \beta_{k}x_{1k} + \epsilon_{1} \\
y_{2} &= \beta_{0} + \beta_{1}x_{21} + \beta_{2}x_{22} + \dots + \beta_{k}x_{2k} + \epsilon_{2} \\
y_{3} &= \beta_{0} + \beta_{1}x_{31} + \beta_{2}x_{32} + \dots + \beta_{k}x_{3k} + \epsilon_{3} \\
\vdots  \\
y_{n} &= \beta_{0} + \beta_{1}x_{n1} + \beta_{2}x_{n2} + \dots + \beta_{k}x_{nk} + \epsilon_{n}
\end{align}
$$

Notice how the values of the outcome variable, predictor variables and errors are different for each equation, but that the coefficients are the same for each equation. This indicates that the coefficients represents something that is consistent across all our data points, and thus provide a means of condensing our data down into a small number of values. Because the regression equations represent a system of linear equations, we know we can write them as vectors and matrices. Importantly, we can separate the predictor variables from the coefficients to give the following structure

$$
\begin{bmatrix}
y_{1} \\
y_{2} \\
y_{3} \\
\vdots \\
y_{n}
\end{bmatrix}
=
\begin{bmatrix}
1      & x_{11} & x_{12}  & \dots  & x_{1k} \\
1      & x_{21} & x_{22}  & \dots  & x_{2k} \\
1      & x_{31} & x_{32}  & \dots  & x_{3k} \\
\vdots & \vdots & \vdots  & \ddots & \vdots \\
1      & x_{n1} & x_{n2}  & \dots  & x_{nk} 
\end{bmatrix}
\begin{bmatrix}
\beta_{0} \\
\beta_{1} \\
\beta_{2} \\
\vdots    \\
\beta_{k}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_{1} \\
\epsilon_{2} \\
\epsilon_{3} \\
\vdots       \\
\epsilon_{n}
\end{bmatrix}
$$

which we can write shorthand as

$$
\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}
$$

This is probably the most important equation you will see on this course. It encapsulates the entire structure of our data analysis and is something you should become intimately familiar with during this module. Y represents the data vector, containing the values of our outcome variable. X represents the design matrix, containing each of our predictor variables as columns. β represnts the vector of model coefficients/parameters and ε represents the vector of errors.

The important part here is to recognise how multiplying the design matrix by the parameter vector recreates the n regression equations. The values of these equations give the predicted value of Y, which is the point on the regression plane for that particular combination of predictor variables.

$$
\hat{\mathbf{Y}} = \mathbf{X}\boldsymbol{\beta} = 
\begin{bmatrix}
1      & x_{11} & x_{12}  & \dots  & x_{1k} \\
1      & x_{21} & x_{22}  & \dots  & x_{2k} \\
1      & x_{31} & x_{32}  & \dots  & x_{3k} \\
\vdots & \vdots & \vdots  & \ddots & \vdots \\
1      & x_{n1} & x_{n2}  & \dots  & x_{nk} 
\end{bmatrix}
\begin{bmatrix}
\beta_{0} \\
\beta_{1} \\
\beta_{2} \\
\vdots    \\
\beta_{k}
\end{bmatrix}
=
\begin{bmatrix}
\beta_{0} + \beta_{1}x_{11} + \beta_{2}x_{12} + \dots  + \beta_{k}x_{1k} \\
\beta_{0} + \beta_{1}x_{21} + \beta_{2}x_{22} + \dots  + \beta_{k}x_{2k} \\
\beta_{0} + \beta_{1}x_{31} + \beta_{2}x_{32} + \dots  + \beta_{k}x_{3k} \\
\vdots  \\
\beta_{0} + \beta_{1}x_{n1} + \beta_{2}x_{n2} + \dots  + \beta_{k}x_{nk}
\end{bmatrix}
$$

So another way to look at the GLM is as a model plus error

$$
\begin{alignat*}{3}
\hat{\mathbf{Y}} &= \mathbf{X}\boldsymbol{\beta} &&+ \boldsymbol{\epsilon} \\
\underset{\text{Data}}{\hphantom{\mathbf{Y}}} &= \underset{\text{Model}}{\hat{\mathbf{Y}}} &&+ \underset{\text{Error}}{\boldsymbol{\epsilon}}
\end{alignat*}
$$

Do not worry at this stage if this all seems a bit abstract. In the next section, we will apply all this theory to a real dataset to see how it works.

## Building the Design Matrix

At the core of the GLM is the structure of the design matrix. This is the part of the GLM that changes from analysis-to-analysis and is the part that defines different models of our data. In fact the general bit of the GLM is a reference to its ability to accommodate different analyses, simply through the specification of different design matrices. As we will come to see, SPM visualises the design matrix for us as a means of communicating the model that we are fitting to our data. This is an indication of how important this part of the GLM is and thus it is important to understand how we can structure it to accommodate different types of predictor variables.

### Continuous Predictor Variables

Any variable that is numeric and represents some sort of measurement is classified as a continuous predictor variable. This includes any quantifiable phenomena such as IQ, reaction time, score on a test, age, weight etc. These types of predictor variables are straightforward to use within the GLM because we simply add them verbatim to the design matrix as columns. As such, they do not require any form of special treatment to be used. We will see examples of these types of variables in the next section, when we look at an example GLM using some real-world data.

### Categorical Predictor Variables

Any variable that represents a form of grouping or category is classified as a categorical variable. This could include variables such as sex, patient group, blood type, ethnicity etc. Compared with continuous variables, categorical predictor variables are more complex to use with the GLM. Fundamentally, what we have to do is to turn these categories into numbers in order to put them in the design matrix. The way we do this is to form dummy variables. These are variables that have a value of 1 or 0 depending on the category. For instance, if the categories were male or female, we could assign values like so

| Category  | Dummy Variable Value |
| --------- | -------------------- |
| Control   | 0                    |
| Patient   | 1                    |

If we have a variable with more than two categories, we can include more dummy variables like so

|  Blood Type | Dummy 1 Value | Dummy 2 Value | Dummy 3 Value  |
| :---------- | ------------- | ------------- | -------------: |
|  A          | 1             | 0             | 0              |
|  B          | 0             | 1             | 0              |
|  AB         | 0             | 0             | 1              | 
|  O          | 0             | 0             | 0              |

Notice how we always have one fewer dummy variable than the number of categories. So we only need 1 dummy variable to represent the 2 categories of male or female, and we only need 3 dummy variables to represent the 4 categories of blood type. As an example, we can see a design matrix below that contains a dummy variable representing diagnosis, where the first 3 subjects are patients and the last 3 are controls.

$$
\mathbf{X} = 
\begin{bmatrix}
1 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 0 \\
1 & 0 \\
1 & 0
\end{bmatrix}
$$

Any row that contains a 1 in the second column indicates that subject is female whereas any row that contains a 0 in the second column indicates that subject is male. We are going to see dummy variables used a lot and so do not worry if this concept is not clear yet. We will see plenty of examples as we press forward.
 
## Estimating the Parameters

Once we have constructed our design matrix, we have a fully-formed probability model of the form

$$
\mathbf{Y} \sim \mathcal{N}\left(\mathbf{X}\boldsymbol{\beta},\sigma\mathbf{I}\right)
$$

The next step is therefore to use maximum likelihood methods to estimate the model parameters. The equations that are derived from the method of maximum likelihood can be written in matrix terms to give a single equation that will simultaneously estimate values for every parameter. This equation is given by

$$
\hat{\boldsymbol{\beta}} = \left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Y}
$$

and is notable for its use of a matrix inverse. As you may remember from the Computational Tools lesson in Functional Neuroanatomy, inverting a matrix is a tricky business because an inverse may not always exist. This tells us something about the limitations of the GLM, namely that not every model we may want to use will be estimable. This is a thorny issue, but one which is thankfully rarely a concern when using SPM.

## Interpreting the Parameters

Once we have the parameter estimates, our aim is to try and interpret what they mean, given that their values represents important summaries of the effects in our dataset. If the variable associated with a parameter estimate is a continuous variable, we would interpret the value as a regression slope, telling us how much our outcome variable is predicted to change for a unit increase in the predictor variable. For example, a regression slope with a value of -5.344 is visualised in {numref}`continuous-fig`

```{figure} images/reg-continuous.png
---
width: 500px
name: continuous-fig
---
Example of the regression slope associated with a continuous predictor variable and a parameter estimate of $\beta_{1}=-5.344$.
```

This can be interpreted as a unit increase in the value of the predictor variable being associated with a decrease in the value of the outcome variable of 5.344. By comparison, if the variable associated with a parameter estimate is a categorical variable, we would interpret the value as a mean difference. To see why, consider what happens when we fit a regression slope to a dummy variable with a value of 0 or 1

```{figure} images/reg-dummy.png
---
width: 500px
name: dummy-fig
---
Example of the regression slope associated with a categorical predictor variable and a parameter estimate of $\beta_{1}=7.940$.
```

So we can see that the model is still fitting a regression slope, but one that goes from the mean of one category to the mean of the other. In this example, the slope has a value of 7.94 which tells us the mean difference between the categories. The intercept of the model is the mean of the group coded as 0 and a unit change simply refers to the change from one group to the other. So although this is still the same interpretation as any other regression slope, it is useful to think of the slopes from dummy variables representing mean differences.

Although we can interpret the values of our parameter estimates, on their own this not enough because we also need to know how much we can trust these estimates. This is done by calculating the standard deviation of the sampling distribution of the estimates, known as the standard error. In the GLM, the standard errors of the parameter estimates can be calculated using the estimate of the model variance

$$
\hat{\sigma}^{2} = \frac{\boldsymbol{\epsilon}^{\prime}\boldsymbol{\epsilon}}{n-p}
$$

which is the squared estimate of the standard deviation of the normal distribution we are using for our model. Here, n is the number of rows of X and p is the number of model parameters (the number of columns of X). This is then combined with the design matrix to produce a variance-covariance matrix

$$
\text{Cov}\left(\hat{\boldsymbol{\beta}}\right) = \hat{\sigma}^{2}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}
$$

You do not need to worry too much about what this represents. The main point is just that the standard errors are taken as the square-root of the diagonal elements of this matrix. 

## Inference

... Although this seems fairly straightforward, inference using neuroimaging data has some very specific challenges that we will pick up on next week. 