# The General Linear Model
To begin with, we will take a high-level perspective on the GLM as a *framework* that can be applied to any dataset. This may appear somewhat abstract on a first read, however, we will see how this is applied to a real dataset in the next section. Remember that at this point we are not talking about fMRI data specifically. Given that this is a specialised application of the GLM, we first need to discuss the general theory before seeing how this traditional approach needs to be adjusted to suit modelling an fMRI time series.

## The General Linear Model Framework
To begin with, we will discuss the mathematical framework behind the GLM. In order to understand the GLM, we need to understand *multiple regression* which, in its most basic form, is know as *simple* regression.

### Simple Regression
In a simple regression model there is a single outcome variable $y$ that is associated with a single predictor variable $x$. The simple regression model takes the form

$$
y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}
$$

which defines a straight-line fit to the data, with $\beta_{0}$ representing the *intercept* and $\beta_{1}$ representing the *slope*. The term $\epsilon$ quantifies the amount of *error* in the model, allowing for the fact that a perfect fit between the two variables is rarely possible. This model is illustrated in {numref}`simple-fig`. In this example, the manager of a production line wants a good estimate of the required number of worker hours given the number of units that must be produced (the *lot size*). 

```{figure} images/simple-reg.png
---
width: 700px
name: simple-fig
---
Illustration of how a simple regression model amounts to constructing a straight-line to summarise the relationship between the predictor variable and the outcome variable.
```

As we can see, the simple regression model consists of a straight line through the scatterplot of measurements. For each value of `Lot Size`, the point on the regression line represents the predicted value of `Hours`. The magnitude of the estimated slope is therefore of interest, given that this quantifies the *strength* of the general relationship between the two variables. However, we also need to consider how close the raw data sits to the regression line as data that are tightly-packed around the regression line suggest a model that fits the data well. This concept of *model fit* is therefore quantified by the *errors*. The smaller that the $\epsilon$ terms are, the shorter the vertical distances between the regression line and the raw data.  

These concepts can be further understood by considering the probability model for a simple regression

$$
y_{i} \sim \mathcal{N}(\beta_{0} + \beta_{1}x_{i}, \sigma)
$$

Taking the example from {numref}`simple-fig`, each value of `Lot Size` is therefore associated with a normal distribution of values for `Hours` where the means sit along the population regression line. An illustration of this concept is given in {numref}`simple-prob-fig`.

```{figure} images/simple-reg-prob.gif
---
width: 700px
name: simple-prob-fig
---
Illustration of the simple regression normal probability model. The operator $E(Y)$ indicates the *expected* value of the outcome variable which, for a normal distribution, is equivalent to the mean.
```

The standard deviation of these normal distributions is given by $\sigma$ and reflects how widely spread the data are around the regression line. There is therefore a direct connection between the width of the assumed probability distributions and the model errors. As such, for each value of `Lot Size`, we expect there to be a normal distribution of errors, spread equally above and below the line, with most errors close to the line and fewer further away. As such, the simple regression model can also be expressed as

$$
\begin{align}
y_{i} &= \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \\
\epsilon_{i} &\sim \mathcal{N}(0,\sigma)
\end{align}
$$

Remembering that this probability model represents the *population*, the aim of a simple regression analysis is to use a *sample* to estimate both the *mean* and *variance* of the population distribution. As the mean depends upon the parameters $\beta_{0}$ and $\beta_{1}$, these must also be estimated from the sample. Finding the line that best fits the data will result in estimates for $\beta_{0}$ and $\beta_{1}$. The errors resulting from this fit can then be used to estimate $\sigma$. 

Once the parameters are estimated, hypothesis testing can be used to determine whether the null hypotheses

$$
\begin{align}
\mathcal{H}_{0}:& \beta_{0} = 0 \\
\mathcal{H}_{0}:& \beta_{1} = 0
\end{align}
$$ 

can be rejected. Usually, the test on $\beta_{1}$ is of most interest because a slope of 0 is indicative of *no* relationship between the outcome and predictor variables. If significant, we would assume some non-zero relationship is present in the population and can interpret what the magnitude of $\hat{\beta}_{1}$ tells us about the relationship under study. This is often in the context of using the simple regression model to *predict* the outcome using the values of the predictor. For the example above, this would result in being able to predict `Hours` by only knowing the value of `Lot Size`, allowing the manager to determine if more or fewer workers need to be hired as production demands are scaled up or down.

```{admonition} Randomness of the errors
:class: tip
One of the more confusing aspects of statistical models is that they can be written in mutliple equivalent ways. For instance, for simple regression, the probability model can be written as either

$$
y_{i} \sim \mathcal{N}\left(\beta_{0} + \beta_{1}x_{i},\sigma\right)
$$

or

$$
\begin{align}
y_{i} &= \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \\
\epsilon_{i} &\sim \mathcal{N}(0,\sigma)
\end{align}
$$

Although the first form may be more intuitive, the second form is insightful in terms of the principles of these models. Recall that we are assuming that there is some *constant* effect that runs through all our observations of a particular phenomena. Also recall that we assume that the reason we do not observe the same value every time is because we are measuring a *random process* that is subject to measurement error. As such, each measurement we take is the sum of our constant effect of interest and random perturbations. This structure is then reflected in the statistical model, as the mean of the assumed distribution is considered the *constant* effect and the errors are considered the *random* deflections. As such, the *errors* are the element that creates randomness and thus are the *random variable* component of our measurements. In statistical parlance, this is the distinction between *fixed-effects* and *random-effects*. Fixed-effects are associated with the mean of the assumed probability distribution, whereas random-effects are associated with random perturbations and thus the *variance* of the measurements.

```

### Multiple Regression
Expanding the simple regression model to contain *multiple* predictor variables produces the multiple regression model. As such, the multiple regression model with $k$ predictor variables is given by

$$
\begin{align}
y_{i} &= \beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \dots + \beta_{k}x_{ik} + \epsilon_{i} \\
&= \beta_{0} + \sum_{j=1}^{k} \beta_{j}x_{ij} + \epsilon_{i}
\end{align}
$$

where we usually assume a normal probability model of the form

$$
y_{i} \sim \mathcal{N}\left(\beta_{0} + \sum_{j=1}^{k} \beta_{j}x_{ij},\sigma\right)
$$

which can also be expressed as

$$
\begin{align}
y_{i} &= \beta_{0} + \sum_{j=1}^{k} \beta_{j}x_{ij} + \epsilon_{i} \\
\epsilon_{i} &\sim \mathcal{N}\left(0,\sigma\right)
\end{align}
$$

Conceptually, much of multiple regression is the same as simple regression, except for a few details. Firstly, rather than a regression line this model estimates a regression *plane* in $k$-dimensional space. For instance, if $k=2$ then the model can be visualised as shown in {numref}`plane-fig`

```{figure} images/reg-plane.png
---
width: 600px
name: plane-fig
---
Illustration of how a multiple regression model with $k$ predictors forms a regression *plane* through $k$-dimensional space.
```

The means of the assumed normal distributions are no longer points along a regression line, rather they become points on a plane in $k$-dimensional space. The individual regression slopes for each predictor $\left(\beta_{j}\right)$ are given by the *edges* of this plane (where $j = 1,\dots,k$). Importantly, these slopes are not the same as fitting a simple regression model to each predictor separately. The slope coefficients represent the effect of each predictor after taking all other predictors into account. This means that adding or altering the predictors will change *all* the parameter estimates. This is an important point because many statistical modelling decisions are built upon this fact. For instance, the notion of *controlling* for the effect of a variable is based directly upon this behaviour.

Despite these differences, the multiple regression model proceeds in much the same fashion as simple regression. The individual parameters of the mean function can be estimated to produce a single estimate for the intercept $\left(\beta_{0}\right)$ and $k$ estimates for the slopes $\left(\beta_{1},\dots,\beta_{k}\right)$. Estimation of the variance again proceeds using the errors, which now represent vertical distances from the regression *plane* to the raw data. Hypothesis tests can then be performed on each of the estimates to determine which of the $k$ variables appears to show a significant non-zero relationship with the outcome. This involves testing null hypotheses of the form

$$
\begin{align}
\mathcal{H}_{0}:& \beta_{0} = 0 \\
\mathcal{H}_{0}:& \beta_{1} = 0 \\
\vdots \\
\mathcal{H}_{0}:& \beta_{k} = 0
\end{align}
$$

Much like simple regression, the general aim is the accurate *prediction* of the outcome, this time using value of *multiple* variables. For instance, the manager of the production line may enhance their model of `Hours` by also considering `Wage` and `Age` of the workers, alongside `Lot Size`. This may allow for a more accurate prediction of `Hours`, as well as allowing the manager to determine which factor is the most influential and where the focus should be when considering how to adapt to changes in production. 

### Multiple Regression in Matrix Form
Now that we have discussed both *simple* and *multiple* regression, we can return to the topic of the GLM. The most important point to understand is that the GLM is simply *multiple regression in matrix form*. As such, if you understand multiple regression, then you already understand the GLM. The fact that there is a different term to refer to the matrix-variant of multiple regression is largely historical, as discussed in the box below. 

```{admonition} The history of the GLM
:class: tip
Historically, statistical analyses could be categorised as either *regression* methods or *analysis of variance* (ANOVA) methods. The distinction was that regression models sought to estimate the relationships between mutliple continuous measurements as a means of prediction, whereas ANOVA models sought to investigate group differences resulting from experimental manipulations. As such, regression was more related to observational studies, whereas ANOVA was associated with designed experiments. However, it was recognised in the 1950s that these two methods could be combined, meaning that an ANOVA model could be specified using mutliple regression. Since this point, the equivelance of these methods has been well-recognised, though regression and ANOVA are still often taught in isolation. The use of the term *General Linear Model* is usually reserved for the specific matrix-based framework using to specify both regression and ANOVA models, whereas the terms *regression* and *ANOVA* are typically used in reference to the more historical non-matrix forms of the same analyses.
```

In terms of writing multiple regression in matrix notation, we start with the observation that there are always $n$ regression equations, one for each data point

$$
\begin{align}
y_{1} &= \beta_{0} + \beta_{1}x_{11} + \beta_{2}x_{12} + \dots + \beta_{k}x_{1k} + \epsilon_{1} \\
y_{2} &= \beta_{0} + \beta_{1}x_{21} + \beta_{2}x_{22} + \dots + \beta_{k}x_{2k} + \epsilon_{2} \\
y_{3} &= \beta_{0} + \beta_{1}x_{31} + \beta_{2}x_{32} + \dots + \beta_{k}x_{3k} + \epsilon_{3} \\
\vdots  \\
y_{n} &= \beta_{0} + \beta_{1}x_{n1} + \beta_{2}x_{n2} + \dots + \beta_{k}x_{nk} + \epsilon_{n}
\end{align}
$$

Because these regression equations represent a system of linear equations, we know we can write them as vectors and matrices. Importantly, we can separate the predictor variables from the coefficients to give the following structure

$$
\begin{bmatrix}
y_{1} \\
y_{2} \\
y_{3} \\
\vdots \\
y_{n}
\end{bmatrix}
=
\begin{bmatrix}
1      & x_{11} & x_{12}  & \dots  & x_{1k} \\
1      & x_{21} & x_{22}  & \dots  & x_{2k} \\
1      & x_{31} & x_{32}  & \dots  & x_{3k} \\
\vdots & \vdots & \vdots  & \ddots & \vdots \\
1      & x_{n1} & x_{n2}  & \dots  & x_{nk} 
\end{bmatrix}
\begin{bmatrix}
\beta_{0} \\
\beta_{1} \\
\beta_{2} \\
\vdots    \\
\beta_{k}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_{1} \\
\epsilon_{2} \\
\epsilon_{3} \\
\vdots       \\
\epsilon_{n}
\end{bmatrix}
$$

This can be written in shorthand as

$$
\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}
$$

This is probably the most important equation you will see on this course. It encapsulates the entire structure of our data analysis and is something you will become intimately familiar with during this module. To make this structure clear 
- $\mathbf{Y}$ represents the *data vector* containing the values of the outcome variable
- $\mathbf{X}$ represents the *design matrix* containing each of the predictor variables as columns 
- $\boldsymbol{\beta}$ represents the vector of *model parameters* 
- $\boldsymbol{\epsilon}$ represents the vector of *errors* 

Importantly, we need to recognise how multiplying $\mathbf{X}$ and $\boldsymbol{\beta}$ recreates the $n$ regression equations.

$$
\mathbf{X}\boldsymbol{\beta} = 
\begin{bmatrix}
1      & x_{11} & x_{12}  & \dots  & x_{1k} \\
1      & x_{21} & x_{22}  & \dots  & x_{2k} \\
1      & x_{31} & x_{32}  & \dots  & x_{3k} \\
\vdots & \vdots & \vdots  & \ddots & \vdots \\
1      & x_{n1} & x_{n2}  & \dots  & x_{nk} 
\end{bmatrix}
\begin{bmatrix}
\beta_{0} \\
\beta_{1} \\
\beta_{2} \\
\vdots    \\
\beta_{k}
\end{bmatrix}
=
\begin{bmatrix}
\beta_{0} + \beta_{1}x_{11} + \beta_{2}x_{12} + \dots  + \beta_{k}x_{1k} \\
\beta_{0} + \beta_{1}x_{21} + \beta_{2}x_{22} + \dots  + \beta_{k}x_{2k} \\
\beta_{0} + \beta_{1}x_{31} + \beta_{2}x_{32} + \dots  + \beta_{k}x_{3k} \\
\vdots  \\
\beta_{0} + \beta_{1}x_{n1} + \beta_{2}x_{n2} + \dots  + \beta_{k}x_{nk}
\end{bmatrix}
= \hat{\mathbf{Y}}
$$

These equations produce the *predicted* values from the model $\left(\hat{\mathbf{Y}}\right)$. As such, another way to look at the GLM is as a *prediction* plus *error*

$$
\underset{\text{Data}}{\mathbf{Y}} = \underset{\text{Prediction}}{\hat{\mathbf{Y}}} + \underset{\text{Error}}{\boldsymbol{\epsilon}}
$$

where the nature of the prediction depends upon the form of the design matrix.

## Building the Design Matrix
At the core of the GLM is the structure of the design matrix. This is the part of the GLM that changes from analysis-to-analysis and is the element that defines different models of our data. The *general* bit of the GLM is a reference to the fact that including continuous predictors in the design matrix will produce a regression model, whereas including categorical predictors will produce an ANOVA model. As we will come to see, the design matrix is such a key part of the GLM that SPM chooses to *visualise* this element as a means of communicating the form of GLM being used to model the fMRI data. As such, understanding how the design matrix is structured is vital to understanding the GLM.

### Continuous Predictor Variables
Any variable that is *numeric* and represents some sort of *measurement* is classified as a *continuous* predictor variable. For instance, measurements such as IQ, reaction time, test score, age or weight are examples of continuous predictor variables. These types of variables enter the design matrix verbatim as column vectors. As such, they do not require any form of special treatment, as we will see in the worked example later in the lesson.

### Categorical Predictor Variables

Any variable that represents a form of *grouping* or *category* is classified as a *categorical* variable. For instance, categories such as diagnostic group, blood type or ethnicity are examples of categorical predictor variables. Compared with continuous variables, categorical variables are more complex to use with the GLM. Fundamentally, the category labels must be converted into numbers. The way this is done is by forming what are known as *dummy variables*. These are variables that conatin either a value of 1 or 0 depending on the category. For instance, if the categories were *patient* or *control*, the dummy variable values could be assigned like so

| Category  | Dummy Variable Value |
| --------- | -------------------- |
| Control   | 0                    |
| Patient   | 1                    |

If there is a variable with more than two categories, more than one dummy variable can be used

|  Blood Type | Dummy 1 Value | Dummy 2 Value | Dummy 3 Value  |
| :---------- | ------------- | ------------- | -------------: |
|  A          | 1             | 0             | 0              |
|  B          | 0             | 1             | 0              |
|  AB         | 0             | 0             | 1              | 
|  O          | 0             | 0             | 0              |

In both cases, notice how there is always one fewer dummy variable than the number of categories. As an illustration, consider the design matrix below containing a dummy variable representing diagnosis, where the first 3 subjects are patients and the last 3 are controls.

$$
\mathbf{X} = 
\begin{bmatrix}
1 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 0 \\
1 & 0 \\
1 & 0
\end{bmatrix}
$$
 
## Estimating the Parameters
Once the design matrix is formed, the probability model of the GLM is given by

$$
\mathbf{Y}_{i} \sim \mathcal{N}\left(\mathbf{X}_{i}\boldsymbol{\beta},\sigma\right)
$$

The next step is therefore to use maximum likelihood methods to estimate the parameters. Although beyond the scope of this lesson to discuss in more detail, maximum likelihood can be used to derive a single equation that estimates the regression plane that best fits the data. In this context, "best fits" is equivalent to minimising the errors, such that the estimated parameters are those that minimises the vertical distances between the regression plane and the raw data. The equation that achieves this is given by

$$
\hat{\boldsymbol{\beta}} = \left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\mathbf{X}^{\prime}\mathbf{Y}
$$

which is notable for its reliance on a matrix *inverse*. Recall that not all matrices are invertible, meaning that not every model specified in the design matrix will be estimable. This is more of an issue with other fMRI software, but is rarely a problem with SPM given its use of a *pseudo-inverse* for solving this equation. This approach comes with its own limitations, but does mean that all models should be estimable and we are not limited in terms of the form that the design matrix can take.

Once the value of $\boldsymbol{\beta}$ is determined, the value of $\sigma$ can be derived from the errors. The error vector $\boldsymbol{\epsilon}$ is formed by a simple subtraction

$$
\boldsymbol{\epsilon} = \mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}
$$

and can then be used to derive the model variance using

$$
\hat{\sigma}^{2} = \frac{\boldsymbol{\epsilon}^{\prime}\boldsymbol{\epsilon}}{n-k}
$$

This is simply the sum of squared errors, the square-root of which provides the estimate of the standard deviation of the probability model distribution.

## Interpreting the Parameters
Once the parameters have been estimated, interest turns to interpreting what they *mean*. Remembering that the parameters represent summaries of the effects and relationships in the dataset, knowing how to interpret their value is of important in order to reach conclusions about the phenomena of interest. 

### Parameter Estimates for Continuous Predictors
If the variable associated with a parameter estimate is continuous, the parameter is interpreted as the value as a *regression slope*. For a unit change in the predictor variable, the parameter indicates how much the outcome variable is predicted to change. For example, a model containing a continuous predictor with a parameter estimate of $\hat{\beta}_{1} = -5.344$ is visualised in {numref}`continuous-fig`. This would be interpreted as the outcome variable decreasing by 5.344 for each unit increase of the predictor. 

```{figure} images/reg-continuous.png
---
width: 500px
name: continuous-fig
---
Example of the regression slope associated with a continuous predictor variable and a parameter estimate of $\beta_{1}=-5.344$.
```

### Parameter Estimates for Categorical Predictors
If the variable associated with a parameter estimate is categorical, the parameter would be interpreted as a *mean difference*. To see why, consider what happens when we fit a regression slope to a dummy variable, as shown in {numref}`dummy-fig`. Notice how the slope begins at the mean of the group coded with a 0 and then ends at the mean of the group coded with a 1. A unit increase on a dummy variable is equivalent to changing the category from 0 to 1. This means that the regression slope is still interpreted in the same fashion as before, but in this context tells us the *mean difference* between the categories. The *intercept* of this model is therefore the mean of the group coded as 0, the slope is the mean difference and the sum of the intercept and slope gives the mean of the group coded as 1. 

```{figure} images/reg-dummy.png
---
width: 500px
name: dummy-fig
---
Example of the regression slope associated with a categorical predictor variable and a parameter estimate of $\beta_{1}=7.940$.
```

### Standard Errors of the Estimates
Although the values of the parameter estimates can be interpreted on their own, to perform statistical inference we also need some indication of how *variable* the estimates are. This is done by calculating the *standard error* for each parameter. In the GLM, the standard errors of the parameter estimates can be calculated by first constructing the *variance-covariance* matrix of the estimates

$$
\text{Cov}\left(\hat{\boldsymbol{\beta}}\right) = \hat{\sigma}^{2}\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}
$$

This is akin to a correlation matrix, with the variance of each estimate on the diagonal and the correlation (covariance) between the estimates on the off-diagonal. The standard errors can then be taken as the square-root of the diagonal elements of $\text{Cov}\left(\hat{\boldsymbol{\beta}}\right)$.

### Inference
The final step in using the GLM framework is to perform some sort of hypothesis test using the estimated parameter values. Typically, each estimate will be divided by its standard error to produce a *t*-statistics. For instance

$$
t = \frac{\hat{\beta}_{1}}{\text{SE}\left\{\hat{\beta}_{1}\right\}}
$$

This particular test statistic involves an implicit comparison of $\hat{\beta}_{1}$ with a proposed population value of $\beta_{1} = 0$. In the context of a regression slope, the null hypothesis is therefore that there is no relationship between the outcome and predictor (i.e. the slope is *flat*). In the context of a mean difference, the null hypothesis is that there is no difference in the average value of the outcome variable between the groups (i.e. the means are *identical*). Decisions about whether the null hypothesis can be rejected can be taken by calculating the *p*-value associated with the calculated *t*-statistic via reference to the appropriate null distribution. As such, this element of the GLM is equivalent to performing inference on the results of both the simple and multiple regression models discussed earlier.