# The GLM: A Worked Example

## Dataset
To see how we use the GLM, we will work through a simple non-imaging example before turning to applying the GLM to fMRI data in the next section. In this example we're again looking at the miles per gallon (MPG) of a selection of cars. To keep this example contained, we have limited our sample to only 6 cars. The data set is shown below

| MPG  | Horsepower | Weight | Transmission |
| ---- | ---------- | ------ | ------------ |
| 21.0 | 110        | 2.620  | Manual       |
| 21.0 | 110        | 2.875  | Manual       |
| 22.8 | 93         | 2.320  | Manual       |
| 21.4 | 110        | 3.215  | Automatic    |
| 18.7 | 175        | 3.440  | Automatic    |
| 18.1 | 105        | 3.460  | Automatic    |

In this example, MPG is our outcome variable and so forms the vector $\mathbf{Y}$

In [1]:
Y = [21.0 21.0 22.8 21.4 18.7 18.1]'

## Building the Design Matrix
In terms of our design matrix we need a column of 1s for the constant, followed by horsepower, weight and then a dummy variable for transmission. So our design matrix will be as follows

In [2]:
X = [1     1     1     1     1     1;     ...
     110   110   93    110   175   105;   ...
     2.620 2.875 2.320 3.215 3.440 3.460; ...
     1     1     1     0     0     0]'

Mathematically, our GLM currently has the form

$$
\begin{bmatrix}
21.0 \\ 
21.0 \\
22.8 \\
21.4 \\
18.7 \\
18.1
\end{bmatrix}
=
\begin{bmatrix}
1 & 110 & 2.620 & 1 \\
1 & 110 & 2.875 & 1 \\
1 & 93  & 2.320 & 1 \\
1 & 110 & 3.215 & 0 \\
1 & 175 & 3.440 & 0 \\
1 & 105 & 3.460 & 0
\end{bmatrix}
\begin{bmatrix}
\beta_{0} \\
\beta_{1} \\
\beta_{2} \\
\beta_{3}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_{1} \\ 
\epsilon_{2} \\
\epsilon_{3} \\
\epsilon_{4} \\
\epsilon_{5} \\
\epsilon_{6}
\end{bmatrix}
$$

So, we have our *outcome* in $\mathbf{Y}$ and our *predictors* arranged in $\mathbf{X}$, but we do not know the values of the *parameters* or the *errors*.

## Parameter Estimation
We can estimate the parameters using the equation given in the previous section, derived from the method of maximum likelihood

In [3]:
beta = inv(X'*X)*X'*Y

Once we have the parameter estimates we can calculate the predicted values of $\mathbf{Y}$

In [4]:
Yhat = X*beta

which can then be used calculate the errors

In [5]:
E = Y - Yhat

Meaning we now have all the elements to complete the GLM equation

$$
\begin{bmatrix}
21.0 \\ 
21.0 \\
22.8 \\
21.4 \\
18.7 \\
18.1
\end{bmatrix}
=
\begin{bmatrix}
1 & 110 & 2.620 & 1 \\
1 & 110 & 2.875 & 1 \\
1 & 93  & 2.320 & 1 \\
1 & 110 & 3.215 & 0 \\
1 & 175 & 3.440 & 0 \\
1 & 105 & 3.460 & 0
\end{bmatrix}
\begin{bmatrix}
36.700 \\
-0.005 \\
-4.945 \\
-1.715
\end{bmatrix}
+
\begin{bmatrix}
-0.499 \\
\hphantom{-}0.763 \\
-0.264 \\
\hphantom{-}1.129 \\
-0.145 \\
-0.984
\end{bmatrix}
$$

## Interpreting the Parameters

In terms of interpreting the parameters, it's helpful to put them in a table

| Effect       | Estimate | 
| ------------ | -------- |
| Constant     | 36.700   |
| Horsepower   | -0.005   |
| Weight       | -4.945   |
| Transmission | -1.715   |

The interpretation is that 
- An increase in horsepower of 1 leads to a decrease in MPG of 0.005 miles
- An increase in weight of 1000lbs leads to a decrease in MPG of 4.946 miles 
- The average difference between automatic and manual transmission cars is a reduction in MPG of 1.715 miles

The standard errors can finally be computed using the variance estimate

In [6]:
n      = size(X,1);
p      = size(X,2);
sigma2 = (E'*E) / (n-p)

which can then be used to construct the variance-covariance matrix of the parameter estimates and extract the standard errors

In [9]:
covBeta = sigma2 * inv(X'*X);
SE      = sqrt(diag(covBeta))

which we can add to the table from above

| Effect       | Estimate | StdErr |
| ------------ | -------- | ------ |
| Constant     | 36.700   | 9.739  |
| Horsepower   | -0.005   | 0.024  |
| Weight       | -4.945   | 3.075  |
| Transmission | -1.715   | 2.442  |

providing both our estimates of the parameter values and their uncertainty.

## Inference
As indicated in the previous section, to make decisions about the estimates we would typically divide each estimate by each standard error to form a *t*-statistic

In [11]:
t = beta ./ SE

Which we can then use to query the null *t*-distribution with $n-p$ degrees of freedom in order to calculate *p*-values

In [17]:
1 - tcdf(abs(t),n-p)

Which can add to our table to complete all the information we need for this analysis

| Effect       | Estimate | StdErr | *t*    | *p*       |
| ------------ | -------- | ------ | ------ | --------- |
| Constant     | 36.700   | 9.739  | 3.769  | **0.032** |
| Horsepower   | -0.005   | 0.024  | -0.205 | 0.428     |
| Weight       | -4.945   | 3.075  | -1.608 | 0.125     |
| Transmission | -1.715   | 2.442  | -0.702 | 0.278     |

Based on this information, we would conclude that only the value of the constant was significantly different from zero. 