# The GLM: A Worked Example
In the previous section, we discussed a lot of detail around the theory of the GLM. However, this was all somewhat abstract, with only minimal examples of applications. As such, in this section, we will try to make this framework more concrete by analysing a simple, non-imaging dataset. 

## Dataset
For this example, we are looking at the miles per gallon (MPG) of a selection of cars. The question of interest is simply to understand how *horsepower*, *weight* and *transmission type* are associated with the fuel efficiency of a car. To keep this example contained, we have limited our sample to only 6 cars. The data set is shown below

| MPG  | Horsepower | Weight | Transmission |
| ---- | ---------- | ------ | ------------ |
| 21.0 | 110        | 2.620  | Manual       |
| 21.0 | 110        | 2.875  | Manual       |
| 22.8 | 93         | 2.320  | Manual       |
| 21.4 | 110        | 3.215  | Automatic    |
| 18.7 | 175        | 3.440  | Automatic    |
| 18.1 | 105        | 3.460  | Automatic    |

## Specifying the Outcome Vector
For this example, `MPG` is the outcome variable and so forms the vector $\mathbf{y}$

In [3]:
y = [21.0 21.0 22.8 21.4 18.7 18.1]'

## Building the Design Matrix
In terms of the design matrix, we need a column of 1s for the constant, followed by `horsepower`, `weight` and then a dummy variable for `transmission`.

In [6]:
X = [1     1     1     1     1     1;     ...
     110   110   93    110   175   105;   ...
     2.620 2.875 2.320 3.215 3.440 3.460; ...
     1     1     1     0     0     0]'

Taking both the outcome and design matrix together, our GLM currently has the form

$$
\begin{bmatrix}
21.0 \\ 
21.0 \\
22.8 \\
21.4 \\
18.7 \\
18.1
\end{bmatrix}
=
\begin{bmatrix}
1 & 110 & 2.620 & 1 \\
1 & 110 & 2.875 & 1 \\
1 & 93  & 2.320 & 1 \\
1 & 110 & 3.215 & 0 \\
1 & 175 & 3.440 & 0 \\
1 & 105 & 3.460 & 0
\end{bmatrix}
\begin{bmatrix}
\beta_{0} \\
\beta_{1} \\
\beta_{2} \\
\beta_{3}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_{1} \\ 
\epsilon_{2} \\
\epsilon_{3} \\
\epsilon_{4} \\
\epsilon_{5} \\
\epsilon_{6}
\end{bmatrix}
$$

So, we have our *outcome* in $\mathbf{y}$ and our *predictors* arranged in $\mathbf{X}$, but we do not know the values of the *parameters* in $\boldsymbol{\beta}$ or the *errors* in $\boldsymbol{\epsilon}$.

## Parameter Estimation
We can estimate the parameters using the equation given in the previous section, derived from the method of maximum likelihood

In [9]:
beta = inv(X'*X)*X'*y

These are therefore the values that make the current data most probable. Once we have the parameter estimates we can calculate the predicted values of $\mathbf{y}$. This is denoted $\hat{\mathbf{y}}$ and is pronounced "y-hat".

In [12]:
yhat = X*beta

These values represent the points on the regression plane for each unique combination of predictor values. For instance, focusing on the first estimate, the model predicts that any car with 110 horsepower that weighs 2,620 lbs and has a manual transmission will achieve 21.4985 MPG. 

These estimates can be used calculate the errors

In [14]:
e = y - yhat

which represent the degree to which each of the model predictions diverge from the original data. We now have all the elements needed to complete the GLM equation

$$
\begin{bmatrix}
21.0 \\ 
21.0 \\
22.8 \\
21.4 \\
18.7 \\
18.1
\end{bmatrix}
=
\begin{bmatrix}
1 & 110 & 2.620 & 1 \\
1 & 110 & 2.875 & 1 \\
1 & 93  & 2.320 & 1 \\
1 & 110 & 3.215 & 0 \\
1 & 175 & 3.440 & 0 \\
1 & 105 & 3.460 & 0
\end{bmatrix}
\begin{bmatrix}
36.700 \\
-0.005 \\
-4.945 \\
-1.715
\end{bmatrix}
+
\begin{bmatrix}
-0.499 \\
\hphantom{-}0.763 \\
-0.264 \\
\hphantom{-}1.129 \\
-0.145 \\
-0.984
\end{bmatrix}
$$

## Interpreting the Parameters

In terms of interpreting the parameters, it is helpful to first put them in a table

| Effect       | Estimate | 
| ------------ | -------- |
| Constant     | 36.700   |
| Horsepower   | -0.005   |
| Weight       | -4.945   |
| Transmission | -1.715   |

The interpretation is then that 
- An increase in `horsepower` of 1 leads to a decrease in `MPG` of 0.005 miles
- An increase in `weight` of 1000lbs leads to a decrease in `MPG` of 4.946 miles 
- The average difference between automatic and manual transmission cars is a reduction in `MPG` of 1.715 miles

While at first it seems that `weight` has the most effect on `MPG`, it is notable that these raw estimates are difficult to interpret due to their differing units (e.g. 1 horsepower vs 1000lbs), and the fact that we have not taken the uncertainty of the estimates into account. To do so, we can calculate the standard errors using the estimate of the model variance

In [17]:
n      = size(X,1);
p      = size(X,2);
sigma2 = (e'*e) / (n-p)

which can then be used to construct the variance-covariance matrix of the parameter estimates and extract the standard errors

In [20]:
covBeta = sigma2 * inv(X'*X);
SE      = sqrt(diag(covBeta))

We can then add these estimates to the table from above

| Effect       | Estimate | SE     |
| ------------ | -------- | ------ |
| Constant     | 36.700   | 9.739  |
| Horsepower   | -0.005   | 0.024  |
| Weight       | -4.945   | 3.075  |
| Transmission | -1.715   | 2.442  |

providing both our estimates of the parameter values and their uncertainty.

## Inference
As indicated in the previous section, to make decisions about the estimates we would typically divide each estimate by each standard error to form a *t*-statistic

In [23]:
t = beta ./ SE

which we can then use to calculate *p*-values by querying the null *t*-distribution with $n-p$ degrees of freedom

In [25]:
pvals = 1 - tcdf(abs(t),n-p); % one-tailed p-value
pvals = pvals .* 2            % convert to two-tailed p-values

These can be added to our table to complete all the information we need for this analysis

| Effect       | Estimate | SE     | *t*    | *p*   |
| ------------ | -------- | ------ | ------ | ----- |
| Constant     | 36.700   | 9.739  | 3.769  | 0.064 |
| Horsepower   | -0.005   | 0.024  | -0.205 | 0.857 |
| Weight       | -4.945   | 3.075  | -1.608 | 0.249 |
| Transmission | -1.715   | 2.442  | -0.702 | 0.555 |

Based on this information, we would conclude that none of the parameters were significantly different from zero. So although it appeared as if `weight` would be the most relevant predictor of `MPG`, according to this analysis there is not enough evidence to discount the posibility of *no* relationsip with `MPG` across all the predictor variables. Remember, that this conclusion is largely governed by the degree of *uncertainty* in the parameter estimates, due to the small sample. It is likely that with more data that the standard errors would decrease and a clearer picture of the relationship between these variables and `MPG` would emerge.

## Using `fitlm`
We can confirm the manual calculations given above by using the `fitlm` function from the MATLAB [Statistics and Machine Learning](https://uk.mathworks.com/products/statistics.html) toolbox. We pass the predictor variables as the first argument (missing out the constant) and the outcome variable as the second argument. As we can see, this function returns a table of values identical to those given above.

In [27]:
Predictors = X(:,2:4);           % predictors without the constant column
Model      = fitlm(Predictors,y)