# Multiple Linear Regression {#sec-multiple-linear-regression}

## <a name="overview"></a> Overview

[Chapter @sec-linear-regression] discussed how to develop and evaluate linear regression models that involve just one predictor.
That is we assumed models of the form 

$$y = w_0 + w_1x + \epsilon \tag{1}$$

In this chapter, we want to enhance this model by introducing more independent variables i.e. we want to consider models of the form

$$y = w_0 + w_1x_1+ \dots + w_{m}x_m + \epsilon \tag{2}$$

In this case we refer to multiple linear regression.



## Multiple linear regression

A multivariate linear regression model assumes that the conditional expectation of a
response
o
n
(11.11)
E Y | X (1) = x(1) , . . . , X (k) = x(k) = β0 + β1 x(1) + . . . + βk x(k)
is a linear function of predictors x(1) , . . . , x(k) .
This regression model has one intercept and a total of k slopes, and therefore, it defines a
k-dimensional regression plane in a (k + 1)-dimensional space of (X (1) , . . . , X (k) , Y ).
The intercept β0 is the expected response when all predictors equal zero.

Each regression slope βj is the expected change of the response Y when the corresponding
predictor X (j) changes by 1 while all the other predictors remain constant.

In order to estimate all the parameters of model (11.11), we collect a sample of n multivariate
observations


We will again consider the sum of squared errors or $SSE$

----

**Sum of Squared Errors**
$$SSE = \sum_i (\hat{y}_i - y_i)^2$$

----

Minimizing Q, we can again take partial derivatives of Q with respect to all the unknown
parameters and solve the resulting system of equations. It can be conveniently written in a
matrix form (which requires basic knowledge of linear algebra; if needed, refer to Appendix,
Section 12.4).

## <a name="ekf"></a> Model evaluation

Chapter @sec-linear-regression introduced the linear regression model whereas chapter @sec-multiple-linear-regression
took this model a step further by adding more predictors into it. We now examine how to evaluate linear regression models that
depend on multiple predictors. In addition, we discuss a methodology we can use in order to determine the right number of 
predictors to use in our model. This is very important as adding more predictors into our model may lead to overfitting and therefore
to low prediction power.


### Adjusted $R^2$ criterion

Evaluation of lienar regression models can be done using the $R^2$ criterion. The criterion describes the proportion
of the explained variance by the regression model. This can be computed according to

\begin{equation}
R^2 = \frac{SSEREG}{SS_{TOT}}
\end{equation}

However, the problem with $R^2$ is that it increases as we add more predicotrs into the model i.e. that it tells us that 
the model has improved which may not actually the the case. This is because the $SS_{TOT}$ will become smaller as more variance
is explained by the model. Therefore, this metric is not a fair criterion when we compare models with different numbers of
predictors. Indeed adding irrelevant predictors should be penalized whereas $R^2$ can only reward for this.


The adjusted $R^2$ criterion is a better approach we can use in order to measure the goodness-of-fit of the model [1].
It is given  by the following equation

----
**Adjusted $R^2$ criterion**

\begin{equation}
R^{2}_{adjusted} = 1 - \frac{SS_{ERR}/df_{ERR}}{SS_{TOT}/df_{TOT}}
\end{equation}


----

This criterion only improves if the predictor we added in the model considerably reduces the error sum of squares.
It incorporates degrees of freedom into the  formula for calculating $R^2$. 
This adjustment results in a penalty when a predictor that adds no prediction capability is added into the model
Thus using only the $R^{2}_{adjusted}$ criterion we should choose the model with the highest adjusted R-square [1].



### Selecting predictors

We now know how to evaluate a linear regression model based on multiple predictor. We thus turn attention into how to select
 In addition, we discuss a methodology we can use in order to determine the right number of predictors to use in our model. This is very important as adding more predictors into our model may lead to overfitting and therefore to low prediction power.


Significance of the additional explained variation (measured by SSEX ) is tested by a partial F-test statistic [1]

----
**Partial F-test statistic**

F =\frac{SS_{EX}/df_{EX}}{MS_{ERR}(Full)}

----


#### Stepwise selection

In stepwise selection algorithm starts with the simplest model that excludes all the predictors i.e. we start with a model
of the form

$$y=w_0$$

Then, predictors enter the model sequentially, one by one. Every new predictor should make
the most significant contribution, among all the predictors that have not been included yet [1].
According to this rule, the first predictor X (s) to enter the model is the one that has the
most significant univariate ANOVA F-statistic


All F-tests considered at this step refer to the same F-distribution with 1 and (n − 2) d.f.
Therefore, the largest F-statistic implies the lowest P-value and the most significant slope
$w_s$

The next predictor X (t) to be selected is the one that makes the most significant contri-
bution, in addition to X (s) . Among all the remaining predictors, it should maximize the
partial F-statistic


The algorithm continues until the F-to-enter statistic is not significant for all the remaining predictors, according to a pre-selected significance level $\alpha$. The final model will have all predictors significant at this level.


#### Backward elimination

The backward elimination algorithm works in the direction opposite to stepwise selec-
tion. It starts with the full model that contains all possible predictors. Predictors are removed from the model sequentially, one by one, starting with the least
significant predictor, until all the remaining predictors are statistically significant.
Significance is again determined by a partial F-test. In this scheme, it is called F-to-remove. The first predictor to be removed is the one that minimizes the F-to-remove statistic



Both sequential model selection schemes, stepwise and backward elimination, involve fitting
at most K models. This requires much less computing power than the adjusted R2 method,
where all 2K models are considered.

## Summary

In this chapter we discussed the multiple linear regression model. This is an extension to the simple linear regression model
we saw in chapter @sec-linear-regression

## References

1. Michael Baron, _Probability and Statistics for Computer Scientists_, 2nd Edition, CRC Press.