# Variable selection and model selection

The main feature of multiple linear regression is that it uses several explanatory variables together. Therefore, the natural question is: which variables should I use to build the best possible model for my objective? This question leads us to introduce model evaluation criteria and variable selection methods.

## What will you learn in this course? 🧐🧐

This course will focus on teaching you evaluation methods for multiple linear regression models.

* Evaluation of multiple linear regression models
    * Analysis of Variance (ANOVA)
    * F-Statistics by Fisher
    * $R^{2}$ (R square)
    * $R^2_{adjusted}$
    * P-values
* Model selection.
    * Step by step methods
* Final remarks



## Evaluation of multiple linear regression models 💯

Some of the evaluation criteria presented below may be used for models other than multiple linear regression. It is therefore all the more important to introduce them now and to remember their respective interpretations.


### Analysis of Variance (ANOVA)

The analysis of variance allows to quantify the performance of a statistical model in terms of estimation error. The different values that we will discuss now will be used to build other performance metrics:

* SST: Sum of Square Total is an indicator of the dispersion of the values of the target variable $Y$ (whose values are noted $y_{1}, ..., y_{n}$) over the population considered, which is written mathematically :

$$
SST = \sum_{i=1}^{n}(y_{i}-\bar{y})^2
$$

It is the sum of the squared deviations from the mean of the target variable $Y$ for the $n$ observations considered.

* SSE: Sum of Square Explained is an indicator that represents the amount of dispersion of the target variable that is explained by the model, which is defined as:

$$
SSE = \sum_{i=1}^{n}(\hat{y}_{i}-\bar{y})^2
$$

It is the sum of the squared mean differences between the model estimates for each observation and the mean of the target variable for the population of interest. In statistics variation is information, you cannot possibly defferentiate samples if they are all described by the exact same set of values.

* SSR: Sum of Squared Residual is an indicator that quantifies the error committed by the model, or in other words the portion of the dispersion of the target variable that is not explained by the model, hence the idea of residual. Its formula is as follows:

$$
SSR =\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2
$$

It is essential to understand these values because they will allow us to build all the evaluation metrics for the multiple linear regression model we will see now.

To summarize, SST (represented in red in the figure below) is proportional to the total variance of the target variable, which can be decomposed into two components : SSE is the variance explained by the model, which is proportional to the amount of variance of our estimates relative to the actual mean of the observed population, and SSR (represented in blue in the figure below) is the sum of the squares of the differences between our estimates and the actual values of the target variable. In other words, SST is the total amount of information, SSE is the information explained by the model and SSR is the information that remains to be explained, or the error committed.


![SSR_SST](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/SSR_SST.png)


* $R^{2}$ (R square)

$R^{2}$, or R-squared, is a statistic that quantifies the explanatory power of the model with respect to the target variable.

$$
R^{2} = 1-\frac{SSR}{SST}
$$

$R^2$ monotonically increases with each additional explanatory variable. It varies between 0 and 1, if the model is not very relevant, the sum of the residual squares $SSR$ will be close to the sum of the total squares $SST$ and $R^{2}$ will be closer to 0, on the contrary, if the model allows to explain the target variable faithfully, then $SSR$ will be closer to 0 and $R^2$ will be closer to 1. So mechanically, with each addition of variable to the model, the prediction of $Y$, the target variable, will be better and $R^2$ will be higher. In fact, $R^2$ is a performance indicator that only allows to compare two models that have the same number of explanatory variables.

## Model selection 🤔🤔

When we have at our disposal $p$ explanatory variables, the number of models that it is possible to construct can be counted in the following way: for each explanatory variable we can build a model with or without it, applying this reasoning to all explanatory variables we can potentially build $2^p$ models. In practice, when the number of $p$ explanatory variables is large, we cannot reasonnably explore the $2^p$ models that can be built in order to select the best one. Different methods exist which allow to avoid using brute force.

### Step by step

The step-by-step selection is divided into three variants:

* Forward selection: the variables are added one by one to the model by selecting at each step the one that maximises the performance on the test score ($R^2$ in the case of linear regression). It stops when all the variables are used or when the score on the test set starts decreasing.
* Elimination (backward): This time we start with a model using all the explanatory variables. At each step, the variable with the highest p-value associated with the student test for the coefficient is eliminated from the model. The procedure stops when all the remaining variables have p-values higher than a threshold set by default at 0.05 (but which can be adapted according to the precision needs of the considered problem). This technique is difficult to implement using scikit-learn since the p-values have to be calculated by hand.
* Stepwise: This algorithm alternates between a selection step and an elimination step after each addition of a variable, in order to remove any variables that would have become less relevant in the presence of those that have been added. Example, if after adding the most useful variable according to the test score criterion one of the variables becomes non-significant, we remove it and proceed with a new forward selection step.

### Final Notes

The model evaluation and selection methods introduced above are perfectly valid for all linear models, as well as the logistic regression that we will discuss later.

## Resources 📚📚

* How do I interpret R Squared - http://bit.ly/2pP83Eb
* Adjusted R Squared - http://bit.ly/2qqz55b