# Chapter 6 - Linear Model Selection and Regularization

When we build a model we can use many predictors but we may not know in advance which predictor is more relevant for the predictions we are interested in. There are three approaches to select the models and the predictors: subset selection, shrinkage, also known as regularization, and dimension reduction.

## Subset selection
Subset selection consists of sistematically trying all the predictors, adding or removing them and then comparing the accuracy of all of the resulting models, for example by comparing their residual sum of squares (RSS) on the validation set 

$$RSS = \sum_{i=1}^{n}(y_i - \hat{f}(x_i))^2$$

where

$$\hat{f}(x_1,..,x_p) = \beta_0 + \sum_{j=1}^{p} \beta_j x_j$$

or with other techniques such as the [Akaike Information Criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) that allow to compare the accuracy of models using all the available observations to train them.

## Shrinkage (Regularization)
The shrinkage method consists of putting a constrain on the values that can be bound to the parameters $\beta_i$. Since the coefficients $\beta_i$ in a linear model are estimated using the least squares method, by minimizing the residual sum of squares

$$\frac{\partial RSS}{\partial \beta_i} = 0$$

we can add a term to RSS that we can use to tune the values of $\beta_i$ coefficients. The most common way is to add a penalty term that is the sum of the squared coefficients so that, instead of RSS, the function to be minimized is

$$RSS + \lambda \sum_{j=1}^{p}\beta_j^2$$

We can then use the parameter $\lambda$ to shrink the values of the coefficients towards zero. The technique of adding a term with squared coefficients to the expression to be minimized, or L2 norm of the $\beta$ vector, is called Ridge Regression or [Tikhonov Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)). 

Ridge Regression does not completely set any coefficient to zero so it cannot be used to select the predictors. Using a different penalty term, the L1 norm of the $\beta$ vector, will force some of the coefficient to be exactly zero. The quantity to be minimized is

$$RSS + \lambda \sum_{j=1}^{p}|\beta_j|$$

The technique is called Lasso (Least Absolute Shrinkage and Selection Operator). 

## Dimension Reduction
This technique consists of mapping the predictors in the p-dimensional space to a subspace of dimension M < p, by finding a linear transformation of the predictors. The next step is to fit the linear model defined using the transformed M predictors. We can represent our linear model using vectors x and y, for the predictor and the response, and a diagonal matrix B for the coefficients, so that the linear model can be written as

$$y = B x$$

We can then define a linear transformation $\Phi$ of the predictor 

$$z = \Phi x$$

We fit the linear model for the transformed predictors  

$$y = \Theta z$$

and by replacing z we have

$$y = \Theta \Phi x$$

that is 

$$B = \Theta \Phi$$
