### 3.3 Model selection

Although the ridge regression estimator may reduce the mean squared
error of $\hat{\beta}$ , it does not directly help us to choose an
appropriate model. We may potentially want to consider many potential
covariates. But there are many reasons for prefering a smaller final
model:

1) prediction accuracy may well be improved

2) It is much easier to interpret a smaller model 

3) In many applications, there is an underlying belief that the 'true
mode' is **sparse.**

We now review some traditional model strategy

1) *Best subset regression: *For each $k\in\{0,1,..p\}$,
let $\hat{\beta}_{\left[k\right]}$ denote the value of $\beta$ with
$k$ nonzero entries that minimizes
$$
RSS(\beta)=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\sum_{j=1}^{p}x_{ij}\beta_{j}\right)^{2}
$$


For instance, the **Akaike information criterion** (Akaike,
1974) chooses k to minimize $RSS(\hat{\beta}_{\left[k\right]})+2k$,
while the **Bayesian information criterion** (Schwartz, 1978)
uses $RSS(\hat{\beta}_{\left[k\right]})+k\log n$.

2) *Forward stepwise selection: *Start with just the intercept
term and at each subsequent stage add the variable that gives the
largest reduction in residual sum of squares. This produces a sequence
$\hat{\beta}_{0}^{F}$, $\hat{\beta}_{1}^{F}$, ..., and we stop when

$$
\frac{RSS\left(\hat{\beta}_{k}^{F}\right)-RSS\left(\hat{\beta}_{k+1}^{F}\right)}{\frac{1}{n-k-2}RSS\left(\hat{\beta}_{k+1}^{F}\right)}\leq F_{1,n-k-2}(\alpha)
$$

where $F_{1,n-k-2}(\alpha)$ is the upper $\alpha$ point of the $F_{1,n-k-z}$
distribution (how to choose $\alpha$ to account for the fact that
we are doing multiple testing).

3) *Backward stepwise selection: * (p<n) start with the full
model, and at each subsequent stage delete the variable that results
in the RSS. Use a F-test as above to determine when to stop.
