# Linear Model Selection and Best Subset Selection (32)

Recall the LM:

$$ Y = \beta_{0} + \beta_{1}X_{1} + ... + \beta_{p}X_{p} + \epsilon $$

We will consider **non-linear** and **additive** models. 

We will look at alternative to least squares for the sake of:

* **Prediction Accuracy**: especially when p > n, to control the variance
* **Model Interpretability**: by removing irrelevant features, it is more useful and deliverable to others.

Three classes:

* **Subset selection**: identify subsets of p predictors that are good.
* **Shrinkage**: penalize the model for size of coefficients
* **Dimension Reduction**: finding combinations of variables

### Subset Selection

1. Let $\mathcal{M}_{0}$ denotes the **null model**, which contains no predictors. This model simply predicts the sample mean for each observation.
2. For $k = 1, 2, ... p$:
    a. Fit all $\binom {p}{k}$ models that contain exactly k predictors.
    b. Pick the best among these $\binom{p}{k}$ models, and call it $\mathcal{M}_{k}$. Here **best** is definded as having the smallest RSS, or largest $R^2$.
3. Select a single best model from among $\mathcal{M}_{0}, ..., \mathcal{M}_{p}$ using cross-validated, prediction error, $C_{p}$ (AIC), BIC, or adjusted $R^2$.
    * we need to choose the one with the best **test error** not **training error**.

# Forward Stepwise Selection (33)

* For computational reasons, best subset selection cannot be applied with very large p because you'll have $2^p$ models. Beyond 40 predictors, best subset selection stops working.
* Best subset selection may also suffer from statistical problems with overfitting because it'll choose the model that look the best on the training data.
    * This is counterintuitive because we're used to doing things the most exact.
    
### Forward Stepwise Selection

* Begins with a model containing no predictors and then adds predictors to the model, one-at-a-time, until all predictors are in the model.
    * Each variable that gives the greatest **additional** improvement the the model.
    
1. Let $\mathcal{M}_{0}$ denote the null model, which contains no predictors.
2. For $k = 0, ... , p -1$:
    a. Consider all $p - k$ models that augment the predictors in $\mathcal{M}_{k}$ with one additonal predictor.
    b. Choose the **best** among these $p - k$ models, and call it $\mathcal{M}_{k + 1}$. Here **best** is defined as having smallest RSS or highest $R^2$
3. Select a single best model from among $\mathcal{M}_{0},...,\mathcal{M}_{p}$ using cross-validated prediction error, $C_{p}$ (AIC), BIC, or adjusted $R^2$.
    * based on # of predictors in model

# Backward Stepwise Selection (34)

Opposite of forward

1. Let $\mathcal{M}_{p}$ denote the **full** model, which contains all p predictors.
2. For $ k = p, p - 1, ..., 1$:
    a. Consider all $k$ models that contain all but one of the predictors in $\mathcal{M}_{k}$, for a total of $ k - 1$ predictors.
    b. Choose the **best** among these k models, and call it $\mathcal{M}_{k-1}$. Here **best** is defined as having smallest RSS or highest $R^2$
3. Select a single best model from among $\mathcal{M}_{0}, ..., \mathcal{M}_{p}$ using cross-validated prediction error, $C_{p}$ (AIC), BIC, or adjusted $R^2$.

# Estimating Test Error Using Mallow's Cp, AIC, BIC, Adjusted R-squared (35)

We can indirectly estimate test error by making an **adjustment** to the training error to account for the bias due to overfitting. The above techniques all adjust the training error for the model size, and can be used to select among a set of models with different number of variables.

### Mallow's Cp

Defined by:

$$ C_{p} = \dfrac{1}{n} (RSS + 2d\hat{\sigma}^2) $$

where d is the total # of parameters used and $\hat{\sigma}^2$ is an estimate of the variance of error $\epsilon$ associated with each response measurement. 

### AIC (Akaike Information Criterion)

$$ \text{AIC} = -2\log L + 2 d $$

Where L is the maximized value of the likelihood function for the estimated model.

* AIC and Cp are proportional so just use Cp.

### BIC (Bayesian Information Criterion)

$$ \text{BIC} = \dfrac {1}{n} (\text{RSS} + \log (n) d \hat{\sigma}^2) $$

* The only difference between BIC and AIC is the 2 in from the log n.
* BIC will put more of a penalty than AIC on large models.

### Adjusted R-squared

$$ \text{Adjusted } R^2 = 1 - \dfrac {\text{RSS} / (n - d - 1)}{\text{RSS} / (n - 1)} $$

* Compared to regular R-squared, this error term punishes larger models
* This is the fan favorite but statisticians like the previous because there's more theory behind it.

# Estimating Test Error Using Cross-Validation (36)

Cross validation/ Validation on $\mathcal{M}_{k}$ will have some advantage over AIC, BIC, Cp, and R-squared because we will not need $\sigma^2$.

### One SD Rule

Don't choose the minimum of SE, but choose the simplest model within one SD of the minimum SE.