# Shrinkage Methods and Ridge Regressions (37)

Recall:

$$ \text{RSS} = \sum_{i = 1}^{n} \bigg(y_{i} - \beta_{0} - \sum_{j = 1}^{p} \beta_{j}x_{ij} \bigg) ^ 2 $$

But in **ridge regression**, the coefficient estimates $\hat{\beta}^R$ are the values that minimize:

$$\sum_{i = 1}^{n} \bigg(y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_{j}x_{ij} \bigg)^2 + \lambda \sum_{j=1}^{p} \beta_{j}^2 = \text{RSS} + \lambda \sum_{j=1}^{p} \beta_{j}^2 $$

Where $\lambda \ge 0$ is a **tuning parameter**, tbd separately.

* The value on the right penalizes coefficients for being too large.
* It is also called the **shrinkage penalty**
    * By setting lambda bigger, you will make the coefficients smaller until they are 0 (feature is not in the model).
    * Side note: the $L_{2} norm = ||B_{1} + ... + B_{p} ||_{2} = \sqrt{\sum B_{j} ^2} $
        * Using $||\hat{\beta}_{\lambda}^L||_{1}/||\hat{\beta}||_{1}$ scales graph of standardized coefficients from 0 to 1
    
### Scaling of predictors

Because ridge regression takes into account the sum of variables, we must standardized the variable by dividing it by its standard error:

$$\bar{x} = \dfrac {x_{ij}} {\sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x}_{j}) ^2}} $$

However, the coefficients are never shrunk to 0 and thus never eliminated...

# The Lasso (38)

Similar to ridge regression except:

$$\sum_{i = 1}^{n} \bigg(y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_{j}x_{ij} \bigg)^2 + \lambda \sum_{j=1}^{p} |\beta_{j}| = \text{RSS} + \lambda \sum_{j=1}^{p} |\beta_{j}| $$

* The only difference is using $|\beta_{j}|$ instead of $\beta_{j}^2$
* This will shrink to zero, but sets it to 0 when small enough.
* This is called **sparsing** in which we use only a subset.
* This is popular because of convex computing optimization.
* This is great because it removes thousands of features down to a few significant predictors.

### Which should I use?

It depends. If it's dense, ridge. If sparse, lasso. Best to use both and test using cv.

# Tuning Parameter Selection for Ridge Regression and Lasso (39)

We're not sure what d (# of parameters) are. However for ridge regression, we need to know what d is in order to properly set the tuning parameter. Therefore CV is our pick to set the tuning parameter.

# Dimension Reduction Methods (40)

The methods we've talked about so far fit linear regression models via least squares or a shrunken approach, using original predictors. Now we'll explore a class of approaches that **transform** the predictors and then fit a least squares model using the transformed variables. This is called **dimension reduction**.

* Let $Z_{1},Z_{2},...,Z_{M}$ represent $M < p$ linear combos of our original p predictors. That is:

$$ Z_{m} = \sum_{j=1}^{p} \phi_{mj}X_{j}$$

* for some constants $\phi_{m1},...,\phi_{mp}$
* We can then fit the linear fit the linear reg model:

$$ y_{i} = \theta_{0} + \sum_{m=1}^M \theta_{m}z_{im} + \epsilon_{i}, i = 1,...,n,$$

* using ordinary least squares. 
* The ideas is that if you are smart about choosing $\phi_{m1},...,\phi_{mp}$ then you can outperform least squares.

Note:

$$ \sum_{m=1}^M \theta_{m}z_{im} = \sum_{m=1}^{M} \theta_{m} \sum_{j=1}^{p} \phi_{mj}x_{ij} = \sum_{j=1}^{p} \sum_{m=1}^{M} \theta_{m}\phi_{mj}x_{ij} = \sum_{j=1}^{p} \beta_{j}x_{ij}$$

where

$$ \beta_{j} = \sum_{m=1}^{M} \theta_{m}\phi_{mj} $$

* We are still doing linear regression, but we are getting constraints a different way.

# Principal Components Regression (41)

The most popular form of dimensional reduction is **PCR** which is used in **PCA**.

* First principal component is that (normalized) linear combos of the variables with the largest variance.
* Second principal component has the largest variace, subject to being uncorrelated to the first.
* And so on.
* These capture all the variance in the data with a few principal components.

We plot a "line" of the first principal component (subspace of the variables in first principal component) s.t. the sqaured perpendicular distance to each point is minimized on a graph of PC x Y-Axis.

* This is like **partial least squares** in lecture 40 because we are choosing to reduce features based on variance.
    * Mathematically PCR and Ridge Regression/ Lasso are all similar. Reducing the ammount of variables in a smooth/choppy way.