# 3.4 Shrinkage Methods

> *Subset selection methods* are discrete process—variables are either retained or discarded—it often exhibits high variance,and so doesn’t reduce the prediction error of the full model. *Shrinkage methods* are more continuous, and don’t suﬀer as much from high variability.

## 3.4.1 Ridge Regression

**Ridge regression** shrinks the regression coeﬃcients by imposing a penalty on their size.The ridge coeﬃcients minimize a penalized residual sum of squares:

\begin{align}
\hat{\beta}^{ridge}=argmin_\beta {\sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j)^2+\lambda\sum_{j=1}^p\beta_j^2}
\end{align}

- λ ≥ 0 is a complexity parameter that controls the amount of shrinkage

Writing the criterion in matrix form:

\begin{align}
RSS(\lambda)=(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta
\end{align}


The ridge regression solutions:
\begin{align}
\hat{\beta}^{ridge}=(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}
\end{align}

- $\mathbf{I}$ is the p×p identity matrix

Note:
- the ridge regression solution is again a linear function of $\mathbf{y}$;
- The solution adds a positive constant to the diagonal of $\mathbf{X}^T\mathbf{X}$ before inversion, which makes the problem nonsingular.

### Singular value decomposition (SVD)
The **singular value decomposition (SVD)** of the centered input matrix X gives us some additional insight into the nature of ridge regression. The SVD of the N × p matrix X has the form:

\begin{align}
X=UDV^T
\end{align}

- U: N×p orthogonal matrices, with the columns of U spanning the column space of X
- V: p×p orthogonal matrices, the columns of V spanning the row space of X
- D: p×p diagonal matrix, with diagonal entries d1 ≥ d2 ≥···≥ dp ≥ 0 called the singular values of X. If one or more values dj =0,X is singular

least squares ﬁtted vector: 

\begin{align}
\mathbf{X}\hat{\beta}^{ls}&=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \\
&=UDV^T (VD^TU^TUDV^T)^{-1}VD^TU^Ty \\
&=UDV^T (VD^TDV^T)^{-1}VD^TU^Ty \\
&=UDV^T (V^T)^{-1}D^{-1}(D^T)^{-1}V^{-1}VD^TU^Ty \\
&=\mathbf{U}\mathbf{U}^T\mathbf{y}
\end{align}

Note: $\mathbf{U}^T\mathbf{y}$ are the coordinates of y with respect to the orthonormal basis U. 

The ridge solutions:
\begin{align}
\mathbf{X}\hat{\beta}^{ridge}&=\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \\
&=UD(D^2+\lambda\mathbf{I})^{-1}D^TU^Ty \\
&=\sum_{j=1}^p\mathbf{u}_j\frac{d^2_j}{d^2_j+\lambda}\mathbf{u}^T_j\mathbf{y}
\end{align}

- $\mathbf{u}_j$ are the columns of U

Note: ridge regression computes the coordinates of y with respect to the orthonormal basis U. It then shrinks these coordinates by the factors $\frac{d^2_j}{d^2_j+\lambda}$

#### What does a small value of $d^2_j$ mean? 
The SVD of the centered matrix X is another way of expressing the **principal components** of the variables in X. The sample covariance matrix is given by $S=X^TX/N$, we have

**Eigen decomposition of $X^TX$:**

\begin{align}
\mathbf{X}^T\mathbf{X}=VD^TU^TUDV^T=VD^2V^T
\end{align}

The eigenvectors $v_j$ (columns of V) are also called the **principal components** (or Karhunen–Loeve) directions of X.
The ﬁrst principal component direction $v_1$ has the property that $z_1=Xv_1$ has the largest sample variance amongst all normalized linear combinations of the columns of X, which is:

\begin{align}
Var(z_1)=Var(Xv_1)=\frac{d^2_1}{N}
\end{align}

and in fact $z_1=Xv_1=u_1d_1$. The derived variable $z_1$ is called the ﬁrst principal component of X, and hence $u_1$ is the normalized ﬁrst principal component.Subsequent principal components $z_j$ have maximum variance $\frac{d^2_j}{N}$, subject to being orthogonal to the earlier ones.

Hence the small singular values $d_j$ correspond to directions in the column space of X having small variance, and ridge regression shrinks these directions the most.

### Eﬀective degrees of freedom
\begin{align}
df(\lambda)&=tr[\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T] \\
&=tr[\mathbf{H}\lambda] \\
&=\sum^p_{j=1}\frac{d^2_j}{d^2_j+\lambda}
\end{align}

This monotone decreasing function of λ is the eﬀective degrees of freedom of the ridge regression ﬁt.
Usually in a linear-regression ﬁt with p variables,the degrees-of-freedom of the ﬁt is p, the number of free parameters.

Note that
> df(λ)= p as λ = 0 (no regularization)

> df(λ) → 0 as λ →∞.

## 3.4.2 The Lasso

The lasso is a shrinkage method like ridge, with subtle but important differences.The lasso estimate is deﬁned by:

\begin{align}
\hat{\beta}^{lasso}&=argmin_\beta\sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j)^2 \\
& s.t. \sum_{j=1}^p|\beta_j|\leq t
\end{align}

Lasso problem in *Lagrangian form*:
\begin{align}
\hat{\beta}^{lasso}&=argmin_\beta\{ \sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j| \}
\end{align}

#### Difference with ridge:
The L2 ridge penalty $\sum_{j=1}^p\beta_j^2$ is replaced by the L1 lasso penalty $\sum_{j=1}^p|\beta_j|$. This
latter constraint makes the solutions nonlinear in the $y_i$, and there is no closed form expression as in ridge regression.

> t should be adaptively chosen to minimize an estimate of expected prediction error.

- if $t>t_0=\sum_{j=1}^p|\hat{\beta_j^{ls}}|$, then the lasso estimates are the $\hat{\beta_j^{ls}}$
- if $t>t_0/2$, the least squares coeﬃcients are shrunk by about 50% on average

The standardized parameter: $s=t/\sum_1^p|\hat{\beta_j}|$

- s=1.0,  the lasso coeﬃcients  are the least squares estimates
- s->0, as the lasso coeﬃcients ->0

## 3.4.3 Discussion: Subset Selection, Ridge Regression and the Lasso

- Ridge regression: does a proportional shrinkage
- Lasso: translates each coeﬃcient by a constant factor λ, truncating at zero --“soft thresholding,”
- Best-subset selection: drops all variables with coeﬃcients smaller than the Mth largest --“hard-thresholding.”
<img src="./images/lass_ridge.png",width=550>

### Bayes View

Consider the criterion

\begin{align}
\tilde{\beta}&=argmin_\beta\{ \sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j|^q \}
\end{align}

for q ≥ 0. The contours of constant value of $\sum_{j=1}^p|\beta_j|^q$ are shown in Figure 3.12, for the case of two inputs.
<img src="./images/q.png",width=550>

<font color= 'red'>The lasso, ridge regression and best subset selection are Bayes estimates with diﬀerent priors:</font>Thinking of $\sum_{j=1}^p|\beta_j|^q$ as the log-prior density for βj , these are also the equi-contours of the prior distribution of the parameters. 

- q = 0 :variable subset selection, as the penalty simply counts the number of nonzero parameters;
- q = 1 :the lasso, also Laplace distribution for each input, with density $\frac{1}{2\tau}exp(-|\beta|/\tau)$, where $\tau=1/\lambda$

- q = 2 :the ridge