## Ridge regression (Hoerl and Kennard, 1970)

Consider the linear model with an intercept term $ Y=\beta_{0}+X\beta+\epsilon$ .When some of the columns of X
  are nearly collinear (as is likely when $p$
  is nearly as large as n), the determinant of $X^{T}X $
 , which is the square of the volume of the parallelepiped whose edges are the columns of $X$
 , is nearly 0. Thus, since a determinant is the product of the eigenvalues, at least one of the eigenvalues of $ \left(X^{T}X\right)^{-1} $
 is very large. But  $$ \hat{\beta}\sim N_{p}\left(\beta,\,\sigma^{2}\left(X^{T}X\right)^{-1}\right) $$
 , at least when the columns of X
  are orthogonal to $ (1,1,1,...,1,1)^{T} $
 , and the sum of the eigenvalues is the trace, so at least one of the components of $\hat{\beta}$
  has very large variance. We therefore seek to shrink $\hat{\beta}$
  towards the origin. This will introduce a bias, but we hope it will be more than compensated by a reduction in variance. 

For fixed $\lambda>0 $
 , the ridge regression estimator is  $$ \hat{\beta}_{\lambda}^{R} $$
 , where $$ \left(\hat{\beta}_{0},\hat{\beta}_{\lambda}^{R}\right) $$
  minimises $$ Q_{2}(\beta_{0},\beta)=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\sum_{j=1}^{p}x_{ij}\beta_{j}\right)^{2}+\lambda\sum_{j=1}^{p}\beta_{j}^{2} $$

In that case, $\hat{\beta}_{\lambda}^{R}=(X^{T}X+\lambda I)^{-1}X^{T}Y $ (see example sheet)

Observe that $ \hat{\beta}_{\lambda}^{R}$ 
  also minimises $$ \sum_{i=1}^{n}\left(Y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j}\right)^{2} $$
  subject to  $ \sum_{j=1}^{p}\beta_{j}^{2}\leq s $
  where $ s=s(\lambda) $ is a bijective function. It can also be motivated in a Bayesian way as the maximum a posteriori (MAP) and posterior mean estimators of \beta
  under a N_{p}\left(0,\,\frac{1}{\lambda}I\right)
  prior. 

In order to study the Mean Squared Error (MSE) properties of $ \hat{\beta}_{\lambda}^{R} $
 , we define $V=\left(I+\lambda\left(X^{T}X\right)^{-1}\right)^{-1} $
  and $W=\left(X^{T}X+\lambda I\right)^{-1}$
 , so that $ \hat{\beta}_{\lambda}^{R}=\left(X^{T}X+\lambda I\right)^{-1}X^{T}Y=WX^{T}Y=V\hat{\beta} $
 

and  $$ V-I=\left(I+\lambda(X^{T}X)^{-1}\right)^{-1}-I=WX^{T}X-I=W\left(X^{T}X-W^{-1}\right)=-\lambda W $$

Assume that $X$
  has full rank p (in particular, $ p\leq n $
 ), and write $\mu_{1}\geq...\geq\mu_{p}>0 $
  for the eigenvalues of  $X^{T}X$
 . Let P
  be an orthogonal matrix such that $P^{T}X^{T}XP=D\equiv\mbox{diag}(\mu_{1},...,\mu_{p}) $
 . Then $$ \det\left(W-\frac{1}{\mu_{j}+\lambda}I\right)	=	\det\left(\left(PDP^{T}+\lambda I\right)^{-1}-\frac{1}{\mu_{i}+\lambda}I\right)
	=	\det\left(P(D+\lambda I)^{-1}P^{T}-\frac{1}{\mu_{i}+\lambda}PP^{T}\right)
	=	\det\left((D+\lambda I)^{-1}-\frac{1}{\mu_{i}+\lambda}I\right)
	=	0 $$
 

Thus the eigenvalues of $W$
  are $ \frac{1}{\mu_{1}+\lambda}\leq...\leq\frac{1}{\mu_{p}+\lambda} $
 .

#### Theorem 1

For sufficiently small $ \lambda>0 $
 , we have $ MSE(\hat{\beta}_{\lambda}^{R}) < MSE(\hat{\beta}) $
 . 

#### Proof: 


$$ 
MSE(\hat{\beta}_{\lambda}^{R})
  = \mathbb{E}_{\beta}\left(\left\Vert \hat{\beta}_{\lambda}-\beta\right\Vert ^{2}\right) $$
  $$ =\mathbb{E}_{\beta}\left(\left\Vert V\hat{\beta}-V\beta+V\beta-\beta\right\Vert \right)
  =	\mathbb{E}_{\beta}\left(\left\Vert V\left(\hat{\beta}-\beta\right)\right\Vert \right)+\beta^{T}(V-I)^{T}(V-I)\beta
  $$
  $$=	\sigma^{2}\mbox{Tr}\left(V\left(X^{T}X\right)^{-1}V^{T}\right)+\lambda^{2}B^{T}W^{T}W\beta
=	\sigma^{2}\mbox{Tr}\left(\left(X^{T}X\right)^{-1}V^{T}V\right)+\lambda^{2}\beta^{T}\left(PDP^{T}-\lambda I\right)^{-1}\left(PDP^{T}-\lambda I\right)^{-1}\beta
	$$
    
$$ =	\sigma^{2}\mbox{Tr}\left(W(I-\lambda W)\right)+\lambda^{2}\beta^{T}P\left(D-\lambda I\right)^{-2}P^{T}\beta
$$

$$
=	\sigma^{2}\sum_{j=1}^{p}\frac{1}{\mu_{j}+\lambda}-\lambda\sigma^{2}\sum_{j=1}^{p}\frac{1}{\left(\mu_{j}+\lambda\right)^{2}}+\lambda^{2}\sum_{j=1}^{p}\frac{\alpha_{j}^{2}}{\left(\mu_{j}+\lambda\right)^{2}}
$$

$$=	\sigma^{2}\sum_{j=1}^{p}\frac{\mu_{j}}{(\mu_{j}+\lambda)^{2}}+\lambda^{2}\sum_{j=1}^{p}\frac{\alpha_{j}^{2}}{\left(\mu_{j}+\lambda\right)^{2}} 
$$
 

where $\alpha=P^{T}\beta$
 

Thus the variance is a monotone decreasing function of $\lambda$
 , while the squared bias is amonotone increasing function of $\lambda$
 . We deduce that

$$\frac{d}{d\lambda}\mbox{MSE}\left(\hat{\beta}_{\lambda}^{R}\right)	=	-2\sigma^{2}\sum_{j=1}^{p}\frac{\mu_{j}}{\left(\mu_{j}+\lambda\right)^{2}}+\sum_{j=1}^{p}\frac{\left\{ 2\left(\mu_{j}+\lambda\right)^{2}\lambda-2\lambda^{2}(\mu_{j}+\lambda)\right\} a_{j}^{2}}{\left(\mu_{j}+\lambda\right)^{4}}
	=	-2\sigma^{2}\sum_{j=1}^{p}\frac{\mu_{j}}{\left(\mu_{j}+\lambda\right)^{3}}+2\lambda\sum_{j=1}^{p}\frac{\mu_{j}x_{j}^{2}}{\left(\mu_{j}+\lambda\right)^{3}}
	<	0 $$
 for  $ 0\leq\lambda<\frac{\sigma^{2}}{\alpha_{\max}}$
 , where  $ \alpha_{\max}^{2}=\max\left(\alpha_{1}^{2},...,\alpha_{p}^{2}\right)$
 

Unfortunately, a poor choice of $\lambda$
  may substantially increase the MSE. In practice, $\lambda$
  is often chosen by V
 -fold cross validation. The idea is to randomly split the data into v groups (folds) of roughly the same size. For each $\lambda$
  on a grid of values, and for k=1,..,v
 , we compute $\hat{\beta}_{\lambda,-k}^{R}$
 , the ridge regression estimator of $\beta$
  based on all of the data except the k
 -th fold. Writing $\kappa(i)$
  for the fold to which $\left(X_{i},\,Y_{i}\right)$
  belongs, we choose the value of $\lambda$
  that minimises
  $CV(\lambda)=\sum_{i=1}^{n}\left(Y_{i}-x_{i}^{T}\hat{\beta}_{\lambda,\,-k(i)}^{R}\right)^{2} $
 

This is a fairly reliable way of choosing tuning parameters, although because we use the same data for both the fitting, and assessing the fit, the method can be prone to overfitting (choosing $\lambda$
  too small). Common choices of v
  include 5,10
  or $n$
 , the last of these being knwon as 'leave-one-out' cross validation.

$V(\hat{\beta}-\beta)\sim N_{p}\left(0,\,\sigma^{2}V(X^{T}X)^{-1}V^{T}\right)$
 