# MATH310 - Lecture_notes5

## Some particular bi-objective problems ([see §15.5.2 in VMLS](https://web.stanford.edu/~boyd/vmls/vmls.pdf#page=342))

<font size="4">
'
 
We focus on solving problems of the following type: 
    
Find ${\bf x}\in \mathbb{R}^{n}$ minimizing the bi-objective function
    
$$J({\bf x}) = \|A{\bf x}-{\bf b}\|^2+ \lambda\|{\bf x}-{\bf x}^{des}\|^2, $$        
where the $m\times n$ coefficient matrix $A$ is "wide" (meaning that $n>m$ i.e. we have more unknowns than equations in the system
$A{\bf x}={\bf b}$) and the magnitude of $\lambda > 0$ indicate the strength in our desire for the solution ${\bf x}$ to be close to some (desired) ${\bf x}^{des}\in \mathbb{R}^{n}$.  
    
With $A_1=A$, ${\bf b}_1 = {\bf b}$, $A_2 = I_n$, ${\bf b}_2 = {\bf x}^{des}$, $\lambda_1 = 1$ and $\lambda_2=\lambda$ the above bi-objective function can be expressed as
    
$$J({\bf x}) = \lambda_1\|A_1{\bf x}-{\bf b}_1\|^2 + \lambda_2\|A_2{\bf x}-{\bf b}_2\|^2,$$    
    
i.e. a _weigthed sum objective_ of a bi-objective least squares problem, see [§15.1 in VMLS](https://web.stanford.edu/~boyd/vmls/vmls.pdf#page=319).
    
</font>  

## An OLS-formulation of the above problem-type
<font size="4">
'
 
Note that the above objective function $J({\bf x})$ corresponds to the ordinary least squares (OLS) formulation for solving the system
 
    
$$\begin{bmatrix}
 A \\
\sqrt{\lambda} I_n
\end{bmatrix}{\bf x}= \begin{bmatrix}
 {\bf b} \\
\sqrt{\lambda}{\bf x}^{des}
\end{bmatrix}$$    
       
with the corresponding normal equations 
    
$$(A^tA+\lambda I_n){\bf x} = A^t{\bf b}+\lambda{\bf x}^{des}.$$    

The least squares solution of this system is

$$\hat{\bf x} = (A^tA+\lambda I_n)^{-1}(A^t{\bf b} + \lambda{\bf x}^{des})$$
$$ = (A^tA+\lambda I_n)^{-1}(A^t{\bf b} + (\lambda I_n + A^t A){\bf x}^{des}-(A^tA){\bf x}^{des})$$ 
$$ = (A^tA+\lambda I_n)^{-1}A^t({\bf b}-A{\bf x}^{des})+{\bf x}^{des}.$$    
    
Note that the inverted matrix $(A^tA+\lambda I_n)^{-1}\in \mathbb{R}^{n\times n}$.    
    
</font>  

## The "kernel trick" for faster solution of the above problem type
<font size="4">
'
 
Note that
    
$$(A^tA+\lambda I_n)A^t = A^t(AA^t+\lambda I_m),$$    
    
where both $(A^tA+\lambda I_n)$ and $(AA^t+\lambda I_m)$ are invertible matrices for $\lambda > 0$.    
    
Multiplication of the above equation from the left by $(A^tA+\lambda I_n)^{-1}$ and from the right by $(AA^t+\lambda I_m)^{-1}$ yields the identity
    
$$A^t(AA^t+\lambda I_m)^{-1} = (A^tA+\lambda I_n)^{-1}A^t.$$ 
    
Therefore the OLS solution of $\begin{bmatrix}
 A \\
\sqrt{\lambda} I_n
\end{bmatrix}{\bf x}= \begin{bmatrix}
 {\bf b} \\
\sqrt{\lambda}{\bf x}^{des}
\end{bmatrix}$
can also be expressed as


$$\hat{\bf x} = A^t(AA^t+\lambda I_m)^{-1}({\bf b}-A{\bf x}^{des})+{\bf x}^{des}.$$
    
Note that here the inverted matrix $(AA^t+\lambda I_m)^{-1}\in \mathbb{R}^{m\times m}$, which is a smaller problem for wide matrices ($n>m$).   
    
If $QR = \bar{A}=\begin{bmatrix}
  A^t \\
  \sqrt{\lambda} I_m 
  \end{bmatrix}$
is _the qr-decomposition_ of the stacked $(n+m)\times m$ matrix $\bar{A}$. Then
    
$$(AA^t+\lambda I_m) = \bar{A}^t\bar{A} = R^tQ^tQR = R^tR,$$    
    
and the OLS solution becomes
    
$$\hat{\bf x} = A^t(R)^{-1}(R^t)^{-1}({\bf b}-A{\bf x}^{des})+{\bf x}^{des}.$$        
    
</font>  

## [Tikhonov regularization (Ridge regression) modelling](https://en.wikipedia.org/wiki/Tikhonov_regularization)
<font size="4">
'
 
Let's convert to "statistics notation" where $X$ denotes a mean centered data matrix of size $m\times n$ where typically $n>m$ (we have more variables/unknowns than samples),  ${\bf y}_0$ is the corresponding mean centered response and $\lambda>0$. Then, if the "desired" solution of the above problem type is set to ${\bf 0}$, our minimization problem is about finding $\beta\in \mathbb{R}^{n}$ minimizing the objective
    
$$J(\beta) = \|X\beta-{\bf y}_0\|^2+ \lambda\|\beta\|^2. $$    
    
The corresponding OLS-problem is
    
$$\begin{bmatrix}
X \\
\sqrt{\lambda} I_n
\end{bmatrix}\beta= \begin{bmatrix}
{\bf y}_0 \\
{\bf 0}
\end{bmatrix},$$    
    
where ${\bf y}_0 = {\bf y}-\bar{y}$ ($\bar{y}=\frac{1}{m}\sum_{i=1}^m y_i$) is the mean centered version of ${\bf y}$.     
    
This type of OLS-problem is often called [Tikhonov regularization (TR) or Ridge regression (RR)](https://en.wikipedia.org/wiki/Tikhonov_regularization), see [§15.3.1 and §15.4 in VMLS](https://web.stanford.edu/~boyd/vmls/vmls.pdf#page=327).
    
According to the above derivations, the least squares solution of such problems is given by
    
$$\beta_{\lambda} = X^t(XX^t+\lambda I_m)^{-1}{\bf y}_0.$$    

Analogously to PCR (see last weeks notes) we predict the response value $\hat{y}$ for a new datapoint (sample) ${\bf x}^t\in \mathbb{R}^n$ based on the $\lambda$-regularized __RR-model__ by including a constant term $\beta_{0,k}$ to calculate
    
$$\hat{y} = \beta_{0,\lambda} + {\bf x}^t\beta_{\lambda}.$$
    
Here $\beta_{0,\lambda} = \bar{y}-\bar{\bf x}^t\beta_{\lambda}$ where $\bar{\bf x}^t$ is the (row) vector of column means used for centering of the data matrix $X$. 
    
Note that for the particular choice ${\bf x} = \bar{\bf x}$ we obtain the prediction
    
$$\hat{y} = \beta_{0,\lambda} + \bar{\bf x}^t{\beta}_{\lambda} = \bar{y}-\bar{\bf x}^t{\beta}_{\lambda} + \bar{\bf x}^t{\beta}_{\lambda} = \bar{y},$$    
    
i.e. from the mean of the observed $X$-data we predict the mean of the observed ${\bf y}$-data, just as we did for the PCR-models.
    
</font>  

## Model validation and -selection

<font size="4">
'
 
__Question:__ How do we select the number of principal components ($k$) in PCR and the regularization parameter value ($\lambda$) i RR to obtain models with good predictions?
    
__Answer:__ We can do [10-fold cross validation or leave-one-out cross validation (recall §13.2 in VMLS)](https://web.stanford.edu/~boyd/vmls/vmls.pdf#page=270) for the various candidate models, compare the RMS-values for the predictions to choose a model with seemingly low prediction error...
    
</font>  