### Review: least squares loss function

---

Ordinary least squares regression minimizes the residual sum of squares (RSS) to fit the data:

$$ \text{minimize:}\; {\rm RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p X_{ij}\beta_j\right)\right)^2 $$

where our model predictions for $y$ are based on the sum of the $\beta_0$ intercept and the products of $\beta_j$ with $X_{ij}$.

Alternatively, in matrix notation using the predictor matrix $X$, the residuals $\epsilon$ and the vector of beta coefficients $\beta$ we write the same equation as

$$ \text{minimize:}\; {\rm RSS} = \epsilon^T \epsilon = (y - X\beta)^T (y - X\beta ) $$

The derivative with respect to all the beta coefficients becomes

$$ \frac{\partial RSS}{\partial \beta} = -2X^T y + 2X^T X\beta $$

Setting equal to zero and solving for the beta coefficient vector gives

$$\beta = (X^T X)^{-1}X^T y $$

### The Ridge penalty

---

Ridge regression adds the sum of the squared (non-intercept!) $\beta$ values to the loss function

$$ \text{minimize:}\; {\rm RSS+Ridge} = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p x_{ij}\beta_j\right)\right)^2 + \lambda_2\sum_{j=1}^p \beta_j^2$$

where $\beta_j^2$ is the squared coefficient for variable $X_j$.

$\sum_{j=1}^n \beta_j^2$ is the sum of these squared coefficients for every variable we have in our model. This does **not** include the intercept $\beta_0$.

$\lambda_2$ is a constant for the _strength_ of the regularization parameter. The higher this value, the greater the impact of this new component in the loss function. If this were zero, then we would revert back to just the least squares loss function. If this were, say, a billion, then the residual sum of squares component would have a much smaller effect on the loss/cost than the regularization term.

With the penalty added the RSS is referred to as the **penalized residual sum of squares (PRSS)**. In matrix format the Ridge PRSS is:

$$ \text{Ridge PRSS} = (y - X\beta)^T (y - X\beta) + \lambda_2 \; \left\|\beta\right\|^2_2 $$

where $\left\|\beta\right\|_2^2$ is the so-called L2-norm of the coefficient vector (again, excluding intercept).

The derivative with respect to all the beta coefficients becomes

$$ \frac{\partial PRSS}{\partial \beta} = -2X^T y + 2X^T X\beta + 2\lambda_2 \beta$$

Setting equal to zero and solving for the beta coefficient vector gives

$$\beta = (X^T X + \lambda_2\mathbb{1})^{-1}X^T y $$

# * in another words, Ridge PRSS = RSS + MSE

## The Lasso penalty

---

The Lasso regression takes a different approach. Instead of adding the sum of _squared_ $\beta$ coefficients to the RSS, one adds the sum of the _absolute values_ of the $\beta$ coefficients:

$$ \text{minimize:}\; {\rm RSS + Lasso} = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p X_{ij}\beta_j\right)\right)^2 + \lambda_1\sum_{j=1}^p |\beta_j|$$

where $|\beta_j|$ is the absolute value of the $\beta$ coefficient for variable $X_j$ (this is often called the L1-norm). $\lambda_1$ is again the strength of the regularization penalty component in the loss function. 

**In matrix format the Lasso PRSS is:**

$$ \text{Lasso PRSS} = (y - X\beta)^T (y - X\beta) + \lambda_1 \; \left\|\beta\right\|_1 $$

where 

$$\left\|\beta\right\|_1=\sum_{j=1}^p |\beta_j|$$ 

Unlike the Ridge, however, there is not a closed-form solution for the Lasso beta coefficients.

# * Lasso PRSS = RSS + MAE