# Lasso regression

*Linear regression loss with $L_2$ regularization penalty*

---
* [Implementation in Python](../pymlalgo/regression/lasso_regression.py)
* [Demo](../demo/lasso_regression_demo.ipynb)
---

### Symbols and Conventions
Refer to [Symbols and Conventions](symbols_and_conventions.ipynb) for details. In summary:
* $n$ is the number of training examples
* $d$ is the number of features in each training example (a.k.a the dimension of the training example)
* $X$ is the features matrix of shape $n \times d$
* $Y$ is the labels matrix of shape $n \times 1$
* $W$ is the weight matrix of shape $d \times 1$


The cost function for Lasso Regression can be written as**,

 

$$F(\beta) = \frac{1}{n} ||Y - X^{T}\beta||_2^2 + \lambda||\beta||_1$$


The minimization problem is written as:

$$min _{\beta_j \in {\rm I\!R} ^{1}} F(\beta) = \frac{1}{n} ||Y - X^{T}\beta||_2^2 + \lambda||\beta||_1$$

where  $j = 1, 2, 3, ..., d$

While minimizing w.r.t $\beta_j$ all the other betas, $\beta_1, \beta_2, ..., \beta_{j-1}, \beta_{j+1}, ...., \beta_d$ are held constant.

 

 

Let's assume the first term of the objective function as $g(\beta)$ and the second term as $h(\beta)$

 

 

Since $g$ is differentiable, the sub gradient will be equal to the gradient. We can find the derivative using the chain rule:

 

$$\partial_{\beta_j} g(\beta) = \triangledown _{\beta_j} (\beta) = -\frac{2}{n}X_j(Y-X^T\beta)$$

$$=-\frac{2}{n}X_j(Y-X_j^T\beta_j - X_{-j}^{T}\beta_{-j})$$

where $X_{-j}$ and ${\beta_{-j}}$ are the predictor matrix and the coefficients vector with the $j^{th}$ dimension removed. 

$$=-\frac{2}{n}X_j(Y - X_{-j}^{T}\beta_{-j}) - \frac{2}{n}X_j(-X_j^T\beta_j)$$

$$=-\frac{2}{n}X_j(Y - X_{-j}^{T}\beta_{-j}) - \frac{2}{n}(-1)X_jX_j^T\beta_j)$$

$$=-\frac{2}{n}X_j(Y - X_{-j}^{T}\beta_{-j}) - \frac{2}{n}(-1)||X_j||_2^2\beta_j$$

$$=-\frac{2}{n}(X_j(Y - X_{-j}^{T}\beta_{-j}) - ||X_j||_2^2\beta_j)$$

 

Let's say that, $z_j = ||X_j||_2^2$ and 

$R_{-j} = Y - X_{-j}^{T}\beta_{-j}$. 

Thus we can write the equation as :

$$=-\frac{2}{n}(X_jR_{-j} - \beta_jz_j)$$

 

Now, let's tackle $h(\beta)$

$$h(\beta) = \lambda||\beta||_1 = \sum_{j=1}^{d}|\beta_j|$$

differentiating w.r.t $\beta_j$, we will only get the derivative w.r.t $\beta_j$ as all other terms will be 0.

$$\partial _{\beta_j} h(\beta) = \partial _{\beta_j} (\lambda|\beta_j|) = \begin{cases}\lambda & \beta > 0 \\
-\lambda & \beta < 0 \\
v\lambda  & \beta = 0, v \in [-1, 1] \\
\end{cases}$$

Using the results and combining $g$ and $h$, we can write

$$\partial_{\beta_j} F(\beta) =\begin{cases}-\frac{2}{n}(X_jR_{-j} - \beta_jz_j) + \lambda & \beta > 0 \\
-\frac{2}{n}(X_jR_{-j} - \beta_jz_j)-\lambda & \beta < 0 \\
-\frac{2}{n}(X_jR_{-j} - \beta_jz_j) + v\lambda  & \beta = 0, v \in [-1, 1] \\
\end{cases}$$

Equating all the three cases to 0, we get 

$$\beta_j =\begin{cases}\frac{\lambda + \frac{2}{n}x_j R_{-j}}{\frac{2}{n}z_j} & \frac{2}{n}x_iR_{-j} \le -\lambda \\
\frac{-\lambda + \frac{2}{n}x_j R_{-j}}{\frac{2}{n}z_j} & \frac{2}{n}x_iR_{-j} \ge \lambda \\
0 & |\frac{2}{n}x_iR_{-j}| \ge \lambda\\
\end{cases}$$