## $\S4.$ Ridge & Logistic Regressions

**Author**: [Gilyoung Cheong](https://www.linkedin.com/in/gycheong/)

**References**
* ["The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman](https://hastie.su.domains/ElemStatLearn/)
* [Wikipedia page about ridge regresion](https://en.wikipedia.org/wiki/Ridge_regression)
* [A Stack Exchange discussion about ridge regression](https://stats.stackexchange.com/questions/69205/how-to-derive-the-ridge-regression-solution) -- We use an amazing linear algebra trick that is suggested by whuber in his answer to compute the Langrange dual function.

### Ridge Regression

"Ridge regression" is a model that takes a possibility that <u>there can be a relation among covariances of input data</u> into a consideration when we use linear or polynomial regressions. (This point is explained in more detail in the notebook.)

Given input data $\boldsymbol{x}_1, \dots, \boldsymbol{x}_m \in \mathbb{R}^n$ and output data $\boldsymbol{y} = (y_1, \dots, y_m) \in \mathbb{R}^n$. The goal of linear regression is to find $\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_m) \in \mathbb{R}^{m+1}$ such that with
$$X = \begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1m} \\
1 & x_{21} & x_{22} & \cdots & x_{2m} \\
\vdots & \vdots & \vdots & \cdots & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{nm}
\end{bmatrix}$$
and $$\boldsymbol{x}_j = (x_{1j}, \dots, x_{nj}) = \begin{bmatrix}
x_{1j} \\ \vdots \\ x_{nj}
\end{bmatrix}$$
for $1 \leq j \leq m$, the $n$-vector $X\boldsymbol{\beta}$ is the best possible approximation of $\boldsymbol{\hat{y}}$ in the sense that $\|\boldsymbol{y} - X\boldsymbol{\beta}\|$ is minimized.

Now, the idea is that when we have large input data, different features become highly correlated. One suggested remedy is to penalize the length of the linear coefficients, or equivalent its square: $\beta_1^2 + \cdots + \beta_m^2$. That is, given $t \in (0, \infty]$, we want to

* minimize $f_0(\beta_0, \beta_1, \dots, \beta_m) = \|\boldsymbol{y} - X\boldsymbol{\beta}\|^2 = \sum_{i=1}^n (y_j - (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_m x_{im}))^2$ subject to
* $f_1(\beta_0, \beta_1, \dots, \beta_m) = \| \boldsymbol{\beta} \|^2 = \beta_1^2 + \cdots + \beta_m^2 \leq t$.

Note that $t = \infty$ is identical to the linear regression, so we may assume that $t$ is finite. Since both $f_0$ and $f_1$ are convex functions defined on $\mathbb{R}^{m+1}$, we may check the [Slater's condition](https://github.com/gycheong/machine_learning/blob/main/Convex%20Optimizations/Lagrange%20duality%20and%20Slater's%20condition.ipynb) to see if we can get obtain the minimum using the Langrange duality of the problem, and indeed, the zero vector is a strictly feasible input in the (relative) interior of the domain. The Lagrangian is given as a function $L : \mathbb{R}^{m+1} \times \mathbb{R} \rightarrow \mathbb{R}$ defined by

$$\begin{align*} L(\boldsymbol{\beta}, \lambda) &= f_{0}(\boldsymbol{\beta}) + \lambda f_1(\boldsymbol{\beta}) \\
&= \|\boldsymbol{y} - X\boldsymbol{\beta}\|^2 + \lambda\| \boldsymbol{\beta} \|^2 \\
&= (\boldsymbol{y} - X\boldsymbol{\beta})^T(\boldsymbol{y} - X\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^T\boldsymbol{\beta}
\end{align*}$$

and thus the Lagrangian dual function is given as a function $g : \mathbb{R} \rightarrow \mathbb{R}$ defined by

$$g(\lambda) = \inf_{\boldsymbol{\beta}\in\mathbb{R}^{m+1}}(L(\boldsymbol{\beta}, \lambda)),$$

and maximizing $g(\lambda)$ subject to $\lambda \geq 0$ necessarily solves the optimization problem.

### Computaton of the Lagrange dual function

Given $\lambda \geq 0$, we compute $g(\lambda)$ under a mild condition. Consider the $(m+1) \times (n + m + 1)$ matrix
$$X_{\sqrt{\lambda}} := \begin{bmatrix}
X \\
\sqrt{\lambda} I_{m+1}
\end{bmatrix},$$

where $I_{m+1}$ is the $(m+1) \times (m+1)$ identity matrix. One can check that $X_{\sqrt{\lambda}}^{T}X_{\sqrt{\lambda}} = X^TX + \lambda I_{m+1}$. Considering the $(n+m+1) \times 1$ column matrix
$$\boldsymbol{y}' := \begin{bmatrix}
\boldsymbol{y} \\
\boldsymbol{0}
\end{bmatrix},$$
and note that ${\boldsymbol{y}'}^T\boldsymbol{y}' = \boldsymbol{y}^T\boldsymbol{y}$ and $X_{\sqrt{\lambda}}^T\boldsymbol{y}' = X^T\boldsymbol{y}$. Hence, it follows that
$$\begin{align*}
\|\boldsymbol{y}' - X_{\sqrt{\lambda}}\boldsymbol{\beta}\|^2 &= (\boldsymbol{y}' - X_{\sqrt{\lambda}}\boldsymbol{\beta})^T(\boldsymbol{y}' - X_{\sqrt{\lambda}}\boldsymbol{\beta}) \\
&= ({\boldsymbol{y}'}^T - \boldsymbol{\beta}^TX_{\sqrt{\lambda}}^T)(\boldsymbol{y}' - X_{\sqrt{\lambda}}\boldsymbol{\beta}) \\
&= \boldsymbol{y}^T\boldsymbol{y} - \boldsymbol{\beta}^TX\boldsymbol{y} - \boldsymbol{y}X\boldsymbol{\beta} + \boldsymbol{\beta}^T(X^TX + \lambda I_{m+1})\boldsymbol{\beta} \\
&= (\boldsymbol{y} - X\boldsymbol{\beta})^T(\boldsymbol{y} - X\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^T\boldsymbol{\beta} \\
&= \|\boldsymbol{y} - X\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|^2
\end{align*}$$

We recall from a [previous discussion about the linear regression](https://github.com/gycheong/machine_learning/blob/main/Linear%20and%20Polynomial%20Regressions/Linear%20Regression%20(theory).ipynb) that the set of all $\boldsymbol{\beta}$ that minimizes the above quantity are precisely the ones that satisfy

$$X_{\sqrt{\lambda}}^{T}X_{\sqrt{\lambda}} \boldsymbol{\beta} = X_{\sqrt{\lambda}}^{T}\boldsymbol{y}',$$

or equivalently

$$(X^{T}X + \lambda I_{m+1}) \boldsymbol{\beta} = X^{T}\boldsymbol{y}.$$

This shows the following:

**Lemma** (Explicit Lagrange dual function). For any $\lambda \geq 0$ such that $X^{T}X + \lambda I_{m+1}$ is invertible, we have
$$g(\lambda) = L(\boldsymbol{\beta}_{\lambda}, \lambda) = \|\boldsymbol{y} - X\boldsymbol{\beta}_{\lambda}\|^2 + \lambda \|\boldsymbol{\beta}_{\lambda}\|^2$$

with $\boldsymbol{\beta}_{\lambda} = (X^{T}X + \lambda I_{m+1})^{-1}X^{T}\boldsymbol{y}.$

### Invertibility of Covariance Matrix as Motivation behind Ridge Regression

Note that $\lambda = 0$ corresponds to the usual linear regression, which corresponds to $t = \infty$ in the optimization problem above. If we normalize our data to assume that each column has mean $0$, then using a similar discussion to the [beginning of our discussion about PCA](https://github.com/gycheong/machine_learning/blob/main/PCA/PCA%20(theory).ipynb) says that entries of $m^{-1}X^TX$ are given by sample covariances of the input data. Thus, if $X^TX$ is NOT invertible, then it means that the these covariances has a relation, namely $\det(X^TX) = 0$.

The effect of adding penality $\lambda$ can make us avoid singular matrix because

$$\det(X^TX + \lambda I_{m+1}) = (-1)^{m+1} P_{X^TX}(-\lambda),$$

where $P_{X^TX}$ is the characteritic polynomial of $X^TX$. Hence, there are at most $m+1$ choices of $\lambda$ such that $X^TX + \lambda I_{m+1}$ is NOT invertible. That is, if we pick $\lambda \geq 0$ at random, we are almost always guaranteed that $X^TX + \lambda I_{m+1}$ is invertible. This is a major reason for using Ridge Regression! (Of course, there could be other reasons based on data.)

**Remark**. Note that the polynomial regression is just a linear regression of the polynomial features of the given input data, so we can make sense of the ridge polynomial regression as well.

**Adjusting $\lambda$**. Since we checked that our optimization problem satsifies [Slater's condition](https://github.com/gycheong/machine_learning/blob/main/Convex%20Optimizations/Lagrange%20duality%20and%20Slater's%20condition.ipynb), as long as our original optimization problem has a solution $\boldsymbol{\beta}^* \in \mathbb{R}^{m+1}$, there must be some $\lambda^* \in \mathbb{R}_{\geq 0}$ such that $g(\lambda^*)$ is equal to the optimum. Thus, if such $\lambda^*$ is not one of the $m+1$ roots of the degree $m+1$ polynomial $\det(X^TX + \lambda)$ in $\lambda$, counting with multiplicity, then it must follow that
$$\|\boldsymbol{y} - X\boldsymbol{\beta}_{\lambda^*}\|^2 + \lambda^* \|\boldsymbol{\beta}_{\lambda^*}\|^2  = g(\lambda^*) = f_0(\beta^*) = \|\boldsymbol{y} - X\boldsymbol{\beta}^*\|^2.$$

If it turns out that $\lambda^* = 0$, then $\boldsymbol{\beta}_{\lambda^*} = \boldsymbol{\beta}^*$, which would be the usual linear regression case. In practice, we just apply a bunch of trials to find $\lambda^*$ that maximizes $\|\boldsymbol{y} - X\boldsymbol{\beta}_{\lambda^*}\|^2 + \lambda^* \|\boldsymbol{\beta}_{\lambda^*}\|^2$. 

### Logistic Regression

Generally speaking a "logistic regression" is applying any regression techniques for classifying purposes. The idea is very simple: if we have a model $\phi : \mathbb{R}^n \times \mathbb{R}^{m} \rightarrow \mathbb{R}$, feeding this model into the **sigmoid function** $\sigma : \mathbb{R} \rightarrow [0, 1]$ defined by
$$\sigma(t) := \frac{1}{1 + e^{-t}}$$

would give us the model $\sigma \circ \phi : \mathbb{R}^n \times \mathbb{R}^{m} \rightarrow [0,1]$ that could be trained for predicting the probability distribution of a categorical outcome.

**Remark**. Although one can feed any regression model into the sigmoid function to create a logistic version of such regression model, but it seems that when we discuss a logistic regression, it is assumed that the regression is a logistic linear ridge regression, as we can see in the documentation of the [LogisticRegression from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Since any polynomial regression is a linear regression on the polynomial features of the input data, such a library can be directly used when we consider a logistic polynomial (ridge) regression.