# Why we need regularization?
Suppose we have a decision function $f \in \mathcal{F}$, which fits the training data very well, but when we generalize it to test data, its performance becomes terrible. One way to solve this problem is reducing the complexity of the functions.  
For linear functions: $x \to w_1x_1 + ... + w_dx_d$
- $\ell_0$ complexity: the number of all non-zero coefficients.  
- $\ell_1$ complexity (Lasso): $\sum_{i = 1}^{d}|w_i|$, for coefficient $w_i$  
- $\ell_2$ complexity (Ridge): $\sum_{i = 1}^{d}w_i^2$, for coefficient $w_i$  
<br/>

Complexity Measure:  
$\Omega : \mathcal{F} \to [0,\infty)$  
Let's consider all function in $\mathcal{F}$ with complexity at most $r$:  
$\mathcal{F}_r = \{f \in \mathcal{F}| \Omega(F) \leq r \}$


# Constrained Empirical Risk Minimization
## Constrained ERM (Ivanov Regularization)

For complexity measure $\Omega : \mathcal{F} \to [0,\infty)$, and a fixed $r\ge 0$,
$$ \min_{f \in \mathcal{F}} \frac{1}{n}\sum_{i = 1}^{n}\ell(f(x_i), y_i)$$
$$s.t.\quad \Omega(f) \le r$$  
we can also write in a concise form:  
$$ \min_{f \in \mathcal{F}_r} \frac{1}{n}\sum_{i = 1}^{n}\ell(f(x_i), y_i)$$  
Choose $r$ using validation data or cross-validation

# Penalized Empirical Risk Minimization
## Penalized ERM (Tikhonov regularization) 
For complexity measure $\Omega : \mathcal{F} \to [0,\infty)$, and a fixed $\lambda\ge 0$,  
$$\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i = 1}^{n}\ell(f(x_i), y_i) + \lambda \Omega(f)$$  
Choose $\lambda$ using validation data or cross-validation

# Ivanov vs Tikhonov Regularization 
Indeed,in most cases they are equivalent.  

Ivanov and Tikhonov are equivalent if:  
- For any choice of $r > 0$, any Ivanov solution  
$$f_r^{*} \in \arg\min_{f \in \mathcal{F_r}}L(f) $$  
is also a Tikhonov solution for some $\lambda > 0$. That is $\exists \lambda > 0$, such that:  
$$f_r^{*} \in \arg\min_{f \in \mathcal{F}}L(f) + \lambda\Omega(f) $$  
- Conversely, for any choice of $\lambda > 0$, any Tikhonov solution is Ivanov solution for some $r > 0$  
We will discuss details in homework


# $\ell_{l}$ and $\ell_2$ Regularization  
- Consider Linear models:  
$$\mathcal{F} = \{f : \mathbb R^d \to \mathbb R|f(x) = w^Tx, w \in \mathbb{R}^d\}$$  
- Loss: $\ell(\hat{y}, y) = (\hat y - y)^2$  
 - Ridge Regression with Ivanov Form  
$\hat{w}=\underset{\|w\|_{2}^{2} \leqslant r^{2}}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left\{w^{T} x_{i}-y_{i}\right\}^{2}$
 <br/>
 - Ridge Regression with Tikhonov Form  
$\hat{w}=\underset{w \in \mathbf{R}^{d}}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left\{w^{T} x_{i}-y_{i}\right\}^{2}+\lambda\|w\|_{2}^{2}$  
 <br/>
 - How does $\ell_2$ regularization induce “regularity”?
For $\hat{f}(x) = \hat{w}^Tx$, $\hat{f}$ is **Lipschitz continuous** with Lipschitz constant $\lVert \hat w \rVert_2$  
Proof:  
$$|f(x + h) - f(x)| = w^T(x + h) - w^Tx = |w^Th|$$  
$$\le \lVert w^T \rVert \lVert h \rVert $$  
 <br/>
 - Lasso Regression with Ivanov Form  
$\hat{w}=\underset{\|w\|_{1} \leqslant r}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left\{w^{T} x_{i}-y_{i}\right\}^{2}$  
 <br/>
 - Lasso Regression with Tikhonov Form  
 $\hat{w}=\underset{w \in \mathbf{R}^{d}}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left\{w^{T} x_{i}-y_{i}\right\}^{2}+\lambda\|w\|_{1}$ 

# Lasso Gives Feature Sparsity
In the optimal solution $\hat{w}$ from Lasso, many entries are 0, which means the corresponding features we do not need.  
(For a data point $x \in R^d$, it has $d$ features, and for each feature, we assign a weight, which is the corresponding entry of $w$)
<br/>  
## What are the benefits of Sparsity?
- Time to compute the result is reduced.
- We need less memory to store the features
- Identify the important features.  
<br/>

## Why does Lasso give Sparsity?
Consider $\ell_1$ and $\ell_2$ norm constraints in two dimension  
Linear Hypothesis space $\mathcal{F} = \{f(x) = w_1x_1 + w_2x_2\}$
<div align="center"><img src = "./norm constraints.jpg" width = '500' height = '100' align = center /></div>
Blue region: Area satisfying complexity constraint $|w_1| + |w_2| \le r$  

Red lines: contours of $\hat{R}_n(w) = \sum_{i = 1}^{n}(w^Tx_i - y_i)^2$

<div align="center"><img src = "./famous l1.jpg" width = '500' height = '100' align = center /></div>


As the figure demonstrates, the optimal points lie on the axis, which gives sparsity.  
Suppose design matrix X is orthogonal, so $X^TX = I$, and contours are circles, then OLS solution in green or red regions implies $\ell^1$ constrained solution will be at corner  
<div align="center"><img src = "./l1.jpg" width = '500' height = '100' align = center /></div>


## The Empirical Risk for Square Loss
Denote the empirical risk of $f(x) = w^Tx$ by:  
$$\hat{R}_n(w) = \frac{1}{n}\lVert Xw - y\rVert^2$$
where $X$ is the **design matrix**  
$\hat{R}_n(w)$ is minimized by $\hat{w} = (X^TX)^{-1}X^Ty$  
- What does $\hat{R}_n$ look like around $\hat{w}$?  
**Complement**  
- **Proposition1** For any vectors $x,b \in R^d$ and symmetric invertible matrix $M \in R^{d \times d}$, we have 
$$
\begin{aligned}  
x^{T} M x-2 b^{T} x=\left(x-M^{-1} b\right)^{T} M\left(x-M^{-1} b\right)-b^{T} M^{-1} b
\end{aligned}
$$
- **Proposition2** (Sum of two quadratic forms in x) Suppose $f(x)$ is the sum of two quadratic forms in $x$:  
$$f(x)=(x-\mu)^{T} \Sigma^{-1}(x-\mu)+(x-\theta)^{T} V^{-1}(x-\theta)$$  
Then we can write $f$ as a single quadratic form plus a constant term, independent of $x$   
$$f(x)=\left(x-M^{-1} b\right)^{T} M\left(x-M^{-1} b\right)-b^{T} M^{-1} b+R$$  
where $M = \Sigma^{-1} + V^{-1}$, $b = \Sigma^{-1}\mu + V^{-1}\theta$, and $R = \theta^{T}V^{-1}\theta + \mu^{T}\Sigma^{-1}\mu$  
By tedious calculation and applying the above conclusion, we can get:  
$$\hat{R}_{n}(w)=\frac{1}{n}(w-\hat{w})^{T} X^{T} X(w-\hat{w})+\hat{R}_{n}(\hat{w})$$  
Since $\hat{R}_n(\hat{w})$ is independent on $w$, if we set $\frac{1}{n}(w-\hat{w})^{T} X^{T} X(w-\hat{w}) = c$, the set of $\left\{w |(w-\hat{w})^{T} X^{T} X(w-\hat{w})=n c\right\}$ is an  **ellipsoid centered at $\hat{w}$** 

The Famous Picture for $\ell^2$ Regularization
<div align="center"><img src = "./famous l2.jpg" width = '500' height = '100' align = center /></div>  

## $(\ell_q)^q$ Constraints
$\ell_{q}:\left(\|w\|_{q}\right)^{q}=\left|w_{1}\right|^{q}+\left|w_{2}\right|^{q}$  
<div align="center"><img src = "./lq.jpg" width = '500' height = '100' align = center /></div>  
for $0 < q < 1$, $(\ell_q)^q$ even sparser  
<div align="center"><img src = "./lq_graph.jpg" width = '500' height = '100' align = center /></div>  

# How to Find Lasso Solution?
- $\lVert w\rVert_1 = |w_1| + |w_2|$ is not differentiable  

## Splitting a Number into Positive and Negative Parts
For any $a \in R$, let $a_+ = max\{0,a\}, a_- = max\{0,-a\}$, the followings are some examples:  
$7^+ = 7, 7^- = 0$  
$-3^+ = 0, -3^- = 3$,  
then $a = a^+ - a^-$, $|a| = a^+ + a^-$  
<br/>
In Lasso, we can divide $w_i$ by $w_i = w_i^+ - w_i^-$, and we denote $w^+ = (w_1^+,...,w_d^+)$ and $w^- = (w_1^-,...,w_d^-)$  

## Lasso as an Quadratic Program  
Substituting $w = w^+ - w^-$, and $|w| = w^+ + w^-$ gives an equivalent problem:  
$$\min _{w^{+}, w^{-}} \sum_{i=1}^{n}\left(\left(w^{+}-w^{-}\right)^{T} x_{i}-y_{i}\right)^{2}+\lambda 1^{T}\left(w^{+}+w^{-}\right)$$  $$s.t. w_i^+,w_i^- \geq 0, \forall i$$  
Thus, the new objective is differentiable, and more precisely, it is convex and quadratic.  

### A Possible Confusion
If we want to optimize this program, we have 2d variables and 2d constraints. But finally, we need $w_i^+$ and its corresponding $w_i^-$ to get $w = w_i^+ - w_i^-$. Here, we only have $2d$ variables, which may in any order.  

### Solve the Confusion  
Lasso problem is trivially equivalent to the following:
$$
\min _{w} \min _{a, b} \sum_{i=1}^{n}\left((a-b)^{T} x_{i}-y_{i}\right)^{2}+\lambda 1^{T}(a+b)
$$
subject to $a_{i} \geqslant 0$ for all $i, b_{i} \geqslant 0$ for all $i$
$$
\begin{array}{l}
a-b=w \\
a+b=|w|
\end{array}
$$  
Claim: We don't need constraint $a + b = |w|$  
Reason: Since $a - b = w$, we must have $a = w^+, b = w^-$, and $a + b = |w|$  
<br/> 

Claim: We don't need constraint $a - b = w$  
Reason: For any $a,b \geq 0$, there exists $w = a - b$, and we do not require $w \geq 0$  
<br/>

So Lasso optimization problem becomes  
$$
\min _{a, b} \sum_{i=1}^{n}\left((a-b)^{T} x_{i}-y_{i}\right)^{2}+\lambda 1^{T}(a+b)
$$
subject to $\quad a_{i} \geqslant 0$ for all $i \quad b_{i} \geqslant 0$ for all $i$  

## Projected SGD  
$$
\min _{w^{+}, w^{-} \in \mathbf{R}^{d}} \sum_{i=1}^{n}\left(\left(w^{+}-w^{-}\right)^{T} x_{i}-y_{i}\right)^{2}+\lambda 1^{T}\left(w^{+}+w^{-}\right)
$$
subject to $w_{i}^{+} \geqslant 0$ for all $i$
$$
w_{i}^{-} \geqslant 0 \text { for all } i
$$  
just like SGD, but after each step, we set it back to $0$,if any component of $w^+$ or $w^-$ is negative  

## Coordinate Descent Method
Goal: Minimize $L(w) = L(w_1, w_2,..., w_d)$ over $w = (w_1,..., w_d) \in R^d$, in each step, we solve 
$$w_{i}^{\text {new }}=\underset{w_{i}}{\arg \min } L\left(w_{1}, \ldots, w_{i-1}, \mathbf{w}_{\mathbf{i}}, w_{i+1}, \ldots, w_{d}\right)$$  
Example:  
Suppose we have $w \in R^3$, we fix $w_2$ and $w_3$ first, and search on $w_1$ to find a $w_1 = w_1^*$ such that $L(w_1^*, w_2, w_3) \leq L(w_1, w_2, w_3)$ for any $w_1$, then we fix $w_1^*$ and $w_3$, search for the optimal $w_2 = w_2^*$, finally search for $w_3^*$.<br/>

**Algorithm**  
Goal: Minimize $L(w) = L(w_1, w_2,..., w_d)$ over $w = (w_1,..., w_d) \in R^d$
- Initialize $w^{(0)} = 0$  
- While not converge:
  - Choose a coordinate $j \in \{1,...,d\}$
  - $w_{j}^{\text {new }} \leftarrow \arg \min _{w_{j}} L\left(w_{1}^{(t)}, \ldots, w_{j-1}^{(t)}, \mathbf{w}_{\mathbf{j}}, w_{j+1}^{(t)}, \ldots, w_{d}^{(t)}\right)$
  - $w_{j}^{(t+1)} \leftarrow w_{j}^{\text {new }}$ and $w^{(t+1)} \leftarrow w^{(t)}$
  - $t \leftarrow t+1$
- Coordinate Descent is good if it is easy to minimize w.r.t one coordinate at one time.
- Random coordinate choice: Stochastic Coordinate Descent
- Cyclic coordinate choice: Cyclic Coordinate Descent  
<br/>  
**Sufficient Conditions:**  
Suppose we want to minimize $f : R^d \to R$  
- f is continuous differentiable and   
- f is strict convex in each coordinate  
**Weak Condition:**  
Theorem:  
If the objective $f$ has the following structure:  
$$f\left(w_{1}, \ldots, w_{d}\right)=g\left(w_{1}, \ldots, w_{d}\right)+\sum_{j=1}^{d} h_{j}\left(w_{j}\right)$$  
where $g: R^d \to R$ is differentiable and convex and each $h_j: R \to R$ is convex but not necessarily differentiable,  
then the coordinate descent algorithm converges to the global minimum

## Coordinate Descent Method for Lasso
- Why mention coordinate descent for Lasso? 
- In Lasso, the coordinate minimization has a closed form solution!
<br/> 
Closed Form Coordinate Minimization for Lasso(Details See Homework 1)
$$\hat{w}_{j}=\underset{w_{j} \in \mathbf{R}}{\arg \min } \sum_{i=1}^{n}\left(w^{T} x_{i}-y_{i}\right)^{2}+\lambda|w|_{1}$$  
then  
$$\hat{w}_{j}=\left\{\begin{array}{ll}
\left(c_{j}+\lambda\right) / a_{j} & \text { if } c_{j}<-\lambda \\
0 & \text { if } c_{j} \in[-\lambda, \lambda] \\
\left(c_{j}-\lambda\right) / a_{j} & \text { if } c_{j}>\lambda
\end{array}\right.$$  
$$a_{j}=2 \sum_{i=1}^{n} x_{i, j}^{2}$$,  
$$c_{j}=2 \sum_{i=1}^{n} x_{i, j}\left(y_{i}-w_{-j}^{T} x_{i,-j}\right)$$