# Linear Dependent Feature
- Suppose the features are not mutually independent:  
  - for example: if we have two features $\mathcal{x_1}$ and $\mathcal{x_2}$ with $\mathcal{x_1} = 3\mathcal{x_2}$, and our decision function is $f = \mathcal{x_1} + \mathcal{x_2}$, which now becomes $f = 4\mathcal{x_2}$.
  - Question: What if we introduce $l_1$ and $l_2$ regularization?


## A Simplest Case
- Input features: $x_1, x_2 \in R$
- Outcome $y \in R$
- Linear prediction functions: $f(x) = w_1x_1 = w_2x_2$
- Suppose $x_1 = x_2$
- Then all functions with $w_1 + w_2 = k$ are the same
### Example
$l_2$ regularization with $\lVert w \rVert_2 \leq 2$, then the intersection $w_1 + w_2 = 2\sqrt2$ is the solution, and with $l_1$ constraints $\lVert w \rVert_2 \leq 1$, the solution becomes $w_1 + w_2 = 2$.  
<div align="center"><img src = "./l2.jpg" width = '500' height = '100' align = center /></div>  

<div align="center"><img src = "./l1.jpg" width = '500' height = '100' align = center /></div>

  
## Linear Related Features
Suppose $x_2 = 2x_1$, with the same constrains as the above example  
<div align="center"><img src = "./l2_2.jpg" width = '500' height = '100' align = center /></div>  
<div align="center"><img src = "./l1_2.jpg" width = '500' height = '100' align = center /></div>  




## Linear Dependent Features
- Identical features
 - $l_1$ regularization spreads the weights arbitrarily
 - $l_2$ regularization spreads the weights evenly
- Linearly related features
 - $l_1$ regularization chooses variable with larger scale, 0 weight to others
 - $l_2$ prefers variables with larger scale – spreads weight proportional to scale

# Empirical Risk for Square Loss and Linear Predictors
- Sets of $w$ giving same empirical risk (i.e. level sets) formed ellipsoids around the ERM.
- With $x_1$ and $x_2$ linearly related, we get a degenerate ellipse. 
<div align="center"><img src = "./empirical_risk.jpg" width = '500' height = '100' align = center /></div>  


## Correlated Features,  $l_1$ Regularization
- Intersection could be anywhere on the top right edge. 
- Minor perturbations (in data) can drastically change intersection point – very **unstable** solution. 
- Makes division of weight among highly correlated features (of same scale) seem arbitrary.  
<div align="center"><img src = "./unstable.jpg" width = '500' height = '100' align = center /></div>  


# Correlated Features and the Grouping Issue

## Example with highly correlated features
- Suppose $y$ is a linear combination of $z_1$ and $z_2$  
- We don't observe $z_1$ and $z_2$ directly, but we have 3 noisy observations
- We want to predict $y$ based on the noisy observations

Suppose $x, y$ generated as follow:  
$$\begin{aligned} z_{1}, z_{2} & \sim \mathcal{N}(0,1) \text { (independent) } \\ \varepsilon_{0}, \varepsilon_{1}, \ldots, \varepsilon_{6} & \sim \mathcal{N}(0,1) \text { (independent) } \\ y &=3 z_{1}-1.5 z_{2}+2 \varepsilon_{0} \\ x_{j} &=\left\{\begin{array}{ll}z_{1}+\varepsilon_{j} / 5 & \text { for } j=1,2,3 \\ z_{2}+\varepsilon_{j} / 5 & \text { for } j=4,5,6\end{array}\right.\end{aligned}$$

Generated a sample of $(x,y)$ pairs of size 100  
Correlations within the groups of $x$’s were around 0.97  
- Lasso regularization path  
<div align="center"><img src = "./lasso path.jpg" width = '500' height = '100' align = center /></div>  
Lines with the same color correspond to features with essentially the same information  
As we can see, Distribution of weight among them seems almost arbitrary

# Hedge Bets When Variables Highly Correlated  
When variables are highly correlated (and same scale, after normalization)  
- we want to give them roughly the same weight, because we want their errors cancel out

# Elastic Net
The elastic net combines lasso and ridge penalties:  
$$\hat{w}=\underset{w \in \mathbf{R}^{d}}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left\{w^{T} x_{i}-y_{i}\right\}^{2}+\lambda_{1}\|w\|_{1}+\lambda_{2}\|w\|_{2}^{2}$$  
We expect correlated random variables to have similar coefficients  
<div align="center"><img src = "./elastic net.jpg" width = '500' height = '100' align = center /></div>  
Elastic net solution is closer to $w_2 = w_1$ line, despite high correlation.

## Elastic Net - “Sparse Regions”
<div align="center"><img src = "./sparse region.jpg" width = '500' height = '100' align = center /></div>  
Suppose design matrix $X$ is orthogonal, so $X^T X = I$, and contours are circles (and features uncorrelated)  

Then OLS solution in green or red regions implies elastic-net constrained solution will be
at corner

# Elastic Net Results on Model

<div align="center"><img src = "./lassoVSElas.jpg" width = '500' height = '100' align = center /></div> 

# Parameters for Correlated Features in Elastic Net  
Recall the elastic net objective function:  
$$J(w)=\frac{1}{n}\|X w-y\|_{2}^{2}+\lambda_{1}\|w\|_{1}+\lambda_{2}\|w\|_{2}^{2}$$  
Let's write $x_i$ as the $i$ th column of the design matrix $X$  
- here $x_i \in R^n$ is the $i$th feature, across all training data  
- As we often do in practice, let’s assume the data are standardized so that every column $x_i$ has mean 0, and standard deviation 1  
- Then we denote the correlation between any pairs of columns $x_i$ and $x_j$ as $\rho_{i j}=\frac{1}{n} x_{i}^{T} x_{j}$
## Theorem1  
Under the conditions described above, if $\hat{w}_{i} \hat{w}_{j}>0$, then  
$$\left|\hat{w}_{i}-\hat{w}_{j}\right| \leq \frac{\|y\|_{2} \sqrt{2}}{\sqrt{n} \lambda_{2}} \sqrt{1-\rho_{i j}}$$  
**Proof**  
By assumption, $\hat{w}_i$ and $\hat{w}_j$ are nonzero, and moreover we must have $\frac{\partial J}{\partial w_{i}}(\hat{w})=\frac{\partial J}{\partial w_{j}}(\hat{w})=0$, that is  
$$\frac{\partial J}{\partial w_{i}}(\hat{w})=\frac{2}{n}(X \hat{w}-y)^{T} x_{i}+\lambda_{1} \operatorname{sign}\left(\hat{w}_{i}\right)+2 \lambda_{2} \hat{w}_{i}=0$$  
and  
$$\frac{\partial J}{\partial w_{j}}(\hat{w})=\frac{2}{n}(X \hat{w}-y)^{T} x_{j}+\lambda_{1} \operatorname{sign}\left(\hat{w}_{j}\right)+2 \lambda_{2} \hat{w}_{j}=0$$  
substraction, we get   
$$\begin{aligned}
\frac{2}{n}(X \hat{w}-y)^{T}\left(x_{j}-x_{i}\right)+2 \lambda_{2}\left(\hat{w}_{j}-\hat{w}_{i}\right) &=0 \\
\Longleftrightarrow\left(\hat{w}_{i}-\hat{w}_{j}\right) &=\frac{1}{n \lambda_{2}}(X \hat{w}-y)^{T}\left(x_{j}-x_{i}\right)
\end{aligned}$$  
Since $\hat{w}$ is a minimizer of $J$, we must have $J(\hat{w}) \leq J(0)$, that is  
$$\frac{1}{n}\|X w-y\|_{2}^{2}+\lambda_{1}\|\hat{w}\|_{1}+\lambda_{2}\|\hat{w}\|_{2}^{2} \leq \frac{1}{n}\|y\|_{2}^{2}$$  
Since the regularization terms are nonnegative, we must have $\|X w-y\|_{2}^{2} \leq\|y\|_{2}^{2}$,  
Meanwhile,  
$$\left\|x_{j}-x_{i}\right\|_{2}^{2}=x_{j}^{T} x_{j}+x_{i}^{T} x_{i}-2 x_{j}^{T} x_{i}$$  
then we have  
$$\left\|x_{j}-x_{i}\right\|_{2}^{2}=2 n-2 n \rho_{i j}$$  
since  
$$1^{T} x_{i}=1^{T} x_{j}=0$$  
and  
$$ \frac{1}{n}x_i^Tx_i = \frac{1}{n}x_j^Tx_j = 1$$  
and the corelation between $x_i$ and $x_j$ is $\rho_{i j}=\frac{1}{n} x_{i}^{T} x_{j}$  
$$\left\|x_{j}-x_{i}\right\|_{2}^{2}=2 n-2 n \rho_{i j}$$  
Putting things together  
$$\begin{aligned}
\left|\hat{w}_{i}-\hat{w}_{j}\right| &=\frac{1}{n \lambda_{2}}\left|(X \hat{w}-y)^{T}\left(x_{j}-x_{i}\right)\right| \\
& \leq \frac{1}{n \lambda_{2}}\|X \hat{w}-y\|_{2}\left\|x_{j}-x_{i}\right\|_{2} \text { by Cauchy-Schwarz inequality } \\
& \leq \frac{1}{n \lambda_{2}}\|y\|_{2} \sqrt{2 n\left(1-\rho_{i j}\right)} \\
&=\frac{1}{\sqrt{n}} \frac{\sqrt{2}\|y\|_{2}}{\lambda_{2}} \sqrt{1-\rho_{i j}}
\end{aligned}$$