# Questions
* If the loss function is not a squared loss, What is the geometric shape of the risk for non square loss?
* Does it matter to include regularization term when we compute the loss for Ridge regression?
* Does convexity always imply continuity?
* What's the differences between sup and max, inf and min?

# Lasso

# Elastic Net

## Lasso vs. Elastic Net
* Lasso yields sparse features. The elastic net retains this property, but additionally has a number of appealing properties.


### How Lasso and Elastic Net handle Correlated Features?
* For highly correlated features with the same scale (assuming they are normalized).
  * Lasso can pick up a solution where the weights add up to a constant. However, the solution can be unstable, a minor change in training data can cause the algorithm to pick another solution where the weights still add up to the same constant but distribute differently. 
  * Elastic net gives solutions more evenly distributed (i.e. weights are about the same values), and the solution is stable. 
  
### Why keep these correlated features instead of selecting only one of them?
* Errors might cancel out for even distributed weights (which correspond to highly correlated features).
* We may get smaller variances if we use more weights to estimate instead of one. For example, if we use three weights $x_1, x_2, x_3$ instead of one $z_1$, we have $z_1 = (x_1+x_2+x_3)/3$. The latter has a smaller variance.
* It's easier to implement the algorithm if we don't need to select the features. 

### How significantly would you expect Lasso and elastic net weights to change when training each on two different random training sets from the population, provided that there are some correlated features? How much would the test performance change in each case? Why would it be nuissance if trained weights are different for two random trainig sets from the same population?
* Lasso weights might change but elastic net weights are more stable.
* The test performance might not change much for both algorithms.
* It's nuissance because it makes it harder to analyze the impact of each feature, and hard to debug your algorithm.


### True of False? If the two features $x_1$ and $x_2$ of a linear classifier $w^T\dot x$ are perfectly correlated (e.g. $x_2=rx_1$, and $w_1$ and $w_2$ are the weight coefficients corresponding to the two features, then level sets of the empirical risk are straight lines of the form $w_1 + rw_2=c$. In particular, there are infinitely many empirical risk minimizers with different coefficient vectors $w$.

* True. If there are only two features, the level sets of the empirical risk are straight lines. The weight coefficients that minimize the empirical risk are on the line $w_1 + rw_2=c$.

#### With above setting, what is the number of regularized ERM minimizers if we add $L_1$ regularization? Explain this for the cases where $r=1$ and $r=1.000001$.
* There will be infinitely many ERM minimizers for $L_1$ regularization when $r=1$. They are online line $w_1+w_2=1$.
* When $r!=1$, there should be only one ERM minimizer with $w_1 = 0$ and $w_2!=0$, because $w_2$ corresponds to the feature $x_2$ which has a larger scale.

#### What is the number of regularized ERM minimizers if we add $L_2$ regularization? How important is the exact value of $r$?
* We should have one ERM minimizer regardless of the $r$ value with $L_2$ regularization. 

### Theorem for Elastic Net (Slide 26 in 03b.elastic-net.pdf)
* If there are two correlated features, when their weights are selected by elastic net and when their weights have the same sign, then we know that the distance between the two weights are bounded up by the product of response length, and the correlation between the two features, scaled by $\lambda_2$. 
* The upper bound has nothing to do with $\lambda_1$, which corresponds to the $L_1$ regularization in elastic net. Still, when $\lambda_1$ is not zero, elastic net still gives you sparsity.
* The higher the correlation, the closer the weights to each other. When two features are highly correlated, their correlation is close to 1, and the two weights are closer to each other, so they are more evenly distributed.  This is preferrable since their errors can cancel out.

* Compare with Lasso, it can give you any combination of weights for highly correlated features as long as they satisfy the constraint. So the weights are unstable. 

### How to limit the impact of outliers in the training data?
* Try absolute loss function (Laplace loss).

### Why you might train with a hinge loss insted of the 0-1 loss when doing ERM?
* Hinge loss function is convex
* Hinge loss function is continuous
* Hinge loss penalize answers which are very wrong.

### Compare the soft margin linear support vector machine's optimization objective with that of ridge regression (Tikhonov form).
* Loss functions are different
* Both use $L_2$ regularization
* No $\frac{1}{n}$ term in SVM

### 5
* For hinge loss, the points at $(-10, 0)$, $(10, 0)$ are not as important as points close to 0. Logistic loss probably makes more errors close to 0.
