## Bias-variance Trade-off

Bias - inability to capture relationship in data

$\text{bias}^2 = (y-\mathbb{E}a(x))^2$

Variance - sensitivity of the model to the fluctuations in the data

Variance = $\mathbb{V}ar(a(x))$

__Trade-off__

$\mathbb{E}[y-a(x)]^2 = \text{bias}^2$ + Variance

- Simple models may have large bias (__underfitting__)
- Excessively complex models have large variance (__overfitting__)

Error $\approx \text{bias}^2 $ + Variance

$ y \in \mathbb{Y}, x \in \mathbb{X}$ 

Real: $y=f(x)$

Dataset: $D = \{(x_1, y_1), \dots ,(x_n, y_n)\}$

$y \approx a(x) \gets$ Model 

---

Expected error of a model decomposing.

$\mathbb{E}_D = [f - a(x)]^2$  
$=\mathbb{E}_D[f - \mathbb{E}_D(a(x)) + \mathbb{E}_D(a(x)) - a(x)]^2$  
$=\mathbb{E}_D[f-\mathbb{E}_D(a(x))]^2 + \mathbb{E}_D[2(f - \mathbb{E}_D(a(x)))(\mathbb{E}_D(a(x)) - a(x))] + 
\mathbb{E}_D[\mathbb{E}_D(a(x)) - a(x)]^2$

Since $\mathbb{E}_D(a(x))$ is a number

$=[f-\mathbb{E}_D(a(x))]^2 + 2(f - \mathbb{E}_D(a(x))) (\mathbb{E}_D(\mathbb{E}_D(a(x))) - \mathbb{E}_D(a(x)))) + \mathbb{Var}(a(x))$  
$=[f-\mathbb{E}_D(a(x))]^2 + 2(f - \mathbb{E}_D(a(x))) \times 0 + \mathbb{Var}(a(x))$  
$=[f-\mathbb{E}_D(a(x))]^2 + \mathbb{Var}(a(x))$  
$=\text{bias}^2 + \mathbb{Var}(a(x))$

## Cross-Validation

our final mean squared error on cross-validation is basically calculated using the whole data. This is a very useful technique, especially when you have a small dataset.

One more question that arise, is how to choose number of folds. There is no unique answer here. But basically, the larger number of fold is, the better.

## Regularization and Feature Selection

Idea: We can penalize model for large weights

Consider loss:

$L(a, X) = \frac{1}{N} \sum^N_{n=1}(a(x_n) - y_n)^2$

And regularizer - function, which impose penality on the weights $R(w)$

Reguralized loss:

$min_w\frac{1}{N} \sum^N_{n=1}(a(x_n) - y_n)^2 + \lambda R(w)$

- $\lambda$ - regularization coefficient
- Use validation or cross-validation to select optimal $\lambda$

_What is a regularizer?_

- The function, which imposes penalty on the (large?) weights
- Popular regularizers:

$L_2$-norm:  $R(w) = \|w\|^2_2 = \sum^d_{j=1} w_j^2$

$L_1$-norm:  $R(w) = \|w\|_1 = \sum^d_{j=1} | w_j |$

## Ridge Regression

$min_w\frac{1}{N} \sum^N_{n=1}(a(x_n) - y_n)^2 + \lambda \|w\|^2_2$

- Has analytical solution:

$w = (X^TX+\lambda I)^{-1} X^Ty$

>similar as $(X^TX)^{-1}X^Ty$, if features are highly correlated, $(X^TX)^{-1}$ will be unstable, add $\lambda I$ will stablize it.

- Has shrinkage effect on weights

It doesn't remove the irrelavent features

## Lasso Regression

$min_w\frac{1}{N} \sum^N_{n=1}(a(x_n) - y_n)^2 + \lambda \|w\|_1$

Lasso (least absolute shrinkage and selection operator)

- No analytical solution
- some weights zero out (select the most important features, the other goes to zero as $\lambda$ increase)

![Selection_008.png](attachment:Selection_008.png)

## Lasso and Ridge: Geometric Interpretation

Linear regression + regularization

$min_w L(w) + \lambda R(w)$

- $R(w) = \| w \|_1 = \sum^d_{i=1} w_i \gets$ Lasso

- $R(w) = \| w \|^2_2 = \sum^d_{i=1} w_i^2 \gets$ Ridge

$\begin{cases} 
min_w L(w) \\
\text{s.t.} R(w) \le S
\end{cases}$


> For example, if the global optimum of our loss function lies somewhere here, let it be w star. We can't use it as a solution to our constraint optimization because this point lies outside the region that Lasso tells us. We would have to find the level of our target function, which is as close as possible to the global optimum and it is intercepted within the region of interest. This point that will usually lie on the border of this region, will be exactly solution for lasso regression.

> $S$ it is a hyperparameter, which we tune when we select the model, but on the plot S will be exactly this number here or this number here and here we'll have point minuses. Basically, it controls how large or how small the interest over which we optimize is

> In case ridge regression, we allow our parameters to lie inside the circle and if the global optimal follows function lies outside, we'll again have to find point on the border which will be as close as possible to the global optimum

> instead of putting some of the weights to zero, ridge regression just shrinks both of them

> even initially the global solution, your parameters were already quite small when we impose regularization and they do not go exactly to 0. They just become smaller, but they are still positive, and therefore, the ridge does not have this useful property that lasted us which is called selection operator and which allows us to remove irrelevant features from our model.

![image.png](attachment:image.png)