# 3. Regularization
- **Question:**
    - What is regularization?
    - Why do we need to regularize?
  
## 3.1 Bias and Variance Tradeoff
### 3.1.1 Intuition
-  Given a learnable objective function, Bias and Variance tradeoff is a property that constraints the effort trying to simultaneously minimize these two sources of error from generalizing beyond the training set.

### 3.1.2 Bias error
- **Intuition:** The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). Another way to think about it is if the error has no favorable direction or doesn't put much attentions on the training dataset, leading high total errors on training and testing sets.
- **Example:** there are 2 scales A & B. My weight is 70kg. We conduct 4 trials of measuring on each scale
    - Recordings on scale A: 71 - 72 - 71 -72
    - Recordings on scale B: 75 - 65 - 80 -60
- **Conclusion:** $\mathbb{E}_A[W] = 71.5$ and $\mathbb{E}_B[W] = 70$ so A is more accurate but biased toward 71.5 kg (possitive bias) and B is less accurate but unbiased (on average the measured weight is 70kg).
- **Mathematical definition:** given an estimator $\hat{\theta}$ of the target statistic $\theta$. The bias of $\hat{\theta}$ with respect to $\theta$ is formulated as:
$$Bias_D(\hat{\theta},\theta) = \mathbb{E}_{D}(\hat{\theta}) - \theta$$
With $\mathbb{E}_{D}$ be the expected value over the possible $x$ observations under the distribution $D \sim \mathbb{P}(x|\theta)$

### 3.1.3 Variance error
- **Intuition:** the variance error is an error which can be understood as “the mean of the deviation around the mean”, aka given a data point, variance captures the spread of the prediction. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
- **Example:** with the sample configuration on bias error.
    - Variance of scale A (exercise): low
    - Variance of scale B (exercise): high
- **Conclusion:** scale A has high bias but low variance and scale B has low variance - high bias.
- **Mathematical definition:** given an estimator $\hat{\theta}$ of the target statistic $\theta$. The variance of $\hat{\theta}$ with respect to $\theta$ is formulated as:
$$Var_D(\hat{\theta}) = \mathbb{E}_D\left[\left(\mathbb{E}_D[\hat{\theta}] - \theta\right)^2\right]$$

### 3.1.4 Mathematical Relationship
- Given the error measure MSE (mean squared error/L2 loss function)
$$\begin{align*}
L(\theta,\hat{\theta}) &= \mathbb{E}_D\left[(\hat{\theta}-\theta + \mathbb{E}_D[\hat{\theta}] - \mathbb{E}_D[\hat{\theta}] )^2\right] \\
&= (\mathbb{E}_D[\hat{\theta}-\theta)^2 + \mathbb{E}_D[\hat{\theta}-\mathbb{E}_D(\hat{\theta})]^2 \text{ Prove this -exercise}\\
&= Bias_D(\hat{\theta},\theta)^2 + Var_D(\hat{\theta}) 
\end{align*}$$

## 3.2 Regularization
- The idea of putting constraints on the functional space to enforce some properties on the predictor (preventing gradient explodes, smoother fitted hyperplane)

### 3.2.1 Shrinkage Methods
- **Question:** is unbiased estimator always good? why? why not?
- **Example:** supposed that we achieved the best unbiased linear model which meant among all the unbiased models, ours has the lowest variance. However, it turns out that although we have unbiased predictions and minimal variance among all unbiased predictions, the variance can still be pretty large. One way to fix that is we introduce a little bit of biased and simultaneously reduce a significant amount of variance (why?). Doing this process carefully and elegantly, we can have a lower prediction error.
- **Conclusion:** the following shrinkage technique does exactly that. They put constraint on the model to pull the parameters towards zero. Hence, the predictor will be biased (why?) and also have lower variance. This is a method of playing the trade off game between bias and variance.

#### 1. Lasso (L1)
- **Definition:**
$$argmin_{\beta}\{\frac{1}{N}||y-X\beta||_2^2 + \lambda_1||\beta||_1\}$$

#### 2. Ridge (L2)
- **Definition:**
$$argmin_{\beta}\{\frac{1}{N}||y-X\beta||_2^2 + \lambda_2||\beta||_2\}$$

#### 3. Elastic net (L1 + L2)
- **Definition:**
$$argmin_{\beta}\{\frac{1}{N}||y-X\beta||_2^2 + \lambda_1||\beta||_1+\lambda_2||\beta||_2\}$$

#### 4. Tune the hyper-parameter lambda
- Try 

### 3.2.3 Comparision
#### The effect of minimizing L1/L2/L1+L2 penalty
- **Lasso:**
    - Dimension reduction/Sparsifying the parameters. (why? prove this)
- **Ridge:**
    - Reduce the impact of non-relevant features. For highly correlated feature, it shrinks the values of them towards each other.
- **Elastic net:**
    - Compromise the effect of both.
    
## 3.3 Exercise:

### 3.3.1 
- Solve the analytical solutions for lasso/ridge

### 3.3.2
- Add the regularizations on the linear regression code and see the differences in the parameter vector

### 3.3.3 
- Show the proof in section 3.1.4

### 3.2.3
- Proove the claims in section 3.2.3 (hint: check the gradient)