# Chapter.06 Practical Issues in Machine Learning
---

### 6.1. Bias-Variance tradeoff
6.1.1. Bias-Variance decomposition of the MSE<br>
For any supervised learning algorithm, we can decompose the MSE on an unseen sample $\mathbf{x}$ as
$$ \mathbb{E}[(y - \hat{f}(x; D))^2] = (Bias_D[\hat{f}(x; D)])^2 + Var_D[\hat{f}(x; D)] + \sigma^2 $$
$$ \text{where} \quad Bias_D[\hat{f}(x;D)] = \mathbb{E}_D[\hat{f}(x;D)] - f(x) \,\ \text{and} \,\ Var_D[\hat{f}(x;D)] = \mathbb{E}_D[(\mathbb{E}_D[\hat{f}(x ; D)] - \hat{f}(x;D))^2 ] $$

- $Bias_D[\hat{f}(x;D)]$ is due to improper model or assumption.(e.g., When approximating a non-linear funcvtion $f(x)$ using a learning method for linear models, there will be error in the estimates $\hat{f}(x)$ due to this assumption).
- $Var_D[\hat{f}(x;D)]$ is variation of an algorithm itself.
- $\sigma^2$ is inherent noise(irreducible error).<br><br>

<strong>Proof.</strong><br>
[PDF File too long](./res/ch06/note_bias-variance_tradeoff.pdf)

Therefore, 
$$ \text{MSE} = \mathbb{E}_x \left\{ Bias_D[\hat{f}(x;D)]^2 + Var_D[\hat{f}(x;D)] \right\} + \sigma^2 $$


<img src="./res/ch06/fig_1_1.png" width="400" height="300"><br>
<div align="center">
  Figure.6.1.1
</div>


- The more complex the model is, the more data points it will capture. $\rightarrow$ the lower the bias will be.
- However, the complexity will make the model vary more to capture the data points $\rightarrow$ the larger the variance will be.

### 6.2. Generalization
6.2.1. Overview<br>
- The ultimate goal of machine learning is __good generalization__.
- Data we observed is just a part of the whole.
- We need to find a model that well explains the whole data only from a given portion of data.
- Good explainability for unobserved data.

Generalization depends on __amount of training data__ and __complexity of model__.

6.2.2. Training and test data sets<br>


<img src="./res/ch06/fig_2_1.png" width="400" height="300"><br>
<div align="center">
  Figure.6.2.1
</div>

- The whole training data is split into two parts: (i) training set and (ii) test set
- Usually, the training set is used for training model (or a machine learning algorithm)
- Whereas, the test set is used to evaluate the (generalization) performance of the model

For example, in polynomial curve fitting with training and test sets

<img src="./res/ch06/fig_2_2.png" width="600" height="300"><br>
<div align="center">
  Figure.6.2.2
</div>

<img src="./res/ch06/fig_2_3.png" width="800" height="300"><br>
<div align="center">
  Figure.6.2.3
</div>

In above figure.6.2.3, when $ M = 0 $, it is under-fitting, because $ M $ cannot learn any data because there isn't weight value that can fit.<br>
When $ M = 1 $, it is under-fitting, because $ M $ can fit just linear data because there is an only one weight value.<br>
When $ M = 3 $, it is well-fitting(best model), because the model is a good representation of the nonlinearity of the data.
When $ M = 9 $, it is over-fitting, because $ M $ is too large so that the model memorize lots of noise.

### 6.3. Overfitting
So, what is overfitting? It is very __good only for training data__ (memorization). But, it is __not good for test data__ (poor generalization). <br><br>

6.3.1. How to avoid overfitting?<br>

6.3.1.1. More training data<br>
   
__Widrow's rule of thumb__ 

$$
N = O \left( \frac{W}{\epsilon} \right)
$$

$$
\text{where} \,\ N \,\ \text{is number of training samples,} \,\ W \,\ \text{is total number of parameters,} \,\ \epsilon \,\ \text{is fraction of target error.}
$$

For example, when you set $ \epsilon $ to $ 10% $, it should be

$$
\epsilon = 0.1 \quad \rightarrow \quad N \ge 10W
$$

6.3.1.2. Reducing the number of features(e.g., by PCA)<br>

We have to find a number of suitable features because if there are so many features, the model can learn noises that hinders generalization, and variation of weights is so high.

6.3.1.3. Regularization<br>

Regularization is restricting the model complexity. It is based on __Occam's razor__ which means the simple is the best.

    - L1-Regularization(CH03.02)
    - L2-Regularization(CH03.02)
    - Max-Norm Constraint
    
$$
|| \mathbf{w} ||_\infty \le u
$$

There are two type of weights. First one is weights having significant influence on performance. __Second one have little or no influence so these can cause overfitting.__  We have to restricting those weights to be zero.

6.3.1.4. Dropout<br>


6.3.1.5. Early-stopping<br>

6.3.1.6. Proper model selection<br>

### 6.4. Model selection

### 6.5. Curse of dimensionality