# Chapter.06 Practical Issues in Machine Learning
---

### 6.1. Bias-Variance tradeoff
6.1.1. Bias-Variance decomposition of the MSE<br>
For any supervised learning algorithm, we can decompose the MSE on an unseen sample $\mathbf{x}$ as
$$ \mathbb{E}[(y - \hat{f}(x; D))^2] = (Bias_D[\hat{f}(x; D)])^2 + Var_D[\hat{f}(x; D)] + \sigma^2 $$
$$ \text{where} \quad Bias_D[\hat{f}(x;D)] = \mathbb{E}_D[\hat{f}(x;D)] - f(x) \,\ \text{and} \,\ Var_D[\hat{f}(x;D)] = \mathbb{E}_D[(\mathbb{E}_D[\hat{f}(x ; D)] - \hat{f}(x;D))^2 ] $$

- $Bias_D[\hat{f}(x;D)]$ is due to improper model or assumption.(e.g., When approximating a non-linear funcvtion $f(x)$ using a learning method for linear models, there will be error in the estimates $\hat{f}(x)$ due to this assumption).
- $Var_D[\hat{f}(x;D)]$ is variation of an algorithm itself.
- $\sigma^2$ is inherent noise(irreducible error).<br><br>

<strong>Proof.</strong><br>
[PDF File too long](./res/ch06/note_bias-variance_tradeoff.pdf)

Therefore, 
$$ \text{MSE} = \mathbb{E}_x \left\{ Bias_D[\hat{f}(x;D)]^2 + Var_D[\hat{f}(x;D)] \right\} + \sigma^2 $$


<img src="./res/ch06/fig_1_1.png" width="400" height="300"><br>
<div align="center">
  Figure.6.1.1
</div>


- The more complex the model is, the more data points it will capture. $\rightarrow$ the lower the bias will be.
- However, the complexity will make the model vary more to capture the data points $\rightarrow$ the larger the variance will be.

### 6.2. Generalization
6.2.1. Overview<br>
- The ultimate goal of machine learning is __good generalization__.
- Data we observed is just a part of the whole.
- We need to find a model that well explains the whole data only from a given portion of data.
- Good explainability for unobserved data.

Generalization depends on __amount of training data__ and __complexity of model__.

6.2.2. Training and test data sets<br>


<img src="./res/ch06/fig_2_1.png" width="400" height="300"><br>
<div align="center">
  Figure.6.2.1
</div>

- The whole training data is split into two parts: (i) training set and (ii) test set
- Usually, the training set is used for training model (or a machine learning algorithm)
- Whereas, the test set is used to evaluate the (generalization) performance of the model

For example, in polynomial curve fitting with training and test sets

<img src="./res/ch06/fig_2_2.png" width="600" height="300"><br>
<div align="center">
  Figure.6.2.2
</div>

<img src="./res/ch06/fig_2_3.png" width="800" height="300"><br>
<div align="center">
  Figure.6.2.3
</div>

In above figure.6.2.3, when $ M = 0 $, it is under-fitting, because $ M $ cannot learn any data because there isn't weight value that can fit.<br>
When $ M = 1 $, it is under-fitting, because $ M $ can fit just linear data because there is an only one weight value.<br>
When $ M = 3 $, it is well-fitting(best model), because the model is a good representation of the nonlinearity of the data.
When $ M = 9 $, it is over-fitting, because $ M $ is too large so that the model memorize lots of noise.

### 6.3. Overfitting
So, what is overfitting? It is very __good only for training data__ (memorization). But, it is __not good for test data__ (poor generalization). <br><br>

6.3.1. How to avoid overfitting?<br>

6.3.1.1. More training data<br>
   
__Widrow's rule of thumb__ 

$$
N = O \left( \frac{W}{\epsilon} \right)
$$

$$
\text{where} \,\ N \,\ \text{is number of training samples,} \,\ W \,\ \text{is total number of parameters,} \,\ \epsilon \,\ \text{is fraction of target error.}
$$

For example, when you set $ \epsilon $ to $ 10% $, it should be

$$
\epsilon = 0.1 \quad \rightarrow \quad N \ge 10W
$$

6.3.1.2. Reducing the number of features(e.g., by PCA)<br>

We have to find a number of suitable features because if there are so many features, the model can learn noises that hinders generalization, and variation of weights is so high.

6.3.1.3. Regularization<br>

Regularization is restricting the model complexity. It is based on __Occam's razor__ which means the simple is the best.

    - L1-Regularization(CH03.02)
    - L2-Regularization(CH03.02)
    - Max-Norm Constraint
    
$$
|| \mathbf{w} ||_\infty \le u
$$

There are two type of weights. First one is weights having significant influence on performance. __Second one have little or no influence so these can cause overfitting.__  We have to restricting those weights to be zero.

6.3.1.4. Dropout<br>

<img src="./res/ch06/fig_3_1.png" width="600" height="250"><br>
<div align="center">
  Figure.6.3.1
</div>

Dropout prevents co-adaptations among neurons on training data. Neurons are dropped out with probability $ (1 - p) $ or kept with probability $ p $. Train with reduced (multiple) networks, and then, test with all trained nodes.

6.3.1.5. Early-stopping<br>
When we train some data with out model, we can see a situation like __Figure.6.1.1__. Obiously, we can stop learning early when the total error is minimal during learning. It is called early-stopping.

<img src="./res/ch06/fig_3_2.png" width="500" height="400"><br>
<div align="center">
  Figure.6.3.2
</div>

It perfroms validation after a certain number of epochs(e.g., every 5 epochs). After the validation, it resumes the training process and checks whether overfitting occurs. If then stop the training.

6.3.1.6. Proper model selection<br>
We must select the best model before overfitting. Training error decreases as the model complexity increases. Test error first decreases and then increases if the model exceeds a certain complexity. 

### 6.4. Model selection

So, how do we choose the best model?<br><br>

6.4.1. Model selection with validation<br>
6.4.1.1. Model selection with validation set<br>
__Input__ : a training/test/validation data set, a set $ \Omega $ of model.<br>
__Algorithm__ : <br>

1. For all models in $ \Omega $ : <br>
    1. Train the model using the training data set<br>
    2. Evaluate the performance of the trained model using the validation data set<br>
2. Select the model with the highest performance<br>
3. Evaluate the performance of the chosen model using the test data set<br>

<img src="./res/ch06/fig_4_1.png" width="500" height="200"><br>
<div align="center">
  Figure.6.4.1
</div>

In above picture, Validation sets are used to select the best model of hyper parameters and check when to early stop the training.

6.4.1.2. Multifold($ K $-fold) Cross-Validation<br>
__Input__ : a training/test data set, a set $ \Omega $ of model, the number $ K $ of groups.<br>
__Algorithm__ : <br>

1. Divide the training data set into $ K $ groups
2. For all models in $ \Omega $ :
    1. For $ i = 1 $ to $ K $ :
        1. Train the model using the $ (K - 1) $ groups except for the $ i $th group
        2. Evaluate the performance of the trained model using the $ i $th group
    2. Average the performance of the $ K $ trained models
3. Select the model with the highest average performance
4. Evaluate the performance of the chosen model using the test data

<img src="./res/ch06/fig_4_2.png" width="500" height="200"><br>
<div align="center">
  Figure.6.4.2
</div>

It is useful when the amount of training data is not enough. But, the network has to be trained $ K $ times (excessive computation)

6.4.1.3. Bootstrap Model Selection<br>
__Input__ : a training/test data set, a set $ \Omega $ of model, a sampling ratio $ \rho \,\ (0 < \rho < 1) $, an iteration number $ T $<br>
__Algorithm__ : <br>
1. For all models in $ \Omega $ :
    1. For $ i = 1 $ to $ T $ :
        1. Make a new training set $ S^\prime $ by randomly picking $ \rho n $ samples from the original training set $ S $
        2. Train the model using the new training set $ S^\prime $
        3. Evaluate the performance of the trained model using the data in $ S - S^\prime $
    2. Average the performance of the $ T $ trained models
2. Select the model with the highest average performance
3. Evaluate the performance of the chosen model using the test data

<img src="./res/ch06/fig_4_2.png" width="500" height="200"><br>
<div align="center">
  Figure.6.4.2
</div>

6.4.2. Model selection with criteria<br>
6.4.2.1. Akaike Information Criterion(AIC)<br>
Akaike's Information Criterion (AIC) allows to compare the performance of different statistical models.

$$
\begin{align*}
\text{AIC}(k) &= N \log E_k + 2k \\
              &= \ln p(D | \mathbf{w}) - M \\
\end{align*}
$$

$$
\text{where} \,\ N \,\ \text{is a number of training samples,} \,\ E_k \,\ \text{is a modeling error,} \,\ k \,\ \text{is a model order(number of parameters).} 
$$





6.4.2.2. Minimum Description Length(MDL) Criterion<br>
One of the hypotheses that achieves the best data compression, Occam's razor, is a formal name.
It means finding a model that minimizes the overall cost function.

$$
\text{MDL}(k) = N \log E_k + k \log N
$$

The MDL converges the true order as $ N \rightarrow \infty $.

### 6.5. Curse of dimensionality
When training a model from data, the more independent samples the better it trains, whereas the larger the dimension, the more difficult it is and requires more data. It is called __curse of dimensionality__. The density of the same number of data rapidly becomes sparse as the dimension increases. <br><br>

For example, in K-NN regression, if we use euclid distance metric, we can face it. So, we can use mahalanobis distance metric that consists of information about densities of data.<br><br>

In another example, following are polynomial curve fitting when $ M = 1, 2, 3 $.<br>
When $ M = 1 $,

$$
y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{i = 1}^{D} w_i x_i
$$

When $ M = 2 $,

$$
y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{i = 1}^{D} w_i x_i + \sum_{i = 1}^{D} \sum_{j = 1}^{D} w_{ij} x_i x_j
$$

When $ M = 3 $,

$$
y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{i = 1}^{D} w_i x_i + \sum_{i = 1}^{D} \sum_{j = 1}^{D} w_{ij} x_i x_j + \sum_{i = 1}^{D} \sum_{j = 1}^{D} \sum_{k = 1}^{D} w_{ijk} x_i x_j x_k
$$

Each has $ D, \,\ D^2 + D, \,\ \text{and} \,\ D^3 + D^2 + D $ parameters. Therefore, we can create following formula.

$$
\text{Dimensionality} \propto D^M
$$

Let's think with the geometric viewpoint. Consider volume of a D-dimensional sphere with $ r = 1 $ as $ V_D(1) $. <br>
At this time, the ratio of the volume of the shell is as follows.

$$
\frac{V_D(1) - V_D(1 - e)}{V_D(1)} = 1 - (1 - e)^D \rightarrow 1 \quad \text{as} \quad D \rightarrow \infty
$$

Therefore, in high dimensional spaces, most of the volume of a sphere is concentrated in a thin shell near the surface. The following is the relationship between $ D $ and $ e $.

<img src="./res/ch06/fig_5_1.png" width="300" height="300"><br>
<div align="center">
  Figure.6.5.1.
</div>


Not all intuitions developed in low-dimensional spaces will generalize to high-dimensional spaces. Certainly, there exist some techniques that are applicable,
effective, and specialized only in higher dimensional spaces, although it is not easy to deal with in high dimensional spaces. (e.g., Kerner trick)


<strong>Reference.</strong><br>
https://en.wikipedia.org/wiki/File:ML_dataset_training_validation_test_sets.png<br>
https://en.wikipedia.org/wiki/Akaike_information_criterion<br>