In [None]:
# Install the necessary dependencies

import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython

---
license:
    code: MIT
    content: CC-BY-4.0
github: https://github.com/ocademy-ai/machine-learning
venue: By Ocademy
open_access: true
bibliography:
  - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib
---

# Model selection

## Over-fitting and under-fitting

### Overview

Remember that the main objective of any machine learning model is to generalize the learning based on training data, so that it will be able to do predictions accurately on unknown data. As you can notice the words 'Overfitting' and 'Underfitting' are kind of opposite of the term 'Generalization'. Overfitting and underfitting models don't generalize well and results in poor performance.

These are the samples of over-fitting and under-fitting in regression:

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/under_over_justalright.png
---
name: Over-fitting-regression-ms
---
Over-fitting and under-fitting in regression
:::

### Underfitting

* Underfitting occurs when machine learning model don't fit the training data well enough. It is usually caused by simple function that cannot capture the underlying trend in the data.
* Underfitting models have high error in training as well as test set. This behavior is called as 'Low Bias'
* This usually happens when we try to fit linear function for non-linear data.
* Since underfitting models don't perform well on training set, it's very easy to detect underfitting

#### How To Avoid Underfitting?
* Increasing the model complexity. e.g. If linear function under fit then try using polynomial features
* Increase the number of features by performing the feature engineering

### Overfitting
* Overfitting occurs when machine learning model tries to fit the training data too well. It is usually caused by complicated function that creates lots of unnecessary curves and angles that are not related with data and end up capturing the noise in data.
* Overfitting models have low error in training set but high error in test set. This behavior is called as 'High Variance'

#### How To Avoid Overfitting?
* Since overfitting algorithm captures the noise in data, reducing the number of features will help. We can manually select only important features or can use model selection algorithm for same
* We can also use the 'Regularization' technique. It works well when we have lots of slightly useful features. Sklearn linear model(Ridge and LASSO) uses regularization parameter 'alpha' to control the size of the coefficients by imposing a penalty. Please refer below tutorials for more details.

### Good Fitting 
* It is a sweet spot between Underfitting and Overfitting model
* A good fitting model generalizes the learnings from training data and provide accurate predictions on new data
* To get the good fitting model, keep training and testing the model till you get the minimum train and test error. Here important parameter is 'test error' because low train error may cause overfitting so always keep an eye on test error fluctuations. The sweet spot is just before the test error start to rise.

Now let's take a look at another example, hoping it will be helpful for your understanding.

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/classification.png
---
name: Over-fitting-classification-ms
---
Over-fitting and under-fitting in classification
:::


### A simple example of linear regression 

This is a simple graphical representation of linear regression training. 
:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/bias-variance-datapoints.jpg
---
name: Datapoints-ms
---
Training data points 
:::

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/bias-variance-overfitting.jpg
---
name: Over-fitting-train-ms
---
Over-fitting model fits very well on training data
:::


:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/bias-variance-overfitting-testdata.jpg
---
name: Over-fitting-test-ms
---
Over-fitting model fits poorly on test data 
:::


:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/bias-variance-underfitting.jpg
---
name: Under-fitting-train-ms
---
Under-fitting model fits poorly on training data
:::

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/bias-variance-underfitting-test-data.jpg
---
name: Under-fitting-test-ms
---
Under-fitting model fits poorly on test data
:::


:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/bias-variance-perfect-fit.jpg
---
name: Perfect-fitting-train-ms
---
Perfect-fitting model fits well on training data
:::

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/bias-variance-perfect-fit-test-data.jpg
---
name: Perfect-fitting-test-ms
---
Perfect-fitting model fits well on test data
:::


## Bias variance tradeoff

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/graphicalillustration.png
---
name: graphicalillustration-ms
---
Graphical illustration of variance and bias
:::



:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/total_error.png
---
name: Model-complexity-ms
---
Model complexity v.s. error
:::

## Interpreting the Learning Curves

You might think about the information in the training data as being of two kinds: *signal* and *noise*. The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part that is *only* true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can't actually help the model make predictions. The noise is the part might look useful but really isn't.

We train a model by choosing weights or parameters that minimize the loss on a training set. You might know, however, that to accurately assess a model's performance, we need to evaluate it on a new set of data, the *validation* data. 

When we train a model we've been plotting the loss on the training set epoch by epoch. To this we'll add a plot the validation data too. These plots we call the **learning curves**. To train deep learning models effectively, we need to be able to interpret them.

Now, the training loss will go down either when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signal. (Whatever noise the model learned from the training set won't generalize to new data.) So, when a model learns signal both curves go down, but when it learns noise a *gap* is created in the curves. The size of the gap tells you how much noise the model has learned.

Ideally, we would create models that learn all of the signal and none of the noise. This will practically never happen. Instead we make a trade. We can get the model to learn more signal at the cost of learning more noise. So long as the trade is in our favor, the validation loss will continue to decrease. After a certain point, however, the trade can turn against us, the cost exceeds the benefit, and the validation loss begins to rise.

This trade-off indicates that there can be two problems that occur when training a model: not enough signal or too much noise. **Underfitting** the training set is when the loss is not as low as it could be because the model hasn't learned enough *signal*. **Overfitting** the training set is when the loss is not as low as it could be because the model learned too much *noise*. The trick to training deep learning models is finding the best balance between the two.

We'll look at a couple ways of getting more signal out of the training data while reducing the amount of noise.

## Capacity

A model's **capacity** refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you should try increasing its capacity.

You can increase the capacity of a network either by making it *wider* (more units to existing layers) or by making it *deeper* (adding more layers). Wider networks have an easier time learning more linear relationships, while deeper networks prefer more nonlinear ones. Which is better just depends on the dataset.

You'll explore how the capacity of a network can affect its performance in the exercise.

## Early Stopping

We mentioned that when a model is too eagerly learning noise, the validation loss may start to increase during training. To prevent this, we can simply stop the training whenever it seems the validation loss isn't decreasing anymore. Interrupting the training this way is called **early stopping**.

Once we detect that the validation loss is starting to rise again, we can reset the weights back to where the minimum occured. This ensures that the model won't continue to learn noise and overfit the data.

Training with early stopping also means we're in less danger of stopping the training too early, before the network has finished learning signal. So besides preventing overfitting from training too long, early stopping can also prevent *underfitting* from not training long enough. Just set your training epochs to some large number (more than you'll need), and early stopping will take care of the rest.

## Adding Early Stopping

In Keras, we include early stopping in our training through a callback. A **callback** is just a function you want run every so often while the network trains. The early stopping callback will run after every epoch. (Keras has [a variety of useful callbacks](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks) pre-defined, but you can [define your own](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback), too.)

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/traintestoverfitting.png
---
name: EarlyStopping-ms
---
Early stopping
:::

## L1 and L2 Regularization

You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the 'simplest' one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simple models are less likely to overfit than complex ones.

A 'simple model' in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parmeters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weeights only to  take small values, which makes the distribution of weight values more 'regular'. This is called 'weight regularization', and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

- L1 regularization, where the cost added is proportional to the aboslute value of the weights coefficients (i.e. to what is called the 'L1 norm' of the weights).
- L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the 'L2 norm' of the weights). L2 regularization is also called weight decay in the context of neurral networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

In `tf.keras`, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Let's add L2 weight regularization now.

$$L2\ Loss = Loss + \textcolor{red}{\lambda}\sum_{i} w_i^2$$

$$L1\ Loss = Loss + \textcolor{red}{\lambda}\sum_{i} \lvert w \rvert$$




:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/circlesquare.png
---
name: circlesquare-ms
---
L1 and L2 regularization
:::


:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/L1L2contour.png
---
name: explainedairegularization-ms
---
L1 and L2 regularization  
:::



:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/ridgelassoItayEvron.gif
---
name: berkeley189s21-ms
---
Different $\beta$ and ellipses 
:::




:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/p-norm_balls.webp
---
name: p-norm_balls
---
Different p norm 
:::




:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/elastic_net_balls.webp
---
name: ElasticNet-ms
---
ElasticNet 
:::


### The impact of the value of $\lambda$ 

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/lagrange-animation.gif
---
name: impact-of-lambda-ms
---
The impact of the value of $\lambda$  
:::


## Dropout

Dropout is one of the most effective and most commonly used regularization techniques for neural network, developed by Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; aafter applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, 1.3, 0, 1.1]. The 'dropout rate' is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.

In tf.keras you can introduce a dropout in a network via the Dropout layer, which gets applied to the output of layer right before.

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/dropoutgif.gif
---
name: Dropout-ms
---
Dropout 
:::



### Prediction after dropout

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/kUc8r.jpg
---
name: Prediction-after-dropout-ms
---
Prediction after dropout 
:::



During training, p neuron activations (usually, p=0.5, so 50%) are dropped. Doing this at the testing stage is not our goal (the goal is to achieve a better generalization). From the other hand, keeping all activations will lead to an input that is unexpected to the network, more precisely, too high (50% higher) input activations for the following layer 

Consider the neurons at the output layer. During training, each neuron usually get activations only from two neurons from the hidden layer (while being connected to four), due to dropout. Now, imagine we finished the training and remove dropout. Now activations of the output neurons will be computed based on four values from the hidden layer. This is likely to put the output neurons in unusual regime, so they will produce too large absolute values, being overexcited 

To avoid this, the trick is to multiply the input connections' weights of the last layer by 1-p (so, by 0.5). Alternatively, one can multiply the outputs of the hidden layer by 1-p, which is basically the same 


## Conclusions


:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/ZahidHasan.png
---
name: Training-size-matters-ms
---
Training size matters
:::


:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/steps.png
---
name: Steps-ms
---
How to choose a good model
:::

:::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/model-selection/Bias-vs.webp
---
name: Conclusion-ms
---
Conclusion 
:::

## Your turn! 🚀

Machine learning model selection and dealing with overfitting and underfitting are crucial aspects of the machine learning pipeline. In this assignment, you'll have the opportunity to apply your understanding of these concepts and techniques. Please complete the following tasks:
[assignment](../assignments/ml-advanced/model-selection/model-selection-assignment-1)