In [1]:
import utils
import warnings

warnings.filterwarnings('ignore')
utils.set_css_style('style.css')

# 1. Bias & Variance

Let's look at the following simple example. If you fit a straight line to the data, maybe a simple linear regression, then this is not a very good fit to the data. This is problem of a high bias, we also say that our model is underfitting the data. On the opposite end, if you fit an incredibly complex regressor, maybe a deep neural network, or a neural network with many hidden units, it's possible that you fit the data perfectly, but that doesn't look like a great fit either. So this is called a regressor of high variance and we also say that this model is overfitting the data. A model in between, with a medium level of complexity, fits the data correctly. 

As mentioned earlier, in machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model. 

## 1.1. Overfitting

**Overfitting** means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset. This is also called a **high variance** problem. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not generalized and can’t make any inferences on new unseen data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns the “noise” in the training data instead of the actual relationships between variables in the data.

## 1.2. Underfitting

In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability on the training data.

<img src="figures/bias-variance.png" alt="bias-variance" style="width: 700px;"/>

One way to check if your model is suffering from high bias or high variance is to compare the training error with the validation set error.


## 1.3. Bias & variance diagnosis

We need to distinguish whether bias or variance is the problem contributing to bad predictions.

The training error will tend to decrease as we increase the complexity of the model (for example the degree $d$ of the polynomial of a linear regression model), whereas the validation/test error will tend to decrease as we increase the complexity up to a point, and then it will increase as complexity is increased, forming a convex curve.

* **High bias** (underfitting): both $J_{train}(w)$ and $J_{test}(w)$ will be high. Also, $J_{test}(w) \approx J_{train}(w)$.

* **High variance** (overfitting): $J_{train}(w)$ will be low and $J_{test}(w)$ will be much greater than $J_{train}(w)$.

The is summarized in the figure below:

<img src="figures/bias-variance-diagnosis.png" alt="bias-variance-diagnosis" style="width: 400px;"/>



## 1.4. Bias & variance correction

In the previous section, we saw how looking at training error and validation error can help you diagnose whether your algorithm has a bias or a variance problem, or maybe both. Knowing whether your model is overfitting or underfitting your data helps you take the correct measures in order to improve your algorithms' performance systematically.

If your algorithm has a high bias, the following are some of the possible remedies:
* Try to make your model more complex (bigger NN for example: size of hidden units, number of layers)
* Add more features if possible
* Try a different model that is suitable for your data.
* Train your model longer.

On the other hand, if your algorithm has a high variance, you can:
* Get more data.
* Use regularization.
* Try a different model that is suitable for your data.

<img src="figures/bias-variance-tradeoff.jpeg" alt="bias-variance" style="width: 600px;"/>

# 2. Regularization

Consider the problem of predicting $y$ from $x \in R$. The left most figure below shows the result of fitting a $y = w_0+w_1 x$ to a dataset. We see that the data doesn’t really lie on straight line, and so the fit is not very good.

Instead, if we had added an extra feature $x^2$, and fit $y = w_0 + w_1 x + w_1 x^2$, then we obtain a slightly better fit to the data (See middle figure). Naively, it might seem that the more features we add, the better. However, there is also a danger in adding too many features: The rightmost figure is the result of fitting a 5^{th} order polynomial:

\begin{equation}
h_{w}(x) = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + w_4 x^4 + w_5 x^5  
\end{equation}

We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor. This is a problem of **overfitting**.
 
<img src="figures/reg_example.png" alt="reg-example" style="width: 700px;"/>
  
If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their corresponding cost. Let's suppose we want to reduce the influence of $w_4 x^4$ and $w_5 x^5$. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function, and the optimisation problem becomes:
  
\begin{equation}
\min_{w} \dfrac {1}{2m} \sum _{i=1}^m \left (h_w (x_{i}) - y_{i} \right)^2 + 1000 \times w_4^2 + 1000 \times w_5^2
\end{equation}

The reason we've added two extra terms at the end is to inflate the cost of $w_4$ and $w_5$ in order to reduce the impact of the correponding features. Now, in order for the cost function to get close to zero, we will have to reduce the values of $w_4$ and $w_5$. As a result, we may see that the new hypothesis fits the data better.
 
We could also **regularize** all of our $w$ parameters in a single summation as:

\begin{equation}
\min_{w} \dfrac {1}{2m} \sum _{i=1}^m \left (h_w (x_{i}) - y_{i} \right)^2 + \lambda \sum_{j=1}^n w_j^2
\end{equation}

 
The $\lambda$, or lambda, is the **regularization parameter**. It determines how much the costs of our $w$ parameters are inflated.

Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If $\lambda$ is chosen to be too large, it may smooth out the function too much and cause underfitting. 

There are many other regularization techniques, this one is known as the L2 regularization. Other Regularization types include: 

* Early Stopping
* Parameter Norm Penalties 
    * L1 regularization
    * L2 regularization
    * Max-norm regularization
    * Dropout regularization
* Dataset Augmentation
* Noise Robustness (Dropout..)
* Sparse Representations
* ...

For more information, you can [check this article](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c)!

## 2.1. Regularization: Parameter Norm Penalties

As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., **overfitting**.

In the precedent section, we have seen that in order to overcome underfitting or high bias, we can basically add new parameters to our model so that the model complexity increases, and thus reducing high bias. Now, how can we overcome overfitting for a model?

Basically there are two methods to overcome overfitting,

* Reduce the model complexity
* **Regularization**

As we have seen before, in regularization, the cost function is penalized in order to reduce the values of our model parameters. A regularized regression cost function can then take the following form:

$$ J(w_0,w_1)\ =\ \frac{1}{2m}\sum_{i=1}^m (w_0+w_1x_i-y_i)^2 + P(\lambda,w)$$

We have different types of regularization techniques to overcome this problem. 

### 2.1.1. L2 Regularization

With L2 Regularization, the loss function is augmented in such a way that we not only minimize the sum of squared residuals but also penalize the size of parameter estimates, in order to shrink them towards zero:

$$ J(w)\ =\ \frac{1}{2m}\sum_{i=1}^m (h(x_i)-y_i)^2 +\ \lambda\sum_{j=1}^nw_j^2$$

Here if you notice, we come across an extra term, which is known as the penalty term. By changing the values of $\lambda$ (the regularization coefficient), we are basically controlling the penalty term. Higher the values of $\lambda$, bigger is the penalty and therefore the magnitude of coefficients are reduced.

Important Points about L2 Regularization:

* It shrinks the parameters, therefore it is mostly used to prevent multicollinearity.
* It reduces the model complexity by coefficient shrinkage.
* It's also called Ridge regularization. 

### 2.1.2. L1 Regularization

The mathematics behind L1 regularization is quite similar to that of L2. The only difference being instead of adding squares of $w$, we will add the absolute value of $w$.

$$ J(w)\ =\ \frac{1}{2m}\sum_{i=1}^m (h(x_i)-y_i)^2 +\ \lambda\sum_{j=1}^n|w_j|$$

Important Points about L1 regularization:

* It's also called Lasso regularization 
* It is generally used when we have a high number of features, because it automatically does feature selection.

### 2.1.3. Reguralization for Sparsity 

Let us try to visualize the effect of $L_1$ and $L_2$ regularization by plotting them. For making visualization easy, let us plot them in 2D space. For that we suppose that we just have two parameters $w_0$ and $w_1$. 

The regularization forces the model optimisation to find the best trade-off between the initial cost function and the complexity of the model represented by the penalty/regularization term $P(\lambda,w)$.

In other words, if we keep the regularization term $P(\lambda,w)$ smaller than certain value, we've achieved our goal. Now let's visualize what it means for the $L_2$ norm of our weight vector to be under certain value, let's say 1. Since $L_2$ is the Euclidean distance from the origin, our desired vector should then be bound within a circle with a radius of 1, centered on the origin. 

This was great at keeping weights small, but it can leave the model unnecessarily large and complex, since all of the features may still remain even with small weights.

When trying to keep $L_1$ norm under certain value, the area in which our weight vector can reside will take the shape of the diamond shown below. The most important takeaway here is that, when applying $L_1$ regularization, the optimal value of certain weights can end up being zero, and that's because of the extreme diamond shape of this optimal region. Thus as opposed to the smooth circular shape in $L_2$ regularization.

This property of $L_1$ regularization is extensively used as a feature selection mechanism. This acts as a built-in feature selector by killing all bad features and leaving only the strongest in the model. This has many benefits especially with sparse features. With fewer coefficients to store and load, there is a reduction in storage and memory needed with a much smaller model size, which is especially important for embedded models. 

<img src="figures/Ridge_vs_Lasso_Regression.png" alt="Ridge_vs_Lasso_Regularization" style="width: 600px;"/>

Actually, there are different possible choices of regularization with different choices of order of the parameter in the regularization term, which is denoted by $\sum_{j=1}^n|w_j|^p$ This is more generally known as $L_p$ regularizer. The $L_0$ for $p=0$ norm is the count of the non-zero values in a vector, and the L-infinity norm for $p \to \infty$ the maximum absolute value of any value in a vector.

\begin{align*}
L_0 \text{-norm} &= \lVert \boldsymbol{w} \rVert_0 = \sum_{j=0}^n |w_j|^0 \newline
L_1 \text{-norm} &= \lVert \boldsymbol{w} \rVert_1 = \sum_{j=0}^n |w_j| \newline
L_2 \text{-norm} &= \lVert \boldsymbol{w} \rVert_2 = \sum_{j=0}^n |w_j|^2 \newline
L_{\infty} \text{-norm} &= \lVert \boldsymbol{w} \rVert_{\infty} = max\big\{|w_0|,|w_1|, ..., |w_n|\big\} \newline
\end{align*}


### 2.1.4. Elastic Net regularization
 
In practice, usually the $L_2$-norm provides more generalizable models than the $L_1$-norm. However, in some situations, we will end up with much more complex heavy models if we use $L_2$ instead of $L_1$. This happens because often features have high correlation with each other, and the $L_1$ regularization uses one of them and throw away the other, whereas $L_2$
regularization will keep both features and keep their weight magnitudes small. Therefore, with $L_1$, you can end up with a smaller model but it may be less predictive. 

To get the best of both worlds, a third commonly used model of regression is the Elastic Net which incorporates penalties from both $L_1$ and $L_2$ regularization. This way, you get the benefits of sparsity for really poor predictive features while also keeping decent and great features with smaller weights to provide a good generalization. The only trade off now is that there are two instead of one hyperparameters to tune with the two different $\lambda$ regularization parameters.

$$ J(w)\ =\ \frac{1}{2m}\sum_{i=1}^m (h(x_i)-y_i)^2 +\ \lambda_1\sum_{j=1}^n|w_j|+\ \lambda_2\sum_{j=1}^nw_j^2$$

Another way to write the cost function in Elastic Net includes having a regularization parameter $\lambda$ and another parameter $\alpha$ corresponding to the weight of $L_1$ and $L_2$ penalty in your cost function:

$$ J(w)\ =\ \frac{1}{2m}\sum_{i=1}^m (h(x_i)-y_i)^2 +\ \lambda( \frac{1-\alpha}{2}\sum_{j=1}^nw_j^2 + \alpha \sum_{j=1}^n|w_j| ) $$

Therefore, in addition to setting and choosing a $\lambda$ value, elastic net also allows us to tune the $\alpha$ parameter where $\alpha = 0$ corresponds to ridge and $\alpha = 1$ to lasso. Simply put, if you plug in $0$ for $\alpha$, the penalty function reduces to the $L_1$ (ridge) term and if we set $\alpha$ to $1$ we get the $L_2$ (lasso) term. Therefore we can choose an alpha value between 0 and 1 to optimize the elastic net. Effectively this will shrink some coefficients and set some to 0 for sparse selection.


### 2.1.5. Dropout Regularization

“Dropout” in machine learning refers to the process of randomly ignoring certain nodes in a layer during training.
In the figure below, the neural network on the left represents a typical neural network where all units are activated. On the right, the red units have been dropped out of the model — the values of their weights and biases are not considered during training.

<img src="figures/dropout.png" alt="dropout" style="width: 800px;"/>

When we apply dropout to a neural network, we’re creating a “thinned” network with unique combinations of the units in the hidden layers being dropped randomly at different points in time during training. Each time the gradient of our model is updated, we generate a new thinned neural network with different units dropped based on a probability hyperparameter p. Training a network using dropout can thus be viewed as training loads of different thinned neural networks and merging them into one network that picks up the key properties of each thinned network.
This process allows dropout to reduce the overfitting of models on training data.

<img src="figures/dropout_classification_error.jpeg" alt="dropout_classification_error" style="width: 500px;"/>

This graph, taken from the paper “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” by Srivastava et al., compares the change in classification error of models without dropout to the same models with dropout (keeping all other hyperparameters constant). All the models have been trained on the MNIST dataset.
It is observed that the models with dropout had a lower classification error than the same models without dropout at any given point in time. A similar trend was observed when the models were used to train other datasets in vision, as well as speech recognition and text analysis.
The lower error is because dropout helps prevent overfitting on the training data by reducing the reliance of each unit in the hidden layer on other units in the hidden layers.

#### Inverted Dropout

There are a few ways of implementing dropout. The most common technique is called **inverted dropout**. For the sake of completeness, let's say we want to illustrate this with a neural network of 3 layers. So, in the following code we'll be illustrating how to represent dropout in a single layer. 

```python
keep_prob = 0.8   # 0 <= keep_prob <= 1
l = 3  # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob

a3 = np.multiply(a3,d3)   # keep only the values in d3

# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob       
```

* As you can see at the end of each implementation, we divide each activation by the dropout probability. The ensures that the expected value of the activation remains the same. This make training and testing faster.
* Vector d[3] is used for forward and back propagation and is the same for them, but it is different for each iteration (pass) or training example.
* At test time we don't use dropout. If you implement dropout at test time - it would add noise to predictions.


#### Dropout Intuition

* Dropout randomly knocks out units in your network. So it's as if on every iteration you're working with a smaller NN, and so using a smaller NN seems like it should have a regularizing effect.
* With dropout regularization, you don't rely on any one feature, so have to spread out weights.
* It's possible to show that dropout has a similar effect to L2 regularization.
* Dropout can have different keep_prob per layer.
* The input layer dropout has to be near 1 (or 1 - no dropout) because you don't want to eliminate a lot of features.
* If you're more worried about some layers overfitting than others, you can set a lower keep_prob for some layers than others. The downside is, this gives you even more hyperparameters to search for using cross-validation. One other alternative might be to have some layers where you apply dropout and some layers where you don't apply dropout and then just have one hyperparameter, which is a keep_prob for the layers for which you do apply dropouts.
* A lot of researchers are using dropout with Computer Vision (CV) because they have a very big input size and almost never have enough data, so overfitting is the usual problem. And dropout is a regularization technique to prevent overfitting.
* A downside of dropout is that the cost function J is not well defined and it will be hard to debug (plot J by iteration).
    * To solve that you'll need to turn off dropout, set all the keep_probs to 1, and then run the code and check that it monotonically decreases J and then turn on the dropouts again.

## 2.2. Other regularization methods

### 2.2.1. Data augmentation:

Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model. 

* Data augmentation examples in a computer vision data:
    * Flipping all your pictures horizontally will give you more data instances.
    * Applying a random position and rotation to an image to get more data.
    * In OCR, you can impose random rotations and distortions to digits/letters.

New data obtained using this technique isn't as good as the real independent data, but still can be used as a regularization technique.

### 2.2.2. Early stopping:

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

<img src="figures/early_stopping.png" alt="early_stopping" style="width: 500px;"/>

### 2.2.3. Model Ensembles

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

* Algorithm:
    * Train multiple independent models.
    * At test time average their results.
* It can get you extra 2% performance.
* It reduces the generalization error.
* You can use some snapshots of your NN at the training ensembles them and take the results.

# 3. Learning curves

Learning curves are often a very useful thing to plot. If either you wanted to sanity check that your algorithm is working correctly, or if you want to improve the performance of the algorithm.

Learning Curve Theory:

* Graph that compares the performance of a model on training and testing data over a varying number of training instances
* We should generally see performance improve as the number of training points increases
* When we separate training and testing sets and graph them individually
    * We can get an idea of how well the model can generalize to new data
* Learning curve allows us to verify when a model has learned as much as it can about the data, when it occurs
    * The performances on the training and testing sets reach a plateau
    * There is a consistent gap between the two error rates
* The key is to find the sweet spot that minimizes bias and variance by finding the right level of model complexity
* Of course with more data any model can improve, and different models may be optimal

Types of learning curves:

* **Bad Learning Curve: High Bias**
    - When training and testing errors converge and are high
    - No matter how much data we feed the model, the model cannot represent the underlying relationship and has high systematic errors
    - Poor fit
    - Poor generalization
* **Bad Learning Curve: High Variance**
    - When there is a large gap between the errors
    - Require data to improve
    - Can simplify the model with fewer or less complex features
* **Ideal Learning Curve**
    - Model that generalizes to new data
    - Testing and training learning curves converge at similar values
    - Smaller the gap, the better our model generalizes
    
    
The following example is a typical case of high variance:

<img src="figures/learning-curve-high-variance.jpg" alt="learning-curve-high-variance" style="width: 500px;"/>

The following diagram is a typical case of high bias.

<img src="figures/learning-curve-high-bias.jpg" alt="learning-curve-high-bias" style="width: 500px;"/>

A learning curve that will help you answer many questions: training dataset size vs. model error.
 
* Do we need more data?
* Do we have a bias problem?
* Do we have a variance problem?
* What's the ideal picture?

Give this a try. It might help you discover a thing or two.


<img src="figures/lc1.jpeg" alt="lc1" style="width: 700px;"/>
<img src="figures/lc2.jpeg" alt="lc2" style="width: 700px;"/>
<img src="figures/lc3.jpeg" alt="lc3" style="width: 700px;"/>
<img src="figures/lc4.jpeg" alt="lc4" style="width: 700px;"/>
