# Improving Deep Neural Networks

## Train, Dev and Test Sets

Finding the best solution for a machine learning task involves several rounds of iterations of various hyperparameters (as well as the usually iterative parameter optimization).

To achieve this usually split the data available into train, dev and test sets. This is so that parameters, hyperparameters and expected performance metrics can be estimated in a fair and unbiased way. 

The traditional splits have been 70% training, 30% dev split. (In the past it wasn't common to have test split and the unbiased estimator of performance wasn't well known as good practice.)

Now with big data, depending on the application we might only need 10,000 examples each in dev and test sets. Therefore these days it is common to see splits such as:

train: 99%
dev: 0.5%
test: 0.5%

Depending on the application, as long as the dev and test set have a minimum number of examples for hyperparameter seach and an fair estimate of generalization performance, it is possible to push down the proportion of data splits for each.

Usually, 0.5% for each of dev and test is seen as acceptable for big data problems.

## Bias vs Variance

For two dimensional features we can get a good understanding of bias and variance. High bias measures underfitting while high variance measures overfitting.

Once we go beyond 3 dimensions it's difficult to get a sense of bias and variance using diagrams.

![](high_bias_high_variance.jpeg)

For more features, using dev and test set splits is a good measure to understand bias and variance. Another measure that helps a lot is the Bayes error (which is often proxied by the human level error - at least for unstructured data).


| Metric | High Bias | Low Bias | High Variance | Low Variance | Low Bias and Low Variance | High Bias and High Variance |
|--|--|--|--|--|--|--|
| Bayes (Almost Human) |0.5%|0.5%|0.5%|0.5%|0.5%|0.5%|
| Train |10%|0.5%|1%|0.5%|0.5%|10%| 
| Dev |11%|0.8%|10%|3%|0.8%|20%| 
| Test |11%|1%|11%|3.5%|0.9%|23%| 

## Basic Recipe for Machine Learning

There is a systematic way to remedy problems such as high bias and high variance. There is a lot more to this topic, but at the basic level we suggest the following:

- High Bias (Underfitting)

    For models that suffer high bias, the following can be tried:
    
    - increase model capacity/a bigger network/ more parameters

    - introduce more features/synthesize more features

    - train for longer (more iterations, smaller convergence criteria) if using an iterative algorithm

    - train with a different optimizer (RMSProp, Adam, Ada)

    - train from different initial conditions

    - search extensively for better neural network architecture/hyperparameters    

- High Variance (Overfitting)

    For models that suffer from high variance, try:

    - decrease model capacity/a smaller network/less parameters

    - reduce number of features (for example PCA, AIC)

    - regularize parameters (penalty on size of parameters - L1, L2 or dropout)

    - use more training data to reduce variability in parameter estimates

    - average bootstrapped models (Bagging)

    - if using classifier check if data is fairly balanced. If not use a balancing strategey (over/undersampling, SMOTE)

    - search extensively for better neural network architecture/hyperparameters

### Bias Variance Tradeoff

Before deep learning models and the big data era, there used to be a lot of talk of trading off bias against variance. One had to accept either higher variance or higher bias, trying to find the best balance for the task at hand.

However now with more data, more compute and perhaps more understanding in the ML community of good practice we just need to change some hyperparameters, tweak the algorithm or collect more data and that can help get down to the Bayes error rate for a problem.

## Regularization

Regularization is used to remedy overfitting (high variance) problems. This is because it is often cheaper than the effort needed to collect and process more data - the other default remedy for overfitting.

There are several kinds of regularization. Thr most popular ones these days are:

- parameter regularization - L1 (Lasso) or L2 (Ridge)
- dropout 
- data augmentation 
- early stopping: not preferred as it is not an orthogonal technique; it impacts the optimization of the cost function and also regularization at the same time.

Below we start with parameter regularization.

### Parameter Regularization

One way to remedy high variance is reduce the number of parameters or model capacity. In this vein we could also try to make parameters very small or zero by placing a penalty against large parameter values. We do this so as to make the fitting algorithm prefer models with smaller or fewer parameters.

From one perspective this is the same as having a Bayesian prior that has an expectation that model parameters will be small or fewer (many model parameters that are 0).

L2 regularization adds a penalty to the cost function which prefers models to have smaller (in absolute sense) parameter values. L1 adds a penalty to the cost function which prefers models to have fewer parameters (by preferring parameters to be 0).

In both cases, if there is strong data (many occurrences/large correlations) for parameter values to be non-zero/non-small then this will still come through in the final optimized parameters selected.

A controlling hyper parameter $\lambda$ determines the amount of regularization - it chooses the amount of smoothing to be applied. There is a sensible range of values for which hyperparameter tuning can be applied.

#### L2 Regularization

L2 regularization augments the cost function with the averaged Frobenius (L2) norm of the parameters of the model in each layer, over all layers. That is to say, if the cost function (loss function averaged over all examples) is:

$$
J(\mathbf{W}, \mathbf{B}) = \frac{1}{m}\sum_i L(y^{(i)}, \hat{y(\mathbf{W}, \mathbf{B})}^{(i)})
$$

Then the cost function for a regularized model, or regularized cost function is:

$$
J(\mathbf{W}, \mathbf{B}) =  \frac{1}{m} \sum_i L(y^{(i)}, \hat{y(\mathbf{W}, \mathbf{B})})^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L}|| W^{[l]} ||_{F}^{2}
$$

Where $|| W^{[l]} ||_{F}^{2} = \sum_i \sum_j (w_{i,j}^{[l]})^2$

This regularized cost function prefers solutions that have smaller (in the absolute value sense) parameters, there can be many parameters, but they will tend to be smaller.

Note: Technically the biases $B^{[l]}$ should be added to the cost function - however since they are far fewer than the weight parameters - it is possible to not regularize them and still get good results.

#### L1 Regularization

L1 Regularization augments the cost function by the L1 metric applied to all weight parameters. This is the averaged of the absolute weight parameters.

$$
J(\mathbf{W}, \mathbf{B}) =  \frac{1}{m} \sum_i L(y^{(i)}, \hat{y(\mathbf{W}, \mathbf{B})})^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L}| W^{[l]} |
$$

Where $| W^{[l]} | = \sum_i \sum_j| w_{i,j}^{[l]}|$

This regularized cost function prefers models which have many weights being zero. It means that sparse models are preferred to more complex ones.

### Gradient Descent: L2 Regularization

The gradient descent formulas are very similar, only now with a weight decay element which is related to the regularization parameter $\lambda$.

If the original update equations for gradient descent are:

$$
W = W - \alpha dW
$$

And with L2 regularization, they become:

$$
W = W(1- \frac{\alpha \lambda}{m}) - \alpha dW
$$

With some rearrangement, we can see this as a weight decay version the of original equations.

In practice it is a lot more common to see L2 regularization than L1 regularization. This might be related to the additional work and computation that is needed to code up and iterate to solve L1 regularization problems.

## Why Regularization Reduces Overfitting

Regularized cost functions tend to optimize and reduce high variance problems. The reasons for this are subtle, but some intuitions are that:

* they prefer functions where the weights are close to 0, because of the penalty increasing the cost function. If weights are close to 0, then the model is simpler, with fewer non-linear interactions.

* they make the model more linear by making the linear combination centered around the linear part of the activation function and this propagates up the layers.

### Graph Regularized Cost Function

When graphing cost functions, we expect the cost to decrease at each iteration of the gradient descent. 

This is true for regularized versions too, just remember to graph against the regularized cost function not the original cost function. This is because gradient descent now applies to the cost function with penalty and the parameter search is trying to reduce the cost with the penalty included. 

## Dropout Regularization


## Understanding Dropout

## Other Regularization Methods

### Early Stopping

### Data Augmentation

## Normalizing Inputs

## Vanishing or Exploding Gradients

## Weight Initialization in a Deep Network

## Numerical Approximations of Gradients

## Gradient Checking

## Minibatch Gradient Descent