# Improving Deep Neural Networks

## Train, Dev and Test Sets

Finding the best solution for a machine learning task involves several rounds of iterations of various hyperparameters (as well as the usually iterative parameter optimization).

To achieve this usually split the data available into train, dev and test sets. This is so that parameters, hyperparameters and expected performance metrics can be estimated in a fair and unbiased way. 

The traditional splits have been 70% training, 30% dev split. (In the past it wasn't common to have test split and the unbiased estimator of performance wasn't well known as good practice.)

Now with big data, depending on the application we might only need 10,000 examples each in dev and test sets. Therefore these days it is common to see splits such as:

train: 99%
dev: 0.5%
test: 0.5%

Depending on the application, as long as the dev and test set have a minimum number of examples for hyperparameter seach and an fair estimate of generalization performance, it is possible to push down the proportion of data splits for each.

Usually, 0.5% for each of dev and test is seen as acceptable for big data problems.

## Bias vs Variance

For two dimensional features we can get a good understanding of bias and variance. High bias measures underfitting while high variance measures overfitting.

Once we go beyond 3 dimensions it's difficult to get a sense of bias and variance using diagrams.

![](high_bias_high_variance.jpeg)

For more features, using dev and test set splits is a good measure to understand bias and variance. Another measure that helps a lot is the Bayes error (which is often proxied by the human level error - at least for unstructured data).


| Metric | High Bias | Low Bias | High Variance | Low Variance | Low Bias and Low Variance | High Bias and High Variance |
|--|--|--|--|--|--|--|
| Bayes (Almost Human) |0.5%|0.5%|0.5%|0.5%|0.5%|0.5%|
| Train |10%|0.5%|1%|0.5%|0.5%|10%| 
| Dev |11%|0.8%|10%|3%|0.8%|20%| 
| Test |11%|1%|11%|3.5%|0.9%|23%| 

## Basic Recipe for Machine Learning

There is a systematic way to remedy problems such as high bias and high variance. There is a lot more to this topic, but at the basic level we suggest the following:

- High Bias (Underfitting)

    For models that suffer high bias, the following can be tried:
    
    - increase model capacity/a bigger network/ more parameters

    - introduce more features/synthesize more features

    - train for longer (more iterations, smaller convergence criteria) if using an iterative algorithm

    - train with a different optimizer (RMSProp, Adam, Ada)

    - train from different initial conditions

    - search extensively for better neural network architecture/hyperparameters    

- High Variance (Overfitting)

    For models that suffer from high variance, try:

    - decrease model capacity/a smaller network/less parameters

    - reduce number of features (for example PCA, AIC)

    - regularize parameters (penalty on size of parameters - L1, L2 or dropout)

    - use more training data to reduce variability in parameter estimates

    - average bootstrapped models (Bagging)

    - if using classifier check if data is fairly balanced. If not use a balancing strategey (over/undersampling, SMOTE)

    - search extensively for better neural network architecture/hyperparameters

### Bias Variance Tradeoff

Before deep learning models and the big data era, there used to be a lot of talk of trading off bias against variance. One had to accept either higher variance or higher bias, trying to find the best balance for the task at hand.

However now with more data, more compute and perhaps more understanding in the ML community of good practice we just need to change some hyperparameters, tweak the algorithm or collect more data and that can help get down to the Bayes error rate for a problem.

## Regularization

Regularization is used to remedy overfitting (high variance) problems. This is because it is often cheaper than the effort needed to collect and process more data - the other default remedy for overfitting.

There are several kinds of regularization. Thr most popular ones these days are:

- parameter regularization - L1 (Lasso) or L2 (Ridge)
- dropout 
- data augmentation 
- early stopping: not preferred as it is not an orthogonal technique; it impacts the optimization of the cost function and also regularization at the same time.

Below we start with parameter regularization.

### Parameter Regularization

One way to remedy high variance is reduce the number of parameters or model capacity. In this vein we could also try to make parameters very small or zero by placing a penalty against large parameter values. We do this so as to make the fitting algorithm prefer models with smaller or fewer parameters.

From one perspective this is the same as having a Bayesian prior that has an expectation that model parameters will be small or fewer (many model parameters that are 0).

L2 regularization adds a penalty to the cost function which prefers models to have smaller (in absolute sense) parameter values. L1 adds a penalty to the cost function which prefers models to have fewer parameters (by preferring parameters to be 0).

In both cases, if there is strong data (many occurrences/large correlations) for parameter values to be non-zero/non-small then this will still come through in the final optimized parameters selected.

A controlling hyper parameter $\lambda$ determines the amount of regularization - it chooses the amount of smoothing to be applied. There is a sensible range of values for which hyperparameter tuning can be applied.

#### L2 Regularization

L2 regularization augments the cost function with the averaged Frobenius (L2) norm of the parameters of the model in each layer, over all layers. That is to say, if the cost function (loss function averaged over all examples) is:

$$
J(\mathbf{W}, \mathbf{B}) = \frac{1}{m}\sum_i L(y^{(i)}, \hat{y(\mathbf{W}, \mathbf{B})}^{(i)})
$$

Then the cost function for a regularized model, or regularized cost function is:

$$
J(\mathbf{W}, \mathbf{B}) =  \frac{1}{m} \sum_i L(y^{(i)}, \hat{y(\mathbf{W}, \mathbf{B})})^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L}|| W^{[l]} ||_{F}^{2}
$$

Where $|| W^{[l]} ||_{F}^{2} = \sum_i \sum_j (w_{i,j}^{[l]})^2$

This regularized cost function prefers solutions that have smaller (in the absolute value sense) parameters, there can be many parameters, but they will tend to be smaller.

Note: Technically the biases $B^{[l]}$ should be added to the cost function - however since they are far fewer than the weight parameters - it is possible to not regularize them and still get good results.

#### L1 Regularization

L1 Regularization augments the cost function by the L1 metric applied to all weight parameters. This is the averaged of the absolute weight parameters.

$$
J(\mathbf{W}, \mathbf{B}) =  \frac{1}{m} \sum_i L(y^{(i)}, \hat{y(\mathbf{W}, \mathbf{B})})^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L}| W^{[l]} |
$$

Where $| W^{[l]} | = \sum_i \sum_j| w_{i,j}^{[l]}|$

This regularized cost function prefers models which have many weights being zero. It means that sparse models are preferred to more complex ones.

### Gradient Descent: L2 Regularization

The gradient descent formulas are very similar, only now with a weight decay element which is related to the regularization parameter $\lambda$.

If the original update equations for gradient descent are:

$$
W = W - \alpha dW
$$

And with L2 regularization, they become:

$$
W = W(1- \frac{\alpha \lambda}{m}) - \alpha dW
$$

With some rearrangement, we can see this as a weight decay version the of original equations.

In practice it is a lot more common to see L2 regularization than L1 regularization. This might be related to the additional work and computation that is needed to code up and iterate to solve L1 regularization problems.

## Why Regularization Reduces Overfitting

Regularized cost functions tend to optimize and reduce high variance problems. The reasons for this are subtle, but some intuitions are that:

* they prefer functions where the weights are close to 0, because of the penalty increasing the cost function. If weights are close to 0, then the model is simpler, with fewer non-linear interactions.

* they make the model more linear by making the linear combination centered around the linear part of the activation function and this propagates up the layers.

### Graph Regularized Cost Function

When graphing cost functions, we expect the cost to decrease at each iteration of the gradient descent. 

This is true for regularized versions too, just remember to graph against the regularized cost function not the original cost function. This is because gradient descent now applies to the cost function with penalty and the parameter search is trying to reduce the cost with the penalty included. 

## Dropout Regularization

### What is Dropout

Dropout is when randomly, with some probability at the layer level, neurons are dropped from the network. The connections into and out of the neuron are dropped. The remaining ouputs are then scaled by the proportion of neurons turned off at each layer. This is to ensure the sums stays in the same range as expected. The probability of keeping a neuron can vary per layer. 

At each iteration of gradient descent random neurons are removed and the update performed. In the following iteration, we again go from the full model to one where dropout has been performed.

For prediction we take the full network with the updates having iterated on the network via dropout, some will have updated all the time, others less so. This then gives us the weights which are used to predict output for a set of inputs.

## Understanding Dropout

### Smaller Network Effect

At each iteration of gradient descent, we are effectively training a smaller network. This helps to reduce high variance problems as there are less parameters to train. This is why it has a regularization effect.

### Network can't rely on any one neuron

It tends to spread out it's weighting - which is reducing parameters, like L2 regularizations. The effect is to lower the absolute value of parameters. This is very similar in essence to L2 regularization.

### Computer Vision

Is common for Computer Vision tasks which don't have enough data and tend to overfit (high variance). Since there isn't more data, then dropout has been shown to help get good results.

### Irregular Cost Function 

The cost functions is no longer consistently defined in iterations. As a consequence it will not decrease after each iteration of gradient descent. This seems like it needs more investigation to explain why convergence to local optima still works. 

However, since the end results (i.e. performance) seem decent for many deep learning problems it has become common practice to use dropout.

## Other Regularization Methods

### Data Augmentation

One way to fix overfitting is to add more variations of data, what we call data augmentation. For computer vision tasks, we could do.

* translation
* crops
* random noise
* flips
* random rotations

### Early Stopping

Looking at dev error and training error as a model is training can be useful for regularization. When the dev error stops falling, one suggestion is to stop training. This is because we rising dev error (while training error keeps falling) is indicative of overfitting taking hold. So one could stop the training and accept the parameters when dev error is minimized.

However, this is not preferred as it makes regularization and optimization intertwined. We would prefer orthogonal controls for each of regularization and optimization.


## Normalizing Inputs

Feature normalizing helps optimization algorithms converge more quickly. This is needed only if the features are very different scales (think orders of magnitude). For example:

* x1 ranges from 10 to 10000
* x2 ranges from -1 to 1

This is where centering and scaling matters.

### Centering and Scaling

Normalization of features (inputs) involves centering and scaling of the features.

#### Centering

Centering is subtract a center measure of the features. Usually the mean vector of the data is used. Other alternatives are:

* median vector
* mode vector

#### Scaling

Scaling the vector, makes the features all on the same scale. 

* standard deviation
* range
* inter-quartile range

### Normalizing Effect on Gradient Descent

Why do we need to normalize the features? This is done so that features are on the same scale. But why? Well remember gradient descent, we have a alpha scaling parameter that essentially approximates the Hessian matrix. If the true function to be optimized varies at different scales, then a single alpha multiplier will struggle to capture this. 

As below to optimize a 2 input function with different scales, we see that the algorithm steps in a zig-zag trying to capture both large and small changes in the inputs.

![](scaling.png)

However the large scale parameter dominates and so the convergence slowly moves in the direction of descent.

By normalizing to the same scale, we can often get better results. (Although this is not always the case, and most improvement may still be in one input - specialized gradient descent algorithms try to remedy this - RMSProp, Adagrad an Adam for example).

If we had the Hessian then we would not need such scaling (or specialized algorithms). However since the Hessian is expensive to compute and sometimes not available we use normalization as a computationally inexpensive practical solution.

The above relates to scaling the features. There are good reasons for centering the features also. This link highlights some:

https://stats.stackexchange.com/questions/104528/the-idea-of-making-the-data-have-a-zero-mean

One related to optimization is that

```
The idea is that to train a neural network one needs to solve a non-convex optimization problem using some gradient based approach. The gradients are calculated by means of backpropagation. Now, these gradients depend on the inputs, and centering the data removes possible bias in the gradients.

Concretely, a non-zero mean is reflected a in large eigenvalue which means that the gradients tend to be bigger in one direction than others (bias) thus slowing the convergence process, eventually leading to worse solutions.
```

Also note that if one feature is on a very different range of values to the others, it really impacts many classifiers, including deep learning neural networks.

Zero centering also makes the mathematics simpler avoiding cross terms for many models - especially those that depend on a euclidean distance.

This paper also seems relevant:

https://arxiv.org/pdf/1502.03167.pdf

## Vanishing or Exploding Gradients

In deep learning networks, a particular problem can occur at training time. This happens because we rely on back-propagation, essential multiplying chained derivatives. As these chained matrix/vector derivatives are multiplied through the layers of the deep network we find that if the components are much larger than 1 or much smaller than 1 consistently - well then the gradients become are exponentially small or exponentially large. 

### Linear Activation Functions

As a simple motivating example we'll consider this vanishing/exploding gradient problem for the linear activation function - which can be extended by Taylor's theorem to linearize other activation functions (such as relu, sigmoid and tanh). However, I accept it's a bad example since deep linear activation function networks just reduce to multiple linear regression, where multiplying by chain rule derivatives is unnecessary, and the explanation below is hand wavy. 

So for now let's assume we have $g(z) = z$ being the activation function at all levels. What happens? Well for a n-layer neural network, we will observe:

$a^[0] = X $
$a^[1] = W^{[1]} X $
$a^{[L]} = (\prod_i W^{[i]}) X $

Imagine we have a simple diagonal weight matrix for each of the layers. That is 

$$
W^{[l]} =   \left[ {\begin{array}{cc}
   \gamma & 0 \\
   0 & \gamma \\
  \end{array} } \right]
$$

Then the expression for the output layer estimate becomes:

$$
a^{[L]} = (\prod_i W^{[i]}) X  =  \left[ {\begin{array}{cc}
   \gamma^{(L-1)} & 0 \\
   0 & \gamma^{(L-1)} \\
  \end{array} } \right]
  = I* \gamma^{(L-1)} X
$$ 

So for this simple case, we have $ a^{[L]} = I* \gamma^{(L-1)} X$. Depending on the value of $\gamma$, we can get vanishing or exploding gradients.

### Exponentially Small

 $ a^{[L]} = I* \gamma^{(L-1)} X $, for example $\gamma = 0.5$ vanishes the gradient.

If gradients are exponentially small (or *vanishing*), then apart from floating point errors - which have their own complications, the learning slows down - in that the parameters stop changing materially in the gradient descent updates.

### Exponentially Large

 $ a^{[L]} = I* \gamma^{(L-1)} X $, for example $\gamma = 1.5$ explodes the gradient.

If gradients are exponentially large, then there can also be floating point errors. More than that, very large gradients will cause the parameters to move a lot, potentially violently away from optima (going too far in the gradient direction) and need to reset the learning rate.
 
For a long time these problems were a huge barrier to training neural networks. There are now  several partial solutions that help to alleviate this problem.

Of the two problems, vanishing gradients are encountered more often and are more of a problem. A partial remedy is to use thoughtful initial values for the parameter search. These can be chosen so that the initial gradient is not much larger than or much smaller than 1.

## Weight Initialization in a Deep Network

By choosing suitable initial conditions much progress can be made to train neural networks while avoiding these problems. In general we want to have small enough initial weight parameters so that the weighted sum is not too far from 1.

One way to do this is firstly initialize the bias parameters as 0. Following that, the weights can be chosen from say a gaussian distribution, sampled randomly from one that has variance $\frac{1}{n}$, this works well for linear activations. For the RELU, a better choice of variance is $\frac{2}{n}$.

In code, for layer $j$, one would see:

```
W_j = np.random.rand((n_j, n_j_1)) * np.sqrt(1/n_j_1) 
```

For the tanh function, Xavier initialization is suggested. With this the `stddev` of a normal distribution should be:

```
stddev = sqrt(1 / fan_avg)
fan_avg = (n_in + n_out)/2
```

where `n_in` and `n_out` are the number of neurons connections going in and out respectively.

Another one used is:

```
stddev = sqrt(2/n_in*n_out)
```

Some will suggest making the variance choice a hyper-parameter and tuning it also. This has been shown to have some effect in some cases.

## Numerical Approximations of Gradients



## Gradient Checking

## Minibatch Gradient Descent