# Improving Deep Neural Networks

## Train, Dev and Test Sets

Finding the best solution for a machine learning task involves several rounds of iterations of various hyperparameters (as well as the usually iterative parameter optimization).

To achieve this usually split the data available into train, dev and test sets. This is so that parameters, hyperparameters and expected performance metrics can be estimated in a fair and unbiased way. 

The traditional splits have been 70% training, 30% dev split. (In the past it wasn't common to have test split and the unbiased estimator of performance wasn't well known as good practice.)

Now with big data, depending on the application we might only need 10,000 examples each in dev and test sets. Therefore these days it is common to see splits such as:

train: 99%
dev: 0.5%
test: 0.5%

Depending on the application, as long as the dev and test set have a minimum number of examples for hyperparameter seach and an fair estimate of generalization performance, it is possible to push down the proportion of data splits for each.

Usually, 0.5% for each of dev and test is seen as acceptable for big data problems.

## Bias vs Variance

For two dimensional features we can get a good understanding of bias and variance. High bias measures underfitting while high variance measures overfitting.

Once we go beyond 3 dimensions it's difficult to get a sense of bias and variance using diagrams.

![](high_bias_high_variance.jpeg)

For more features, using dev and test set splits is a good measure to understand bias and variance. Another measure that helps a lot is the Bayes error (which is often proxied by the human level error - at least for unstructured data).


| Metric | High Bias | Low Bias | High Variance | Low Variance | Low Bias and Low Variance | High Bias and High Variance |
|--|--|--|--|--|--|--|
| Bayes (Almost Human) |0.5%|0.5%|0.5%|0.5%|0.5%|0.5%|
| Train |10%|0.5%|1%|0.5%|0.5%|10%| 
| Dev |11%|0.8%|10%|3%|0.8%|20%| 
| Test |11%|1%|11%|3.5%|0.9%|23%| 

## Basic Recipe for Machine Learning

There is a systematic way to remedy problems such as high bias and high variance. There is a lot more to this topic, but at the basic level we suggest the following:

- High Bias (Underfitting)

    For models that suffer high bias, the following can be tried:
    
    - increase model capacity/a bigger network/ more parameters

    - introduce more features/synthesize more features

    - train for longer (more iterations, smaller convergence criteria) if using an iterative algorithm

    - train with a different optimizer (RMSProp, Adam, Ada)

    - train from different initial conditions

    - search extensively for better neural network architecture/hyperparameters    

- High Variance (Overfitting)

    For models that suffer from high variance, try:

    - decrease model capacity/a smaller network/less parameters

    - reduce number of features (for example PCA, AIC)

    - regularize parameters (penalty on size of parameters - L1, L2 or dropout)

    - use more training data to reduce variability in parameter estimates

    - average bootstrapped models (Bagging)

    - if using classifier check if data is fairly balanced. If not use a balancing strategey (over/undersampling, SMOTE)

    - search extensively for better neural network architecture/hyperparameters

### Bias Variance Tradeoff

Before deep learning models and the big data era, there used to be a lot of talk of trading off bias against variance. One had to accept either higher variance or higher bias, trying to find the best balance for the task at hand.

However now with more data, more compute and perhaps more understanding in the ML community of good practice we just need to change some hyperparameters, tweak the algorithm or collect more data and that can help get down to the Bayes error rate for a problem.

## Regularization




## Why Regularization Reduces Overfitting

## Dropout Regularization

## Understanding Dropout

## Other Regularization Methods

## Normalizing Inputs

## Vanishing or Exploding Gradients

## Weight Initialization in a Deep Network

## Numerical Approximations of Gradients

## Gradient Checking

## Minibatch Gradient Descent