# Improving Neural Networks

Neural networks need tuning to work well. This is something that I have already discovered, playing with them around baseball stats and with neural net similators, that it's not easy to judge what the best network structure will be when we start out. Ng says "it's almost _impossible_ to correctly guess the right values for all [the variables in the network]"

## Some crucial parameters/hyperparameters

- \# number of layers
- \# hidden units
- learning rates
- activation functions

The process is iterative:
idea > code > experiment > idea

### What works?!?

The short answer is that we don't know. I love this, it reminds me of science and philosophy of science. Problems are idiosyncratic. The skill practitioner may have a lot of ideas, but rarely knows which ones will generate solutions. The idiosyncracies emerge from a number of different sorts of charactistics of the problem: how much data do you have? how many input features? what sort of hardware are you running? what type of problem (NLP, Vision, structured data, etc.) do you have? Intuitions from one field don't necessarily translate to another. (I would love to know whether he agrees with the statement "sometimes a good solution is found by combining diverse intutions, which may come from people who are less experienced in the particular domain." That's such a common feature of scientific discovery.)

## Good Process

Our abilty to efficiently go around the interative cycle (therefore) makes a big difference to our ability to find problem solutions. 

### Train/dev/test sets

You already basically know the train/test stuff. Development is basically testing for development. It's where we experiment with our ideas to see how they improve. The final test is for validating the final model; it's important because tuning could be fluke on the dev data, i.e., the tuning might be biased toward features of dev set, so if you don't have a dev set...

**Size choices** Thinking about size: more training data improves model performance and NN really start to outperform other models when they have a lot of training data. Dev and test data sets must be large enough to produce reliable statistical information on model performance. Thus, when working with realtively small data sets, a 60/20/20 split is common. However, when working with much larger "big data" sets, we don't need such a large portion of dev and test sets. 10,000 examples for dev and test are usually ample for statistical validation and the model will be better if it has more data to learn with.

**Distribution** This is a no-brainer: ideally, your train and test set come from the same distribution. It's usually better not to use, say, web-scraped images for training and user uploaded images for testing. Ideally, of course, the distribution from which we do our training and testing is also the distribution which we're deploying on.

**No Test Set** Sometimes when we don't need an unbiased estimate of model performance, we skip the test set and just train and develop until we're satisfied with the model. 



## Bias and Variance

Examples. 

Suppose you have a problem that people can solve, like cat classificaiton. Then...

Train set error = 1% and dev set error = 11% has _high variance_.

Train set error = 15% and dev set error = 16% has _high bias_.

Train error = 15% and dev set error = 30% has _both_.

## Regularization 

Regularization is one of the best ways to handle high variance (overfitting.) 

When we regularize logistic regression, we add a regularization term.

$
J(w,b) = \frac{1}{m}\sum L(\hat(y)^{(i)},y^{i}) + \frac{\lambda}{2m}||w||^{2} _{2}
$


This is L2 regularization. It is the most common form of regularization.

An alternative is L1 regularization, named so because it eliminates the square of the parameter. It tends to make a model sparse, as you know.

### What's $\lambda$ ?

It's a constant hyperparameter that you have to tune. Note that in python ```lambda``` is a resrved keyworkd. Use ```lambd``` or similar in your code.

### For a neural net

$ J(w[1],b[1],...,w[l],b[l]) = \frac{1}{m}\sum\limits_{i=1}^m L(\hat{y}^i,y^i) + \frac{\lambda}{2m}\sum\limits_{l = 1}^L ||W^{[l]}||^2
$

### Why does this work

Short version: the penalty pushes the linear weights closer to zero, resulting in a simpler network.

Longer, better version: tanh is approximately linear near 0, so when we apply this to z, we get an approximately linear result when we move the weights closer to 0. As a result, the complex over-fitting functions (which are non-linear) are not computable. 

## Dropout Regularization

This is an alternative to L2 regularization.