# Lab 3: Fully-Connected Neural Networks

# Nonlinear modeling
Logistic regression and linear regression are linear models: the output variable (or the vector of logits, in logistic regression) is a linear combination of the input features.
These models can model nonlinear terms in your data, including interactions between the features, but only if you design those terms by hand.

By contrast, nonlinear models such as neural networks can "automatically" discover interaction terms and nonlinear terms in the input.
This lets them represent much more complicated functions, since most interesting problems involve terms too difficult to design by hand.

The key distinguishing factor between different kinds of nonlinear models is how those terms are discovered.
Very generic models, like RBF SVMs, can represent almost any kind of mapping from input to output, but as a result can't learn "smart" features because the only property they assume about the data is that similar inputs map to similar outputs.
More "opinionated" models, like neural networks, learn a function from a smaller class but can learn more complicated relationships with less data.

#### Aside: The "no free lunch theorem of machine learning"
This theorem roughly says: no machine learning model strictly outperforms any other model on all problems.
It's kind of controversial whether this really means anything: sure, you can pick a test set that has nothing to do with the input, and so random choice does better than even mean-fitting.

But, the general idea is important for understanding why we'd chose one model over another.
No model does better than any other on random functions.
Instead, when picking a model, you need to think about what _properties of the real world_ the model does well on.
For instance, if you expect classes to be well-separated, then an SVM is a good choice; if you expect neighborhood relationships to be more important, than K-nearest-neighbors might be a good fit.
When thinking about models, think about what _priors_ they impose on the functions they learn.
We'll justify neural networks by showing that they impose priors we'd reasonably expect problems in the real world to follow.

# Neural networks
Neural networks get interpreted in a ton of different ways.
Here are a few of my favorites (though there exist more).

## Representation learning
Prior to deep learning, the dominant approach to machine learning was "feature engineering" -- designing features (interactions, nonlinear terms, etc) by hand with the assumption that some simple model of them (linear model, K-NN) will suffice to solve the problem. 
One way to think about neural networks is that they perform this process automatically, learning representations in their hidden layers and then fitting a simple model (linear regression or logistic regression) to the output data.

The last layer is a linear model of the activation of the second-to-last layer, and so the set of hidden layers defines a nonlinear function from the input features to the final representation.
We want this function to "disentangle" the input, so that a linear model is sufficient to perform the final step.
The activations of hidden layers form "representations" of the input that contain the same important information, but discards noise and makes the information more easily accessible.

This is achieved with the backpropagation algorithm and gradient descent.
The gradient of the loss function not only tells the last layer how it should change to fit the data better, but also how the previous layer should change such that the final layer does better. 
 
This same reasoning applies within hidden layers.
The last hidden layer has the difficult task of giving the output layer a good enough representation to perform the task linearly.
This task is easier if the last hidden layer itself has a good representation as input.
And so on.
That's why we use multiple hidden layers: because each one makes the next one's job easier.

Neural networks of sufficient size are **universal functon approximators** (and this is a theorem!), able to represent any function from their inputs to their outputs so long as they have at least one hidden layer.
Wider networks learn a more rich representation per hidden layer, and deeper networks learn more representations so each can be simpler.

It's worth noting: at every step, the hidden layers change so that their activations would make the _current_ version of the final layer do better.
But, the final layer is also changing, so we can't necessarily change each layer in just the right way.
This problem gets harder with deeper models.

## Hierarchical pattern-matching
TODO

## Function composition
TODO

## Effects of depth and width
TODO

## Mathematically...
TODO

# Activation functions
### Logistic sigmoid
TODO

### ReLU
TODO

# SGD with momentum
TODO

# Layer initialization, saturation, and variance scaling
TODO

### Sigmoid units with Glorot initialization
TODO

### ReLU units with He initialization
TODO

# Keras
TODO

### Sequential models
TODO

# tfdbg, the TensorFlow debugger
TODO

# More TensorBoard visualizations
### Learning dynamics
TODO

### Layer activations
TODO