# Multi-Layer Perceptron

## Model Specification

Multi-Layer perceptron is a multi-layer generalization of the Rosenblatt’s perceptron. Rosenblatt's perceptron a variant of the McCulloch and Pitt neuron with a provably convergent learning rule, which is similar to gradient descent.

![mlp.PNG](attachment:mlp.PNG)

### Variants and Generalizations

* McCullough and Pitt model: Neurons as Boolean threshold units
    
    - Models the brain as performing propositional logic
    
    - But no learning rule

* Hebb’s learning rule: Neurons that fire together wire together, but it is unstable.
* Modern neural network models are essentially MLPs with different activation functions.
* ADALINE and MADALINE

See the various optimization techniques and tricks in [efficient_deep_learning_and_optimization.ipynb](efficient_deep_learning_and_optimization.ipynb).

## Theoretical Properties

A bit of history

* Neural networks began as computational models of the brain
* Neural network models are connectionist machines
    - The comprise networks of neural units    
* McCullough and Pitt model: Neurons as Boolean threshold units
    - Models the brain as performing propositional logic
    - But no learning rule
* Hebb’s learning rule: Neurons that fire together wire together
    – Unstable

Rosenblatt's perceptron, or the individual form, can express 'and', 'not' and 'or', but not 'xor' (Minsky and Papert, 1968). But its multi-layer version, or MLP can be shown to be

* able to **model arbitrary boolean function**, with individual perceptron as Boolean gates.
    > Even network with a _single_ hidden layer is a universal Boolean machine, though it may require an exponentially large number of perceptrons. Getting deeper in the network can help reduce network size (number of neurons at each level, sometimes exponentially). But then it is shown that to approximate arbitrary functions, each layer still should have sufficient capacity so that information is not lost in the filtering. This can be somewhat alleviated by using not a threshold activation as defined by perceptron, but other activations such as ReLu or sigmoid to indicate how far the sample point is to the decision boundary, so that information is not lost. This intuition can be carried through in the cases where MLP is used as classifiers or to approximate other functions.
* universal classifier, where individual perceptrons as feature detectors, or correlation filters that fires when a pattern is recognized (i.e. have correlation so strong to trigger the activation).
* universal function approximator: 
  > It can be shown that this can be done even with just one-layer and infinite neurons. It first construct a 'cylinder' in the high-dimensional space. This is something that probably only have theoretical appeal.

### Advantages and Disadvantages

* Disadvantages

    - Learning networks of threshold-activation perceptrons requires solving a hard combinatorial-optimization problem. Because we cannot compute the influence of small changes to the parameters on the overall error. That is why in modern neural network we use continuous activation functions with non-zero derivatives to enables us to estimate network parameters by running optimizations.
    - This is not necessarily a disadvantage, but perceptrons and MLPs are methods with low bias and high variance: very sensitive to outliers and any one of the training sample point. This is to be contrasted vs. neural networks with other different activation functions. More specifically, for neural network that has differentiable activation that backprop can work, it will often not find a separating solution **even though the solution is within the class of functions learnable by the network**.
    
* Advantages

    - Perceptrons can always train (albeit of an extrmely long time) to separate the classes, as long as the problem is linearly separable.

### Relation to Other Models

A special case of modern neural network. The modern neural network typically has **different activation functions**.

| Activation Function | Neurons dying and saturation for extreme input values, killing gradients | Computationaly Expensive |  Can generate negative output and gradient |
| - | - | - | - |
| sigmond | yes | relatively yes (exponential functions) | no |
| tanh | yes | no | yes |
| ReLu | yes | yes | no |
| Leaky Relu/Prelu (parametric ReLu)| no, but doubles the number of parameters | no | yes |
| maxout | no | no | no |

And as mentioned above, different activation function can be essential for the performance of the network.

## Empirical Performance

There is anecdotal evidence that variance of neural network will decrease with depth. 

Also, it is a popular hypothesis that 
- In large networks, saddle points are far more common than local minima. This is probably because in large networks, there are more parameters and the search space is higher dimensional.
- Most local minima are equivalent. In fact, as mentioned above, number of neurons/number of layers increase model capacity, as well as the risk of over-fitting. But it is a bad idea to prevent over-fitting by restricting the number of neurons/layers, since local minimas of smaller networks tend to have larger losses than bigger networks. It is better to use other regularization methods, such as L1/L2, or dropouts.
- Those are not true for small networks.

### Advantages and Disadvantages

## Implementation Details and Practical Tricks

 - See the pytorch implementation details in CNN.ipynb and RNN.ipynb.

 - See the various optimization techniques and tricks in [efficient_deep_learning_and_optimization.ipynb](efficient_deep_learning_and_optimization.ipynb).

## Use Cases

## Results Interpretation, Metrics and Visualization

## References

- CMU Deep Learning Course Fall 2019.
- Stanford 231n

### Further Reading

- CMU Deep Learning Course Fall 2019, Lecture 3 about ADALINE and MADALINE


## Misc.