# Chapter 5: Multilayer Perceptron

### Dhuvi Karthikeyan

##### 1/29/2023

## 5.1 Multilayer Perceptron Theory

### 5.1.1 Hidden Layers

* Affine transformation = linear transformation with bias.
* Linearity implies monotonicity
* While linear assumptions may work for various monotonic relationships that are naturally occurring or artificially engineered, unless and until we can learn that mapping from input to monotonic feature, we will have a very brittle model


**Note:** The deep learning paradigm is to use observational data to jointly learn a representation (via hidden layers) and a linear predictor that acts upon that hidden representation.

#### Incorporating Hidden Layers

When stacking a number of dense layers (fully connected), we say that the representation is handled by the first L-1 layers and the Lth layer is the "output layer". 
   
   * Interesting that the output of the L-1th layer isnt the representation but rather the entire set of L-1 layers 
   * Addition of the non-linearity $\sigma$ makes the MLP > linear model
   
   
**Universal Approximators**: Single hidden layer network given sufficient width can learn any function to arbitrary accuracy/error.

* Kernel methods can solve the problem of fitting a function exaclty even in infinite dimensional spaces

### 5.1.2 Activation Functions

#### ReLU Function

Rectified linear unit: ReLU(x) = max(x,0)
Stepwise linear, so formally, the derivative isn't defined @0 but in applied maths we can use the left hand derivative to say its equal to 0. 
    
    * ReLU derivatives are well behaved. They are either 1 or 0; helped block the vanishing gradient problem
    
#### Sigmoid Function

Known as a squashing function for taking values from $-\infty$ to $\infty$ to 0 to 1.
* The sigmoid activation usually replaced by the ReLU for training stability purposes, it is uniquely suited in the recurrent neural network architecture for its ability to control information flow over time. 

    * Derivative of sigmoid(x) = sigmoid(x)[1-sigmoid(x)]
    
#### Tanh Function

Tanh function is also a squashing function, this time from -1,1 instead of 0,1.
Very similar to sigmoid but with point symmetry about the origin. 

    * Derivative of the tanh(x) = 1 - tanh(x)^2
    
#### Swish Activation 

$$ \sigma(x) = x*\text{sigmoid}(\beta x)$$

## 5.3 Forward Propagation, Backward Propagation, and Computational Graphs

### 5.3.1 Foward Propagation

"Forward pass" is calculating and storing the intermediate variables (outputs) of a neural network, in order, from input to output layer.

### 5.3.2 Computational Graph of Forward Propagation

Visualization of the computational graph and flow of information is very useful for understanding the flow of information

### 5.3.3 Backpropagation

Traversing the network in reverse order thanks to the chain rule from calculus to identify the gradients of the layer outs w.r.t the previous layers inputs. 


# *SUPPLANT WITH CLASS NOTES.*

### 5.3.4 Training Neural Networks

Forward pass goes through the computational graph and, in order, computes lal the dependencies that the backprop does in reverse. Waves of forward and back pass upon incorrect predictions results in training on incorrect examples where the magnitude of weight updates depends on initialization and contribution on the final output.

## 5.4 Numerical Stability and Initialization

Parameter weight initialization in many cases is intimately linked with choices of activation function. The choice of pairing often drives training stability/emergence of vanishing or exploding gradients, or even gradients that just stall.

### 5.4.1 Vanishing and Exploding Gradients

Vanishing and exploding gradients pose a distinct problem compared to the issue of underflow/overflow encountered when multiplying probabilities or exponentiating. It is unpredictable whether the eigenvalues of the chain rule matrices will be really large or really small. This has impact on numerical stability as well as training stability as we may quickly diverge.

#### Vanishing Gradients

The reason that practitioners have opted for ReLU by default; Also ReLU helped push the field forward with deep learning because stacking many layers was now possible.

#### Exploding Gradients

Matrix products explode given enough layers and there are entries > 1.0

#### Breaking the Symmetry

Initialization of all the parameters to a constant within a layer would result in symmetric gradient sharing. This would effectively turn our MLP into a single unit MLP @ that layer. This can be fixed using dropout regularization and random initialization (although neither can guarantee it do to stochasticity).

### 5.4.2 Parameter Initialization

Default initialization is the normal distribution (usually standard normal)

#### Xavier Initialilzation

Ignoring the nonlinearities of a neural network at layer i:

$$ o_i = \sum_{j=1}^{n_{in}} w_{ij}x_j$$

If we take $w_{ij}$ as mean 0 and variance $\sigma^2$ and make the assumption that $x_j$ is mean 0 and variance $\gamma^2$, the output $o_i$ has a mean of 0 and variance of $n_{in} \sigma^2 \gamma^2$.

On the forward pass the variance scales with $n_{in}$ which is the dimension of the input whereas during backwards pass the variance scales with $n_{out}$ which is the dimension of the layer output. To keep a fixed variance we would need to scale $n_{in}\sigma^2 = 1$ **and** $n_{out} \sigma^2 = 1$ which is impossible.

Xavier instead proposed the following condition to satisfy:

$$ \frac{1}{2}(n_{in} + n_{out})\sigma^2 = 1$$

which can be re-written as an instantiation of the variance to:

$$ \sigma^2 = \frac{2}{n_{in} + n_{out}}$$

## 5.5 Generalization in Deep Learning

Deep learning models have generalized exceptionally well on a vast myriad of tasks and while there a number of hypotheses, it is still virgin territory as to how the optimization fits training data and how generalization actually occurs. A brief discussion is offered in the book

### 5.5.1 Revisiting Overfitting and Regularization

* "No free lunch" theorem Wolpert et al (1995) says that learning algs generalize better on data w specific distributions and worse on data from others

* Careful construction of inductive biases can help boost generalization to unseen examples

* Counterintuitively a number of tricks exist in deep learning and not standard machine learning that lead to increased generalization as opposed to decreased:
    * Training for more epochs
    * Increasing model complexity (Double descent)

### 5.5.2 Inspiration from Nonparametrics

Nonparametric methods often grow in complexity along with the data and can fit the training data exactly. Modern deep learning methods although having parameters that are updated are often likened in behavior to nonparametric methods (Jacot et al 2018) like kernel methods. 

### 5.5.3 Early Stopping

Deep neural networks are capable of fitting labels (artifically generated or not) with ease, but the fitting to random labels is something that happens after fitting correctly labeled data (Rolnick et al 2017). This guarantees that whenever a model fits to cleanly labeled data and not teh incorrectly labeled one, it will generalize to data drawn from that distribution.

EARLY STOPPING IS CRUCIAL WITH LABEL NOISE.

### 5.5.4 Classical Regularization for Deep Networks

The rationale for regularization in the deep learning setting is often different from the rationale in the standard case however their efficacy (albeit a little attenuated on their own) remains. 

## 5.6 Dropout

* In regularizaiton we want to learn smooth functions such that the addition of arbirary noise to our data does not derail the outputs
* 1995 - Christopher Bishop showed that training with input noise on purpose is equal to Tikonov regularization
* 2014 - Dropout: a simple way to prevent neural networks from overfitting 
    * Zero out outputs at each layer and that way we "noise" the data during forward prop (How?)
    * This allows for it to gradients to be applied in backprop to only a certain subset of the data
* How is the noise injected?
    * In the original version in 1995, Bishop stochastically added Gaussian noise to the input
        * In expectation $\mathbb{E}[h'] = h$
    * In the 2014 dropout version they set the output of a node to 0 with probability p and renormalize the remaining outputs to h/1-p with probability 1-p so in expectation: $\mathbb{E}[h']= p*0 + (1-p)*\frac{h}{1-p} = h$

### Class Notes

* Pit-falls/Qualifications of "Linear models" 
    * increase in a feature must always increase or always decrease models output (monotonic relationship to input features)

    * The effect on the output is proportional to the change in input
    
**How do we model non-linearities?**

* Composing multiple functions (stack multiple transformations)
* However, compositions of linear transformations results in a linear transformation
* Introduce non-linearity -> (activation function)
* Transformation ~ Layer: composite of linear transformation and nonlinearity
* $\sigma$ the nonlinearity function is *usually* elementwise
* What to use for $\sigma$?
    * Simgoid(x) = $\frac{1}{1+\exp(-x)}$
    
#### Universal Approximation Theorem

A sufficiently "wide" MLP with just one hidden layer can approximate any function arbitrarily well.