# Avoid Overfitting Through Regularization
- "With four parameters I can fit an elephant and with fie I can make him wiggle his trunk" by John von Neumann
- preface: DNNs can have **tens of thousands or even millions** of parameters
    - increases flexibility and can fit a huge variety of complex datasets
    - flexibility can make the network prone to overfitting
- Important techniques covered in previous chapters:
    - early stopping
    - Batch normalization (BN)

## l<sub>1</sub> and l<sub>2</sub> Regularization
- l<sub>1</sub> for a sparse model (with many weights equal to 0)
- l<sub>2</sub> to constrain a NN's connection weights

In [1]:
from tensorflow import keras

layer = keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal", 
                          kernel_regularizer=keras.regularizers.l2(0.01))

- The l<sub>2</sub>() function returns a regularizers that's called at each step during training to compute regularization loss
    - then it is added to the final loss 
- Typically, I want to apply the same regularizer to all layers in my network, and use the same activation func and intialization strategy in all hidden layers
    - Use *functools.partial()* function to avoid code repititions

In [4]:

from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
#history = model.fit(X_train_scaled, y_train, epochs=n_epochs, validation_data=(X_valid_scaled, y_valid))

## Dropout

- one of the most popular regular. techniques
- Even state-of-the-art NNs get 1-2% accuracy boosts
- At every training step:
    - every neuron has a probability *p* of being "dropped out," meaning that it will be ignored during this training step
        - but it could be active the next step
        - hyperparameter *p* == dropout rate; typically b/t 10%-50%, 20%-30% in Rnns, 40%-50% in Cnns
        - <img src="images/Dropout.jpeg" width=360/>
- neurons cannot co-adapt w/ neighboring neurons, they have to be as useful as possible on their own
- neurons cannot rely exclusively on just a few input neurons; they must pay attention to all of their input neurons
- result: less sensitivity to slight changes in input, meaning more robust network!
- **alternative way to interpret Dropout**
    - it's lke a unique NN is generated at each trainign step
    - a total of 2<sup>N</sup> possible networks because each neuron can be either present or absent
    - The whole network can be seen as an averaging ensemble of all these smaller NNs
- It will need to multiple each connection weight w/ the *keep probability (1-p)* so that each neuron won't get a total input signal roughly as large as what the network was trained on 
- During training, the dropout layer randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep probability
- After training, the dropout layer does nothing at all. Just passes the inputs to the next layer

In [6]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
#history = model.fit(X_train_scaled, y_train, epochs=n_epochs, validation_data=(X_valid_scaled, y_valid))

- ### Warning: dropout is only active during training, so training loss should not be compared to validation loss. Evaluate w/out dropout
- If overfitting, increase dropout rate
- If underfitting, decrease dropout rate
- Increase dropout rate for large layers, and decrease for small ones
- significantly slows convergence but performs better generally

## Monte Carlo (MC) Dropout
- A paper further justified dropout
    1. relationship b/t dropout networks and approximate Bayesian inference found
    2. MC dropout was introduced to boost the performance of any trained ropout model without having to retrain it or even modify it at all

## Max-Norm Regularization