# Network Regularlization & Normalization


## Overview - Regularization
- Bias vs variance trade-off
- Using test, train, and vali splits. 
- Prevent overfitting by adding regularization methods (L1, L2, dropout)
- Optimizing and training time reduction by normalizing inputs
    - Normalizing inputs can drasticaly decrease computation time, and prevent vanishing/exploding graidents. 
    
### Hyperparameters to Tune
- Number of hidden units
- Number of layers
- Learning rate ( $\alpha$)
- Activation function

### Training, Validation, and Test Sets
- The fact that there are so many hyperparameters to tune calls for a formalized and unbiased approach to testing/training sets.
- We will use 3 sets when running, selecting, and validating a model:
    - Training set: for training the alogrithm
    - Validation set: to decide which model will be the final one after parameter tuning
    - Testing set: after choosing final  the final model, use the test set for an inbiased estimate of performance. 
- Set sizes:
    - With big data, your dev and test sets don't necessarily need to be 20-30% of all the data. 
    - You can choose test and hold-out sets that are of size 1-5%. 
        - eg. 96% train, 2% hold-out, 2% test set. 
    - It is **VERY IMPORTANT** to make sure holdout and test sample come from the same distribution: eg. same resolution of santa pictures. 
    
### Bias vs Variance 
- A model with high bias may result in underfitting.
    - <img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-42-02-tuning-neural-networks-with-regularization-online-ds-ft-021119/master/figures/underfitting.png" width=200>
- A model with high variance may result in overfitting. 
    - <img src="https://raw.githubusercontent.com/jirvingphd/dsc-04-42-02-tuning-neural-networks-with-regularization-online-ds-ft-021119/master/figures/overfitting.png" width=200>

- In deep learning, there is less of a bias-variance trad-off vs simpler models. 

**Rules of thumb re: bias/variance trade-off:**

| High Bias? (training performance) | high variance? (validation performance)  |
|---------------|-------------|
| Use a bigger network|    More data     |
| Train longer | Regularization   |
| Look for other existing NN architextures |Look for other existing NN architextures |



### L1 & L2 Regularlization
- These methods of regularizaiton do so by penalizing coefficients(regression) or weights(neural networks),
    - L1 & L2 exist in regression models as well. There, L1='Lasso Regressions' , L2='Ridge regression'

- **L1 & L2 regularization add a term to the cost function.**

$$Cost function = Loss (say, binary cross entropy) + Regularization term$$

$$ J (w^{[1]},b^{[1]},...,w^{[L]},b^{[L]}) = \dfrac{1}{m} \sum^m_{i=1}\mathcal{L}(\hat y^{(i)}, y^{(i)})+ \dfrac{\lambda}{2m}\sum^L_{l=1}||w^{[l]}||^2$$

    - where $\lambda$ is the regularization parameter. 
    - The difference between  L1 vs L2 is that L1 is just the sum of the weights whereas L2 is the sum of the _square_of the weights.  

- **L1 Regularization:**
    $$ Cost function = Loss + \frac{\lambda}{2m} * \sum ||w||$$
    - Uses the absolute value of weights and may reduce the weights down to 0. 
    
        
- **L2 Regularization:**:
    $$ Cost function = Loss + \frac{\lambda}{2m} * \sum ||w||^2$$
    - Also known as weight decay, as it forces weights to decay towards zero, but never exactly 0.. 
    
-  Regularization term $||w^{[l]}||^2 _F$  is  A.K.A. The Frobenius Norm
    - $||w^{[l]}||^2 = \sum^{n^{[l-1]}}_{i=1} \sum^{n^{[l]}}_{j=1} (w_{ij}^{[l]})^2$

    
- **CHOOSING L1 OR L2:**
    - L1 is very useful when trying to compress a model. (since weights can decrease to 0)
    - L2 is generally preferred otherwise.
    
- **USING L1/L2 IN KERAS:**
    - Add a kernel_regulaizer to a  layer.
```python 
from keras import regularizers
model.add(Dense(64, input_dim=64, kernel_regularizer=regularizers.l2(0.01))
```
    - here 0.01 = $\lambda$

### Dropout Regularization
- Uses a specified probablity to random leave out a node from a ---epoch?


- **USING DROPOUT IN KERAS:**
    - Dropout layers are located in keras.layers.core 
    - Specify probably of being exlcuded/dropped out.
```python
from keras.layers.core import Dropout
model = Sequential()
model.add(layers.Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'))
model.add(layers.core.Dropout(Dropout(0.25))                              
```

### Data Augmentation (not covered in class)
- Simplest way to reduce overfitting is to increase the size of the training data.
- Difficult to do with large datasets, but can be implemented with images as shown below:
- **For augmenting image data:**
    - Can alter the images already present in the training data by shifting, shearing, scaling, rotating.<br><br> <img src ="https://www.dropbox.com/s/9i1hl3quwo294jr/data_augmentation_example.png?raw=1" width=300>
    - This usually provides a big leap in improving the accuracy of the model. It can be considered as a mandatory trick in order to improve our predictions.

- **In Keras:**
    - `ImageDataGenerator` contains several augmentations available.
    - Example below:
    
```python
from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(horizontal flip=True)
datagen.fit(train)
```
### Early Stopping (not covered in class)
- Monitor performance for decrease or plateau in performance, terminate process when given criteria is reached.

- **In Keras:**
    - Can be applied using the [callbacks function](https://keras.io/callbacks/)
```python    
from keras.callbacks import EarlyStopping
EarlyStopping(monitor='val_err', patience=5)
```
    - 'Monitor' denotes quanitity to check
    - 'val_err' denotes validation error
    - 'pateience' denotes # of epochs without improvement before stopping.
        - Be careful, as sometimes models _will_ continue to improve after a stagnant period

### Reference Links I found:
- https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/
- http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/



## Overview - Network Optimization
### Normalization
- Normalizing to a consistent scale (typically 0 to 1) improves performance, but also ensures the process will converge to a stable solution. 

- Methods:
    - Z-Score (subtracting mean, normalize by standard deviation)
    
#### Reference Links
- https://www.coursera.org/lecture/deep-neural-network/normalizing-inputs-lXv6U

### Changing Initial Parameters
- The more input features into layer $l$, the smaller we want each weight $w_i$ to be.
- Rule of thumb:
    - $Var(w_i) = 1/n$ or $2/n$
- A common initilization strategy for the relu activation functions is:

    * $w^{[l]}$ `= np.random.randn(shape)*np.sqrt(2/n_(l-1))`
    
## Optimization:
Alternatives to gradient descent that do not oscillate as much as g.d.:
### Gradient Descent with Momentum:
- Comutes an exponentially weighted average of the gradients to use.
    - will dampen oscillations and improve performance.
- How to:
    -  Calculate current batch's moving averages for the derivatives of $W$ and$b$
        - Compute $V_{dw} = \beta V_{dw} + (1-\beta)dW$
        - $V_{db} = \beta V_{db} + (1-\beta)db$ 
        - So updated terms become
            - $W:= W- \alpha Vdw$
            -$b:= b- \alpha Vdb$
    -  Generally, $\beta=0.9$ is a good hyperparameter value.
    
### RMSprop
- "Root mean square" prop
- Slow down learning in one direction and speed it up in another.
    - In the direction where we want to learn fast, the corresponding S will be small, so dividing by a small number. 
    - In the direction where we will want to learn slow, the corresponding S will be relatively large, and updates will be smaller. 
- How to:
    - On each iteration, use exponentially weighted average again:
        - exponentially weighted average of the squares of the derivatives
        - $S_{dw} = \beta S_{dw} + (1-\beta)dW^2$
        - $S_{db} = \beta S_{dw} + (1-\beta)db^2$
        - So that:
            - $W:= W- \alpha \dfrac{dw}{\sqrt{S_{dw}}}$
            - $b:= b- \alpha \dfrac{db}{\sqrt{S_{db}}}$
    - Often, add small $\epsilon$ in the denominator to make sure that you don't end up dividing by 0.


### Adam Optimization Algorithm
- Adaptive Moment Estimation - essentially combines both methods above.
- Works very well in most situations.
- How to: 
    - Initialize: $V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0$.
    - For each teration: compute $dW, db$ using the current mini-batch.
        -  $V_{dw} = \beta_1 V_{dw} + (1-\beta_1)dW$, $V_{db} = \beta_1 V_{db} + (1-\beta_1)db$ 
        -  $S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dW^2$, $S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2$ 
        
- As with  momentum and then RMSprop. We need to perform a correction! This is sometimes also done in RSMprop, but definitely here too.
    - $V^{corr}_{dw}= \dfrac{V_{dw}}{1-\beta_1^t}$, $V^{corr}_{db}= \dfrac{V_{db}}{1-\beta_1^t}$

    - $S^{corr}_{dw}= \dfrac{S_{dw}}{1-\beta_2^t}$, $S^{corr}_{db}= \dfrac{S_{db}}{1-\beta_2^t}$

    - $W:= W- \alpha \dfrac{V^{corr}_{dw}}{\sqrt{S^{corr}_{dw}+\epsilon}}$ and

    - $b:= b- \alpha \dfrac{V^{corr}_{db}}{\sqrt{S^{corr}_{db}+\epsilon}}$ 


### Learning Rate Decay
- Learning rate decreases across epochs.
    - $\alpha = \dfrac{1}{1+\text{decay_rate * epoch_nb}}* \alpha_0$

- other methods:
    - $\alpha = 0.97 ^{\text{epoch_nb}}* \alpha_0$ (or exponential decay)<br>OR:
    - $\alpha = \dfrac{k}{\sqrt{\text{epoch_nb}}}* \alpha_0$<br> OR:
    - Manual decay.
    
    
    
### HYPERPARAMETER TUNING:
Most important:
- $\alpha$

Important next:
- $\beta$ (momentum)
- Number of hidden units
- mini-batch-size

Finally:
- Number of layers
- Learning rate decay

Almost never tuned:
- $\beta_1$, $\beta_2$, $\epsilon$ (Adam)

- Tip: Don't use a grid, because hard to say in advance which hyperparameters will be important.


### OPTIMIZAITON REFS:
- https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/
- https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
- https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
- https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network https://www.springboard.com/blog/free-public-data-sets-data-science-project/