In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

%load_ext tensorboard

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


## The Vanishing/Exploding Gradient Problems

Gradients often get smaller as the algorithm progresses down to lower layers, leaving them virtually unchanged. This is the #vanishing gradients# problem. 

The opposite of this is the #exploding gradients# problem where the gradients will get larger.

Deep neural networks suffer from unstable gradients generally because different layers learn at different speeds. 

The reason is that there logistic sigmoid function and a certain weight initialization in the early 2000's was part of the reason.

Main reason is that the vairance of the output for each layer is much greater than the variance of the inputs.

### Glorot and He Initialization

Argued that we need the vairance of the outputs of each layer to match its inputs, and the gradients need to have equal variance before and after flowing in the reverse direction.

Not possible to do guarantee both unless layer has equal number of inputs and neurons(called *fan-in* and *fan-out* of the layer).

The Xavier or Glorot Initialization is fan<sub>avg</sub> = (fan<sub>in</sub> + fan<sub>out</sub>) / 2

By default, Keras uses Glorot initialization with uniform distribution.

The He initialization is also a name for ReLu.

Below is a list of initializers:

In [2]:
[name for name in dir(keras.initializers) if not name.startswith("_")]

['Constant',
 'GlorotNormal',
 'GlorotUniform',
 'HeNormal',
 'HeUniform',
 'Identity',
 'Initializer',
 'LecunNormal',
 'LecunUniform',
 'Ones',
 'Orthogonal',
 'RandomNormal',
 'RandomUniform',
 'TruncatedNormal',
 'VarianceScaling',
 'Zeros',
 'constant',
 'deserialize',
 'get',
 'glorot_normal',
 'glorot_uniform',
 'he_normal',
 'he_uniform',
 'identity',
 'lecun_normal',
 'lecun_uniform',
 'ones',
 'orthogonal',
 'random_normal',
 'random_uniform',
 'serialize',
 'truncated_normal',
 'variance_scaling',
 'zeros']

### Nonsaturating Activation Functions

ReLu is a great choice, but it suffers from *dying ReLu's*, where some neurons only output 0. Huge portions of the network can die, esp. with high learning rates.

To deal with that issue, the made *leaky ReLu* where when its less than zero it becomes negative instead of 0, and it then has a chance of neurons coming back to life.

High leaks seem to do really well. Parametric leaky Relu (PReLu) can be modified through backprop. Its really good for large image datasets but tends to overfit on smaller ones.

Then there is *exponential linear unit* (ELU) that performs better than ReLu. Its main con is that its slower to compute than ReLu and its variants. That it converges faster during training compensates but at test time it will be slower.

*Scaled ELU* (SELU) will self-normalize (if you only use dense layers, and all use SELU), preserving a mean of 0 and std of 1, which solves the gradient problems. It will outperform other activation functions, but it has conditions:
    - Input features must be standardized(mean=0, std=1)
    - Every hidden layer weights must be initialized with LeCun normal initialization. `kernel_initializer="lecun_normal"`
    - Network architecture must be sequential, if used on recurrent networks or networks that are Wide & Deep, self-normalization not guaranteed and performance wont necessairly outperform other functions.
    - Only dense layers, but can improve convolutional neural nets as well.
    

The general path is SELU > ELU > leaky ReLu & variants > ReLu > tanh > logistic.

If network architecure prevents self-normalizing, got to ELU. If I care about runtime latency, leaky Relu (I can set a value for it). Can also cross-validation other unctions, such as RReLu if network is overfitting or PReLu if there is a huge training set. BUT ReLu has a lot of support and is fast.  

### Batch Normalization

