# The Vanishing/Exploding Gradients Problems
- As a result, the Gradient Descent update leaves the lowerlayers’ connection weights virtually unchanged, and training never converges to a good solution. We call this the vanishing gradients problem
- the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem,
- for alleviating this problem, the connection weights of each layer must be initialized randomly 
```
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
```
- ReLU: a problem called dying ReLU
    - the weighted sum of itsinputs are negative for all instances in the training set.
    - use a variant: Leaky ReLU, for z<0, give a small slope 0.01 or 0.2.
- In general: general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic.

## Batch Normalization
- The technique consists of adding an operation in the model just before or after the activation function of each hidden layer.
- if you add a BN layer as the very first layer of your neural network, you do not need to standardize your training set (e.g., using a StandardScaler);

### Implementing Batch Normalization with Keras
- Just add a BatchNormalization layer before or after each hidden layer’s activation function, and optionally add a BN layer as well as the first layer in your model.
- add BN before or after activation function is under debate, depends on the task
- hyperparameter:
    - momentum close to 1, 
    - axis: default -1

### Gradient Clipping
- Another popular technique to mitigate the exploding gradients problem is to clip the gradients during backpropagation so that they never exceed some threshold.


## Reusing Pretrained Layers
- It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle (we will discuss how to find them in Chapter 14), then reuse the lower layers of this network.
- This technique is called transfer learning.
- you will usually have to add a preprocessing step to resize them to the size expected by the original model.
- Try freezing all the reused layers first

### Transfer Learning with Keras 
- clone model to avoid train B affect A
- 

### Unsupervised Pretraining
- GANs rather than RBMs
- 

### Pretraining on an Auxiliary Task
- 

## Faster Optimizers


### Momentum Optimization
- Momentum optimization cares a great deal about what previous gradients were
- the algorithm introduces a new hyperparameter β, called the momentum, which must be set between 0 (high friction) and 1 (no friction).
- The one drawback of momentum optimization is that it adds yet another hyperparameter to tune. However, the momentum value of 0.9 usually works well in practice and almost always goes faster than regular Gradient Descent.

- Nesterov Accelerated Gradient
- AdaGrad frequently performs well for simple quadratic problems, but it often stops too early when training neural networks.

### Adam and Nadam Optimization
- Table 11-2. Optimizer comparison

### Learning Rate Scheduling
- reduce the learning rate during training
- Power scheduling
- Exponential scheduling
- Implementing power scheduling in Keras is the easiest option: just set the decay hyperparameter when creating an optimizer:

## Avoiding Overfitting Through Regularization
- Just like you did in Chapter 4 for simple linear models, you can use ℓ2 regularization to constrain a neural network’s connection weights, and/or ℓ1 regularization if you want a sparse model (with many weights equal to 0).


### Dropout 
- is one of the most popular regularization techniques for deep neural networks.
- Averaging over multiple predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off

### Max-Norm Regularization
- computing ∥w∥2 after each training step and rescaling w if needed (w ← w r/‖ w ‖2).
- Reducing r increases the amount of regularization and helps reduce overfitting

## SUmmary
- Table 11-3. Default DNN configuration