# TensorFlow for Deep Learning - Training Deep Neural Networks

Credits - [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

Good reads 
- [Yes you should understand backprop](https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b#.vt3ax2kg9)
- [Why Momentum Really Works](https://distill.pub/2017/momentum/)

## 1. The Vanishing/Exploding Gradients

[<img src="images/VanGrad.png" width="500"/>](https://youtu.be/W_JJm_5syFw)

Unfortunately, gradients often get smaller and smaller as the backprop algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer's connection weights virtually unchanged, and training never converges to a good solution. We call this the **vanishing gradients** problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the **exploding gradients** problem, which surfaces in recurrent neural networks. More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

Xavier Glorot and Yoshua Bengio in 2010 paper) showed that the combination of the popular _logistic sigmoid_ activation function and the weight initialization technique of a _normal distribution_ with a mean of 0 and a standard deviation of 1 results in variance of the outputs of each layer being much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This saturation is actually made worse by the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in deep networks). Looking at the logistic activation function, you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus, when backpropagation kicks in it has virtually no gradient to propagate back through the network; and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.

### 1a. Glorot and He Initialization

The signal should flow properly in both directions: in the forward direction when making predictions, and in the reverse direction when backpropagating gradients. We don't want the signal to die out, nor do we want it to explode and saturate. For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction.  It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons (these numbers are called the fan-in and fan-out of the layer), but Glorot and Bengio proposed a good compromise that has proven to work very well in practice: the connection weights of each layer must be initialized randomly: 

**Glorot initialization (when using the logistic activation function)**
- Normal distribution with mean 0 and variance $ \sigma^2 = {1}/{fan_{avg}} $, or
- Uniform distribution between - r and + r, with $ r = \sqrt{3/fan_{avg}}$

where, $fan_{avg} = (fan_{in}+fan_{out})/2$, $fan_{in} = $ no. of inputs, $fan_{out} = $ no. of outputs

Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the success of Deep Learning.

<img src="images/TF_init.png" style="float:center;" width="350"/>

---
When creating a layer, you can change this to He initialization by setting:

```python
keras.layers.Dense(10, activation="relu", kernel_initializer="he_uniform")
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
```

He initialization with a uniform distribution but based on $fan_{avg}$ rather than $fan_{in}$:

```python
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', 
                                                 distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)
```
---
**_Keras default: Glorot with a uniform distribution_**.

### 1b. Nonsaturating Activation Functions

One of the insights in the 2010 paper by Glorot and Bengio was that the problems with unstable gradients were in part due to a poor choice of activation function. Until then most people had assumed that if Mother Nature had chosen to use roughly sigmoid activation functions in biological neurons, they must be an excellent choice. But it turns out that other activation functions behave much better in deep neural networks -- in particular, the **ReLU** activation function, _mostly because it does not saturate for positive values and because it is fast to compute_.

**PROBLEM** - Unfortunately, the ReLU activation function is not perfect. It suffers from a problem known as the _dying ReLUs_: during training, some neurons (for any input) effectively **die**, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network's neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting zeros, and Gradient Descent does not affect it anymore because the gradient of the ReLU function is zero when its input is negative.

**SOLN** - To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU. This function is defined as $LeakyReLU_{\alpha}(z) = max(\alpha z, z)$. The hyperparameter $\alpha$ defines how much the function **leaks**: it is the slope of the  function for $z < 0$ and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up.

- _randomized leaky ReLU_ (RReLU), where $\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing.
- _parametric leaky ReLU_ (PReLU), where $\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by  backpropagation like any other parameter)
- _exponential linear unit_ (ELU) outperformed all the ReLU variants: training time was reduced, and the neural network performed better on the test set
    $$ ELU_{\alpha}(z) =  \alpha(exp(z)-1) \quad if \ z<0 $$
    $$ ELU_{\alpha}(z) =  z \quad if \ z\ge 0 $$

    - It takes on negative values when $z < 0$, which allows the unit to have an average output closer to 0 and helps alleviate the vanishing gradients problem. The hyperparameter $\alpha$ defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter. It has a nonzero gradient for $z < 0$, which avoids the dead neurons problem. If $\alpha$ is equal to 1 then the function is smooth  everywhere, including around $z < 0$, which helps speed up Gradient Descent since it does not bounce as much to the left and right of $z<0$.

    - The main drawback of the ELU activation function is that it is slower to compute than the ReLU function and its variants (due to the use of the exponential function). Its faster convergence rate during training  compensates for that slow computation, but still, at test time an ELU network will be slower than a ReLU network.
    
- _scaled ELU_ (SELU) activation function: it is a scaled variant of the ELU activation function. If you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will self-normalize: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. As a result, the SELU activation function often significantly outperforms other activation functions for such neural nets (especially deep ones). There are, however, a few conditions for self-normalization to happen:
    - The input features must be standardized
    - Every hidden layer's weights must be initialized with LeCun normal initialization. In Keras, this means setting ```kernel_initializer="lecun_normal"```
    - The network's architecture must be sequential
    - The paper only guarantees self-normalization if all layers are dense, but some researchers have noted that the SELU activation function can improve performance in convolutional neural nets as well

**TIPS**
- Although your mileage will vary, in general SELU > ELU > leaky ReLU
(and its variants) > ReLU > tanh > logistic.
- If the network's architecture prevents it from
self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0).
- If you care a lot about runtime latency, then you may prefer leaky ReLU.
- If you don't want to tweak yet another hyperparameter, you may use the default $\alpha$ values used by Keras (e.g., 0.3 for leaky ReLU).
- If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, such as RReLU if your network is overfitting or PReLU if you have a huge training set.
- That said, because ReLU is the most used activation function (by far), many libraries and hardware accelerators provide ReLU-specific  ptimizations; therefore, if speed is your priority, ReLU might still be the best choice.

---
To use the leaky ReLU activation function, create a LeakyReLU layer and
add it to your model just after the layer you want to apply it to:

```python
model = keras.models.Sequential([ 
    [...] 
    keras.layers.Dense(10, kernel_initializer="he_normal"), 
    keras.layers.LeakyReLU(alpha=0.2), 
    [...]
])
```

For PReLU, replace LeakyRelu(alpha=0.2) with PReLU().

For SELU activation, set activation="selu" and
kernel_initializer="lecun_normal" when creating a layer:
```python
layer = keras.layers.Dense(10, activation="selu", 
                           kernel_initializer="lecun_normal")
```
---

### 1c. Batch Normalization

Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the danger of the vanishing/exploding gradients problems at the _beginning_ of training, it doesn't guarantee that they won't
come back during training.

In a 2015 paper,  Sergey Ioffe and Christian Szegedy proposed a technique
called **Batch Normalization** (BN) that addresses these problems. The
technique consists of adding an operation in the model just before or after
the activation function of each hidden layer. This operation simply zero-
centers and normalizes each input, then scales and shifts the result using
two new parameter vectors per layer: one for scaling, the other for  shifting. In other words, the operation lets the model learn the optimal scale and mean of each of the layer's inputs. 

**Train**

In many cases, if you add a BN layer as the very first layer of your neural network, you do not need to standardize your training set (e.g., using a StandardScaler); the BN layer will do it for you (well, approximately, since it only looks at one batch at a time, and it can also rescale and shift each input feature). In order to zero-center and normalize the inputs, the algorithm needs to estimate each input's mean and standard deviation. It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name "Batch Normalization"). 

Standardization + Scaling + Shifting:

<img src="images/TF_BN1.png" width="300"/>
<img src="images/TF_BN2.png" width="300"/>

**Test**

So during training, BN standardizes its inputs, then rescales and offsets
them. Good! What about at test time? We may need to make predictions for individual instances rather than for batches of instances: in this case, we will have no way to compute each input's mean and standard deviation. Moreover, even if we do have a batch of instances, it may be too small, or the instances may not be independent and identically distributed, so computing statistics over the batch instances would be unreliable.

Most implementations of Batch Normalization estimate these final statistics
during training by using a moving average of the layer's input means and
standard deviations. This is what Keras does automatically when you use the BatchNormalization layer. To sum up, four parameter vectors are learned in each batch-normalized layer: $\bold{\gamma}$ (the output scale vector) and $\bold{\beta}$ (the output offset vector) are learned through regular backpropagation, and $\bold{\mu}$ (the final input mean vector) and $\bold{\sigma}$ (the final input standard deviation vector) are estimated using an exponential moving average. Note that $\bold{\mu}$ and $\bold{\sigma}$ are estimated during training, but they are used only after training (to replace the batch input means and standard deviations).

PROS:

- The vanishing gradients problem is strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the logistic activation function. 
- The networks were are much less sensitive to the weight initialization. The authors were able to use much larger learning rates.
- BN acts like a regularizer, reducing the need for other regularization techniques such as dropout.

CONS(ish):

- BN does add some complexity to the model
- Training is rather slow, because each epoch takes much more time when you use Batch Normalization. This is usually counterbalanced by the fact that convergence is much faster with BN, so it will take fewer epochs to reach  the same performance. All in all, wall time will usually be shorter (this is the time measured by the clock on your wall).

AVOIDING PENALTY:

There is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer. Fortunately, it's often possible to fuse the BN layer with the previous layer, after training, thereby avoiding the runtime penalty. This is done by updating the previous layer's weights and biases so that it directly produces outputs of the appropriate scale and offset: 

<img src="images/TF_BN3.png" width="500"/>

---
Implementing Batch Normalization with Keras: 

In [1]:
import tensorflow.keras as keras

model = keras.models.Sequential([ 
    keras.layers.Flatten(input_shape=[28, 28]), 
    keras.layers.BatchNormalization(),

    keras.layers.Dense(300,activation="elu",kernel_initializer="he_normal"), 
    keras.layers.BatchNormalization(), 

    keras.layers.Dense(100,activation="elu",kernel_initializer="he_normal"), 
    keras.layers.BatchNormalization(),

    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

In [2]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

Each BN layer adds four parameters per input: $\gamma, \beta, \mu$ and $\sigma$ (for example, the first BN layer adds 3,136 parameters, which is 4 x 784). The last two parameters, $\mu$ and $\sigma$, are the moving averages; they are not affected by backpropagation, so Keras calls them **non-trainable** params.

In this tiny example with just two hidden layers, it's unlikely that BN will have a very positive impact; but for deeper networks it can make a tremendous difference. 

To add the BN layers before the activation functions, you must
remove the activation function from the hidden layers and add them as
separate layers after the BN layers. Moreover, since a Batch Normalization
layer includes one offset parameter per input, you can remove the bias term
from the previous layer (just pass ```use_bias=False``` when creating it):

In [3]:
model = keras.models.Sequential([ 
    keras.layers.Flatten(input_shape=[28, 28]), 
    keras.layers.BatchNormalization(), 

    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False), 
    keras.layers.BatchNormalization(), 
    keras.layers.Activation("elu"), 

    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False), 
    keras.layers.BatchNormalization(), 
    keras.layers.Activation("elu"),
    
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 784)               3136      
_________________________________________________________________
dense_3 (Dense)              (None, 300)               235200    
_________________________________________________________________
batch_normalization_4 (Batch (None, 300)               1200      
_________________________________________________________________
activation (Activation)      (None, 300)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 100)               30000     
_________________________________________________________________
batch_normalization_5 (Batch (None, 100)              

The BatchNormalization class has quite a few hyperparameters you can
tweak. The defaults will usually be fine, except for ```momentum``` and ```axis```.

```python
momentum
```
occasionally need to tweak the ```momentum```. This hyperparameter is used by the BatchNormalization layer when it updates the exponential moving
averages; given a new value v (i.e., a new vector of input means or standard
deviations computed over the current batch), the layer updates the running
average $\hat{v}$ using the following equation:

$$ v \leftarrow v \times momentum + v \times (1-momentum) $$

A good momentum value is typically close to 1; for example, 0.9, 0.99, or
0.999 (you want more 9s for larger datasets and smaller mini-batches).

```python
axis
```
it determines which axis should be normalized. It defaults to -1, meaning that by default it will normalize the last axis (using the means and  standard deviations computed across the other axes). When the input batch is 2D (i.e., the batch shape is [_batch size,features_]), this means that each input feature will be normalized based on the mean and standard deviation computed across all the instances in the batch.


Notice that the BN layer does not perform the same computation during
training and after training: it uses batch statistics during training and the "final" statistics after training (i.e., the final values of the moving averages). Let's take a peek at the source code of this class to see how this is handled:

```python
class BatchNormalization(keras.layers.Layer): 
    [...] 
    def call(self, inputs, training=None): 
        [...]
```

The ```call()``` method is the one that performs the computations; as you can
see, it has an extra ```training``` argument, which is set to None by default, but the ```fit()``` method sets to it to 1 during training. If you ever need to write a custom layer, and it must behave differently during training and testing, add a training argument to the ```call()``` method and use this argument in the method to decide what to compute.

---

### 1d. Gradient Clipping

Another popular technique to mitigate the exploding gradients problem is to
clip the gradients during backpropagation so that they never exceed some
threshold. This is called **Gradient Clipping**. This technique is most often
used in recurrent neural networks, as Batch Normalization is tricky to use in
RNNs. For other types of networks, BN is usually sufficient.

---
In Keras, implementing Gradient Clipping is just a matter of setting the
```clipvalue``` or ```clipnorm``` argument when creating an optimizer:

```python
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)
```
---

This optimizer will clip every component of the gradient vector to a value
between -1.0 and 1.0. This means that all the partial derivatives of the loss
(with regard to each and every trainable parameter) will be clipped between
-1.0 and 1.0. The threshold is a hyperparameter you can tune. Note that it
may change the orientation of the gradient vector. For instance, if the
original gradient vector is [0.9, 100.0], it points mostly in the direction of the second axis; but once you clip it by value, you get [0.9, 1.0], which
points roughly in the diagonal between the two axes. In practice, this
approach works well. If you want to ensure that Gradient Clipping does not
change the direction of the gradient vector, you should clip by norm by
setting ```clipnorm``` instead of ```clipvalue```. This will clip the whole gradient if its $l_2$ norm is greater than the threshold you picked. For example, if you set ```clipnorm=1.0```, then the vector [0.9, 100.0] will be clipped to [0.00899964, 0.9999595], preserving its orientation but almost eliminating the first component.



## 2. Reusing Pretrained Layers

It will not only speed up training considerably, but also require significantly less training data. Suppose you have access to a DNN that was trained to classify pictures into 100 different categories, including animals, plants, vehicles, and everyday objects. You now want to train a DNN to classify specific types of vehicles. These tasks are very similar, even partly overlapping, so you should try to reuse parts of the first network. 

Different input size: 

If the inputs of the new task don't have the same size as the ones used in  the original task, we usually have to add a preprocessing step to resize them to the size expected by the original model. More generally, transfer learning will work best when the inputs have similar low-level features.

Similar tasks:

The more similar the tasks are, the more layers you want to reuse (starting with the lower layers). For very similar tasks, try keeping all the hidden layers and just replacing the output layer.

Unfreezing hidden layer(s):

Try freezing all the reused layers first (i.e., make their weights non-trainable so that Gradient Descent won't modify them), then train your model and see how it performs. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data you have, the more layers you can unfreeze. It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-tuned weights.

### 2a. Transfer Learning with Keras:

Suppose the Fashion MNIST dataset only contained eight classes - for example, all the classes except for sandal and shirt. Someone built and trained a Keras model on that set and got reasonably good performance (>90% accuracy). Let's call this model A. You now want to tackle a different task: you have images of sandals and shirts, and you want to train a binary  lassifier (positive=shirt, negative=sandal). Your dataset is quite small; you only have 200 labeled images. When you train a new model for this task (let's call it model B) with the same architecture as model A, it performs reasonably well (97.2% accuracy). But since it's a much easier task (there are just two classes), you were hoping for more. You realize that
your task is quite similar to task A, so perhaps transfer learning can help?

```python
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))
```

Note that ```model_A``` and ```model_B_on_A``` now share some layers. When you train ```model_B_on_A```, it will also affect ```model_A```. If you want to avoid that, you need to _clone_ ```model_A``` before you reuse its layers. To do this, you clone model A's architecture with ```clone.model()```, then copy its weights (since ```clone_model()``` does not clone the weights):

```python
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())
```

Now you could train model_B_on_A for task B, but since the new output layer was initialized randomly it will make large errors (at least during the first few epochs), so there will be large error gradients that may wreck the
reused weights. To avoid this, one approach is to freeze the reused layers
during the first few epochs, giving the new layer some time to learn
reasonable weights. To do this, set every layer's trainable attribute to
False and compile the model:

```python
for layer in model_B_on_A.layers[:-1]: 
    layer.trainable = False 
 
model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd", 
                     metrics=["accuracy"])
```

**Note**:You must always compile your model after you freeze or unfreeze layers.

Now you can train the model for a few epochs, then unfreeze the reused
layers (which requires compiling the model again) and continue training to
fine-tune the reused layers for task B. After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again to avoid
damaging the reused weights:

```python
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, 
                           validation_data=(X_valid_B, y_valid_B)) 
 
for layer in model_B_on_A.layers[:-1]: 
    layer.trainable = True 
 
optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-2
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, 
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16, 
                           validation_data=(X_valid_B, y_valid_B))
```

This model's test accuracy is 99.25%, which means that transfer learning reduced the error rate from 2.8% down to almost 0.7%! Are you convinced? You shouldn't be: I cheated! I tried many configurations until I found one that demonstrated a strong improvement. Why did I cheat? It turns out that transfer learning does not work very well with small dense networks, presumably because small networks learn few patterns, and dense networks learn very specific patterns, which are unlikely to be useful in other tasks. Transfer learning works best with deep convolutional neural networks, which tend to learn feature detectors that are much more general (especially in the lower layers).

## 3. Faster Optimizers

Five ways to speed up training (and reach a better solution):
- applying a good initialization strategy for the connection weights
- using a good activation function
- using Batch Normalization
- reusing parts of a pretrained network
- **faster optimizers than SGD**

<img src="images/TF_FO5.png" style="float:center;" width="500"/>

(* is bad, **is average, and *** is good)

Recall SGD is:

<img src="images/TF_FO0.png" style="float:center;" width="150"/>

### 3a. Momentum Optimization

[<img src="images/Mom.png" width="500"/>](https://youtu.be/r-rYz_PEWC8)

Recall that Gradient Descent updates the weights by directly subtracting the gradient of the cost function with regard to the weights multiplied by the learning rate. It does not care about what the earlier gradients were. If the local gradient is tiny, it goes very slowly. Momentum optimization cares a great deal about what previous gradients were: at each iteration, it subtracts the local gradient from the _momentum vector_ $m$ (multiplied by the learning rate $\eta$), and it updates the weights by adding this momentum vector. 

ACC: In other words, the gradient is used for acceleration, not for speed.  

FRICTION: To simulate some sort of friction mechanism and prevent the momentum from growing too large, the algorithm introduces a new hyperparameter $\beta$, called the _momentum_, which must be set between 0 (high friction) and 1 (no friction). A typical momentum value is 0.9. Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reasons it's good to have a bit of friction in the system: it gets rid of these oscillations and thus speeds up convergence.

<img src="images/TF_FO1.png" style="float:center;" width="200"/>

For example, if $\beta$ = 0.9, then the terminal velocity is equal to 10 times the gradient times the learning rate, so momentum optimization ends up going 10 times faster than Gradient Descent! This allows momentum optimization to escape from plateaus much faster than Gradient Descent.

In deep neural networks that don't use Batch Normalization, the upper layers will often end up having inputs with very different scales, so using momentum optimization helps a lot. It can also help roll past local optima.

--- 
Implementing momentum optimization in Keras:

```optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)```

---

### 3b. Nesterov Accelerated Gradient

The Nesterov Accelerated Gradient (NAG) method, also known as Nesterov momentum optimization, measures the gradient of the cost function not at the local position $\theta$ but slightly ahead in the direction of the momentum, at $\theta + \beta m$. 

<img src="images/TF_FO2.png" style="float:center;" width="250"/>

This small tweak works because in general the momentum vector will be pointing in the right direction (i.e., toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than the gradient at the original position. After a while, these small improvements add up and NAG ends up being significantly faster than regular momentum optimization. 

<img src="images/TF_FO2a.png" style="float:center;" width="350"/>


Moreover, note that when the momentum pushes the weights across a valley, $\nabla_1$ continues to push farther across the valley, while $\nabla_2$ pushes back toward the bottom of the valley. This helps reduce oscillations and thus NAG converges faster.

---
NAG is generally faster than regular momentum optimization. To use it,
simply set nesterov=True when creating the SGD optimizer:

```optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)```

---


### 3c. AdaGrad

GD starts by quickly going down the steepest slope, which may not always point straight toward the global optimum, and then it very slowly goes down to the bottom of the valley. It would be nice if the algorithm could correct its direction earlier to point a bit more toward the global optimum. The AdaGrad algorithm achieves this correction by _scaling_ down the gradient vector along the steepest dimensions. 

<img src="images/TF_FO3.png" style="float:center;" width="250"/>

The first step accumulates the square of the gradients into the vector $s$.  This vectorized form is equivalent to computing s  $ s_i  \leftarrow  s_i + (\partial J(\theta) / \partial \theta )$  for each element $s_i$  of the vector $s$; in other words, each $ s_i$  accumulates the squares of the partial derivative of the cost function with regard to parameter $\theta_i$. If the
cost function is steep along the $i^{th}$ dimension, then $s_i$  will get larger and larger at each iteration. The second step is almost identical to Gradient Descent, but with one big difference: the gradient vector is scaled down by a factor of 
$\sqrt{s+\epsilon}$. This vectorized form is equivalent to simultaneously computing $ \theta_i  \leftarrow  \theta_i - \eta \ \partial J(\theta) / \partial \theta_i \ / \sqrt{s_i+\epsilon}$.

<img src="images/TF_FO3a.png" style="float:center;" width="400"/>

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an _adaptive learning rate_. It helps point the resulting updates more directly toward the global optimum. One additional benefit is that it requires much less tuning of the learning rate hyperparameter $\eta$.

AdaGrad frequently performs well for simple quadratic problems, but it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. So even though Keras has an Adagrad optimizer, you should not use it to train deep neural networks.

### 3d. RMSProp

As we've seen, AdaGrad runs the risk of slowing down a bit too fast and never converging to the global optimum. The RMSProp algorithm  fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step. 

<img src="images/TF_FO4.png" style="float:center;" width="300"/>

Except on very simple problems, this optimizer almost always performs much better than AdaGrad. In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.

---
Keras has an RMSprop optimizer:
```optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)```

---

### 3e. Adam

**Adam** which stands for _adaptive moment estimation_, combines the ideas of momentum optimization and RMSProp: just like momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients. 

<img src="images/TF_FO6a.png" style="float:center;" width="300"/>

In this equation, T represents the iteration (epoch) number (starting at 1). 

If you just look at steps 1, 2, and 5, you will notice Adam's close similarity to both momentum optimization and RMSProp. The only difference is that step 1 computes an exponentially decaying average rather than an exponentially decaying sum, but these are actually equivalent except for a constant factor (the decaying average is just $1-\beta$ times the decaying sum). Steps 3 and 4 are somewhat of a technical detail: since $m$ and $s$ are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost $m$ and $s$ at the beginning of training.

The momentum decay hyperparameter $\beta_1$  is typically initialized to 0.9, while the scaling decay hyperparameter $\beta_2$  is often initialized to 0.999. Since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it requires less tuning of the learning rate hyperparameter $\alpha$. You can often use the default value $\alpha$ = 0.001

_AdaMax_: Adam scales down the parameter updates by the square root of $s$. In short, Adam scales down the parameter updates by the $l_2$  norm of the time-decayed gradients. AdaMax replaces the $l_2$  norm with the $l_{\inf}$ norm (a fancy way of saying the max).

_Nadam_: optimization is Adam optimization plus the Nesterov trick, so it
will often converge slightly faster than Adam

---
Adam optimizer using Keras:
```optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)```

---


### NOTE: TRAINING SPARSE MODELS

All the optimization algorithms just presented produce dense models, meaning that most parameters will be nonzero. If you need a blazingly fast model at runtime, or if you need it to take up less memory, you may prefer to end up with a sparse model instead. 

One easy way to achieve this is to train the model as usual, then get rid of the tiny weights (set them to zero). Note that this will typically not lead to a very sparse model, and it may degrade the model's performance.

A better option is to apply strong $l_1$  regularization during training (we will see how later in this chapter), as it pushes the optimizer to zero out as many weights as it can. 

If these techniques remain insufficient, check out the [TensorFlow Model Optimization Toolkit (TF-MOT)](https://homl.info/tfmot), which provides a pruning API capable of iteratively removing connections during training based on their magnitude.

### NOTE: 1st vs 2nd ORDER DERIVATIVES

All the optimization techniques discussed so far only rely on the first-order partial derivatives (Jacobians). The optimization literature also contains amazing algorithms based on the second-order partial derivatives (the Hessians, which are the partial derivatives of the Jacobians). Unfortunately, these algorithms are very hard to apply to deep neural networks because there are $n^2$ Hessians per output (where n is the number of parameters), as opposed to just n Jacobians per output. Since DNNs typically have tens of thousands of parameters, the second-order optimization algorithms often don't even fit in  memory, and even when they do, computing the Hessians is just too slow.

## 4. Learning Rate Scheduling

**Optimal Learning Rate**: the optimal learning rate is about half of the maximum learning rate (i.e., the learning rate above which the training algorithm diverges) One way to find a good learning rate is to train the model for a few hundred iterations, starting with a very low learning rate (e.g., 1e-5) and gradually increasing it up to a very large value (e.g., 10). This is done by multiplying the learning rate by a constant factor at each iteration (e.g., by exp(log(1e+6)/500) to go from 1e+5 to 10 in 500
iterations). If you plot the loss as a function of the learning rate (using a log scale for the learning rate), you should see it dropping at first. But after a while, the learning rate will be too large, so the loss will shoot back up: the optimal learning rate will be a bit lower than the point at which the loss starts to climb (typically about 10 times lower than the
turning point). You can then reinitialize your model and train it normally using this good learning rate.

**Learning Schedules**: start with a large learning rate and then reduce it once training stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. There are many different strategies to reduce the learning rate during training. It can also be beneficial to **_start with a low learning rate, increase it, then drop it again_**. These strategies are called _learning schedules_.

<img src="images/TF_LR1.jpeg" style="float:center;" width="500"/>

---

- Power scheduling in Keras (assumes $c=1$) with $decay=1/s$, ($s$ is the number of steps it takes to divide the
learning rate by one more unit)
    ```python
    optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)
    ```

- Exponential scheduling in Keras using a function and a callback
    ```python
    def exponential_decay_fn(epoch): 
        return 0.01 * 0.1**(epoch / 20)
    # OR
    def exponential_decay(lr0, s): 
        def exponential_decay_fn(epoch): 
            return lr0 * 0.1**(epoch / s) 
        return exponential_decay_fn
    exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

    lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
    history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])
    ```

- Piecewise Constant scheduling in Keras using a function and a callback
    ```python
    def piecewise_constant_fn(epoch): 
        if epoch < 5: 
            return 0.01 
        elif epoch < 15: 
            return 0.005 
        else: 
            return 0.001

    lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)
    history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])
    ```

- Performance scheduling in Keras using a ```ReduceLROnPlateau``` callback
    ```python
    lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
    history = model.fit(X_train_scaled, y_train, [...], callbacks=[lr_scheduler])
    ```


**Saving and Loading**: When you save a model, the optimizer and its learning rate get saved along with it. This means that with this new schedule function, you could just load a trained model and continue training where it left off, no problem. Things
are not so simple if your schedule function uses the epoch argument however: the epoch does not get saved, and it gets reset to 0 every time you call the fit() method. If you were to continue training a model where it left off, this could lead to a very large learning rate, which would likely damage your model's weights. One solution is to manually set the fit() method's ```initial_epoch``` argument so the epoch starts at the right value. 

**keras.optimizers.schedules**: _tf.keras_ (and not _keras_) offers an alternative way to implement learning rate scheduling: define the learning rate using one of the schedules available in ```keras.optimizers.schedules```, then pass this learning rate to any
optimizer. This approach updates the learning rate at each step rather than at each epoch.

```python
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate) # or any optimizer
```

This is nice and simple, plus when you save the model, the learning rate and its schedule (including its state) get saved as well. This approach, however, is not part of the Keras API; it is specific to tf.keras.

---

## 5. Avoiding Overfitting Through Regularization

### 5a. $l_1$ and $l_2$ Regularization

[<img src="images/Reg.png" width="500"/>](https://youtu.be/ndYnUrx8xvs)

$l_2$ regularization to a Keras layer's connection weights, using a regularization factor of 0.01:

```python
layer = keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal", 
                           kernel_regularizer=keras.regularizers.l2(0.01))
```

The l2() function returns a regularizer that will be called at each step during training to compute the regularization loss. This is then added to the final loss. As you might expect, you can just use ```keras.regularizers.l1()``` if you want $l_!$   regularization; if you want both $l_1$  and $l_2$  regularization, use ```keras.regularizers.l1_l2()``` (specifying both regularization factors).

Apply the same regularizer to all layers in your network, as well as using the same activation function and the same
initialization strategy in all hidden layers:

```python
from functools import partial 
 
RegularizedDense = partial(keras.layers.Dense, 
                           activation="elu", 
                           kernel_initializer="he_normal", 
                           kernel_regularizer=keras.regularizers.l2(0.01)) 
 
model = keras.models.Sequential([ 
    keras.layers.Flatten(input_shape=[28, 28]), 
    RegularizedDense(300), 
    RegularizedDense(100), 
    RegularizedDense(10, activation="softmax", 
                     kernel_initializer="glorot_uniform")
])
```

### 5b. Dropout

At every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability $p$ of being temporarily "dropped out", meaning it will be entirely ignored during this training step, but it may be active during the next step. The hyperparameter $p$ is called the _dropout rate_, and it is typically set between 10% and 50%: closer to 20-30% in RNNs and closer to 40-50% in CNNs. After training, neurons don't get dropped anymore.

Droupout basically allow NN to **not** depend on any one feature and spreads out the weights. Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.

Another way to understand the power of dropout is to realize that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there are a total of 2  possible networks (where N is the total number of droppable neurons). This is such a huge number that it is virtually impossible for the same neural network to be sampled twice.
Once you have run 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent because they share many of their weights, but they are nevertheless all different. The resulting neural network can be seen as an **averaging ensemble** of all these smaller neural networks.

Suppose $p = 25%$ , in which case during testing a neuron would be connected to 4/3 times as many input neurons as it would be (on average) during training. To compensate for this fact, we need to multiply each neuron's input connection weights by 3/4 after training. If we don't, each neuron will get a total input signal roughly 4/3 times as large as what the network was trained on and will be unlikely to perform well. More generally, we need to multiply each input connection weight by the _keep probability_ $(1-p)$ after training. Alternatively, we can divide each neuron's output by the _keep probability_ during training (these alternatives are not perfectly equivalent, but they work equally well).

**NOTE**: In practice, you can usually apply dropout only to the neurons in the top 1-3 layers (excluding the output layer). Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.

**NOTE**: Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g., after training).

**NOTE**: If you want to regularize a self-normalizing network based on the SELU activation function (as discussed earlier), you should use alpha dropout: this is a variant of dropout that preserves the mean and standard deviation of its inputs.

---
Implement dropout using Keras
```python
model = keras.models.Sequential([ 
    keras.layers.Flatten(input_shape=[28, 28]), 
    keras.layers.Dropout(rate=0.2), 
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"), 
    keras.layers.Dropout(rate=0.2), 
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"), 
    keras.layers.Dropout(rate=0.2), 
    keras.layers.Dense(10, activation="softmax")
])
```
During training, keras.layers.Dropout layer randomly drops some inputs (setting them to 0) and divides the remaining inputs by the _keep probability_ (```rate```). After training, it does nothing at all; it just passes the inputs to the next layer. 

---


### 5c. Monte Carlo (MC) Dropout

Boost the performance of any trained dropout model without having to retrain it or even modify it at all, provides a much better measure of the model's uncertainty, and is also amazingly simple to implement. 

We just make 100 predictions over the test set, setting ```training=True``` to ensure that the Dropout layer is active, and stack the predictions. Since dropout is active, all the predictions will be different. ```y_probas``` is an array of shape [100, 10000, 10] for 100 smaples, 10000 instances and 10 classes. Once we average over the first dimension (axis=0), we get ```y_proba```, an array of shape [10000, 10], like we would get with a single prediction.

```python
y_probas = np.stack([model(X_test_scaled, training=True) 
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)
```

with dropout off: 
```python
np.round(model.predict(X_test_scaled[:1]), 2)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.99]], dtype=float32)
```

with dropout on (```training=True```): 
```python
np.round(y_probas[:, :1], 2)
array([[[0.  , 0.  , 0.  , 0.  , 0.  , 0.14, 0.  , 0.17, 0.  , 0.68]],
       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.16, 0.  , 0.2 , 0.  , 0.64]],
       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.01, 0.  , 0.97]],
       [...]
```

apparently, when we activate dropout, the model is not sure anymore. Once we average over the first dimension, we get the
following MC Dropout predictions:

```python
np.round(y_proba[:1], 2)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.22, 0.  , 0.16, 0.  , 0.62]], dtype=float32)
```

The model still thinks this image belongs to class 9, but only with a 62% confidence, which seems much more reasonable than 99%. Plus it's useful to know exactly which other classes it thinks are likely. 

take a look at the standard deviation of the probability estimates:
```python
y_std = y_probas.std(axis=0)
np.round(y_std[:1], 2)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.28, 0.  , 0.21, 0.02, 0.32]],
      dtype=float32)
```

Apparently there's quite a lot of variance in the probability estimates: if you were building a risk-sensitive system (e.g., a medical or financial system), you should probably treat such an uncertain prediction with extreme caution. You definitely would not treat it like a 99% confident prediction.

In short, MC Dropout is a fantastic technique that boosts dropout models and provides better uncertainty estimates. And of course, since it is just regular dropout during training, it also acts like a regularizer.

### 5d. Data Augmentation

Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instance. This reduces overfitting, making this a regularization technique. The generated instances should be as realistic as possible: ideally, given an image from the augmented training set, a human should not be able to tell whether it was augmented or not. **Simply adding white noise will not help; the modifications should be learnable (white noise is not)**.

## 6. Understanding Neural Noise

[<img src="images/Noise.png" width="600"/>](https://youtu.be/ubqhh4Iv7O4)

## 7. Understanding Inefficiencies in our Network

[<img src="images/Inefficiency.png" width="600"/>](https://youtu.be/ubqhh4Iv7O4)

## Summary

The following configuration generally works fine in most cases, without requiring much hyperparameter tuning.

<img src="images/TF_TrSumm1.png" style="float:left;" width="400"/>

<img src="images/TF_TrSumm2.png" style="float:left;" width="404"/>




## Exercises

a. Exercise: Build a DNN with 20 hidden layers of 100 neurons each (that's too many, but it's the point of this exercise). Use He initialization and the ELU activation function. input_shape=[32, 32, 3]

In [52]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten

tf.random.set_seed(42)
np.random.seed(42)

In [45]:
model = Sequential()
model.add(Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(Dense(100, activation='elu', kernel_initializer='he_uniform'))


b. 
Exercise: Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32--pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you'll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model's architecture or hyperparameters.


In [46]:
(X_train_full, y_train_full), (X_test, y_test) = \
    tf.keras.datasets.cifar10.load_data()

X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]

In [47]:
model.add(Dense(10, activation='softmax'))
model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.Nadam(lr=5e-3),
              metrics='accuracy')

In [48]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    'modules/my_cifar10_model.h5', save_best_only=True)
callbacks = [early_stopping_cb, model_checkpoint_cb]

model.fit(X_train, y_train, epochs=100, batch_size=1000,
          validation_data=(X_valid, y_valid), validation_batch_size=1000,
          callbacks=callbacks)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x2ad11e9a90a0>

In [49]:
model = tf.keras.models.load_model("modules/my_cifar10_model.h5")
model.evaluate(X_valid, y_valid)



[1.6279069185256958, 0.4083999991416931]

c. Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?

The code below is very similar to the code above, with a few changes:

    I added a BN layer after every Dense layer (before the activation function), except for the output layer. I also added a BN layer before the first hidden layer.
    I changed the learning rate to 5e-4. I experimented with 1e-5, 3e-5, 5e-5, 1e-4, 3e-4, 5e-4, 1e-3 and 3e-3, and I chose the one with the best validation performance after 20 epochs.
    I renamed the run directories to runbn* and the model file name to my_cifar10_bn_model.h5.

In [50]:
from tensorflow.keras.layers import BatchNormalization

model = Sequential()
model.add(Flatten(input_shape=[32, 32, 3]))
model.add(BatchNormalization())
for _ in range(20):
    model.add(Dense(100, activation='elu', kernel_initializer='he_uniform'))
    model.add(BatchNormalization())
model.add(Dense(10, activation='softmax'))

model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.Nadam(lr=5e-4),
              metrics='accuracy')

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    'modules/my_cifar10_bn_model.h5', save_best_only=True)
callbacks = [early_stopping_cb, model_checkpoint_cb]

model.fit(X_train, y_train, epochs=100, batch_size=1000,
          validation_data=(X_valid, y_valid), validation_batch_size=1000,
          callbacks=callbacks)

model = tf.keras.models.load_model("modules/my_cifar10_bn_model.h5")
model.evaluate(X_valid, y_valid)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100


[1.5548253059387207, 0.45419999957084656]

- Is the model converging faster than before? Much faster! The previous model took 27 epochs to reach the lowest validation loss, while the new model achieved that same loss in just 5 epochs and continued to make progress until the 16th epoch. The BN layers stabilized training and allowed us to use a much larger learning rate, so convergence was faster.
- Does BN produce a better model? Yes! The final model is also much better, with 45.0% accuracy instead of 40%. It's still not a very good model, but at least it's much better than before (a Convolutional Neural Network would do much better, but that's a different topic).
- How does BN affect training speed? Although the model converged much faster, each epoch took about 12s instead of 8s, because of the extra computations required by the BN layers. But overall the training time (wall time) was shortened significantly!

d. Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).


In [53]:
from tensorflow.keras.layers import BatchNormalization

model = Sequential()
model.add(Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(Dense(100, activation='selu', kernel_initializer='lecun_uniform'))
model.add(Dense(10, activation='softmax'))

model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.Nadam(lr=5e-4),
              metrics='accuracy')

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    'modules/my_cifar10_selu_model.h5', save_best_only=True)
callbacks = [early_stopping_cb, model_checkpoint_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train - X_means) / X_stds
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (X_test - X_means) / X_stds

model.fit(X_train, y_train, epochs=100, batch_size=1000,
          validation_data=(X_valid, y_valid), validation_batch_size=1000,
          callbacks=callbacks)

model = tf.keras.models.load_model("modules/my_cifar10_selu_model.h5")
model.evaluate(X_valid, y_valid)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100


[1.544802188873291, 0.4569999873638153]

- We get 46% accuracy, which is not much better than the original model (40%), as good as the model using batch normalization (45%). However, convergence was almost as fast as with the BN model, plus each epoch took only 1 seconds. So it's by far the fastest model to train so far.

e. Exercise: Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.

In [55]:
from tensorflow.keras.layers import AlphaDropout

model = Sequential()
model.add(Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(Dense(100, activation='selu', kernel_initializer='lecun_uniform'))
model.add(AlphaDropout(rate=0.1))
model.add(Dense(10, activation='softmax'))

model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=tf.keras.optimizers.Nadam(lr=5e-4),
              metrics='accuracy')

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    'modules/my_cifar10_alpha_dropout_model.h5', save_best_only=True)
callbacks = [early_stopping_cb, model_checkpoint_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train - X_means) / X_stds
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (X_test - X_means) / X_stds

model.fit(X_train, y_train, epochs=100, batch_size=1000,
          validation_data=(X_valid, y_valid), validation_batch_size=1000,
          callbacks=callbacks)

model = tf.keras.models.load_model("modules/my_cifar10_alpha_dropout_model.h5")
model.evaluate(X_valid, y_valid)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100


[1.5727181434631348, 0.4580000042915344]

- Same accuracy as before