In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras

In [2]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [31]:
pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

# The Vanishing/Exploding Gradients Problem

When flowing backwards through neural network, sometimes (when applying chain rule) gradients

* Grow larger and larger
* Grow smaller and smaller

This means that different layers can possibly learn at very different learning rates. This is due to

* Weight initialization technique
* Bad choice of activation function (sigmoid, for example)




## Weight Initialization Technique

1. Need variance of outputs of each layer to be equal to variance of inputs. 
2. Need gradients to have equal variance before and after flowing through a layer.

There are two ways to achieve this in practice. Let $f_{in}$ by number of input neurons and $f_{out}$ be number of output neurons. Then $f_{avg}=\frac{1}{2}(f_{in}+f_{out})$. 

1. Weights chosen by normal ditribution with $\mu=0$ and $\sigma^2 = 1/f_{avg}$
2. Uniform between -$r$ and $r$ with $r=\sqrt{3/f_{avg}}$

By default Keras uses a uniform distribution. When creating a layer can change this by using 

In [3]:
keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal')

<tensorflow.python.keras.layers.core.Dense at 0x211b3f0fc50>

See table on 334 and code on 335 for more initialization possibilities.

## Activation Functions

Generally ReLU is not the best, since below $x=0$ its slope is $0$ and thus neurons can "die" (aka they don't get tweaked anymore during gradient descent) if they enter this region. The best activation functions are **ELU** and **SELU**. The **ELU** replaces the $0$ part in the negative $x$ axis of the ReLu with a decaying exponential. **SELU** is a scaled variant of **ELU**. **SELU** requires

1. Input features must by standardized (mean 0 std 1)
2. Hidden layers weights must be initialized with LeCun normal initialization. This means that $\sigma^2=1/f_{in}$ instead of $1/f_{avg}$.
3. Network must be sequential (no fancy stuff with splitting up training set and having some layers skip ahead etc..)

For implementing these activation functions see 337-338.

## Batch Normalization

Batch normalization is a techinque that zero-centers and normalizes each input (to a neuron) before the activation function. This helps ensure that vanishing/exploding gradient problem doesn't come back during late times in training.

See 338-342 for more mathematical details, but know that it essentially "acts" as a standard scaler between layers. 

The reason it is called "batch" normalization is because it normalizes entries using

$$\hat{x} = (x-\mu_B)/\sigma_B $$

where $\mu_B$ and $\sigma_B$ (vectors) are computed using only a batch of the data (using full dataset is not practically for stocahstic gradient descent). It then weights them into the neuron using

$$ \text{Input} = \gamma \otimes \hat{x} + \beta $$

and thus $\gamma$ and $\beta$ are like the effective neuron weights.



**Important for Coding**: Batch normalization can be accomplished as follows.

In [4]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

One parameter to tweak is **momentum** which affects how means or standard deviations are updated from batch to batch. Let $v$ represent $\mu_B$ or $\sigma_B$ (moving avg)

$$\hat{v} \leftarrow \hat{v} \times \text{momentum} + v \times (1-\text{momentum}) $$

If momentum is 1 (standard) then moving average is just what the current batch, but if not then it retains information from the previous batch (like smoothing function in time series analysis).

Note that after training (when evaluating on test set) the batch layers use $\mu$ and $\sigma$ (true values).

## Gradient Clipping

For certain neural networks called *recurrent neural networks* batch normalization is tricky to use. As such other technqiues have been developed to deal with exploding gradients. One technique is gradient clipping, where large gradients are clipped between a given range (never exceed some value).

In [5]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)

Here all gradients will be clipped between -1.0 and 1.0. What this means, however, is if a gradient is initially [-0.9, 100] it will be clipped to [-0.9,1] (direction changes by quite a bit). In practice it still works well though. To avoid, this, you can use the "clipnorm" argument which instead normalizes by the length of the vector. Generally one should use both approaches and see which one works best.

# Reusing Pretrained Layers


Generally not a good idea to trian DNN from scratch: reuse some layers from another network. This is called **transfer learning**. Generally

* Low layers far more useful than upper layers 

Procedure to do this:

1. Select layers want to use
2. Freeze them
3. Add new upper layers
4. Train upper layers

## Example

Lets do this on fashion MNIST data. Suppose we have

* 200 images of shirts and sandals we want to classify
* Model from fashion MNIST which used way more data to train

We could just train a model on the 200 images but because training sample is small, wouldn't get good accuracy. Instea we can use model from full MNIST data and us it for the shirt sandal training.

In [6]:
def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

### 1. Train on full data set

In [7]:
tf.random.set_seed(42)
np.random.seed(42)

# Create
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

# Compile
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

# Fit
history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_valid_A, y_valid_A))

Train on 43986 samples, validate on 4014 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Save

In [8]:
model_A.save("chap11_my_model_A.h5")

### 2. Train model on the 200 Binary Images

In [9]:
# Build
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

# Compile
model_B.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

# Fit
history = model_B.fit(X_train_B, y_train_B, epochs=20,
                      validation_data=(X_valid_B, y_valid_B))

Train on 200 samples, validate on 986 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


So now we have two models: our "shitty one" trained on only 200 images but we also have a much more powerful model trained on a bigger data set (but it doesn't quite do what we want it to do- its not a binary classifier). How can we use parts of that model?

### 3. Transfer Learning: Reuse parts of good model for binary classification

Lets strip model A of its final layer (softmax) and replace it with a binary classifier layer.

In [10]:
# Get good model
model_A = keras.models.load_model("chap11_my_model_A.h5")
# Use every layer except last one
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
# Make a sigmoid final layer for binary classification
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Now training this will actually affect model A, so we need to clone it if we also want to keep a copy of model A.

In [11]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

First we freeze the reused layers to allow the final layer to adapt (since its weights are initially randomized and it will take some time to train them).

In [12]:
# Freeze weights
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

# Compile
model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])

Now train it for a bit...

In [13]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

Train on 200 samples, validate on 986 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Now unfreeze layers, compile again, and train some more...

In [14]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

Train on 200 samples, validate on 986 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


### 4. Conclusion: Compare model trained on 200 images to reused model

Model B (shitty one trained on 200 images)

In [15]:
%%capture
B_results = model_B.evaluate(X_test_B, y_test_B)

In [16]:
print(B_results)

[0.14446661925315857, 0.9695]


Model that reused some layers of big model trained on lots of data.

In [17]:
%%capture
BonA_results = model_B_on_A.evaluate(X_test_B, y_test_B)

In [18]:
print(BonA_results)

[0.06893341457843781, 0.9925]


The accuracy went up by quite a bit! **HOWEVER**, this only worked because **of the random seed used**. This is called "torturing the data until it confesses". Some random seeds don't even show any improvement!

**Moral of the Story**: If a flashy new paper looks too positive, be suspicious, the new technique might not actually be as good as the author says it is. The authors might have tried manby variants until they got one that looked successful, without mentioning their many failures. It turns out that transfer learning does not work well with small dense networks, since they learn very few patterns and very specific patterns. Transfer learning works best with deep convolutional networks. **Transfer learning will be revisisted properly in chapter 14**.

# Unsupervised Pretraining

Since labels are generally expensive to obtain in the real world, but training data is less expensive, one typically does unsupervised pretraining on many cases without labels, and then at the very end (with a small set of labeled data) one reuses the layers of this pretrained network and trains on the few cases of labeled data.

**Example**: Building a system to classify images but only have a few pictures of people. Pretrain using many images on google to tell if pictures contain same person, then reuse this network for your task.

# Faster Optimizers

Look at techniques other than the regular SGD method.

## Momentum Optimization

Change update structure to

1. $\mathbf{m} \to \beta \mathbf{m} -\eta \nabla_{\mathbf{\theta}}J(\mathbf{\theta})$
2. $\mathbf{\theta} \to \mathbf{\theta}+\mathbf{m}$


This is like momentum in physics. Easy to verify that constant gradient (flat downward slope) yields max momentum proportional to $1/(1+\beta)$. The purpose of this is to speed up gradient descent in flat patches.

## Nesterov Accelerated Gradient

Change above to 

1. $\mathbf{m} \to \beta \mathbf{m} -\eta \nabla_{\mathbf{\theta}}J(\mathbf{\theta}+\beta \mathbf{m})$
2. $\mathbf{\theta} \to \mathbf{\theta}+\mathbf{m}$

Only difference is gradient is calculated slightly ahead of present position. This only works because momentum tends to point in the right direction (momentum is an accumulated avg of overall traveling and tends to point towards optimum).

Both of the two procedures can be used in keras as follows (momentum is value of $\beta$):

In [19]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

## Adagrad

This algorithm decays the learning rate but does so faster for steep dimensions than shallow dimnesions

1. $\mathbf{s} \to \mathbf{s} +\nabla_{\mathbf{\theta}}J(\mathbf{\theta}) \otimes \nabla_{\mathbf{\theta}}J(\mathbf{\theta})$
2. $\mathbf{\theta} \to \mathbf{\theta}-\eta \nabla_{\mathbf{\theta}}J(\mathbf{\theta}) \oslash \sqrt{\mathbf{s}+\epsilon} $

Note the elementwise multiply and divide symbols. $\epsilon$ is just there to ensure no division by 0.

This algorithm does not work well for neural networks usually because it stops to early. But it is important to understand for the upcoming algorithms.

## RMSProp

Fixes adagrad by accumulating only the gradients of the few most recent operations. 

1. $\mathbf{s} \to \beta \mathbf{s} + (1-\beta)\nabla_{\mathbf{\theta}}J(\mathbf{\theta}) \otimes \nabla_{\mathbf{\theta}}J(\mathbf{\theta})$
2. $\mathbf{\theta} \to \mathbf{\theta}-\eta \nabla_{\mathbf{\theta}}J(\mathbf{\theta}) \oslash \sqrt{\mathbf{s}+\epsilon} $

$\beta$ is typically set to 0.9.

## Adam and Nadam 

Putting everything together...

1. $\mathbf{m} \to \beta_1 \mathbf{m} -(1-\beta_1)\nabla_{\mathbf{\theta}}J(\mathbf{\theta})$
2. $\mathbf{s} \to \beta_2 \mathbf{s} + (1-\beta_2)\nabla_{\mathbf{\theta}}J(\mathbf{\theta}) \otimes \nabla_{\mathbf{\theta}}J(\mathbf{\theta})$
3. $\hat{\mathbf{m}} \to (1-\beta_1^T)^{-1} \hat{\mathbf{m}}  $
4. $\hat{\mathbf{s}} \to (1-\beta_2^T)^{-1} \hat{\mathbf{s}}  $
5. $\mathbf{\theta} \to \mathbf{\theta} + \eta \hat{\mathbf{m}} \oslash \sqrt{\hat{\mathbf{s}}+\epsilon}$

$ T$ is the iteration number (starting at 1). Steps 3 and 4 are included because $m$ and $s$ are initialized at zero so it helps boost them. $\beta_1$ is typically 0.9 while $\beta_2$ is typically 0.999.

In [20]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Nadam is just Adam with the Nesterov trick.

## Summary

See table 11-2 for good summary of all optimizers.

# Learning Rate Scheduling

We also need to deal with learning rate parameter $\eta$ and how it changes during epochs. 360-361 goes over a few techniques for decreasing learning rate over time but they all have the same structure:

**Start with high learning rate and generally decrease over time**.

## Power Scheduling

After $s$ steps $\eta \to \eta_0/2$. After $s$ more steps $\eta$ goes to $\eta_0/3$. Continues to /4, /5, etc. Hyperparameter "decay" is the inverse of $s$

In [21]:
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-3)

## User Specified (Exponential Decay Example)

Maybe you want learning rate to decay in some user specified way. Need to create a function that returns decay as a function of epoch. Easiest way of doing this (with multiple parameters) is a function of a function

In [22]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0*0.1**(epoch/s)
    return exponential_decay_fn

Now get function

In [23]:
exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

Need to create learning rate scheduler

In [24]:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

**Need to use this as a callback**

In [25]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B),
                           callbacks=[lr_scheduler])

Train on 200 samples, validate on 986 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Can also update every step (rather than every epoch). See Geron notebook for more details.

**Important**: When you reload a model the epoch starts at 0 when you fit it again. This can screw up the fitting since the learning rate will also start at the initial value. Way of getting around this is to store epoch information as well and use the "initial_epoch" argument.

## Decreasing Learning Rate when no Improvement is seen

Create new scheduler to decrease learning rate when njo improvement is seen in validation set.

In [26]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

This one multiplies lr by 0.5 when no improvement is seen after 5 epochs.

# Avoiding Overfitting Through Regularization

## $l_1$ and $l_2$ normalization

Regularization is added to layers. As before, $l_1$ creates sparser models (more weights equal to zero) than $l_2$ normalization. This can be justified through

$l_2$ Norm:
$\sqrt{1^2+9^2} = 9.06$

whereas

$l_1$ Norm:
$1+9 = 10$

The 1 contributes far more to the $l_1$ norm (second equation) than the $l_2$ norm (first equation) so it tends to be eliminated more frequently if it is less important.

In [48]:
layer = keras.layers.Dense(100, activation="elu", kernel_initializer='he_normal',
                          kernel_regularizer=keras.regularizers.l2(0.01))

Since you typically use the same activation functions, initializers, and regulaizers for all layers it is typically a good idea to create NNs in loops. Another option is to use the "partial" library of python which allows you to create a wrapper for any function

In [49]:
from functools import partial

# Create Wrapper
RegularizedDense = partial(keras.layers.Dense,
                          activation='elu',
                          kernel_initializer='he_normal',
                          kernel_regularizer=keras.regularizers.l2(0))

# Create model
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

Note that the arguments in the wrapper are set to default values but can be changed.

## Dropout

Technique of dropout is that during each iteration, some neurons "turn-off" and are not changed during that iteration. This technique adds a huge accuracy boost to most models. The hyperparameter for this is "p" and is usually set between 10 and 15 percent.

However, since some neurons are dropped, the sum of the weights $\sum w_ix_i$ used as input to a neuron decreases. As such, all the neuron weights are multiplied by $1-p$ after training.

It is implemented as follows:

In [51]:
# Create model
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    RegularizedDense(300),
    keras.layers.Dropout(rate=0.2),
    RegularizedDense(100),
    keras.layers.Dropout(rate=0.2),
    RegularizedDense(10, activation='softmax')
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 5
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

Train on 55000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


**If a model is overfitting it is usually a good idea to increase the dropout rate**.

**For SELU activation functions use alpha dropout- look it up!**

### Monte Carlo Dropout

The way Monte Carlo dropout works is that you make predictions on the test set while the dropout layer is active, then you take the average of the predictions.

In [52]:
tf.random.set_seed(42)
np.random.seed(42)

Get 10 predictions for each of 100 test samples (100 different Monte Carlo Dropouts)

In [53]:
y_probas = np.stack([model(X_test_scaled, training=True)
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)
y_std = y_probas.std(axis=0)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



Lets compare the models prediction (with training=False) to a few of these probabilities

In [54]:
# From the model
np.round(model.predict(X_test_scaled[:1]), 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.1 , 0.  , 0.89]],
      dtype=float32)

In [56]:
# Monte Carlo Dropout Mean
np.round(y_proba[:1], 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.05, 0.  , 0.16, 0.  , 0.79]],
      dtype=float32)

In [57]:
# Monte Carlo Dropout STD
np.round(y_std[:1], 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.06, 0.  , 0.18, 0.  , 0.17]],
      dtype=float32)

Clearly there is some standard deviation in the estimates obtained through dropout. These uncertain predictions should be treated with extreme caution in highly sensitive systems.

In [60]:
# Monte Carlo Dropout
y_pred = np.argmax(y_proba, axis=1)
accuracy = np.sum(y_pred == y_test) / len(y_test)
print(accuracy)

0.8668


In [61]:
# Regular Model
y_pred = np.argmax(model.predict(X_test_scaled), axis=1)
accuracy = np.sum(y_pred == y_test) / len(y_test)
print(accuracy)

0.8672


The accuracy is pretty close between the two (hard to compare).

**In this case we set training equal to true when getting test samples. But what if there are other layers we don't want turned on?** In this case override the MCDropout layer and force training equal to true:

In [62]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

Use this class instead when deriving layers:

In [63]:
mc_model = keras.models.Sequential([
    MCDropout(layer.rate) if isinstance(layer, keras.layers.Dropout) else layer
    for layer in model.layers
])