# Training Deep Neural Networks Pt.2


Today we will learn about:


1. Reusing Pretrained Layers
2. Optimizers



In [None]:
import tensorflow as tf
from tensorflow import keras

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

## Load Dataset

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


# Reusing Pretrained Layers


Let's split the fashion MNIST training set in two:

 

*   X_train_A: all images of all items except for sandals and shirts (classes 5 and 6).
*   X_train_B: a much smaller training set of just the first 200 images of sandals or shirts.


X_train_B: a much smaller training set of just the first 200 images of sandals or shirts.
The validation set and the test set are also split this way, but without restricting the number of images.

We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle set B (binary classification). We hope to transfer a little bit of knowledge from task A to task B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). However, since we are using Dense layers, only patterns that occur at the same location can be reused (in contrast, convolutional layers will transfer much better, since learned patterns can be detected anywhere on the image, as we will see in the CNN chapter).

### Train Model A

In [None]:
def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

In [None]:
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

In [None]:
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

  "The `lr` argument is deprecated, use `learning_rate` instead.")


In [None]:
history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_valid_A, y_valid_A))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
model_A.save("my_model_A.h5")

### Train Model B

Let's try to train Model B from scratch

In [None]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

In [None]:
model_B.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

In [None]:
history = model_B.fit(X_train_B, y_train_B, epochs=20,
                      validation_data=(X_valid_B, y_valid_B))

### Train Model B with Model A

In [None]:
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

In [None]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])

In [None]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

# Optimizers

In neural network, there are many methods to find the best parameter, but the most popular is gradient descent. Remember that in NN our objective is to minimize a cost function, thus we hypotheses that by changing the weights following the gradient of cost function with regard to the weights, we can eventually achieve weights with lowest cost value. 

In general, weights update can be expressed by equation. Keep in mind the implicit negative notation can be switched, either included in the update term ($\Delta_{t}$) or outside of it.

$\theta_{t+1}=\theta_{t}+\Delta \theta_{t}$

The standard stochastic gradient descent, usually refered as SGD, is the simplest form of gradient descent. The only parameter of this gradient descent is its learning rate.

$\Delta \theta_{t}=-\eta\frac{\delta J}{\delta{\theta}}$


In [None]:
optim_sgd=tf.keras.optimizers.SGD(
    learning_rate=0.01, momentum=0.0, nesterov=False, name="SGD", **kwargs
)

SGD with Momentum: Momentum term help stabilizes the update by including a short term memory. Mathematically, this is done by including the value of previous update to current update calculation. The parameter now become 2, one is the learning rate and the other is the momentum term.

$\Delta \theta_{t}=\mu\Delta \theta_{t-1}-\eta\frac{\delta J}{\delta{\theta}}$ 

In [None]:
optim_sgd_mom = tf.keras.optimizers.SGD(
    learning_rate=0.01, momentum=0.9, nesterov=False, name="SGD_Momentum", **kwargs
)

Nesterov Momentum: Nesterov Momentum adjust the direction of the regular momentum update.

$\Delta \theta_{t}=\mu\Delta \theta_{t-1}-\eta\frac{\delta J}{\delta{\theta}}(\theta_{t}+\mu\Delta_{t-1})$ 

In [None]:
optim_sgd_ntv = tf.keras.optimizers.SGD(
    learning_rate=0.01, momentum=0.9, nesterov=True, name="SGD_Nesterov", **kwargs
)

Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. The more updates a parameter receives, the smaller the updates.

$v_t=v_{t-1}+(\frac{\delta J}{\delta{\theta}})^{2} $

$\Delta \theta_{t}=-\frac{\eta}{\sqrt{v_t+\epsilon}}(\frac{\delta J}{\delta{\theta}})$

Adagrad essentialy only need one parameter, which is the initial learning rate $\eta$. But, on practice we can set small value $\epsilon$ to avoid division by zero.

In [None]:
optim_adagrad = tf.keras.optimizers.Adagrad(
    learning_rate=1,
    epsilon=1e-07,
    name="Adagrad",
    **kwargs
)

RMSProp use moving average of squared gradients to scale the weight update. It is proposed by Geoffrey Hinton (He is one of the "godfathers" of deep learning) to solve the problem of diminishing learning rates on AdaGrad. For RMSProp we use moving average, instead of keeping all the average like what AdaGrad did.
Expressing moving average of squared gradients as $v$, we can calculate the weight update as follows:

$v_t=\rho v_{t-1}+(1-\rho)(\frac{\delta J}{\delta{\theta}})^{2} $

$\Delta \theta_{t}=-\frac{\eta}{\sqrt{v_t+\epsilon}}(\frac{\delta J}{\delta{\theta}})$ 

There are two essential parameters, $\rho$ the discounting factor and $\eta$ the learning rate. Again, for numerical stability we add small $\epsilon$

In [None]:
optim_rmsprop = tf.keras.optimizers.RMSprop(
    learning_rate=0.001,
    rho=0.9,
    epsilon=1e-07,
    name="RMSprop"
)

Adadelta optimization is a stochastic gradient descent method that is based on adaptive learning rate per dimension to address two drawbacks:
The continual decay of learning rates throughout training.
The need for a manually selected global learning rate.
Similar to RMSProp, AdaDelta also try to solve the problem of RMSProp. It is developed independently, thus they are somehow similar. AdaDelta use another exponentially decaying average, this time not of squared gradients but of squared parameter updates.

$v_t=\rho v_{t-1}+(1-\rho)(\frac{\delta J}{\delta{\theta}})^{2} $

$x_t=\rho x_{t-1}+(1-\rho)(\Delta \theta_{t})^{2} $

$\Delta \theta_{t}=-\frac{\eta\sqrt{x_t+\epsilon}}{\sqrt{v_t+\epsilon}}(\frac{\delta J}{\delta{\theta}})$ 

On the original paper AdaDelta do not use any learning rate parameter, but in Keras we can set it just as in other method and as written on the equation.

In [None]:
optim_adadelta = tf.keras.optimizers.AdaDelta(
    learning_rate=0.001,
    rho=0.9,
    epsilon=1e-07,
    name="RMSprop"
)

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.ADAM is just Adadelta (which rescales gradients based on accumulated "second-order" information) plus momentum (which smooths gradients based on accumulated "first-order" information). 

$s_t=\beta_{1} v_{t-1}+(1-\beta_{1})(\frac{\delta J}{\delta{\theta}}) $

$v_t=\beta_{2} x_{t-1}+(1-\beta_{2})(\frac{\delta J}{\delta{\theta}})^{2} $

$\Delta \theta_{t}=-\frac{\eta s_t}{\sqrt{v_t+\epsilon}}(\frac{\delta J}{\delta{\theta}})$ 


In [None]:
optim_adam = tf.keras.optimizers.Adam(
    learning_rate=0.001,
    rho=0.9,
    epsilon=1e-07,
    name="RMSprop"
)

Generally, we prefer the adaptive learning rate or momentum method as the more epoch the learning has through, we need smaller learning rate to avoid fluctuation of the weights. However, there are some cases where the Nesterov optimizer give better performance when we hit the right parameter. 