1. It is not okay to make the weights all the same because this **causes symmetry** that backpropagation cannot break. All neurons in layers after the initial layers will have the same weights which makes it seem like there's just one neuron per layer + slower. Impossible to converge to a good solution
2. Does not matter is biases are intialized to zero or not
3. The SELU function will:
    a. allow the network to self-normalize if all layers are dense and use the SELU activation function
    b. Much faster
    c. Always has a non-zero derivate
    d. can output negative values so that the mean output is closer to 0 (better for mitigating the risk of vanishing gradients) while ReLU has a mean closer to 0.5. 
4. 
    a. SELU for general use cases
    b. ELU for network's whose architecture prevent it from normalizing
    c. Leak ReLU if trying to minimize runtime latency and don't want to tweak another hyperparameter 
    d. RReLU if there's time and computing power
    e. PReLU if training set is huge.
    f. Vanilla ReLU for popularity
    g. tanh for output between -1 and 1
    h. logistic sigmoid for estimating a probaility, rarely used in hidden layers
    j. softmax for outputting probabilities for mutually exclusive classes, rarely used in hidden layers
5. Setting the momentum hyperparameter too close to 1 will not effectively prevent the momentum from becoming too large so the gradient descent algorithm will oscillate many many times before converging because the learning rate won't let it sit reach the minima
6. Three ways to produce a sparse model:
    a. do normal training and make the tiny weights zero
    b. l<sub>1</sub> regularization
    c. use the TF Model Optimization Toolkit
7. Dropout slows down training, but makes the model perform better generally. There is no impact on inference/predicting speed because it is only use during training. MC Dropout is the same as regular Dropout but is active during inference and slows down the network; it must be run 10 times or more for better predictions so it is slower by a factor of at least 10

In [152]:
from tensorflow.keras import layers, models, optimizers
from functools import partial

RegularizedDense = partial(layers.Dense,activation="elu", kernel_initializer="he_normal")

model = models.Sequential(
    [layers.Flatten(input_shape=[32, 32, 3])] +
    [RegularizedDense(100) for _ in range(20)] + 
    [layers.Dense(10, activation="softmax")]
)

In [153]:
model.summary()

Model: "sequential_38"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_47 (Flatten)         (None, 3072)              0         
_________________________________________________________________
dense_1335 (Dense)           (None, 100)               307300    
_________________________________________________________________
dense_1336 (Dense)           (None, 100)               10100     
_________________________________________________________________
dense_1337 (Dense)           (None, 100)               10100     
_________________________________________________________________
dense_1338 (Dense)           (None, 100)               10100     
_________________________________________________________________
dense_1339 (Dense)           (None, 100)               10100     
_________________________________________________________________
dense_1340 (Dense)           (None, 100)             

In [154]:
from tensorflow.keras.datasets import cifar10

train, test = cifar10.load_data()

In [155]:
x_train, y_train = train[0], train[1]
x_test, y_test = test[0], test[1]

In [156]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.33)

In [157]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers.schedules import ExponentialDecay

early_stop_cb = EarlyStopping(patience=10, restore_best_weights=True)
# s = 20 * len(x_train) // 32
# learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)

optimizer = optimizers.Nadam(learning_rate=5e-5) # does not support learnign rate scheduling
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(x_train, y_train, epochs=50, validation_data=(x_val, y_val), callbacks=[early_stop_cb])
model.evaluate(x_test, y_test)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50


[1.5225147008895874, 0.45680001378059387]

In [158]:
import numpy as np
from tensorflow.keras.callbacks import ModelCheckpoint

RegularizedDense = partial(layers.Dense,activation="elu", kernel_initializer="he_normal")

model_BN = models.Sequential()
model_BN.add(layers.Flatten(input_shape=x_train.shape[1:]))
model_BN.add(layers.BatchNormalization())
for _ in range(20):
    model_BN.add(layers.Dense(100, kernel_initializer="he_normal"))
    model_BN.add(layers.BatchNormalization())
    model_BN.add(keras.layers.Activation("elu"))
    
model_BN.add(layers.Dense(10, activation="softmax"))


early_stop_cb = EarlyStopping(patience=10, restore_best_weights=True)
model_BN_checkpoint_cb = ModelCheckpoint("cifar10_bn_model.h5", save_best_only=True)
model_BN.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model_BN.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=50, callbacks=[early_stop_cb, model_BN_checkpoint_cb])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50


<tensorflow.python.keras.callbacks.History at 0x7fcc35cb1640>

- a network using BN converges much faster than the previous network. From the get-go the BN network had a lower loss than the previous network
- the BN network also produces a better model!
- Batch normalization (BN) adds extra computation to the network so training time per epoch is actually longer than the training times for the previous network

In [159]:
from tensorflow.keras.initializers import LecunNormal
import os

model_self_norm = models.Sequential()
model_self_norm.add(layers.Flatten(input_shape=x_train.shape[1:]))
for _ in range(20):
    model_self_norm.add(layers.Dense(100, activation='selu', kernel_initializer=LecunNormal()))
model_self_norm.add(layers.Dense(10, activation="softmax"))

early_stop_cb = EarlyStopping(patience=10, restore_best_weights=True)
model_self_norm_checkpoint_cb = ModelCheckpoint("cifar10_self_norm_model.h5", save_best_only=True)
model_self_norm.compile(optimizer=optimizers.SGD(lr=0.001, momentum = 0.9), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model_self_norm.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=50, callbacks=[early_stop_cb, model_self_norm_checkpoint_cb])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50


<tensorflow.python.keras.callbacks.History at 0x7fcc41934910>

In [160]:
model_dropout = models.Sequential()
model_dropout.add(layers.Flatten(input_shape=x_train.shape[1:]))
for _ in range(20):
    model_dropout.add(layers.Dense(100, activation='selu', kernel_initializer=LecunNormal()))
model_dropout.add(layers.AlphaDropout(rate=0.1)) # why is it only at the end of the network?
model_dropout.add(layers.Dense(10, activation="softmax"))

early_stop_cb = EarlyStopping(patience=10, restore_best_weights=True)
model_dropout_checkpoint_cb = ModelCheckpoint("cifar10_self_norm_model.h5", save_best_only=True)
model_dropout.compile(optimizer=optimizers.SGD(lr=0.001, momentum = 0.9), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

x_means = x_train.mean(axis=0)
x_stds = x_train.std(axis=0)
x_train_scaled = (x_train - x_means) / x_stds
x_val_scaled = (x_val - x_means) / x_stds
x_test_scaled = (x_test - x_means) / x_stds

history_dropout = model_dropout.fit(x_train_scaled, y_train, validation_data=(x_val_scaled, y_val), epochs=50, callbacks=[early_stop_cb, model_dropout_checkpoint_cb])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50


In [161]:
class MCAlphaDropout(layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

In [162]:
model_mc_drop = models.Sequential([
    MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer for layer in model_dropout.layers #replace alphadropout layers
])

In [163]:
# define utility functions
def mc_dropout_predict_probas(mc_model, X, n_samples=10):
    y_probas = [mc_model.predict(X) for sample in range(n_samples)]
    return np.mean(y_probas, axis=0)

def mc_dropout_predict_classes(mc_model, X, n_samples=10):
    y_probas = mc_dropout_predict_probas(mc_model, X, n_samples)
    return np.argmax(y_probas, axis=1)

In [164]:
y_pred = mc_dropout_predict_classes(model_mc_drop, x_val_scaled)
accuracy = np.mean(y_pred == y_val[:, 0])
accuracy

0.4624848484848485

One cycle scheduling:

In [165]:
class OneCycleScheduler(keras.callbacks.Callback):
    def __init__(self, iterations, max_rate, start_rate=None,
                 last_iterations=None, last_rate=None):
        self.iterations = iterations
        self.max_rate = max_rate
        self.start_rate = start_rate or max_rate / 10
        self.last_iterations = last_iterations or iterations // 10 + 1
        self.half_iteration = (iterations - self.last_iterations) // 2
        self.last_rate = last_rate or self.start_rate / 1000
        self.iteration = 0
    def _interpolate(self, iter1, iter2, rate1, rate2):
        return ((rate2 - rate1) * (self.iteration - iter1)
                / (iter2 - iter1) + rate1)
    def on_batch_begin(self, batch, logs):
        if self.iteration < self.half_iteration:
            rate = self._interpolate(0, self.half_iteration, self.start_rate, self.max_rate)
        elif self.iteration < 2 * self.half_iteration:
            rate = self._interpolate(self.half_iteration, 2 * self.half_iteration,
                                     self.max_rate, self.start_rate)
        else:
            rate = self._interpolate(2 * self.half_iteration, self.iterations,
                                     self.start_rate, self.last_rate)
        self.iteration += 1
        K.set_value(self.model.optimizer.lr, rate)

In [171]:
import math
from tensorflow.keras import backend as K

batch_size = 128
n_epochs = 45
onecycle = OneCycleScheduler(math.ceil(len(x_train_scaled) / batch_size) * n_epochs, max_rate=0.05)

model_mc_drop.compile(optimizer=optimizers.SGD(lr=0.001, momentum = 0.9), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model_mc_drop.fit(x_train_scaled, y_train, epochs=n_epochs, batch_size=batch_size,
                    validation_data=(x_val_scaled, y_val),
                    callbacks=[onecycle])

Epoch 1/45
Epoch 2/45
Epoch 3/45
Epoch 4/45
Epoch 5/45
Epoch 6/45
Epoch 7/45
Epoch 8/45
Epoch 9/45
Epoch 10/45
Epoch 11/45
Epoch 12/45
Epoch 13/45
Epoch 14/45
Epoch 15/45
Epoch 16/45
Epoch 17/45
Epoch 18/45
Epoch 19/45
Epoch 20/45
Epoch 21/45
Epoch 22/45
Epoch 23/45
Epoch 24/45
Epoch 25/45
Epoch 26/45
Epoch 27/45
Epoch 28/45
Epoch 29/45
Epoch 30/45
Epoch 31/45
Epoch 32/45
Epoch 33/45
Epoch 34/45
Epoch 35/45
Epoch 36/45
Epoch 37/45
Epoch 38/45
Epoch 39/45
Epoch 40/45
Epoch 41/45
Epoch 42/45
Epoch 43/45
Epoch 44/45
Epoch 45/45


The model is training at an average of 2 seconds per epoch, and performs just as well, if not better, than the previous models. It has a larger batch size.