# Implementing advanced procedures and algorithms in TensorFlow

This notebook collects advanced procedures relevant to training more complex neural networks. It gives you the possibility to look up the procedures as needed and copy the relevant code. The notebook builds on a GitHub repository by [Aurélien Géron](https://github.com/ageron/handson-ml2).

In [None]:
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# 1. From the previous notebook: Multiclass-classification

Throughout, we will train a neural network on a dataset of fashion-products that is labeled with the categories of each product. The data is loaded directly from TensorFlow (which has quite the broad collection of datasets):

In [None]:
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_other, y_other), (X_test, y_test) = fashion_mnist.load_data()

In [None]:
print(X_other.shape)
print(y_other.shape)
print(X_test.shape)
print(y_test.shape)

The X's are matrices (with 28x28 pixels), while the y's are numbers.

We divide the values of X by 255 (essentially standardizing the pixel-values to 0-1) and also split apart a validation set (of the same size as the test set):

In [None]:
X_other = X_other / 255.
X_test = X_test / 255.
X_train, X_valid, y_train, y_valid = train_test_split(X_other, y_other, train_size = 50000, random_state=152)

print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

Let's plot two examples:

In [None]:
plt.subplot(1, 2, 1)
plt.imshow(X_train[0],cmap="binary")
plt.axis('off')
plt.subplot(1, 2, 2)
plt.imshow(X_train[500],cmap="binary")
plt.axis('off')

As well as the corresponding labels:

In [None]:
print(y_train[0])
print(y_train[500])

That's a bit hard to interpret. Luckily, we have the right names for each label available:

In [None]:
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

We can now take another look at what the pictures above represent:

In [None]:
print(class_names[y_train[0]])
print(class_names[y_train[500]])

## 1.1 Training the model

Can you create a model with two hidden layers and one (softmax) output layer? The hidden layers should have 100 neurons each. You will have to figure out the number of neurons on the output layer (hint: it depends on the number of classes).

Also, don't forget to flatten the images with an appropriate `input_shape`!

Make sure you save your model as `model`.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

Use the `summary` function, to see whether everything worked out as it should. If you defined the model as discussed above, you should get a total of 89,610 parameters.

In [None]:
model.summary()

We can now compile the model. Use `optimizer=tf.keras.optimizers.SGD(learning_rate=0.01)` and `metrics=['accuracy']`. For the loss, use `sparse_categorical_crossentropy`. This is because our y's here are **not** one-hot-encoded, but instead are values from 0 to 9.

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])

Train the model for 30 epochs, keeping track also of the `validation_data`.

In [None]:
log = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid))

Take a look at the training process:

In [None]:
def create_plot(log):
    plt.plot(log.history['accuracy'],label = "training accuracy",color='green')
    plt.plot(log.history['loss'],label = "training loss",color='darkgreen')
    plt.plot(log.history['val_accuracy'], label = "validation accuracy",color='grey')
    plt.plot(log.history['val_loss'], label = "validation loss",color='darkblue')
    plt.legend()
    plt.show()
create_plot(log)

If we accept our model, we can evaluate it on the test set, using the `evaluate` function of our model:

In [None]:
model.evaluate(X_test, y_test)

We can also take a look at some predictions:

In [None]:
X_new = X_test[:4]
y_predict = np.argmax(model.predict(X_new), axis=-1)
y_predict = [class_names[y] for y in y_predict]

In [None]:
plt.subplot(1, 4, 1)
plt.imshow(X_test[0],cmap="binary")
plt.axis('off')
plt.subplot(1, 4, 2)
plt.imshow(X_test[1],cmap="binary")
plt.axis('off')
plt.subplot(1, 4, 3)
plt.imshow(X_test[2],cmap="binary")
plt.axis('off')
plt.subplot(1, 4, 4)
plt.imshow(X_test[3],cmap="binary")
plt.axis('off')
print("Predictions are: " + str(y_predict))

Can you do better? Try tweaking the learning rate, the number of layers, and the neurons per layer to see if your validation loss improves. Once you have decided on your final model, evaluate it on the test set and note down your loss there.

# 2. Regularization

We learned about a number of regularization techniques. Here, we will see how to implement early stopping, L2-regularization, and dropout-regularization

## 2.1 Early stopping

When we implement early stopping, the model definition and compilation is unchanged:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])

However, we now need to add a so-called `callback` to the training process. We define the `EarlyStopping` callback, which interrupts training if the validation loss is no longer improving. In particular, the callback waits for `patience` epochs of no improvement before interrupting:

In [None]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

The other parameter here is `restore_best_weights`. If set to `True`, this simply means that, once the callback decides to interrupt, it takes the version of the model that led to the best validation loss so far (do you know which epoch this corresponds to?)

In [None]:
log = model.fit(X_train, y_train, epochs=100,
                validation_data=(X_valid, y_valid),
                callbacks=[early_stopping_cb])
create_plot(log)

Now that we have stopped early, compare the performance of the model on the test set:

In [None]:
model.evaluate(X_test, y_test)

## 2.2 L2- regularization

L2-regularization adds a penalty based on the L2-norm to the loss function. Usually, we add the same penalty for all weights, and we don't add a penalty for the biases. But you could also add a `bias_regularizer`, or even an `activity_regularizer`, which regularizes the output of the neurons instead of the parameters.

Keep in mind that the regularization parameter is another hyperparameter that might need tuning. A good starting point is 0.01, but it can vary quite a bit depending on the problem and network.

In [None]:
reg_param = 0.01
regularizer = tf.keras.regularizers.l2(reg_param)

Can you rerun the model from above, but using regularization? In particular, to each `Dense` layer, you want to add the argument `kernel_regularizer=regularizer`:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu",kernel_regularizer=regularizer),
    tf.keras.layers.Dense(100, activation="relu",kernel_regularizer=regularizer),
    tf.keras.layers.Dense(10, activation="softmax",kernel_regularizer=regularizer)
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])
log = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid))
create_plot(log)

As the regularlization term is added to the loss, the loss will typically start out quite high, before the optimization routine finds a good way to adjust the weights to reduce the loss.

Note that we have improved the overfitting issue quite a bit, but unfortunately we have made it more difficult for the model to learn (we introduced bias). This is frequently the case, and we usually need to do some fiddling to find a good compromise.

## 2.3 Dropout-regularization

Another regularization method is dropout-regularization. At training time, in each iteration, a number of neurons will be considered as non-existent, so we force the network to distribute weights more equally across neurons.

This makes the correct computation for activations a bit challenging when doing predictions, but luckily TensorFlow takes care of the added complexity.

If we want to ensure that neurons at a certain layer drop out (with probability `rate`), we add a `Dropout` layer before the corresponding `Dense` layer, using
```
tf.keras.layers.Dropout(rate=0.2)
```
Of course, `0.2` is just a particular choice and we can vary that.
Can you repeat the previous (baseline) model, but adding a `Dropout` layer before each `Dense` layer?

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])
log = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid))
create_plot(log)

In [None]:
model.evaluate(X_test, y_test)

# 3. Vanishing / exploding gradients

Below, we create a much deeper neural network. As you can see, not much is happening int erms of learning:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="sigmoid"),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])
log = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

## 3.1 Batch normalization

Batch normalization allows us to do normalization at all stages of the network. For each input that is normalized, we need 4 parameters:
1. One that determines how the input is scaled (trainable)
1. One that determines how the input is shifted (trainable)
1. One that keeps track of the average of that input (non-trainable - it is still being adjusted though!)
1. One that keeps track of the standard deviation of that input (non-trainable - it is still being adjusted though!)

### Option 1: After activation (before inputs are weighted)

We can simply add a `BatchNormalization` layer before each of our `Dense` layers:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="sigmoid"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(30, activation="sigmoid"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.summary()

Can you verify the number of trainable and non-trainable parameters?

Let's now train the network again:

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])
log = model.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))

We can train even a deep neural network much more easily!

### Option 2: Before activation (after inputs are weighted)

The reommendation by the authors of the original paper on batch normalization is to normalize the weighted sum that goes into the neurons. That is, we first combine the inputs (and add a bias), then we "normalize" that weighted sum, before running the activation function on it. To do so in TensorFlow, we have to split apart our hidden layers into the combination and the activaiton. We out the `BatchNormalization` in-between

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(30),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(30),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(30),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(30),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(30),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(30),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(10),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('softmax')
])
model.summary()

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])
log = model.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))

In practice, the differences between the two options tend to be small. But if you are really struggling to get your network to learn anything, try it out like this!

## 3.2 Specific initializations

We generally want to initialize our weights in a sensible manner (especially if we are not using batch normalization, for example, because of runtime concerns). Let's start with our baseline model:

In [None]:
tf.random.set_seed(631)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

weights, biases = model.layers[1].get_weights()
print("First layer: " + str(weights[0,0]))
weights, biases = model.layers[-1].get_weights()
print("Last layer: " + str(weights[0,0]))

Can you copy the model definition, but change the first layer to `kernel_initializer='he_normal'` and the output layer to `kernel_initializer='glorot_uniform'`?

What changes to you observe in the first layer, what changes in the last layer? Do they make sense?

In [None]:
tf.random.set_seed(631)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax", kernel_initializer="glorot_uniform")
])

weights, biases = model.layers[1].get_weights()
print("First layer: " + str(weights[0,0]))
weights, biases = model.layers[-1].get_weights()
print("Last layer: " + str(weights[0,0]))

Try again. Can you copy the model definition, but change the first layer to `kernel_initializer='he_uniform'` and the output layer to `kernel_initializer='glorot_normal'`?

What changes to you observe in the first layer, what changes in the last layer? Do they make sense?

In [None]:
tf.random.set_seed(631)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_uniform"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax", kernel_initializer="glorot_normal")
])

weights, biases = model.layers[1].get_weights()
print("First layer: " + str(weights[0,0]))
weights, biases = model.layers[-1].get_weights()
print("Last layer: " + str(weights[0,0]))

# 4. Speeding up learning

## 4.1 Mini-batch gradient descent

You might not have noticed, but we run mini-batch gradient descent by default. We can control the batch-size within the `model.fit` function. The default is `32`. Run the below code:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])

log = model.fit(X_train, y_train,
                epochs=3,batch_size=32,
                validation_data=(X_valid, y_valid))

Can you remake the model, but change the `batch_size` to `1024`?

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
              metrics=["accuracy"])

log = model.fit(X_train, y_train,
                epochs=3, batch_size=1024,
                validation_data=(X_valid, y_valid))

Notice the number of steps taken in each epoch (the counter just underneath "Epoch x/3"). Can you explain where the number of steps are coming from?

## 4.2 Using Momentum

We can add momentum to many algorithms. The base case is to add momentum to `SGD`. A typical value is `0.9` but keep in mind that this is another hyperparameter that may need some tuning. When setting up `SGD`, you can also tick `nesterov=True` to use the Nesterov algorithm, a moment-based algorithm we didn't discuss. Try it out:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=False),
              metrics=["accuracy"])
log = model.fit(X_train, y_train,epochs=20,
                validation_data=(X_valid, y_valid))
create_plot(log)

## 4.3 RMSpop

RMSprop is an algorithm that pursues a slightly different idea: it normalizes the gradients using their squares. It requires to specify a `learning_rate`, as well as the hyperparameters `rho` and `epsilon`. For the latter two, the standard values usually do just fine, while even the `learning_rate` is less problematic than in `SGD`.

If you want, you can also add `momentum` to the algorithm.

Try it out:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9, epsilon=1e-07, momentum=0.0),
              metrics=["accuracy"])
log = model.fit(X_train, y_train,epochs=20,
                validation_data=(X_valid, y_valid))
create_plot(log)

## 4.4 Adam

Finally, we have `Adam`, which is used most commonly. Adam combines the ideas of RMSprop and momentum gradient descent. However, it also adds a slight adjustment that is particularly relevant for early iterations. The hyperparameters are `learning_rate`, `beta_1`, `beta_2`, and `epislon`, even though mostly people leave everything but the `learning_rate` alone. Try it out:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07),
              metrics=["accuracy"])
log = model.fit(X_train, y_train,epochs=20,
                validation_data=(X_valid, y_valid))
create_plot(log)

# 5. Learning rate schedule

## 5.1 Power scheduling

Remember that each epoch contains a number of steps ($\frac{n}{\text{mini-batch-size}}$ to be exact). If we want to express our decay schedule based on the number of epochs, we first have to make a bit of an adjustment. For example, say that we specify the min-batch-size to 32 and that we want to have reach the next "decay step" (i.e., 1/2, 1/3, 1/4, ...) after 5 epochs.

Can you define the correct `s`, which should be the number of steps (not epochs!) until we reach the next "decay step"?

In [None]:
batch_size = 32
epochs_until_change = 5

steps_per_epoch = X_train.shape[0] / batch_size
s = epochs_until_change * steps_per_epoch

Once we have defined the right `s`, we can train the model by manually defining our optimizer, using the TensorFlow scheduling process.

Here, we use `InverseTimeDecay` which computes
```
current_learning_rate = initial_learning_rate / (1 + decay_rate * step / decay_steps)
```
Increasing the `decay_rate` is equivalent to decreasing the `decay_steps`. Since we have already tuned `decay_steps=s`, we can simply set the `decay_rate` to 1.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

learning_rate = tf.keras.optimizers.schedules.InverseTimeDecay(initial_learning_rate=0.01, decay_steps=s, decay_rate=1)
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])
log = model.fit(X_train, y_train,
                epochs=20,batch_size=batch_size,
                validation_data=(X_valid, y_valid))
create_plot(log)

## 5.2 Exponential scheduling

When we use any other type of scheduling, we can follow the same process. In particular, can you redefine `s`, but this time with 10 epochs until we reach the next stage in the schedule?

In [None]:
batch_size = 32
epochs_until_change = 10

steps_per_epoch = X_train.shape[0] / batch_size
s = epochs_until_change * steps_per_epoch

We now use the `ExponentialDecay` schedule. Here, the computation is
```
current_learning_rate = initial_learning_rate * decay_rate**(step / decay_steps)
```
(Note that `**` means to the power of)

Our baseline exponential schedule has a `0.1` base, so we set `decay_rate=0.1`:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

learning_rate = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.01, decay_steps=s, decay_rate=0.1)
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])
log = model.fit(X_train, y_train,
                epochs=20,batch_size=batch_size,
                validation_data=(X_valid, y_valid))
create_plot(log)