In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import math

import time
import datetime

# Using TensorBoard to track metrics

In this notebook I will show how to use the TensorBoard extension which allows to keep track of training runs and data associated with each epoch (e.g. loss, accuracies, predicted/generated outputs, activation statistics etc.)

First we will look at tensorflow's summary writer and what it can be used for before we use it in the training loop to keep track of the metrics.

The summary writer can be used to log scalars, feature maps (e.g. batches of generated images or activations), histograms, audio (batches of 1D sequences), and text.


After having taken a look at these, we will take a look at how this can be used to track the metrics (loss, accuracy etc.) of a subclassed model. For this we will implement the train and test step as internal methods of the model and have the loss, optimizer and metrics as attributes of the model.

Here we iterate over 100 steps and try to store a loss, some randomly generated images, an audio tensor, text, and a histogram for each step. We do not yet use a deep learning model, we just show how to store different kinds of data with TensorBoard.

In [2]:
# load tensorboard extension
%load_ext tensorboard

# define file-path for log file
file_path = "test_logs/test"

# define the tf file-writer (we usually use a separate one for train and validation)
summary_writer = tf.summary.create_file_writer(file_path)

# write 100 logs for loss

for i in range(100):
    
    # compute loss (here targets and predictions would come from the data and the model)
    targets = tf.constant([0.3,0.3,-0.8])
    predictions = targets + tf.random.normal(shape=targets.shape, stddev=100/(i+1)) # decreasing noise
    
    loss_function = tf.keras.losses.MeanSquaredError()
    loss = loss_function(targets,predictions)
    
    
    # image batch (these would be obtained from the model)
    
    image_batch = tf.random.uniform(shape=(32,28,28,1),dtype=tf.float32)
    
    
    # audio batch (would be obtained from the model but here it's just a hard coded sine wave of 110hz)
    
    x = 2* math.pi*tf.cast(tf.linspace(0,32000*5, 32000*5), tf.float32)*110/32000
    x = tf.expand_dims(x, axis=0) # add batch dimension
    x = tf.expand_dims(x, axis=-1) # add last dimension
    x = tf.repeat(x, 32, axis=0) # repeat to have a batch of 32
    audio_batch = tf.math.sin(x) # obtain sine wave
    
    
    # text (this would be the output of a language model after one training epoch)
    
    text_batch = tf.constant("This is the sampled output of a language model")
    
    
    # histogram (e.g. of activations of a dense layer during training)
    
    activations_batch = tf.random.normal(shape=(32,20,1))
    min_activations = tf.reduce_min(activations_batch, axis=None)
    max_activations = tf.reduce_max(activations_batch, axis=None)
    histogram = tf.histogram_fixed_width_bins(activations_batch, 
                                              value_range=[min_activations, max_activations])
    
    
    # now we want to write all the data to a log-file.
    with summary_writer.as_default():
        
        # save the loss scalar for the "epoch"
        tf.summary.scalar(name="loss", data=loss, step=i)
        
        # save a batch of images for this epoch (have to be between 0 and 1)
        tf.summary.image(name="generated_images",data = image_batch, step=i, max_outputs=32)
        
        # save the batch of audio for this epoch
        tf.summary.audio(name="generated_audio", data = audio_batch, 
                         sample_rate = 32000, step=i, max_outputs=32)
        
        # save the generated text for that epoch
        tf.summary.text(name="generated_text", data = text_batch, step=i)
        
        # save a histogram (e.g. of activations in a layer)
        tf.summary.histogram(name="layer_N_activations", data = histogram, step=i)

# Inspect the logged data in the TensorBoard

We can look at the images, audio, text, histograms and plots for each time-step. 

For plots under the "scalars" section, we can control the amount of smoothing for the plots. This allows us to visually judge whether the loss is decreasing even in the presence of strong oscillations.

In [3]:
# open the tensorboard to inspect the data for the 100 steps
%tensorboard --logdir test_logs/

Reusing TensorBoard on port 6006 (pid 11605), started 3:17:40 ago. (Use '!kill 11605' to kill it.)

# Using TensorBoard to store loss and accuracy of a subclassed model

In this part of the notebook, we will define a subclassed CNN model and store loss and accuracy for both training and validation data to the TensorBoard. 

To do this in a clean way, we implement the keras metrics that keep track of loss and accuracy in each epoch for us as part of the model. We also define the train and test steps as methods inside the model rather than as external functions. Doing so will move us one step closer to being able to use the in-built training and evaluation methods that come with Tensorflow/Keras, that is the compile and fit methods, which we do not yet allow for the homeworks.

To use train_step and test_step as methods of the model, we need to have the loss-function, the metrics, and the optimizer as parts of the model, which is why we define them in the init method.

Note that we need to update the metrics after each training example and reset the metrics after each epoch or before evaluating our model on the validation data set.

Also note that the metrics_list contains a mean metric for the loss, which does not take targets and predictions as arguments in its update_state method, but just a scalar. For this reason, we treat it differently from the remaining metrics.

In [4]:
class CNN(tf.keras.Model):
    def __init__(self):
        super(CNN, self).__init__()
    
        self.optimizer = tf.keras.optimizers.Adam()
        
        self.metrics_list = [
                        tf.keras.metrics.Mean(name="loss"),
                        tf.keras.metrics.CategoricalAccuracy(name="acc"),
                        tf.keras.metrics.TopKCategoricalAccuracy(3,name="top-3-acc") 
                       ]
        
        self.loss_function = tf.keras.losses.CategoricalCrossentropy(from_logits=True)   
        
        L2_lambda = 0.01
        dropout_amount = 0.5
        
        self.all_layers = [
            
            tf.keras.layers.Conv2D(filters=32, 
                                   kernel_size=5, 
                                   strides=1, 
                                   padding="same",
                                   kernel_initializer=tf.keras.initializers.glorot_uniform,
                                   activation=None,
                                   kernel_regularizer=tf.keras.regularizers.L2(L2_lambda)),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Activation(tf.nn.relu),
            
            tf.keras.layers.MaxPool2D(pool_size=2,strides=1),
            
            tf.keras.layers.Dropout(dropout_amount),
            
            tf.keras.layers.Conv2D(filters=32, kernel_size=3, strides=1, padding="same",activation=None,
                                  kernel_initializer=tf.keras.initializers.glorot_uniform,
                                   kernel_regularizer=tf.keras.regularizers.L2(L2_lambda)),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Activation(tf.nn.relu),
            
            tf.keras.layers.GlobalAveragePooling2D(),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(dropout_amount),

            tf.keras.layers.Dense(128, kernel_regularizer=tf.keras.regularizers.L2(L2_lambda)),
            tf.keras.layers.Activation(tf.nn.relu),
            
            tf.keras.layers.Dropout(dropout_amount),
            
            tf.keras.layers.Dense(10, kernel_regularizer=tf.keras.regularizers.L2(L2_lambda)),
        ]
    
    def call(self, x, training=False):

        for layer in self.all_layers:
            try:
                x = layer(x,training)
            except:
                x = layer(x)
       
        return x
    
    def reset_metrics(self):
        
        for metric in self.metrics:
            metric.reset_states()
            
    @tf.function
    def train_step(self, data):
        
        x, targets = data
        
        with tf.GradientTape() as tape:
            predictions = self(x, training=True)
            
            loss = self.loss_function(targets, predictions) + tf.reduce_sum(self.losses)
        
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        
        # update loss metric
        self.metrics[0].update_state(loss)
        
        # for all metrics except loss, update states (accuracy etc.)
        for metric in self.metrics[1:]:
            metric.update_state(targets,predictions)

        # Return a dictionary mapping metric names to current value
        return {m.name: m.result() for m in self.metrics}

    @tf.function
    def test_step(self, data):

        x, targets = data
        
        predictions = self(x, training=False)
        
        loss = self.loss_function(targets, predictions) + tf.reduce_sum(self.losses)
        
        self.metrics[0].update_state(loss)
        
        for metric in self.metrics[1:]:
            metric.update_state(targets, predictions)

        return {m.name: m.result() for m in self.metrics}

# Preparing the training and validation data

Here we use data augmentation for which we use a special model that runs some operations on it (random flipping, resizing and cropping). Note that any kind of model can be used in the input pipeline - even a VAE could be used to encode the images and then add noise to the generative embedding.

In [5]:
ds =tfds.load("fashion_mnist", as_supervised=True)

train_ds = ds["train"]
val_ds = ds["test"]

data_augmentation_model = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.Resizing(32,32),
    tf.keras.layers.RandomCrop(28,28)
])

def augment(x):
    return data_augmentation_model(x)

train_ds = train_ds.map(lambda x,y: (augment(x)/255, tf.one_hot(y, 10, dtype=tf.float32)),\
                        num_parallel_calls=tf.data.AUTOTUNE).shuffle(5000).batch(32).prefetch(tf.data.AUTOTUNE)

val_ds = val_ds.map(lambda x,y: (x/255, tf.one_hot(y, 10, dtype=tf.float32)),\
                    num_parallel_calls=tf.data.AUTOTUNE).shuffle(5000).batch(32).prefetch(tf.data.AUTOTUNE)

In [6]:
# instantiate the model
model = CNN()

# run model on input once so the layers are built
model(tf.keras.Input((28,28,1)));

# Instantiate the file-writers for the training

We store the tensorboard logs to a folder with a meaningful name (e.g. hyperparameter settings + date and time).

In [7]:
# to clear all logs use this line:

#!rm -rf ./logs/

In [8]:
# Define where to save the log
hyperparameter_string= "Your_Settings_Here"
current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

train_log_path = f"logs/{hyperparameter_string}/{current_time}/train"
val_log_path = f"logs/{hyperparameter_string}/{current_time}/val"

# log writer for training metrics
train_summary_writer = tf.summary.create_file_writer(train_log_path)

# log writer for validation metrics
val_summary_writer = tf.summary.create_file_writer(val_log_path)

# Writing the training loop

Note that you need to re-run the above cell (and hence update the time-stamp) if you don't want to over-write the data of the previous training-run.

If you use keras metrics, do not forget to reset the states between train and validation and between epochs.
We use metric.update_states(...) to update a metric. This usually means we update the running average with the new value. There also exist keras metrics that can also compute scores such as CategoricalAccuracy, TopKCategoricalAccuracy.

For your own training loops, you may want to add TQDM to see the progress of each epoch and the estimate of how much time it will take.

Instead of looking at the printed losses and accuracies, we can look at the TensorBoard plots which will be updated after every epoch.

In [9]:
for epoch in range(5):
    
    print(f"Epoch {epoch}:")
    
    # Training:
    
    for data in train_ds:
        metrics = model.train_step(data)
    
    # print the metrics
    print([f"{key}: {value}" for (key, value) in zip(list(metrics.keys()), list(metrics.values()))])
    
    # logging the validation metrics to the log file which is used by tensorboard
    with train_summary_writer.as_default():
        for metric in model.metrics:
            tf.summary.scalar(f"{metric.name}", metric.result(), step=epoch)
    
    # reset all metrics (requires a reset_metrics method in the model)
    model.reset_metrics()
    
    
    # Validation:
    
    for data in val_ds:
        metrics = model.test_step(data)
    
    print([f"val_{key}: {value}" for (key, value) in zip(list(metrics.keys()), list(metrics.values()))])
    
    # logging the validation metrics to the log file which is used by tensorboard
    with val_summary_writer.as_default():
        for metric in model.metrics:
            tf.summary.scalar(f"{metric.name}", metric.result(), step=epoch)
    
    # reset all metrics
    model.reset_metrics()
    
    print("\n")

Epoch 0:
['loss: 1.609331727027893', 'acc: 0.5475833415985107', 'top-3-acc: 0.8635333180427551']
['val_loss: 1.2642236948013306', 'val_acc: 0.6227999925613403', 'val_top-3-acc: 0.8938000202178955']


Epoch 1:
['loss: 1.1726185083389282', 'acc: 0.6559500098228455', 'top-3-acc: 0.9244666695594788']
['val_loss: 1.4290522336959839', 'val_acc: 0.51419997215271', 'val_top-3-acc: 0.8436999917030334']


Epoch 2:
['loss: 1.1076161861419678', 'acc: 0.6722833514213562', 'top-3-acc: 0.9319666624069214']
['val_loss: 1.187070369720459', 'val_acc: 0.6211000084877014', 'val_top-3-acc: 0.9394999742507935']


Epoch 3:
['loss: 1.0730130672454834', 'acc: 0.6803666949272156', 'top-3-acc: 0.936033308506012']
['val_loss: 1.6239241361618042', 'val_acc: 0.5058000087738037', 'val_top-3-acc: 0.8766000270843506']


Epoch 4:
['loss: 1.049936294555664', 'acc: 0.6890000104904175', 'top-3-acc: 0.9362499713897705']
['val_loss: 1.0211036205291748', 'val_acc: 0.6879000067710876', 'val_top-3-acc: 0.9570000171661377']




In [10]:
%tensorboard --logdir logs/

# Saving and loading a subclassed model

Because training deep neural networks can take multiple days, weeks or even months, we want to save checkpoints in between. This is especially useful if you use Google Colab and you save the model directly to your Google Drive folder. That way you don't lose any progress if your runtime gets closed.

Note however that you lose the state of the optimizer. I will provide another notebook that shows how a full model can be saved and loaded by using the in-built compile method.

In [11]:
# save the model with a meaningful name
model.save_weights(f"saved_model_{hyperparameter_string}", save_format="tf")

In [12]:
# instantiate a new model from our CNN class
loaded_model = CNN()

# build the model
inp= tf.keras.Input((28,28,1))
loaded_model(inp)

# load the model weights to continue training. 
loaded_model.load_weights(f"saved_model_{hyperparameter_string}");