# Theory


4. Let x be the K x 1 vector output of the last layer of a xNN and e = crossEntropy(p*, softMax(x)) be the error where p* is a K x 1 vector with a 1 in position k* representing the correct class and 0s elsewhere. Derive ∂e/∂x. Large portions of this are shown in the slides, however, the purpose of this question is for you to derive all of the parts yourself to gain more confidence with error gradients. Here’s a cookbook of steps and hints:

4.1. Derive the gradient of the cross entropy for a 1 hot label at position k*. Use the derivative rule for log (assume base e) and note that only 1 element of the gradient is non zero.

$$e = crossEntropy(p^*,softMax(x))$$

$$crossEntropy(p^*,p_x)=-\sum p^*log(p_x)$$
$$=-log(p_{x_*})$$
$$=-log(softMax(x_*))$$

$$\frac{\delta e}{\delta softMax} = [0,0,...,\frac{-1}{softMax(x_*)},...,0,0]^T$$

4.2. Derive the Jacobian of the soft max. Use the derivative quotient rule and note 2 cases: i != j and i == j (where i and j refer to the Jacobian row and col). Apply a common trick for functions with exponentials and re write the derivatives in terms of original function.

$$softMax(x) = \frac{e^{x_j}}{\sum e^{x_i}}$$

$$\frac{\delta softmax(i=j)}{\delta x} = \frac{(\sum e^{x_i}*e^{x_j}) - (e^{x_j} *e^{x_i})}{(\sum e^{x_i})^2}$$


$$=\frac{e^{x_j}}{\sum e^{x_i}}*\frac{\sum e^{x_i}-e^{x_i}}{\sum e^{x_i}}$$


$$=S_j(1-S_i)$$


<br>

$$\frac{\delta softmax(i\neq j)}{\delta x} = \frac{0 - (e^{x_j} *e^{x_i})}{(\sum e^{x_i})^2}$$

$$=-S_i*S_j$$

For any i,j in the Jacobian matrix, 
- if $i=j$ : $\frac{\delta Softmax}{\delta x} = $ Softmax(i)(1-Softmax(j)
- if $i \neq j$: $\frac{\delta Softmax}{\delta x} = $ -Softmax(i)Softmax(j)

4.3. Apply the chain rule to derive the gradient of e = crossEntropy(p*, softMax(x)) as the Jacobian matrix times the gradient vector. Take advantage of only 1 element of the gradient vector being non zero effectively selecting the corresponding col of the Jacobian matrix.


$$\frac{\delta e}{\delta x} = \frac{\delta e}{\delta softMax}\frac{\delta Softmax}{\delta x}$$

$$ = [p_0,...,p_{k^*}-1,...,p_N]$$

Through matrix-vector multiplication where we will multiply the gradient (\frac{\delta e}{\delta softMax}) across each column of the Jacobian. The result will be a vector formed from the $k^*$th row of each Jacobian column multiplied by $\frac{-1}{P_{k*}}$

4.4. Note the beautiful and numerically stable result


4.5. Remind yourself in the future when implementing classification networks in
software, use a single call to the high level library’s built in combined soft max cross entropy function if it’s available instead of making 2 calls to separate soft max and cross entropy functions. But realize that some libraries combine separate functions as an optimization step behind the scenes for you so if it’s not available then it’s probably still ok.



5. Consider a simple residual block of the form y = x + f(H x + v) where x is a K x 1 input feature vector, H is K x K linear transformation matrix, v is a K x 1 bias vector, f is a ReLU pointwise nonlinearity and y is a K x 1 output feature vector. Assume that ∂e/∂y is given. Write out a single expression using the chain rule for ∂e/∂x in terms of ∂e/∂y and the Jacobians of the other operations. For the ReLU, define the Jacobian as a K x K diagonal matrix I{0, 1}. Note the clean flow of the gradient from the output to the input, this is a key for training deep networks.


$$\frac{\delta y}{\delta x} = Identity(k,k) + I(0,1)H$$
$$\frac{\delta e}{\delta x}  = \frac{\delta e}{\delta y} \cdot (Identity(k,k) + I(0,1)H)$$

6. Write out the gradient descent update for H and v in the above example. Define intermediate feature maps as necessary. Note the need to save feature maps from the forward pass which has memory implications for xNN training.

$$\frac{\delta e}{\delta v} = \frac{\delta e}{\delta y} \cdot I(0,1)$$
$$\frac{\delta e}{\delta H} = \frac{\delta e}{\delta y}I(0,1)H$$

$$v_{t+1}=v_{t}-\alpha \frac{\delta e}{\delta v}$$
$$H_{t+1}=H_{t}-\alpha \frac{\delta e}{\delta H}$$

# Practice

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import os
import datetime
import pathlib
import numpy as np
import PIL.Image as Image
import matplotlib.pyplot as plt
import time
#tf.debugging.set_log_device_placement(True)

AUTOTUNE = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 64
OG_IMAGENET_SIZE = (224,224)
IMAGE_SIZE = (64,64)

tf.__version__ 

'2.1.0'

## Data Extraction w MobileNet_V2

Data is downloaded from the [official ImageNet website](http://image-net.org/small/download.php). Each image has a resolution of 64x64 with three channels.

In [2]:
data_root = pathlib.Path("E://Data/Imagenet/imageNet_64-64/train_64x64")
file_names = list(map(lambda x: str(data_root/x), os.listdir(data_root)))
file_names[:5]

['E:\\Data\\Imagenet\\imageNet_64-64\\train_64x64\\0224781.png',
 'E:\\Data\\Imagenet\\imageNet_64-64\\train_64x64\\0224782.png',
 'E:\\Data\\Imagenet\\imageNet_64-64\\train_64x64\\0224783.png',
 'E:\\Data\\Imagenet\\imageNet_64-64\\train_64x64\\0224784.png',
 'E:\\Data\\Imagenet\\imageNet_64-64\\train_64x64\\0224785.png']

In [3]:
def decode_image(path):
    img = tf.io.read_file(path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, OG_IMAGENET_SIZE)
    return img

filename_dataset = tf.data.Dataset.from_tensor_slices(file_names)
for x in filename_dataset.take(1):
    print(x)

tf.Tensor(b'E:\\Data\\Imagenet\\imageNet_64-64\\train_64x64\\0224781.png', shape=(), dtype=string)


In [4]:
inference_data = filename_dataset.map(decode_image, num_parallel_calls = AUTOTUNE)
inference_data = tf.data.Dataset.zip((filename_dataset, inference_data))
inference_data = inference_data.batch(BATCH_SIZE)
inference_data = inference_data.prefetch(AUTOTUNE)

In [5]:
labels_path = tf.keras.utils.get_file('ImageNetLabels.txt','https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt')
imagenet_labels = np.array(open(labels_path).read().splitlines())
def get_classes(predictions):
    return imagenet_labels[np.argmax(predictions, axis=-1)] 

In [6]:
classifier_url ="https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/2" 
classifier = tf.keras.Sequential([
    hub.KerasLayer(classifier_url, input_shape = OG_IMAGENET_SIZE+(3,))
])

In [None]:
for files, x in inference_data:
    with tf.device('/GPU:0'):
        pred = classifier.predict(x)
    classes = get_classes(pred)
    for i, file in enumerate(files):
        f = pathlib.Path(file.numpy().decode('ascii'))
        if not os.path.exists(f.parent/classes[i]):
            os.mkdir(f.parent / classes[i])
        else:    
            pass
        os.rename(f, f.parent / classes[i] / f.parts[-1])

## Training Data Pipeline

In [None]:
# data
DATA_NUM_CLASSES        = 10
DATA_CHANNELS           = 3
DATA_ROWS               = 32
DATA_COLS               = 32
DATA_CROP_ROWS          = 28
DATA_CROP_COLS          = 28

# model
MODEL_LEVEL_0_REPEATS   = 3
MODEL_LEVEL_1_REPEATS   = 3
MODEL_LEVEL_2_REPEATS   = 3

# training
TRAINING_BATCH_SIZE      = 32
TRAINING_SHUFFLE_BUFFER  = 5000
TRAINING_LR_MAX          = 0.001
# TRAINING_LR_SCALE        = 0.1
# TRAINING_LR_EPOCHS       = 2
TRAINING_LR_INIT_SCALE   = 0.01
TRAINING_LR_INIT_EPOCHS  = 5
TRAINING_LR_FINAL_SCALE  = 0.01
TRAINING_LR_FINAL_EPOCHS = 25

# training (derived)
TRAINING_NUM_EPOCHS = TRAINING_LR_INIT_EPOCHS + TRAINING_LR_FINAL_EPOCHS
TRAINING_LR_INIT    = TRAINING_LR_MAX*TRAINING_LR_INIT_SCALE
TRAINING_LR_FINAL   = TRAINING_LR_MAX*TRAINING_LR_FINAL_SCALE

SAVE_MODEL_PATH = 'E://Models/ImageNet_64/'
!mkdir -p "$SAVE_MODEL_PATH"

AUTOTUNE = tf.data.experimental.AUTOTUNE

def get_label(file_path):
  # convert the path to a list of path components
  parts = tf.strings.split(file_path, os.path.sep)
  # The second to last is the class-directory
  return parts[-2] == CLASSES

def decode_image(img):
    img = tf.image.decode_image(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.random_flip_left_right(img)
    #img = tf.image.random_crop(img, size=[DATA_CROP_ROWS, DATA_CROP_COLS, 3])
    return img

def process_path(path):
    """
    Input: file_path of a sample image
    Output: image in 3x64x64 float32 Tensor and one hot tensor
    """
    label = get_label(path)
    image = tf.io.read_file(path)
    image = decode_image(image)
    return image, label


def prepare_for_training(ds, cache=False, shuffle_buffer_size=TRAINING_SHUFFLE_BUFFER):
    if cache:
        if isinstance(cache, str):
            ds = ds.cache(cache)
        else:
            ds = ds.cache()
    ds = ds.shuffle(buffer_size=shuffle_buffer_size)
    ds = ds.repeat()
    ds = ds.batch(TRAINING_BATCH_SIZE)
    ds = ds.prefetch(buffer_size=AUTOTUNE)
    return ds

In [None]:
data_root = pathlib.Path("E://Data/Imagenet/imageNet_64-64/train_64x64")
list_files = tf.data.Dataset.list_files(str(data_root/'*/*'))
#map the above function to file_name dataset
train_imgs = list_files.map(process_path, num_parallel_calls=AUTOTUNE) 
train_ds_cachefile = prepare_for_training(train_imgs, cache=SAVE_MODEL_PATH+"cache.tfcache", shuffle_buffer_size=TRAINING_SHUFFLE_BUFFER)

## Model Specifications

In [None]:
# create and compile model
def create_model(level_0_repeats, level_1_repeats, level_2_repeats):

    # encoder - input
    model_input = keras.Input(shape=(DATA_CROP_ROWS, DATA_CROP_COLS, DATA_CHANNELS), name='input_image')
    x           = model_input
    
    # encoder - level 0
    for n0 in range(level_0_repeats):
        # x = keras.layers.Conv2D(32, 3, strides=1, padding='same', activation='relu', use_bias=True)(x)
        x = keras.layers.Conv2D(32, 3, strides=1, padding='same', activation=None, use_bias=False)(x)
        x = keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(x)
        x = keras.layers.ReLU()(x)
    x = keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same')(x)

    # encoder - level 1
    for n1 in range(level_1_repeats):
        # x = keras.layers.Conv2D(64, 3, strides=1, padding='same', activation='relu', use_bias=True)(x)
        x = keras.layers.Conv2D(64, 3, strides=1, padding='same', activation=None, use_bias=False)(x)
        x = keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(x)
        x = keras.layers.ReLU()(x)
    x = keras.layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same')(x)
        
    # encoder - level 2
    for n2 in range(level_2_repeats):
        # x = keras.layers.Conv2D(128, 3, strides=1, padding='same', activation='relu', use_bias=True)(x)
        x = keras.layers.Conv2D(128, 3, strides=1, padding='same', activation=None, use_bias=False)(x)
        x = keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(x)
        x = keras.layers.ReLU()(x)

    # encoder - output
    encoder_output = x

    # decoder
    y              = keras.layers.GlobalAveragePooling2D()(encoder_output)
    decoder_output = keras.layers.Dense(DATA_NUM_CLASSES, activation='softmax')(y)
    
    # forward path
    model = keras.Model(inputs=model_input, outputs=decoder_output, name='cifar_model')

    # loss, backward path (implicit) and weight update
    model.compile(optimizer=tf.keras.optimizers.Adam(TRAINING_LR_MAX), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    # return model
    return model

# create and compile model
model = create_model(MODEL_LEVEL_0_REPEATS, MODEL_LEVEL_1_REPEATS, MODEL_LEVEL_2_REPEATS)

# model description and figure
model.summary()

## Training

In [None]:
# learning rate schedule
def lr_schedule(epoch):

    # staircase
    # lr = TRAINING_LR_MAX*math.pow(TRAINING_LR_SCALE, math.floor(epoch/TRAINING_LR_EPOCHS))

    # linear warmup followed by cosine decay
    if epoch < TRAINING_LR_INIT_EPOCHS:
        lr = (TRAINING_LR_MAX - TRAINING_LR_INIT)*(float(epoch)/TRAINING_LR_INIT_EPOCHS) + TRAINING_LR_INIT
    else:
        lr = (TRAINING_LR_MAX - TRAINING_LR_FINAL)*max(0.0, math.cos(((float(epoch) - TRAINING_LR_INIT_EPOCHS)/(TRAINING_LR_FINAL_EPOCHS - 1.0))*(math.pi/2.0))) + TRAINING_LR_FINAL

    # debug - learning rate display
    # print(epoch)
    # print(lr)

    return lr

# plot training accuracy and loss curves
def plot_training_curves(history):

    # training and validation data accuracy
    acc     = history.history['accuracy']
    val_acc = history.history['val_accuracy']

    # training and validation data loss
    loss     = history.history['loss']
    val_loss = history.history['val_loss']

    # plot accuracy
    plt.figure(figsize=(8, 8))
    plt.subplot(2, 1, 1)
    plt.plot(acc, label='Training Accuracy')
    plt.plot(val_acc, label='Validation Accuracy')
    plt.legend(loc='lower right')
    plt.ylabel('Accuracy')
    plt.ylim([min(plt.ylim()), 1])
    plt.title('Training and Validation Accuracy')

    # plot loss
    plt.subplot(2, 1, 2)
    plt.plot(loss, label='Training Loss')
    plt.plot(val_loss, label='Validation Loss')
    plt.legend(loc='upper right')
    plt.ylabel('Cross Entropy')
    plt.ylim([0, 2.0])
    plt.title('Training and Validation Loss')
    plt.xlabel('epoch')
    plt.show()

# callbacks (learning rate schedule, model checkpointing during training)
callbacks = [keras.callbacks.LearningRateScheduler(lr_schedule),
             keras.callbacks.ModelCheckpoint(filepath=SAVE_MODEL_PATH+'model_{epoch}.h5', save_best_only=True, monitor='val_loss', verbose=1)]

# training
initial_epoch_num = 0
history           = model.fit(x=dataset_train, epochs=TRAINING_NUM_EPOCHS, verbose=1, callbacks=callbacks, validation_data=dataset_test, initial_epoch=initial_epoch_num)

# example of restarting training after a crash from the last saved checkpoint
# model             = create_model(MODEL_LEVEL_0_REPEATS, MODEL_LEVEL_1_REPEATS, MODEL_LEVEL_2_REPEATS)
# model.load_weights(SAVE_MODEL_PATH+'model_X.h5') # replace X with the last saved checkpoint number
# initial_epoch_num = X                            # replace X with the last saved checkpoint number
# history           = model.fit(x=dataset_train, epochs=TRAINING_NUM_EPOCHS, verbose=1, callbacks=callbacks, validation_data=dataset_test, initial_epoch=initial_epoch_num)

# plot accuracy and loss curves
plot_training_curves(history)

## Evaluation

In [None]:
# test
test_loss, test_accuracy = model.evaluate(x=dataset_test)
print('Test loss:     ', test_loss)
print('Test accuracy: ', test_accuracy)

# example of saving and loading the model in Keras H5 format
# this saves both the model and the weights
# model.save('./save/model/model.h5')
# new_model       = keras.models.load_model('./save/model/model.h5')
# predictions     = model.predict(x=dataset_test)
# new_predictions = new_model.predict(x=dataset_test)
# np.testing.assert_allclose(predictions, new_predictions, atol=1e-6)

# example of saving and loading the model in TensorFlow SavedModel format
# this saves both the model and the weights
# keras.experimental.export_saved_model(model, './save/model/')
# new_model       = keras.experimental.load_from_saved_model('./save/model/')
# predictions     = model.predict(x=dataset_test)
# new_predictions = new_model.predict(x=dataset_test)
# np.testing.assert_allclose(predictions, new_predictions, atol=1e-6)

# example of getting a list of all feature maps
# feature_map_list = [layer.output for layer in model.layers]
# print(feature_map_list)

# example of creating a model encoder
# replace X with the layer number of the encoder output
# model_encoder    = keras.Model(inputs=model.input, outputs=model.layers[X].output)
# model_encoder.summary()
