# <center> Using automatic Differentiation to compute Sigma </center>

We will make a demonstration on how to use automatic differenciation in order to compute $\sigma$ for arbitrary neural networks structures. Due to computation limitation we will only show how to compute one $\sigma$ for the Iris dataset that comprises 150 observations and 3 variables. The code is valid for bigger networks and datasets but would necessitate more powerful computers than the one we currently have. Automatic differienciation is done by using **Tensorflow** framework.

We first inport the functions we will need. We use Keras (and tensorflow) for neural networks. We load Iris dataset from scikit-learn. We define a batch size of 1 for our network, we predict 3 classes (the type of flower setosa, versicolor and virginica). We will run 20 epochs.

In [1]:
from __future__ import print_function

import keras
import keras.backend as K
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

import numpy as np
import tensorflow as tf

batch_size = 1
num_classes = 3
epochs = 20


Using TensorFlow backend.


We load Iris data.

In [40]:


data = load_iris()['data']
targets = load_iris()['target']
target_names = load_iris()['target_names']

We make the necessary preprocessing for Keras.

In [41]:
# the data, shuffled and split between train and test sets
x_train, x_test, y_train, y_test = train_test_split(data, targets, test_size=0.33)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

100 train samples
50 test samples


We compile our network. We get an accuracy of 0.6 which is not very good but sufficient for our proof of concept.

In [42]:
K.set_learning_phase(1)
model = Sequential()
model.add(Dense(4, activation='sigmoid', input_shape=(4,)))

model.add(Dense(num_classes, activation="linear"))

model.summary()


model.compile(loss='mean_squared_error',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 4)                 20        
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 15        
Total params: 35
Trainable params: 35
Non-trainable params: 0
_________________________________________________________________
Train on 100 samples, validate on 50 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test loss: 0.12950145185
Test accuracy: 0.600000007153


We define a function to compute the hessian of the loss. It will do it automatically by auto differentiation.

In [43]:
def compute_hessian(model, x_sample, y_sample):
    # weight tensors
    weights = model.trainable_weights
    nb_params = sum([layer.count_params() for layer in model.layers])
    weights = [weight for weight in weights if model.get_layer(weight.name[:-2].split("/")[0]).trainable] # filter down weights tensors to only ones which are trainable
    weights_flat =([tf.reshape(w, [-1]) for w in weights])
    gradients =  tf.concat(
        [tf.reshape(g, [-1,1]) for g in model.optimizer.get_gradients(model.total_loss, weights)],0
    )

    hessians = tf.stack([tf.concat([tf.reshape(h,[-1,1]) for h in K.gradients(gradients[n], weights)],0)[n]
                for n in range(nb_params)])
    input_tensors = [model.inputs[0], # input data
                     model.sample_weights[0], # how much to weight each sample by
                     model.targets[0], # labels
                     K.learning_phase(), # train or test mode
    ]

    get_gradients = K.function(inputs=input_tensors, outputs=[hessians])

    inputs = [x_sample, # X
              np.ones(len(x_sample)), # sample weights
              y_sample # Y
    ]

    
    
    return get_gradients(inputs)[0]


In [44]:
hessian = compute_hessian(model, x_train, y_train)
hessian = hessian.reshape(-1)

We then make a function to differentiate outputs with respect to weights. This function is really slow for this small example probably because it makes the gradient difficult to compute.

In [45]:
def evaluate_output_j_gradients_wrt_weights(model, trainingExample, j):
    outputTensor = model.output[:,j]
    listOfVariableTensors = model.trainable_weights

    gradients = K.gradients(outputTensor, listOfVariableTensors)
    
    sess = tf.InteractiveSession()
    sess.run(tf.initialize_all_variables())
    evaluated_gradients = sess.run(gradients,feed_dict={model.input:trainingExample})
    return evaluated_gradients

def evaluate_output_gradients_wrt_weights(model, trainingExample):
    return [evaluate_output_j_gradients_wrt_weights(model, trainingExample, j) for j in range(3)]

In [66]:
gradients = evaluate_output_gradients_wrt_weights(model, x_train[0].reshape(1,4))

In [67]:
gradients_flat = [
    np.concatenate(
        [np.array([b**2 for b in list(a)]).reshape(-1) for a in g]
    ) 
    for g in gradients]

We finally get the values for sigma with the formula $$sigma_i = \sum_{i}{\frac{{\gamma_i}}{\beta*\sum_{I}{h_{ii}}}}$$.

And we choose $\beta$ = 10 arbitrarily.

In [74]:
from __future__ import division
[np.sum(g/(10*hessian)) for g in gradients_flat]

[2.1756890623553709, 2.0553389314226469, 2.0066627327254971]

This demonstration would need more investigation but we could not experiment in bigger networks.

In absence of any ways of determining $\beta$ it is very difficult to compute sigma accurately. We find that here we would probably need a bigger value for beta in order to have meaningful values of $\sigma$.