In [1]:
import numpy as np
import tensorflow as tf

In [2]:
np.random.seed(0)
batch_size = 1

W1_val = np.random.rand(4, 3)
b1_val = np.random.rand(4, 1)
W2_val = np.random.rand(1, 4)
b2_val = np.random.rand(1, 1)
x_val = np.random.rand(batch_size, 3)

W1 = tf.constant(W1_val)
b1 = tf.constant(b1_val)
W2 = tf.constant(W2_val)
b2 = tf.constant(b2_val)

In [3]:
x = tf.placeholder(shape=(batch_size, 3), dtype=np.float64)

In [30]:
xT = tf.transpose(x)
z1 = (W1@xT+b1)
a1 = tf.tanh(z1)

z2 = W2@a1 + b2
a2 = tf.tanh(z2)

c = tf.transpose(a2)
c = c**2

In [31]:
sess = tf.Session()

In [40]:
out = c
inp = W1
grads = tf.gradients(out, inp)
hess = tf.hessians(out, inp)
res = sess.run([hess], feed_dict={x:x_val})
res = res[0][0]
res

array([[[[-3.07510379e-05, -1.77573888e-05, -3.00341927e-05],
         [-1.68371619e-06, -9.72272976e-07, -1.64446666e-06],
         [-3.87932697e-06, -2.24014284e-06, -3.78889499e-06],
         [-8.95037125e-06, -5.16845067e-06, -8.74172688e-06]],

        [[-1.77573888e-05, -1.02541208e-05, -1.73434419e-05],
         [-9.72272976e-07, -5.61445416e-07, -9.49608074e-07],
         [-2.24014284e-06, -1.29358520e-06, -2.18792230e-06],
         [-5.16845067e-06, -2.98455579e-06, -5.04796761e-06]],

        [[-3.00341927e-05, -1.73434419e-05, -2.93340580e-05],
         [-1.64446666e-06, -9.49608074e-07, -1.60613208e-06],
         [-3.78889499e-06, -2.18792230e-06, -3.70057110e-06],
         [-8.74172688e-06, -5.04796761e-06, -8.53794627e-06]]],


       [[[-1.68371619e-06, -9.72272976e-07, -1.64446666e-06],
         [-8.36639831e-04, -4.83123166e-04, -8.17136708e-04],
         [-9.82527369e-05, -5.67366883e-05, -9.59623424e-05],
         [-2.26688413e-04, -1.30902713e-04, -2.21404021e-04]],

In [7]:
res.shape

(1, 3, 1, 3)

In [8]:
x[0,0]

<tf.Tensor 'strided_slice:0' shape=() dtype=float64>

### 2nd order derivative

Our model starts with a vector valued input $\vec{x}$, maps it to some intermediate value $\vec{a}$ and finally produces a scalar y. The information flow looks like this:
$$
\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_n \end{bmatrix} \to 
\begin{bmatrix}a_1 \\ a_2 \\ \vdots \\a_n \end{bmatrix} \to 
y
$$

We are interested in the first and second derivatives of y with respect to x. Since y is a scalar, the Jacobian matrix is a row vector, and the second derivative can be expressed as a Hessian: 
$$
J_{yx}, (J_{yx})_i= \frac{\partial y}{\partial x_i}\\
H_{yx}, (H_{yx})_{ij} = \frac{\partial^2y}{\partial x_i \partial x_j}
$$

The first and second derivative of y with respect to a are given. The naming scheme is the same, we use $J_{ya}$ and $H_{ya}$

Now we want to use this to calculate $J_{yx}$ and $H_{yx}$.

The Jacobian of y with respect to x is straightforward. First we calculate the Jacobian $J_{ax}$, since a is a vector, this is a matrix. To get the Jacobian $J_{yx}$, we use a matrix product:
$$
(J_{ax})_{ij} = \frac{\partial a_i}{\partial x_j} \\
(J_{yx})_{j} = \sum_i (J_{ya})_i * (J_{ax})_{ij}
$$

The Hessian of y with respect to x is complicated. We need to look at all the paths through a, to compute the derivatives:
$$
\begin{align}
(H_{yx})_{ij}&= \frac{\partial^2 y}{\partial x_i \partial x_j} \\
&= \sum_{h,k} \frac{\partial^2 y}{\partial a_h \partial a_k} \frac{\partial a_h}{\partial x_i} \frac{\partial a_k}{\partial x_j} \\
&\; +\sum_h \frac{\partial y}{\partial a_h} \frac{\partial^2 a_h}{\partial x_i \partial x_j}
\end{align}
$$

Apparently the Hessian consists of two components:
$$
\begin{align}
(H_{yx})_{ij} &= (I_{yx})_{ij} + (O_{yx})_{ij} \\
(I_{yx})_{ij} &= \sum_{h,k} \frac{\partial^2 y}{\partial a_h \partial a_k} \frac{\partial a_h}{\partial x_i} \frac{\partial a_k}{\partial x_j} \\
(O_{yx})_{ij} &=\sum_h \frac{\partial y}{\partial a_h} \frac{\partial^2 a_h}{\partial x_i \partial x_j}
\end{align}
$$

We can rewrite this in terms of the given Jacobians and Hessians:
$$
\begin{align}
(I_{yx})_{i,j} &= \sum_{h,k} (H_{ya})_{h,k} (J_{ax})_{h,i} (J_{ax})_{k,j} \\
(O_{yx})_{i,j} &= \sum_{h} (J_{ya})_{1, h} (H_{ax})_{h,i,j}
\end{align}
$$

Please note the Hessian $H_{ax}$, this is indexed with 3 indices. Because a is a vector and we take the second derivative with respect to a vector, this becomes a 3-dimensional tensor. To avoid confusion, I'm using two indices for $(J_{ya})_{1, h}$, this is a row vector.

### coding samples
Now I will look at concerete implementations. To stay compatible with the notation above, I assume that our layer maps x to a and that the whole model finally outputs y.

In [9]:
import numpy as np

In [10]:
m = 4
n = 4
a = np.empty((m, n, n), dtype=np.dtype('a5'))

In [11]:
for i, j, k in np.ndindex(*a.shape):
    a[i,j,k] = '{} {} {}'.format(i,j,k)

In [12]:
a[0,1,2]

b'0 1 2'

In [13]:
a[np.arange(n), np.arange(n), np.arange(n)]=1

In [14]:
a

array([[[b'1', b'0 0 1', b'0 0 2', b'0 0 3'],
        [b'0 1 0', b'0 1 1', b'0 1 2', b'0 1 3'],
        [b'0 2 0', b'0 2 1', b'0 2 2', b'0 2 3'],
        [b'0 3 0', b'0 3 1', b'0 3 2', b'0 3 3']],

       [[b'1 0 0', b'1 0 1', b'1 0 2', b'1 0 3'],
        [b'1 1 0', b'1', b'1 1 2', b'1 1 3'],
        [b'1 2 0', b'1 2 1', b'1 2 2', b'1 2 3'],
        [b'1 3 0', b'1 3 1', b'1 3 2', b'1 3 3']],

       [[b'2 0 0', b'2 0 1', b'2 0 2', b'2 0 3'],
        [b'2 1 0', b'2 1 1', b'2 1 2', b'2 1 3'],
        [b'2 2 0', b'2 2 1', b'1', b'2 2 3'],
        [b'2 3 0', b'2 3 1', b'2 3 2', b'2 3 3']],

       [[b'3 0 0', b'3 0 1', b'3 0 2', b'3 0 3'],
        [b'3 1 0', b'3 1 1', b'3 1 2', b'3 1 3'],
        [b'3 2 0', b'3 2 1', b'3 2 2', b'3 2 3'],
        [b'3 3 0', b'3 3 1', b'3 3 2', b'1']]], dtype='|S5')