# Computing gradients in tensorflow

## Partial derivatives using pure python

Let's say we have a derivable function:

\begin{equation}
f(x) = 2 x_1^2 + 3 x_1 x_2
\end{equation}

In [1]:
def f(x1, x2):
    return 2 * x1 ** 2 + 3 * x1 * x2

It's easy to find analytically the derivative of this function:

\begin{align}
\frac{\partial{f}}{\partial x_1} & = 4x_1 + 3x_2\\
\frac{\partial{f}}{\partial x_2} & = 3x_1 
\end{align}

So, for the point $x=(2,1)$, the result will be $(11,6)$. To check that everything goes as expected, we can compute the partial derivatives with regard to both variables using the definition:

\begin{equation}
\frac{\partial}{{\partial x}}f \left( x \right) = \mathop {\lim }\limits_{\epsilon \to 0} \frac{{f\left( {x + \epsilon } \right) - f\left( x \right)}}{\epsilon }
\end{equation}

In [2]:
x1, x2 = 2, 1
eps = 1e-04

In [3]:
(f(x1 + eps, x2) - f(x1, x2)) / (eps)

11.000200000026439

In [4]:
(f(x1, x2 + eps) - f(x1, x2)) / (eps)

6.00000000000378

Great!

## Partial derivatives using tensorflow

We will do the same, but this time we will use tensorlfow to calculate the results. It may not be as interesting, but it will certainly be more efficient.

In [5]:
import tensorflow as tf

In [6]:
x1, x2 = tf.Variable(2.), tf.Variable(1.)

In [7]:
with tf.GradientTape() as tape:
    y = f(x1, x2)
gradients = tape.gradient(y, [x1, x2])

In [8]:
[g.numpy() for g in gradients]

[11.0, 6.0]

Within the `tf.GradientTape` context, tensorflow will track each operation applied to any variable. But be careful! To save memory, tensorflow will remove the tape contents after calling the `.gradient()` method. To avoid this, you can explicitly indicate that you do not want them to disappear (with the `persistent=True` parameter of the `GradientTape`), but try not to do so if there is no good reason, or even remove it from memory once you've done with it.

By default, the tape will record all the operations involving variables (because de default value for the tape's `watch_accessed_variables` parameter is `True`). We can track the operations that involves a constant adding `tape.watch(my_constant)` at the begining of the context, or setting `watch_accessed_variables=False` and select the variables we want to track through the `watch` method. This is useful if we want to add information about the variation of the inputs in our loss function.

## Higher order derivatives

We can even compute second (or higher) order derivatives by nesting tapes. For example:

In [9]:
def f(x):
    return 5*x**3

In this case,

\begin{equation}
\frac{\partial f}{\partial x} = 15 x ^2 \\
\frac{\partial^2 f}{\partial x^2} = 30 x
\end{equation}

In [10]:
x = tf.Variable(0.1)
with tf.GradientTape() as tape1:
    with tf.GradientTape() as tape2:
        y = f(x)
    dy_dx  = tape2.gradient(y, x)
d2y_dx2 = tape1.gradient(dy_dx, x)

print(f"dy/dx at x={x.numpy():.2f}: {dy_dx.numpy():.2f}")
print(f"d2y/dx2 at x={x.numpy():.2f}: {d2y_dx2.numpy():.2f}")

dy/dx at x=0.10: 0.15
d2y/dx2 at x=0.10: 3.00


## Derivatives of different variables

If we try to calculate the gradient of several variables separately, tensorlow will calculate the sum of the gradients.

**Tip:** Until now we have only used the `.gradient()` method with variables or lists of variables for its two main parameters. However, it also accepts dictionaries.

In [11]:
x = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
  y0 = x**2
  y1 = -4 * x

print(tape.gradient({'y0': y0, 'y1': y1}, x).numpy())
print(tape.gradient(y0, x).numpy(), tape.gradient(y1, x).numpy())

0.0
4.0 -4.0


However, if we compute the gradients of a single variable, contaning several components (all of them affected by the same calculations), we will get the gradients of each component.

In [12]:
x = tf.linspace(-1.0, 1.0, 3)

with tf.GradientTape() as tape:
  tape.watch(x) # x is a constant
  y = tf.nn.sigmoid(x)

dy_dx = tape.gradient(y, x)
dy_dx.numpy()

array([0.19661194, 0.25      , 0.19661193], dtype=float32)

## Jacobian

Now we know how to compute the derivatives of a single value regarding a set of variables. Let's see now how to compute the derivatives of a vector (two-dimensional tensor).

If you want to compute, for example, gradients for an array of losses, tensorflow will compute the gradients of the sum of all of them. To compute all the derivatives one step before, we will need to use the tape's `jacobian()` method.

In [13]:
def f(x):
    x1 = 2 * x[0] ** 2
    x2 = x[1] ** 3
    x3 = x[2] + x[1]
    return tf.stack([x1, x2, x3])

\begin{equation}
\mathcal{J}_u(x_1, x_2, x_3) =
\begin{bmatrix}
  \frac{\partial u_1}{\partial x_1} & 
    \frac{\partial u_1}{\partial x_2} & 
    \frac{\partial u_1}{\partial x_3} \\[1ex] % <-- 1ex more space between rows of matrix
  \frac{\partial u_2}{\partial x_1} & 
    \frac{\partial u_2}{\partial x_2} & 
    \frac{\partial u_2}{\partial x_3} \\[1ex]
  \frac{\partial u_3}{\partial x_1} & 
    \frac{\partial u_3}{\partial x_2} & 
    \frac{\partial u_3}{\partial x_3}
\end{bmatrix}
\end{equation}

In [14]:
x = tf.Variable([1.0, 1.0, 1.0])
with tf.GradientTape() as tape:
    y = f(x)
tape.jacobian(y, x)

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[4., 0., 0.],
       [0., 3., 0.],
       [0., 1., 1.]], dtype=float32)>

## Derivatives involving matrix operations

We can also calculate the derivatives of any variable involved in matrix operations.

In [15]:
W = tf.Variable(tf.random.normal((3, 1)), name='W')
b = tf.Variable(tf.zeros(1, dtype=tf.float32), name='b')
X = tf.constant([[1., 2., 3.], [4., 5., 6.]])
y_true = tf.constant([[5.], [16.]])

with tf.GradientTape(persistent=True) as tape:
  y = X @ W + b
  loss = tf.reduce_mean((y - y_true)**2)

In [16]:
dloss_dW, dloss_db = tape.gradient(loss, [W, b])
print(dloss_dW.numpy(), dloss_db.numpy())

[[ -79.16642]
 [-104.15854]
 [-129.15067]] [-24.992123]


This is especially useful when working with deep learning models, and of course, we can do exacly the same with the variable inside a keras layer/model:

In [17]:
layer = tf.keras.layers.Dense(1, activation='relu')

with tf.GradientTape() as tape:
  # Forward pass
  y = layer(X)
  loss = tf.reduce_mean((y - y_true)**2)

# Calculate gradients with respect to every trainable variable
grad = tape.gradient(loss, layer.trainable_variables)

In [18]:
[g.numpy() for g in grad]

[array([[-41.165585],
        [-52.70707 ],
        [-64.24856 ]], dtype=float32),
 array([-11.541485], dtype=float32)]