In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.6.0


One of the advantage of Tensorflow is that It has Automatic Differentiation and gradients feature.

so Let's understand first **What do we mean by Automatic Differentiation ?**.

# What is Automatic differentiation and gradients ?

<a href="https://en.wikipedia.org/wiki/Automatic_differentiation">Automatic Differentiation</a> is a set of techniques to evaluate the derivative of a function specified by computer program.

# How Automatic differentiation is useful in Neural Network ?

Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks

# How to compute gradients with Tensorflow?

To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.


TensorFlow provides the `tf.GradientTape` API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually `tf.Variables`. TensorFlow "records" relevant operations executed inside the context of a `tf.GradientTape` onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.

Let's see an example - 

In [None]:
x = tf.Variable(3.0)

with tf.GradientTape() as tape:
  y = x**2

Once we've recorded some operations, we can use `GradientTape.gradient(target, sources)` to calculate the gradient of some target (often a loss) relative to some source (often the model's variables)

In [None]:
# dy = 2x * dx
dy_dx = tape.gradient(y, x)
dy_dx.numpy()

6.0

In [None]:
# combined code
x = tf.Variable(3.0)

with tf.GradientTape() as tape:
  y = x**2

# dy = 2x * dx
dy_dx = tape.gradient(y, x)
print(dy_dx.numpy())

6.0


### Working of GradientTape on Tensor

In [None]:
w = tf.Variable(tf.random.normal((3, 2)), name='w')
b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='b')
x = [[1., 2., 3.]]

with tf.GradientTape(persistent=True) as tape:
  y = x @ w + b
  loss = tf.reduce_mean(y**2)

[dl_dw, dl_db] = tape.gradient(loss, [w, b])
print(w.shape)
print(dl_dw.shape)

(3, 2)
(3, 2)


Here to get the gradient of `loss` with respect to both variables, we can pass both as sources to the `gradient` method.

We can also pass a dictionary of variables like this - 

In [None]:
my_vars = {
    'w': w,
    'b': b
}

grad = tape.gradient(loss, my_vars)
grad['b']

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([6.556717 , 1.8888472], dtype=float32)>

# Default tape behaviour & its reason.

The default behavior is to record all operations after accessing a trainable tf.Variable. The reasons for this are:

* The tape needs to know which operations to record in the forward pass to calculate the gradients in the backwards pass.
*  The tape holds references to intermediate outputs, so you don't want to record unnecessary operations.
*  The most common use case involves calculating the gradient of a loss with respect to all a model's trainable variables.


Let's see the below example which fails to calculate a gradient because the tf.Tensor is not "watched" by default, and the `tf.Variable` is not trainable. (Notice `trainable=False`).

In [None]:
# A trainable variable
x0 = tf.Variable(3.0, name='x0')

# Not trainable
x1 = tf.Variable(3.0, name='x1', trainable=False)

# Not a Variable: A variable + tensor returns a tensor.
x2 = tf.Variable(2.0, name='x2') + 1.0

# Not a variable
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
  y = (x0**2) + (x1**2) + (x2**2)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
  print(g)

tf.Tensor(6.0, shape=(), dtype=float32)
None
None
None


# How to know what all variables are being watched by the tape?

we can list the variables being watched by the tape using the `GradientTape.watched_variables` method.

In [None]:
[var.name for var in tape.watched_variables()]

['x0:0']

# How to control, What the tape watches ?



To record gradients with respect to a `tf.Tensor`, we need to call `GradientTape.watch(x)`.

In [None]:
x = tf.constant(3.0)
with tf.GradientTape() as tape:
  tape.watch(x)
  y = x**2

# dy = 2x * dx
dy_dx = tape.gradient(y, x)
print(dy_dx.numpy())

6.0


Conversely, to disable the default behavior of watching all `tf.Variables`, set `watch_accessed_variables=False` when creating the gradient tape. This calculation uses two variables, but only connects the gradient for one of the variables.

In [None]:
x0 = tf.Variable(0.0)
x1 = tf.Variable(10.0)

with tf.GradientTape(watch_accessed_variables=False) as tape:
  tape.watch(x1)
  y0 = tf.math.sin(x0)
  y1 = tf.nn.softplus(x1)
  y = y0 + y1
  ys = tf.reduce_sum(y)

# dys/dx1 = exp(x1) / (1 + exp(x1)) = sigmoid(x1)
grad = tape.gradient(ys, {'x0': x0, 'x1': x1})

print('dy/dx0:', grad['x0'])
print('dy/dx1:', grad['x1'].numpy())

dy/dx0: None
dy/dx1: 0.9999546


**Notice** Since `GradientTape.watch` was not called on x0, no gradient is computed with respect to it.

# How to compute Higher Order Derivatives ?

we can compute higher-order derivatives by nesting tapes.

In [None]:
a = tf.Variable(tf.random.normal(shape=(2, 2)))
b = tf.Variable(tf.random.normal(shape=(2, 2)))

with tf.GradientTape() as outer_tape:
  with tf.GradientTape() as tape:
    c = tf.sqrt(tf.square(a) + tf.square(b))
    dc_da = tape.gradient(c, a)
  d2c_da2 = outer_tape.gradient(dc_da, a)
  print(d2c_da2)

tf.Tensor(
[[0.31170034 0.32279664]
 [1.153167   0.56869346]], shape=(2, 2), dtype=float32)


# References

https://www.tensorflow.org/api_docs/python/tf/GradientTape

https://www.tensorflow.org/guide/autodiff


https://github.com/farhadkamangar/CSE5368 

https://cognitiveclass.ai/courses/course-v1:BigDataUniversity+ML0120EN+v2

