<a href="https://colab.research.google.com/github/mohamedyosef101/101_learning_area/blob/area/d2l/Preliminaries/2_5-automatic-differentiation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Differentiation

**Source**: Aston Zhang et al. 2023. [Dive into deep learning](https://d2l.ai/).

All modern deep learning frameworks take the complex calculus work off our plates by offering *automatic differentiation* (often shortened to *autograd*).

In [1]:
import torch

## A simple function
Let's assume that we are interested in differentiating the function $y=2x^Tx$ with respect to the column vector $x$.

In [3]:
# assign x an initial value
x = torch.arange(4.0, requires_grad=True)
print(f"x={x.detach().numpy()}",
      f"\nGradient: {x.grad}") # the gradient is None by default

x=[0. 1. 2. 3.] 
Gradient: None


In [4]:
# calculate y which is equal to 2 * x * x
y = 2 * torch.dot(x, x)
print(f"y = {y.detach().numpy()}")

y = 28.0


In [5]:
y.backward() # take the gradient of y with respect to x
# >> the result of backward is 4x for now

# access the gradient
print(f"x={x.detach().numpy()}",
      f"\nGradient={x.grad.detach().numpy()}")

x=[0. 1. 2. 3.] 
Gradient=[ 0.  4.  8. 12.]


We already know that the gradient of the function $y=2x^Tx$ with respect to $x$ should be $4x$.

In [6]:
# check the value of the backward
print(x.grad == 4 * x)

tensor([True, True, True, True])


Now let's calculate another function of x and take its gradient.

> **Note** that PyTorch does not automatically reset the gradient buffer when we record a new gradient.

In [11]:
x.grad.zero_() # Reset the gradient
y = x.sum()
y.backward() # take the gradient which is x
x, x.grad # access the gradient of x which is 1

(tensor([0., 1., 2., 3.], requires_grad=True), tensor([1., 1., 1., 1.]))

## Backward for non-scalar variables
When $y$ is a vector, the most natural representation of the derivative of $y$ with respect to a vector $x$ is a matrix called the *Jacobian* that contains the partial derivatives of each component of $y$ with respect to each component of $x$.

In [None]:
x.grad.zero_()
y = x * x
y.sum().backward()
x, x.grad

(tensor([0., 1., 2., 3.], requires_grad=True), tensor([0., 2., 4., 6.]))

## Detaching Computation
Suppose we have $z = x * y$ and $y = x * x$ but we want to focus on the direct influence of $x$ on $z$ rather than the influence conveyed via $y$. In this case, we can create a new variable $u$ that takes the same value as $y$ but whose provenance (how it was created) has been wiped out.

> Detaching a tensor means that it will no longer be part of the gradient computation during backpropagation.

In [None]:
x.grad.zero_() # reset the gradient
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
print(x, '\n', y, '\n', u, '\n', x.grad == u)

tensor([0., 1., 2., 3.], requires_grad=True) 
 tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>) 
 tensor([0., 1., 4., 9.]) 
 tensor([True, True, True, True])


Taking the gradient of z = x * u will yield the result u, (not 3 * x * x as you might have expected since z = x * x * x).

## Gradients and Python Control Flow


In [None]:
def f(a):
  b = a * 2
  while b.norm() < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

In [None]:
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

In [None]:
a.grad == d / a

tensor(True)