<a href="https://colab.research.google.com/github/kimgeonhee317/d2l-notes/blob/main/notebook/2_5_Automatic_Differentiation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch

## 2.5.1. A Simple Function

let's say $y=2\mathbf{x}^T\mathbf{x}$ with respect to the column vector $\mathbf{x}$

In [2]:
# differentiating the function y
x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

In [4]:
# Can also create x = torch.arange(4.0, requires_grad = True)
x.requires_grad_(True)
x.grad # The gradient is None by default

In [13]:
# calculate our function of x and assign the result to y
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

In [14]:
# Now can take the gradient of y with respect to x by calling its backward method.
y.backward()
x.grad

tensor([ 0., 12., 24., 36.])

In [15]:
# Verify
x.grad == 4 * x

tensor([ True, False, False, False])

In [17]:
# Pytorch does not automatically reset gradient buffer when we record a new gradient.
# To reset the gradient buffer, we can call x.grad.zero() as follows:
x.grad.zero_() # reset
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

## 2.5.2 Backward for Non-Scalar Variables
When y is a vector, the most natural interpretation of derivate of y with respect to a vector x is matrix called the *Jacobian* that contains the partial derivatives of each component of y with respect to each component of x.
Likewise, for higher-order y and x, the differentiation result could be an even higher-order tensor.

Because deep learning frameworks vary in how they interpret gradients of non-scalr tensors, PyTorch takes some steps to avoid confusion. Invoking backward on a non-scalar elicits an error unless we tell PyTorch how to reduce the object to a scalar.

More formally, we need to provide some vector $\mathbf{v}$ such that backward will compute $\mathbf{v}^\top \partial_{\mathbf{x}} \mathbf{y}$ rather than $\partial_{\mathbf{x}} \mathbf{y}$.

In [18]:
x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y)))
x.grad

tensor([0., 2., 4., 6.])

## 2.5.3 Detaching Computation

Sometimes, we wish to move calculations outside of the recorded computational graph.
In this case we need to *detach* the respective computational graph from final result.


In [28]:
x.grad.zero_()
y = x*x
u = y.detach()
z = u*x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

In [29]:
# While this procedure detaches y's ancestors from the graph leading to z, the computational graph leading to y persists and thus
# we can calculate the gradient of y with respect to x.
x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

tensor([True, True, True, True])

## 2.5.4 Gradients and Python Control Flow

One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of Python control flow, we can still calculate the gradient of the resulting variable.

In [30]:
def f(a):
  b = a * 2
  while b.norm() < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

In [33]:
# we call this function, passing it to random value as input
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

Its dependence on the input is quite simple, it is a linear function of a with pecewise defined scale. As such we can see $ d = \frac{d}{a} a$ and $\frac{d}{a}$ is a vector of constant entries.

In [34]:
a.grad == d/a

tensor(True)