# Automatic Differentiation

Computing gradients is a crucial step in all optimisation algorithms used for deep learning. While the calculations themselves are often not enormously complex, calculating them by hand is tedious and error prone.

Thankfully, all modern machine learning frameworks implement methods for automatic differentiation (often termed autograd) which builds a computational graph mapping the dependency of variables onto one another, and applied the chain rule backwards through that dependency graph to compute the gradients. The computational term for this process is called _backpropagation_. 

In [1]:
import torch


## A simple function

Let us say that we want to compute the gradient of a function $y = 2\mathbf{x}^{\intercal}\mathbf{x}$ with respect to the column vector $\mathbf{x}$.

In [4]:
x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

In [7]:
# Before we compute the gradient of y with respect to x, we need a place to store this gradient. 
# Usually in deep learning we avoid assigning a new lot of memory each time we compute the derivative, as deep learning requires
# Successively computing the derivatives a great many times
x.requires_grad_(True)
print(x.grad) # None by default

None


In [11]:
y = 2*torch.dot(x, x)
print(y)

tensor(28., grad_fn=<MulBackward0>)


In [13]:
# We can now take the gradient of y with respect to x by computing it's .backward() method

y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

In [14]:
# We know that the gradient of y with respect to x should be 4x, we can verify this:

x*4 == x.grad


tensor([True, True, True, True])

In [18]:
# Now lets calculate another function. Note that pytorch does not automatically reset the gradient buffer when computing a new gradient, but instead adds it, this behaviour comes in
# handy when we want to optimise the sum of multiple objective functions. 
x.grad.zero_()
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

## Backward for non-scalar variables

When y is a vector, the natural representation of the gradient of y with respect to x is a matrix known as the _Jacobian_, which contains the derivative of each component of y with respect to each component of x. 

While the Jacobian does show up in advanced ML, more often we want to sum up the elements of y with each component of x, leading to a matrix which has the same dimensions as x. 

As an example, we will often have a vector representing the value of a loss function with respect to a number of examples calculated separately, here, we just want to sum up the gradients of each sample. 

To avoid confusion, pytorch elicits an error unless we tell it how to reduce the object to a scalar. More formally, we need to provide some vector $\mathbf{v}$ so that PyTorch will compute $\mathbf{v}^{\intercal}\partial_{\mathbf{x}}\mathbf{y}$ and not just $\partial_{\mathbf{x}}\mathbf{y}$. Confusingly, this argument is named gradient.

In [35]:
x.grad.zero_()
y = x * x
print(y)
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()
x.grad

tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)


tensor([0., 2., 4., 6.])

## Detaching Computations

Sometimes we wish to detatch the computation in order to compute some intermediate function, without this being used in the computation of gradients.

For example if we have a function $z = x * y$ and $y = x * x$, but we want to focus on the direct impact of $x$ on the gradient of $z$, rather than the influence of the intermediary $y$, we would define an intermediate function $u$, whose provenance is detatched from the original source. $u$ thus has no ancestors in the computational graph used to compute the autograd result, so taking the gradient of $z = x * u$ returns $x$ (treating $y = u$ as a constant) rather than $3x^2$, where the influence of $x$ on computing $y$ is taken into account

In [36]:
x.grad.zero_()
y = x * x
u = y.detach()
z = x * u

z.sum().backward()
x.grad

tensor([0., 1., 4., 9.])

In [37]:
z

tensor([ 0.,  1.,  8., 27.], grad_fn=<MulBackward0>)

Note that while the computational graph directing y to z has been detached, there is still a link from x to y, and so we can compute the derivative of y with respect to x.

In [38]:
x.grad.zero_()
y.sum().backward()
print(x.grad)

print(x.grad == 2*x) # y = x^2 so dy/dx = 2x

tensor([0., 2., 4., 6.])
tensor([True, True, True, True])


## Gradients and Python control flow.

In the previous example, we considered only a simple function. However, in reality we often wish to make a result dependent on intermediate auxilliary functions, loops, etc. With autograd we can still compute the gradients through these. 

To illustrate this, lets look at the following function, where the number of iterations of the while loop depends on the value of a passed to the function.

In [46]:
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else: 
        c = 100 * b
    return c

In [47]:
# We initialise a with a random input, so we couldn't possibly know what form the computational graph will take
a = torch.randn(size=(), requires_grad=True)
d = f(a)

# this actually still works though, which is wild
d.backward()

In [42]:
# Even though the function f is pretty contrived, it is still linear with respect to a with a "piecewise defined scale" (whatever that means)
d

tensor(-185575.1406, grad_fn=<MulBackward0>)

In [44]:
a.grad

tensor(819200.)

In [45]:
d/a

tensor(819200., grad_fn=<DivBackward0>)

### Basics

1. Attach gradients to those variables with respect to which we would like to compute the gradients
2. Record the computation of the target value
3. Execute the backpropagation function to compute the derivatives
4. access the resulting gradient


## Experimentation

In [61]:
# Define x as a tensor with requires_grad=True to track gradients
x = torch.tensor(2.0, requires_grad=True)

# Define a function y = x^3
y = x**3

# First backward pass to compute dy/dx
y.backward(retain_graph=True)  # Retain the computation graph

# Compute the first derivative dy/dx
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]  # Retain computational graph

# Compute the second derivative d²y/dx²
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]

print(f"First derivative (dy/dx) at x=2: {dy_dx.item()}")  # Expected: 3 * 2^2 = 12
print(f"Second derivative (d2y/dx2) at x=2: {d2y_dx2.item()}")  # Expected: 6 * 2 = 12 ### IS THIS RIGHT??

First derivative (dy/dx) at x=2: 12.0
Second derivative (d2y/dx2) at x=2: 12.0
