PyTorch: Autograd
====


**Autograd** is PyTorch' package for automatic differentiation for all operations on Tensors. It's a *define-by-run* framework - backpropagation is defined by how the code runs.

`.requires_grad = True`
----

This attribute sets the tensor to track all operations on it. After finishing computation you can then call `.backward()` to automatically compute all the gradients and store them into `.grad` attribute of each tensor.

`.detach()` stops the tensor from tracking history, preventing future computation from being tracked.

To stop tracking history in a block of code you can wrap it in `with torch.no_grad():`.


`Function`
----

Every operation performed on a Tensor creates a new `Function` object, that performs the computation and records that it happened. Alltogether they build up an acyclic graph, encoding a complete history of computation. Tensor's attribute `.grad_fn` refers to a `Function` used to create the Tensor (except for Tensors created by the user, where `.grad_fn is None`).

If you want to compute the derivatives, you can call `.backward()` on a `Tensor`. If the `Tensor` is a scalar (holds one-element data) there is no need to pass any arguments to `.backward()`. If you are using a vector, you need to specify a `gradient` argument, which is a tensor of a matching shape.


In [None]:
import torch


In [None]:
# create a tensor with requires_grad
x = torch.ones(2, 2, requires_grad=True)
print(x)


In [None]:
# perform a simple operation and check the `grad_fn`
y = x - 4
print(y.grad_fn)


In [None]:
# perform some more operations
z = y * y * 5
out = z.mean()

print(z)  # see the grad_fn
print(out)  # see the grad_fn


In [None]:
# as earlier, `.requires_grad_(...) changes the flag in-place
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)  # <- this will be None
print('----')

a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b)


In [None]:
# perform backpropagation on `out` and calculate the gradients
# since `out` contains a single scalar, there is no need to pass arguments
out.backward()
# equivalent: out.backward(torch.tensor(1.))
print(x.grad)  # d(out)/dx

# .backward() accumulates gradient only in the leaf nodes
# that is why for y, z the grad is None
print(y.grad)
print(z.grad)


# Small mathematical note

Let `out` be called $o$.

$$o = \frac{1}{4} \sum_i{z_i}$$
$$z_i = 5*\left(x_i - 4\right)^2$$
$$z_i\mid_{x_i=1} = 27$$

Therefore:

$$\frac{\partial o}{\partial x_i} = \frac{5}{2} \left(x_i - 4\right)$$
$$\frac{\partial o}{\partial x_i}\mid_{x_i=1} = -\frac{15}{2} = -7.5$$

## Gradient

For a vector valued function $\vec{y}=f\left(\vec{x}\right)$, the gradient of $\vec{y}$ with respect to $\vec{x}$ is a Jacobian matrix:

$$J = \left(\begin{array}{ccc}\frac{\partial y_1}{\partial x_1} & \dots & \frac{\partial y_1}{\partial x_n}\\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_n} \end{array}\right)$$

`torch.autograd` is an engine for computing vector-Jacobian product - given any vector $v = \left(\begin{array}{cccc}v_1 & v_2 & \dots & v_m\end{array}\right)^T$ compute a product $v^T \cdot J$. If $v$ is a gradient of a scalar function $l=g\left(\vec{y}\right)$ (that is $v = \left(\begin{array}{ccc}\frac{\partial l}{\partial y_1}& \dots & \frac{\partial l}{\partial y_m}\end{array}\right)^T$), then, by chain rule, the vector-Jacobian product would be the gradient of $l$ with respect to $\vec{x}$:

$$J^T \cdot v = \left(\begin{array}{ccc}\frac{\partial y_1}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_1}\\
\vdots & \ddots & \vdots \\
\frac{\partial y_1}{\partial x_n} & \dots & \frac{\partial y_m}{\partial x_n} \end{array}\right) \left(\begin{array}{c}\frac{\partial l}{\partial y_1} \\ \vdots \\ \frac{\partial l}{\partial y_m}\end{array}\right)=\left(\begin{array}{c}\frac{\partial l}{\partial x_1} \\ \vdots \\ \frac{\partial l}{\partial x_n}\end{array}\right)$$

> Note: $v^T \cdot J$ gives a row vector which can be treated as a column vector by taking $J^T \cdot v$.

This characteristic of vector-Jacobian product makes it very convenient to feed external gradients into a model that has non-scalar output.


In [None]:
# example of vector-Jacobian product
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000: # L2 norm
    print(f'L2 norm: {y.data.norm()}')
    y = y * 2

print(y)


In [None]:
# y is not a scalar - cannot calculate the Jacobian directly
# you need to pass a vector as an argument - 3 element
v = torch.tensor([0.1, 1.0, 0.00001], dtype=torch.float)  # e.g. from a loss function
y.backward(v)

print(x.grad)


In [None]:
# stop autograd from tracking history
print(x.requires_grad)
print((2 * x).requires_grad)
print('-----')

# torch.no_grad()
with torch.no_grad():
    print((2 * x).requires_grad)
print('-----')
    
# .detach()
detached = x.detach()
print(detached.requires_grad)
print((detached == x))


Exercises
-----


1. Create 3 torch Tensors (scalars): $x = 1$, $w = 0.27$ and $b = 3$, so that they will be tracking gradients.


2. Calculate the following equation:

$$y = w \cdot x + b$$


3. Compute and display the gradients for each value.


4. Calculate the result of another equation, compute gradients and display them.

$$z = w \cdot \left(x ^ 2 - b\right)$$
