# AUTOMATIC DIFFERENTIATION WITH TORCH.AUTOGRAD
When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters are adjusted according to the gradient of loss function with respect to given parameter.

In [8]:
import torch

x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output
w = torch.randn(5, 3, requires_grad=True) # weight
b = torch.randn(3, requires_grad=True) # bias
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

# Tensors, Functions and computational graph
![](https://pytorch.org/tutorials/_images/comp-graph.png)


In [9]:
print('Gradient Function for z = ', z.grad_fn)
print('Gradient function for loss = ', loss.grad_fn)   

Gradient Function for z =  <AddBackward0 object at 0x7fcb95fece50>
Gradient function for loss =  <BinaryCrossEntropyWithLogitsBackward0 object at 0x7fcb95fec430>


# Computing Gradient
To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely we need $\frac{\partial loss}{\partial a}$ and $\frac{\partial loss}{\partial b}$ undersome fixed values of x and y. To compute those derivatives, we call `loss.backward()` and then retrieve the values from `w.grad` and `b.grad`

In [10]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0326, 0.1933, 0.1209],
        [0.0326, 0.1933, 0.1209],
        [0.0326, 0.1933, 0.1209],
        [0.0326, 0.1933, 0.1209],
        [0.0326, 0.1933, 0.1209]])
tensor([0.0326, 0.1933, 0.1209])


# Disabling gradient tracking

In [13]:
z = torch.matmul(x,w) + b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x,w) + b
print(z.requires_grad)

z = (torch.matmul(x,w) + b).detach()
print(z.requires_grad)

True
False
False


# Reasons to disable gradient tracking:
- To mark some parameters in the network as frozen parameters.
- To speed up computations when only doing the forward pass, because computation gradient on tensors that do not track gradients would be more efficient.

# Tensor Gradient and Jacobian Product

In many cases, there are scalar loss function, and we need to compute the gradient with respect to some parameters. However there are cases when the output function is arbitrary tensor. In this case pytorch allow to compute jacobian product and not actial gradient.



For a vector function  $\vec{y}=f(\vec{x})$, where
$\vec{x}=\langle x_1,\dots,x_n\rangle$ and
$\vec{y}=\langle y_1,\dots,y_m\rangle$, a gradient of
$\vec{y}$ with respect to $\vec{x}$ is given by **Jacobian
matrix**:
$J=\left(\begin{array}{ccc}
       \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
       \vdots & \ddots & \vdots\\
       \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
       \end{array}\right)$
Instead of computing the Jacobian matrix itself, PyTorch allows you to computer Jacobian Product $v^T\cdot J$.

In [18]:
inp = torch.eye(5, requires_grad=True)
out = (inp+1).pow(2)
out.backward(torch.ones_like(inp), retain_graph=True)
print("First Call\n", inp.grad)
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nSecond call\n", inp.grad)
inp.grad.zero_()
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nCall after zeroing gradients\n", inp.grad)

First Call
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])

Second call
 tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.],
        [4., 4., 4., 4., 8.]])

Call after zeroing gradients
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])
