In [1]:
import torch

In [29]:
print(torch.__version__) # This code was tested on version 1.1.0

1.1.0


### Simple automatic gradient calculation

In [2]:
x=torch.tensor(5, requires_grad=True, dtype=torch.float32) # creates a tensor (think as multi-dimensional array)

In [3]:
x

tensor(5., requires_grad=True)

In [5]:
x.grad  # x.grad=None initally

In [6]:
y=x*x

In [7]:
y

tensor(25., grad_fn=<MulBackward0>)

In [8]:
y.grad

In [9]:
y.backward() # computes partial derivatives ∂y/∂x and stores it in x.grad

In [10]:
x.grad # ∂y/∂x = 2 * x = 2 * 5 = 10

tensor(10.)

### Why do we need `requires_grad`?

* The argument `requires_grad=True` to tells PyTorch that we want the gradient (partial derivative with respect to this variable/tensor). Once we specify this PyTorch tracks the operations performed using this variable which it uses to compute the partial derviates when `.backward()` is called on an output variable/tensor (i.e. a variable that is obtained as a result of some operation/opertions on this variable/tensor).  
* Also, PyTorch requires the tensor/variable to be of type float to calculate its gradient, which is why we need `dtype=torch.float32`

In [11]:
x=torch.tensor(5)
y=x*x
y.backward() # you get an error

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

In [12]:
x=torch.tensor(5, requires_grad=True)
y=x*x
y.backward() # you get an error

RuntimeError: Only Tensors of floating point dtype can require gradients

### One more example

In [13]:
x=torch.tensor(5, requires_grad=True, dtype=torch.float32)
z=torch.tensor(3, requires_grad=True, dtype=torch.float32)

In [14]:
y=x*x + z + torch.log(z)

In [15]:
y.backward()

x.grad= ∂y/∂x = 2x = 2*5 = 10

z.grad = ∂y/∂z = 1+(1/z) = 1+1/3=1.33

In [16]:
x.grad, z.grad

(tensor(10.), tensor(1.3333))

### A more involved example

In [17]:
x=torch.tensor(5, requires_grad=True, dtype=torch.float32)
z=torch.tensor(3, requires_grad=True, dtype=torch.float32)

In [18]:
r1=x*x + z + torch.log(z) # r1=f(x,z), where f=x*x + z + log(z)

In [19]:
r2=r1+torch.sqrt(x)  # r2=g(r1,x), where g=r1 + √x

In [20]:
r3=100*r1+torch.log(r2) # r3=h(r1,r2), where h=100 * r1 + log(r2)

In [21]:
y=r1+r2+r3 # y=u(r1,r2,r3), where u=r1 + r2 +r3

So our y is:
$$y = u(r1,r2,r3) = u(f(x,z), g(f(x,z), x), h(f(x,z), g(f(x,z), x) ) )$$  

Computing the partial derivative of y wrt x (∂y/∂x) and y wrt z (∂y/∂z) by hand is not so easy anymore and will take some time. Pytorch does this automatically for us! 

In [22]:
y.backward() # voila! (∂y/∂x) and z (∂y/∂z) are calculated and stored in x.grad and z.grad respectively

In [26]:
print(f"∂y/∂x={x.grad}, ∂y/∂z={z.grad}")

∂y/∂x=1020.5499267578125, ∂y/∂z=136.0425567626953


In [27]:
print(x.grad, z.grad, r1.grad, r2.grad, r3.grad)

tensor(1020.5499) tensor(136.0426) None None None


Why are `r1.grad`, `r2.grad` and `r3.grad` None?
* PyTorch by default stores only the gradient of the non-intermediate variables/tensors (i.e. x and z here) to save memory
* See here for a detailed explanation - https://discuss.pytorch.org/t/why-cant-i-see-grad-of-an-intermediate-variable/94/2

### Excercise

* What does the following code do?
* How is `y.backward()` different from `r2.backward()`?

In [None]:
x=torch.tensor(5, requires_grad=True, dtype=torch.float32)
z=torch.tensor(3, requires_grad=True, dtype=torch.float32)

r1=x*x + z + torch.log(z) 
r2=r1+torch.sqrt(x)  
r3=100*r1+torch.log(r2)
y=r1+r2+r3

r2.backward()

print(x.grad, z.grad)