Having heard about the announcement about Theano from Bengio lab , as a Theano user, I am happy and sad to see the fading of the old hero, caused by many raising stars. Sad to see it is too old to compete with its industrial competitors, and happy to have so many excellent deep learning frameworks to choose from. Recently I started translating some of my old codes to Pytorch and have been really impressed by its dynamic nature and clearness. But at the very beginning, I was very confused by the ``backward()`` function when reading the tutorials and documentations. This motivated me to write this post in order for other Pytorch beginners to ease the understanding a bit. And I'll assume that you already know the [``autograd``](http://pytorch.org/docs/master/autograd.html) module and what a [``Variable``](http://pytorch.org/docs/0.1.12/_modules/torch/autograd/variable.html) is, but are a little confused by definition of ``backward()``. 

First let's recall the gradient computing under mathematical notions. For an independent variable $x$ (scalar or vector), the first whatever operation on $x$ is $y = f(x)$. Then the gradient of $y$ w.r.t $x_i$s is
$$\begin{align}\nabla y&=\begin{bmatrix}
\frac{\partial y}{\partial x_1}\\
\frac{\partial y}{\partial x_2}\\
\vdots
\end{bmatrix}
\end{align}.
$$
Then for a specific point of $x=[X_1, X_2, \dots]$, we'll get the gradient of $y$ on that point as a vector. With these notions in my mind, the confusing part of the function ``torch.autograd.backward(variables, grad_variables=None, retain_graph=None, create_graph=None, retain_variables=None)`` is the parameter ``grad_variables``. And as the function returns the sum of gradients of ``Variables`` w.r.t leaf variables, then the attribute ``.grad`` of leaf variable made me deep in the cloud. Okay, an example is better than thousands of words. 

In [1]:
from __future__ import print_function
import torch as T
import torch.autograd
from torch.autograd import Variable
import numpy as np

In [4]:
'''Define a scalar variable, set requires_grad to be true to add it to backward path for computing gradients'''
x = Variable(T.randn(1, 1), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
print('y', y)
#define one more operation to check the chain rule
z = y ** 3
print('z', z)


x Variable containing:
-0.8175
[torch.FloatTensor of size 1x1]

y Variable containing:
-3.2701
[torch.FloatTensor of size 1x1]

z Variable containing:
-34.9702
[torch.FloatTensor of size 1x1]



There simple operations defined a forward path $z=(2x)^3$, $z$ will be the final output ``Variable`` we would like to compute gradient $dz=24x^2dx$, which will be passed to the parameter ``Variables`` in ``backward()`` function. 

In [5]:
#yes, it is just as simple as this:
z.backward()

In [6]:
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)

z gradient None
y gradient None
x gradient Variable containing:
 128.3257
[torch.FloatTensor of size 1x1]



The gradients of both $y$ and $z$ are None, since the function returns the gradient for the leaves, which is $x$ in this case. At the very beginning, I was assuming something like this:

``x gradient None
y gradient None
z gradient Variable containing:
 128.3257
[torch.FloatTensor of size 1x1]``,
since the gradient is for the final output $z$.

With a blink of thinking, we could clarify that would be practically chaos if $x$ is a multi-dimensional vector. ``x.grad`` should be interpreted as the gradient of $z$ at $x$. Till now, we know the simple usage of ``backward()``. But wait, what the heck is ``grad_variables``? Let's do the above again with some values for it. 

In [22]:
x = Variable(T.randn(1, 1), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
print('y', y)
#define one more operation to check the chain rule
z = y ** 3
print('z', z)
z.backward(T.FloatTensor([1]), retain_graph=True)
print('Keeping the default value gives')
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)


x Variable containing:
-1.2095
[torch.FloatTensor of size 1x1]

y Variable containing:
-2.4190
[torch.FloatTensor of size 1x1]

z Variable containing:
-14.1552
[torch.FloatTensor of size 1x1]

Keeping the default value gives
z gradient None
y gradient None
x gradient Variable containing:
 35.1098
[torch.FloatTensor of size 1x1]



In [23]:
x.grad.data.zero_()
z.backward(T.FloatTensor([0.1]), retain_graph=True)
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)

z gradient None
y gradient None
x gradient Variable containing:
 3.5110
[torch.FloatTensor of size 1x1]



As you can see that if the value is set to be 0.1, then the gradient became one tenth of the original gradient. Do one more time to set it to be 0.2.

In [20]:
x.grad.data.zero_()
z.backward(T.FloatTensor([0.2]), retain_graph=True)
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)

z gradient None
y gradient None
x gradient Variable containing:
 4.4785
[torch.FloatTensor of size 1x1]



Now let's add one dimension to $x$. We can clearly see the gradients of $z$ are computed w.r.t to each dimension of $x$. Note that in this case $z$ is also two-dimensional. 

In [41]:
x = Variable(T.randn(2, 2), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#print('y', y)
#define one more operation to check the chain rule
z = y ** 3
print('z', z)
z.backward(T.FloatTensor([1, 0]), retain_graph=True)
print('x gradient', x.grad)
z.backward(T.FloatTensor([0, 1]), retain_graph=True)
print('x gradient', x.grad)

x Variable containing:
 1.3448 -0.2924
-0.4746  0.8842
[torch.FloatTensor of size 2x2]

z Variable containing:
 19.4567  -0.2000
 -0.8550   5.5296
[torch.FloatTensor of size 2x2]

x gradient Variable containing:
 43.4041   0.0000
  5.4049   0.0000
[torch.FloatTensor of size 2x2]

x gradient Variable containing:
 43.4041   2.0518
  5.4049  18.7622
[torch.FloatTensor of size 2x2]



Then what if we render the output one-dimensional (scalar) while $x$ is two-dimensional. This is a real simplified scenario of neural networks. 
$$f(x)=\frac{1}{n}\sum_i^n(2x_i)^3$$
$$f'(x)=\frac{1}{n}\sum_i^n24x_i^2$$

In [29]:
x = Variable(T.randn(2, 1), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#print('y', y)
#define one more operation to check the chain rule
z = y ** 3
out = z.mean()
print('out', out)
out.backward(T.FloatTensor([1]), retain_graph=True)
print('x gradient', x.grad)

x Variable containing:
-0.4230
-1.2352
[torch.FloatTensor of size 2x1]

out Variable containing:
-7.8402
[torch.FloatTensor of size 1]

x gradient Variable containing:
  2.1467
 18.3074
[torch.FloatTensor of size 2x1]

