In [7]:
import torch

## Computational dynamic graph in PyTorch
A computational dynamic graph is a data structure that helps to track changes in a tensor using graphs data structures. To active the computational tracking in a PT tensor the tensor must be initialized with the flag `requires_grad=True`. 

#### *tensor.backward()*
The backward method in a PyTorch tensor object set by default the numerical derivative of an actual tensor-operation with respect to a tensor-variable that requires grad. This derivative is stored in the `.grad` property of the tensor-variable that requires grad. 

This is, with requires_grad=True a tensor-variable is defined as a computational dynamic graph and the operations made with this tensor will be tracked. With the *backward* method this operations will be used to return the derivative of these operations with respect to the tensor-variable.

With this concept the derivative of the cost function with respect to the biases and the weights can be calculated thought the deep neural network (in the next sections).

#### Example 1. Scalar backpropagation
With the backward method compute the numerical derivative of the function 
$$y(x) = 2 x^{4} + x^{3} + 3 x^{2} + 5 x + 1$$
with respect to $x$ evaluated in $x = 2$.  

Defining a tensor-variable that requires grad as x

In [8]:
x = torch.tensor(2.0, requires_grad=True)

Defining the $y$ tensor-operation using $x$

In [9]:
y = 2*x**4 + x**3 + 3*x**2 + 5*x + 1
print(y)

tensor(63., grad_fn=<AddBackward0>)


The backward method used in y stores the result of the derivative evaluated in 2 in the grad property of x:

$$\frac{dy}{dx}=8(x)^3+3(x)^2+6(x)+5 $$
$$\left. \frac{dy}{dx} \right|_{x=2}=8(2)^3+3(2)^2+6(2)+5 = 64+12+12+5 = 93$$

In [10]:
y.backward()
print(x.grad)

tensor(93.)


In [12]:
# prove the solution
x.grad == 8*x**3 + 3*x**2 + 6*x +5

tensor(True)

### Multistep vector-matrix backpropagation
The PyTorch tensor-variable could represent matrices, vectors and scalars. With vectors and matrices pytorch make element-wise operations: operations that are applied independently to each corresponding element in a pair of vectors or matrices. When an operation is performed element-wise, each element in the input(s) undergoes the same operation without considering the structure or dimensions of the input as a whole.

Example of element-wise operations

Consider the element-wise sum of A + B = C

$$
\mathbf{A} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{pmatrix}, \quad \mathbf{B} = \begin{pmatrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \end{pmatrix}
$$

$$
\mathbf{C} = \mathbf{A} + \mathbf{B} = \begin{pmatrix} a_{11} + b_{11} & a_{12} + b_{12} & a_{13} + b_{13} \\ a_{21} + b_{21} & a_{22} + b_{22} & a_{23} + b_{23} \end{pmatrix}
$$

A variable-tensor x as dynamic computational graph also track this element wise operations and when these operations are backpropagated the derivatives are also calculated element-wise.

### Example 2. Multistep backpropagation

With the matrix $\mathbf{X}$ make the following operations in pytorch and backpropagate the $\mathbf{Z}$ tensor

$$
\mathbf{X} = \begin{pmatrix}
1 & 2 &  3 \\
3 & 2 & 1
\end{pmatrix}
$$

$$ \mathbf{Y} = 3\mathbf{X} + 2$$

$$ \mathbf{Z} = 2\mathbf{Y}^2$$

In [13]:
x = torch.tensor([[1.,2,3],[3 ,2 ,1]], requires_grad=True)
print(x)

tensor([[1., 2., 3.],
        [3., 2., 1.]], requires_grad=True)


In [14]:
y = 3*x + 2
print(y)

tensor([[ 5.,  8., 11.],
        [11.,  8.,  5.]], grad_fn=<AddBackward0>)


In [15]:
z = 2*y**2
print(z)

tensor([[ 50., 128., 242.],
        [242., 128.,  50.]], grad_fn=<MulBackward0>)


With gradient=torch.ones_like(x) the backward method retrieves the gradient taking count the shape of the x tensor, with scalars there is no need for gradient parameter

In [16]:
z.backward(gradient=torch.ones_like(x))
print(x.grad)

tensor([[ 60.,  96., 132.],
        [132.,  96.,  60.]])


Verification 

Each component of the tensor $z$, as function of $x_i$ can be written as:

$$z_i = 2(y_i)^2 = 2(3x_i+2)^2$$

To evaluate  $\frac {\partial z_i}{\partial x_i}$ the Chain Rule can be used: $f(g(x)) = f'(g(x))g'(x)$

$$f(g(x)) = 2(g(x))^2 $$
$$f'(g(x)) = 4g(x) $$
$$g(x) = 3x+2$$ 
$$g'(x) = 3 $$
$$\frac {\partial z_i}{\partial x_i} = 4g(x_i) 3 = 12(3x_i+2) $$

Evaluating the derivative for each component $x_i$ of the tensor $x$:

$$\left. \frac{\partial z_1}{\partial x_i} \right|_{x_i=1} = 12(3(1)+2) = 60$$
$$\left. \frac{\partial z_2}{\partial x_i} \right|_{x_i=2} = 12(3(2)+2) = 96$$
$$\left. \frac{\partial z_3}{\partial x_i} \right|_{x_i=3} = 12(3(3)+2) = 136$$






In [17]:
# prove the solution
x.grad == 12*(3*x+2)

tensor([[True, True, True],
        [True, True, True]])

#### The use of the mean method in backpropagation
In some PyTorch applications it is common to use an average with *mean* before perform the backward pass

In [18]:
x = torch.tensor([[1.,2,3],[3 ,2 ,1]], requires_grad=True)

In [19]:
y = 3*x + 2

In [24]:
z = 2*y**2
print(z)

tensor([[ 50., 128., 242.],
        [242., 128.,  50.]], grad_fn=<MulBackward0>)


In [21]:
out = z.mean()
print(out)

tensor(140., grad_fn=<MeanBackward0>)


In [22]:
out.backward()

In [23]:
print(x.grad)

tensor([[10., 16., 22.],
        [22., 16., 10.]])


Now the single component of the Tensor $out = o$, in terms of $z_i(y_i(x_i))$ can be written as:

$$  o(x_i) = \frac {1} {6}\sum_{i=1}^{6} z_i(y_i(x_i)) $$

Taking count that:

$$ \left(\displaystyle\sum_{i=1}^nf_i(x)\right)^\prime=\displaystyle\sum_{i=1}^nf_i^\prime(x) $$

The derivative of $o$ with respect to $x_i$

$$ \frac {\partial o}{\partial x_i} = \left(\displaystyle \frac {1} {6} \sum_{i=1}^6z_i(y_i(x_i))\right)^\prime=\displaystyle \frac {1} {6} \sum_{i=1}^6\frac {\partial [z_i(y_i(x_i))]} {\partial x_i} = \displaystyle \frac {1} {6} \sum_{i=1}^6\frac {\partial z_i}{\partial y_i} \frac {\partial y_i}{\partial x_i} $$

Pytorch ignores the summation in backward step, then 

$$\frac {\partial o}{\partial x_i} = \frac {1} {6} \frac {\partial z_i}{\partial y_i} \frac {\partial y_i}{\partial x_i}  $$

With 

$$\frac {\partial z_i}{\partial y_i} \frac {\partial y_i}{\partial x_i} = \frac {\partial z_i}{\partial x_i} = 12(3x_i+2) $$

$$\frac {\partial o}{\partial x_i} = \frac {1} {6} 12(3x_i+2) = 2(3x_i+2) $$



In [25]:
# prove this solution
x.grad == 2*(3*x+2)

tensor([[True, True, True],
        [True, True, True]])