# 2.5 Automatic Differentiation

Pytorch constructs computational graph to record the dependencies among different values (tensors). Gradients are automatically calculated through chain rules using the computational graph during backpropagation.

Take the example of a single linear layer and squared loss. 

Let the input be a 2-vector $X=[x_0,x_1]^T$ and the output be a scaler. Weight $w=[w_0,w_1]$, bias $b \in \mathbb{R}$, and output $y = w^TX + b$. Let ground-truth target $z \in \mathbb{R}$ and the squared loss $l = (y-z)^2$.

Using chain rule, we can manually and tediously calculate the partial derivative with respect to any variable.

For example, 
$$ 
\begin{align}
&\frac{\partial l}{\partial w_0} \\
&= \frac{\partial l}{\partial{y}} \frac{\partial{y}}{\partial (w^TX)} \frac{\partial (w^TX)}{\partial (w_0 * x_0)} \frac{\partial (w_0 * x_0)}{\partial (w_0)} \\ 
&= 2(y-z)* 1 * 1 * x_0 \\
&= 2(y-z)x_0
\end{align} 
$$

If we pick input $X = [1,2]^T$, $w = [0.5,0.5]$, $b = 0.1$ and $z = 1$, we will have $y = 1 * 0.5 + 2 * 0.5 + 0.1 = 1.6$ and $l = (1.6-1)^2 = 0.36$. 

Then $\frac{\partial l}{\partial w_0} = 2 *(1.6-1) * 1 = 1.2$

Similarly, $\frac{\partial l}{\partial w_1} = 2 *(1.6-1) * 2 = 2.4$, and the full gradient $\nabla_w l = \begin{bmatrix}1.2 \\ 2.4\end{bmatrix}$

We can calculate the same result easily using pytorch automatic differentiation by calling the `backward()` function of a target tensor. It will also calculate the gradient for all the tensors involved in computing the value of the target tensor (unless gradient of a tensor is explicitly marked as unneeded): 

In [None]:
import torch 

# use the requires_grad attribute to control whether the calculate the gradient for the tensor
x = torch.tensor([[1.],[2.]], requires_grad=True)
# The same as setting requires_grad_ later to True:
# x = torch.tensor([[1.],[2.]])
# x.requires_grad_(True)
w = torch.tensor([[0.5,0.5]], requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)
y = w @ x + b 
# Gradients are disabled by default for tensors created using the torhc.tensor constructor
z = torch.tensor(1.0)
loss = (y-z) ** 2
# Use backpropagation to compute gradients
loss.backward()

print(f"Gradient of w:\t{w.grad}")
# flattened gradient of x only for display 
print(f"Gradient of x:\t{x.grad.flatten()}")
print(f"Gradient of b:\t{b.grad}")
# gradient disabled, so is None 
print(f"Gradient of z:\t{z.grad}")


Gradient of w:	tensor([[1.2000, 2.4000]])
Gradient of x:	tensor([0.6000, 0.6000])
Gradient of b:	1.2000000476837158
Gradient of z:	None


We can see that the partial derivative of $w_0$ is correctly calculated. In fact, the gradient of vector $w$, $x$, and $b$ are all calculated. Note that the gradient of the $z$ is None, as its requires_grad parameter is not set to True. 

Gradients are attributes of tensors and will accumulate as computation happens. New gradietn value will be added to the old gradient value: 

In [None]:
print(f"w gradient after previous calculation \t{w.grad}\n")
# some calculation
result = w.sum()
result.backward()
print(f"Result value: \t{result}")
print(f"Accumulated w gradient: \t{w.grad} \n")

# conduct the same calculation again. result2 is not changed, but gradient of 2 is changed 
result = w.sum()
result.backward()
print(f"Result value iter2: \t{result}")
print(f"Accumulated w gradient iter2: \t{w.grad}")



w gradient after previous calculation 	tensor([[1.2000, 2.4000]])

Result value: 	1.0
Accumulated w gradient: 	tensor([[2.2000, 3.4000]]) 

Result value iter2: 	1.0
Accumulated w gradient iter2: 	tensor([[3.2000, 4.4000]])


The gradient can be manually set to 0 with `tensor.zero_()` (or any other values, for example `one_(), fill_()`, etc. Refer to pytorch documentations for other options): 

In [None]:
w.grad.zero_()

result = w.sum()
result.backward()
print(f"Result value: \t{result}")
print(f"Accumulated w gradient: \t{w.grad} \n")

# If we reset the gradient, both results should agree
w.grad.zero_()
result = w.sum()
result.backward()
print(f"Result value iter2: \t{result}")
print(f"Accumulated w gradient iter2: \t{w.grad} \n")


Result value: 	1.0
Accumulated w gradient: 	tensor([[1., 1.]]) 

Result value iter2: 	1.0
Accumulated w gradient iter2: 	tensor([[1., 1.]]) 



## 2.5.2 Backward for Non-scalar variables

If instead of a scalar, our calculation outputs a vector, directly calling `backward()` on the output vector will produce an error, as there will be multiple gradients to choose from. For example, we calculate 

$$y = w_0 * X + w_1 * X$$ 

(w_0 and w_1 are scalars, X is a 2-vector), then instead of $\nabla_w y$, we will have a Jacobian matrix 

$$
J = \begin{bmatrix}
    \frac{\partial y_0}{\partial w_0} & \frac{\partial y_0}{\partial w_1} \\
    \frac{\partial y_1}{\partial w_0} & \frac{\partial y_1}{\partial w_1}
\end{bmatrix}
$$

In [None]:
x = torch.tensor([[1.],[2.]], requires_grad=True)
w = torch.tensor([0.5,0.5], requires_grad=True)
y_r = w[0] * x + w[1] * x
print(f"Shape of y_r: {y_r.shape}")
# Directly calling backward on y will result an error
# y_r.backward()

Shape of y_r: torch.Size([2, 1])


The `gradient` parameter of `backward()` function can be used to calculate a matrix vector multiplication between the Jacobian matrix and a vector. For example, denote the gradient vector as $v = [v_0,v_1]^T$, then `y_r.backward(gradient=v)` will calculate 
$$
v^T J = \begin{bmatrix}
    v_0 * \frac{\partial y_0}{\partial w_0} + v_1 * \frac{\partial y_1}{\partial w_0}  \\
    v_0 * \frac{\partial y_0}{\partial w_1} + v_1 * \frac{\partial y_1}{\partial w_1}
\end{bmatrix}
$$

Setting `gradient` to a 1 vector ([1,1] in the above example) will produce the sum of all gradient with respect to the target tensor. 

In [None]:
# set gradient to be a 1 vector that matches the shape of y 
y_r.backward(gradient=torch.ones((2,1)))
print(f"Sum of gradient: {w.grad}")
sumed_grad = w.grad 

Sum of gradient: tensor([3., 3.])


In [None]:
x = torch.tensor([[1.],[2.]], requires_grad=True)
w = torch.tensor([0.5,0.5], requires_grad=True)
y_r = w[0] * x + w[1] * x
y_r[0].backward()
p_y0_p_w = w.grad.clone()
print(p_y0_p_w)

# Clear gradient. 
x.grad.zero_()
w.grad.zero_()
y_r = w[0] * x + w[1] * x
y_r[1].backward()
p_y1_p_w = w.grad.clone()
print(p_y1_p_w)

# gradient results are the same 
p_y1_p_w + p_y0_p_w == sumed_grad

tensor([1., 1.])
tensor([2., 2.])


tensor([True, True])

## 2.5.3 Detaching Computation 

If we do not want to calculate the gradient of some tensor, We can detach any tensor (from the computational graph) using `detach()` function so that the gradients of tensors used to compute the detached tensor are not calculated. Using example from the book, for $y = x * x$, and $z = y * x$, if $y$ is not detached, $\nabla_x z = \nabla_x x^3 = 3x^2$ (a little abuse of notation for element-wise power). After detaching $y$, computation of $y$ no longer affect gradient calculation of $x$, and then the gradient will just be value of $y$. We can validate this to be true: 

In [None]:
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x
# computation x * x is not included in computing the gradient
z.sum().backward()
print(x.grad == u)

tensor([[True],
        [True]])


Even though $y$ is detached from the computational graph of calculating $z$, we can still backpropagate from within the calculation of $y$ to get gradient. For example, $\nabla_x sum(y) = 2 * x$. As shown in the example from the book: 

In [None]:
x.grad.zero_()
# calculation of y can still be backpropagated explicitly 
y.sum().backward()
print(x.grad == 2 * x)

tensor([[True],
        [True]])


## 2.5.4 Gradients and Python Control Flow

In addition to simple arithmatic operations, gradients can also be calculated and accumulated when other python control flows, such as loops or conditional statements, are used. The book provides a simple example: 

In [None]:
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

# create a random 
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()
a.grad == d / a


tensor(True)

## 2.7 Documentations

`dir()` give the full list of functions and attributes of a module/object. 

`help()` displays the help messages for function or class. In jupyer notebook, "?" can also be used to achieve the same result

In [None]:
import torch 
dir(torch.linalg)

['LinAlgError',
 'Tensor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_add_docstr',
 '_linalg',
 'cholesky',
 'cholesky_ex',
 'common_notes',
 'cond',
 'cross',
 'det',
 'diagonal',
 'eig',
 'eigh',
 'eigvals',
 'eigvalsh',
 'householder_product',
 'inv',
 'inv_ex',
 'ldl_factor',
 'ldl_factor_ex',
 'ldl_solve',
 'lstsq',
 'lu',
 'lu_factor',
 'lu_factor_ex',
 'lu_solve',
 'matmul',
 'matrix_exp',
 'matrix_norm',
 'matrix_power',
 'matrix_rank',
 'multi_dot',
 'norm',
 'pinv',
 'qr',
 'slogdet',
 'solve',
 'solve_ex',
 'solve_triangular',
 'svd',
 'svdvals',
 'sys',
 'tensorinv',
 'tensorsolve',
 'torch',
 'vander',
 'vecdot',
 'vector_norm']

In [None]:
import torch 
help(torch.ldexp) 
# same as the following in jupyernotebook
# np.busday_count?

Help on built-in function ldexp in module torch:

ldexp(...)
    ldexp(input, other, *, out=None) -> Tensor
    
    Multiplies :attr:`input` by 2**:attr:`other`.
    
    .. math::
        \text{{out}}_i = \text{{input}}_i * 2^\text{{other}}_i
    
    
    Typically this function is used to construct floating point numbers by multiplying
    mantissas in :attr:`input` with integral powers of two created from the exponents
    in :attr:`other`.
    
    Args:
        input (Tensor): the input tensor.
        other (Tensor): a tensor of exponents, typically integers.
    
    Keyword args:
        out (Tensor, optional): the output tensor.
    
    Example::
    
        >>> torch.ldexp(torch.tensor([1.]), torch.tensor([1]))
        tensor([2.])
        >>> torch.ldexp(torch.tensor([1.0]), torch.tensor([1, 2, 3, 4]))
        tensor([ 2.,  4.,  8., 16.])

