# PyTorch Tutorial, Differentiation engine:
#### Torch.autograd is the built-in differentiation engine 
- Compute partial derivatives for parameters with **requires_grad=True**
- Disable gradient tracking
- Use **loss.backward()** for gradient computation
- A taste of **gradient descent** and its effect on loss value

https://github.com/ostad-ai/PyTorch-Tutorial

In [1]:
# importing the necessary modules
import torch
import torch.nn as tnn

In previous post, we show how to create our neural network. A neural network contains parameters that need to be trained for the specific task. The widely used algorithm for training neural networks is **backpropagation**. To use backpropagation, we need to compute the **gradient** of the loss function with respect to each parameter of the neural network. For this purpose, we need a differentiation engine.

For the parameters that need to be optimized, we need to compute the gradients the of loss function with respect to those parameters. In PyTorch, we set the **requires_grad** properties of the parameters to **True** to inform PyTorch to compute gradients for the parameters.
<br> In the following, we create a simple linear model for predicting output $y$ from input vector $[x_0,x_1]$
<br> $y=b+w_0x_0+w_1x_1$
<br>We want to find bias b and weights $w_0$ and $w_1$ so that the loss function below is minimized. For simplicity, we use instantenuous loss for *n*th sample:
<br>$loss=(d_n-y_n)^2$
<br>We can compute the partial derivatives of loss function with respect to the parameters 
<br>$\frac{\partial loss}{\partial w_0}$, $\frac{\partial loss}{\partial w_1}$,
$\frac{\partial loss}{\partial b}$
<br>$\frac{\partial loss}{\partial w_i}=-2(d_n-y_n)x_i$, $i=0,1$
<br>$\frac{\partial loss}{\partial b}=-2(d_n-y_n)$

In [2]:
# the direct method to compute partial derivatives of the loss function above
def compute_grads(x,d,w,b):
    y=torch.matmul(x,w.detach())+b.detach()
    Dw=-2*(d-y)*x
    Db=-2*(d-y)
    return Dw,Db

Let partial derivatives be computed automatically with torch.autograd as shown below. 
<br>We compare the results wth the direct method defined above.
<Br> In PyTorch, we use **loss.backward** to compute gradients for parameters with **requires_grad=True**

In [4]:
# y=b+w0x0+w1x1 with loss=(d-y)^2
x=torch.tensor([3.,5.]) #input vector
d=torch.tensor(.2) # desired value
w=torch.randn(2,requires_grad=True) # weights
b=torch.randn(1,requires_grad=True) #bias
y=torch.matmul(x,w)+b # output of model to input x
loss=(d-y)**2  # the dieal value is zero, the smaller the better
loss.backward()
print(f'Gradients with autograd: {w.grad},{b.grad}')
print(f'Gradients with direct method: {compute_grads(x,d,w,b)}')

Gradients with autograd: tensor([15.3466, 25.5777]),tensor([5.1155])
Gradients with direct method: (tensor([15.3466, 25.5777]), tensor([5.1155]))


How to **disable gradient tracking** and why?
We said that tensors with requires_grad=True support gradient computation. But, there are times we want to disable gradient tracking.
<br>For example, when we want to evaluate or use a trained neural network, we don't need gradient computation. 
<br>Also, when we want to **freeze** some parameters during training of a neural netowrk, we should use gradient disabling for them. 
<br>We may disable gradient tracking with:
 - torch.no_grad()
 - detach()

However, we have the option **requires_grad=False** for tensors to disable gradient computation.

In [5]:
y=torch.matmul(x,w)+b
print(f'requires_grad is {y.requires_grad}')
with torch.no_grad():
    y=torch.matmul(x,w)+b
print(f'requires_grad is {y.requires_grad}')
y=torch.matmul(x,w)+b
y=y.detach()
print(f'requires_grad is {y.requires_grad}')

requires_grad is True
requires_grad is False
requires_grad is False


**Hint:** Every time we call **loss.backward()**, the gradient values are added to the *grad* propertiy of all leaf nodes of computational graph. Therefore, a second call to loss.backward() leads to gradient accumulation. To get the same result, we should zero out the *grad* property of parameters.
<br> In the example below, we check the loss.backward() and its effect on gradients.

In [6]:
xv=torch.randn(5)
dv=torch.ones(3)
wv=torch.randn(5,3,requires_grad=True)
bv=torch.randn(3,requires_grad=True)
yv=torch.matmul(xv,wv)+bv
loss=(dv-yv).pow(2).sum()
loss.backward(retain_graph=True) #compute gradients of parameters
print(f'First call to loss.backward():')
print(f'weights and bias grads:\n {wv.grad} {bv.grad}')
loss.backward(retain_graph=True)
print(f'Second call to loss.backward():')
print(f'weights and bias grads:\n {wv.grad} {bv.grad}')
# making gradients zero before loss.backward()
wv.grad.zero_(); bv.grad.zero_()
loss.backward()
print(f'Third call to loss.backward() with zeroing gradients:')
print(f'weights and bias grads:\n {wv.grad} {bv.grad}')
print('--------------')
print('The first call and third call have the same values of gradients')

First call to loss.backward():
weights and bias grads:
 tensor([[  0.6597,  -0.6459,   0.6203],
        [-12.6338,  12.3691, -11.8791],
        [  0.1674,  -0.1639,   0.1574],
        [-13.2970,  13.0185, -12.5027],
        [  5.7151,  -5.5954,   5.3737]]) tensor([-6.2246,  6.0942, -5.8527])
Second call to loss.backward():
weights and bias grads:
 tensor([[  1.3194,  -1.2918,   1.2406],
        [-25.2676,  24.7383, -23.7581],
        [  0.3348,  -0.3278,   0.3148],
        [-26.5941,  26.0370, -25.0054],
        [ 11.4302, -11.1907,  10.7474]]) tensor([-12.4492,  12.1884, -11.7055])
Third call to loss.backward() with zeroing gradients:
weights and bias grads:
 tensor([[  0.6597,  -0.6459,   0.6203],
        [-12.6338,  12.3691, -11.8791],
        [  0.1674,  -0.1639,   0.1574],
        [-13.2970,  13.0185, -12.5027],
        [  5.7151,  -5.5954,   5.3737]]) tensor([-6.2246,  6.0942, -5.8527])
--------------
The first call and third call have the same values of gradients


A taste of **gradient descent**: Changing parameters with negative of gradients 

In [7]:
xv=torch.randn(5) # input
dv=torch.ones(3)  # desired output
wv=torch.randn(5,3,requires_grad=True) # paramter
bv=torch.randn(3,requires_grad=True)   # paramter
yv=torch.matmul(xv,wv)+bv
loss=(dv-yv).pow(2).sum()
print(f'Loss before gradient descent: {loss}')
# computing gradients for parameters
loss.backward()
# gradient descent step
with torch.no_grad():
    wv-=.05*wv.grad
    bv-=.05*bv.grad
yv=torch.matmul(xv,wv)+bv
loss=(dv-yv).pow(2).sum()
print(f'Loss after gradient descent: {loss}')

Loss before gradient descent: 12.306927680969238
Loss after gradient descent: 8.528040885925293
