## Automatic Differentiation in PyTorch
&nbsp;

In [25]:
# Build a linear layer with a dummy MSE loss. 
import torch

dummy_input = torch.rand(16, )
dummy_gt = torch.randn(10,)

'''
Remember to set the required_grad on the tensors that hold the weights to be optimized.
This is akin to calling .watch() on that in Tensorflow GradientTape. 
'''

w = torch.randn(10, 16, requires_grad=True)
b = torch.randn(10, requires_grad=True)

#Getting ahead of the tutorial here a little bit. PyTorch memories are getting back to me.
optimizer = torch.optim.Adam(params=[w,b])

#A trivial training loop here for 1000 epochs. Gets the job done though.
for i in range(1000):
    z = (w @ dummy_input) + b #feedforward.
    loss = torch.nn.functional.mse_loss(dummy_gt, z) # compute the loss.
    if i % 100 == 0:
        print(loss)
        
    optimizer.zero_grad() #make sure to refresh the grads on all tensors that 'requires_grad'
    '''
    #computes the gradients and backprops them 
    to all the 'requires_grad' tensors that went into computing the 'loss'. 
    '''
    loss.backward() 

    optimizer.step() #optimizer applies the gradients to the tensors it was defined it (w,b) here.


tensor(5.4758, grad_fn=<MseLossBackward0>)
tensor(3.1213, grad_fn=<MseLossBackward0>)
tensor(1.8149, grad_fn=<MseLossBackward0>)
tensor(1.0647, grad_fn=<MseLossBackward0>)
tensor(0.6200, grad_fn=<MseLossBackward0>)
tensor(0.3534, grad_fn=<MseLossBackward0>)
tensor(0.1952, grad_fn=<MseLossBackward0>)
tensor(0.1037, grad_fn=<MseLossBackward0>)
tensor(0.0526, grad_fn=<MseLossBackward0>)
tensor(0.0253, grad_fn=<MseLossBackward0>)


### Getting the gradient function and the gradient values.:

&nbsp;

**Note that actual gradient value is accessible only in the leaf nodes of the computational graph. I don't quite understand why though. My hypothesis: Perhaps some memory constraint issue. Once the step is taken on the weights in the middle of the network, the gradient may be deleted unless retain_graph is called.**

In [26]:
#Printing the gradient function.

print('Gradient function for z:', z.grad_fn)
print('Gradient function for loss', loss.grad_fn)

Gradient function for z: <AddBackward0 object at 0x7f9ba13ea9d0>
Gradient function for loss <MseLossBackward0 object at 0x7f9ba13ea520>


In [27]:
#Print the last gradients that accumulated on the weights and biases. 
print('Last epoch gradient on w:', w.grad.shape)
print('Last epoch gradient on b', b.grad.shape)

Last epoch gradient on w: torch.Size([10, 16])
Last epoch gradient on b torch.Size([10])


### One can backprop on a computational graph only once. If you try to call .backward() on the same CG object twice, you get the error below.
&nbsp;

**If it is absolutely necessary to backprop the same CG multiple times, then use loss.backward(retain_graph=True). That saves the graph in RAM, which is costly and must be avoided unless needed.**

In [28]:
loss.backward()

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

## Stopping Gradient Tracking on Tensors

### Why do this at all? 
&nbsp;
* Freezing some weights of the model while fine-tuning. 
* Running the model on inference mode after training. 

&nbsp;

**Method 1: Use a context torch.no_grad() context as shown below.** 

**Method 2: Call the detach() method on the tensor.**

In [37]:
temp = torch.matmul(w, dummy_input) + b 
print(temp.requires_grad)

'''
Make sure to surrouding non-training operations on trainable tensors with no_grad() context.
'''
with torch.no_grad():
    temp = torch.matmul(w, dummy_input) + b

#Notice that gradients were not computed for this tensor.
print(temp.requires_grad)

#We can start gradient tracking again on this tensor.
temp.requires_grad_(True) # Notice that this is an in_place operations as its name suggests.
print(temp.requires_grad)

'''
Alternatively, use the .detach() method on the tensor to stop gradient tracking. 
'''
temp.detach_()
print(temp.requires_grad)

True
False
True
False


## Important Note on the DAGs in PyTorch
&nbsp;

**DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.**

&nbsp;

**This means that when I compute loss = ... in each epoch, a new DAG is being created. This also explains why one can't call .backward() twice on the same DAG without setting retain_graph to true.**