# Automatic differentiation with torch.autograd
### When training neural networks, the most frequently used algorithm is back propagation. 
### In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function 
### with respect to the given parameter. The loss function calulates the difference between the expected output
### and the actual output that a neural network produces. The goal is to get the result of the loss function as
### close to zero as possible. The algorithm traverse backwards through the network network to adjust the 
### weights and bias to retrain the model. That's why it's called back propagation. This back and forward 
### process of retraining the model over time to reduce the loss to 0 is called the gradient descent.

### To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd.
### It supports automatic computation of gradient for any computational graph.

### Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function.
### It can be defined in PyTorch in the following manner:

In [1]:
%matplotlib inline
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

### A function that we apply to tensors to construct computational graph is in fact an object of class Function.
### This object knows how to compute the function in the forward direction, and also how to compute its 
### derivative during the backward propagation step. A reference to the backward propagation function is
### stored in grad_fn property of a tensor.

In [2]:
print('Gradient function for z =',z.grad_fn)
print('Gradient function for loss =', loss.grad_fn)

Gradient function for z = <AddBackward0 object at 0x7fb66c278fd0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7fb66c278940>


# Computing gradients
### To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function
### to compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad:

In [3]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0051, 0.0433, 0.1946],
        [0.0051, 0.0433, 0.1946],
        [0.0051, 0.0433, 0.1946],
        [0.0051, 0.0433, 0.1946],
        [0.0051, 0.0433, 0.1946]])
tensor([0.0051, 0.0433, 0.1946])


## Disabling gradient tracking
### By default, all tensors with requires_grad=True are tracking their computational history and support 
### gradient computation. However, there are some cases when we do not need to do that, for example, 
### when we have trained the model and just want to apply it to some input data, i.e. we only want to do 
### forward computations through the network. We can stop tracking computations by surrounding our 
### computation code with torch.no_grad() block:

In [4]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


### Another way to achieve the same result is to use the detach() method on the tensor:

In [5]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


## There are reasons you might want to disable gradient tracking:

### To mark some parameters in your neural network at frozen parameters. 
### This is a very common scenario for fine tuning a pre-trained network.
### To speed up computations when you are only doing forward pass,
### because computations on tensors that do not track gradients would be more efficient.

'''More on Computational Graphs
Conceptually, autograd keeps a record of data (tensors) and all executed operations 
(along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. 
In this DAG, leaves are the input tensors, roots are the output tensors. 
By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

run the requested operation to compute a resulting tensor
maintain the operation’s gradient function in the DAG.
The backward pass kicks off when .backward() is called on the DAG root. autograd then:

computes the gradients from each .grad_fn,
accumulates them in the respective tensor’s .grad attribute
using the chain rule, propagates all the way to the leaf tensors.
DAGs are dynamic in PyTorch

An important thing to note is that the graph is recreated from scratch; after each .backward() call,
autograd starts populating a new graph.
This is exactly what allows you to use control flow statements in your model; 
you can change the shape, size and operations at every iteration if needed.


In [7]:
inp = torch.eye(5, requires_grad=True)
out = (inp+1).pow(2)
out.backward(torch.ones_like(inp), retain_graph=True)
print("First call\n", inp.grad)
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nSecond call\n", inp.grad)
inp.grad.zero_()
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nCall after zeroing gradients\n", inp.grad)

First call
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])

Second call
 tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.],
        [4., 4., 4., 4., 8.]])

Call after zeroing gradients
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])
