<a href="https://colab.research.google.com/github/namanphy/pytorch-handson/blob/master/autograd_tensor_functions_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import torch

## Autograd
It is a tool that does the calculation of derivatives via a technique called **automatic differentiation**. As quoted from official documentation : *`torch.autograd` provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions*. 

Automatic differentiation is a set of techniques to numerically evaluate the derivative of a function. As it is recquired during the backpropagation pass(to compute the gradient of weights w.r.t loss function) while training a neural network.


## Computation Graph
So how does during backpropagation, pytorch(or any other DL library for that matter) calculates gradients, it does by generating a data structure called **Computation graph**. 
In a complex setup where there is thousands of variables to calculate gradient, a computation graph comes into picture.

**Computation graph are nothing but a simple map of references of variables(or tensors) and operators(or functions) generated for a set of algebric equations, through which autograd can traverse and trace back to leafs) to calculate gradients.**

Now, as pytorch generate these graphs during runtime in a forward pass(simple calculation of outputs from inputs), graphs are called Dynamic Computation Graphs.



In this notebook we are going to discuss the following functions and properties assciated with tensors.

**1.** 3 basic properties of tensors
*   grad_fn : how to view the different fallback functions during backprop.
*   requires_grad : 
*   is_leaf : If it is leaf in a graph.

**2.** backprop() : 

**3.** retain_grad() : 

**4.** register_hook() : 

**5.** detatch() :
  
**6**  EXTRA - torch.no_grad() :





## 1. Important properties : requires_grad, grad_fn, is_leaf
The `requires_grad` attribute tells autograd to track your operations. So if you want PyTorch to create a graph corresponding to these operations, you will have to set the `requires_grad` attribute of the Tensor to True.

There are 2 ways in which it can be done, either by passing it as argument in `torch.tensor` or explicitly setting up the `requires_grad` property to True.

In [4]:
t0 = torch.tensor([1., 2.], requires_grad=True)
print(f't0 : {t0}')

t1 = torch.FloatTensor([1., 2.])  # dtype = torch.float32
t1.requires_grad=True
print(f't1 : {t1}')

t0 : tensor([1., 2.], requires_grad=True)
t1 : tensor([1., 2.], requires_grad=True)


It is to remember that tensors with only `float` data types can require gradient (or ask `autograd` to record its operations).
The other two `float` dtype tensors available in pytorch are :

1. torch.HalfTensor (torch.float16)
2. torch.DoublTensor (torch.float64)

**It has a property :**
  - The Tensors generated by applying any operations on other tensors, given that the for atleast one input tensor `requires_grad = True`, then the resultant tensor will also have `requires_grad = True`. 
  - It is also helpful when in a network we dont want to change the gradients and hence dont want to update the weights associated with some tensors. Just setting `requires_grad` to False the tensors won't participate in computation graph.

In [5]:
t2 = torch.HalfTensor([1., 2.])  # dtype = torch.float16
t2.requires_grad=True
print(f't2 : {t2}')

t3 = torch.DoubleTensor([1., 2.])  # dtype = torch.float64
t3.requires_grad=True
print(f't3 : {t3}')


t2 : tensor([1., 2.], dtype=torch.float16, requires_grad=True)
t3 : tensor([1., 2.], dtype=torch.float64, requires_grad=True)


---

The `grad_fn` property holds the reference to the function (mathematical operator) that creates it. It is very important during backward pass as the function here is responsible to calculate the gradient and pass it to appropiate next function in the pass.
  - If `requires_grad` is set to False, `grad_fn` would be None.

The `is_leaf` property tells whether a tensor is a leaf node or not. Essentially leaf tensors are the tensors whom we want to accumuate the gradient and are present at the edge of the computation graph. **Only leaf Tensors will have their grad populated during a call to** `backward()`. Technically, the `leaf tensors` are any tensors that created by following approaches :
1. Tensors resulting in operations from tensors that have `requires_grad = False` will be leaf Tensors.

2. Any tensor that is explicitly created by the user will be leaf Tensors.
This means that they are not the result of an operation and so grad_fn is None.

3. Obtained from `detach()` function.

In [6]:
x = torch.tensor(3., requires_grad=True)

a = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

y = a * x

z = y + b

print("Tensor x")
print(f'grad funtion = {x.grad_fn}')
print(f'is leaf = {x.is_leaf}')
print(x.requires_grad)

print("\nTensor y")
print(f'grad funtion = {y.grad_fn}')
print(f'is leaf = {y.is_leaf}')
print(y.requires_grad)

print("\nTensor z")
print(f'grad funtion = {z.grad_fn}')
print(f'is leaf = {z.is_leaf}')
print(z.requires_grad)

Tensor x
grad funtion = None
is leaf = True
True

Tensor y
grad funtion = <MulBackward0 object at 0x7f63d5fd2ba8>
is leaf = False
True

Tensor z
grad funtion = <AddBackward0 object at 0x7f63d5fd2ba8>
is leaf = False
True


As in above example, the tensor `x` is only the leaf node. And as `x` is a leaf node, the `grad_fn` is None (as it is not obtained from any operations). 

The tensor `y` has `grad_fn` a multiplication operator, since `y` is obtained from multiplication of `a` and `x`.  Similarly the case for `z`.


## 2. backward(gradient=None, retain_graph=None, create_graph=False)

This the most important of the tensor methods present here. It basically computes the gradient of current tensor w.r.t. graph leaves. It is responsible to calculate the gradient during a backward pass.

*Note : `backward` function only calculates gradients by going over a already made backward graph. The backward graph is as discussed genrared during a forward pass only.*

1. The backward function **takes a incoming gradient** from the part of the network in front of it.
2. Then it **calculates the local gradient** at a particular tensor.
3. Then it **multiply the local gradient to with incoming gradient**.
4. Finally **pass the computed gradient to the tensor's inputs** by invoking the backward method of the `grad_fn` of their inputs or simply **save the gradient in `grad` property for leaf nodes**.

Suppose in the above example, when calling `z.backward`. The grad_fn of `z` is `<AddBackward>`.
1. The backward function of `<AddBackward>` takes a default input tensor as `torch.tensor([1.])`.
2. Then it calculates gradient for `y` and `b`. For both the gradients will be `[1.]` as it a add fucntion.
3. The gradient is multiplied with the incoming tensor i.e. `[1.] * [1.]`.
4. Now for `b` the grad_fn is `None` so the gradient computed directly will get stored in `grad` property of tensor `b`. And for tensor `y` the backward function passes the gradient to its input tensor's `grad_fn` (i.e. `<MulBackward>` of `y` since it is formed after multiplication of `x` and `a`)
5. Similarly the backward function will be called for `y`'s `<MulBackward>`.

As noticed, the backward function is recursively called through out the graph as we backtrack. You can access the gradients by calling the `grad` attribute of Tensor

In [7]:
x = torch.tensor(3., requires_grad=True)

a = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

y = a * x

z = y + b

z.backward()

print("Tensor x")
print(f'grad funtion = {x.grad_fn}')

print("\nTensor a")
print(f'grad funtion = {a.grad_fn}')

print("\nTensor b")
print(f'grad funtion = {b.grad_fn}')

print("\nTensor y")
print(f'grad funtion = {y.grad_fn}')

print("\nTensor z")
print(f'grad funtion = {z.grad_fn}')
print("\n")
print('dz/dx:', x.grad) 
print('dz/da:', a.grad) 
print('dz/db:', b.grad) 

Tensor x
grad funtion = None

Tensor a
grad funtion = None

Tensor b
grad funtion = None

Tensor y
grad funtion = <MulBackward0 object at 0x7f63f1d01be0>

Tensor z
grad funtion = <AddBackward0 object at 0x7f63f1d01be0>


dz/dx: tensor(4.)
dz/da: tensor(3.)
dz/db: tensor(1.)


### Calling backward on non-scaler tensor.

For a vector-valued tensor, the backward function gives a Runtime error : `grad can be implicitly created only for scalar outputs`. 

This is beacuse for a non-scalar tensor a jacobian-vector is to be computed and then the `backward` expects incoming gradients as it's input (usually the gradient of the differentiated function w.r.t. corresponding tensors) . Hence the `backward` expects incoming gradient a Tensor of same size as the current tensor, then it'll able to backpropagate.

So either you can pass the tensor of the same shape or simply change the size of the current tensor to `torch.Size([])` as expected by backward.

*NOTE : If you'll pass non-ones tensor in backward, the gradients will get scaled accordingly.*

In [9]:
x = torch.tensor(3., requires_grad=True)
a = torch.tensor([4.,2.], requires_grad=True)
b = torch.tensor(5., requires_grad=True)

y = a * x
z = y + b 

z.backward(torch.tensor([1.,1.])) # passing a gradient to backward.

print('dz/dx:', x.grad)
print('dz/da:', a.grad) 
print('dz/db:', b.grad) 

dz/dx: tensor(6.)
dz/da: tensor([3., 3.])
dz/db: tensor(2.)


In [10]:
x = torch.tensor(3., requires_grad=True)
a = torch.tensor([4.,2.], requires_grad=True)
b = torch.tensor(5., requires_grad=True)

y = a * x
z = y + b 

z = z.mean()
z.backward()

print('dz/dx:', x.grad)
print('dz/da:', a.grad) 
print('dz/db:', b.grad) 

dz/dx: tensor(3.)
dz/da: tensor([1.5000, 1.5000])
dz/db: tensor(1.)


### Calling backward twice

The computation graph that was generated during runtime and that is used to calculate the gradient during a backward pass, is essentialy gets freed when you call `backward`.
As the gradients are already computed after calling `backward`, the graph is destroyed and will form again when you again run a forward pass.

Hence, if  you'll set `retain_graph = True` the graph will not be freed and you'll be able to backpropagate through the same graph again to calculate the gradients.

In [0]:
x = torch.tensor(3., requires_grad=True)

a = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

y = a * x
z = y + b

z.backward(retain_graph=True)

In [12]:
print('dz/dx:', x.grad)
print('dz/da:', a.grad) 
print('dz/db:', b.grad) 

dz/dx: tensor(4.)
dz/da: tensor(3.)
dz/db: tensor(1.)


## 3. retain_grad()

This function allows a tensor that is not a leaf node to store the gradient that passes through it during backward pass. It can help to troubleshoot the flow of gradients in the graph.


In [13]:
x = torch.tensor(3., requires_grad=True)
a = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

y = a * x + b
y.retain_grad()

c = torch.tensor(4., requires_grad=True)
d = torch.tensor(4., requires_grad=True)

w = c * y + d
w.retain_grad()

w.backward()

print('dw/dx:', x.grad) 
print('dw/da:', a.grad) 
print('dw/db:', b.grad) 
print('dw/dy:', y.grad)
print('dw/dc:', c.grad) 
print('dw/dd:', d.grad) 
print('dw/dw:', w.grad)


dw/dx: tensor(16.)
dw/da: tensor(12.)
dw/db: tensor(4.)
dw/dy: tensor(4.)
dw/dc: tensor(17.)
dw/dd: tensor(1.)
dw/dw: tensor(1.)


## 4. register_hook()

The hook will be called every time a gradient with respect to the Tensor is computed. The hook should have the following signature:

`hook(grad) -> Tensor or None`

So, **the hook can take the value of `grad` and can a return a new value or perform operations with the value.**
This is best part, this can help to 
1. Modify the `grad` on the fly during a backward pass without waiting for the pass to be completed. This can influence our ways to calculate gradient in a graph.
2. Debug the the code for flow of gradients in your graph. Identifying gradients at each step even for non-leaf nodes.

Looking a example from the pytorch documentation.

In [22]:
v = torch.tensor([0., 0., 0.], requires_grad=True)
h = v.register_hook(lambda grad: grad * 2)  # double the gradient
v.backward(torch.tensor([1., 2., 3.]))
print(v.grad)
h.remove()

tensor([2., 4., 6.])


## 5. detach()

It can create a copy of a tensor that is not a part of the computation graph i.e detaches the Tensor from the graph that created it, making it a leaf.
Both tensors will share the same memory, output tensor will be a leaf with `grad_fn = None` and `requires_grad = False`.

In [23]:
print(f'Original V : {v}')
new = v.detach()
new[0] = 5
print(f'New detached V : {new}')
print(f'New V : {v}')

Original V : tensor([0., 0., 0.], requires_grad=True)
New detached V : tensor([5., 0., 0.])
New V : tensor([5., 0., 0.], requires_grad=True)
