# Automatic Differentiation with `torch.autograd`
In this notebook, we will explore the `torch.autograd` module in PyTorch, which provides automatic differentiation for all operations on Tensors. This is a key feature for training neural networks using backpropagation.

Let's consider the simplest one-layer neural network ,with input `x`, parameters `w` and `b` (weights and biases), and some loss function:

In [2]:
import torch

x = torch.ones(5)   # Input tensor
y = torch.zeros(3)  # Output tensor

w = torch.randn(5, 3, requires_grad =True)  # Weights
b = torch.randn(3, requires_grad=True)  # Bias

z = torch.matmul(x, w) + b  # Linear transformation

# Calculate the loss
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y) # Binary Cross Entropy Loss

The code above defines the following **Computational Graph**:

![Computational Graph](./img/comp-graph.png "Computational Graph")

In this network, `w` and `b` are *parameters*, which we need to optimize. Thus, we need to be able to compute the gradients of the loss with respect to those variables, i.e., $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$. 

This is where `torch.autograd` comes in handy. It allows us to compute these gradients automatically using the chain rule of calculus. That is why we set the `requires_grad` property to `True` for the parameters we want to optimize.

Any function that we apply to a tensor to construct a computational graph is in fact an object of class `torch.autograd.Function`. This object has two methods: `forward` and `backward`. 
1. The `forward` method computes the output of the function.
2. The `backward` method computes the gradients of the inputs with respect to the output.

A reference to the backward propagation function is stored in the `grad_fn` attribute of the tensor.

In [4]:
print(f"Gradient function for out: {z.grad_fn}\n")
print(f"Gradient function for loss: {loss.grad_fn}\n")

Gradient function for out: <AddBackward0 object at 0x7f0048631ea0>

Gradient function for loss: <BinaryCrossEntropyWithLogitsBackward0 object at 0x7f00486331f0>



## Computing Gradients
To optimize the parameters of our model, we need to compute the derivatives of our loss function with respect to the parameters, under some fixed values of `x` and `y`. To compute those derivatives, we call `loss.backward()`, and then retrieve the values from `w.grad` and `b.grad`.

In [5]:
loss.backward()  # Apply Backpropagation to compute gradients

print(f"Gradient of loss w.r.t. weights: {w.grad}\n")
print(f"Gradient of loss w.r.t. bias: {b.grad}\n")

Gradient of loss w.r.t. weights: tensor([[0.1531, 0.3012, 0.1102],
        [0.1531, 0.3012, 0.1102],
        [0.1531, 0.3012, 0.1102],
        [0.1531, 0.3012, 0.1102],
        [0.1531, 0.3012, 0.1102]])

Gradient of loss w.r.t. bias: tensor([0.1531, 0.3012, 0.1102])



### Disabling Gradient Tracking
By default, all tensors with `requires_grad=True` will track all operations on them. However, there are some cases when we do not need that, e.g., when we are evaluating the model on a validation or test set. In those cases, we will not backpropagate the loss so we can disable gradient tracking using `torch.no_grad()` context manager. This will reduce memory consumption for computations that would otherwise have `requires_grad=True`.

In [6]:
z = torch.matmul(x,w) + b
print(z.requires_grad)  # Check if requires_grad is set to True

with torch.no_grad():
    z = torch.matmul(x, w) + b  # No gradient tracking
print(z.requires_grad)  # Check if requires_grad is set to False

True
False


Another way to achieve the same result is to use the `detach()` method on a tensor.

In [7]:
z = torch.matmul(x, w) + b
z_det = z.detach()  # Detach from the computation graph
print(z_det.requires_grad)  # Check if requires_grad is set to False

False


#### Some reasons why you might want to disable gradient tracking:
- Mark some parameters as *frozen parameters*.
- Speed up computations when you are only doing forward passes.

## Forward and Backward Pass
Conceptually, `autograd` keeps a record of all operations performed on tensors with `requires_grad=True` in a directed acyclic graph (DAG) consisting of `Function` objects. In this DAG, leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

#### In a forward pass, `autograd` does 2 things:
- Run the requested operation to compute a resulting tensor.
- Maintain the operation's gradient function in the *DAG*.
  - Store the context needed for computing the gradients in the `grad_fn` attribute of the resulting tensor.

#### In a backward pass, `backward()` then:
- Computes the gradients from each `.grad_fn` in the *DAG*.
- Accumulates them in the respective tensor's `.grad` attribute.
- Using the chain rule, Propagates all the way back to the leaves of the graph.

### Tensor Gradients and Jacobian Product
In many cases, we have a scalar loss function, and we need to compute the gradient with respect to the parameters. Howver, there are cases when the network's output is an arbitrary tensor. In this case, `PyTorch` allows computing the *Jacobian Product*.

For a vector output $y = f(x)$, where $x=(x_1, ..., x_n)$ and $y=(y_1, ..., y_m)$, a gradient of $y$ with respect to $x (\frac{\partial y}{\partial x})$ is given by the Jacobian matrix $J$ of the function $f$:
$$
J = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}
$$


Instead of computing the full Jacobian matrix, we can compute the product of the Jacobian with a vector $v$ using `torch.autograd.functional.jacobian`. This is useful when we only need the gradient of the output with respect to a small number of inputs, or when the Jacobian is too large to compute explicitly.

This is achieved by calling `backward()` on the output tensor with $v$ as an argument. The size of $v$ must match the size of the original output tensor. 

Jacobian Product = $v^T \cdot J$. By selecting an appropriate $v$ we can compute the gradient of the desired parameters.

In [16]:
inp = torch.eye(4, 5, requires_grad=True)  # Input tensor
out = (inp+1).pow(2).t()

print(f"inp: {inp}\n")
print(f"out: {out}\n")
print(f"Gradient function for out: {out.grad_fn}\n")

inp: tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.]], requires_grad=True)

out: tensor([[4., 1., 1., 1.],
        [1., 4., 1., 1.],
        [1., 1., 4., 1.],
        [1., 1., 1., 4.],
        [1., 1., 1., 1.]], grad_fn=<TBackward0>)

Gradient function for out: <TBackward0 object at 0x7f0048633310>



In [27]:
# Perform backpropagation on the output tensor
out.backward(torch.ones_like(out), retain_graph=True)  # Retain the graph for further backward calls
print(f"First Call\nGradient of out w.r.t. inp: {inp.grad}\n")
out.backward(torch.ones_like(out), retain_graph=True)  # Perform backpropagation again
print(f"Second Call\nGradient of out w.r.t. inp after second backward: {inp.grad}\n")

# Clear the gradient
inp.grad.zero_() 
# Perform another backward pass
out.backward(torch.ones_like(out), retain_graph=True)
print(f"Third Call\nGradient of out w.r.t. inp after clearing gradient: {inp.grad}\n")

# Clear the gradient
inp.grad.zero_()


First Call
Gradient of out w.r.t. inp: tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Second Call
Gradient of out w.r.t. inp after second backward: tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])

Third Call
Gradient of out w.r.t. inp after clearing gradient: tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])



tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])