# Automatic differentiation in PyTorch

Markus Enzweiler, markus.enzweiler@hs-esslingen.de

This is a demo used in a Computer Vision & Machine Learning lecture. Feel free to use and contribute.


## Setup

Adapt `packagePath` to point to the directory containing this notebeook, e.g. Colab or local.

In [None]:
# Imports
import sys
import os

In [None]:
# Package Path
package_path = "./" # local
print(f"Package path: {package_path}")

In [None]:
# Additional imports

# Repository Root
repo_root = os.path.abspath(os.path.join("..", ".."))
# Add the repository root to the system path
sys.path.append(repo_root)

# Package Imports
from nbutils import requirements as reqs

In [None]:
# Install requirements from requirements.txt
req_file = os.path.join(package_path, "requirements.txt")
reqs.pip_install_reqs(req_file)

In [None]:
# Now we should be able to import the additional packages
import torch

## Autograd

Autograd in PyTorch is a powerful tool for automatic differentiation, enabling the efficient computation of gradients in neural networks and other computational graphs. Here's a brief overview:

1. **Graph Construction**: During the forward pass, PyTorch builds a computational graph. Nodes represent tensors, while edges correspond to functions (operations) that transform these tensors.

2. **Enable Gradient Tracking**: By setting `requires_grad=True` for a tensor, you tell PyTorch to track all operations on it. This is crucial for gradient computation.

3. **Backward Propagation**: In the backward pass, PyTorch computes gradients by traversing this graph from outputs to inputs. This is done using the chain rule of calculus.

4. **Gradient Calculation**: The gradients are calculated by `torch.autograd.grad` or `.backward()` methods. For $y = f(x)$, PyTorch computes $ \frac{\partial y}{\partial x} $ by backtracking through the graph.


This system allows for efficient and flexible gradient computations, which is essential for training neural networks using gradient-based optimization methods.

See:
- https://pytorch.org/docs/stable/autograd.html
- https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

### Autograd with scalar functions

In [None]:
# Torch has "autograd" to automatically compute gradients
# Let's try it out with simple functions first.

# Define the function, x^2+3x+2
def f(x):
    return x**2 + 3*x + 2

# Manual gradient w.r.t x
def f_grad(x):
    return 2*x + 3

def autograd(func, x):
    # Initialize an empty list for gradients
    gradients = []

    # Compute the gradient for each element in the tensor
    for xi in x:
        # Compute the function on the i-th element
        y = func(xi)

        # Compute the gradient for the i-th element
        gradients.append(torch.autograd.grad(outputs=y, inputs=xi)[0])

        # The torch.autograd.grad function is designed to compute gradients of scalar outputs with respect to inputs.
        # In our case, the function f(x) applied to x_tensor results in a vector (a tensor with multiple elements),
        # not a single scalar. Hence, torch.autograd.grad cannot directly compute the gradient for each element
        # of this vector. To resolve this, we loop over each element of x_tensor, treating each function evaluation
        # f(x[i]) as a scalar output, and compute its gradient individually. This way, we are effectively computing
        # the gradient of multiple scalar functions, each dependent on a single element of x_tensor.

    return torch.stack(gradients)



# Compute some function values and gradients
# make sure to set requires_grad=True to enable gradient tracking on the computational graph
x_tensor = torch.arange(-5, 5, 1, dtype=torch.float32, requires_grad=True)

f_value    = f(x_tensor)
f_grad     = f_grad(x_tensor)
f_autograd = autograd(f, x_tensor)

for i in range(len(x_tensor)):
    print(f"x = {x_tensor[i].item():5.2f}: "
          f"f(x) = {f_value[i].item():5.2f}, "
          f"f_grad(x) = {f_grad[i].item():5.2f}, "
          f"autograd(x) = {f_autograd[i].item():5.2f}")

### Autograd with tensors

In [None]:
# Define two tensors and track computations
t1 = torch.tensor([[1, 2, 3],
                   [4, 5, 6]], dtype=torch.float32, requires_grad=True)

t2 = torch.tensor([[7, 8, 9],
                   [10, 11, 12]], dtype=torch.float32, requires_grad=True)

# Initially, gradients for t1 are None since no operations have been performed
print(f"t1.grad = {t1.grad}")

# grad_fn for t1 is None because it is not a result of an operation
# but directly created from data in the computational graph
print(f"t1.grad_fn = {t1.grad_fn}")

In [None]:
# Perform element-wise multiplication of t1 and t2
t1_mul_t2 = t1 * t2

# The resulting tensor t1_mul_t2 has grad_fn set to MulBackward0,
# indicating that it's a result of a multiplication operation
print(f"t1_mul_t2 = {t1_mul_t2}")

# Gradients for t1 are still None because backward() has not been called yet
print(f"t1.grad = {t1.grad}")

After `backward()`, `t1.grad` and `t2.grad` are populated.

The gradient of each element of `t1` is equal to the corresponding element in `t2`, and vice versa. This is because the derivative of `t1[i] * t2[i]` w.r.t. `t1[i]` is `t2[i]`, and w.r.t. `t2[i]` is `t1[i]`.

In [None]:
# Compute gradients of the sum of all elements in t1_mul_t2 with respect to t1 and t2
t1_mul_t2.sum().backward()

# After backward(), t1.grad and t2.grad are populated.
# The gradient at each element in t1 and t2 indicates the rate of change of the sum with respect to that element.
# For element-wise multiplication, the gradient at each element of t1 is equal to the corresponding element
# in t2 and vice versa. This is because the derivative of t1[i] * t2[i] w.r.t. t1[i] is t2[i],
# and w.r.t. t2[i] is t1[i].

print(f"t1.grad = {t1.grad}")
print(f"t2.grad = {t2.grad}")

### Why `.sum()` is needed for `.backward()`:


- PyTorch's `.backward()` function computes gradients with respect to a scalar value. This is essential because gradients are conceptually the rate of change of a scalar value with respect to other variables. If you have a tensor with more than one element and wish to compute gradients with respect to its elements, you need to first reduce it to a scalar. When performing operations between tensors, like `t1 * t2`, the result is another tensor. To compute gradients with respect to the original tensors (`t1` and `t2`), a scalar value is needed for differentiation. The `.sum()` method achieves this by combining all elements of the resulting tensor into a single scalar.

- When `.backward()` is called on the scalar result of `t1_mul_t2.sum()`, it activates the chain rule in reverse throughout the computational graph. It calculates the gradient of the scalar with respect to each element in the tensors involved in the computation (`t1` and `t2`), effectively propagating the gradients backwards.
- The gradients computed in this manner indicate how much each element of `t1` and `t2` would need to change to increase the scalar sum. This approach is frequently utilized in optimization problems, where the scalar often represents a loss function.

**In summary**, `.sum()` is employed to convert the tensor resulting from `t1 * t2` into a scalar, enabling `.backward()` to compute gradients. This procedure is standard in many deep learning applications, particularly in the computation of loss functions, where errors are backpropagated from a single scalar value (the loss) to update model parameters.

In [None]:
# Analyzing the gradient at t2[0,1]. If t2.grad[0,1] is 2, it means that a unit change in t2[0,1] results in a
# change of 2 in the sum. Therefore, increasing t2[0,1] by 3 should increase the sum by 3 * t2.grad[0,1], under
# linear approximation.

# Create a new tensor and add 3 to t2[0,1]
t2_modified = t2.clone()
t2_modified[0,1] = t2[0,1] + 3

# Perform the computation again with the modified t2
t1_mul_t2_updated = t1 * t2_modified
updated_sum = t1_mul_t2_updated.sum()

# Compare the change in sum
change_in_sum = updated_sum - t1_mul_t2.sum()
print(f"Change in sum: {change_in_sum}")