#back propagation
-most frequently used algorithm training neural networks
-parameters(weights) adjusted according to the gradient of the loss function
-torch.autograd: compute gradients

In [1]:
import torch

w, b: parameters, which should be optimized
*set requires_grad when craeating a tensor or later x.requires_grad_(True) method

In [3]:
x = torch.ones(5)
y = torch.zeros(3)
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

*Function: knows how to operate forward, backward propagation
-reference to the backward propagation function stored in grad_fn

In [4]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7febfcd295a0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7febfcd2b850>


-needs to compute ∂loss/∂w, ∂loss/∂b -> loss.backward(), x.grad
-only available objects which have requires_grad = True
-backward can be only performed once -> if want to do more, need to pass retain_graph=True to backward

In [5]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0258, 0.2917, 0.0432],
        [0.0258, 0.2917, 0.0432],
        [0.0258, 0.2917, 0.0432],
        [0.0258, 0.2917, 0.0432],
        [0.0258, 0.2917, 0.0432]])
tensor([0.0258, 0.2917, 0.0432])


-disabling gradient tracking
 e.g. already trained and want to apply it, mark some parameters as frozen, speed up computations
 -> just forward computation. torch.no_grad() block

In [6]:
z = torch.matmul(x, w) + b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w) + b
print(z.requires_grad)

True
False


In [7]:
z = torch.matmul(x, w) + b
z_det = z.detach()
print(z_det.requires_grad)

False


autograd keeps records in DAG
-leaves input tensors, roots output tensors
-tracing roots to leaves -> compute gradients using chain rule

*forward pass
-run operation
-maintan gradient function in DAG

*backward pass - when .backward() is called on the DAG roog
-compute gradient from each .grad_fn
-accumulate gradients in .grad
-propagate to leaf using chain rule

scalar loss function when output function is arbitrary tensor
-Jacobian product, not actual gradient.
-J = Jacobian matrix
-compute Jacobian Product v^T . J for given input vector v
 -> call backward with v as an argument

In [10]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"First call\n{inp.grad}")

out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")

inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

First call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])


second -> different because of accumulation
zero -> compute proper gradient. optimizer helps to do it