In [375]:
import torch

### Automatic Differentiation

Calculating derivatives is the crucial step in all the optimization algorithms that we will use to train deep networks. Modern deep learning frameworks take this work off our plates by offering automatic differentiation (often shortened to autograd).

### Exmplanation based on a simple function

**y = 2x<sup>T</sup>x**, where **x** is an vector

In [376]:
x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

Before gradien calculation, we need a place to store it. Because of the calculation complexity in real-life scenarios and how much data needs to be processed and stored - memory management is crucial to not run out of it.
For this, gradiend with the respect to vector **x**  we can store **in that vector**. 

In [377]:
x.requires_grad_(True)
# We can also define this when creating a tensor by x = torch.arange(4.0, requires_grad=True)

# Gradient for now is None by default
print(x.grad)

None


In [378]:
# You can also use "matmul" but the executed algoright vary on the input, while using "dot" you specify wich exactly algorimth you want to use
# dot product = scalar product (pl. iloczyn skalarny)
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

In [379]:
x.grad # Still None
y.backward() # Take the gradient of y with respect to x by calling its backward method - "x" gradient will be now filled
x.grad

tensor([ 0.,  4.,  8., 12.])

In [380]:
# Gradient function with respect to the **x** should be:

# y' = 2 * (x * x)
# (x * x) is essentially (x ** 2)
# y' = 2 * (x ** 2)
# y' = 4 * x

x.grad == 4 * x

tensor([True, True, True, True])

In [381]:
y = x.sum()
y.backward()
y

tensor(6., grad_fn=<SumBackward0>)

In [382]:
x.grad 
# Result - tensor([ 1.,  5.,  9., 13.])
# Because PyTorch does NOT automatically resey the gradient buffer.
# Instead, the new gradient is added to the already-stored gradient. 
# This behavior comes in handy when we want to optimize the sum of multiple objective functions.

tensor([ 1.,  5.,  9., 13.])

In [383]:
# In order to reset the gradient use "grad.zero_()"
x.grad.zero_()
y = x.sum()
y.backward()

print(f"x: {x}")
print(f"y: {y}")
print(f"x.grad: {x.grad}")

x: tensor([0., 1., 2., 3.], requires_grad=True)
y: 6.0
x.grad: tensor([1., 1., 1., 1.])


### Backward for Non-Scalar Variables

In [384]:
x.grad.zero_()
y = x * x
# Just running "y.backward()" will result in Error - RuntimeError: grad can be implicitly created only for scalar outputs (for operations that result is a scalar)
# By providing gradient you are providing what backward() should compute agains and how gradient resoult should be presented, so "v * dx * y", instead of "dx * y" by default

y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()

print(f"x: {x}")
print(f"y: {y}")
print(f"x.grad: {x.grad}")

x: tensor([0., 1., 2., 3.], requires_grad=True)
y: tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)
x.grad: tensor([0., 2., 4., 6.])


### Detaching Computation

In order to move some calculations outside of the recorded computational graph - so the calculation will not be included in the grandient, we need to detach the respective computational graph from the final result.

Suppose we have **z = x * y** and **y = x * x** but we want to focus on the direct influence of **x** on **z** rather than the influence conveyed via **y**.

So we want **x affects z**, NOT **x affects y affects z**.

In [385]:
x.grad.zero_()

y = x * x
u = y.detach()
# Create a new variable u that takes the same value as y but whose provenance (how it was created) has been wiped out. 
# Thus u has no ancestors in the graph and gradients do not flow through u to x

z = u * x
z.sum().backward()

print(f"x: {x}")
print(f"y: {y}")
print(f"z: {z}")
print(f"x.grad: {x.grad}")
x.grad == u

x: tensor([0., 1., 2., 3.], requires_grad=True)
y: tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)
z: tensor([ 0.,  1.,  8., 27.], grad_fn=<MulBackward0>)
x.grad: tensor([0., 1., 4., 9.])


tensor([True, True, True, True])

In [386]:
# Procedure above detaches y’s ancestors from the graph leading to z,
# the computational graph leading to y persists and thus we can calculate the gradient of y with respect to x.
x.grad.zero_()
y.sum().backward()

print(f"x: {x}")
print(f"y: {y}")
x.grad, x.grad == 2 * x

x: tensor([0., 1., 2., 3.], requires_grad=True)
y: tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)


(tensor([0., 2., 4., 6.]), tensor([True, True, True, True]))

In [387]:
x.grad.zero_()
o = x * x * x
o.sum().backward()

x, o, x.grad

(tensor([0., 1., 2., 3.], requires_grad=True),
 tensor([ 0.,  1.,  8., 27.], grad_fn=<MulBackward0>),
 tensor([ 0.,  3., 12., 27.]))