# In this section, we will begin exploring PyTorch's powerful capability of automatically tracking gradients through its computational graph. Imagine using gradient descent to update a neural network model with countless parameters. PyTorch's autograd system records the computational graph during the forward pass and systematically computes all gradients during the backward pass.

*Reference video: https://www.youtube.com/watch?v=DbeIqrwb_dE&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4&index=3*

In [None]:
import torch

In [None]:
# Let's start with a very simple regression problem

x = torch.randn(3)
print(x)

"""
Let's consider that this `x` is a parameter of some model.
Our aim is to update this `x` to make better predictions.

Let's first apply a simple check:
"""

print(x.requires_grad)
# This returns False.

"""
`.requires_grad` is an attribute in PyTorch that indicates whether a tensor requires gradient tracking.
In other words, it determines whether this tensor is part of the computational graph for later gradient computation.

The return value of False simply indicates that `x` is not being tracked for gradients yet.
To enable gradient tracking, we can do the following:
"""

x.requires_grad = True
print(x.requires_grad)

"""
Alternatively, another simple way to achieve this when creating the tensor is:
"""
x = torch.randn(3, requires_grad=True)
print(x.requires_grad)

# Now, x is ready for gradient tracking and can participate in the computational graph.

"""
Let us now perform a very simple forward pass (where the forward pass refers to
how we let the input data flow through, or the series of operations we apply to the input).
"""

y = x + 2
z = y * y * 2
z = z.mean()

"""
The forward pass (computational graph) can be described as:
x --> addition operation (+ 2) --> output y --> multiplication operation (y * y * 2)
--> output z --> mean operation --> final output
"""
print(z)  # Note that the printed `z` displays the latest operation: grad_fn=<MeanBackward0>

"""
Exercise: Besides <MeanBackward0>, what are the `grad_fn` values for other types of operations?
"""


"""
Now, we can apply the built-in function `.backward()` to compute the gradients of `z` with respect
to the `requires_grad=True` parameters recorded in the computational graph. PyTorch automatically tracks
the backward path through the computational graph to calculate all these gradients. This process is known as
"backpropagation."

IMPORTANT: Intermediate variables, like `y` in this case, are automatically created with `requires_grad=True`
when they are the result of operations involving other `requires_grad=True` variables, such as `x`.

Note: The gradient computation in PyTorch is based on the "vector-Jacobian product" technique.
This means if `z` is a scalar (a single value), PyTorch can compute the gradients directly. However,
if `z` is not a scalar (i.e., it is a tensor with multiple values), you must provide a vector to the `.backward()`
function that represents how the gradients should be weighted (this is typically referred to as the `gradient` argument).
Explanation of this Jacobian process in neural network, see: https://www.youtube.com/watch?v=AdV5w8CY3pw&t=3s
"""

# To find out ∂z / ∂x, we simply do:
z.backward()
print(x.grad)  # Gradient of z with respect to x


tensor([-0.6120,  0.5023, -0.3509])
False
True
True
tensor(15.6566, grad_fn=<MeanBackward0>)
tensor([5.2619, 3.5785, 1.1214])


In [None]:
"""
The following points (related to "leaf variables") go beyond what we need to know for now, but can be important:
(feel free to skip if you are not ready)

In a nutshell, the tensors we MANUALLY create with `requires_grad=True` are called "leaf variables."
On the other hand, `y` here is an intermediate variable created through an operation.
Although `y` has `requires_grad=True` because it is part of the computational graph, it is not a "leaf variable."
When a tensor is not a leaf variable, PyTorch does not store its gradient during `backward()` to save memory.
This makes sense since we typically do not need to update these intermediate variables.

If, for some reason, you need gradients for such intermediate variables, you can use `retain_grad()` to ensure
the gradient is stored during backpropagation.
"""

# Check gradients and leaf status
print(y.grad)  # You will find ∂z / ∂y is not recorded and simply returns None
print(x.is_leaf)  # x is a leaf variable
print(y.is_leaf)  # y is not a leaf variable

# Using retain_grad() to retain gradients for the intermediate variable y
y = x + 2
z = y * y * 2
z = z.mean()

y.retain_grad()
z.backward()

# Now the gradient on y is also retained
print(f'Now the grad on y is also retained: {y.grad}')

None
True
False
Now the grad on y is also retained: tensor([5.2619, 3.5785, 1.1214])


  print(y.grad)  # You will find ∂z / ∂y is not recorded and simply returns None


In [None]:
"""
IMPORTANT:

When calling `.backward()`, the gradients are recorded (also called "populated"),
but at the same time, the involved "forward computational graph" is freed from memory.
This means an error will occur if you try to call `z.backward()` again without recomputing
the forward pass (i.e., `y = x + 2; z = y * y * 2; z = z.mean()`).

To prevent the system from freeing the computational graph during the backward pass, you can use
`z.backward(retain_graph=True)`. However, this is for more advanced use cases, and we will discuss its
applications in a later tutorial.
"""

'\nIMPORTANT:\n\nWhen calling `.backward()`, the gradients are recorded (also called "populated"), \nbut at the same time, the involved "forward computational graph" is freed from memory. \nThis means an error will occur if you try to call `z.backward()` again without recomputing \nthe forward pass (i.e., `y = x + 2; z = y * y * 2; z = z.mean()`).\n\nTo prevent the system from freeing the computational graph during the backward pass, you can use\n`z.backward(retain_graph=True)`. However, this is for more advanced use cases, and we will discuss its\napplications in a later tutorial.\n'

In [None]:
"""
Another important characteristic of PyTorch's `.backward()` function is that,
when calling this function, the calculated gradients are "added" to the `requires_grad` variables.
Therefore, if those variables already have gradient information from previous operations, calling
`.backward()` will simply add the newly calculated gradients to the existing ones.

Here’s an example of what could happen:
"""

# Example
weights = torch.ones(4, requires_grad=True)

model_output = (weights * 3).sum()
model_output.backward()
print(weights.grad)  # The gradient is calculated and printed

# Now, let's try calling .backward() again.
# Remember, before we do another backward pass, we need to recompute the forward pass
# because the previous computational graph was freed.
model_output = (weights * 3).sum()
model_output.backward()
print(weights.grad)

"""
The gradient has now become [6, 6, 6, 6], as this is the second time we called `.backward()`.
This happens because PyTorch accumulates gradients, adding the new ones to the existing gradients.
If the operation was just `(weights * 3).sum()` once, the gradient would only be [3, 3, 3, 3].

In some situations, this characteristic is useful if you want to accumulate gradients while backwarding
through different losses step by step.

However, in this case, to ensure the gradients are correct for each backward pass, we need to zero out
the previous gradients before calling `.backward()` again:
"""

weights.grad.zero_()  # Zero out the gradients
print(weights.grad)  # The gradients are now reset to zero

# Hence, for multiple iterations:
for epoch in range(5):
    weights.grad.zero_()  # Zero out gradients each time
    model_output = (weights * 3).sum()
    model_output.backward()

print(weights.grad)

# No matter how many iterations we apply, the gradients are correct now.

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])
tensor([0., 0., 0., 0.])
tensor([3., 3., 3., 3.])


In [None]:
"""
In some situations, we may not want PyTorch to track gradients.
For example, when we are testing our model, or when we need some intermediate
quantities in the computational graph but want to cut their relation with the original graph.

In these situations, there are three ways to stop PyTorch from tracking gradients:
1. `x.requires_grad_(False)` or `x.requires_grad = False`
2. `x.detach()`
3. Wrapping the operations in `with torch.no_grad():`
"""

# For the 1st way
x = torch.randn(3, requires_grad=True)
print(x)  # x requires gradients here

# Now we can choose to call the function
x.requires_grad_(False)
print(x)  # x doesn't require gradients anymore

# Similarly, we can modify the attribute directly
x = torch.randn(3, requires_grad=True)
x.requires_grad = False
print(x)  # x doesn't require gradients

"""
Setting `requires_grad=False` directly and calling `requires_grad_(False)` are equivalent.
The key difference is that `requires_grad` is an "attribute," while `requires_grad_()` is a function.
Note that `requires_grad_()` is an in-place function, as indicated by the trailing underscore.
"""

# For the 2nd way
x = torch.randn(3, requires_grad=True)
print(x)  # x requires gradients
y = x.detach()
print(y)  # y is a copy of x which doesn't require gradients

"""
The critical difference between `.detach()` and `requires_grad_()` is that `.detach()` creates
a new tensor `y` that shares the same data as `quantity1` but doesn't track gradients.
It is an out-of-place operation, meaning `quantity1` remains intact in the computational graph while `y` is independent.

Example of using `.detach()`:

input -> operation1 -> quantity1 -> operation2 -> quantity2 -> loss
                            |
                            --> y = quantity1.detach() -> (operation on y not for backward) -> some information wanted

In this example, detaching `quantity1` allows you to perform additional operations on `y` without affecting the backward
pass for the loss. The original `quantity1` remains part of the computational graph.

Note: If you use `quantity1.requires_grad = False`, the backward pass (`backward(loss)`) will not raise an error,
but `quantity1` will be disconnected from the graph, and no gradients will be computed for it or any earlier computations
involving it. In contrast, `.detach()` keeps the computational graph intact for `quantity1`.
"""

# For the 3rd way
x = torch.randn(3, requires_grad=True)
with torch.no_grad():
    y = x + 2
    print(y)  # y doesn't require gradients

"""
The `torch.no_grad()` context is commonly used when testing a model or when passing input
through a neural network without wanting to track the computations in the computational graph.
This prevents gradient information from being stored and speeds up computations.
"""


tensor([-0.9308,  0.7550,  0.9499], requires_grad=True)
tensor([-0.9308,  0.7550,  0.9499])
tensor([ 0.8912,  1.7899, -1.6068])
tensor([-1.7852, -0.9613,  0.9818], requires_grad=True)
tensor([-1.7852, -0.9613,  0.9818])
tensor([2.5161, 0.6254, 1.2002])
