When looking at the training loop, I noticed something odd.

```
y_pred = model_0(X_train)
loss = loss_fn(y_pred, y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

How does the `loss.backward()` step allow the optimizer to update the weights/biases accordingly?
There is seemingly no reference between loss and optimizer. The optimizer has the model since we initialized it with the model parameters earlier, but the loss is generated simply from y_pred and y_train, two tensors. So how does the loss get communicated to the model?

The answer lies in PyTorch's autograd mechanism

In [90]:
import torch
import torch.nn as nn

In [91]:
torch.manual_seed(0)

<torch._C.Generator at 0x7f5bfc176250>

In [92]:
# Create a Linear Regression model class
class LinearRegressionModel(nn.Module): # <- almost everything in PyTorch is a nn.Module (think of this as neural network lego blocks)
    def __init__(self):
        super().__init__() 
        self.weights = nn.Parameter(torch.randn(1, # <- start with random weights (this will get adjusted as the model learns)
                                                dtype=torch.float), # <- PyTorch loves float32 by default
                                   requires_grad=True) # <- can we update this value with gradient descent?)

        self.bias = nn.Parameter(torch.randn(1, # <- start with random bias (this will get adjusted as the model learns)
                                            dtype=torch.float), # <- PyTorch loves float32 by default
                                requires_grad=True) # <- can we update this value with gradient descent?))

    # Forward defines the computation in the model
    def forward(self, x: torch.Tensor) -> torch.Tensor: # <- "x" is the input data (e.g. training/testing features)
        return self.weights * x + self.bias # <- this is the linear regression formula (y = m*x + b)

In [93]:
def print_grad_graph(tensor, indent=0):
    """
    Recursively prints the computation graph of a tensor's gradient function.
    """
    if hasattr(tensor, 'grad_fn') and tensor.grad_fn is not None:
        print(" " * indent + f"{tensor.grad_fn}")
        for f, _ in tensor.grad_fn.next_functions:
            if f is not None:
                print_grad_graph(f, indent + 4)
    elif hasattr(tensor, 'next_functions'):
        print(" " * indent + f"{tensor}")
        for f, _ in tensor.next_functions:
            if f is not None:
                print_grad_graph(f, indent + 4)

The grad_fn attribute of a tensor shows the last operation that created the tensor.

Since our linear regression model is just y = mx+b, we should see an add operation here since adding bias is done last.


In [94]:

model = LinearRegressionModel()
x = torch.randn(1,1, requires_grad=False)
y_pred = model(x)

print(f"y_pred.grad_fn: {y_pred.grad_fn}")  # Points to AddBackward0

y_pred.grad_fn: <AddBackward0 object at 0x7f5adac7ce20>


If we want to see more the operations before the last one, we can use the .next_functions attribute

In [95]:
print(f"y_pred.grad_fn.next_functions: {y_pred.grad_fn.next_functions}")

y_pred.grad_fn.next_functions: ((<MulBackward0 object at 0x7f5adac820e0>, 0), (<AccumulateGrad object at 0x7f5adac80fd0>, 0))


Here we see two items, the multiplication operation and an 'AccumulateGrad' operation.

The 'AccumulateGrad' object is a place to store the gradient for the add operation, which is related to the bias.

We can prove this by showing that the variable in the 'AccumulateGrad' object is the bias tensor.

In [96]:
print(f"y_pred.grad_fn.next_functions[1][0].variable:\n{y_pred.grad_fn.next_functions[1][0].variable}\n")  # Should show bias tensor
print(f"Bias tensor from model:\n{model.bias}")

y_pred.grad_fn.next_functions[1][0].variable:
Parameter containing:
tensor([-0.2934], requires_grad=True)

Bias tensor from model:
Parameter containing:
tensor([-0.2934], requires_grad=True)


Similarly to the bias, we can see the gradient for the weight as well

In [97]:
weight_fn = y_pred.grad_fn.next_functions[0][0]
print(f"weight_fn: {weight_fn}")  # Should point to MulBackward0
print(weight_fn.next_functions)
print()

print(f"weight_fn.next_functions[0][0].variable:\n{weight_fn.next_functions[0][0].variable}\n")  # Should show weights tensor
print(f"Weights tensor from model:\n{model.weights}")

weight_fn: <MulBackward0 object at 0x7f5adac81270>
((<AccumulateGrad object at 0x7f5adac822c0>, 0), (None, 0))

weight_fn.next_functions[0][0].variable:
Parameter containing:
tensor([1.5410], requires_grad=True)

Weights tensor from model:
Parameter containing:
tensor([1.5410], requires_grad=True)


Notice that there is another value in the next_functions tuple for the weight function that is None.

This is actually representing the input data X. This value is None because the X tensor does not have autograd on.

If we want to see a non-None value there, we can turn on requires_grad for X

In [98]:
x_with_grad = torch.randn(1,1, requires_grad=True)
y_pred_with_grad = model(x_with_grad)

print(f"Multiplication op next_functions: {y_pred_with_grad.grad_fn.next_functions[0][0].next_functions}")
print(f"\nInput object from multiplication op:\n{y_pred_with_grad.grad_fn.next_functions[0][0].next_functions[1][0].variable}\n")  # Should show x tensor
print(f"Input tensor from model:\n{x_with_grad}")

Multiplication op next_functions: ((<AccumulateGrad object at 0x7f5adabbbd30>, 0), (<AccumulateGrad object at 0x7f5adac82080>, 0))

Input object from multiplication op:
tensor([[0.5684]], requires_grad=True)

Input tensor from model:
tensor([[0.5684]], requires_grad=True)


All of this is PyTorch's autograd mechanism, for more details see here: https://docs.pytorch.org/docs/stable/notes/autograd.html.

This mechanism is how `loss.backwards()` can still allow the optimizer to update weights/biases, as the gradient gets stored in the parameters themselves.

Here we can see the full tree of the loss. Reminder that L1 Loss used here is the mean of |y_pred-y_true|

In [99]:
model = LinearRegressionModel()
x = torch.randn(1, 1)
loss_fn = nn.L1Loss()
y_true = torch.randn(1, 1)

y_pred = model(x)
loss = loss_fn(y_pred, y_true)
print_grad_graph(loss)

<MeanBackward0 object at 0x7f5adabbb550>
    <AbsBackward0 object at 0x7f5adac82050>
        <SubBackward0 object at 0x7f5adac979a0>
            <AddBackward0 object at 0x7f5adac97460>
                <MulBackward0 object at 0x7f5adac974c0>
                    <AccumulateGrad object at 0x7f5adac97a30>
                <AccumulateGrad object at 0x7f5adac975b0>
