# 0x05 Gradient descent and backpropagation

In this tutorial we will cover the implementations and how-tos for interacting
with the backpropagation engine of PyTorch.

PyTorch is the go-to library for deep learning in Python especially if you are building a custom model on your own.
You will be very likely be using PyTorch when you are doing your research.

> 💡 **NOTE**: 
>
> We assume you have already learnt the fundamentals of derivatives and gradients.
>
> If you need a quick recap, check out this explanation [here](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#optional-reading-vector-calculus-using-autograd) by the PyTorch team.
>
> Another suggested deep walkthrough is this [3blue1brown video](https://www.youtube.com/watch?v=tIeHLnjs5U8) on the topic.

## 1. Backpropagation

One of PyTorch tensors' biggest difference with NumPy arrays is that they can track gradients.

Two important ways to interact with it are `requires_grad` and the `.grad` property.
Let us see it with a simple example.

Consider this formula:
$$
y = w_1x^2 + w_2x + b
$$

where $x$ is the input.

> 🤔 **THINKING**
>
> - What is $\frac{\partial y}{\partial w_1}$, $\frac{\partial y}{\partial w_2}$, $\frac{\partial y}{\partial b}$? Compute by hand.

To prove your computation, let us implement this formula in PyTorch.

In [1]:
import torch

x = torch.tensor([2.0])

# Identify the parameters that you need to compute gradients for
# and set requires_grad=True
w1 = torch.tensor([1.0], requires_grad=True)
w2 = torch.tensor([3.0], requires_grad=True)
b = torch.tensor([4.0], requires_grad=True)
# You will see why we used an intermediate variable z here.
z = w1 * torch.pow(x, 2) + w2 * x
y = z + b
print(y)

tensor([14.], grad_fn=<AddBackward0>)


In [2]:
# Compute backpropagation
y.backward()

In [3]:
# Gradient w.r.t. w1, w2, and b are stored in .grad of the tensors
print(w1.grad)  # dy/dw1
print(w2.grad)  # dy/dw2
print(b.grad)    # dy/db

tensor([4.])
tensor([2.])
tensor([1.])


Is your computation correct?

Now, you may want to also get $\frac{dy}{dz}$ from PyTorch.

In [4]:
print(z.grad)  # dy/dz

None


  print(z.grad)  # dy/dz


Oh no 😱, we cannot do it!

This is because PyTorch does not update the gradients on **non-leaf** tensors. This makes sense because model parameters are leaf tensors. If you really **DO** want to compute the gradients of non-leaf tensors for specific use cases, you can use [`.retain_grad()`](https://pytorch.org/docs/stable/generated/torch.Tensor.retain_grad.html).

> 📚 **EXERCISE**
>
> - Define an expression on your own and get the gradients using backpropagation.
> - Currently, our `y` is a scalar. Although loss is usually scalar in deep learning, what if we have a vector as `y`? How do we compute the gradients in this case?

In [5]:
# === Your code here ===

## 2. Toggling gradient tracking

When you are evaluating a model, you do not need to update the weights, and hence you do not need to track the gradients.

Disabling gradient tracking will reduce memory consumption and accelerate computations.

In PyTorch, we use `torch.no_grad()` to disable gradient tracking.

Let us do an experiment on how much memory and time we save by disabling gradient tracking.

In [9]:
# Load a pretrained small ViT model
# ViT, or Vision Transformer, is a type of neural network architecture that is dedicated to vision tasks.
from torchvision.models import vit_b_16
model = vit_b_16(weights="DEFAULT")
# Set the model to evaluation mode - you will learn more about this later
model.eval()
# Create a random input data
x = torch.randn(8, 3, 224, 224) # 8 images, 3 channels, 224x224 pixels

In [10]:
import timeit
import tracemalloc
import gc
# Function to measure time with timeit
def time_execution(stmt, setup="pass", number=10):
    return timeit.timeit(stmt=stmt, setup=setup, number=number) / number

# It will take some time to run it

# Profile with gradients
print("PROFILE RESULTS: WITH GRADIENT")
gc.collect()
tracemalloc.start()
# Time the execution using timeit (average of 10 runs)
avg_time = time_execution(
    stmt="model(x)",
    setup="from __main__ import torch, model, x",
    number=10
)
# Memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**3:.5f}KB")
print(f"Peak memory usage: {peak / 10**3:.5f}KB")
print(f"Average time taken: {avg_time:.5f} seconds")
tracemalloc.stop()

PROFILE RESULTS: WITH GRADIENT
Current memory usage: 14.64000KB
Peak memory usage: 43.67900KB
Average time taken: 0.38417 seconds


In [11]:
# Remove the model
del model

# Load again
model = vit_b_16(weights="DEFAULT")
# Set the model to evaluation mode - you will learn more about this later
model.eval()

VisionTransformer(
  (conv_proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
  (encoder): Encoder(
    (dropout): Dropout(p=0.0, inplace=False)
    (layers): Sequential(
      (encoder_layer_0): EncoderBlock(
        (ln_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
        )
        (dropout): Dropout(p=0.0, inplace=False)
        (ln_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): MLPBlock(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU(approximate='none')
          (2): Dropout(p=0.0, inplace=False)
          (3): Linear(in_features=3072, out_features=768, bias=True)
          (4): Dropout(p=0.0, inplace=False)
        )
      )
      (encoder_layer_1): EncoderBlock(
        (ln_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (self_a

In [12]:
# Profile without gradients
print("\nPROFILE RESULTS: WITHOUT GRADIENT")
gc.collect()
tracemalloc.start()
# Time the execution using timeit (average of 10 runs)
avg_time = time_execution(
    stmt="with torch.no_grad(): model(x)",
    setup="from __main__ import torch, model, x",
    number=10
)
# Memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**3:.5f}KB")
print(f"Peak memory usage: {peak / 10**3:.5f}KB")
print(f"Average time taken: {avg_time:.5f} seconds")
tracemalloc.stop()


PROFILE RESULTS: WITHOUT GRADIENT
Current memory usage: 14.63700KB
Peak memory usage: 48.62600KB
Average time taken: 0.36842 seconds


## 3. Gradient descent and optimizing process