In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Autograd

## Why is autograd important?
As functions start becoming nested, computing the derivating becomes more and more complex and difficult.
Nested functions like y = f(x), z = f(y), u = f(z) and we need to find du/dx \
For example:
```
Linear transformation: z = w.x + b
Activation function: y_pred = sigmoid(z)
Loss function: L = -(y_target * ln(y_pred) + (1 - y_target) * ln(1 - y_pred))
```

Neural networks are like nested functions and computing derivatives become extremely difficult manually. This is where PyTorch's Autograd module comes into picture

***Definition:*** Autograd is a core component of PyTorch that provides automatic differentiation for tensor operations. It enables gradient computation which is essential for training machine learning and deep learning models using optimization algorithms such as gradient descent

In [2]:
!pip3 install torch torchvision



In [3]:
import torch
torch.__version__

'2.4.0+cpu'

# Example 1

In [4]:
x = torch.tensor(5.0, requires_grad=True)
y = x ** 2

In [5]:
x

tensor(5., requires_grad=True)

In [6]:
y

tensor(25., grad_fn=<PowBackward0>)

## What is `grad_fn=<PowBackward0>`
During the forward pass PyTorch will track the operations if one of the involved tensors requires gradients (i.e. its `.requires_grad` attribute it set to True) and will create a computation graph from these operations. To be able to backpropagate through this computation graph and to calculate the gradients for all involved parameters, PyTorch will additionally store the corresponding "gradient functions" (or "backward functions") of the executed operations to the output tensor (stored as the `.grad_fn` attribute). Once the forward pass is done, you can then call the `.backward()` operation on the output (or loss) tensor, which will backpropagate through the computation graph using the functions stored in `.grad_fn`.

In [7]:
y.backward()

In [8]:
x.grad # value of derivative of y w.r.t x at x

tensor(10.)

# Example 2

In [9]:
x = torch.tensor(10.5, requires_grad=True)
y = x ** 4.5
z = torch.sin(y)

In [10]:
x

tensor(10.5000, requires_grad=True)

In [11]:
y

tensor(39386.9023, grad_fn=<PowBackward0>)

In [12]:
z

tensor(-0.6855, grad_fn=<SinBackward0>)

In [13]:
z.backward()

In [14]:
x.grad

tensor(-12290.4551)

## Error!
y.grad fails
This is so because the input tensors are the leaves, the output tensors are the root and the layers form the intermediate nodes in the computation graph. By default, the derivatives are computed w.r.t to the leaves and you can access .grad property for those tensors. You cannot access the intermediate node .grad by default

# Example 3: A single layer perceptron

In [15]:
x = torch.tensor(6.7) # input (CGPA)
y = torch.tensor(0.0) # output (Placement yes(1)/no(0))

w = torch.rand((), requires_grad=True) # weight
b = torch.rand((), requires_grad=True) # bias

In [16]:
# Binary cross entropy loss for scalar
def binary_cross_entropy_loss(prediction, target):
    EPSILON = 1e-8
    prediction = torch.clamp(prediction, EPSILON, 1-EPSILON) # to prevent log(0), we make sure prediction lies in a certain range
    loss = - (target * torch.log(prediction) + (1 - target) * torch.log(1 - prediction))
    return loss

In [17]:
# Forward Pass
# Linear transformation:
z = w * x + b

# Activaion Function:
y_pred = torch.sigmoid(z)

# Loss function:
loss = binary_cross_entropy_loss(y_pred, y)

In [18]:
# Manual Backpropagation
# 1. dL/d(y_pred): Loss with respect to the prediction (y_pred)
dloss_dy_pred = (y_pred - y)/(y_pred*(1-y_pred))

# 2. dy_pred/dz: Prediction (y_pred) with respect to z (sigmoid derivative)
dy_pred_dz = y_pred * (1 - y_pred)

# 3. dz/dw and dz/db: z with respect to w and b
dz_dw = x  # dz/dw = x
dz_db = 1  # dz/db = 1 (bias contributes directly to z)

dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

In [19]:
print(f"Manual Gradient of loss w.r.t weight (dw): {dL_dw}")
print(f"Manual Gradient of loss w.r.t bias (db): {dL_db}")

Manual Gradient of loss w.r.t weight (dw): 5.841495990753174
Manual Gradient of loss w.r.t bias (db): 0.8718650937080383


In [20]:
# Autograd
loss.backward()

print(f"Autograd gradient of loss w.r.t weight (dw): {w.grad}")
print(f"Autograd gradient of loss w.r.t bias (db): {b.grad}")

Autograd gradient of loss w.r.t weight (dw): 5.841495990753174
Autograd gradient of loss w.r.t bias (db): 0.8718650937080383


# Clearing gradients
If you run backprop once and get some gradients, they do not clear on their own. Running the same backprop again would cause accumulation of gradients. For instance if `y = x ** 2` and if let's say `x.grad == 4.0`, and we run `y.backward()` and `x.grad` again, it will result in `x.grad == 8.0`, which is wrong.

In [21]:
x = torch.tensor(2.0, requires_grad=True)

In [22]:
# Run from this cell till x.grad again to view accumulation of grads
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [23]:
y.backward()

In [24]:
x.grad

tensor(4.)

In [25]:
x.grad.zero_() # this would clear the gradient variable and now repeating the same cycle would always give the correct grad

tensor(0.)

# Disabling gradient tracking
The idea of using autograd is only necessary when training the neural network models since we need to calculate the derivatives and perform backward passes to update the weights and biases. However, once we have performed the training of the neural network and we want to perform predictions using the model, we do not need the backward pass, we only need forward pass. In such a scenario, we can disable gradient tracking. We can disable gradient tracking with the following three options:
- requires_grad_(False)
- detach()
- torch.no_grad()

## requires_grad_(False)

In [26]:
x.requires_grad_(False) # in place disabling requires_grad

tensor(2.)

In [27]:
y = x ** 2
y

# here y.backward() would fail since there is no grad_fn

tensor(4.)

## detach()

In [28]:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [29]:
z = x.detach() # exact same as x but without the gradient tracking
y1 = z ** 2
y1

tensor(4.)

## torch.no_grad()

In [30]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [31]:
with torch.no_grad():
    y = x ** 2
y
# here again y.backward() would not work.
# if we remove `with torch.no_grad()`, then again gradient tracking is enabled

tensor(4.)