# 🔥**Autograd**


### **What is Autograd & why its used?**

- **Gradient:** The derivative of a function with respect to its inputs

- Autograd in PyTorch is its **automatic differentiation engine**, which is essential for training deep learning models. It enables **automatic computation of gradients**, making backpropagation efficient and easy, which is essential for the training machine learning models using optimization algorithms like gradient descent.

- Autograd in PyTorch solves the problem of **manually calculating derivatives (gradients)** in neural networks.

- Neural networks are essentially **complex, nested functions**, and training them requires finding the derivative of the loss function with respect to the model's parameters (weights and biases).

- Calculating these derivatives manually is **difficult and almost impossible** for large networks due to the need for the **chain rule**.

- Autograd provides **automatic differentiation**, allowing for the automatic calculation of these essential gradients for optimization algorithms like Gradient Descent used in training.






### **How Autograd Works in PyTorch**

#### **1. Tracking Operations**  
- Autograd **records all operations** performed on tensors with `requires_grad=True`.  
- These operations form a **computational graph**, where:  
  - **Nodes** are tensors (inputs, intermediates, outputs).  
  - **Edges** represent the operations (functions) transforming them.  

#### **2. Building a Dynamic Computation Graph**  
- PyTorch constructs a **Directed Acyclic Graph (DAG)** in real-time as operations are executed.  
- Each tensor stores:  
  - Its data (`grad_fn` for the operation that created it).  
  - A reference to the backward function (`grad_fn`) for gradient computation.  

#### **3. Computing Gradients with `.backward()`**  
- When `.backward()` is called, Autograd:  
  1. **Traverses the graph in reverse** (from output to input).  
  2. **Applies the chain rule** to compute gradients for all tracked tensors.  
  3. **Stores gradients** in the `.grad` attribute of leaf tensors (parameters).  

#### **4. Updating Model Parameters**  
- Gradients are used by optimizers (e.g., **SGD, Adam**) to update weights via:  
  ```python
  optimizer.step()  # Updates parameters using gradients
  optimizer.zero_grad()  # Clears old gradients
  ```

### **Key Features**  
- **Dynamic Graph**: The graph is rebuilt on-the-fly in each forward pass (*define-by-run*).  
- **Efficiency**: Only computes gradients for tensors with `requires_grad=True`.  
- **Memory Management**: Intermediate values are discarded after backward pass unless retained (e.g., for double backward).

### **Why PyTorch make the computation graphs?**

PyTorch creates computation graphs (specifically, directed acyclic graphs or DAGs) to track the operations performed on tensors. This tracking allows the Autograd module to automatically calculate derivatives (gradients) by tracing the graph backward from the output (roots) to the input (leaves) using the chain rule. This automatic differentiation is essential for efficiently training neural networks as manually calculating gradients for complex, nested network functions is difficult.

## ✅**Example: 1**



**Write a program to calculate the gradient `dy/dx`?**

- $y = x^2$

    - $\frac{dy}{dx} = 2x$


In [1]:
def dy_dx(x):
  return 2*x

dy_dx(2)

4

In [2]:
# Using Autograd

import torch

x = torch.tensor(3.0, requires_grad=True)
y = x**2

print("x -> ", x)
print("y -> ", y)

y.backward()
print("x.grad -> ", x.grad)

x ->  tensor(3., requires_grad=True)
y ->  tensor(9., grad_fn=<PowBackward0>)
x.grad ->  tensor(6.)


## ✅**Example: 2**



**Write a program to calculate the gradient `dz/dx`?**

- $y = x^2$

    - $\frac{dy}{dx} = 2x$

- $z = Sin(y)$
    - $\frac{dz}{dy} = Cos(y)$

- $\frac{dz}{dx} = \frac{dz}{dy} . \frac{dy}{dx} = 2x.Cos(y) = 2x.Cos(x^2)$

**Manual Gradient**

In [3]:
import math

def dz_dx(x):
    return 2 * x * math.cos(x**2)

dz_dx(4)

-7.661275842587077

**Using Autograd**

In [4]:
# Autograd demonstration

import torch

# Input tensor with gradient tracking
x = torch.tensor(4.0, requires_grad=True)
# Operation 1
y = x ** 2
# Operation 2
z = torch.sin(y)

print("x:", x)
print("y:", y)
print("z:", z)

# Compute gradients
z.backward()

# Gradient of z w.r.t. x
print("x.grad:", x.grad)

x: tensor(4., requires_grad=True)
y: tensor(16., grad_fn=<PowBackward0>)
z: tensor(-0.2879, grad_fn=<SinBackward0>)
x.grad: tensor(-7.6613)


> `PowBackward0` and `SinBackward0` refer to the backward computation steps that PyTorch uses to track gradients for the power and sine functions

In [5]:
y.grad

  y.grad


**Nodes PyTorch:**

- 🟡**`Leaf nodes`**: These are the **input tensors** to the computation. By default, gradients are calculated and stored for these during the backward pass.
- 🟡**`Root nodes`**: This is the **output tensor** (or tensors) from which the backward calculation starts.
- 🟡**`Intermediate nodes`**: These represent the **intermediate tensors** in the computation graph, between the input and output.
    - *Gradients are typically **not calculated or stored** for these intermediate nodes by default.*


>  PyTorch's Autograd builds a directed acyclic graph (DAG) where **leaf nodes are the input tensors** and **root nodes are the output tensors**. In this example, `x` is the input (leaf), `z` is the final output (root), and `y` is an intermediate step.

By default, Autograd **does not calculate or store gradients for intermediate tensors**. The gradient calculation happens from the root back to the leaf tensors. Therefore, you can access `x.grad` (a leaf tensor) and `z.grad` (if it were a leaf, though here it's the root from which the backward pass starts), but not `y.grad` because it's an intermediate node.

## ✅**Example: 3**

### 🟡 **Training Process for a Neural Network**

1. **Forward Pass** – Compute the output of the network for a given input.  
2. **Loss Calculation** – Measure the error using a loss function.  
3. **Backward Pass** – Compute the gradients of the loss with respect to the model parameters.  
4. **Parameter Update** – Adjust the parameters using an optimizer (e.g., gradient descent).

---

### ⭕ **Gradient Computation for Weights and Bias**

---

#### 🔹 **Forward Pass Computations**

1. **Linear Transformation**  
   $$
   z = w \cdot x + b
   $$  
   Gradients:  
   $$
   \frac{dz}{dw} = x, \quad \frac{dz}{db} = 1
   $$

2. **Activation (Sigmoid Function)**  
   $$
   \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}
   $$  
   Gradient:  
   $$
   \frac{d\hat{y}}{dz} = \hat{y}(1 - \hat{y})
   $$

3. **Loss Function (Binary Cross-Entropy)**  
   $$
   L = -\left[y \cdot \ln(\hat{y}) + (1 - y) \cdot \ln(1 - \hat{y})\right]
   $$  
   Gradient:  
   $$
   \frac{dL}{d\hat{y}} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}
   $$

---

#### 🔹 **Using the Chain Rule**

To compute the gradient of the loss with respect to the weights and bias:

- **Gradient with respect to weight $w$:**
  $$
  \frac{dL}{dw} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \cdot \frac{dz}{dw}
  $$

- **Gradient with respect to bias $b$:**
  $$
  \frac{dL}{db} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} \cdot \frac{dz}{db}
  $$

---

### ⭐ **Final Simplified Gradient Expressions**

Since:  
$$
\frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dz} = \hat{y} - y
$$

We get:

- **Weight Gradient:**
  $$
  \frac{dL}{dw} = (\hat{y} - y) \cdot x
  $$

- **Bias Gradient:**
  $$
  \frac{dL}{db} = \hat{y} - y
  $$

---


### 🔗**Computational graph**

$$
\begin{array}{ccccccccccccccc}
\textbf{w} & \searrow & & & & & & & \\
 & & \boxed{\ast} & \rightarrow & \boxed{+} & \rightarrow & z & \rightarrow & \boxed{\sigma} & \rightarrow & \hat{y} & \rightarrow & \boxed{LF} & \rightarrow & \text{Loss} \\
\textbf{x} & \nearrow & & & \uparrow & & & & & & & & \uparrow \\
 & & & & \textbf{b} & & & & & & & & \textbf{y} \\
\end{array}
$$


### 🟩**Manual Gradient**

In [6]:
import torch

# Inputs
x = torch.tensor(6.7)  # Input feature
y = torch.tensor(0.0)  # True label (binary)

w = torch.tensor(1.0)  # Weight
b = torch.tensor(0.0)  # Bias

# Binary Cross-Entropy Loss for scalar
def binary_cross_entropy_loss(prediction, target):
    epsilon = 1e-8  # To prevent log(0)
    prediction = torch.clamp(prediction, epsilon, 1 - epsilon)
    return -(target * torch.log(prediction) + (1 - target) * torch.log(1 - prediction))


# Forward pass
z = w * x + b  # Weighted sum (linear part)
y_pred = torch.sigmoid(z)  # Predicted probability

# Compute binary cross-entropy loss
loss = binary_cross_entropy_loss(y_pred, y)
print(loss)


# Derivatives:
# 1. dL/d(y_pred): Loss with respect to the prediction (y_pred)
dloss_dy_pred = (y_pred - y)/(y_pred*(1-y_pred))

# 2. dy_pred/dz: Prediction (y_pred) with respect to z (sigmoid derivative)
dy_pred_dz = y_pred * (1 - y_pred)

# 3. dz/dw and dz/db: z with respect to w and b
dz_dw = x  # dz/dw = x
dz_db = 1  # dz/db = 1 (bias contributes directly to z)

dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

print(f"Manual Gradient of loss w.r.t weight (dw): {dL_dw}")
print(f"Manual Gradient of loss w.r.t bias (db): {dL_db}")

tensor(6.7012)
Manual Gradient of loss w.r.t weight (dw): 6.691762447357178
Manual Gradient of loss w.r.t bias (db): 0.998770534992218


### 🟨**Using AutoGrad**

In [7]:
# Autograd for binary cross-entropy loss
import torch

# Input and target
x = torch.tensor(6.7)
y = torch.tensor(0.0)

# Weights and bais with gradient tracking
w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

print("Initial weight:", w)
print("Initial bias:", b)

# Linear layer output
z = w * x + b
print("Linear output (z):", z)

# Predicted probability
y_pred = torch.sigmoid(z)
print("Predicted probability:", y_pred)

# Binary cross-entropy loss
def binary_cross_entropy_loss(prediction, target):
    epsilon = 1e-8
    prediction = torch.clamp(prediction, epsilon, 1 - epsilon)
    return -(target * torch.log(prediction) + (1 - target) * torch.log(1 - prediction))

loss = binary_cross_entropy_loss(y_pred, y)
print("Loss:", loss)

# Compute gradients
loss.backward()

# Gradients of the loss w.r.t. weights
print("Gradient of w:", w.grad)
# Gradient of the loss w.r.t. bias
print("Gradient of b:", b.grad)

Initial weight: tensor(1., requires_grad=True)
Initial bias: tensor(0., requires_grad=True)
Linear output (z): tensor(6.7000, grad_fn=<AddBackward0>)
Predicted probability: tensor(0.9988, grad_fn=<SigmoidBackward0>)
Loss: tensor(6.7012, grad_fn=<NegBackward0>)
Gradient of w: tensor(6.6918)
Gradient of b: tensor(0.9988)


### 🟧**Now using vector inputs insted of scalar**




In [8]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print(x)

y = (x**2).mean()
print(y)

y.backward()
print(x.grad) # we get 3 grad one for each input

tensor([1., 2., 3.], requires_grad=True)
tensor(4.6667, grad_fn=<MeanBackward0>)
tensor([0.6667, 1.3333, 2.0000])


### 🟦**Clearing gradient**





> Clearing gradients means **setting the accumulated gradient values of tensors back to zero**.

> We need to do this because PyTorch **accumulates gradients** by adding them up every time `backward()` is called. If we don't clear them before each training step (forward and backward pass), the gradients from previous steps will be included, leading to **incorrect gradient values** for the current step's parameter updates. This ensures that the optimization (like Gradient Descent) uses the correct gradients for the current batch or iteration.

In [9]:
x = torch.tensor(2.0, requires_grad=True)
print(x)

tensor(2., requires_grad=True)


In [10]:
# cell-1
# forward
y = x ** 2
print(y)

tensor(4., grad_fn=<PowBackward0>)


In [11]:
# cell-2
# backward
y.backward()

In [12]:
# cell-3
x.grad

tensor(4.)

>✅ Since PyTorch accumulates gradients by adding them up every time `y.backward()` is called, running `cell-1`, `cell-2`, and `cell-3` consecutively will result in x.grad changing as follows: `tensor(4.)` → `tensor(8.)` → `tensor(12.)` → `tensor(16.)`

In [13]:
# cell-4
x.grad.zero_()

tensor(0.)

> `x.grad.zero_()` is used to reset the accumulated gradients in PyTorch to zero before each new backward pass. Since PyTorch accumulates gradients by default, not clearing them leads to incorrect updates in optimization. This ensures that only the current batch’s gradients are used for updating the model’s parameters.

## ✅**Example 4: Disabling Gradient Tracking**

**Disabling gradient tracking** means **stopping PyTorch's Autograd from recording operations and building the computation graph** for specific tensors or sections of code.

> ✅ We need this **primarily during inference or prediction** after a neural network has been trained.

> During prediction, we only need to perform a forward pass, and we **don't need to calculate gradients** to update model parameters.

> Disabling tracking saves **memory and computational resources** that would otherwise be used for gradient calculation.

In [14]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [15]:
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [16]:
y.backward()

In [17]:
x.grad

tensor(4.)

#### **Option 1 - `requires_grad_(False)`**



In [18]:
x.requires_grad_(False)

tensor(2.)

In [19]:
x

tensor(2.)

In [20]:
y = x ** 2

In [21]:
y

tensor(4.)

In [22]:
y.backward()

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

#### **Option 2 - `detach()`**

In [23]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [24]:
z = x.detach()
z

tensor(2.)

In [25]:
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [26]:
y1 = z ** 2
y1

tensor(4.)

In [27]:
y.backward()

In [28]:
y1.backward()

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

#### **Option 3 - `torch.no_grad()`**



In [29]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [30]:
with torch.no_grad():
    y = x ** 2

In [31]:
y

tensor(4.)

In [32]:
y.backward()

# x.grad

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

## ✅**Example 5: Linear Regression**

In [33]:
import torch  # Import PyTorch for tensor operations and automatic differentiation

# Linear regression with autograd
# Goal: Learn parameters w and b for the model y = w * x + b

# Input data
x = torch.tensor([1., 2., 3., 4.])  # Input features (independent variable)
y_true = torch.tensor([2., 4., 6., 8.])  # Target values (dependent variable, where y = 2 * x ideally)

# Parameters (weights and bias)
w = torch.tensor(0.0, requires_grad=True)  # Initialize weight w with gradient tracking enabled
b = torch.tensor(0.0, requires_grad=True)  # Initialize bias b with gradient tracking enabled

# Forward pass
def forward(x):
    """
    Compute model predictions: y_pred = w * x + b
    Args:
        x (torch.Tensor): Input tensor
    Returns:
        torch.Tensor: Predicted output
    """
    return w * x + b

# Loss function (Mean Squared Error)
def loss(y_pred, y_true):
    """
    Compute Mean Squared Error (MSE) loss between predictions and true values
    Args:
        y_pred (torch.Tensor): Predicted values
        y_true (torch.Tensor): True target values
    Returns:
        torch.Tensor: MSE loss
    """
    return ((y_pred - y_true) ** 2).mean()

# Training loop
learning_rate = 0.01  # Step size for parameter updates
epochs = 100  # Number of training iterations

for epoch in range(epochs):
    # Forward pass: Compute predictions
    y_pred = forward(x)  # Predict y using current w and b

    # Compute loss: Measure error between predictions and true values
    l = loss(y_pred, y_true)  # Calculate MSE loss

    # Backward pass: Compute gradients of loss with respect to w and b
    l.backward()  # Autograd computes gradients and stores them in w.grad and b.grad

    # Update parameters and zero gradients
    with torch.no_grad():  # Disable gradient tracking for parameter updates to save memory
        w -= learning_rate * w.grad  # Update weight using gradient descent: w = w - learning_rate * gradient
        b -= learning_rate * b.grad  # Update bias using gradient descent: b = b - learning_rate * gradient

        # Zero gradients to prevent accumulation for the next iteration
        w.grad.zero_()  # Clear the gradient of w
        b.grad.zero_()  # Clear the gradient of b

    # Print progress every 10 epochs
    if epoch % 10 == 0:
        print(f'Epoch {epoch}: w = {w.item():.3f}, loss = {l.item():.3f}')

# Print final learned model
print(f'Final prediction: y = {w.item():.3f} * x + {b.item():.3f}')

Epoch 0: w = 0.300, loss = 30.000
Epoch 10: w = 1.559, loss = 0.833
Epoch 20: w = 1.767, loss = 0.075
Epoch 30: w = 1.805, loss = 0.052
Epoch 40: w = 1.816, loss = 0.049
Epoch 50: w = 1.822, loss = 0.046
Epoch 60: w = 1.827, loss = 0.043
Epoch 70: w = 1.832, loss = 0.041
Epoch 80: w = 1.837, loss = 0.038
Epoch 90: w = 1.842, loss = 0.036
Final prediction: y = 1.846 * x + 0.452
