📝 **Author:** Amirhossein Heydari - 📧 **Email:** amirhosseinheydari78@gmail.com - 📍 **Linktree:** [linktr.ee/mr_pylin](https://linktr.ee/mr_pylin)

---

**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [A Simple Neuron Structure (Perceptron)](#toc2_)    
  - [How to estimate **y**?](#toc2_1_)    
- [Gradient](#toc3_)    
  - [autograd](#toc3_1_)    
    - [AutoGrad in Details : Example 1](#toc3_1_1_)    
    - [AutoGrad in Details : Example 2](#toc3_1_2_)    
    - [PyTorch Automatic Derivatives](#toc3_1_3_)    
      - [Example 1: $f(x) = 2x + 3 \rightarrow \nabla f(x) = \frac{\partial f}{\partial x} = 2$](#toc3_1_3_1_)    
      - [Example 2: $f(x) = 3x^2 - 2x + 5 \quad\rightarrow\quad \nabla f(x) = \frac{\partial f}{\partial x} = 6x - 2$](#toc3_1_3_2_)    
      - [Example 3: $f(w_1, w_2) = w_1x_1 + w_2x_2 \quad\rightarrow\quad \nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2} \right) = (x_1, x_2)$](#toc3_1_3_3_)    
    - [In-place Operations with `requires_grad=True` on Leaf Nodes](#toc3_1_4_)    
  - [Gradient Descent](#toc3_2_)    
    - [Example 4: $f(w_1, w_2, b) = w_1x_1 + w_2x_2 + b \quad\rightarrow\quad \nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}, \frac{\partial f}{\partial b} \right) = (x_1, x_2, 1)$](#toc3_2_1_)    
      - [Chain Rule](#toc3_2_1_1_)    
      - [Updating Weights](#toc3_2_1_2_)    
    - [Gradient Descent Optimization Example](#toc3_2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
from torch.autograd import Function

# <a id='toc2_'></a>[A Simple Neuron Structure (Perceptron)](#toc0_)
   - In many contexts, the terms **Neuron** and **Perceptron** are used interchangeably

<div style="display:flex; margin-top:50px;">
   <div style="width:20%; margin-right:auto; margin-left:auto;">
      <table style="margin:0 auto; width:80%; text-align:center">
         <caption style="font-weight:bold;">Dataset</caption>
         <thead>
            <tr>
               <th style="width:25%; text-align:center"><span style="color:magenta;">#</span></th>
               <th style="width:25%; text-align:center"><span style="color:#9090ff;">x<sub>1</sub></span></th>
               <th style="width:25%; text-align:center"><span style="color:#9090ff;">x<sub>2</sub></span></th>
               <th style="width:25%; text-align:center"><span style="color:red;">y</span></th>
            </tr>
         </thead>
         <tbody>
            <tr><th>1</th><td>1</td><td>1</td><td>2</td></tr>
            <tr><th>2</th><td>2</td><td>3</td><td>5</td></tr>
            <tr><th>3</th><td>1</td><td>2</td><td>3</td></tr>
            <tr><th>4</th><td>3</td><td>1</td><td>4</td></tr>
            <tr><th>5</th><td>2</td><td>4</td><td>6</td></tr>
            <tr><th>6</th><td>3</td><td>2</td><td>5</td></tr>
            <tr><th>7</th><td>4</td><td>1</td><td>5</td></tr>
         </tbody>
      </table>
   </div>
   <div style="width:80%; padding:10px;">
      <figure style="text-align:center; margin:0;">
         <img src="../assets/images/original/perceptron/perceptron-1.svg" alt="perceptron-1.svg" style="max-width:80%; height:auto;">
         <figcaption style="font-size:smaller; text-align:center;">A simple Neuron (Perceptron)</figcaption>
      </figure>
   </div>
</div>

## <a id='toc2_1_'></a>[How to estimate **y**?](#toc0_)
   1. **System of Equations**
      $$
      \left\{
      \begin{aligned}
      1w_1 + 1w_2 &= 2 \\
      2w_1 + 3w_2 &= 5 \\
      1w_1 + 2w_2 &= 3 \\
      3w_1 + 1w_2 &= 4 \\
      2w_1 + 4w_2 &= 6 \\
      3w_1 + 2w_2 &= 5 \\
      4w_1 + 1w_2 &= 5 \\
      \end{aligned}
      \right.
      $$

      - **Disadvantages**
        - `Complexity`: Neural networks are highly complex systems with millions of parameters ([GPT-4 has 1.76 trillion parameters](https://en.wikipedia.org/wiki/GPT-4#:~:text=Rumors%20claim%20that%20GPT%2D4,running%20and%20by%20George%20Hotz.)).
        - `Non-linearity`: Neural networks use activation functions like Sigmoid, which introduce non-linearity into the network.
      - **Critical issue: Overdetermined system**
        - The number of equations are more than the number of unknowns.
        - The system becomes inconsistent and cannot be solved exactly.
        - It may lead to either "No solution" or "An infinite number of solutions".

   1. **Delta Rule**
      - The delta rule, also known as the Widrow-Hoff rule or the LMS (least mean squares) rule.
      - The delta rule is commonly associated with the AdaLiNe (Adaptive Linear Neuron) model.
      - It is a simple supervised learning rule used for training single-layer neural networks (perceptrons).

   1. **Backpropagation**
      - Backpropagation is an extended version of Delta Rule for multi-layer neural networks.
      - It allows the network to learn from its mistakes by updating the weights iteratively using **Gradient Descent** (aka Steepest Descent).

# <a id='toc3_'></a>[Gradient](#toc0_)
   - **Definition**:
      - The gradient represents the rate of change of the output of a function with respect to its inputs. 
      - For functions with multiple variables, it generalizes the concept of a derivative, forming a vector of partial derivatives.
   - **Intuition**:
      - In one-dimensional functions, the gradient (or derivative) corresponds to the slope of the function.
      - In multi-dimensional functions, the gradient points in the direction of the steepest ascent of the function, with its magnitude indicating the rate of change.
   - **Applications**:
      - Crucial for optimization techniques like **Gradient Descent**, where gradients guide the updates to minimize loss functions in machine learning.

## <a id='toc3_1_'></a>[autograd](#toc0_)
   - **Overview**:
      - PyTorch's **automatic differentiation engine**, which computes gradients efficiently for tensor operations.
      - It enables dynamic computation graphs, making it flexible for building and training complex neural networks.
   - **How it Works**:
      1. **Backward Pass**:
         - Calling `torch.Tensor.backward()` computes the gradients for all tensors in the computation graph with `requires_grad=True`. These gradients are accumulated in the `grad` attribute of the respective tensors.
      2. **Accessing Gradients**:
         - Gradients are stored in `torch.Tensor.grad` after the backward pass.
         - Optimizers (e.g., `torch.optim.SGD`, `torch.optim.Adam`) use these gradients to update model parameters during training.

📚 **Tutorials**:
   - A Gentle Introduction to `torch.autograd`: [pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)

### <a id='toc3_1_1_'></a>[AutoGrad in Details : Example 1](#toc0_)

<figure style="text-align: center;">
    <img src="../assets/images/original/gradient/autograd.svg" alt="autograd.svg" style="width: 80%;">
    <figcaption style="text-align: center;">Lower-Level AutoGrad Mechanism</figcaption>
</figure>

©️ **Credits**:
   - more info about autograd: [https://www.youtube.com/@elliotwaite](https://www.youtube.com/watch?v=MswxJw-8PvE)

In [2]:
class CustomMul(Function):
    @staticmethod
    def forward(ctx, input1, input2):
        ctx.save_for_backward(input1, input2)
        return input1 * input2

    @staticmethod
    def backward(ctx, grad_output):
        input1, input2 = ctx.saved_tensors
        grad_input1 = grad_output * input2
        grad_input2 = grad_output * input1
        return grad_input1, grad_input2

In [None]:
# leaf nodes
t_1 = torch.tensor(2.0)
t_2 = torch.tensor(3.0, requires_grad=True)

# perform a multiplication operation
t_3 = CustomMul.apply(t_1, t_2)

# backward
t_3.backward()

# log
print(f"t_1.grad: {t_1.grad}")
print(f"t_2.grad: {t_2.grad}")
print(f"t_3.grad_fn.next_functions : {t_3.grad_fn.next_functions}")

### <a id='toc3_1_2_'></a>[AutoGrad in Details : Example 2](#toc0_)

In [None]:
# grad_fn
x = torch.tensor(2.0, requires_grad=True)

# perform operations
y = x + 1
z = y**2 * 3
out = z.mean()


# function to traverse the graph
def print_computation_graph(grad_fn, level=0):
    if grad_fn is not None:
        print(" " * level, grad_fn)
        if hasattr(grad_fn, "next_functions"):
            for fn in grad_fn.next_functions:
                print_computation_graph(fn[0], level + 4)


# start from the output node (out) and traverse backward
print("computation graph:")
print_computation_graph(out.grad_fn)

### <a id='toc3_1_3_'></a>[PyTorch Automatic Derivatives](#toc0_)

#### <a id='toc3_1_3_1_'></a>[Example 1: $f(x) = 2x + 3 \rightarrow \nabla f(x) = \frac{\partial f}{\partial x} = 2$](#toc0_)
   - $\nabla f(4) = 2$
   - $\nabla f(0) = 2$
   - $\nabla f(1) = 2$

In [None]:
def f(x: torch.Tensor):
    return 2 * x + 3  # torch.add(torch.multiply(2, x), 3)


# x: independent variable
x = torch.tensor(1, dtype=torch.float32, requires_grad=True)

# f(x) or y : dependent variable
y = f(x)

# compute the gradients with respect to all Tensors that have `requires_grad=True`
y.backward()

# access computed gradients
# if x at 1 moves by ε, then y moves by 2ε
gradients = x.grad

# log
print("x     :", x)
print("y     :", y)
print("x.grad:", gradients)

In [None]:
# plot
_ = np.linspace(-4, 6, 100)
plt.figure(figsize=(6, 4))
plt.title(f"x.grad: {x.grad}")
plt.plot(_, f(_), label="f(x) = 2x + 3", color="blue")
plt.axvline(x=x.item(), color="red", linestyle="--", label=f"x = {x}")
plt.axhline(y=f(x).item(), color="green", linestyle="--", label=f"y = {f(x)}")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.xticks(range(-10, 16, 2))
plt.yticks(range(-10, 16, 2))
plt.grid(True)
plt.legend()
plt.show()

#### <a id='toc3_1_3_2_'></a>[Example 2: $f(x) = 3x^2 - 2x + 5 \quad\rightarrow\quad \nabla f(x) = \frac{\partial f}{\partial x} = 6x - 2$](#toc0_)
   - $\nabla f(3) = 16$
   - $\nabla f(0) = -2$
   - $\nabla f(1) = 4$

In [None]:
def f(x):
    # torch.add(torch.sub(torch.mul(3, torch.pow(x, 2)), torch.mul(2, x)), 5)
    return 3 * x**2 - 2 * x + 5


x = torch.tensor(3, dtype=torch.float32, requires_grad=True)
y = f(x)

# compute the gradients with respect to all Tensors that have `requires_grad=True`
y.backward()

# access computed gradients
# if x at 3 moves by ε, then y moves by (6 * 3 - 2)ε
gradients = x.grad

# log
print("x     :", x)
print("y     :", y)
print(f"x.grad: {gradients} [at x={x}]")

In [None]:
# plot
_ = np.linspace(-5, 5, 100)
plt.figure(figsize=(6, 4))
plt.title(f"x.grad: {x.grad}")
plt.plot(_, f(_), label="f(x) = 3x^2 - 2x + 5", color="blue")
plt.axvline(x=x.item(), color="red", linestyle="--", label=f"x = {x}")
plt.axhline(y=f(x).item(), color="green", linestyle="--", label=f"y = {f(x).item()}")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.xticks(range(-5, 6))
plt.yticks(range(0, 101, 10))
plt.grid(True)
plt.legend()
plt.show()

#### <a id='toc3_1_3_3_'></a>[Example 3: $f(w_1, w_2) = w_1x_1 + w_2x_2 \quad\rightarrow\quad \nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2} \right) = (x_1, x_2)$](#toc0_)
   - `magnitude:` $|\nabla f(w_1, w_2)| = \sqrt{x_1^2 + x_2^2}$

   - `direction:` $\nabla f(w_1, w_2) = \frac{x_1}{\sqrt{x_1^2 + x_2^2}} \hat{i} + \frac{x_2}{\sqrt{x_1^2 + x_2^2}} \hat{j}$

In [None]:
def f(X, W):
    return torch.dot(X, W)


W = torch.tensor([1, 2], dtype=torch.float32, requires_grad=True)
X = torch.tensor([2, 3], dtype=torch.float32)
y = f(X, W)

# compute the gradients
y.backward()

# access the gradients
gradients = W.grad

magnitude_grad = torch.norm(gradients)  # same as (grad ** 2).sum().sqrt()
direction_grad = gradients / magnitude_grad  # normalized (unit vector)

# log
print("W:", W)
print("X:", X)
print("y:", y)
print("-" * 50)
print("magnitude of gradients:", magnitude_grad.item())
print("direction of gradients:", direction_grad)

In [None]:
# plot
w1 = np.linspace(-10, 10, 100)
w2 = np.linspace(-10, 10, 100)
X1, X2 = np.meshgrid(w1, w2)
_ = X1 * W[0].detach().numpy() + X2 * W[1].detach().numpy()

fig = plt.figure(figsize=(12, 4), layout="compressed")

ax1 = fig.add_subplot(121, projection="3d")
ax1.plot_surface(X1, X2, _, cmap="viridis")
ax1.set_xlabel("w1")
ax1.set_ylabel("w2")
ax1.set_zlabel("f(w1, w2)")
ax1.set_title("f(w1, w2) = 2w1 + 3w2")
ax2 = fig.add_subplot(122)
ax2.quiver(0, 0, direction_grad[0], direction_grad[1], angles="xy", scale_units="xy", scale=1, color="red")
ax2.set_xlim(-2, 2)
ax2.set_ylim(-2, 2)
ax2.set_xlabel("w1")
ax2.set_ylabel("w2")
ax2.set_title("Direction of gradients")
ax2.grid("on")

plt.show()

### <a id='toc3_1_4_'></a>[In-place Operations with `requires_grad=True` on Leaf Nodes](#toc0_)
   - **In-place operations** modify the content of a tensor **directly** without creating a new tensor.
   - Examples include operations like `+=`, `-=` or using functions with an underscore like `.add_()`, `.mul_()`, etc.

**Why In-place Operations are Problematic for Gradients?**
   - **Loss of Original Data:**  
      - When you perform an in-place operation on a tensor that requires gradients, PyTorch **loses track** of the original tensor values, which is essential for correctly calculating the gradient during the backward pass.
      - This happens because, during the backward pass, PyTorch needs the original values to compute the gradients. If the tensor is modified in place, the **original value is overwritten** and cannot be accessed later for the backward calculation.

In [None]:
x1 = torch.tensor(0, dtype=torch.float64)
x2 = torch.tensor(0, dtype=torch.float64, requires_grad=True)

# out-of-place assignment
x1 = x1 + 1  # x1 = x1.add(1)
x2 = x2 + 1  # x2 = x2.add(1)

# log
print("x1:", x1)
print("x2:", x2)

In [None]:
x1 = torch.tensor(0, dtype=torch.float64)
x2 = torch.tensor(0, dtype=torch.float64, requires_grad=True)

# in-place assignment
x1 += 1  # x1.add_(1)

try:
    x2 += 1  # x2.add_(1)
except RuntimeError as e:
    print(e)

# log
print("x1:", x1)
print("x2:", x2)

## <a id='toc3_2_'></a>[Gradient Descent](#toc0_)
   - The gradient direction is indeed the direction in which a function increases most rapidly
   - To minimize the loss function, we shall move in the opposite of the gradient direction.

### <a id='toc3_2_1_'></a>[Example 4: $f(w_1, w_2, b) = w_1x_1 + w_2x_2 + b \quad\rightarrow\quad \nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}, \frac{\partial f}{\partial b} \right) = (x_1, x_2, 1)$](#toc0_)

<figure style="text-align: center;">
    <img src="../assets/images/original/perceptron/adaline.svg" alt="adaline.svg" style="width: 80%;">
    <figcaption style="text-align: center;">ADAptive LInear NEuron (ADALINE)</figcaption>
</figure>

$
    W = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}\quad,\quad
    X = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}\quad,\quad
    output = W^T X = \begin{bmatrix} w_0 \ w_1 \ w_2 \end{bmatrix}.\begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}= w_0 + w_1x_1 + w_2x_2
$

#### <a id='toc3_2_1_1_'></a>[Chain Rule](#toc0_)
   - Activation function must be differentiable
   - Loss (error) function must be differentiable
$$
\nabla L(W) = (\frac{\partial \text{loss}}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial \text{output}} \cdot \frac{\partial \text{output}}{\partial W})
$$

#### <a id='toc3_2_1_2_'></a>[Updating Weights](#toc0_)
$$
W_{new} = W_{old} - \alpha \nabla L(W_{old})
$$

### <a id='toc3_2_2_'></a>[Gradient Descent Optimization Example](#toc0_)
   - $x = [2, 3] \quad,\quad y = 0$
   - Note: $x$ is a single sample with two features

In [None]:
# y = 0
y_true = torch.tensor(0, dtype=torch.int64)

# 1 is the multiplication for bias
X = torch.tensor([1, 2, 3], dtype=torch.float32)

# initial weights [bias = .3]
W = torch.tensor([0.3, 0.7, 0.5], dtype=torch.float32, requires_grad=True)

# hyper parameters
epochs = 10
learning_rate = 0.5

for epoch in range(epochs):
    print(f"epoch      : {epoch}")

    # feed-forward
    output = torch.dot(X, W)
    y_pred = torch.sigmoid(output)
    print(f"y_true     : {y_true.item()} (label)")
    print(f"y_pred     : {y_pred.item()}")
    print(f"prediction : {torch.where(y_pred < .5, 0, 1)} (label)")

    # loss
    loss = (y_pred - y_true) ** 2
    print(f"loss       : {loss.item()}")

    # backward
    loss.backward()
    dW = W.grad
    step = learning_rate * dW
    print(f"grad       : {dW}")
    print(f"step       : {step}")

    # update weights [method 1]
    # W.requires_grad_(False)
    # W -= step
    # W.grad.zero_()
    # W.requires_grad_(True)

    # update weights [method 2]
    # W = W.detach() - step
    # W.requires_grad_(True)

    # update weights [method 3] : preferred
    with torch.no_grad():
        W -= step
        W.grad.zero_()

    print(f"W_new      : {W}")
    print("-" * 50)