<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari — 
        📧 <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> — 
        🐙 <a href="https://github.com/mr-pylin/pytorch-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pytorch.org/" target="_blank" rel="noopener noreferrer">
            <img src="../assets/images/pytorch/logo/pytorch-logo-dark.svg" 
                 alt="PyTorch Logo"
                 style="max-height: 48px; width: auto; background-color: #ffffff; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [A Simple Neuron Structure (Perceptron)](#toc2_)    
  - [How to estimate **y**?](#toc2_1_)    
- [Gradient](#toc3_)    
  - [autograd](#toc3_1_)    
    - [Example 1](#toc3_1_1_)    
    - [Example 2](#toc3_1_2_)    
  - [PyTorch Automatic Derivatives](#toc3_2_)    
    - [Example 1: Linear Function](#toc3_2_1_)    
    - [Example 2: Quadratic Function](#toc3_2_2_)    
    - [Example 3: Quadratic Function in 2D](#toc3_2_3_)    
    - [Example 4: Neuron-Style Squared Loss in 2D](#toc3_2_4_)    
  - [Handling `requires_grad=True` Issues](#toc3_3_)    
    - [Tensor Conversion](#toc3_3_1_)    
    - [In-place Operations with `requires_grad=True` on Leaf Nodes](#toc3_3_2_)    
  - [Gradient Descent](#toc3_4_)    
    - [Example: A Simple Neuron](#toc3_4_1_)    
      - [Chain Rule](#toc3_4_1_1_)    
      - [Weight Update Rule](#toc3_4_1_2_)    
    - [Gradient Descent Optimization Example](#toc3_4_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
from typing import Any

import matplotlib.pyplot as plt
import numpy as np
import torch
from torch.autograd import Function

In [None]:
# reduce default grid line width
plt.rcParams["grid.linewidth"] = 0.4

# <a id='toc2_'></a>[A Simple Neuron Structure (Perceptron)](#toc0_)

- In many contexts, the terms **Neuron** and **Perceptron** are used interchangeably


<div style="display:flex; margin-top:50px;">
  <div style="width:20%; margin-right:auto; margin-left:auto;">
    <table style="margin:0 auto; width:80%; text-align:center">
      <caption style="font-weight:bold;">Dataset</caption>
      <thead>
        <tr>
          <th style="width:25%; text-align:center"><span style="color:magenta;">#</span></th>
          <th style="width:25%; text-align:center"><span style="color:#9090ff;">x<sub>1</sub></span></th>
          <th style="width:25%; text-align:center"><span style="color:#9090ff;">x<sub>2</sub></span></th>
          <th style="width:25%; text-align:center"><span style="color:red;">y</span></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>1</th>
          <td>1</td>
          <td>1</td>
          <td>2</td>
        </tr>
        <tr>
          <th>2</th>
          <td>2</td>
          <td>3</td>
          <td>5</td>
        </tr>
        <tr>
          <th>3</th>
          <td>1</td>
          <td>2</td>
          <td>3</td>
        </tr>
        <tr>
          <th>4</th>
          <td>3</td>
          <td>1</td>
          <td>4</td>
        </tr>
        <tr>
          <th>5</th>
          <td>2</td>
          <td>4</td>
          <td>6</td>
        </tr>
        <tr>
          <th>6</th>
          <td>3</td>
          <td>2</td>
          <td>5</td>
        </tr>
        <tr>
          <th>7</th>
          <td>4</td>
          <td>1</td>
          <td>5</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div style="width:80%; padding:10px;">
    <figure style="text-align:center; margin:0;">
      <img src="../assets/images/original/perceptron/perceptron-1.svg" alt="perceptron-1.svg" style="max-width:80%; height:auto;">
      <figcaption style="text-align:center;">A simple Neuron (Perceptron)</figcaption>
    </figure>
  </div>
</div>

## <a id='toc2_1_'></a>[How to estimate **y**?](#toc0_)

1. **System of Equations**
    $$
    \left\{
    \begin{aligned}
    1w_1 + 1w_2 &= 2 \\
    2w_1 + 3w_2 &= 5 \\
    1w_1 + 2w_2 &= 3 \\
    3w_1 + 1w_2 &= 4 \\
    2w_1 + 4w_2 &= 6 \\
    3w_1 + 2w_2 &= 5 \\
    4w_1 + 1w_2 &= 5 \\
    \end{aligned}
    \right.
    $$

    - **Disadvantages**
      - `Complexity`: Neural networks are highly complex systems with millions of parameters ([GPT-4 has 1.76 trillion parameters](https://en.wikipedia.org/wiki/GPT-4#:~:text=Rumors%20claim%20that%20GPT%2D4,running%20and%20by%20George%20Hotz.)).
      - `Non-linearity`: Neural networks use activation functions like Sigmoid, which introduce non-linearity into the network.
    - **Critical issue: Overdetermined system**
      - The number of equations are more than the number of unknowns.
      - The system becomes inconsistent and cannot be solved exactly.
      - It may lead to either "No solution" or "An infinite number of solutions".

1. **Delta Rule**
    - The delta rule, also known as the Widrow-Hoff rule or the LMS (least mean squares) rule.
    - The delta rule is commonly associated with the AdaLiNe (Adaptive Linear Neuron) model.
    - It is a simple supervised learning rule used for training single-layer neural networks (perceptrons).

1. **Backpropagation**
    - Backpropagation is an extended version of Delta Rule for multi-layer neural networks.
    - It allows the network to learn from its mistakes by updating the weights iteratively using **Gradient Descent** (aka Steepest Descent).


# <a id='toc3_'></a>[Gradient](#toc0_)

- **Definition**:
  - The gradient represents the rate of change of the output of a function with respect to its inputs.
  - For functions with multiple variables, it generalizes the concept of a derivative, forming a vector of partial derivatives.
- **Intuition**:
  - In one-dimensional functions, the gradient (or derivative) corresponds to the slope of the function.
  - In multi-dimensional functions, the gradient points in the direction of the steepest ascent of the function, with its magnitude indicating the rate of change.
- **Applications**:
  - Crucial for optimization techniques like **Gradient Descent**, where gradients guide the updates to minimize loss functions in machine learning.


## <a id='toc3_1_'></a>[autograd](#toc0_)

- **Overview**:
  - PyTorch's **automatic differentiation engine**, which computes gradients efficiently for tensor operations.
  - It enables dynamic computation graphs, making it flexible for building and training complex neural networks.
- **How it Works**:
  1. **Backward Pass**:
      - Calling `torch.Tensor.backward()` computes the gradients for all tensors in the computation graph with `requires_grad=True`. These gradients are accumulated in the `grad` attribute of the respective tensors.
  1. **Accessing Gradients**:
      - Gradients are stored in `torch.Tensor.grad` after the backward pass.
      - Optimizers (e.g., `torch.optim.SGD`, `torch.optim.Adam`) use these gradients to update model parameters during training.

📚 **Tutorials**:

- A Gentle Introduction to `torch.autograd`: [pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)


### <a id='toc3_1_1_'></a>[Example 1](#toc0_)

<figure style="text-align: center;">
  <img src="../assets/images/original/gradient/autograd.svg" alt="autograd.svg" style="width: 80%;">
  <figcaption style="text-align: center;">Lower-Level AutoGrad Mechanism</figcaption>
</figure>

©️ **Credits**:

- Detailed info about **autograd**: [youtube.com/@elliotwaite](https://www.youtube.com/watch?v=MswxJw-8PvE)


In [None]:
class CustomMul(Function):
    @staticmethod
    def forward(
        ctx: torch.autograd.function.FunctionCtx,
        input1: torch.Tensor,
        input2: torch.Tensor,
    ) -> torch.Tensor:
        ctx.save_for_backward(input1, input2)
        return input1 * input2

    @staticmethod
    def backward(
        ctx: Any,
        *grad_output: torch.Tensor,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        input1, input2 = ctx.saved_tensors
        grad_input1 = grad_output[0] * input2
        grad_input2 = grad_output[0] * input1
        return grad_input1, grad_input2

In [None]:
# leaf nodes
t_1 = torch.tensor(2.0)
t_2 = torch.tensor(3.0, requires_grad=True)

# perform a multiplication operation
t_3 = CustomMul.apply(t_1, t_2)

# backward
t_3.backward()  # type: ignore[attr-defined]

# log
print(f"t_1.grad: {t_1.grad}")
print(f"t_2.grad: {t_2.grad}")
print(f"t_3.grad_fn.next_functions : {t_3.grad_fn.next_functions}")  # type: ignore

### <a id='toc3_1_2_'></a>[Example 2](#toc0_)


In [None]:
# grad_fn
x = torch.tensor(2.0, requires_grad=True)

# perform operations
y = x + 1
z = y**2 * 3
out = z.mean()


# function to traverse the graph
def print_computation_graph(grad_fn: torch.autograd.Function, level: int = 0):
    if grad_fn is not None:
        print(" " * level, grad_fn)
        if hasattr(grad_fn, "next_functions"):
            for fn in grad_fn.next_functions:
                print_computation_graph(fn[0], level + 4)


# start from the output node (out) and traverse backward
print("computation graph:")
print_computation_graph(out.grad_fn)  # type: ignore

## <a id='toc3_2_'></a>[PyTorch Automatic Derivatives](#toc0_)


### <a id='toc3_2_1_'></a>[Example 1: Linear Function](#toc0_)

- **Function**:
  $$f(x) = 2x + 3$$

- **Gradient (derivative)**:
  $$\nabla f(x) = \frac{\partial f}{\partial x} = 2$$

- **Key observations**:
  - The gradient is **constant**: it does not depend on $x$.
  - This means the slope of the line is always the same.
  - Linear functions **do not have a minimum or maximum** (they go to $\pm \infty$).

- **Examples**:
  - $\nabla f(4) = 2$
  - $\nabla f(0) = 2$
  - $\nabla f(1) = 2$

- **Interpretation**:
  - At every point on the line, the function increases at the same rate.
  - Gradient descent (or ascent) would **never converge**, since there is no *valley* or *peak*.


In [None]:
def f(x: torch.Tensor) -> torch.Tensor:
    return 2 * x + 3  # torch.add(torch.multiply(2, x), 3)


# x: independent variable
x = torch.tensor(1, dtype=torch.float32, requires_grad=True)

# f(x) or y : dependent variable
y = f(x)

# compute the gradients with respect to all Tensors that have `requires_grad=True`
y.backward()

# access computed gradients
# if x at 1 moves by ε, then y moves by 2ε
gradients = x.grad

# log
print("x     :", x)
print("y     :", y)
print("x.grad:", gradients)

In [None]:
# plot
_ = np.linspace(-4, 6, 100)
plt.figure(figsize=(6, 4))
plt.title(f"x.grad: {x.grad}")
plt.plot(_, f(_), label="f(x) = 2x + 3", color="blue")  # type: ignore
plt.axvline(x=x.item(), color="red", linestyle="--", label=f"x = {x}")
plt.axhline(y=f(x).item(), color="green", linestyle="--", label=f"y = {f(x)}")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.xticks(range(-10, 16, 2))
plt.yticks(range(-10, 16, 2))
plt.grid(True)
plt.legend()
plt.show()

### <a id='toc3_2_2_'></a>[Example 2: Quadratic Function](#toc0_)

- **Function**:
  $$f(x) = 3x^2 - 2x + 5$$

- **Gradient (derivative)**:
  $$\nabla f(x) = \frac{\partial f}{\partial x} = 6x - 2$$

- **Key observations**:
  - Unlike a linear function, the gradient **depends on $x$**.
  - When $\nabla f(x) = 0$, the slope is flat → a **critical point**.
  - Since the coefficient of $x^2$ is positive ($3 > 0$), the parabola opens upward → the critical point is a **minimum**.

- **Examples**:
  - $\nabla f(3) = 16$ → positive slope (function increasing).
  - $\nabla f(0) = -2$ → negative slope (function decreasing).
  - $\nabla f(1) = 4$ → positive slope (function increasing).

- **Finding the minimum**:
  - Solve $6x - 2 = 0 \;\;\Rightarrow\;\; x = \tfrac{1}{3}$.
  - At $x = \tfrac{1}{3}$, $f\!\left(\tfrac{1}{3}\right) = 3\left(\tfrac{1}{9}\right) - 2\left(\tfrac{1}{3}\right) + 5 = \tfrac{14}{3}$.
  - So the **absolute minimum** is at:
    $$\left(x, f(x)\right) = \left(\tfrac{1}{3}, \tfrac{14}{3}\right)$$

- **Interpretation**:
  - For $x < \tfrac{1}{3}$, the gradient is **negative** → function decreasing.
  - For $x > \tfrac{1}{3}$, the gradient is **positive** → function increasing.
  - Gradient descent would naturally converge to the minimum at $x = \tfrac{1}{3}$.


In [None]:
def f(x: torch.Tensor) -> torch.Tensor:
    # torch.add(torch.sub(torch.mul(3, torch.pow(x, 2)), torch.mul(2, x)), 5)
    return 3 * x**2 - 2 * x + 5


x = torch.tensor(3, dtype=torch.float32, requires_grad=True)
y = f(x)

# compute the gradients with respect to all Tensors that have `requires_grad=True`
y.backward()

# access computed gradients
# if x at 3 moves by ε, then y moves by (6 * 3 - 2)ε
gradients = x.grad

# log
print("x     :", x)
print("y     :", y)
print(f"x.grad: {gradients} [at x={x}]")

In [None]:
# plot
_ = np.linspace(-5, 5, 100)
plt.figure(figsize=(6, 4))
plt.title(f"x.grad: {x.grad}")
plt.plot(_, f(_), label="f(x) = 3x^2 - 2x + 5", color="blue")  # type: ignore
plt.axvline(x=x.item(), color="red", linestyle="--", label=f"x = {x}")
plt.axhline(y=f(x).item(), color="green", linestyle="--", label=f"y = {f(x).item()}")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.xticks(range(-5, 6))
plt.yticks(range(0, 101, 10))
plt.grid(True)
plt.legend()
plt.show()

### <a id='toc3_2_3_'></a>[Example 3: Quadratic Function in 2D](#toc0_)

- **Function**:
  $$f(w_1, w_2) = (w_1 - x_1)^2 + (w_2 - x_2)^2$$
  - This is a standard convex quadratic function.
  - The function measures the squared distance between the weight vector \(W = [w_1, w_2]\) and a fixed point \(X = [x_1, x_2]\).

- **Gradient (vector of partial derivatives)**:
  $$\nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2} \right) = 2 \begin{bmatrix} w_1 - x_1 \\ w_2 - x_2 \end{bmatrix}$$
  - The gradient points **directly toward the minimum** at \(W = X\).

- **Magnitude of the gradient**:
  $$|\nabla f(W)| = 2 \sqrt{(w_1 - x_1)^2 + (w_2 - x_2)^2}$$
  - Proportional to the distance from the minimum.

- **Direction of the gradient**:
  $$\text{direction} = \frac{\nabla f(W)}{|\nabla f(W)|} = \frac{[w_1 - x_1, w_2 - x_2]}{\sqrt{(w_1 - x_1)^2 + (w_2 - x_2)^2}}$$
  - Points **radially outward from the minimum** if you reverse the sign for gradient descent.

- **Example**:
  - **Given**:
    $$X = [2, 3], \quad W = [1, 2]$$
  - **Compute gradient**:
    $$\nabla f(W) = 2 \cdot ([1, 2] - [2, 3]) = 2 \cdot [-1, -1] = [-2, -2]$$
  - **Gradient magnitude**:
    $$|\nabla f(W)| = \sqrt{(-2)^2 + (-2)^2} = \sqrt{8} \approx 2.83$$

- **Critical point and minimum**:
  - Solve \(\nabla f(W) = 0 \Rightarrow W = X = [2,3]\).
  - This is the **unique global minimum** of the function.

- **Interpretation**:
  - Contours are **concentric circles** around \(X\).
  - Gradient descent moves weights **directly toward the minimum**.
  - This example illustrates a **perfectly convex 2D quadratic** for teaching optimization intuitively.


In [None]:
def f(W: torch.Tensor, X: torch.Tensor) -> torch.Tensor:
    return (W[0] - X[0])**2 + (W[1] - X[1])**2


W = torch.tensor([1.0, 2.0], dtype=torch.float32, requires_grad=True)
X = torch.tensor([2.0, 3.0], dtype=torch.float32)

# compute function value
y = f(W, X)

# compute gradients w.r.t W
y.backward()

# access gradients
gradients = W.grad
magnitude_grad = torch.norm(gradients)
direction_grad = gradients / magnitude_grad  # normalized unit vector

# log
print("W           :", W)
print("X (minimum) :", X)
print("y (loss)    :", y)
print("-" * 50)
print("gradients             :", gradients)
print("magnitude of gradients:", magnitude_grad.item())
print("direction of gradients:", direction_grad)

In [None]:
# input vector
X_arr = X.numpy()

# create weight grid
w1 = np.linspace(-1, 5, 100)
w2 = np.linspace(-1, 5, 100)
W1, W2 = np.meshgrid(w1, w2)
Z = (W1 - X_arr[0])**2 + (W2 - X_arr[1])**2  # unique minimum at W=X

# example point
point = np.array([1, 2])
direction_grad = 2 * (point - X_arr)  # gradient at W=point

# plot
fig = plt.figure(figsize=(12, 5), layout="compressed")

# 3D surface
ax1 = fig.add_subplot(121, projection="3d")
ax1.plot_surface(W1, W2, Z, cmap="viridis")
ax1.set_xlabel("w1")
ax1.set_ylabel("w2")
ax1.set_zlabel("f(W)")
ax1.set_title("f(W) = (w1-x1)^2 + (w2-x2)^2")

# contours + gradient
ax2 = fig.add_subplot(122)
contours = ax2.contour(W1, W2, Z, levels=30, cmap="viridis")
ax2.clabel(contours, inline=True, fontsize=8)
ax2.plot(X[0], X[1], "ro", markersize=8, label="Minimum W=X")
ax2.quiver(
    point[0],
    point[1],
    direction_grad[0],
    direction_grad[1],
    angles="xy",
    scale_units="xy",
    scale=1,
    color="red",
)
ax2.set_xlim(-1, 5)
ax2.set_ylim(-1, 5)
ax2.set_xlabel("w1")
ax2.set_ylabel("w2")
ax2.set_title("Contours + Gradient direction at W=[1,2]")
ax2.legend()
ax2.grid(True)

plt.show()


### <a id='toc3_2_4_'></a>[Example 4: Neuron-Style Squared Loss in 2D](#toc0_)

- **Function** (squared error for a linear neuron):
  $$f(w_1, w_2) = (w_1 x_1 + w_2 x_2 - y)^2$$

- **Gradient (vector of partial derivatives)**:
  $$\nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2} \right) = 2 (w_1 x_1 + w_2 x_2 - y) \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}$$

- **Magnitude of the gradient**:
  $$|\nabla f(W)| = 2 |\,w_1x_1 + w_2x_2 - y\,| \,\sqrt{x_1^2 + x_2^2}$$
  - Scales with both the **prediction error** $(w_1 x_1 + w_2 x_2 - y)$ and the **input magnitude** $\|X\|$.
  - Gradient is zero when the prediction matches the target.

- **Direction of the gradient**:
  $$\nabla f(W) \text{ points along the input vector } X = [x_1, x_2]$$
  - The gradient vector points toward or away from the minimum depending on whether the prediction is above or below the target.
  - Direction is always **aligned with the input vector**.

- **Example**:
  - **Given**:
    $$x = [2, 3], \quad y = 10, \quad W = [1, 2]$$
  - **Compute prediction error**:
    $$w_1 x_1 + w_2 x_2 - y = 1\cdot2 + 2\cdot3 - 10 = 8 - 10 = -2$$
  - **Compute gradient**:
    $$\nabla f(W) = 2 \cdot (-2) \cdot [2, 3] = [-8, -12]$$
  - **Magnitude**:
    $$|\nabla f(W)| = \sqrt{(-8)^2 + (-12)^2} = \sqrt{64 + 144} = \sqrt{208} \approx 14.42$$

- **Critical point and minimum**:
  - $\nabla f(W) = 0$ when $w_1 x_1 + w_2 x_2 = y$.
  - This is a **unique absolute minimum in terms of the loss**, i.e., when the neuron’s output matches the target.
  - Any $W$ satisfying this linear equation lies on a line in 2D weight space.

- **Interpretation**:
  - This is a **squared error loss** for a single linear neuron.
  - Gradient points toward the weights that produce the correct output.
  - Contours are straight lines (hyperplanes) perpendicular to the input vector.
  - Gradient descent moves weights along the input direction to reduce error.


In [None]:
def f(X: torch.Tensor, W: torch.Tensor, y_target: torch.Tensor) -> torch.Tensor:
    return (torch.dot(X, W) - y_target)**2


W = torch.tensor([1, 2], dtype=torch.float32, requires_grad=True)
X = torch.tensor([2, 3], dtype=torch.float32)
y_target = torch.tensor(10.0, dtype=torch.float32)  # target output

# compute function value
y = f(X, W, y_target)

# compute the gradients
y.backward()

# access the gradients
gradients = W.grad
assert gradients is not None, "Gradient is None - did you call backward()?"

# magnitude and direction
magnitude_grad = torch.norm(gradients)  # ||grad||
direction_grad = gradients / magnitude_grad  # normalized (unit vector)

# log
print("W:", W)
print("X:", X)
print("y_target:", y_target)
print("y (loss):", y)
print("-" * 50)
print("gradients             :", gradients)
print("magnitude of gradients:", magnitude_grad.item())
print("direction of gradients:", direction_grad)


In [None]:
# input vector and target
X_arr = np.array([2, 3])
y_target = 10

# create weight grid
w1 = np.linspace(-1, 6, 100)
w2 = np.linspace(-1, 6, 100)
W1, W2 = np.meshgrid(w1, w2)

# neuron-style squared loss: f(W) = (X·W - y_target)^2
Z = (W1 * X_arr[0] + W2 * X_arr[1] - y_target) ** 2

# example point for gradient arrow
point = np.array([1, 2])              
direction_grad = 2 * (point @ X_arr - y_target) * X_arr  # gradient at W = point

# normalize gradient for plotting
arrow_grad = direction_grad / np.linalg.norm(direction_grad)
arrow_scale = 1.5  # adjust arrow length for visibility
arrow_grad_scaled = arrow_grad * arrow_scale

# compute points for the minimum line
w1_min = np.linspace(-1, 6, 100)
w2_min = (y_target - X_arr[0] * w1_min) / X_arr[1]

# plot
fig = plt.figure(figsize=(12, 5), layout="compressed")

# 3D surface
ax1 = fig.add_subplot(121, projection="3d")
ax1.plot_surface(W1, W2, Z, cmap="viridis")
ax1.set_xlabel("w1")
ax1.set_ylabel("w2")
ax1.set_zlabel("f(W)")
ax1.set_title("f(W) = (X·W - y)^2")

# contours + gradient
ax2 = fig.add_subplot(122)
contours = ax2.contour(W1, W2, Z, levels=30, cmap="viridis")
ax2.clabel(contours, inline=True, fontsize=8)

# plot the entire minimum line
ax2.plot(w1_min, w2_min, "r-", linewidth=2, label="Minimum line (X·W=y)")

# plot normalized gradient arrow
ax2.quiver(
    point[0],
    point[1],
    arrow_grad_scaled[0],
    arrow_grad_scaled[1],
    angles="xy",
    scale_units="xy",
    scale=1,
    color="red",
)
ax2.set_xlim(-1, 6)
ax2.set_ylim(-1, 6)
ax2.set_xlabel("w1")
ax2.set_ylabel("w2")
ax2.set_title("Contours + Gradient direction at W=[1,2]")
ax2.legend()
ax2.grid(True)

plt.show()

## <a id='toc3_3_'></a>[Handling `requires_grad=True` Issues](#toc0_)


### <a id='toc3_3_1_'></a>[Tensor Conversion](#toc0_)

- Variables with `requires_grad=True` must be **detached** from the **computation graph** before converting to other `array-like` formats (e.g., **NumPy** arrays)

In [None]:
x1 = torch.tensor(0, dtype=torch.float64)
x2 = torch.tensor(0, dtype=torch.float64, requires_grad=True)

# tensor to NDArray
arr1 = x1.numpy()

try:
    arr2 = x2.numpy()
except RuntimeError as e:
    print(f"Error: {e}")
    arr2 = x.detach().numpy()

# log
print("x1:", x1)
print("x2:", x2)

### <a id='toc3_3_2_'></a>[In-place Operations with `requires_grad=True` on Leaf Nodes](#toc0_)

- **In-place operations** modify the content of a tensor **directly** without creating a new tensor.
- Examples include operations like `+=`, `-=` or using functions with an underscore like `.add_()`, `.mul_()`, etc.

**Why In-place Operations are Problematic for Gradients?**

- **Loss of Original Data:**  
  - When you perform an in-place operation on a tensor that requires gradients, PyTorch **loses track** of the original tensor values, which is essential for correctly calculating the gradient during the backward pass.
  - This happens because, during the backward pass, PyTorch needs the original values to compute the gradients. If the tensor is modified in place, the **original value is overwritten** and cannot be accessed later for the backward calculation.


In [None]:
x1 = torch.tensor(0, dtype=torch.float64)                      # leaf node
x2 = torch.tensor(0, dtype=torch.float64, requires_grad=True)  # leaf node

# out-of-place assignment
x1 = x1 + 1  # x1 = x1.add(1)
x2 = x2 + 1  # x2 = x2.add(1)

# log
print("x1:", x1)
print("x2:", x2)

In [None]:
x1 = torch.tensor(0, dtype=torch.float64)
x2 = torch.tensor(0, dtype=torch.float64, requires_grad=True)

# in-place assignment
x1 += 1  # x1.add_(1)

try:
    x2 += 1  # x2.add_(1)
except RuntimeError as e:
    print(e)
    x2 = x2 + 1  # or detach first

# log
print("x1:", x1)
print("x2:", x2)

## <a id='toc3_4_'></a>[Gradient Descent](#toc0_)

- The gradient direction is indeed the direction in which a function increases most rapidly
- To minimize the loss function, we shall move in the opposite of the gradient direction.


### <a id='toc3_4_1_'></a>[Example: A Simple Neuron](#toc0_)

- **Function**:  
  $$f(w_1, w_2, b) = w_1x_1 + w_2x_2 + b$$

- **Gradient (with respect to parameters)**:  
  $$\nabla f(W) = 
  \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}, \frac{\partial f}{\partial b} \right) 
  = (x_1, x_2, 1)$$


<figure style="text-align: center;">
  <img src="../assets/images/original/perceptron/adaline.svg" alt="adaline.svg" style="width: 80%;">
  <figcaption style="text-align: center;">ADAptive LInear NEuron (ADALINE)</figcaption>
</figure>


- **Vector form**:  
  $$
  W = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix},
  \quad
  X = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix},
  \quad
  y_{\text{pred}} = W^T X = w_0 + w_1x_1 + w_2x_2
  $$


#### <a id='toc3_4_1_1_'></a>[Chain Rule](#toc0_)

- The **activation function** must be differentiable.  
- The **loss function** must be differentiable.  

Gradient of the loss with respect to weights:

$$
\nabla L(W) 
= \frac{\partial L}{\partial y_{\text{pred}}} 
   \cdot \frac{\partial y_{\text{pred}}}{\partial \text{output}} 
   \cdot \frac{\partial \text{output}}{\partial W}
$$

- Here:  
  - $\tfrac{\partial y_{\text{pred}}}{\partial \text{output}} = 1$ (linear output).  
  - $\tfrac{\partial \text{output}}{\partial W} = X$.  
  - So effectively:  
    $$
    \nabla L(W) = \frac{\partial L}{\partial y_{\text{pred}}} \cdot X
    $$

#### <a id='toc3_4_1_2_'></a>[Weight Update Rule](#toc0_)

- General gradient descent update:  
  $$
  W_{\text{new}} = W_{\text{old}} - \alpha \nabla L(W_{\text{old}})
  $$

- Intuition:  
  - Move weights **opposite to the gradient** (downhill on the loss surface).  
  - $\alpha$ is the **learning rate**: step size of the update.  


### <a id='toc3_4_2_'></a>[Gradient Descent Optimization Example](#toc0_)

- $x = [2, 3] \quad,\quad y = 0$
- Note: $x$ is a single sample with two features


In [None]:
# y = 0
y_true = torch.tensor(0, dtype=torch.int64)

# 1 is the multiplication for bias
X = torch.tensor([1, 2, 3], dtype=torch.float32)

# initial weights [bias = .3]
W = torch.tensor([0.3, 0.7, 0.5], dtype=torch.float32, requires_grad=True)

# hyper parameters
epochs = 10
learning_rate = 0.5

for epoch in range(epochs):
    print(f"epoch      : {epoch}")

    # feed-forward
    output = torch.dot(X, W)
    y_pred = torch.sigmoid(output)

    # loss
    loss = (y_pred - y_true) ** 2

    # backward
    loss.backward()
    dW = W.grad
    step = learning_rate * dW  # type: ignore

    # log
    print(f"y_true     : {y_true.item()} (label)")
    print(f"y_pred     : {y_pred.item()}")
    print(f"prediction : {torch.where(y_pred < .5, 0, 1)} (label)")
    print(f"loss       : {loss.item()}")
    print(f"grad       : {dW}")
    print(f"step       : {step}")

    # update weights
    with torch.no_grad():
        W -= step
        W.grad.zero_()  # type: ignore
    
    # log
    print(f"W_new      : {W}")
    print("-" * 50)