📝 **Author:** Amirhossein Heydari - 📧 **Email:** amirhosseinheydari78@gmail.com - 📍 **Linktree:** [linktr.ee/mr_pylin](https://linktr.ee/mr_pylin)

---

# Dependencies

In [30]:
import matplotlib.pyplot as plt
import numpy as np
import torch


# A Simple Neuron Structure (Perceptron)
   - In many contexts, the terms "neuron" and "perceptron" are used interchangeably

<div style="display: flex; margin-top: 50px;">
    <div style="width: 20%;">
        <table style="margin-left: auto; margin-right: auto;">
            <caption>Dataset</caption>
            <tr>
                <th><span style="color: magenta;">#</span></th>
                <th><span style="color: #9090ff;">x<sub>1</span></th>
                <th><span style="color: #9090ff;">x<sub>2</span></th>
                <th><span style="color: red;">y</span></th>
            </tr>
            <tr>
                <th>1</th>
                <td>1</td>
                <td>1</td>
                <td>2</td>
            </tr>
            <tr>
                <th>2</th>
                <td>2</td>
                <td>3</td>
                <td>5</td>
            </tr>
            <tr>
                <th>3</th>
                <td>1</td>
                <td>2</td>
                <td>3</td>
            </tr>
            <tr>
                <th>4</th>
                <td>3</td>
                <td>1</td>
                <td>4</td>
            </tr>
            <tr>
                <th>5</th>
                <td>2</td>
                <td>4</td>
                <td>6</td>
            </tr>
        </table>
    </div>
    <div style="width: 80%;">
        <figure style="text-align: center;">
            <img src="../assets/images/original/perceptron/perceptron-1.svg" alt="perceptron-1.png" style="width: 100%;">
            <figcaption style="text-align: center;">A simple Neuron (Perceptron)</figcaption>
        </figure>
    </div>
</div>

## How to estimate <span style="color: red;">y</span> ?

   1. <span>System of Equations</span>
        $$\left\{
        \begin{aligned}
        1w_1 + 1w_2 &= 2 \\
        2w_1 + 3w_2 &= 5 \\
        1w_1 + 2w_2 &= 3 \\
        3w_1 + 1w_2 &= 4 \\
        2w_1 + 4w_2 &= 6 \\
        \end{aligned}
        \right.$$

      <ul>
        <li>Disadvantages</li>
            <ul>
                <li><span style="font-family: Consolas;">Complexity &nbsp;&nbsp;:</span> Neural networks are highly complex systems with millions of parameters <a href= "https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/">[GPT-4 has 1.76 trillion parameters]</a></li>
                <li><span style="font-family: Consolas;">Non-linearity:</span> Neural networks use <u>activation functions like Sigmoid</u>, which introduce non-linearity into the network</li>
            </ul>
        <li>Critical issue [Overdetermined system]</li>
            <ul>
                <li>the number of <u>equations</u> is more than the number of <u>unknowns</u></li>
                <li>the system is inconsistent and cannot be solved exactly</li>
                <li>it is possible to have either no solution or an infinite number of solutions</li>
            </ul>
      </ul>

   1. <span>Delta Rule</span>
      - The delta rule, also known as the <u>Widrow-Hoff</u> rule or the <u>LMS</u> (least mean squares) rule
      - The delta rule is commonly associated with the <u>Adaline</u> (Adaptive Linear Neuron) model
      - It is a simple supervised learning rule used for training single-layer neural networks <u>(perceptrons)</u>

   1. <span>Backpropagation</span>
      - Backpropagation is an extended version of Delta Rule for multi-layer neural networks.
      - It allows the network to learn from its mistakes by updating the weights iteratively using <span style="color: tomato;">Gradient Descent</span> (aka Steepest Descent).

# Gradient
   - The rate of change of the `output` with respect to the `input` at a particular point in the function.
   - It can be seen as the generalization of the derivative for functions with multiple variables.
   - For one-dimensional functions, the gradient (derivative) represents the slope of the function but doesn't have a direction in the multi-dimensional sense.

# autograd
   - PyTorch’s automatic differentiation engine that powers neural network training
   - `torch.Tensor.backward()` computes the gradients and accumulates them in the grad attribute of the tensors that have `requires_grad=True`
   - `torch.Tensor.grad` is used to access the computed gradients stored in the `grad` attribute. These gradients are later used by an optimizer (e.g., `torch.optim`) to update the model parameters.

### Example 1: $f(x) = 2x + 3 \rightarrow \nabla f(x) = \frac{\partial f}{\partial x} = 2$
   - $\nabla f(4) = 2$
   - $\nabla f(0) = 2$
   - $\nabla f(1) = 2$

In [31]:
def f(x: torch.Tensor):
    return 2 * x + 3  # torch.add(torch.multiply(2, x), 3)


# x: independent variable
x = torch.tensor(1, dtype=torch.float32, requires_grad=True)

# f(x) or y : dependent variable
y = f(x)

# compute the gradients with respect to all Tensors that have `requires_grad=True`
y.backward()

# access computed gradients
# if x at 1 moves by ε, then y moves by 2ε
gradients = x.grad

# log
print('x     :', x)
print('y     :', y)
print("x.grad:", gradients)

x     : tensor(1., requires_grad=True)
y     : tensor(5., grad_fn=<AddBackward0>)
x.grad: tensor(2.)


In [None]:
# plot
_ = np.linspace(-4, 6, 100)
plt.figure(figsize=(6, 4))
plt.title(f"x.grad: {x.grad}")
plt.plot(_, f(_), label="f(x) = 2x + 3", color='blue')
plt.axvline(x=x.item(), color='red', linestyle='--', label=f"x = {x}")
plt.axhline(y=f(x).item(), color='green', linestyle='--', label=f"y = {f(x)}")
plt.xlabel('x')
plt.ylabel('f(x)')
plt.xticks(range(-10, 16, 2))
plt.yticks(range(-10, 16, 2))
plt.grid(True)
plt.legend()
plt.show()

### Example 2: $f(x) = 3x^2 - 2x + 5 \quad\rightarrow\quad \nabla f(x) = \frac{\partial f}{\partial x} = 6x - 2$
   - $\nabla f(3) = 16$
   - $\nabla f(0) = -2$
   - $\nabla f(1) = 4$

In [33]:
def f(x):
    # torch.add(torch.sub(torch.mul(3, torch.pow(x, 2)), torch.mul(2, x)), 5)
    return 3 * x ** 2 - 2 * x + 5


x = torch.tensor(3, dtype=torch.float32, requires_grad=True)
y = f(x)

# compute the gradients with respect to all Tensors that have `requires_grad=True`
y.backward()

# access computed gradients
# if x at 3 moves by ε, then y moves by (6 * 3 - 2)ε
gradients = x.grad

# log
print('x     :', x)
print('y     :', y)
print(f"x.grad: {gradients} [at x={x}]")

x     : tensor(3., requires_grad=True)
y     : tensor(26., grad_fn=<AddBackward0>)
x.grad: 16.0 [at x=3.0]


In [None]:
# plot
_ = np.linspace(-5, 5, 100)
plt.figure(figsize=(6, 4))
plt.title(f"x.grad: {x.grad}")
plt.plot(_, f(_), label="f(x) = 3x^2 - 2x + 5", color='blue')
plt.axvline(x=x.item(), color='red', linestyle='--', label=f"x = {x}")
plt.axhline(y=f(x).item(), color='green', linestyle='--', label=f"y = {f(x).item()}")
plt.xlabel('x')
plt.ylabel('f(x)')
plt.xticks(range(-5, 6))
plt.yticks(range(0, 101, 10))
plt.grid(True)
plt.legend()
plt.show()

### Example 3: $f(w_1, w_2) = w_1x_1 + w_2x_2 \quad\rightarrow\quad \nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2} \right) = (x_1, x_2)$
   - `magnitude:` $|\nabla f(w_1, w_2)| = \sqrt{x_1^2 + x_2^2}$

   - `direction:` $\nabla f(w_1, w_2) = \frac{x_1}{\sqrt{x_1^2 + x_2^2}} \hat{i} + \frac{x_2}{\sqrt{x_1^2 + x_2^2}} \hat{j}$

In [35]:
def f(X, W):
    return torch.dot(X, W)


W = torch.tensor([1, 2], dtype=torch.float32, requires_grad=True)
X = torch.tensor([2, 3], dtype=torch.float32)
y = f(X, W)

# compute the gradients
y.backward()

# access the gradients
gradients = W.grad

magnitude_grad = torch.norm(gradients)      # same as (grad ** 2).sum().sqrt()
direction_grad = gradients / magnitude_grad  # normalized (unit vector)

# log
print('W:', W)
print('X:', X)
print('y:', y)
print('-' * 50)
print('magnitude of gradients:', magnitude_grad.item())
print('direction of gradients:', direction_grad)

W: tensor([1., 2.], requires_grad=True)
X: tensor([2., 3.])
y: tensor(8., grad_fn=<DotBackward0>)
--------------------------------------------------
magnitude of gradients: 3.605551242828369
direction of gradients: tensor([0.5547, 0.8321])


In [None]:
# plot
w1 = np.linspace(-10, 10, 100)
w2 = np.linspace(-10, 10, 100)
X1, X2 = np.meshgrid(w1, w2)
_ = X1 * W[0].detach().numpy() + X2 * W[1].detach().numpy()

fig = plt.figure(figsize=(12, 4), layout='compressed')

ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X1, X2, _, cmap='viridis')
ax1.set_xlabel('w1')
ax1.set_ylabel('w2')
ax1.set_zlabel('f(w1, w2)')
ax1.set_title("f(w1, w2) = 2w1 + 3w2")
ax2 = fig.add_subplot(122)
ax2.quiver(0, 0, direction_grad[0], direction_grad[1], angles='xy', scale_units='xy', scale=1, color='red')
ax2.set_xlim(-2, 2)
ax2.set_ylim(-2, 2)
ax2.set_xlabel('w1')
ax2.set_ylabel('w2')
ax2.set_title("Direction of gradients")
ax2.grid('on')

plt.show()

## Gradient Descent
   - The gradient direction is indeed the direction in which a function increases most rapidly
   - To minimize the loss function, we shall move in the opposite of the gradient direction.

### Example 4: $f(w_1, w_2, b) = w_1x_1 + w_2x_2 + b \quad\rightarrow\quad \nabla f(W) = \left( \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}, \frac{\partial f}{\partial b} \right) = (x_1, x_2, 1)$

<figure style="text-align: center;">
    <img src="../assets/images/original/perceptron/logistic-regression.svg" alt="logistic-regression.svg" style="width: 80%;">
    <figcaption style="text-align: center;">Logistic Regression</figcaption>
</figure>

$
    W = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}\quad,\quad
    X = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}\quad,\quad
    output = W^T X = \begin{bmatrix} w_0 \ w_1 \ w_2 \end{bmatrix}.\begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}= w_0 + w_1x_1 + w_2x_2
$

#### Chain Rule
   - Activation function must be differentiable
   - Loss (error) function must be differentiable
$$
\nabla L(W) = (\frac{\partial \text{loss}}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial \text{output}} \cdot \frac{\partial \text{output}}{\partial W})
$$

#### Updating Weights
$$
W_{new} = W_{old} - \alpha \nabla L(W_{old})
$$

## in-place operation with `requires_grad=True` on a leaf_node
   - you can't perform **in-place** operations on tensors that require gradients `[e.g. updating weights]`
   - When you perform **in-place** operations (e.g., `+=`, or using methods like `.add_()`), PyTorch can lose track of the original values of the tensors before the operation.
   - operations that end with an underscore (e.g., `add_()`) are considered as "in-place" operations

In [37]:
x1 = torch.tensor(0, dtype=torch.float64)
x2 = torch.tensor(0, dtype=torch.float64, requires_grad=True)

# regular assignment
x1 = x1 + 1  # x1 = x1.add(1)
x2 = x2 + 1  # x2 = x2.add(1)

# this operation creates a new tensor with requires_grad=True, and a node is added to the computation graph to track the operation (x2 + 1).

# log
print('x1:', x1)
print('x2:', x2)

x1: tensor(1., dtype=torch.float64)
x2: tensor(1., dtype=torch.float64, grad_fn=<AddBackward0>)


In [38]:
x1 = torch.tensor(0, dtype=torch.float64)
x2 = torch.tensor(0, dtype=torch.float64, requires_grad=True)

# in-place assignment
x1 += 1  # x1.add_(1)

try:
    x2 += 1  # x2.add_(1)
except RuntimeError as e:
    print(e)

# log
print('x1:', x1)
print('x2:', x2)

a leaf Variable that requires grad is being used in an in-place operation.
x1: tensor(1., dtype=torch.float64)
x2: tensor(0., dtype=torch.float64, requires_grad=True)


In [39]:
# grad_fn
x = torch.tensor(2.0, requires_grad=True)

# perform operations
y = x + 1
z = y ** 2 * 3
out = z.mean()

# function to traverse the graph
def print_computation_graph(grad_fn, level=0):
    if grad_fn is not None:
        print(" " * level, grad_fn)
        if hasattr(grad_fn, 'next_functions'):
            for fn in grad_fn.next_functions:
                print_computation_graph(fn[0], level + 4)

# start from the output node (out) and traverse backward
print("computation graph:")
print_computation_graph(out.grad_fn)

computation graph:
 <MeanBackward0 object at 0x000001CBB0C8C850>
     <MulBackward0 object at 0x000001CB90E44550>
         <PowBackward0 object at 0x000001CBBB1B5C60>
             <AddBackward0 object at 0x000001CBB8B0F190>
                 <AccumulateGrad object at 0x000001CBB8B0FFD0>


## Example
   - $x = [2, 3] \quad,\quad y = 0$
   - Note: $x$ is a single sample with two features

In [40]:
# y = 0
y_true = torch.tensor(0, dtype=torch.int64)

# 1 is the multiplication for bias
X = torch.tensor([1, 2, 3], dtype=torch.float32)

# initial weights [bias = .3]
W = torch.tensor([.3, .7, .5], dtype=torch.float32, requires_grad=True)

# hyper parameters
epochs = 10
learning_rate = .5

for epoch in range(epochs):
    print('epoch:', epoch)

    # feed-forward
    output = torch.dot(X, W)
    y_pred = torch.sigmoid(output)
    print(f"y_true    : {y_true.item()} (label)")
    print(f"y_pred    : {y_pred.item()}")
    print(f"prediction: {torch.where(y_pred < .5, 0, 1)} (label)")

    # loss
    loss = (y_pred - y_true) ** 2
    print('loss:', loss.item())

    # backward
    loss.backward()
    dW = W.grad
    step = learning_rate * dW
    print('grad:', dW)
    print('step:', step)

    # update weights [method 1]
    # W.requires_grad_(False)
    # W -= step
    # W.grad.zero_()
    # W.requires_grad_(True)

    # update weights [method 2]
    # W = W.detach() - step
    # W.requires_grad_(True)

    # update weights [method 3] : preferred
    with torch.no_grad():
        W -= step
        W.grad.zero_()

    print('W_new', W)
    print('-' * 50)

epoch: 0
y_true    : 0 (label)
y_pred    : 0.960834264755249
prediction: 1 (label)
loss: 0.9232024550437927
grad: tensor([0.0723, 0.1446, 0.2169])
step: tensor([0.0362, 0.0723, 0.1085])
W_new tensor([0.2638, 0.6277, 0.3915], requires_grad=True)
--------------------------------------------------
epoch: 1
y_true    : 0 (label)
y_pred    : 0.9366591572761536
prediction: 1 (label)
loss: 0.8773303627967834
grad: tensor([0.1111, 0.2223, 0.3334])
step: tensor([0.0556, 0.1111, 0.1667])
W_new tensor([0.2083, 0.5165, 0.2248], requires_grad=True)
--------------------------------------------------
epoch: 2
y_true    : 0 (label)
y_pred    : 0.8716690540313721
prediction: 1 (label)
loss: 0.7598069310188293
grad: tensor([0.1950, 0.3900, 0.5850])
step: tensor([0.0975, 0.1950, 0.2925])
W_new tensor([ 0.1108,  0.3215, -0.0677], requires_grad=True)
--------------------------------------------------
epoch: 3
y_true    : 0 (label)
y_pred    : 0.6342986822128296
prediction: 1 (label)
loss: 0.402334809303283