<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari — 
        📧 <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> — 
        🐙 <a href="https://github.com/mr-pylin/pytorch-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pytorch.org/" target="_blank" rel="noopener noreferrer">
            <img src="../assets/images/pytorch/logo/pytorch-logo-dark.svg" 
                 alt="PyTorch Logo"
                 style="max-height: 48px; width: auto; background-color: #ffffff; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Multilayer Perceptron (MLP)](#toc2_)    
  - [Forward Propagation Using Linear Algebra](#toc2_1_)    
  - [Gradient Computation and Backpropagation](#toc2_2_)    
  - [Multilayer Perceptron Using PyTorch](#toc2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [1]:
import torch
import torch.nn.functional as F
from torch import nn
from torchinfo import summary

In [2]:
# set a seed for deterministic results
seed = 42
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
# check if cuda is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# log
device

# <a id='toc2_'></a>[Multilayer Perceptron (MLP)](#toc0_)

- A [**Multilayer Perceptron (MLP)**](https://en.wikipedia.org/wiki/Multilayer_perceptron) is a type of feedforward artificial neural network, also known as a **Fully-Connected Network** or **Dense Network**.
- It consists of at least three layers of nodes: an **input layer**, one or more **hidden layers**, and an **output layer**.

🧬 **Key Characteristics**:

- **Fully Connected**: Every node (neuron) in one layer is connected to every node in the next layer.
- **Non-Linear [Activations](./utils/activation-functions.ipynb)**: Each neuron applies a non-linear activation function, enabling the network to model complex patterns.
- **[Feedforward](https://en.wikipedia.org/wiki/Feedforward_neural_network)**: Data flows in a single direction, from input to output, with no cycles or loops.

🏛️ **Basic Architecture**:

- **Input Layer**: Receives input features. The number of neurons equals the number of features in the dataset.
- **Hidden Layers**: These layers contain neurons that compute weighted sums and apply activation functions.
- **Output Layer**: Produces the final output, which could be a single value or a set of values for different tasks e.g. [**regression**](https://en.wikipedia.org/wiki/Regression_analysis), and [**classification**](https://en.wikipedia.org/wiki/Classification).

<figure style="text-align: center;">
  <img src="../assets/images/original/mlp/multi-layer-perceptrons.svg" alt="multi-layer-perceptrons.svg" style="width: 100%;">
  <figcaption style="text-align: center;">Multi-Layer-Perceptron (aka fully connected layers)</figcaption>
</figure>

<table style="margin: 0 auto; text-align:center;">
  <thead>
    <tr>
      <th colspan="2">hidden<sub>1</sub> parameters</th>
      <th colspan="2">hidden<sub>2</sub> parameters</th>
      <th colspan="2">logits parameters</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Weights</td>
      <td>Biases</td>
      <td>Weights</td>
      <td>Biases</td>
      <td>Weights</td>
      <td>Biases</td>
    </tr>
    <tr>
      <td>A × B</td>
      <td>B</td>
      <td>B × C</td>
      <td>C</td>
      <td>C × D</td>
      <td>D</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td colspan="2">(A + 1) × B</td>
      <td colspan="2">(B + 1) × C</td>
      <td colspan="2">(C + 1) × D</td>
    </tr>
  </tfoot>
</table>

📉 **Limitations of MLPs**:

- **Fixed Input and Output Sizes**:
  - MLPs require a fixed size for both input and output, making them less flexible for tasks involving variable-length sequences.
- **Lack of Temporal Awareness**:
  - MLPs do not inherently handle temporal data well.
  - They treat each input independently, which means they can't capture the temporal dependencies in sequential data.
- **Scalability Issues**:
  - As the size of the input data grows, the number of parameters in an MLP increases significantly, leading to higher computational costs and potential **overfitting**.
- **Stateless Nature**:
  - MLPs learn a fixed function approximation and do not maintain any state between inputs, which limits their ability to model dynamic processes.

⚔️ **MLPs vs. Other Architectures**:

- MLPs vs. [CNNs (Convolutional Neural Networks)](./08-convolutional-neural-networks.ipynb): CNNs are better suited for image data because they can capture spatial hierarchies, while MLPs are more general-purpose.
- MLPs vs. [RNNs (Recurrent Neural Networks)](./12-recurrent-neural-networks.ipynb): RNNs are used for sequential data (e.g., time series, language modeling) because they can handle temporal dependencies.

🛠️ **Weight and Bias Initialization**:

- **Weight**
  - Weights are initialized using the Kaiming (He) initialization by default, which is suitable for layers using ReLU activation functions.
  - the weights are initialized from a uniform distribution with a range based on the number of input and output units.
      $$W \sim \mathcal{U}\left(-{gain}\times\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, {gain}\times\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$$
- **Bias**:
  - Biases are initialized to zero by default.
- More Details about Initialization: [hyperparameters.ipynb](./utils/hyperparameters.ipynb)

🛝 **Playgrounds**:

- [deeperplayground.org](https://deeperplayground.org/)
- [alexlenail.me/NN-SVG](https://alexlenail.me/NN-SVG/)


## <a id='toc2_1_'></a>[Forward Propagation Using Linear Algebra](#toc0_)

- **Layer 1 (First Hidden Layer)**
  - **Input**: $x \in ℝ^d$, where $d$ is the number of input features.
  - **Weights**: $W^{(1)} \in ℝ^{h_1 \times d}$, where $h_1$​ is the number of neurons in the first hidden layer.
  - **Biases**: $b^{(1)} \in ℝ^{h_1}$.
  - The transformation for the first hidden layer is:
      $$\mathbf{z}^{(1)} = \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})$$

- **For each subsequent layer** $l$, where $l = \{2, 3, \ldots, L − 1\}$
  - **Input** from the previous layer: $z^{(l-1)} \in ℝ^{h_{l-1}}$.
  - **Weights**: $W^{(l)} \in ℝ^{h_l \times h_{l-1}}$, where $h_l$​ is the number of neurons in the $l$-th hidden layer.
  - **Biases**: $b^{(1)} \in ℝ^{h_l}$.
  - The transformation for each hidden layer is:
      $$\mathbf{z}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})$$

- **Output Layer**
  - **Weights**: $W^{(L)} \in ℝ^{o \times h_{L-1}}$, where $o$ is the number of output neurons.
  - **Biases**: $b^{(L)} \in ℝ^{o}$.
  - The transformation for the output is:
      $$\mathbf{\hat{y}} = \sigma_L(\mathbf{W}^{(L)} \mathbf{a}^{(L-1)} + \mathbf{b}^{(L)})$$


In [4]:
class MLP(torch.nn.Module):
    def __init__(self, input_size: int, hidden_size1: int, hidden_size2: int, output_size: int):
        super().__init__()

        # initialize weights and biases for the first hidden layer
        self.W1 = nn.Parameter(torch.randn(hidden_size1, input_size))
        self.b1 = nn.Parameter(torch.randn(hidden_size1))

        # initialize weights and biases for the second hidden layer
        self.W2 = nn.Parameter(torch.randn(hidden_size2, hidden_size1))
        self.b2 = nn.Parameter(torch.randn(hidden_size2))

        # initialize weights and biases for the output layer
        self.W3 = nn.Parameter(torch.randn(output_size, hidden_size2))
        self.b3 = nn.Parameter(torch.randn(output_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        self.z1 = x @ self.W1.T + self.b1
        self.a1 = F.relu(self.z1)

        self.z2 = self.a1 @ self.W2.T + self.b2
        self.a2 = F.relu(self.z2)

        self.z3 = self.a2 @ self.W3.T + self.b3
        return self.z3

In [None]:
# example input
batch_size = 3
x = torch.randn(batch_size, 10)
y = torch.randn(batch_size, 2)

# initialize the MLP
input_size = 10  # number of input features
hidden_size1 = 5  # number of neurons in the first hidden layer
hidden_size2 = 3  # number of neurons in the second hidden layer
output_size = 2  # number of output neurons (e.g., for binary classification)

model_1 = MLP(input_size, hidden_size1, hidden_size2, output_size)
model_1

In [None]:
summary(model_1, input_size=(x.size()), device="cpu")

In [None]:
# perform forward propagation
with torch.no_grad():
    y_pred = model_1.forward(x)

# log
print(f"y_pred:\n{y_pred}")

## <a id='toc2_2_'></a>[Gradient Computation and Backpropagation](#toc0_)

- **Compute the Loss**:
   $$\mathcal{L}(\mathbf{\hat{y}}, \mathbf{y})$$
- **Backpropagation**
  - Compute the gradient of the loss with respect to the output layer weights and biases:
      $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} \cdot \frac{\partial \mathbf{z}^{(L)}}{\partial \mathbf{W}^{(L)}}$$
  - Compute gradients for the weights and biases of each preceding layer:
      $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} \cdot \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{W}^{(l)}}$$
- **Update the Parameters**
  - using a gradient-based optimization algorithm like Gradient Descent or Adam:
   $$\mathbf{W}^{(l)} = \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$$


In [8]:
def backward(self, x: torch.Tensor, y: torch.Tensor, learning_rate: float) -> None:
    # compute the loss (Mean Squared Error - MSE)
    # loss = (1/N) * sum((z3 - y)^2) over all batch samples
    loss = torch.mean((self.z3 - y) ** 2)

    # compute the gradient of the loss with respect to z3 (output layer pre-activation)
    # this is the local gradient for the loss function with respect to z3
    # d(loss)/d(z3) = 2 * (z3 - y) / N
    loss_grad = 2 * (self.z3 - y) / y.size(0)

    # compute the gradient of the loss with respect to W3 (weights between hidden layer 2 and output layer)
    # d(loss)/d(W3) = d(loss)/d(z3) * d(z3)/d(W3)
    # d(z3)/d(W3) = a2^T (activation of hidden layer 2)
    # grad_W3 = (loss_grad)^T * a2
    grad_W3 = torch.matmul(loss_grad.T, self.a2)

    # compute the gradient of the loss with respect to b3 (biases of the output layer)
    # d(loss)/d(b3) = d(loss)/d(z3) * d(z3)/d(b3)
    # d(z3)/d(b3) = 1 (bias gradient accumulates over the batch dimension)
    # grad_b3 = sum(loss_grad) across batch dimension
    grad_b3 = torch.sum(loss_grad, dim=0)

    # backpropagate the gradient to the second hidden layer (w.r.t. a2)
    # compute the gradient of the loss with respect to a2 (activation of hidden layer 2)
    # d(loss)/d(a2) = d(loss)/d(z3) * d(z3)/d(a2)
    # d(z3)/d(a2) = W3 (weights between hidden layer 2 and output layer)
    grad_a2 = torch.matmul(loss_grad, self.W3)

    # compute the gradient of the loss with respect to z2 (pre-activation of hidden layer 2)
    # this is the local gradient for ReLU at the second hidden layer
    # d(z2)/d(a2) = ReLU'(z2) (element-wise derivative of ReLU)
    # grad_z2 = grad_a2 * ReLU'(z2) (ReLU'(z2) is 1 where z2 > 0, else 0)
    grad_z2 = grad_a2 * (self.a2 > 0).float()

    # compute the gradient of the loss with respect to W2 (weights between hidden layer 1 and hidden layer 2)
    # d(loss)/d(W2) = d(loss)/d(z2) * d(z2)/d(W2)
    # d(z2)/d(W2) = a1^T (activation of hidden layer 1)
    # grad_W2 = (grad_z2)^T * a1
    grad_W2 = torch.matmul(grad_z2.T, self.a1)

    # compute the gradient of the loss with respect to b2 (biases of hidden layer 2)
    # d(loss)/d(b2) = d(loss)/d(z2) * d(z2)/d(b2)
    # d(z2)/d(b2) = 1 (bias gradient accumulates over the batch dimension)
    # grad_b2 = sum(grad_z2) across batch dimension
    grad_b2 = torch.sum(grad_z2, dim=0)

    # backpropagate the gradient to the first hidden layer (w.r.t. a1)
    # compute the gradient of the loss with respect to a1 (activation of hidden layer 1)
    # d(loss)/d(a1) = d(loss)/d(z2) * d(z2)/d(a1)
    # d(z2)/d(a1) = W2 (weights between hidden layer 1 and hidden layer 2)
    grad_a1 = torch.matmul(grad_z2, self.W2)

    # compute the gradient of the loss with respect to z1 (pre-activation of hidden layer 1)
    # this is the local gradient for ReLU at the first hidden layer
    # d(z1)/d(a1) = ReLU'(z1) (element-wise derivative of ReLU)
    # grad_z1 = grad_a1 * ReLU'(z1) (ReLU'(z1) is 1 where z1 > 0, else 0)
    grad_z1 = grad_a1 * (self.a1 > 0).float()

    # compute the gradient of the loss with respect to W1 (weights between input layer and hidden layer 1)
    # d(loss)/d(W1) = d(loss)/d(z1) * d(z1)/d(W1)
    # d(z1)/d(W1) = x^T (input features)
    # grad_W1 = (grad_z1)^T * x
    grad_W1 = torch.matmul(grad_z1.T, x)

    # compute the gradient of the loss with respect to b1 (biases of hidden layer 1)
    # d(loss)/d(b1) = d(loss)/d(z1) * d(z1)/d(b1)
    # d(z1)/d(b1) = 1 (bias gradient accumulates over the batch dimension)
    # grad_b1 = sum(grad_z1) across batch dimension
    grad_b1 = torch.sum(grad_z1, dim=0)

    # update parameters using gradients (Gradient Descent step)
    with torch.no_grad():
        self.W1 -= learning_rate * grad_W1
        self.b1 -= learning_rate * grad_b1
        self.W2 -= learning_rate * grad_W2
        self.b2 -= learning_rate * grad_b2
        self.W3 -= learning_rate * grad_W3
        self.b3 -= learning_rate * grad_b3

In [9]:
MLP.backward = backward

In [None]:
# example input
batch_size = 3
x = torch.randn(batch_size, 10)
y = torch.randn(batch_size, 2)

# initialize the MLP
input_size = 10  # Number of input features
hidden_size1 = 5  # Number of neurons in the first hidden layer
hidden_size2 = 3  # Number of neurons in the second hidden layer
output_size = 2  # Number of output neurons

model_2 = MLP(input_size, hidden_size1, hidden_size2, output_size)
summary(model_2, input_size=x.size(), device="cpu")

In [None]:
# perform forward propagation
with torch.no_grad():
    y_pred_1 = model_2.forward(x)

# Perform backward propagation and update weights
learning_rate = 0.01
model_2.backward(x, y, learning_rate)

# Perform forward propagation again to see updated output
with torch.no_grad():
    y_pred_2 = model_2.forward(x)

# log
print(f"y_true:\n{y}\n")
print(f"output before backpropagation:\n{y_pred_1}\n")
print(f"output after backpropagation:\n{y_pred_2}")

## <a id='toc2_3_'></a>[Multilayer Perceptron Using PyTorch](#toc0_)

- Refer to this [**notebook**](./projects/01-multi-layer-perceptrons.ipynb) for a comprehensive example on the MLP concept.

📚 **Tutorials**:

- Neural Networks: [pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial)
- Training a Classifier: [pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html)


In [12]:
class MLP2(nn.Module):
    def __init__(self, input_size: int, hidden_size1: int, hidden_size2: int, output_size: int):
        super().__init__()
        # define layers using nn.Linear
        self.fc1 = nn.Linear(input_size, hidden_size1)  # first hidden layer
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)  # second hidden layer
        self.fc3 = nn.Linear(hidden_size2, output_size)  # output layer

        # define activation function (ReLU)
        self.relu = nn.ReLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # forward pass through the network
        x = self.relu(self.fc1(x))  # first hidden layer with ReLU
        x = self.relu(self.fc2(x))  # second hidden layer with ReLU
        x = self.fc3(x)  # output layer (no activation here)
        return x

In [None]:
input_size = 500  # number of input features
hidden_size1 = 10  # size of the first hidden layer
hidden_size2 = 8  # size of the second hidden layer
num_classes = 3  # number of output features

# initialize the model
model_3 = MLP2(input_size, hidden_size1, hidden_size2, num_classes)

# log
model_3

In [14]:
# example input
batch_size = 32
x = torch.randn(batch_size, input_size)
y = torch.randint(0, num_classes, (batch_size,))

In [None]:
summary(model_3, input_size=x.size(), device="cpu")

In [None]:
# initialize criterion and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_3.parameters(), lr=0.01)

num_epochs = 100

# training loop
for epoch in range(num_epochs):

    # forward pass
    y_pred = model_3(x)

    # compute the loss
    loss = criterion(y_pred, y)

    # perform backward propagation automatically
    loss.backward()

    # update the weights & zero the gradients
    optimizer.step()
    optimizer.zero_grad()

    # compute accuracy
    acc = (y_pred.argmax(dim=1) == y).sum().item() / batch_size

    # log
    if epoch % 10 == 0 or (epoch + 1) == num_epochs:
        print(
            f"epoch {epoch+1:0{len(str(num_epochs))}}/{num_epochs} -> loss: {loss.item():6.4f} | acc: {acc*100:5.2f}%"
        )