📝 **Author:** Amirhossein Heydari - 📧 **Email:** amirhosseinheydari78@gmail.com - 📍 **Linktree:** [linktr.ee/mr_pylin](https://linktr.ee/mr_pylin)

---

# Dependencies

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from torchinfo import summary

In [2]:
# set a seed for deterministic results
random_state = 42
torch.manual_seed(random_state)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [3]:
# check if cuda is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

# Multilayer Perceptron
   - A [Multilayer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (aka Fully-Connected Network or Dense Network) is a class of feedforward artificial neural networks
   - An MLP consists of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer

**Key Characteristics**:
   - Fully Connected: Every node (neuron) in one layer is connected to every node in the next layer.
   - Non-Linear [Activation](https://en.wikipedia.org/wiki/Activation_function): Each neuron applies a non-linear activation function, allowing the network to model complex patterns.
   - [Feedforward](https://en.wikipedia.org/wiki/Feedforward_neural_network): Data moves in one direction, from input to output, with no cycles or loops.

**Basic Architecture**:
   - Input Layer: Receives the input features. The number of neurons here equals the number of features in the input data.
   - Hidden Layers: Each hidden layer contains a set of neurons that apply weighted sums and activation functions. The number of neurons and layers is a [hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_%28machine_learning%29).
   - Output Layer: Produces the final output, which could be a single value ([regression](https://en.wikipedia.org/wiki/Regression_analysis)) or a set of values ([classification](https://en.wikipedia.org/wiki/Classification)).

<figure style="text-align: center;">
    <img src="../assets/images/SVGs/multi-layer-perceptron.svg" alt="multi-layer-perceptron.svg" style="width: 100%;">
    <figcaption style="text-align: center;">Multi-Layer-Perceptron (aka fully connected layers)</figcaption>
</figure>

<table style="margin-left:auto;margin-right:auto;text-align:center;">
  <thead>
    <tr>
      <th colspan="2">hidden<sub>1</sub> parameters</th>
      <th colspan="2">hidden<sub>2</sub> parameters</th>
      <th colspan="2">logits parameters</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Weights</td>
      <td>Biases</td>
      <td>Weights</td>
      <td>Biases</td>
      <td>Weights</td>
      <td>Biases</td>
    </tr>
    <tr>
      <td>A × B</td>
      <td>B</td>
      <td>B × C</td>
      <td>C</td>
      <td>C × D</td>
      <td>D</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td colspan="2">(A + 1) × B</td>
      <td colspan="2">(B + 1) × C</td>
      <td colspan="2">(C + 1) × D</td>
    </tr>
  </tfoot>
</table>

**Activation Functions**:
   - [Activation functions](./04_activation-functions.ipynb) introduce non-linearity into the network, enabling it to learn and model complex data.
   - Common activation functions include:
      - Sigmoid: $\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$ (good for binary classification)
      - Tanh: $\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ (outputs range from -1 to 1)
      - ReLU (Rectified Linear Unit): $\text{ReLU}(x) = \max(0, x)$ (popular in deep learning)

**Training an MLP**:
   - Forward Pass: Calculate the output using the current weights and biases.
   - Loss Calculation: Compute the loss using a [loss function](./05_loss-functions.ipynb), such as Mean Squared Error (MSE) for regression or Cross-Entropy for classification.
   - Backward Pass (Backpropagation): Calculate the gradient of the loss function with respect to each weight and bias.
   - Weight Update: Update the weights and biases using an optimization algorithm like Gradient Descent or Adam.

**Limitations of MLPs**:
   - Scalability: MLPs with many layers and neurons require significant computational resources.
   - [Vanishing Gradients](https://en.wikipedia.org/wiki/Vanishing_gradient_problem): In deep networks, gradients can become very small, making training difficult.
   - Data Efficiency: MLPs generally require a large amount of data to perform well.

**MLPs vs. Other Architectures**:
   - MLPs vs. [CNNs (Convolutional Neural Networks)](./12_convolutional-neural-networks.ipynb): CNNs are better suited for image data because they can capture spatial hierarchies, while MLPs are more general-purpose.
   - MLPs vs. [RNNs (Recurrent Neural Networks)](./18_recurrent-neural-networks.ipynb): RNNs are used for sequential data (e.g., time series, language modeling) because they can handle temporal dependencies.

**Notes**:
   - loss function : 
      - multi-class classification : `torch.nn.CrossEntropyLoss` = `torch.nn.LogSoftmax` + `torch.nn.NLLLoss`
      - [pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
      - [pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html)
   - activation function for the last layer:
      - when using `torch.nn.CrossEntropyLoss` as a loss function, the output layer doesn't need an activation function
      - `torch.nn.CrossEntropyLoss` calculates `torch.nn.LogSoftmax` and `torch.nn.NLLLoss` internally.
      - [pytorch.org/docs/stable/generated/torch.nn.Softmax.html](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html)
      - [pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html)
   - `torch.nn.Linear`
      - Weights
         - Initialized based on a scheme similar to Xavier/Glorot initialization
         - Uniform Distribution [default]: $W \sim \mathcal{U}\left(-{gain}\times\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, {gain}\times\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$
         - Normal Distribution: $W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$
      - Biases:
         - Initialized to zero
      - [pytorch.org/docs/stable/nn.init.html](https://pytorch.org/docs/stable/nn.init.html)
      - Paper: [Understanding the difficulty of training deep feedforward neural networks - Glorot, X. & Bengio, Y. (2010).](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

**Playground**:
   - [deeperplayground.org](https://deeperplayground.org/)
   - [alexlenail.me/NN-SVG](https://alexlenail.me/NN-SVG/)

## Forward Propagation Using Linear Algebra
   - Layer 1 (First Hidden Layer)
      - Input: $x \in ℝ^d$, where $d$ is the number of input features.
      - Weights: $W^{(1)} \in ℝ^{h_1 \times d}$, where $h_1$​ is the number of neurons in the first hidden layer.
      - Biases: $b^{(1)} \in ℝ^{h_1}$.
      - The transformation for the first hidden layer is:
      $$\mathbf{z}^{(1)} = \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})$$

   - For each subsequent layer $l$, where $l = \{2, 3, \ldots, 𝐿 − 1\}$
      - Input from the previous layer: $z^{(l-1)} \in ℝ^{h_{l-1}}$.
      - Weights: $W^{(l)} \in ℝ^{h_l \times h_{l-1}}$, where $h_l$​ is the number of neurons in the $l$-th hidden layer.
      - Biases: $b^{(1)} \in ℝ^{h_l}$.
      - The transformation for each hidden layer is:
      $$\mathbf{z}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})$$

   - Output Layer
      - Weights: $W^{(L)} \in ℝ^{o \times h_{L-1}}$, where $o$ is the number of output neurons.
      - Biases: $b^{(L)} \in ℝ^{o}$.
      - The transformation for the output is:
      $$\mathbf{\hat{y}} = \sigma_L(\mathbf{W}^{(L)} \mathbf{a}^{(L-1)} + \mathbf{b}^{(L)})$$

In [4]:
class MLP(torch.nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size) -> None:
        super(MLP, self).__init__()
        
        # initialize weights and biases for the first hidden layer
        self.W1 = nn.Parameter(torch.randn(hidden_size1, input_size))
        self.b1 = nn.Parameter(torch.randn(hidden_size1))
        
        # initialize weights and biases for the second hidden layer
        self.W2 = nn.Parameter(torch.randn(hidden_size2, hidden_size1))
        self.b2 = nn.Parameter(torch.randn(hidden_size2))
        
        # initialize weights and biases for the output layer
        self.W3 = nn.Parameter(torch.randn(output_size, hidden_size2))
        self.b3 = nn.Parameter(torch.randn(output_size))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        self.z1 = x @ self.W1.T + self.b1
        self.a1 = F.relu(self.z1)
        
        self.z2 = self.a1 @ self.W2.T + self.b2
        self.a2 = F.relu(self.z2)
        
        self.z3 = self.a2 @ self.W3.T + self.b3
        return self.z3

In [5]:
# example input
batch_size = 3
x = torch.randn(batch_size, 10)
y = torch.randn(batch_size, 2)

# initialize the MLP
input_size = 10   # number of input features
hidden_size1 = 5  # number of neurons in the first hidden layer
hidden_size2 = 3  # number of neurons in the second hidden layer
output_size = 2   # number of output neurons (e.g., for binary classification)

model_1 = MLP(input_size, hidden_size1, hidden_size2, output_size)
model_1

MLP()

In [6]:
summary(model_1, input_size=(x.size()), device='cpu')

Layer (type:depth-idx)                   Output Shape              Param #
MLP                                      [3, 2]                    81
Total params: 81
Trainable params: 81
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00

In [7]:
# perform forward propagation
with torch.no_grad():
    y_pred = model_1.forward(x)

# log
print(f"y_pred:\n{y_pred}")

y_pred:
tensor([[ 5.2491,  6.4449],
        [-0.6936,  0.9967],
        [-0.0099,  1.6234]])


## Gradient Computation and Backpropagation
   - Compute the Loss:
   $$\mathcal{L}(\mathbf{\hat{y}}, \mathbf{y})$$
   - Backpropagation
      - Compute the gradient of the loss with respect to the output layer weights and biases:
      $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} \cdot \frac{\partial \mathbf{z}^{(L)}}{\partial \mathbf{W}^{(L)}}$$
      - Compute gradients for the weights and biases of each preceding layer:
      $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} \cdot \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{W}^{(l)}}$$
   - Update the Parameters using a gradient-based optimization algorithm like Gradient Descent or Adam:
   $$\mathbf{W}^{(l)} = \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$$

In [8]:
def backward(self, x: torch.Tensor, y: torch.Tensor, learning_rate: float):
    # compute the loss (Mean Squared Error - MSE)
    # loss = (1/N) * sum((z3 - y)^2) over all batch samples
    loss = torch.mean((self.z3 - y) ** 2)

    # compute the gradient of the loss with respect to z3 (output layer pre-activation)
    # this is the local gradient for the loss function with respect to z3
    # d(loss)/d(z3) = 2 * (z3 - y) / N
    loss_grad = 2 * (self.z3 - y) / y.size(0)

    # compute the gradient of the loss with respect to W3 (weights between hidden layer 2 and output layer)
    # d(loss)/d(W3) = d(loss)/d(z3) * d(z3)/d(W3)
    # d(z3)/d(W3) = a2^T (activation of hidden layer 2)
    # grad_W3 = (loss_grad)^T * a2
    grad_W3 = torch.matmul(loss_grad.T, self.a2)

    # compute the gradient of the loss with respect to b3 (biases of the output layer)
    # d(loss)/d(b3) = d(loss)/d(z3) * d(z3)/d(b3)
    # d(z3)/d(b3) = 1 (bias gradient accumulates over the batch dimension)
    # grad_b3 = sum(loss_grad) across batch dimension
    grad_b3 = torch.sum(loss_grad, dim=0)

    # backpropagate the gradient to the second hidden layer (w.r.t. a2)
    # compute the gradient of the loss with respect to a2 (activation of hidden layer 2)
    # d(loss)/d(a2) = d(loss)/d(z3) * d(z3)/d(a2)
    # d(z3)/d(a2) = W3 (weights between hidden layer 2 and output layer)
    grad_a2 = torch.matmul(loss_grad, self.W3)

    # compute the gradient of the loss with respect to z2 (pre-activation of hidden layer 2)
    # this is the local gradient for ReLU at the second hidden layer
    # d(z2)/d(a2) = ReLU'(z2) (element-wise derivative of ReLU)
    # grad_z2 = grad_a2 * ReLU'(z2) (ReLU'(z2) is 1 where z2 > 0, else 0)
    grad_z2 = grad_a2 * (self.a2 > 0).float()

    # compute the gradient of the loss with respect to W2 (weights between hidden layer 1 and hidden layer 2)
    # d(loss)/d(W2) = d(loss)/d(z2) * d(z2)/d(W2)
    # d(z2)/d(W2) = a1^T (activation of hidden layer 1)
    # grad_W2 = (grad_z2)^T * a1
    grad_W2 = torch.matmul(grad_z2.T, self.a1)

    # compute the gradient of the loss with respect to b2 (biases of hidden layer 2)
    # d(loss)/d(b2) = d(loss)/d(z2) * d(z2)/d(b2)
    # d(z2)/d(b2) = 1 (bias gradient accumulates over the batch dimension)
    # grad_b2 = sum(grad_z2) across batch dimension
    grad_b2 = torch.sum(grad_z2, dim=0)

    # backpropagate the gradient to the first hidden layer (w.r.t. a1)
    # compute the gradient of the loss with respect to a1 (activation of hidden layer 1)
    # d(loss)/d(a1) = d(loss)/d(z2) * d(z2)/d(a1)
    # d(z2)/d(a1) = W2 (weights between hidden layer 1 and hidden layer 2)
    grad_a1 = torch.matmul(grad_z2, self.W2)

    # compute the gradient of the loss with respect to z1 (pre-activation of hidden layer 1)
    # this is the local gradient for ReLU at the first hidden layer
    # d(z1)/d(a1) = ReLU'(z1) (element-wise derivative of ReLU)
    # grad_z1 = grad_a1 * ReLU'(z1) (ReLU'(z1) is 1 where z1 > 0, else 0)
    grad_z1 = grad_a1 * (self.a1 > 0).float()

    # compute the gradient of the loss with respect to W1 (weights between input layer and hidden layer 1)
    # d(loss)/d(W1) = d(loss)/d(z1) * d(z1)/d(W1)
    # d(z1)/d(W1) = x^T (input features)
    # grad_W1 = (grad_z1)^T * x
    grad_W1 = torch.matmul(grad_z1.T, x)

    # compute the gradient of the loss with respect to b1 (biases of hidden layer 1)
    # d(loss)/d(b1) = d(loss)/d(z1) * d(z1)/d(b1)
    # d(z1)/d(b1) = 1 (bias gradient accumulates over the batch dimension)
    # grad_b1 = sum(grad_z1) across batch dimension
    grad_b1 = torch.sum(grad_z1, dim=0)

    # update parameters using gradients (Gradient Descent step)
    with torch.no_grad():
        self.W1 -= learning_rate * grad_W1
        self.b1 -= learning_rate * grad_b1
        self.W2 -= learning_rate * grad_W2
        self.b2 -= learning_rate * grad_b2
        self.W3 -= learning_rate * grad_W3
        self.b3 -= learning_rate * grad_b3

In [9]:
MLP.backward = backward

In [10]:
# example input
batch_size = 3
x = torch.randn(batch_size, 10)
y = torch.randn(batch_size, 2)

# initialize the MLP
input_size = 10   # Number of input features
hidden_size1 = 5  # Number of neurons in the first hidden layer
hidden_size2 = 3  # Number of neurons in the second hidden layer
output_size = 2   # Number of output neurons

model_2 = MLP(input_size, hidden_size1, hidden_size2, output_size)
summary(model_2, input_size= x.size(), device='cpu')

Layer (type:depth-idx)                   Output Shape              Param #
MLP                                      [3, 2]                    81
Total params: 81
Trainable params: 81
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00

In [11]:
# perform forward propagation
with torch.no_grad():
    y_pred_1 = model_2.forward(x)

# Perform backward propagation and update weights
learning_rate = 0.01
model_2.backward(x, y, learning_rate)

# Perform forward propagation again to see updated output
with torch.no_grad():
    y_pred_2 = model_2.forward(x)

# log
print(f"y_true:\n{y}\n")
print(f"output before backpropagation:\n{y_pred_1}\n")
print(f"output after backpropagation:\n{y_pred_2}")

y_true:
tensor([[ 0.6131, -1.0648],
        [ 0.1055,  1.9739],
        [ 1.0703, -1.7379]])

output before backpropagation:
tensor([[-1.6613,  6.9142],
        [-0.9228,  5.4186],
        [-1.0267,  5.6290]])

output after backpropagation:
tensor([[0.8538, 1.7682],
        [0.3796, 2.6229],
        [0.8538, 1.7682]])


## Multilayer Perceptron Using PyTorch
   - Refer to this [notebook](./projects/00_multi-layer-perceptron.ipynb) for a comprehensive example on the MLP concept.

In [12]:
class MLP2(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(MLP2, self).__init__()
        # define layers using nn.Linear
        self.fc1 = nn.Linear(input_size, hidden_size1)    # first hidden layer
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)  # second hidden layer
        self.fc3 = nn.Linear(hidden_size2, output_size)   # output layer
        
        # define activation function (ReLU)
        self.relu = nn.ReLU()

    def forward(self, x):
        # forward pass through the network
        x = self.relu(self.fc1(x))  # first hidden layer with ReLU
        x = self.relu(self.fc2(x))  # second hidden layer with ReLU
        x = self.fc3(x)             # output layer (no activation here)
        return x

In [13]:
# example input
batch_size = 3
x = torch.randn(batch_size, 10)
y = torch.randn(batch_size, 2)

In [14]:
# initialize the MLP
input_size = 10   # number of input features
hidden_size1 = 5  # number of neurons in the first hidden layer
hidden_size2 = 3  # number of neurons in the second hidden layer
output_size = 2   # number of output neurons

model_3 = MLP2(input_size, hidden_size1, hidden_size2, output_size)
model_3

MLP2(
  (fc1): Linear(in_features=10, out_features=5, bias=True)
  (fc2): Linear(in_features=5, out_features=3, bias=True)
  (fc3): Linear(in_features=3, out_features=2, bias=True)
  (relu): ReLU()
)

In [15]:
summary(model_3, input_size= x.size(), device='cpu')

Layer (type:depth-idx)                   Output Shape              Param #
MLP2                                     [3, 2]                    --
├─Linear: 1-1                            [3, 5]                    55
├─ReLU: 1-2                              [3, 5]                    --
├─Linear: 1-3                            [3, 3]                    18
├─ReLU: 1-4                              [3, 3]                    --
├─Linear: 1-5                            [3, 2]                    8
Total params: 81
Trainable params: 81
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00

In [16]:
# define a loss function
criterion = nn.MSELoss()

# define an optimizer (e.g., SGD)
optimizer = torch.optim.SGD(model_3.parameters(), lr=0.01)

# training loop
num_epochs = 100  # Number of epochs

for epoch in range(num_epochs):
    
    # forward pass
    output = model_3(x)
    
    # compute the loss
    loss = criterion(output, y)
    
    # perform backward propagation automatically
    loss.backward()
    
    # update the weights & zero the gradients
    optimizer.step()
    optimizer.zero_grad()
    
    # log
    if (epoch + 1) % 10 == 0:
        print(f'epoch {epoch+1:3}/{num_epochs}  ->  Loss: {loss.item():.4f}')

epoch  10/100  ->  Loss: 1.0662
epoch  20/100  ->  Loss: 0.8673
epoch  30/100  ->  Loss: 0.7438
epoch  40/100  ->  Loss: 0.6574
epoch  50/100  ->  Loss: 0.5912
epoch  60/100  ->  Loss: 0.5368
epoch  70/100  ->  Loss: 0.4932
epoch  80/100  ->  Loss: 0.4588
epoch  90/100  ->  Loss: 0.4289
epoch 100/100  ->  Loss: 0.4014
