<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari ‚Äî 
        üìß <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> ‚Äî 
        üêô <a href="https://github.com/mr-pylin/pytorch-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pytorch.org/" target="_blank" rel="noopener noreferrer">
            <img src="../assets/images/pytorch/logo/pytorch-logo-dark.svg" 
                 alt="PyTorch Logo"
                 style="max-height: 48px; width: auto; background-color: #ffffff; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Multilayer Perceptron (MLP)](#toc2_)    
  - [Forward and Backward Propagation](#toc2_1_)    
    - [Forward Propagation](#toc2_1_1_)    
      - [Input Layer $\rightarrow$ First Hidden Layer](#toc2_1_1_1_)    
      - [Hidden Layer $l$ $\rightarrow$ Hidden Layer $l+1$](#toc2_1_1_2_)    
      - [Last Hidden Layer $\rightarrow$ Output Layer](#toc2_1_1_3_)    
    - [Backward Propagation](#toc2_1_2_)    
      - [Output Layer (Last Layer `L`)](#toc2_1_2_1_)    
      - [Hidden Layers `l = L-1, ..., 1`](#toc2_1_2_2_)    
  - [Limitations](#toc2_2_)    
    - [MLPs vs. Other Architectures](#toc2_2_1_)    
  - [Parameter Initialization](#toc2_3_)    
    - [Weight](#toc2_3_1_)    
    - [Bias](#toc2_3_2_)    
  - [MLP Implementation](#toc2_4_)    
    - [Manual](#toc2_4_1_)    
    - [Using PyTorch](#toc2_4_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
import torch
import torch.nn.functional as F
from torch import nn
from torchinfo import summary

In [None]:
# set a seed for deterministic results
seed = 42
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
# check if cuda is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# log
device

# <a id='toc2_'></a>[Multilayer Perceptron (MLP)](#toc0_)

- A [**Multilayer Perceptron (MLP)**](https://en.wikipedia.org/wiki/Multilayer_perceptron) is a type of feedforward artificial neural network, also known as a **Fully-Connected Network** or **Dense Network**.
- It consists of at least three layers of nodes: an **input layer**, one or more **hidden layers**, and an **output layer**.

üß¨ **Key Characteristics**:

- **Fully Connected**: Every node (neuron) in one layer is connected to every node in the next layer.
- **[Non-Linear Activations](./utils/activation.ipynb)**: Each neuron applies a non-linear activation function, enabling the network to model complex patterns.
- **[Feedforward](https://en.wikipedia.org/wiki/Feedforward_neural_network)**: Data flows in a single direction, from input to output, with no cycles or loops.

üèõÔ∏è **Basic Architecture**:

- **Input Layer**: Receives input features. The number of neurons equals the number of features in the dataset.
- **Hidden Layers**: These layers contain neurons that compute weighted sums and apply activation functions.
- **Output Layer**: Produces the final output, which could be a single value or a set of values for different tasks e.g. [**Regression**](https://en.wikipedia.org/wiki/Regression_analysis), and [**Classification**](https://en.wikipedia.org/wiki/Classification).

<div style="text-align: center; padding-top: 10px;">
    <img src="../assets/images/original/mlp/mlp-general.svg" alt="mlp-general.svg" style="min-width: 512px; width: 80%; height: auto;; border-radius: 16px;">
    <p><em>Figure 1: Multi-Layer-Perceptron (aka fully connected layers)</em></p>
</div>

**Calculating the number of parameters**:

<table style="margin: 0 auto; text-align:center;">
  <thead>
    <tr>
      <th colspan="2">hidden<sub>1</sub> parameters</th>
      <th colspan="2">hidden<sub>2</sub> parameters</th>
      <th colspan="2">hidden<sub>L-1</sub> parameters</th>
      <th colspan="2">output parameters</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Weights</td>
      <td>Biases</td>
      <td>Weights</td>
      <td>Biases</td>
      <td>Weights</td>
      <td>Biases</td>
      <td>Weights</td>
      <td>Biases</td>
    </tr>
    <tr>
      <td>n √ó h<sub>1</sub></td>
      <td>h<sub>1</sub></td>
      <td>h<sub>1</sub> √ó h<sub>2</sub></td>
      <td>h<sub>2</sub></td>
      <td>h<sub>L-2</sub> √ó h<sub>L-1</sub></td>
      <td>h<sub>L-1</sub></td>
      <td>h<sub>L-1</sub> √ó o</td>
      <td>o</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td colspan="2">(n + 1) √ó h<sub>1</sub></td>
      <td colspan="2">(h<sub>1</sub> + 1) √ó h<sub>2</sub></td>
      <td colspan="2">(h<sub>L-2</sub> + 1) √ó h<sub>L-1</sub></td>
      <td colspan="2">(h<sub>L-1</sub> + 1) √ó o</td>
    </tr>
  </tfoot>
</table>

üõù **Playgrounds**:

- [deeperplayground.org](https://deeperplayground.org/)
- [alexlenail.me/NN-SVG](https://alexlenail.me/NN-SVG/)


## <a id='toc2_1_'></a>[Forward and Backward Propagation](#toc0_)

### <a id='toc2_1_1_'></a>[Forward Propagation](#toc0_)


#### <a id='toc2_1_1_1_'></a>[Input Layer $\rightarrow$ First Hidden Layer](#toc0_)

$$
\mathbf{Z}^{[1]} = \mathbf{X} \mathbf{W}^{[1]} + \mathbf{1}_m \mathbf{b}^{[1]}
$$

$$
\mathbf{A}^{[1]} = \sigma(\mathbf{Z}^{[1]})
$$

- $\mathbf{X} \in \mathbb{R}^{m \times n}$
- $\mathbf{W}^{[1]} \in \mathbb{R}^{n \times h_1}$
- $\mathbf{1}_m \in \mathbb{R}^{m \times 1}$ is a column of ones to broadcast the bias.
- $\mathbf{1}_m \mathbf{b}^{[1]}, \mathbf{Z}^{[1]}, \mathbf{A}^{[1]} \in \mathbb{R}^{m \times h_1}$


#### <a id='toc2_1_1_2_'></a>[Hidden Layer $l$ $\rightarrow$ Hidden Layer $l+1$](#toc0_)

For `l = 1, ..., L-1` (except last layer):

$$
\mathbf{Z}^{[l+1]} = \mathbf{A}^{[l]} \mathbf{W}^{[l+1]} + \mathbf{1}_m \mathbf{b}^{[l+1]}
$$

$$
\mathbf{A}^{[l+1]} = \sigma(\mathbf{Z}^{[l+1]})
$$

- $\mathbf{W}^{[l+1]} \in \mathbb{R}^{h_l \times h_{l+1}}$
- $\mathbf{1}_m \mathbf{b}^{[l+1]}, \mathbf{Z}^{[l+1]}, \mathbf{A}^{[l+1]} \in \mathbb{R}^{m \times h_{l+1}}$


#### <a id='toc2_1_1_3_'></a>[Last Hidden Layer $\rightarrow$ Output Layer](#toc0_)

$$
\mathbf{Z}^{[L]} = \mathbf{A}^{[L-1]} \mathbf{W}^{[L]} + \mathbf{1}_m \mathbf{b}^{[L]}
$$

$$
\mathbf{Y}_{\text{logits}} = \mathbf{Z}^{[L]}
$$

$$
\mathbf{Y}_{\text{pred}} = \text{softmax}(\mathbf{Y}_{\text{logits}})
$$

- $\mathbf{W}^{[L]} \in \mathbb{R}^{h_{L-1} \times o}$
- $\mathbf{Z}^{[l]}, \mathbf{Y}_{\text{logits}}, \mathbf{Y}_{\text{pred}} \in \mathbb{R}^{m \times o}$


### <a id='toc2_1_2_'></a>[Backward Propagation](#toc0_)


#### <a id='toc2_1_2_1_'></a>[Output Layer (Last Layer `L`)](#toc0_)

$$
\Delta^{[L]} = \mathbf{A}^{[L]} - \mathbf{Y} \in \mathbb{R}^{m \times o}
$$

**Gradients:**

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[L]}} = (\mathbf{A}^{[L-1]})^\top \Delta^{[L]} \in \mathbb{R}^{h_{L-1} \times o}
$$

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[L]}} = \sum_{j=1}^{m} \Delta^{[L](j)}
$$


#### <a id='toc2_1_2_2_'></a>[Hidden Layers `l = L-1, ..., 1`](#toc0_)

$$
\Delta^{[l]} = \Delta^{[l+1]} (\mathbf{W}^{[l+1]})^\top \odot \sigma'(\mathbf{Z}^{[l]}) \in \mathbb{R}^{m \times h_l}
$$

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = (\mathbf{A}^{[l-1]})^\top \Delta^{[l]} \in \mathbb{R}^{h_{l-1} \times h_l}
$$

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \sum_{j=1}^{m} \Delta^{[l](j)}
$$

- $\mathbf{A}^{[0]} = \mathbf{X}$ (input matrix)  
- $\sigma'(\mathbf{Z}^{[l]})$ = element-wise derivative of activation function  
- $\odot$ = element-wise (Hadamard) product


## <a id='toc2_2_'></a>[Limitations](#toc0_)

- **Fixed Input and Output Sizes**:
  - Standard MLPs require fixed-size input and output tensors.
  - This makes them less directly applicable to variable-length data without preprocessing (e.g., padding, pooling, or embedding).

- **No Built-in Structure Awareness**:
  - MLPs treat inputs as flat vectors.
  - They do not inherently exploit spatial structure (images), temporal structure (sequences), or relational structure (graphs).
  - This often makes them less parameter-efficient compared to specialized architectures.

- **Scalability Issues**:
  - The number of parameters grows rapidly with input size:
    $$
    \#\text{params} \propto d_{\text{in}} \times d_{\text{out}}
    $$
  - This leads to:
    - higher memory usage
    - higher computational cost
    - increased risk of overfitting

- **Stateless Nature**:
  - MLPs are stateless: the output depends only on the current input.
  - They do not maintain internal memory across multiple inputs.
  - This limits their ability to model sequential or dynamic processes directly.

- **Limited Inductive Bias**:
  - MLPs do not assume any structure in the data.
  - This makes them highly general but often less efficient than architectures designed for specific data types.


### <a id='toc2_2_1_'></a>[MLPs vs. Other Architectures](#toc0_)

- **MLP vs. [CNN](./07-convolutional-neural-networks.ipynb)**  
  - CNNs exploit spatial locality and weight sharing.
  - This makes CNNs far more parameter-efficient and effective for image data.

- **MLP vs. [RNN](./11-recurrent-neural-networks.ipynb)**  
  - RNNs maintain hidden state across time steps.
  - This enables modeling temporal dependencies in sequential data.

- **MLP vs. Transformer**
  - Transformers use attention mechanisms to model relationships between all input elements.
  - They handle sequential and structured data more effectively than standard MLPs.


## <a id='toc2_3_'></a>[Parameter Initialization](#toc0_)

- Initialization occurs once when the layer is created.
- Parameters are updated during training by the optimizer.
- Initialization is defined in `reset_parameters()` of `nn.Linear`.
- More Details about Initialization: [**hyperparameter.ipynb**](./utils/hyperparameter.ipynb)


In [None]:
# example dimensions
fan_in = 3
fan_out = 4

### <a id='toc2_3_1_'></a>[Weight](#toc0_)

- By default, `nn.Linear` initializes weights using **Kaiming (He) uniform initialization**.
- This initialization is well suited for layers followed by ReLU or similar activation functions (see [**activation.ipynb**](./utils/activation.ipynb) for more info).
- The weights are sampled from a uniform distribution:

  $$
  W_{ij} \sim \mathcal{U}\left(
  -\sqrt{\frac{6}{n_{\text{in}}}},
  \sqrt{\frac{6}{n_{\text{in}}}}
  \right)
  $$

- where:
  - $n_{\text{in}}$ is the number of input features (fan-in)


In [None]:
# create empty tensors (uninitialized memory)
W = torch.empty((fan_in, fan_out))

# default distribution for <nn.Linear> parameters
nn.init.kaiming_uniform_(W, mode="fan_in", nonlinearity="relu")

# log
print(f"W:\n{W}")

### <a id='toc2_3_2_'></a>[Bias](#toc0_)

- Biases are initialized from a uniform distribution:

  $$
  b_i \sim \mathcal{U}\left(
  -\frac{1}{\sqrt{n_{\text{in}}}},
  \frac{1}{\sqrt{n_{\text{in}}}}
  \right)
  $$

- Biases are **not initialized to zero** by default.


In [None]:
# create empty tensors (uninitialized memory)
b = torch.empty((fan_in))

# default distribution for <nn.Linear> parameters
nn.init.uniform_(b, -1 / fan_in**0.5, 1 / fan_in**0.5)

# log
print(f"b:\n{b}")

## <a id='toc2_4_'></a>[MLP Implementation](#toc0_)


### <a id='toc2_4_1_'></a>[Manual](#toc0_)


In [None]:
class CustomMLP(torch.nn.Module):
    def __init__(self, n_input: int, hidden_sizes: list[int], n_output: int):
        super().__init__()

        # layer sizes
        self.layer_sizes = [n_input] + hidden_sizes + [n_output]
        self.L = len(self.layer_sizes) - 1  # number of layers

        # create weight and bias parameters manually
        self.weights = torch.nn.ParameterList()
        self.biases = torch.nn.ParameterList()
        for l in range(self.L):
            W = torch.nn.Parameter(torch.empty(self.layer_sizes[l], self.layer_sizes[l + 1]))
            b = torch.nn.Parameter(torch.zeros(1, self.layer_sizes[l + 1]))
            # Kaiming initialization for weights
            torch.nn.init.kaiming_uniform_(W, mode="fan_in", nonlinearity="relu")
            self.weights.append(W)
            self.biases.append(b)

    def forward(self, x: torch.Tensor, return_all: bool = False):
        """Forward pass storing pre-activations and activations"""
        a = x
        activations = [a]  # a^{[0]} = X
        pre_acts = []

        for l in range(self.L - 1):
            z = torch.matmul(a, self.weights[l]) + self.biases[l]
            pre_acts.append(z)
            a = torch.relu(z)
            activations.append(a)

        # output layer
        z = torch.matmul(a, self.weights[-1]) + self.biases[-1]
        pre_acts.append(z)
        y_pred = torch.softmax(z, dim=1)
        activations.append(y_pred)

        if return_all:
            return y_pred, pre_acts, activations
        return y_pred

    def backward(self, x: torch.Tensor, y_true: torch.Tensor):
        """
        Manual backward propagation for cross-entropy loss + softmax
        x: input batch (m x n)
        y_true: one-hot labels (m x o)
        """
        m = x.shape[0]

        # forward pass and store intermediate values
        y_pred, pre_acts, activations = self.forward(x, return_all=True)

        # initialize gradient containers
        dW = [torch.zeros_like(W) for W in self.weights]
        db = [torch.zeros_like(b) for b in self.biases]

        # output layer error (softmax + cross-entropy)
        delta = (y_pred - y_true) / m  # shape: (m x o)
        dW[-1] = torch.matmul(activations[-2].T, delta)
        db[-1] = delta.sum(dim=0, keepdim=True)

        # backprop through hidden layers
        for l in reversed(range(self.L - 1)):
            # derivative of ReLU
            dz = delta.matmul(self.weights[l + 1].T) * (pre_acts[l] > 0).float()
            delta = dz
            dW[l] = torch.matmul(activations[l].T, delta)
            db[l] = delta.sum(dim=0, keepdim=True)

        return dW, db

In [None]:
# parameters
n_input = 4            # number of input features
hidden_sizes = [5, 3]  # two hidden layers: 5 and 3 nodes
n_output = 2           # number of classes
batch_size = 6

# create random input data
X = torch.randn(batch_size, n_input)

# create random one-hot labels
y_indices = torch.randint(0, n_output, (batch_size,))
y_onehot = torch.zeros(batch_size, n_output)
y_onehot[torch.arange(batch_size), y_indices] = 1

In [None]:
# instantiate the MLP
custom_model = CustomMLP(n_input=n_input, hidden_sizes=hidden_sizes, n_output=n_output)
custom_model

In [None]:
# model summary
summary(custom_model, input_size=(batch_size, n_input), device="cpu")

In [None]:
# forward pass
y_pred = custom_model.forward(X, return_all=False)

# log
print(f"predictions:\n{y_pred}")

In [None]:
# backward pass
dW, db = custom_model.backward(X, y_onehot)

# log
for l in range(len(dW)):
    print(f"layer {l+1} gradients:")
    print(f"dW:\n{dW[l]}")
    print(f"db:\n{db[l]}\n")

### <a id='toc2_4_2_'></a>[Using PyTorch](#toc0_)

- Refer to this [**mnist-classification.ipynb**](./projects/mnist-classification/implementation-1/mnist-classification.ipynb) for a comprehensive example on the MLP concept.

üìö **Tutorials**:

- Neural Networks: [pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial)
- Training a Classifier: [pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html)


In [None]:
class PytorchMLP(nn.Module):
    def __init__(self, n_input: int, hidden_sizes: list[int], n_output: int):
        super().__init__()

        # build layers
        layers = []
        in_features = n_input
        for h in hidden_sizes:
            layers.append(nn.Linear(in_features, h))
            layers.append(nn.ReLU())
            in_features = h
            
        # output layer
        layers.append(nn.Linear(in_features, n_output))
        self.model = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)

In [None]:
# parameters
n_input = 4
hidden_sizes = [3]
n_output = 2
batch_size = 6

# random input data
X = torch.randn(batch_size, n_input)

# random label indices for classification
y_true = torch.randint(0, n_output, (batch_size,))

In [None]:
# instantiate model
pytorch_model = PytorchMLP(n_input, hidden_sizes, n_output)
pytorch_model

In [None]:
summary(pytorch_model, input_size=(batch_size, n_input), device="cpu")

In [None]:
# forward pass
logits = pytorch_model(X)

# log
print("Logits:\n", logits)

In [None]:
# define loss function
criterion = nn.CrossEntropyLoss()  # expects logits + integer labels

# compute loss
loss = criterion(logits, y_true)
print(f"loss: {loss.item()}")

# backward pass
loss.backward()  # computes gradients for all parameters

# log
print(f"gradients for first layer weights:\n{pytorch_model.model[0].weight.grad}")