# Understanding Multi-Layer Perceptrons (MLP)

## Introduction to MLP

Multi-Layer Perceptrons (MLP) are a type of artificial neural network known for their ability to model complex relationships in data through multiple layers of neurons. An MLP consists of at least three layers: an input layer, one or more hidden layers, and an output layer. Each neuron in a layer connects to every neuron in the subsequent layer, typically with an associated weight and bias.

## Mathematical Background

The mathematical operations within an MLP can be broken down into two major phases: the forward pass and the backward pass (used for training via backpropagation).

### Forward Pass

The forward pass involves computing the output of the neural network for a given input. This is achieved through the following steps:

1. **Input Layer**: Receives the input features.
2. **Hidden Layers**:
   - Each neuron in a hidden layer computes the weighted sum of its inputs, which is then passed through a nonlinear activation function. The output of each neuron can be mathematically represented as follows:
     $$ a_j^{(l)} = \sigma\left(\sum_{i} w_{ij}^{(l)} x_i + b_j^{(l)} \right) $$
   - Where:
     - $a_j^{(l)}$ is the activation of the $j$-th neuron in the $l$-th layer,
     - $w_{ij}^{(l)}$ are the weights connecting the $i$-th neuron in the $(l-1)$-th layer to the $j$-th neuron in the $l$-th layer,
     - $b_j^{(l)}$ is the bias of the $j$-th neuron in the $l$-th layer,
     - $\sigma$ is the activation function (e.g., Sigmoid, ReLU),
     - $x_i$ are the inputs from the previous layer or the input layer.

3. **Output Layer**:
   - The final layer's output calculation is similar to that of the hidden layers, but its function may differ depending on the task (e.g., softmax for classification).

### Backward Pass (Backpropagation)

Backpropagation is used to update the weights and biases of the network based on the error in output. The process includes:

1. **Error Calculation**:
   - The error between the actual output and the predicted output is calculated, often using a loss function like mean squared error (MSE):
     $$ E = \frac{1}{2} \sum (y - \hat{y})^2 $$
   - Where $y$ is the true value and $\hat{y}$ is the predicted value.

2. **Gradient Descent**:
   - The gradients of the error with respect to each weight and bias are calculated to update the parameters:
     $$ w_{ij}^{(l)} = w_{ij}^{(l)} - \eta \frac{\partial E}{\partial w_{ij}^{(l)}} $$
   - Where $\eta$ is the learning rate.

3. **Propagation of Error**:
   - The error is propagated back through the network, updating each weight and bias according to its contribution to the output error.

### Numerical Example

Consider a simple network with one input neuron, one hidden neuron, and one output neuron, with a single data point (x=1, y=2):

1. **Forward Pass**:
   - Input: x = 1
   - Hidden Layer Weight: w_1 = 0.5, Bias: b_1 = 0.1
   - Output Layer Weight: w_2 = 1.5, Bias: b_2 = -0.3
   - Activation Function: Sigmoid
   - Output Calculation:
     - Hidden Layer Activation: $ a_1 = \sigma(0.5 \times 1 + 0.1) $
     - Output: $ \hat{y} = \sigma(1.5 \times a_1 - 0.3) $

2. **Backward Pass**:
   - Loss: $ E = \frac{1}{2} (2 - \hat{y})^2 $
   - Update Weights using gradient descent.

By understanding and implementing these computations, one can effectively use MLPs to tackle various predictive modeling tasks.


# Understanding Multi-Layer Perceptrons (MLP)

## Introduction to MLP

Multi-Layer Perceptrons (MLP) are a type of artificial neural network known for their ability to model complex relationships in data through multiple layers of neurons. An MLP consists of at least three layers: an input layer, one or more hidden layers, and an output layer. Each neuron in a layer connects to every neuron in the subsequent layer, typically with an associated weight and bias.

## Mathematical Background

The mathematical operations within an MLP can be broken down into two major phases: the forward pass and the backward pass (used for training via backpropagation).

### Forward Pass

The forward pass involves computing the output of the neural network for a given input. This is achieved through the following steps:

1. **Input Layer**: Receives the input features.
2. **Hidden Layers**:
   - Each neuron in a hidden layer computes the weighted sum of its inputs, which is then passed through a nonlinear activation function. The output of each neuron can be mathematically represented as follows:
     $$ a_j^{(l)} = \sigma\left(\sum_{i} w_{ij}^{(l)} x_i + b_j^{(l)} \right) $$
   - Where:
     - $a_j^{(l)}$ is the activation of the $j$-th neuron in the $l$-th layer,
     - $w_{ij}^{(l)}$ are the weights connecting the $i$-th neuron in the $(l-1)$-th layer to the $j$-th neuron in the $l$-th layer,
     - $b_j^{(l)}$ is the bias of the $j$-th neuron in the $l$-th layer,
     - $\sigma$ is the activation function (e.g., Sigmoid, ReLU),
     - $x_i$ are the inputs from the previous layer or the input layer.

3. **Output Layer**:
   - The final layer's output calculation is similar to that of the hidden layers, but its function may differ depending on the task (e.g., softmax for classification).

### Backward Pass (Backpropagation)

Backpropagation is used to update the weights and biases of the network based on the error in output. The process includes:

1. **Error Calculation**:
   - The error between the actual output and the predicted output is calculated, often using a loss function like mean squared error (MSE):
     $$ E = \frac{1}{2} \sum (y - \hat{y})^2 $$
   - Where $y$ is the true value and $\hat{y}$ is the predicted value.

2. **Gradient Descent**:
   - The gradients of the error with respect to each weight and bias are calculated to update the parameters:
     $$ w_{ij}^{(l)} = w_{ij}^{(l)} - \eta \frac{\partial E}{\partial w_{ij}^{(l)}} $$
   - Where $\eta$ is the learning rate.

3. **Propagation of Error**:
   - The error is propagated back through the network, updating each weight and bias according to its contribution to the output error.

### Derivatives Calculation

To update the weights and biases, we calculate the derivatives as follows:

- For the output layer:
  $$ \frac{\partial E}{\partial w_{kj}} = \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_k} \cdot \frac{\partial z_k}{\partial w_{kj}} $$
  Where:
  - $\frac{\partial E}{\partial \hat{y}} = \hat{y} - y$ (for MSE),
  - $\frac{\partial \hat{y}}{\partial z_k} = \sigma'(z_k)$ (derivative of the activation function),
  - $\frac{\partial z_k}{\partial w_{kj}} = a_j$ (activation from the previous layer).

- For hidden layers, similarly:
  $$ \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_i} \cdot \frac{\partial z_i}{\partial w_{ij}} $$
  Where:
  - $\frac{\partial E}{\partial a_j}$ is propagated error from subsequent layers,
  - $\frac{\partial a_j}{\partial z_i} = \sigma'(z_i)$,
  - $\frac{\partial z_i}{\partial w_{ij}} = x_i$ (input to the neuron).

### Numerical Example

Consider a simple network with one input neuron, one hidden neuron, and one output neuron, with a single data point (x=1, y=2):

1. **Forward Pass**:
   - Input: x = 1
   - Hidden Layer Weight: w_1 = 0.5, Bias: b_1 = 0.1
   - Output Layer Weight: w_2 = 1.5, Bias: b_2 = -0.3
   - Activation Function: Sigmoid
   - Output Calculation:
     - Hidden Layer Activation: $ a_1 = \sigma(0.5 \times 1 + 0.1) $
     - Output: $ \hat{y} = \sigma(1.5 \times a_1 - 0.3) $

2. **Backward Pass**:
   - Loss: $ E = \frac{1}{2} (2 - \hat{y})^2 $
   - Update Weights using gradient descent.

By understanding and implementing these computations, one can effectively use MLPs to tackle various predictive modeling tasks.


# Understanding Multi-Layer Perceptrons (MLP)

## Introduction to MLP

Multi-Layer Perceptrons (MLP) are a foundational type of artificial neural network designed for modeling complex patterns in data. Comprising multiple layers including an input layer, one or more hidden layers, and an output layer, MLPs facilitate learning through the deep structure of fully connected neurons.

## Mathematical Background

### Forward Pass

During the forward pass, the MLP computes outputs by processing the input data layer-by-layer from the input to the output, applying weights, biases, and activation functions at each neuron.

- **Input Layer**: Receives the input features.
- **Hidden Layers**:
  - Each neuron in a hidden layer computes the weighted sum of its inputs and applies a nonlinear activation function:
    $$ a_j^{(l)} = \sigma\left(\sum_{i} w_{ij}^{(l)} x_i + b_j^{(l)} \right) $$
  - Where:
    - $a_j^{(l)}$ is the activation of the $j$-th neuron in the $l$-th layer,
    - $w_{ij}^{(l)}$ are the weights from the $i$-th neuron in the $(l-1)$-th layer to the $j$-th neuron,
    - $b_j^{(l)}$ is the bias of the $j$-th neuron,
    - $\sigma$ is the activation function (e.g., Sigmoid, ReLU),
    - $x_i$ are the inputs from the previous layer (or external inputs for the first hidden layer).

- **Output Layer**:
  - Often differs in function based on the specific application, such as using softmax for classification or a linear function for regression.

### Backward Pass (Backpropagation)

Backpropagation is used to update the weights and biases of the network based on the error in output. The process includes:

1. **Error Calculation**:
   - The error between the actual output and the predicted output is calculated, often using a loss function like mean squared error (MSE):
     $$ E = \frac{1}{2} \sum (y - \hat{y})^2 $$
   - Where $y$ is the true value and $\hat{y}$ is the predicted value.

2. **Gradient Descent**:
   - The gradients of the error with respect to each weight and bias are calculated to update the parameters:
     $$ w_{ij}^{(l)} = w_{ij}^{(l)} - \eta \frac{\partial E}{\partial w_{ij}^{(l)}} $$
   - Where $\eta$ is the learning rate.

3. **Propagation of Error**:
   - The error is propagated back through the network, updating each weight and bias according to its contribution to the output error.

#### Derivatives Calculation:

1. **Output Layer Error Gradient**:

    - For the output layer:
  $$ \frac{\partial E}{\partial w_{kj}} = \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_k} \cdot \frac{\partial z_k}{\partial w_{kj}} $$
  Where:
  - $\frac{\partial E}{\partial \hat{y}} = \hat{y} - y$ (for MSE),
  - $\frac{\partial \hat{y}}{\partial z_k} = \sigma'(z_k)$ (derivative of the activation function),
  - $\frac{\partial z_k}{\partial w_{kj}} = a_j$ (activation from the previous layer).
   - Calculate the error gradient for each output neuron:
     $$ \delta_k = (\hat{y}_k - y_k) \sigma'(z_k) $$
   - Where $z_k$ is the total input to the k-th output neuron, and $\sigma'(z_k)$ is the derivative of the activation function at $z_k$.



2. **Hidden Layer Error Gradient**:
   - Error gradients for hidden layer neurons are calculated by propagating the output layer gradients backward:
     $$ \delta_j = \left(\sum_{k} w_{jk} \delta_k\right) \sigma'(z_j) $$
   - Where $w_{jk}$ are the weights from the j-th neuron to the k-th neuron in the next layer.

   - For hidden layers, similarly:
  $$ \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_i} \cdot \frac{\partial z_i}{\partial w_{ij}} $$
  Where:
  - $\frac{\partial E}{\partial a_j}$ is propagated error from subsequent layers,
  - $\frac{\partial a_j}{\partial z_i} = \sigma'(z_i)$,
  - $\frac{\partial z_i}{\partial w_{ij}} = x_i$ (input to the neuron).



4. **Update Rules**:
   - Weights are updated by subtracting a portion of the gradient:
     $$ w_{ij} = w_{ij} - \eta \cdot \delta_j \cdot a_i $$
   - Where $\eta$ is the learning rate.









### Numerical Example

Consider a simple network with one input (x=3), two hidden neurons, and one output neuron, trained on a single data point with target output y=1, using the sigmoid activation function and mean squared error (MSE) loss.

#### Forward Pass Calculation:

1. **Input to Hidden Layer**:
   - Weights: $w_{11} = 0.1, w_{21} = -0.2$
   - Biases: $b_1 = 0, b_2 = 0.1$
   - Outputs:
     - $ a_1 = \sigma(0.1 \times 3 + 0) = \sigma(0.3) $
     - $ a_2 = \sigma(-0.2 \times 3 + 0.1) = \sigma(-0.5) $

2. **Hidden to Output Layer**:
   - Weight: $w_{12} = 0.3$
   - Bias: $b_3 = -0.1$
   - Output: $ \hat{y} = \sigma(0.3 \times a_1 + 0.3 \times a_2 - 0.1) $

#### Backward Pass Calculation:

1. **Output Error**:
   - Error term: $ \delta_3 = (\hat{y} - 1) \sigma'(\text{input to output neuron}) $

2. **Hidden Layer Errors**:
   - Error terms:
     - $ \delta_1 = (w_{12} \cdot \delta_3) \sigma'(\text{input to neuron 1}) $
     - $ \delta_2 = (w_{12} \cdot \delta_3) \sigma'(\text{input to neuron 2}) $

3. **Update Weights**:
   - $ w_{11} = w_{11} - \eta \cdot \delta_1 \cdot x $
   - $ w_{21} = w_{21} - \eta \cdot \delta_2 \cdot x $
   - $ w_{12} = w_{12} - \eta \cdot \delta_3 \cdot a_1 $

This tutorial aims to demystify the training of an MLP by detailing each computational step involved, from data input through to the adjustments made based on the output error.


# Understanding Multi-Layer Perceptrons (MLP)

## Assumptions Behind MLP

MLPs are based on several key assumptions:
- **Universal Approximation**: MLPs assume they can approximate any continuous function, given a sufficient number of neurons in a hidden layer and appropriate activation functions.
- **Data Scaling**: Input features should be normalized or standardized to ensure efficient convergence during training.
- **Independence and Identical Distribution (i.i.d.)**: The data samples are assumed to be drawn independently from the same distribution, which is crucial for the generalizability of the model.
- **Differentiability**: The function to be learned must be differentiable almost everywhere, which is necessary for the backpropagation algorithm to work.

## Bias-Variance Trade-off

In MLPs:
- **Bias**: High bias occurs if the network is too simple to capture the underlying patterns in the data.
- **Variance**: High variance can cause overfitting, where the model learns noise from the training data rather than the actual trends.

Balancing the bias and variance is crucial to building effective MLP models. Techniques like regularization and dropout are often used to manage this balance.

## Advantages of MLPs

1. **Non-Linear Modeling**: MLPs are capable of modeling complex non-linear relationships due to their layered structure and activation functions.
2. **Flexibility**: They can be used across various tasks including classification, regression, and feature learning.
3. **Scalability**: Given adequate data and computational resources, MLPs can be scaled to improve performance.

## Disadvantages of MLPs

1. **Overfitting Risk**: Without proper regularization, MLPs can easily overfit, especially in cases with noisy training data or when the model is too complex.
2. **Computationally Intensive**: Training MLPs can be resource-intensive, requiring substantial computing power and time, particularly as the network size increases.
3. **Local Minima**: The non-convex nature of neural network training can lead to suboptimal solutions if the optimization gets stuck in local minima.
4. **Require Large Datasets**: MLPs generally require large amounts of data to perform well without overfitting and to generalize effectively to new data.
5. **Black Box Nature**: Neural networks, including MLPs, are often considered "black boxes" because it can be difficult to interpret how they are making predictions, which can be a drawback in applications where transparency is important.



In [1]:
import numpy as np

# Sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Create a class for the MLP
class MLP:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights and biases
        self.weights_input_hidden = np.random.rand(input_size, hidden_size)
        self.bias_hidden = np.zeros((1, hidden_size))
        self.weights_hidden_output = np.random.rand(hidden_size, output_size)
        self.bias_output = np.zeros((1, output_size))

    def feedforward(self, X):
        # Forward pass
        self.hidden = sigmoid(np.dot(X, self.weights_input_hidden) + self.bias_hidden)
        self.output = sigmoid(np.dot(self.hidden, self.weights_hidden_output) + self.bias_output)
        return self.output

    def backpropagation(self, X, y, learning_rate):
        # Error in output
        output_error = y - self.output
        output_delta = output_error * sigmoid_derivative(self.output)

        # Error in hidden layer
        hidden_error = np.dot(output_delta, self.weights_hidden_output.T)
        hidden_delta = hidden_error * sigmoid_derivative(self.hidden)

        # Update parameters
        self.weights_hidden_output += np.dot(self.hidden.T, output_delta) * learning_rate
        self.bias_output += np.sum(output_delta, axis=0, keepdims=True) * learning_rate
        self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate
        self.bias_hidden += np.sum(hidden_delta, axis=0, keepdims=True) * learning_rate

    def train(self, X, y, learning_rate, epochs):
        for epoch in range(epochs):
            output = self.feedforward(X)
            self.backpropagation(X, y, learning_rate)
            if epoch % 1000 == 0:
                loss = np.mean((y - output) ** 2)
                print(f"Epoch {epoch}, Loss {loss}")

# Example usage
if __name__ == "__main__":
    # Input data (e.g., XOR problem)
    X = np.array([[0, 0],
                  [0, 1],
                  [1, 0],
                  [1, 1]])
    # Labels
    y = np.array([[0], [1], [1], [0]])

    # Create MLP object
    mlp = MLP(input_size=2, hidden_size=2, output_size=1)
    mlp.train(X, y, learning_rate=0.5, epochs=10000)

    # Test
    for inputs in X:
        print(f"Input: {inputs} - Predicted: {mlp.feedforward(inputs)}")


Epoch 0, Loss 0.2879688550631573
Epoch 1000, Loss 0.07409680852610133
Epoch 2000, Loss 0.003439088578271777
Epoch 3000, Loss 0.0015734654780032672
Epoch 4000, Loss 0.0010027637903330922
Epoch 5000, Loss 0.0007309150731999721
Epoch 6000, Loss 0.0005730270921955388
Epoch 7000, Loss 0.0004702562527598223
Epoch 8000, Loss 0.00039820172101227787
Epoch 9000, Loss 0.0003449676921624329
Input: [0 0] - Predicted: [[0.01915486]]
Input: [0 1] - Predicted: [[0.98339885]]
Input: [1 0] - Predicted: [[0.98339169]]
Input: [1 1] - Predicted: [[0.0172618]]
