<a href="https://colab.research.google.com/github/mjmousavi97/Deep-Learning-Tehran-uni/blob/main/HomeWorks/01%20HW/Q4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning Process of MLP with One Hidden Layer (Backpropagation)

This document provides a detailed derivation and explanation of the backpropagation learning algorithm for a Multi-Layer Perceptron (MLP) with **one hidden layer**, trained using the **Sum of Squared Errors (SSE)** loss.

---

## ðŸ”· Network Architecture

We consider a standard feedforward neural network with:
- One input layer
- One hidden layer with activation function $f$
- One output layer with activation function $g$ (can be identity for regression)

The forward propagation formula is:

$$
h_{\mathbf{w}, \mathbf{b}}(\mathbf{x}) = g\left(W^2 f\left(W^1 \mathbf{x} + \mathbf{b}^1 \right) + \mathbf{b}^2\right)
$$

Where:
- $W^1$: weights from input to hidden layer  
- $b^1$: biases of hidden layer  
- $W^2$: weights from hidden to output layer  
- $b^2$: biases of output layer  
- $f$: activation function of hidden layer  
- $g$: activation function of output layer (e.g., identity for regression)



## ðŸ”· Training Set

$$
\left\{ (\mathbf{x}^q, y^q) \right\}_{q=1}^Q
$$

- $\mathbf{x}^q \in \mathbb{R}^n$: input vector  
- $y^q \in \mathbb{R}$: target scalar output



## ðŸ”· Loss Function (Sum of Squared Errors)

$$
J(\mathbf{w}, \mathbf{b}) = \frac{1}{2} \sum_{q=1}^{Q} \left\| h_{\mathbf{w},\mathbf{b}}(\mathbf{x}^q) - y^q \right\|^2
$$



## ðŸ”· Backpropagation Derivations

We derive gradients for each parameter using the chain rule. Let:
- $z_j^2 = \sum_i w_{ji}^1 x_i + b_j^1$
- $a_j^2 = f(z_j^2)$
- $z_k^3 = \sum_j w_{kj}^2 a_j^2 + b_k^2$
- $h_k = g(z_k^3)$



### âœ… Gradient of Loss w.r.t. Output Weights $w_{kj}^2$

$$
\frac{\partial J}{\partial w_{kj}^2} = \frac{\partial J}{\partial h_k} \cdot \frac{\partial h_k}{\partial z_k^3} \cdot \frac{\partial z_k^3}{\partial w_{kj}^2}
= (h_k - y_k) g'(z_k^3) a_j^2
$$

Update rule:
$$
w_{kj}^{2(new)} = w_{kj}^{2(old)} - \alpha (h_k - y_k) g'(z_k^3) a_j^2
$$



### âœ… Gradient of Loss w.r.t. Output Bias $b_k^2$

$$
\frac{\partial J}{\partial b_k^2} = \frac{\partial J}{\partial h_k} \cdot \frac{\partial h_k}{\partial z_k^3} \cdot \frac{\partial z_k^3}{\partial b_k^2} = (h_k - y_k) g'(z_k^3)
$$

Update rule:
$$
b_k^{2(new)} = b_k^{2(old)} - \alpha (h_k - y_k) g'(z_k^3)
$$



### âœ… Gradient of Loss w.r.t. Hidden Weights $w_{ji}^1$

We apply the chain rule through the network:

$$
\frac{\partial J}{\partial w_{ji}^1} = \sum_{k=1}^r \left( \frac{\partial J}{\partial h_k} \cdot \frac{\partial h_k}{\partial z_k^3} \cdot \frac{\partial z_k^3}{\partial a_j^2} \cdot \frac{\partial a_j^2}{\partial z_j^2} \cdot \frac{\partial z_j^2}{\partial w_{ji}^1} \right)
$$

Plugging in each derivative:
- $\frac{\partial J}{\partial h_k} = h_k - y_k$
- $\frac{\partial h_k}{\partial z_k^3} = g'(z_k^3)$
- $\frac{\partial z_k^3}{\partial a_j^2} = w_{kj}^2$
- $\frac{\partial a_j^2}{\partial z_j^2} = f'(z_j^2)$
- $\frac{\partial z_j^2}{\partial w_{ji}^1} = x_i$

Final expression:
$$
\frac{\partial J}{\partial w_{ji}^1} = \left( \sum_{k=1}^{r} (h_k - y_k) g'(z_k^3) w_{kj}^2 \right) f'(z_j^2) x_i
$$

Update rule:
$$
w_{ji}^{1(new)} = w_{ji}^{1(old)} - \alpha \left( \sum_{k=1}^{r} (h_k - y_k) g'(z_k^3) w_{kj}^2 \right) f'(z_j^2) x_i
$$



### âœ… Gradient of Loss w.r.t. Hidden Bias $b_j^1$

Same logic, except no $x_i$:

$$
\frac{\partial J}{\partial b_j^1} = \left( \sum_{k=1}^{r} (h_k - y_k) g'(z_k^3) w_{kj}^2 \right) f'(z_j^2)
$$

Update rule:
$$
b_j^{1(new)} = b_j^{1(old)} - \alpha \left( \sum_{k=1}^{r} (h_k - y_k) g'(z_k^3) w_{kj}^2 \right) f'(z_j^2)
$$




# Backpropagation Derivations for Stochastic and Mini-Batch Gradient Descent

This document provides detailed mathematical derivations for training a Multi-Layer Perceptron (MLP) with one hidden layer using:

- **Stochastic Gradient Descent (SGD)**
- **Mini-Batch Gradient Descent**

---

## âœ… Network Architecture

Let the MLP be defined as follows:

- Input: $ \mathbf{x} \in \mathbb{R}^n $
- Hidden layer: weights $ W^1 $, biases $ b^1 $, activation $ f $
- Output layer: weights $ W^2 $, biases $ b^2 $, activation $ g $

### Forward Propagation:

- Hidden pre-activation:  
  $$
  z_j^2 = \sum_{i} w_{ji}^1 x_i + b_j^1
  $$

- Hidden activation:  
  $$
  a_j^2 = f(z_j^2)
  $$

- Output pre-activation:  
  $$
  z_k^3 = \sum_{j} w_{kj}^2 a_j^2 + b_k^2
  $$

- Output activation:  
  $$
  h_k = g(z_k^3)
  $$

---

## âœ… 1. Stochastic Gradient Descent (SGD)

### ðŸ”¹ Loss Function:

For a single training example $ (\mathbf{x}^q, y^q) $:

$$
J^{(q)}(\mathbf{w}, \mathbf{b}) = \frac{1}{2} (h_k^q - y_k^q)^2
$$

### ðŸ”¹ Gradients:

#### Output Layer

- $ \frac{\partial J}{\partial z_k^3} = (h_k - y_k) \cdot g'(z_k^3) $

- $ \frac{\partial J}{\partial w_{kj}^2} = (h_k - y_k) g'(z_k^3) a_j^2 $

- $ \frac{\partial J}{\partial b_k^2} = (h_k - y_k) g'(z_k^3) $

#### Hidden Layer

- Backpropagate error:  
  $$
  \delta_j^2 = f'(z_j^2) \sum_k (h_k - y_k) g'(z_k^3) w_{kj}^2
  $$

- $ \frac{\partial J}{\partial w_{ji}^1} = \delta_j^2 x_i $

- $ \frac{\partial J}{\partial b_j^1} = \delta_j^2 $

### ðŸ”¹ Update Rules:

- $ w_{kj}^{2(new)} = w_{kj}^{2(old)} - \alpha (h_k - y_k) g'(z_k^3) a_j^2 $

- $ b_k^{2(new)} = b_k^{2(old)} - \alpha (h_k - y_k) g'(z_k^3) $

- $ w_{ji}^{1(new)} = w_{ji}^{1(old)} - \alpha \delta_j^2 x_i $

- $ b_j^{1(new)} = b_j^{1(old)} - \alpha \delta_j^2 $

---

## âœ… 2. Mini-Batch Gradient Descent

### ðŸ”¹ Loss Function:

For a mini-batch of size $ B $:

$$
J^{(\text{batch})}(\mathbf{w}, \mathbf{b}) = \frac{1}{2B} \sum_{q=1}^{B} (h_k^q - y_k^q)^2
$$

### ðŸ”¹ Gradients:

**Average over the batch**:

- $ \frac{\partial J}{\partial w_{kj}^2} = \frac{1}{B} \sum_{q=1}^B (h_k^q - y_k^q) g'(z_k^{3,q}) a_j^{2,q} $

- $ \frac{\partial J}{\partial b_k^2} = \frac{1}{B} \sum_{q=1}^B (h_k^q - y_k^q) g'(z_k^{3,q}) $

- Hidden error term for each $ j $:  
  $$
  \delta_j^{2,q} = f'(z_j^{2,q}) \sum_k (h_k^q - y_k^q) g'(z_k^{3,q}) w_{kj}^2
  $$

- $ \frac{\partial J}{\partial w_{ji}^1} = \frac{1}{B} \sum_{q=1}^B \delta_j^{2,q} x_i^q $

- $ \frac{\partial J}{\partial b_j^1} = \frac{1}{B} \sum_{q=1}^B \delta_j^{2,q} $

### ðŸ”¹ Update Rules:

- $ w_{kj}^{2(new)} = w_{kj}^{2(old)} - \alpha \cdot \frac{1}{B} \sum_{q=1}^B (h_k^q - y_k^q) g'(z_k^{3,q}) a_j^{2,q} $

- $ w_{ji}^{1(new)} = w_{ji}^{1(old)} - \alpha \cdot \frac{1}{B} \sum_{q=1}^B \delta_j^{2,q} x_i^q $


