<a href="https://colab.research.google.com/github/mjmousavi97/Deep-Learning-Tehran-uni/blob/main/Q4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Batch Gradient Descent in MLP – Derivation and Update Rules

---

## Notation

- $N$: Total number of training samples (full batch)
- $x^{(n)}$: Input vector for sample $n$
- $y^{(n)}$: True output (target) for sample $n$
- $h^{(n)}$: Network output for sample $n$
- $f(\cdot)$: Activation function at hidden layer
- $g(\cdot)$: Activation function at output layer
- Loss function:

  $$
  J = \frac{1}{N} \sum_{n=1}^N \frac{1}{2} \| h^{(n)} - y^{(n)} \|^2
  $$

---

## Forward Propagation (Per Sample)

1. **Hidden layer**:
   $$
   z_j^{1(n)} = \sum_i w^1_{ji} x_i^{(n)} + b^1_j
   $$
   $$
   a_j^{1(n)} = f(z_j^{1(n)})
   $$

2. **Output layer**:
   $$
   z_k^{2(n)} = \sum_j w^2_{kj} a_j^{1(n)} + b^2_k
   $$
   $$
   h_k^{(n)} = g(z_k^{2(n)})
   $$

---

## Gradients (Full-Batch)

### Output Weights $w^2_{kj}$

$$
\frac{\partial J}{\partial w^2_{kj}} =
\frac{1}{N} \sum_{n=1}^N (h_k^{(n)} - y_k^{(n)}) \cdot g'(z_k^{2(n)}) \cdot a_j^{1(n)}
$$

---

### Output Biases $b^2_k$

$$
\frac{\partial J}{\partial b^2_k} =
\frac{1}{N} \sum_{n=1}^N (h_k^{(n)} - y_k^{(n)}) \cdot g'(z_k^{2(n)})
$$

---

### Hidden Weights $w^1_{ji}$

$$
\frac{\partial J}{\partial w^1_{ji}} =
\frac{1}{N} \sum_{n=1}^N \sum_k
\left[ (h_k^{(n)} - y_k^{(n)}) \cdot g'(z_k^{2(n)}) \cdot w^2_{kj} \right]
\cdot f'(z_j^{1(n)}) \cdot x_i^{(n)}
$$

---

### Hidden Biases $b^1_j$

$$
\frac{\partial J}{\partial b^1_j} =
\frac{1}{N} \sum_{n=1}^N \sum_k
\left[ (h_k^{(n)} - y_k^{(n)}) \cdot g'(z_k^{2(n)}) \cdot w^2_{kj} \right]
\cdot f'(z_j^{1(n)})
$$

---

## Update Rules

Using learning rate $\eta$:

- **Output weights**:
  $$
  w^2_{kj} \leftarrow w^2_{kj} - \eta \cdot \frac{\partial J}{\partial w^2_{kj}}
  $$

- **Output biases**:
  $$
  b^2_k \leftarrow b^2_k - \eta \cdot \frac{\partial J}{\partial b^2_k}
  $$

- **Hidden weights**:
  $$
  w^1_{ji} \leftarrow w^1_{ji} - \eta \cdot \frac{\partial J}{\partial w^1_{ji}}
  $$

- **Hidden biases**:
  $$
  b^1_j \leftarrow b^1_j - \eta \cdot \frac{\partial J}{\partial b^1_j}
  $$


# Backpropagation Derivations for Stochastic and Mini-Batch Gradient Descent

This document provides detailed mathematical derivations for training a Multi-Layer Perceptron (MLP) with one hidden layer using:

- **Stochastic Gradient Descent (SGD)**
- **Mini-Batch Gradient Descent**

---

## ✅ Network Architecture

Let the MLP be defined as follows:

- Input: $ \mathbf{x} \in \mathbb{R}^n $
- Hidden layer: weights $ W^1 $, biases $ b^1 $, activation $ f $
- Output layer: weights $ W^2 $, biases $ b^2 $, activation $ g $

### Forward Propagation:

- Hidden pre-activation:  
  $$
  z_j^2 = \sum_{i} w_{ji}^1 x_i + b_j^1
  $$

- Hidden activation:  
  $$
  a_j^2 = f(z_j^2)
  $$

- Output pre-activation:  
  $$
  z_k^3 = \sum_{j} w_{kj}^2 a_j^2 + b_k^2
  $$

- Output activation:  
  $$
  h_k = g(z_k^3)
  $$

---

## ✅ 1. Stochastic Gradient Descent (SGD)

### 🔹 Loss Function:

For a single training example $ (\mathbf{x}^q, y^q) $:

$$
J^{(q)}(\mathbf{w}, \mathbf{b}) = \frac{1}{2} (h_k^q - y_k^q)^2
$$

### 🔹 Gradients:

#### Output Layer

- $ \frac{\partial J}{\partial z_k^3} = (h_k - y_k) \cdot g'(z_k^3) $

- $ \frac{\partial J}{\partial w_{kj}^2} = (h_k - y_k) g'(z_k^3) a_j^2 $

- $ \frac{\partial J}{\partial b_k^2} = (h_k - y_k) g'(z_k^3) $

#### Hidden Layer

- Backpropagate error:  
  $$
  \delta_j^2 = f'(z_j^2) \sum_k (h_k - y_k) g'(z_k^3) w_{kj}^2
  $$

- $ \frac{\partial J}{\partial w_{ji}^1} = \delta_j^2 x_i $

- $ \frac{\partial J}{\partial b_j^1} = \delta_j^2 $

### 🔹 Update Rules:

- $ w_{kj}^{2(new)} = w_{kj}^{2(old)} - \alpha (h_k - y_k) g'(z_k^3) a_j^2 $

- $ b_k^{2(new)} = b_k^{2(old)} - \alpha (h_k - y_k) g'(z_k^3) $

- $ w_{ji}^{1(new)} = w_{ji}^{1(old)} - \alpha \delta_j^2 x_i $

- $ b_j^{1(new)} = b_j^{1(old)} - \alpha \delta_j^2 $

---

## ✅ 2. Mini-Batch Gradient Descent

### 🔹 Loss Function:

For a mini-batch of size $ B $:

$$
J^{(\text{batch})}(\mathbf{w}, \mathbf{b}) = \frac{1}{2B} \sum_{q=1}^{B} (h_k^q - y_k^q)^2
$$

### 🔹 Gradients:

**Average over the batch**:

- $ \frac{\partial J}{\partial w_{kj}^2} = \frac{1}{B} \sum_{q=1}^B (h_k^q - y_k^q) g'(z_k^{3,q}) a_j^{2,q} $

- $ \frac{\partial J}{\partial b_k^2} = \frac{1}{B} \sum_{q=1}^B (h_k^q - y_k^q) g'(z_k^{3,q}) $

- Hidden error term for each $ j $:  
  $$
  \delta_j^{2,q} = f'(z_j^{2,q}) \sum_k (h_k^q - y_k^q) g'(z_k^{3,q}) w_{kj}^2
  $$

- $ \frac{\partial J}{\partial w_{ji}^1} = \frac{1}{B} \sum_{q=1}^B \delta_j^{2,q} x_i^q $

- $ \frac{\partial J}{\partial b_j^1} = \frac{1}{B} \sum_{q=1}^B \delta_j^{2,q} $

### 🔹 Update Rules:

- $ w_{kj}^{2(new)} = w_{kj}^{2(old)} - \alpha \cdot \frac{1}{B} \sum_{q=1}^B (h_k^q - y_k^q) g'(z_k^{3,q}) a_j^{2,q} $

- $ w_{ji}^{1(new)} = w_{ji}^{1(old)} - \alpha \cdot \frac{1}{B} \sum_{q=1}^B \delta_j^{2,q} x_i^q $


