## Lecture Notes: Backpropagation in CNN (Part 1)

### I. Context and Difficulty

The study of backpropagation in CNNs is a **very important and conceptual topic**. It is inherently difficult and involves significant mathematics (maths), potentially requiring the viewer to watch the material multiple times and use a copy pen to follow the calculations. While deep learning libraries handle this automatically, having a deeper understanding allows for better use of the tools.

This overall topic is divided into three parts:

1.  **Part 1 (Current focus):** Understanding the overall application of backpropagation in the CNN architecture.
2.  **Part 2 (Next video):** Discussing how backpropagation applies specifically to the **Convolution, Max Pooling, and Flattening operations**.
3.  **Part 3 (Last video):** Applying backpropagation to a more complex CNN architecture.

### II. Simple CNN Architecture and Data Flow

To simplify the explanation, a basic CNN architecture is used.

#### A. Architecture Overview

1.  **Input:** An image ($X$), assumed to be $6 \times 6$.
2.  **Convolution:** The image is passed through a $3 \times 3$ filter.
    *   *Output Shape:* $4 \times 4$ Feature Map.
3.  **Activation:** The Feature Map passes through a **ReLU operation**.
    *   *Output Shape:* Remains $4 \times 4$.
4.  **Pooling:** Max Pooling is applied (e.g., $2 \times 2$ window and stride of 2).
    *   *Output Shape:* $2 \times 2$.
5.  **Flattening:** The $2 \times 2$ output is flattened, resulting in a 1D vector of length 4.
6.  **Fully Connected Layer:** The 4 inputs are sent to a single output neuron.
7.  **Prediction:** The output provides a prediction (e.g., $1 \times 1$ scalar).

#### B. Trainable Parameters

There are only two locations in this simple architecture where **trainable parameters** (weights and biases) exist.

| Location | Parameter | Count | Description |
| :--- | :--- | :--- | :--- |
| **Convolution Filter** | $W_1$ | 9 weights ($3 \times 3$) | The values inside the $3 \times 3$ filter. |
| | $B_1$ | 1 bias (Shape $1 \times 1$) | The filter's bias. |
| **Fully Connected Layer** | $W_2$ | 4 weights (Shape $1 \times 4$) | Weights connecting the 4 flattened inputs to the single neuron. |
| | $B_2$ | 1 bias (Shape $1 \times 1$) | The output neuron's bias. |
| **Total** | | **15 trainable parameters** | Optimising these 15 parameters is the goal of backpropagation. |

The Loss Function used is **Binary Cross-Entropy** (applicable for binary classification, like determining if an image is a dog or not).

### III. Logical Flow and Forward Propagation

To understand backpropagation, the entire CNN architecture is represented by a logical diagram.

#### A. Logical Diagram Flow

$$X \xrightarrow{\text{Conv}, W_1, B_1} Z_1 \xrightarrow{\text{ReLU}} A_1 \xrightarrow{\text{Max Pooling}} P_1 \xrightarrow{\text{Flatten}} F \xrightarrow{\text{FC, } W_2, B_2} Z_2 \xrightarrow{\text{Sigmoid}} A_2 (\hat{y}) \rightarrow L$$

#### B. Forward Propagation Equations

The forward propagation describes the calculation flow:

1.  $$Z_1 = \text{Convolution}(X, W_1) + B_1$$
2.  $$A_1 = \text{ReLU}(Z_1)$$
3.  $$P_1 = \text{Max Pooling}(A_1)$$
4.  $$F = \text{Flatten}(P_1)$$
5.  $$Z_2 = \text{Dot Product}(F, W_2) + B_2$$ (Note: $W_2$ is multiplied by $F$, or $F$ is multiplied by $W_2$, depending on order).
6.  $$A_2 = \text{Sigmoid}(Z_2)$$ (This is the prediction $\hat{y}$).
7.  $$L = \text{Loss}(Y, A_2)$$.

### IV. Backpropagation: The Objective

The end goal of backpropagation is to apply the **Gradient Descent algorithm** to find the optimal values for $W_1, B_1, W_2, \text{and } B_2$ that minimise the loss ($L$).

To achieve this, four specific derivatives (gradients) must be calculated:

1.  $$\frac{\partial L}{\partial W_1}$$
2.  $$\frac{\partial L}{\partial B_1}$$
3.  $$\frac{\partial L}{\partial W_2}$$
4.  $$\frac{\partial L}{\partial B_2}$$

#### A. Conceptual Segmentation

The architecture can be logically separated into two components for easier analysis:

1.  **ANN Part (Fully Connected):** Everything after the Flattening layer (from $F$ onwards).
2.  **CNN Part:** Everything up to the Flattening layer (Convolution, ReLU, Max Pooling).

Backpropagation starts from the loss function ($L$) and moves backwards.

### V. Derivation of Gradients for the ANN Part ($W_2, B_2$)

The calculation begins with the derivatives related to the Fully Connected (ANN) part. These calculations use the Chain Rule.

#### A. Gradient for $W_2$ ($\frac{\partial L}{\partial W_2}$)

The chain rule states that changing $W_2$ affects $Z_2$, which affects $A_2$, which finally affects $L$.

$$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial A_2} \times \frac{\partial A_2}{\partial Z_2} \times \frac{\partial Z_2}{\partial W_2}$$

The final calculated derivative for $W_2$ is:
$$\frac{\partial L}{\partial W_2} = (A_2 - Y_i) \times F^T$$

The resulting shape of $\frac{\partial L}{\partial W_2}$ must match the shape of $W_2$ (which is $1 \times 4$).

#### B. Gradient for $B_2$ ($\frac{\partial L}{\partial B_2}$)

The chain rule for the bias $B_2$ follows a similar structure.

$$\frac{\partial L}{\partial B_2} = \frac{\partial L}{\partial A_2} \times \frac{\partial A_2}{\partial Z_2} \times \frac{\partial Z_2}{\partial B_2}$$

The final calculated derivative for $B_2$ in the single instance case is:
$$\frac{\partial L}{\partial B_2} = A_2 - Y_i$$

### CNN Architecture and Chain Rule
A basic CNN architecture involves the following steps:
Input $X$ $\rightarrow$ Convolution ($Z_1$) $\rightarrow$ ReLU ($A_1$) $\rightarrow$ Max Pooling ($P_1$) $\rightarrow$ Flatten ($F$) $\rightarrow$ Fully Connected ($Z_2$) $\rightarrow$ Sigmoid ($A_2$ or $Y_{hat}$) $\rightarrow$ Loss ($L$).

To calculate $\frac{\partial L}{\partial W_1}$, the chain rule is applied:
$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial A_2} \times \frac{\partial A_2}{\partial Z_2} \times \frac{\partial Z_2}{\partial F} \times \frac{\partial F}{\partial P_1} \times \frac{\partial P_1}{\partial A_1} \times \frac{\partial A_1}{\partial Z_1} \times \frac{\partial Z_1}{\partial W_1}$$

*Note: The product of the initial terms $\frac{\partial L}{\partial A_2} \times \frac{\partial A_2}{\partial Z_2}$ simplifies to $A_2 - Y$ (where $Y$ is the target label). The next term, $\frac{\partial Z_2}{\partial F}$, is effectively the weights of the fully connected layer, $W_2$. The resulting product of the first three terms yields $\frac{\partial L}{\partial F}$.*

***

## Backpropagation on Specific CNN Layers

### 1. Backpropagation on the Flatten Layer
The Flatten operation converts a 2D tensor (like the Max Pooling output $P_1$) into a 1D vector ($F$).

*   **No Trainable Parameters:** The Flatten layer does not contain trainable parameters, so conventional differentiation is not used.
*   **Reverse Operation:** Backpropagation is achieved by performing the exact reverse of the forward operation.
*   **Procedure:** The incoming gradient ($\frac{\partial L}{\partial F}$, which is a $1\text{D}$ vector) is simply **reshaped** back into the multidimensional format of $P_1$ (e.g., converting a $4 \times 1$ vector back to a $2 \times 2$ matrix). This reshaped matrix represents $\frac{\partial L}{\partial P_1}$.

### 2. Backpropagation on the Max Pooling Layer
The Max Pooling operation selects the maximum item in each window (e.g., $4 \times 4$ input $A_1$ becomes $2 \times 2$ output $P_1$).

*   **No Trainable Parameters:** The Max Pooling layer also has **no trainable parameters**.
*   **Reverse Operation (Unpooling):** The gradient is propagated backward by reversing the pooling operation.
*   **Procedure:** Only the numbers that were **maximum** in their respective windows during the forward pass contributed to the final prediction and, therefore, the loss.
    *   The incoming gradient matrix ($\frac{\partial L}{\partial P_1}$, e.g., $2 \times 2$) is used to create a larger matrix (the shape of $A_1$, e.g., $4 \times 4$).
    *   The values of the incoming gradient are placed precisely in the locations (indices) where the maximum items were located in the original matrix $A_1$ (before pooling).
    *   All other positions in the resulting matrix are set to **zero**, as those non-maximum numbers had no contribution to the loss prediction.

Mathematically, $\frac{\partial L}{\partial A_1}(x, y)$ equals $\frac{\partial L}{\partial P_1}(m, n)$ if $A_1(x, y)$ was the maximum element in its pooling region, and 0 otherwise.

### 3. Backpropagation on the ReLU Activation Layer
The relationship between $A_1$ (ReLU output) and $Z_1$ (convolution output) is used to find $\frac{\partial A_1}{\partial Z_1}$.

*   **Differentiation of ReLU:** The derivative of ReLU is used to calculate $\frac{\partial A_1}{\partial Z_1}$.
    *   If the corresponding item in $Z_1$ is **greater than zero**, the derivative is **1**.
    *   If the corresponding item in $Z_1$ is **less than or equal to zero**, the derivative is **0**.
*   **Procedure:** To obtain $\frac{\partial L}{\partial Z_1}$, the gradient coming from the Max Pooling layer ($\frac{\partial L}{\partial A_1}$) is **multiplied element-wise** by the ReLU derivative matrix ($\frac{\partial A_1}{\partial Z_1}$).

### 4. Backpropagation on the Convolution Layer
This involves calculating the gradients for the bias ($B_1$) and the weights ($W_1$).

#### A. Gradient with Respect to Bias ($\frac{\partial L}{\partial B_1}$)
The bias $B_1$ is added uniformly to every element of the convolution output $Z_1$.

*   **Procedure:** Because $Z_{1, ij}$ (any individual output element) has a derivative of 1 with respect to $B_1$, the total derivative $\frac{\partial L}{\partial B_1}$ is simply the **summation of all the values** in the incoming gradient matrix ($\frac{\partial L}{\partial Z_1}$).

#### B. Gradient with Respect to Weights ($\frac{\partial L}{\partial W_1}$)
Calculating the derivative for the weights involves applying the chain rule to individual weights and their corresponding inputs ($X$).

*   **Key Insight (Pattern Recognition):** While the expansion of the derivative results in complex summation terms, the overall operation to find $\frac{\partial L}{\partial W_1}$ simplifies significantly.
*   **Procedure:** The derivative of the loss with respect to the filter weights ($\frac{\partial L}{\partial W_1}$) is obtained by performing a **convolution operation of the input image ($X$) with the incoming loss gradient matrix ($\frac{\partial L}{\partial Z_1}$)**.

***
Do you want to focus on how the Max Pooling layer determines which gradient values to retain and which to set to zero?