# In-depth ResNet Tutorial with Detailed Mathematical Formulations

ResNet, short for Residual Network, revolutionized deep learning by enabling the training of much deeper networks through the use of residual blocks. This architecture was introduced by Kaiming He et al. in 2015 and has proven to be highly effective across various tasks. This tutorial provides a detailed mathematical analysis of ResNet's operations, including its characteristic skip connections.

## ResNet Architecture Overview

ResNet architectures come in various depths, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. Here, we'll focus on the general structure that is common to all variants, which consists of:
1. **Input Layer**: This processes the input image.
2. **Initial Convolution and Max Pooling Layer**: Prepares the data for the residual blocks.
3. **Residual Blocks**: These are the core of ResNet.
4. **Global Average Pooling**: Averages feature maps across dimensions to reduce feature dimensionality.
5. **Fully Connected Output Layer**: Produces the final classification output.
6. **Output Layer**: Softmax for classification.

### Residual Blocks

A key component of ResNet is the residual block, which includes skip connections that allow activations to be forwarded directly across layers. For simplicity, let's consider a two-layer block used in ResNet-34 as an example.

## Detailed Mathematical Operations

### Initial Convolution Layer
- **Forward Pass**:
  - **Formula**: $O = \sigma(W * X + b)$
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial X} = W^T * \frac{\partial L}{\partial O} \cdot \sigma'(W * X + b)$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W} = X * \frac{\partial L}{\partial O} \cdot \sigma'(W * X + b)$

### Residual Block
Each residual block has an identity shortcut connection that skips one or more layers:
- **Forward Pass**:
  - **Input to Block**: $x$
  - **First Layer**: $y = \sigma(W_1 * x + b_1)$
  - **Second Layer**: $z = W_2 * y + b_2$
  - **Output of Block**: $O = \sigma(z + x)$
- **Backward Pass** (using chain rule and assuming ReLU activations for simplicity):
  - **Gradient through Second Layer**:
    - $\frac{\partial L}{\partial z} = \frac{\partial L}{\partial O} \cdot \sigma'(z + x)$
    - $\frac{\partial L}{\partial W_2} = y \cdot \frac{\partial L}{\partial z}$
    - $\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z}$
  - **Gradient through First Layer**:
    - $\frac{\partial L}{\partial y} = W_2^T \cdot \frac{\partial L}{\partial z}$
    - $\frac{\partial L}{\partial W_1} = x \cdot (\frac{\partial L}{\partial y} \cdot \sigma'(W_1 * x + b_1))$
    - $\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial y} \cdot \sigma'(W_1 * x + b_1)$
  - **Gradient w.r.t. input of Block**:
    - $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial O} \cdot \sigma'(z + x) + W_1^T \cdot (\frac{\partial L}{\partial y} \cdot \sigma'(W_1 * x + b_1))$

### Global Average Pooling
- **Forward Pass**:
  - **Formula**: $O_k = \frac{1}{N} \sum_{i=1}^N x_{ik}$
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial x_{ik}} = \frac{1}{N} \frac{\partial L}{\partial O_k}$

### Fully Connected and Output Layers
- **Forward Pass**:
  - **Formula**: $O = Wx + b$
  - **Softmax**: $S_k = \frac{e^{O_k}}{\sum_{i} e^{O_i}}$
- **Backward Pass**:
  - **Gradient w.r.t. output**: $\frac{\partial L}{\partial O_k} = S_k - y_k$ (where $y_k$ is the target probability for class $k$).
  - **Gradient w.r.t. weights and input**: Similar to those in the fully connected layer, applying the chain rule.

## Conclusion

This comprehensive mathematical breakdown of ResNet illustrates the power of residual learning and its implications for training deep neural networks. By adding shortcut connections, ResNet allows gradients to flow directly through the network, making it possible to train much deeper networks effectively.


## ResNet-50 Overview

ResNet-50, developed by Kaiming He et al., represents a significant advancement in deep learning architectures through its innovative use of residual learning. It is widely recognized for its ability to train very deep networks effectively, vastly improving performance on a range of computer vision tasks.

### Innovations of ResNet-50

ResNet-50 introduced several groundbreaking architectural innovations:

1. **Residual Blocks**: These blocks help address the vanishing gradient problem by introducing shortcut connections that skip one or more layers.
2. **Bottleneck Layers**: These layers reduce the number of parameters and computational complexity by using a 1x1 convolution to reduce and then increase dimensions, sandwiching a 3x3 convolution.
3. **Identity Mappings**: These shortcuts are used directly when the input and output dimensions are the same, aiding in the uninterrupted flow of gradients.
4. **Global Average Pooling**: This replaces the traditional fully connected layers at the top of the network, significantly reducing the number of parameters.

### Detailed Architecture and Parameter Calculation

Below is a breakdown of each significant layer of ResNet-50, detailing dimensions, configurations, and parameter calculations:

| Layer                 | Input Dimension              | Output Dimension             | Kernel Size/Stride/Pad | Parameters Formula                                                  | Number of Parameters |
|-----------------------|------------------------------|------------------------------|------------------------|---------------------------------------------------------------------|----------------------|
| **Input**             | $224 \times 224 \times 3$    | N/A                          | N/A                    | N/A                                                                 | 0                    |
| **Initial Conv**      | $224 \times 224 \times 3$    | $112 \times 112 \times 64$   | $7 \times 7$, S=2, P=3 | $(7 \times 7 \times 3) \times 64 + 64$                              | 9,472                |
| **Max Pooling**       | $112 \times 112 \times 64$   | $56 \times 56 \times 64$     | $3 \times 3$, S=2, P=1 | 0                                                                   | 0                    |
| **Bottleneck Block**  | $56 \times 56 \times 256$    | $56 \times 56 \times 256$    | Mixed                  | Varies by sub-layer                                                 | Varies               |
| **Global Avg Pooling**| $7 \times 7 \times 2048$     | $1 \times 1 \times 2048$     | Global                | 0                                                                   | 0                    |
| **Fully Connected**   | $2048$                       | $1000$                       | N/A                    | $2048 \times 1000 + 1000$                                           | 2,049,000            |

### Calculation Formulas

- **Parameter Formula for Conv Layers**: For convolution layers, the formula is $(K \times K \times C_{\text{in}}) \times C_{\text{out}} + C_{\text{out}}$, where $K$ is the kernel size, $C_{\text{in}}$ is the number of input channels, and $C_{\text{out}}$ is the number of output channels.
- **Output Dimension for Conv Layers**: Given by $\left\lfloor \frac{W-K+2P}{S} + 1 \right\rfloor \times \left\lfloor \frac{H-K+2P}{S} + 1 \right\rfloor$, where $W$ and $H$ are the width and height of the input, $K$ is the kernel size, $P$ is the padding, and $S$ is the stride.

### Advantages of ResNet-50

- **Allows Deeper Networks**: Overcomes the vanishing gradient problem, enabling the training of networks that are much deeper than was previously possible.
- **Improved Accuracy**: Demonstrates significantly better accuracy on complex tasks due to its deeper and more sophisticated architecture.
- **Efficient Use of Parameters**: Uses bottleneck layers and global average pooling to reduce the total number of parameters, making it computationally more efficient.

### Disadvantages of ResNet-50

- **High Computational Requirement**: Despite its efficient use of parameters, the training and inference are still computationally intensive.
- **Complexity**: The architecture's complexity can make it challenging to implement and optimize for specific tasks or hardware.

### Key Properties of ResNet

- **Skip Connections**: Facilitate training by allowing gradients to flow through the network without hindrance.
- **Feature Reusability**: Enhances the network's ability to reuse features, which improves learning efficiency and effectiveness.
- **Scalability**: The architecture can be scaled up (e.g., ResNet-101, ResNet-152) effectively to improve performance without substantial modifications.

ResNet-50 continues to be one of the most influential models in the field of deep learning, illustrating a significant shift in how deep learning architectures are constructed and understood.
