# Comprehensive EfficientNet Tutorial with Detailed Mathematical Formulations

EfficientNet is a convolutional neural network architecture introduced by Tan and Le, which optimizes performance through a compound scaling method. This tutorial provides a detailed mathematical breakdown of EfficientNet operations, including forward and backward passes for each layer.

## EfficientNet Architecture Overview

EfficientNet uses a systematic scaling method to scale the depth, width, and resolution of the network uniformly. Key components include:

1. **Input Layer**: Processes the input image.
2. **Stem Layer**: Initial convolution layer.
3. **MBConv Blocks**: Mobile Inverted Bottleneck Convolutional Blocks with SE (Squeeze-and-Excitation).
4. **Head Layers**: Final layers before classification.
5. **Fully Connected Output Layer**: Produces the classification output.
6. **Output Layer**: Applies a softmax function for classification.

### Stem Layer
- **Forward Pass**:
  - **Formula**: $O = \sigma(W \ast X + b)$
    - Where $\ast$ denotes the convolution operation.
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial X} = W^T \ast \frac{\partial L}{\partial O} \cdot \sigma'(W \ast X + b)$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W} = X \ast \frac{\partial L}{\partial O} \cdot \sigma'(W \ast X + b)$

### MBConv Block
An MBConv block consists of depthwise separable convolutions with a squeeze-and-excitation layer.

- **Forward Pass**:
  - **Expansion Phase**:
    - **Formula**: $z = \sigma(W_e \ast x + b_e)$
  - **Depthwise Convolution**:
    - **Formula**: $d = \sigma(W_d \ast z + b_d)$
  - **Squeeze-and-Excitation**:
    - **Squeeze**: $s = \text{avg\_pool}(d)$
    - **Excite**: $e = \sigma(W_{se2} \cdot \sigma(W_{se1} \cdot s + b_{se1}) + b_{se2}) \cdot d$
  - **Projection Phase**:
    - **Formula**: $O = W_p \ast e + b_p$
- **Backward Pass**:
  - **Gradient w.r.t. Projection Phase**:
    - $\frac{\partial L}{\partial e} = W_p^T \ast \frac{\partial L}{\partial O}$
    - $\frac{\partial L}{\partial W_p} = e \ast \frac{\partial L}{\partial O}$
  - **Gradient w.r.t. Squeeze-and-Excitation**:
    - **Excitation Phase**:
      - $\frac{\partial L}{\partial s} = W_{se1}^T \cdot \frac{\partial L}{\partial e} \cdot \sigma'(W_{se1} \cdot s + b_{se1})$
      - $\frac{\partial L}{\partial W_{se1}} = s \cdot \frac{\partial L}{\partial e} \cdot \sigma'(W_{se1} \cdot s + b_{se1})$
    - **Squeeze Phase**:
      - $\frac{\partial L}{\partial d} = \frac{\partial L}{\partial e} \cdot \sigma'(W_{se2} \cdot \sigma(W_{se1} \cdot s + b_{se1}) + b_{se2})$
  - **Gradient w.r.t. Depthwise Convolution**:
    - $\frac{\partial L}{\partial z} = W_d^T \ast \frac{\partial L}{\partial d}$
    - $\frac{\partial L}{\partial W_d} = z \ast \frac{\partial L}{\partial d}$
  - **Gradient w.r.t. Expansion Phase**:
    - $\frac{\partial L}{\partial x} = W_e^T \ast \frac{\partial L}{\partial z}$
    - $\frac{\partial L}{\partial W_e} = x \ast \frac{\partial L}{\partial z}$

### Head Layers
- **Forward Pass**:
  - **Formula**: $O = \sigma(W \ast x + b)$
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial x} = W^T \ast \frac{\partial L}{\partial O} \cdot \sigma'(W \ast x + b)$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W} = x \ast \frac{\partial L}{\partial O} \cdot \sigma'(W \ast x + b)$

### Global Average Pooling
- **Forward Pass**:
  - **Formula**: $O_k = \frac{1}{H \times W} \sum_{i=1}^H \sum_{j=1}^W x_{ijk}$
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial x_{ijk}} = \frac{1}{H \times W} \frac{\partial L}{\partial O_k}$

### Fully Connected Layers
- **Forward Pass**:
  - **Formula**: $O = W \cdot x + b$
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial O}$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W} = x \cdot \frac{\partial L}{\partial O}$

### Output Layer - Softmax
- **Forward Pass**:
  - **Formula**: $S_k = \frac{e^{O_k}}{\sum_i e^{O_i}}$
- **Backward Pass**:
  - **Gradient w.r.t. output of last fully connected layer**: $\frac{\partial L}{\partial O_k} = S_k - y_k$, where $y_k$ is the target class.




## EfficientNet Overview

EfficientNet, developed by Mingxing Tan and Quoc V. Le, is a state-of-the-art convolutional neural network that sets new benchmarks for efficiency and accuracy. Introduced in 2019, EfficientNet leverages a novel compound scaling method that uniformly scales all dimensions of depth/width/resolution using a set of fixed scaling coefficients.

### Innovations of EfficientNet

EfficientNet introduced significant advancements in scaling deep learning models:

1. **Compound Scaling**: Unlike conventional scaling methods that independently increase depth, width, or resolution, EfficientNet scales these factors in a more balanced manner based on a compound coefficient.
2. **Efficient Blocks**: Uses mobile inverted bottleneck convolutions, which include lightweight depthwise separable convolutions, to enhance efficiency.
3. **Squeeze-and-Excitation Optimization**: Each block incorporates squeeze-and-excitation layers that recalibrate channel-wise feature responses to further boost performance.
4. **Fixed Scaling Coefficients**: Through a systematic study, the creators derived a set of fixed scaling coefficients that determine how network depth, width, and resolution are adjusted as the model scales up, optimizing both accuracy and efficiency.


### Variants of EfficientNet

EfficientNet comes in several variants, labeled B0 through B7, that offer a spectrum of capabilities catering to different computational and memory requirements:

- **EfficientNet-B0**: The baseline model designed to provide a balance between efficiency and accuracy.
- **EfficientNet-B1 to B7**: Each subsequent model increases in size and capacity, systematically scaled up using the compound coefficient method to achieve higher accuracy.
### Detailed Architecture and Parameter Calculation

Below is the general structure of the original EfficientNet-B0, which forms the basis for other scaled versions:

| Layer Type                | Input Dimension              | Output Dimension             | Kernel Size/Stride/Pad | Parameters Formula                                            | Number of Parameters |
|---------------------------|------------------------------|------------------------------|------------------------|---------------------------------------------------------------|----------------------|
| **Input**                 | $224 \times 224 \times 3$    | N/A                          | N/A                    | N/A                                                           | 0                    |
| **Conv1**                 | $224 \times 224 \times 3$    | $112 \times 112 \times 32$   | $3 \times 3$, S=2, P=1 | $(3 \times 3 \times 3 + 1) \times 32$                         | 896                  |
| **MBConv1**               | $112 \times 112 \times 32$   | $112 \times 112 \times 16$   | $3 \times 3$, S=1, P=1 | Depthwise: $(3 \times 3 \times 32) \times 1$, Pointwise: $(32 + 1) \times 16$ | 528                |
| **MBConv6 x2**            | $112 \times 112 \times 16$   | $56 \times 56 \times 24$     | $3 \times 3$, S=2, P=1 | Varies per layer                                              | Varies               |
| **Global Avg Pooling**    | $7 \times 7 \times 320$      | $1 \times 1 \times 320$      | Global                 | 0                                                              | 0                    |
| **Fully Connected**       | $320$                        | Number of classes            | N/A                    | $(320 + 1) \times \text{Number of classes}$                   | Varies               |

### Calculation Formulas

- **Parameter Formula for Conv Layers**: $(K \times K \times C_{\text{in}} + 1) \times C_{\text{out}}$, where $K$ is the kernel size, $C_{\text{in}}$ is the number of input channels, and $C_{\text{out}}$ is the number of output channels for convolutions.
- **Output Dimension for Conv Layers**: $\left\lfloor\frac{W-K+2P}{S}+1\right\rfloor \times \left\lfloor\frac{H-K+2P}{S}+1\right\rfloor$, where $W$ and $H$ are the width and height of the input, $K$ is the kernel size, $P$ is the padding, and $S$ is the stride.

### Advantages of EfficientNet

- **Highly Efficient**: Achieves much higher accuracy with significantly fewer parameters and lower computation compared to other CNNs.
- **Scalable Architecture**: Can be systematically scaled up to achieve a wider range of performance targets and resource constraints.
- **State-of-the-Art Performance**: On benchmarks like ImageNet, EfficientNets often outperform models that are much larger and more computationally expensive.

### Disadvantages of EfficientNet

- **Complexity in Implementation**: The novel components such as squeeze-and-excitation and MBConv blocks can be complex to implement from scratch.
- **Resource Intensity at Higher Scales**: While EfficientNet-B0 is quite efficient, larger versions like B7 require substantial computational resources.

### Key Properties of EfficientNet

- **Optimized Scaling**: The network efficiently utilizes resources by scaling width, depth, and resolution based on fixed coefficients.
- **Robust Performance**: Demonstrates strong robustness and lower susceptibility to overfitting despite its depth and complexity.
- **Innovative Design**: The integration of advanced techniques like squeeze-and-exitation and MBConv optimizes both the performance and efficiency of the network.

EfficientNet continues to be a highly influential model in deep learning, showcasing the impact of balanced scaling on achieving high performance in convolutional neural networks.

EfficientNet represents a significant step forward in the development of CNNs, demonstrating that it is possible to achieve both high efficiency and high accuracy through careful, principled scaling.
