# Tutorial 2.4: ResNet

Author: [Erik Syniawa](mailto:erik.syniawa@informatik.tu-chemnitz.de)

As introduced in "Deep Residual Learning for Image Recognition" by He et al. (2017) [[1](#6-references)]. The implementation is based on the implementaion of the  torchvision library [[2](#6-references)] with slight modifications to make the code more cohesive with later models.

In [None]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import numpy as np
from typing import Optional, List, Tuple, Type, Union

import os, sys
notebook_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(notebook_dir, ".."))
if root_path not in sys.path:
    sys.path.append(root_path)
    print(f"Added {root_path} to sys.path")
    
from Utils.little_helpers import timer, set_seed

set_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'PyTorch version: {torch.__version__} running on {device}')


## 1. The Degradation Problem
As networks get deeper, they become harder to train, with both training and test accuracy getting worse. This is counterintuitive because a deeper network should be able to represent everything a shallower network can (by making some layers act as identity mappings).

<div align="center">
    <img src="figures/training_error.png" width="750"/>
    <p><i>Figure 1: Training error for plain and residual networks. Source: [1]</i></p>
</div>

Figure 1 shows that in the plain network the deeper 34-layer network has a __higher__ error than the 18-layer network. In residual networks, the deeper 34-layer network has __lower__ error than the 18-layer network. This is a key insight of the paper [1] - residual connections help deeper networks to train better. 

## 2. Understanding Residual Learning

The key insight of ResNet is to reformulate the layers as learning a residual mapping instead of the underlying mapping.
Instead of hoping a stack of layers directly fits a desired mapping $H(x)$, we let these layers fit a residual function $F(x) = H(x) - x$.
The original mapping becomes $F(x) + x$, which is implemented with a "shortcut connection" that performs identity mapping and element-wise addition (see Figure 2).

<div align="center">
    <img src="figures/residual_block.png" width="750"/>
    <p><i>Figure 2: Comparison of residual and non-residual blocks</i></p>
</div>

1. If an identity mapping is optimal, it's easier to push the residual to zero than for stacked layers to learn the identity function
2. If the optimal mapping is close to identity, small perturbations are easier to learn than learning from scratch
3. Shortcut connections allow gradient to flow directly through the network, mitigating vanishing gradient problems

### 2.1 The Loss Landscape of ResNets ([Li et al., 2018](https://proceedings.neurips.cc/paper/2018/hash/a41b3bb3e6b050b6c9067c67f663b915-Abstract.html))

<div align="center">
    <img src="figures/loss_landscape.png" width="750"/>
    <p><i>Figure 3: The loss surfaces of ResNet-56 with/without skip connections. Source: Li et al. (2018)</i></p>
</div>

The visualization above reveals several critical insights:

1. **Without skip connections** (left): The loss landscape has many sharp local minima and steep cliff-like regions. Such landscapes are extremely difficult to optimize because:
   - Gradients can point in unhelpful directions
   - Learning can easily get trapped in poor local minima
   - Small perturbations in weights can cause dramatic changes in loss

2. **With skip connections** (right): The loss landscape becomes smooth and convex, with a clear, wide basin leading to probable good solutions. Summa Summarum:
   - Allows for more reliable gradient flow
   - Makes optimization much less sensitive to initialization
   - Creates wider minima that tend to generalize better

This visualization helps explain why very deep networks become trainable with residual connections. Skip connections essentially provide "highways" for gradient flow, preventing the vanishing/exploding gradient problem while creating a more navigable optimization landscape.

The dramatic smoothing effect becomes increasingly important as networks get deeper. For shallow networks, the benefits are less pronounced, but for deep architectures, skip connections transform what would be untrainable networks into highly effective models. This is not only important for ResNets but also for other architectures that use skip connections, such as Vision Transformers that we will see later.

## 3. Implementation of ResNet Building Blocks

### 3.1 Key components

#### 3.1.1 Batch Normalization ([Ioffe & Szegedy, 2015](https://proceedings.mlr.press/v37/ioffe15.html))

Batch Normalization (BatchNorm) is a technique that normalizes the activations of each layer, which helps to stabilize and accelerate training. For each feature channel, BatchNorm:

- Normalizes the activations to have zero mean and unit variance across the batch
- Applies learnable scale ($\gamma$) and shift ($\beta$) parameters to preserve the network's representational capacity

BatchNorm offers several benefits:

- Reduces internal covariate shift (the distribution of activations changing during training)
- Provides regularization, reducing the need for dropout and potentially allowing higher learning rates 
- Makes networks less sensitive to weight initialization

In ResNet, BatchNorm is applied after each convolution and before the activation function:

```python
# Correct structure:
self.conv = nn.Conv2d(...)
self.bn = nn.BatchNorm2d(...)
# In forward:
out = F.relu(self.bn(self.conv(x)))  # BN before ReLU
```

#### 3.1.2 ReLU Activation Function
The Rectified Linear Unit (ReLU) is defined as:

$f(x) = \max(0, x)$

ReLU became the standard activation function in early modern CNNs for several reasons:

- It allows faster convergence compared to sigmoid or tanh due to non-saturating behavior for positive values
- It's computationally efficient (simple thresholding operation)
- It induces sparsity in the network, as negative values become zero

However, ReLU has important limitations:

- Dying ReLU problem: Neurons can "die" during training when they consistently receive negative inputs, causing their gradients to become zero and preventing further learning
- Non-zero centered: Unlike tanh, ReLU outputs are not zero-centered, which can lead to zig-zagging dynamics during gradient descent

To address these issues, several improved activation functions have been developed:

- **Leaky ReLU**: $f(x) = \max(0, x) + \alpha \min(0, x)$ where $\alpha$ is a small constant (e.g., 0.01)
- **PReLU** (Parametric ReLU): Similar to Leaky ReLU but with learnable $\alpha$
- **ELU** (Exponential Linear Unit): $f(x) = x$ if $x > 0$ else $\alpha(e^x - 1)$
- **SiLU/Swish**: $f(x) = x \cdot \sigma(x)$ where $\sigma$ is the sigmoid function
- **GELU** (Gaussian Linear Error Unit): $f(x) = x \cdot \Phi(x)$ where $\Phi$ is the cumulative distribution function of the standard normal distribution

More recent architectures often use SiLU/Swish or GELU activations, which provide better gradient flow and typically lead to better performance, especially in deeper networks. However, the original ResNet architecture uses ReLU activations throughout.

In ResNets, ReLU is typically applied after BatchNorm:

```python
out = F.relu(self.bn(self.conv(x)))  # BN before ReLU
```

#### 3.1.3 Weight Initialization

Proper initialization is crucial for training very deep networks. ResNet typically uses He initialization (also known as Kaiming initialization). In practice, PyTorch will handle this automatically. For a detailed explanation over different initialization of different activation functions, see the [PyTorch documentation](https://pytorch.org/docs/stable/nn.init.html).

```python
# As used in the original paper:
for m in self.modules():
    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, nn.BatchNorm2d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)
```

### 3.2 Residual Block (used in ResNet-18/34)

This is the basic block in smaller ResNets. Each block contains:

- Two 3×3 convolutional layers (each followed by BatchNorm and ReLU)
- A skip connection that adds the input to the output of the convolutional layers


In [None]:
class ResidualBlock(nn.Module):
    expansion = 1  # Output channels = in_channels * expansion
    """Basic residual block for ResNet18/34"""
    def __init__(self, 
                 in_channels: int, 
                 out_channels: int, 
                 stride: int = 1,
                 **kwargs):
        super(ResidualBlock, self).__init__()
        
        # First conv layer
        self.conv1 = nn.Conv2d(in_channels, out_channels, 
                               kernel_size=3,
                               stride=stride, 
                               padding=1, 
                               bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Second conv layer
        self.conv2 = nn.Conv2d(out_channels, out_channels, 
                               kernel_size=3,
                               stride=1, 
                               padding=1, 
                               bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Shortcut connection
        if stride == 1 and in_channels == out_channels:
            # Identity shortcut - no transformation needed
            self.shortcut = nn.Identity()
        else:
            # Projection shortcut - transform input dimensions to match output
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 
                          kernel_size=1,
                          stride=stride, 
                          bias=False),
                nn.BatchNorm2d(out_channels)
            )
        self.stride = stride

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        out += self.shortcut(identity)  # Skip connection
        out = self.relu(out)

        return out

### 3.3 Bottleneck Block (used in ResNet-50/101/152)

The Bottleneck Block is an alternative building block used in deeper ResNet architectures. The key idea is to reduce computational complexity while maintaining performance through a bottleneck design.

#### Structure of the Bottleneck Block

The Bottleneck Block contains three convolution layers (see Figure 4):

- 1×1 Convolution (Dimensionality Reduction): Reduces the number of channels/feature maps, typically to 1/4 of the input channels
- 3×3 Convolution (Spatial Feature Extraction): Processes spatial features with the reduced channel dimension
- 1×1 Convolution (Dimensionality Expansion): Restores channel dimension to the desired output size (typically 4× the middle layer)

This design creates a "bottleneck" in the middle where the representation has fewer channels, significantly reducing computation:

<div align="center">
    <img src="figures/bottleneck.png" width="500"/>
    <p><i>Figure 4: Comparison of Basic Block vs. Bottleneck Block. Source [1].</i></p>
</div>

#### Computational Advantage

To understand the computational benefit, consider a block with 256 input and output channels:

- Basic Block: Two 3×3 convolutions, each with 256 channels
    - Computation cost: $2 × (3×3×256×256) = 1,179,648$ operations per spatial location

- Bottleneck Block: 1×1 → 3×3 → 1×1 convolutions, with the middle layer having 64 channels
    - Computation cost: $(1×1×256×64) + (3×3×64×64) + (1×1×64×256) = 69,632$ operations per spatial location

The bottleneck design reduces computation by about 17× while maintaining similar expressiveness!

> Why do you think the bottleneck block is not used in ResNet-18/34?

#### Expansion Factor
The bottleneck block uses an expansion factor (typically 4) to define the relationship between the bottleneck width and output width. This is why we set `expansion = 4` in the implementation, meaning the output channels will be 4× the number of bottleneck channels. 
In the basic block the expansion factor is 1, meaning the output channels are equal to the input channels, but we leave it in as an attribute for consistency and a simpler implementation of the whole model.


In [None]:
class Bottleneck(nn.Module):
    """Bottleneck block for ResNet50/101/152"""
    expansion = 4  # Output channels = in_channels * expansion
    
    def __init__(self, 
                 in_channels: int, 
                 out_channels: int, 
                 stride: int = 1, 
                 groups: int = 1, 
                 base_width: int = 64, 
                 dilation: int = 1):
        super(Bottleneck, self).__init__()
        
        width = int(out_channels * (base_width / 64.)) * groups
        
        # First 1x1 conv - dimensionality reduction
        self.conv1 = nn.Conv2d(in_channels, width,
                              kernel_size=1, stride=1, bias=False)
        self.bn1 = nn.BatchNorm2d(width)
        
        # 3x3 conv - spatial feature extraction
        self.conv2 = nn.Conv2d(width, width, kernel_size=3, stride=stride,
                              groups=groups, dilation=dilation,
                              padding=dilation, bias=False)
        self.bn2 = nn.BatchNorm2d(width)
        
        # Second 1x1 conv - dimensionality expansion
        self.conv3 = nn.Conv2d(width, out_channels * self.expansion,
                              kernel_size=1, stride=1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        
        self.relu = nn.ReLU(inplace=True)
        if stride == 1 and in_channels == out_channels * self.expansion:
            # Identity shortcut - no transformation needed
            self.shortcut = nn.Identity()
        else:
            # Projection shortcut - transform input dimensions to match output
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels * self.expansion)
            )
        self.stride = stride

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = x

        # Dimensionality reduction
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        # Spatial feature extraction
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
        
        # Dimensionality expansion
        out = self.conv3(out)
        out = self.bn3(out)

        out += self.shortcut(identity)  # Skip connection
        out = self.relu(out)

        return out
    

## 3.4 Projection Shortcuts vs. Identity Shortcuts

There are three ways to handle shortcuts when dimensions change:
- Option A: Zero-padding identity shortcuts (no extra parameters)
- Option B: Projection shortcuts only when dimensions change
- Option C: Projection shortcuts for all connections

Let's examine the trade-offs:

<div align="center">
    <img src="figures/effect_shortcuts.png" width="800"/>
    <p><i>Figure 5: Effect of Shortcut Types. Values taken from [1].</i></p>
</div>

[[1](#6-references)] concluded that projections provide small accuracy gains but identity shortcuts are sufficient for addressing the degradation problem and are more memory/computation efficient. 

> Looking at our code: What type of shortcut is used in our building blocks?

## 3.5 Regularization Techniques for Deep CNNs

### 3.5.1 Stochastic Depth with DropPath ([Huang et al., 2016](https://arxiv.org/abs/1603.09382))

DropPath is a regularization technique specifically designed for very deep residual networks. The core idea is beautifully simple: during training, randomly drop entire layers (by skipping the residual branch) with some probability. During inference, use the full network.

#### How DropPath works

1. During training, for each batch, randomly dropping the entire residual path (the non-identity branch) of certain blocks with a probability `p`
2. When a path is dropped, the input simply passes through the identity connection
3. When kept, the output is scaled by `1/(1-p)` to maintain the expected value of the output
4. During inference (evaluation mode), no paths are dropped

Here's the core implementation taken from [timm (torch image models)](https://github.com/huggingface/pytorch-image-models):

In [None]:
# Thanks to rwightman's timm package
# github.com:rwightman/pytorch-image-models

def drop_path(x, drop_prob: float = 0., training: bool = False):
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output


class DropPath(nn.Module):
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)


#### Why it works

Stochastic Depth improves deep network training in several ways:

1. **Implicit Network Ensemble**: By randomly dropping different layers, the network effectively becomes an implicit ensemble of networks with varying depths, contributing to better generalization.

2. **Reducing Vanishing Gradients**: With some layers dropped, gradients have shorter paths to flow back during backpropagation, which helps mitigate the vanishing gradient problem.

3. **Improved Information Flow**: By sometimes skipping residual blocks, the network can maintain better information flow from earlier layers to later ones.

4. **Regularization Effect**: The randomness introduced during training prevents the network from relying too heavily on specific layers, forcing it to learn more robust features.

#### Implementation Pattern

A common implementation pattern is to gradually increase the drop probability for deeper layers:

- Earlier layers are dropped with lower probability (or not at all)
- Deeper layers are dropped with higher probability

This makes intuitive sense: earlier layers extract more fundamental features that shouldn't be dropped as often, while deeper layers focus on more specialized feature refinement.

With the standard linear scaling approach, if `p` is the maximum drop probability:

```
drop_rate_for_block[i] = p * i / (total_blocks - 1)
```


### 3.5.2 DropBlock ([Ghiasi et al., 2018](https://proceedings.neurips.cc/paper/2018/hash/7edcfb2d8f6a659ef4cd1e6c9b6d7079-Abstract.html))

Standard dropout randomly drops individual activations during training, which works well for fully connected layers. However, it's less effective for convolutional layers due to the inherent spatial correlation within feature maps:

<div align="center">
    <img src="figures/dropblock.png" width="750"/>
    <p><i>Figure 6: (a) Input image to a CNN. The green regions in (b) and (c) show activations containing semantic information. (b) Standard dropout randomly removes individual activations, but nearby units still preserve semantic information. (c) DropBlock drops contiguous regions, forcing the network to use other features for classification. Source: Ghiasi et al. (2018)</i></p>
</div>

As illustrated above, in convolutional networks:
- Features are spatially correlated - adjacent units in a feature map contain similar information
- Dropping random individual activations isn't effective because neighboring activations can still propagate the same information
- The network doesn't learn to rely on diverse features because the same semantic information can flow through multiple pathways

#### How it works

DropBlock addresses this problem by dropping entire contiguous regions of feature maps, rather than individual units:

1. **Block-wise dropping**: Instead of dropping individual activations, DropBlock zeros out square regions of activations
2. **Contiguous regions**: By removing spatially contiguous areas, entire semantic concepts are blocked
3. **Feature diversification**: The network is forced to learn from other regions and features

The method has two main parameters:
- `block_size`: The size of the square blocks to drop (e.g., 7×7)
- `drop_prob`: The probability of dropping a feature unit

#### Implementation

Below is a PyTorch implementation of DropBlock that you can be used in your models:


In [None]:
import torch.nn.functional as F


class DropBlock2D(nn.Module):
    """
    Implements DropBlock2D: structured form of dropout for convolutional layers. 
    
    :param drop_prob: Probability of activations to be dropped.
    :param block_size: Size of the blocks to drop (quadratic).
    """
    def __init__(self, 
                 drop_prob: float, 
                 block_size: int):
        super(DropBlock2D, self).__init__()
        self.drop_prob = drop_prob
        self.block_size = block_size
    
    def forward(self, x):
        # If not in training phase or drop probability is 0, return input
        if not self.training or self.drop_prob == 0:
            return x
                
        assert x.dim() == 4 # shape: (B, C, H, W)
        # Get dimensions
        _, _, height, width = x.shape
        
        # Calculate gamma (sampling rate)
        # Equation (1) from the paper
        gamma = self.drop_prob / (self.block_size ** 2) * (
            (height * width) / ((height - self.block_size + 1) * (width - self.block_size + 1))
        )
        
        # sample mask from Bernoulli distribution 
        mask = (torch.rand(x.shape[0], *x.shape[2:]) < gamma).float()
        mask = mask.to(x.device)
        
        # compute block mask and apply it
        block_mask = self._compute_block_mask(mask)
        out = x * block_mask[:, None, :, :]

        # Normalize by keeping the sum of the input the same
        out = out * (block_mask.numel() / (block_mask.sum() + 1e-8))
        
        return out
    
    def _compute_block_mask(self, mask):
        # https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.max_pool2d.html
        block_mask = F.max_pool2d(
            input=mask[:, None, :, :],
            kernel_size=(self.block_size, self.block_size),
            stride=(1, 1),
            padding=self.block_size // 2
        )
        
        # if block size is even, trim the last row and column due to padding
        if self.block_size % 2 == 0:
            block_mask = block_mask[:, :, :-1, :-1]
        
        block_mask = 1 - block_mask.squeeze(1)
        
        return block_mask    

#### Tips

- Like DropPath DropBlock is most effective in the later layers of the network where features are more semantic.
- Choose appropriate block_size:
    - For feature maps with high resolution (early layers): smaller block_size (e.g., 3×3)
    - For feature maps with low resolution (later layers): larger block_size (e.g., 5×5 or 7×7)
- Schedule the drop probability: Start with a low probability and gradually increase it during training


### 3.4.3 DropPath vs. DropBlock: Complementary Approaches

| Aspect                        | DropPath                                                          | DropBlock                                         |
|-------------------------------|-------------------------------------------------------------------|---------------------------------------------------|
| **What is dropped**           | Entire residual branches                                          | Contiguous blocks in feature maps                 |
| **Applicability**             | Networks with branching paths (e.g., ResNets, Vision Transformer) | Any convolutional layer                           |
| **Main benefit**              | Creates implicit ensemble of networks with varying depths         | Forces learning of spatially distributed features |
| **Level of operation**        |  Network architecture level                                        |Feature map level                                 |

**Key differences and complementary benefits**:

1. **Scope of operation**: 
   - DropPath works at the architectural level by disabling entire branches
   - DropBlock works at the feature map level by masking contiguous regions

2. **Feature diversity**:
   - DropPath encourages the network to not rely too heavily on any particular residual path
   - DropBlock encourages the network to use diverse spatial features within a layer

3. **Combining both techniques**:
   - DropPath helps with network-level robustness
   - DropBlock helps with feature-level robustness
   - Together, they provide complementary regularization effects


> Before you implement either DropPath or DropBlock in your ResNet, you should first run the code without them to see how the model performs. This will give you a baseline to compare against when you add these techniques. After that, you can experiment with different configurations of DropPath and DropBlock to see how they affect the model's performance.


## 4. ResNet implementation

Here we bring it all together. Note that we use the `ResidualBlock` and `Bottleneck` classes defined above to create the ResNet architecture without `DropPath` or `DropBlock` (see above).

In [None]:
class ResNet(nn.Module):
    def __init__(self, 
                 block: Type[Union[ResidualBlock, Bottleneck]],  # If you define another Block architecture eg. with DropPath or DropBlock insert them here
                 layers: List[int],  # number of blocks in each layer
                 num_classes: int = 1000,  # if <= 0 head will be identity
                 in_channels: int = 3,  # input channels (RGB or grayscale)
                 zero_init_residual: bool = False,  # zero-initialize the last BN in each residual branch
                 groups: int = 1,  # number of groups for group convolution
                 width_per_group: int = 64,  # width of each group
                 replace_stride_with_dilation: Optional[List[bool]] = None, 
                 return_features: bool = False):
        super(ResNet, self).__init__()
        
        self.return_features = return_features
        self.in_channels = 64
        self.dilation = 1
        
        if replace_stride_with_dilation is None:
            # Create a list of False values with same length as layers
            replace_stride_with_dilation = [False] * len(layers)
        else:
            assert len(replace_stride_with_dilation) == len(layers), \
                f"replace_stride_with_dilation should be of length {len(layers)}, but has length {len(replace_stride_with_dilation)}"
            
        self.groups = groups
        self.base_width = width_per_group
        
        # Initial layers - can be adjusted based on input size
        if in_channels == 1:  # grayscale
            self.conv1 = nn.Conv2d(in_channels, self.in_channels, 
                                   kernel_size=3, 
                                   stride=1, 
                                   padding=1, 
                                   bias=False)
        else:  # RGB - For larger inputs like Imagenette
            self.conv1 = nn.Conv2d(in_channels, self.in_channels, 
                                   kernel_size=7, 
                                   stride=2, 
                                   padding=3, 
                                   bias=False)
            
        self.bn1 = nn.BatchNorm2d(self.in_channels)
        self.relu = nn.ReLU(inplace=True)
        
        if in_channels != 1:  # For larger inputs, add maxpool
            self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        else:
            self.maxpool = nn.Identity()  # No maxpool for small inputs
        
        # Define channel sizes for each layer
        channels = [64, 128, 256, 512, 1024, 2048]  # Support for more layers if needed
        
        # Create ResNet layers using ModuleList for flexibility
        self.layers = nn.ModuleList()
        
        for i, num_blocks in enumerate(layers):
            # Only apply stride=2 from the second layer onwards
            stride = 1 if i == 0 else 2
            
            layer = self._make_layer(
                block=block, 
                out_channels=channels[i], 
                num_blocks=num_blocks, 
                stride=stride,
                dilate=replace_stride_with_dilation[i]
            )
            
            self.layers.append(layer)
            
        # Global average pooling
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        
        # Calculate final feature dimension based on architecture
        final_dim = channels[len(layers)-1] * block.expansion
            
        # Classifier head. If num_classes is 0, return features only to use it as a bockbone
        self.head = nn.Linear(final_dim, num_classes) if num_classes > 0 else nn.Identity()
        
        # Initialize weights
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
                
        # Zero-initialize the last BN in each residual branch
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, ResidualBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, 
                    block: Type[Union[ResidualBlock, Bottleneck]],
                    out_channels: int, 
                    num_blocks: int, 
                    stride: int = 1, 
                    dilate: bool = False) -> nn.Sequential:
        previous_dilation = self.dilation
        
        if dilate:
            self.dilation *= stride
            stride = 1
            
        layers = []
        
        # First block may have stride > 1
        layers.append(block(
            in_channels=self.in_channels, 
            out_channels=out_channels, 
            stride=stride, 
            groups=self.groups,
            base_width=self.base_width, 
            dilation=previous_dilation
        ))
        
        # Update in_channels for subsequent blocks
        self.in_channels = out_channels * block.expansion
        
        # Remaining blocks
        for _ in range(1, num_blocks):
            layers.append(block(
                in_channels=self.in_channels, 
                out_channels=out_channels, 
                groups=self.groups,
                base_width=self.base_width, 
                dilation=self.dilation
            ))
            
        return nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        # Apply all ResNet layers
        for layer in self.layers:
            x = layer(x)

        x = self.avgpool(x)
        features = torch.flatten(x, 1)
        logits = self.head(features)
        
        if self.return_features:
            return logits, features
        else:
            return logits


### 4.1 Creating ResNet Models

Here we define a function to create ResNet models with different configurations so you don't have to do it manually below. You can specify the model type (e.g., `resnet18`, `resnet50`, etc.) or provide custom parameters for the number of blocks and channels.

In [None]:
# Function to create ResNet models with different configurations
def create_resnet(model_type: str = None, 
                  block_type: str = 'basic', 
                  num_blocks: List[int] = None, 
                  num_classes: int = 1000,
                  in_channels: int = 3, 
                  return_features: bool = False, 
                  **kwargs) -> ResNet:
    # Predefined configurations | Add more if you create more block types
    configs = {
        'resnet18': (ResidualBlock, [2, 2, 2, 2]),
        'resnet34': (ResidualBlock, [3, 4, 6, 3]),
        'resnet50': (Bottleneck, [3, 4, 6, 3]),
        'resnet101': (Bottleneck, [3, 4, 23, 3]),
        'resnet152': (Bottleneck, [3, 8, 36, 3]),
        # Smaller ResNets for smaller datasets (they are not in the original paper)
        'resnet20': (ResidualBlock, [3, 3, 3]),
        'resnet32': (ResidualBlock, [5, 5, 5]),
        'resnet44': (ResidualBlock, [7, 7, 7]),
        'resnet56': (ResidualBlock, [9, 9, 9]),
        'resnet_mnist': (ResidualBlock, [2, 2, 2]),
    }
    
    # If model_type is provided, use predefined configuration
    if model_type is not None:
        if model_type not in configs:
            raise ValueError(f"Unsupported model type: {model_type}. "
                           f"Available types: {list(configs.keys())}")
        block, layers = configs[model_type]
    
    # Otherwise, use custom configuration
    else:
        if num_blocks is None:
            raise ValueError("Either model_type or num_blocks must be provided")
        
        if block_type.lower() == 'basic':
            block = ResidualBlock
        elif block_type.lower() == 'bottleneck':
            block = Bottleneck
        else:
            raise ValueError("block_type must be 'basic' or 'bottleneck'")
        
        layers = num_blocks
    
    return ResNet(block, layers, 
                  num_classes=num_classes, 
                  in_channels=in_channels,
                  return_features=return_features,  # For returning the embeddings before classifier
                  **kwargs)

### 4.2 Channel Width and Feature Map Progression

In the standard ResNet architecture [[2](#6-references)], the number of feature channels follows a specific progression through the network stages:

1. **Initial Convolution**: Starts with 64 channels
2. **Four Stages**: Each subsequent stage doubles the number of channels
   - Stage 1: 64 channels (or 64 × block.expansion for bottleneck blocks)
   - Stage 2: 128 channels (or 128 × block.expansion)
   - Stage 3: 256 channels (or 256 × block.expansion)
   - Stage 4: 512 channels (or 512 × block.expansion)

For standard ResNets with basic blocks (ResNet-18/34), the `expansion` factor is 1, resulting in 64, 128, 256, and 512 channels. For bottleneck architectures (ResNet-50/101/152), the `expansion` factor is 4, resulting in 256, 512, 1024, and 2048 channels at the output of each stage.

This doubling of channels is coordinated with spatial downsampling (via strided convolutions), which halves the feature map resolution between stages (hence bottleneck). This design maintains a roughly consistent computational load across stages while increasing representational capacity.

<div align="center">
    <img src="figures/resnet_channels.png" width="500"/>
    <p><i>Figure 7: Channel width progression in ResNet with layers = [2, 2, 2]</i></p>
</div>

#### 4.2.1 ResNet Variants with Modified Channel Widths

Several ResNet variants modify the standard channel progression to achieve specific goals:

1. **Wide ResNet**: Increases the base width of the network by using a wider channel multiplier
   - For example, Wide-ResNet-50-2 doubles the internal width of bottleneck blocks
   - The outer channels remain the same (256, 512, 1024, 2048), but the internal 3×3 convolution width increases (128→256, 256→512, etc.)
   - This increases model capacity while maintaining the same depth

2. **ResNeXt**: Introduces grouped convolutions with expanded width
   - ResNeXt models are parameterized by their cardinality (number of groups) and width
   - For example, ResNeXt-50 32×4d has 32 groups and base width of 4d
   - This enables wider networks with controlled parameter growth

3. **EfficientNet** (see later section): Some adaptations use channel width multipliers to scale the entire network
   - These typically scale all channel dimensions by a constant factor
   - Helps balance model size and computational requirements

In the torchvision implementation, channel width modification is controlled by two parameters:
- `groups`: Controls the number of groups in 3×3 convolutions (default=1 for standard ResNets)
- `width_per_group`: Base width for each group (default=64 for standard ResNets)

The effective width of the internal bottleneck is then calculated as:
```python
width = int(planes * (base_width / 64.0)) * groups
```

For standard ResNets, this formula simplifies to `width = planes` (since base_width=64 and groups=1). For Wide ResNet-50-2, it becomes `width = planes * 2`, and for ResNeXt-50 32×4d, it's `width = 4 * 32 = 128`.

### 4.3 Loading Pre-trained ResNet Models

Torchvision provides pre-trained ResNet models that can be easily loaded and used:

```python
import torchvision.models as models

# Load a pretrained ResNet model (you can choose from several variants)
# Common options: resnet18, resnet34, resnet50, resnet101, resnet152
pretrained_resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Print the model architecture to see the structure
print("Model structure summary:")
print(f"Number of parameters: {sum(p.numel() for p in pretrained_resnet.parameters())}")
print("First few layers:")
for name, module in list(pretrained_resnet.named_children())[:4]:
    print(f"{name}: {module}")
```

You can also load a pre-trained models from `timm`, which is a popular library for image models. We discussed it in the [DropPath section](#3-implementation-of-resnet-building-blocks). Also you can download different models from [huggingface.co/models](https://huggingface.co/models) or directly from [github](https://github.com/KaimingHe/deep-residual-networks) if the authors provide it.
Although there are guidelines how to call different attributes or instances of the model, the best way is to check the documentation of the library you are using or to print them out like above to assure that you reference the right part (for example `head` for the classifier).

### 4.4 Transfer Learning with ResNet

ResNet is commonly used as a backbone for transfer learning in computer vision tasks (see YOLO notebook). Here's how you can use a pre-trained ResNet for a new classification task:

```python
def create_transfer_model(num_classes):
    # Load pre-trained ResNet
    model = models.resnet50(pretrained=True)
    
    # Freeze all the parameters
    for param in model.parameters():
        param.requires_grad = False
    
    # Replace the last fully connected layer
    num_features = model.fc.in_features
    model.fc = nn.Linear(num_features, num_classes)
    
    return model
```

> As mentioned above: Please check if the instances you are using are the right ones. For example, in the code above we use `model.fc` to access the classifier head, but in other libraries it might be called `head` (as in our implementation), `classifier` or something else entirely. Sometimes the model is also wrapped in a other module, so you have to access it like this `model.module.fc`! `print(model)` is your friend here.

## 5. Beyond ResNet: Efficient Model Scaling

After exploring ResNet architecture and its implementation, let's look at a more recent advancement in convolutional neural networks: **EfficientNet**. Introduced in 2019 by Tan and Le [[3](#6-references)], EfficientNet addresses a key question in neural network design:

> How do we effectively scale models to achieve better accuracy with limited computational resources?

ResNet showed that deeper networks can be successfully trained through residual connections. However, EfficientNet demonstrates that we need to carefully balance **depth**, **width**, and **resolution** to achieve optimal performance.

### The Problem with Traditional Scaling Methods

Traditional methods for scaling neural networks typically focus on one dimension:

1. **Depth Scaling**: Adding more layers (like going from ResNet-18 to ResNet-50, ResNet-101, etc.)
2. **Width Scaling**: Increasing the number of channels in each layer
3. **Resolution Scaling**: Using higher resolution input images

<div align="center">
    <img src="figures/scaling.PNG" width="900"/>
    <p><i>Figure 8: Model scaling methods. Source: [3]</i></p>
</div>

While each approach improves accuracy, scaling any single dimension quickly reaches diminishing returns. The key insight of EfficientNet was that these dimensions are not independent

### Compound Scaling: The Key Insight of EfficientNet

EfficientNet's major contribution is **compound scaling**, which uniformly scales on these dimensions:

- Depth: $d = \alpha^\phi$
- Width: $w = \beta^\phi$
- Resolution: $r = \gamma^\phi$

Where $\alpha$, $\beta$, $\gamma$ are constants determined by a small grid search, and $\phi$ is the compound coefficient that controls the overall scaling.

The intuition behind compound scaling is straightforward:

- If input resolution increases, we need more layers (depth) to increase the receptive field
- With higher resolution and more layers, we need more channels (width) to capture fine-grained patterns
- These dimensions should be balanced together rather than scaled independently

### ResNet vs. EfficientNet

Compared to the ResNet-50, the EfficientNet-B4 utilize similar FLOPs, yet the EfficientNet-B4 improves the top-1 accuracy on ImageNet from 76% of the ResNet to around 83% (see Figure 9). This suggest not only better performance in terms of accuracy but also in computational efficiency to CNNs. Please consider this for your own UTKFace age task.

<div align="center">
    <img src="figures/effnet_resnet.png" width="500"/>
    <p><i>Figure 9: FLOPS vs. ImageNet Accuracy with different architectures. Source: [3]</i></p>
</div>

You can use pre-trained EfficientNets from [`timm`](https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/efficientnet.py).

```python 

import timm
from torch.optim import SGD

# Load a pretrained EfficientNet-B0 model
model = timm.create_model('efficientnet_b0', pretrained=True)

# If you want to modify the classifier for a different number of classes
# model = timm.create_model('efficientnet_b0', pretrained=True, num_classes=10)

# For evaluation mode
model.eval()

# Or adjust the learning rate of your backbone 
lr_params = [
    {'params': model.heads.parameters(), 'lr': base_lr},
    {'params': model.encoder.parameters(), 'lr': backbone_lr}
]

optimizer = torch.optim.SGD(lr_params, weight_decay=0.01, momentum=0.9)
```

## 6. References

[1] [He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).](https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)

[2] [torchvision/models/resnet.py](https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py)

[3] [Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.](https://arxiv.org/pdf/1905.11946.pdf)

# Training and evaluation of a ResNet on the Imagenette

So let's get to action and train a ResNet on the Imagenette dataset. We will use the `create_resnet` function to create a ResNet-18 model and train it on the Imagenette dataset. Depending on your hardware, you can choose to train larger ResNet model (e.g., ResNet-34). But first, let's load the dataset and define the training and evaluation functions with the image augmentation and normalization.

In [None]:
import torchvision.transforms.v2 as v2
from Utils.dataloaders import prepare_imagenette

# define hyperparameters
batch_size = 256
num_workers = 4

transform_augm = transforms.Compose([
    v2.ToImage(),
    # Core transformations
    v2.RandomResizedCrop(size=224, scale=(0.75, 1.0), ratio=(0.9, 1.05)),
    v2.RandomHorizontalFlip(p=0.5),  # People can face either direction
    v2.RandomRotation(degrees=(-10, 10)),  # Small rotations
    
    # Lighting and appearance variations
    v2.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.1, hue=0.05),
    v2.RandomAutocontrast(p=0.2),
    
    # Occasional realistic variations - with proper probability handling
    v2.RandomApply([v2.GaussianBlur(kernel_size=3, sigma=(0.1, 1.0))], p=0.3),
    v2.RandomAdjustSharpness(sharpness_factor=1.5, p=0.3),
    v2.RandomPerspective(distortion_scale=0.15, p=0.3),
    v2.RandomErasing(p=0.1, scale=(0.02, 0.08), ratio=(0.3, 3.3)),
    
    # Normalization
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

transform_norm = transforms.Compose(
[    v2.ToImage(),
     v2.ToDtype(torch.float32, scale=True),
     v2.Resize(size=(224,224)),
     v2.Normalize(mean = [0.485, 0.456, 0.406], std=[0.229,0.224,0.225]) , 
])

# Load the Imagenette dataset
train_loader, val_loader, classes = prepare_imagenette(train_compose=transform_augm, 
                                                       test_compose=transform_norm, 
                                                       save_path='../Dataset/',
                                                       batch_size=batch_size, 
                                                       num_workers=num_workers)


### Initialize model

In [None]:
resnet_type = 'resnet18'
num_classes = len(classes)
resnet_model = create_resnet(model_type=resnet_type, 
                             num_classes=num_classes, 
                             in_channels=3, 
                             return_features=True)  # We will later visualize the features

# model summary
print(resnet_model)

# learnable parameters
from Utils.little_helpers import get_parameters

print(f"Number of trainable parameters: {get_parameters(resnet_model):.3f}M")

### Training

In [None]:
# Let's define our optimizer and loss function. Normally ResNets are trained with SGD, but Adam is also a good choice.
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR

num_epochs = 10

init_lr = 1e-2
w_decay = 1e-3
optimizer = optim.SGD(resnet_model.parameters(), lr=init_lr, momentum=0.9, weight_decay=w_decay)
scheduler = StepLR(optimizer, step_size=3, gamma=0.5)  # Reduce lr by a factor `gamma` every `step_size` epochs
loss_fn = nn.CrossEntropyLoss()  

# Training loop
from Utils.functions import train_model

results_folder = 'resnet_model/'
os.makedirs(results_folder, exist_ok=True)

with timer("Training process"):
    history = train_model(model=resnet_model, 
                          train_loader=train_loader, 
                          val_loader=val_loader, 
                          criterion=loss_fn,
                          optimizer=optimizer,
                          scheduler=scheduler,
                          device=device,
                          num_epochs=num_epochs,
                          checkpoint_path=results_folder,
                          patience=5)
    
torch.save(resnet_model.state_dict(), f'{results_folder}{resnet_type}_dict.pth')

# save history + predictions
np.save(f'{results_folder}history.npy', history)

In [None]:
from Utils.plotting import visualize_training_results

visualize_training_results(train_losses=history['train_loss'],
                           train_accs=history['train_acc'],
                           test_losses=history['val_loss'],
                           test_accs=history['val_acc'],
                           output_dir=None)

### Evaluation

In [None]:
from Utils.functions import test_model
from Utils.plotting import visualize_test_results

# evaluate model on validation set
with timer("Evaluating process"):
    aggregate_df, per_image_df, overall_accuracy, embeddings = test_model(model=resnet_model,
                                                                          test_loader=val_loader,
                                                                          device=device,
                                                                          class_names=classes,
                                                                          print_per_class_summary=True,
                                                                          collect_embeddings=True,)

# save dataframes as parquet (requires pyarrow and fastparquet)
try:
    aggregate_df.to_parquet(os.path.join(results_folder, 'aggregate_df.parquet'))
    per_image_df.to_parquet(os.path.join(results_folder, 'per_image_df.parquet'))
except ImportError:
    aggregate_df.to_pickle(os.path.join(results_folder, 'aggregate_df.pkl'))
    per_image_df.to_pickle(os.path.join(results_folder, 'per_image_df.pkl'))
    
visualize_test_results(aggregate_df=aggregate_df,
                       per_image_df=per_image_df,
                       overall_accuracy=overall_accuracy,
                       output_dir=None)


### Visualize features

In [None]:
from Utils.plotting import visualize_embeddings_tsne, visualize_embeddings_pca

# Visualize the embeddings using t-SNE
visualize_embeddings_tsne(embeddings=embeddings['all_embeddings'],
                          labels=embeddings['all_labels'],
                          class_names=classes,
                          output_dir=None)

# Visualize the embeddings using PCA
visualize_embeddings_pca(embeddings=embeddings['all_embeddings'],
                          labels=embeddings['all_labels'],
                          class_names=classes,
                          output_dir=None)


## 1. Exercise: Implement DropPath and DropBlock in ResNet

As mentioned in the text, our implementation of ResNet does not include any regularization techniques like DropPath or DropBlock. Your task is to implement these techniques in the ResNet architecture. You can choose to implement either one or both of them. Choose either the `ResidualBlock` or `Bottleneck` class to implement them. You can also use the `DropPath` and `DropBlock` classes provided above. Make sure to test your implementation on the Imagenette dataset and compare the results with the baseline ResNet model **without** these techniques.

In [None]:
# TODO: Replace the following code with your implementation of DropPath and DropBlock in the Bottleneck or ResidualBlock class


## 2. Exercise: Experiment with a pretrained ResNet model

As mentioned in the text, you can use a pretrained ResNet model from torchvision or timm. Your task is to load a pretrained ResNet model and fine-tune it on the Imagenette dataset. You can choose to freeze some layers of the model or train all layers best. Compare the results with the baseline ResNet model and the one with DropPath or DropBlock.

> Note that you can add **Layer-specific learning rates**: The more semantic the information is (later layers), the higher the learning rate should be. Early layers capture generic features like edges and textures, while later layers capture more dataset-specific features.

```python
# Define different learning rates for different parts of the network (for a pretrained torchvision model)
# Lower learning rate for early layers (backbone)
backbone_lr = 1e-5
# Medium learning rate for middle layers
middle_lr = 5e-5
# Higher learning rate for the classifier (final layers)
head_lr = 1e-3

# Group parameters by their position in the network (example)
param_groups = [
    # Layer1 (early features) - lowest learning rate
    {'params': model.layer1.parameters(), 'lr': backbone_lr},
    # Layer2 (early-mid features)
    {'params': model.layer2.parameters(), 'lr': backbone_lr * 2},
    # Layer3 (mid features)
    {'params': model.layer3.parameters(), 'lr': middle_lr},
    # Layer4 (semantic features)
    {'params': model.layer4.parameters(), 'lr': middle_lr * 2},
    # Classifier (task-specific) - highest learning rate
    {'params': model.fc.parameters(), 'lr': head_lr}
]

optimizer = optim.SGD(param_groups, momentum=0.9)
```

In [None]:
# Your code here