# Building the YOLOv11 Backbone

*Notebook 2 of 5 in the YOLOv11 from-scratch series*

## Introduction

The backbone is the feature extraction engine of any object detection model. In YOLOv11, the backbone extracts **hierarchical features at multiple spatial scales**, enabling the detector to find objects ranging from small pedestrians to large vehicles in a single forward pass.

### Key innovations in the YOLOv11 backbone

1. **C3k2 block** - A Cross Stage Partial (CSP) bottleneck that uses 2 convolutions instead of 3. It splits the input channels, processes one branch through a series of bottleneck blocks, collects intermediate outputs, concatenates everything, and projects back. This is more parameter-efficient than the older C3 block while achieving similar representational power.

2. **SPPF (Spatial Pyramid Pooling - Fast)** - Applies three sequential 5x5 max-pooling operations (equivalent to 5x5, 9x9, and 13x13 receptive fields) to capture multi-scale contextual information without increasing computational cost significantly.

3. **Multi-scale outputs** - The backbone produces three feature maps at different resolutions:
   - **P3**: stride 8 (80x80 for 640x640 input) - fine-grained features for small objects
   - **P4**: stride 16 (40x40) - mid-level features for medium objects
   - **P5**: stride 32 (20x20) - coarse features with large receptive field for large objects

By the end of this notebook, you will have a fully functional YOLOv11 backbone implemented from scratch in PyTorch.

In [None]:
# --- Colab Environment Setup ---
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    %pip install -q matplotlib seaborn scikit-learn scipy tqdm
    print("Colab dependencies installed")


## Imports

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
from typing import List, Tuple

## Building blocks: Conv-BN-SiLU

Every convolutional layer in modern YOLO architectures follows the same pattern:

1. **Convolution** (`nn.Conv2d`) - the learnable spatial filter, with `bias=False` since batch normalization handles the bias term.
2. **Batch Normalization** (`nn.BatchNorm2d`) - normalizes activations across the batch, stabilizing training and allowing higher learning rates.
3. **SiLU activation** (also known as Swish: $f(x) = x \cdot \sigma(x)$) - a smooth, non-monotonic activation that consistently outperforms ReLU in detection tasks.

This pattern is so pervasive that we encapsulate it in a single `ConvBNSiLU` module. The `padding` parameter defaults to `kernel_size // 2`, which preserves spatial dimensions for odd kernel sizes (the standard choice).

In [None]:
class ConvBNSiLU(nn.Module):
    """Standard Conv + BatchNorm + SiLU (Swish) activation block."""

    def __init__(self, in_channels: int, out_channels: int, kernel_size: int = 1,
                 stride: int = 1, padding: int = None, groups: int = 1):
        super().__init__()
        if padding is None:
            padding = kernel_size // 2
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride,
                              padding, groups=groups, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.act = nn.SiLU(inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

## Bottleneck block

The `Bottleneck` is the fundamental processing unit inside CSP blocks. It consists of two convolutions:

1. A **squeeze** convolution that reduces channels by the `expansion` factor (default 0.5).
2. An **expand** convolution that restores the channel count.

When `shortcut=True` and the input/output channel counts match, a **residual connection** adds the input directly to the output. This identity shortcut helps gradients flow through deep networks and has been a cornerstone of modern architectures since ResNet.

The `kernel_size` parameter accepts a tuple `(k1, k2)` to independently set the kernel size for each convolution. YOLOv11's C3k2 block uses `(3, 3)` by default.

In [None]:
class Bottleneck(nn.Module):
    """Standard bottleneck with optional residual connection."""

    def __init__(self, in_channels: int, out_channels: int, shortcut: bool = True,
                 kernel_size: Tuple[int, int] = (3, 3), expansion: float = 0.5):
        super().__init__()
        hidden = int(out_channels * expansion)
        self.cv1 = ConvBNSiLU(in_channels, hidden, kernel_size[0])
        self.cv2 = ConvBNSiLU(hidden, out_channels, kernel_size[1])
        self.add = shortcut and in_channels == out_channels

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

## C3k2 block: CSP with 2 convolutions

The **C3k2** (Cross Stage Partial with 2 convolutions) block is a key architectural element of YOLOv11. It improves upon earlier CSP designs (C3, C2f) by being more parameter-efficient.

### How CSP works

The Cross Stage Partial (CSP) design philosophy is:

1. **Split**: A 1x1 convolution (`cv1`) projects the input into `2 * hidden_channels`, then the output is split (chunked) into two equal halves along the channel dimension.
2. **Transform**: One half passes through a series of `n` bottleneck blocks. Crucially, **each bottleneck's output is collected** (not just the final one), creating a dense connection pattern.
3. **Concatenate**: The original split half, plus all `n` bottleneck outputs (total of `2 + n` feature groups), are concatenated along the channel dimension.
4. **Project**: A final 1x1 convolution (`cv2`) fuses the concatenated features back to the desired output channel count.

The "2 convolutions" in C3k2 refers to the two projection convolutions (`cv1` and `cv2`), distinguishing it from C3 which uses three. The default kernel size pair `(3, 3)` in each bottleneck gives the block its name suffix "k2" (2 kernels of size 3).

In [None]:
class C3k2(nn.Module):
    """CSP Bottleneck with 2 convolutions (YOLOv11 variant).

    Splits input channels, processes one part through bottleneck blocks,
    concatenates, and projects back. More efficient than C3 with similar performance.
    """

    def __init__(self, in_channels: int, out_channels: int, n: int = 1,
                 shortcut: bool = True, expansion: float = 0.5):
        super().__init__()
        self.c = int(out_channels * expansion)  # hidden channels
        self.cv1 = ConvBNSiLU(in_channels, 2 * self.c, 1)
        self.cv2 = ConvBNSiLU((2 + n) * self.c, out_channels, 1)
        self.bottlenecks = nn.ModuleList(
            Bottleneck(self.c, self.c, shortcut, kernel_size=(3, 3), expansion=1.0)
            for _ in range(n)
        )

    def forward(self, x):
        # Split into two branches
        y = list(self.cv1(x).chunk(2, dim=1))
        # Pass through sequential bottlenecks, collecting outputs
        for bn in self.bottlenecks:
            y.append(bn(y[-1]))
        return self.cv2(torch.cat(y, dim=1))

## SPPF: Spatial Pyramid Pooling - Fast

The **SPPF** (Spatial Pyramid Pooling - Fast) module addresses a fundamental challenge: how to capture context at multiple spatial scales without drastically increasing computation.

### Design

The original SPP module applied max-pooling with three different kernel sizes (5, 9, 13) in parallel. SPPF achieves the **same effective receptive fields** by applying a single 5x5 max-pool operation **three times sequentially**:

- After 1 pool: effective receptive field of 5x5
- After 2 pools: effective receptive field of 9x9
- After 3 pools: effective receptive field of 13x13

The four feature maps (original + 3 pooled versions) are concatenated and projected through a 1x1 convolution. Using `stride=1` and `padding=k//2` preserves the spatial dimensions throughout.

This sequential design is faster than parallel pooling because it reuses intermediate results, and it is applied only at the deepest stage of the backbone where feature maps are smallest (20x20).

In [None]:
class SPPF(nn.Module):
    """Spatial Pyramid Pooling - Fast (SPPF).

    Three sequential 5x5 max-pool operations (equivalent to 5x5, 9x9, 13x13 pooling)
    capture multi-scale context efficiently.
    """

    def __init__(self, in_channels: int, out_channels: int, k: int = 5):
        super().__init__()
        hidden = in_channels // 2
        self.cv1 = ConvBNSiLU(in_channels, hidden, 1)
        self.cv2 = ConvBNSiLU(hidden * 4, out_channels, 1)
        self.pool = nn.MaxPool2d(k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        y1 = self.pool(x)
        y2 = self.pool(y1)
        y3 = self.pool(y2)
        return self.cv2(torch.cat([x, y1, y2, y3], dim=1))

## Full backbone assembly

Now we assemble the complete YOLOv11 backbone by stacking the building blocks we have defined. The backbone is organized into a **stem** followed by **four stages**, each performing spatial downsampling (stride 2) and feature refinement:

| Component | Operation | Output Shape | Notes |
|-----------|-----------|-------------|-------|
| **Stem** | Conv 3x3, s=2 | 64 x 320 x 320 | Initial feature extraction |
| **Stage 1** | Conv 3x3, s=2 + C3k2(n=2) | 128 x 160 x 160 | Low-level features |
| **Stage 2** | Conv 3x3, s=2 + C3k2(n=2) | 256 x 80 x 80 | **P3 output** (stride 8) |
| **Stage 3** | Conv 3x3, s=2 + C3k2(n=2) | 512 x 40 x 40 | **P4 output** (stride 16) |
| **Stage 4** | Conv 3x3, s=2 + C3k2(n=2) + SPPF | 1024 x 20 x 20 | **P5 output** (stride 32) |

The three outputs (P3, P4, P5) form a **feature pyramid** that will be further refined by the neck (FPN/PAN) in the next notebook. Small objects are detected at P3 (high resolution, low-level features), while large objects are detected at P5 (low resolution, high-level semantic features).

In [None]:
class YOLOv11Backbone(nn.Module):
    """YOLOv11 backbone producing P3, P4, P5 feature maps.

    Architecture:
        Stem (3->64) -> Stage1 (64->128) -> Stage2 (128->256, P3)
        -> Stage3 (256->512, P4) -> Stage4 (512->1024) -> SPPF (P5)
    """

    def __init__(self, in_channels: int = 3, base_channels: int = 64):
        super().__init__()
        c1 = base_channels       # 64
        c2 = c1 * 2              # 128
        c3 = c2 * 2              # 256
        c4 = c3 * 2              # 512
        c5 = c4 * 2              # 1024

        # Stem
        self.stem = ConvBNSiLU(in_channels, c1, 3, stride=2)

        # Stage 1: downsample + C3k2
        self.stage1_down = ConvBNSiLU(c1, c2, 3, stride=2)
        self.stage1_c3k2 = C3k2(c2, c2, n=2, shortcut=True)

        # Stage 2: downsample + C3k2 -> P3 output
        self.stage2_down = ConvBNSiLU(c2, c3, 3, stride=2)
        self.stage2_c3k2 = C3k2(c3, c3, n=2, shortcut=True)

        # Stage 3: downsample + C3k2 -> P4 output
        self.stage3_down = ConvBNSiLU(c3, c4, 3, stride=2)
        self.stage3_c3k2 = C3k2(c4, c4, n=2, shortcut=True)

        # Stage 4: downsample + C3k2 + SPPF -> P5 output
        self.stage4_down = ConvBNSiLU(c4, c5, 3, stride=2)
        self.stage4_c3k2 = C3k2(c5, c5, n=2, shortcut=True)
        self.sppf = SPPF(c5, c5)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Forward pass returning multi-scale features.

        Args:
            x: (B, 3, 640, 640) input images
        Returns:
            p3: (B, 256, 80, 80) - stride 8
            p4: (B, 512, 40, 40) - stride 16
            p5: (B, 1024, 20, 20) - stride 32
        """
        # Stem: 640 -> 320
        x = self.stem(x)

        # Stage 1: 320 -> 160
        x = self.stage1_c3k2(self.stage1_down(x))

        # Stage 2: 160 -> 80 (P3)
        x = self.stage2_c3k2(self.stage2_down(x))
        p3 = x  # 256 channels, 80x80

        # Stage 3: 80 -> 40 (P4)
        x = self.stage3_c3k2(self.stage3_down(x))
        p4 = x  # 512 channels, 40x40

        # Stage 4: 40 -> 20 (P5)
        x = self.stage4_c3k2(self.stage4_down(x))
        p5 = self.sppf(x)  # 1024 channels, 20x20

        return p3, p4, p5

## Shape verification

Let us instantiate the backbone and verify that the output feature maps have the expected shapes. This is a critical sanity check: if the shapes are wrong, the downstream neck and head will fail.

In [None]:
# Verify output shapes
backbone = YOLOv11Backbone()
dummy_input = torch.randn(1, 3, 640, 640)

with torch.no_grad():
    p3, p4, p5 = backbone(dummy_input)

print("Input shape:", dummy_input.shape)
print(f"P3 shape: {p3.shape}  (stride 8,  {p3.shape[1]} channels)")
print(f"P4 shape: {p4.shape}  (stride 16, {p4.shape[1]} channels)")
print(f"P5 shape: {p5.shape}  (stride 32, {p5.shape[1]} channels)")

# Verify spatial dimensions
assert p3.shape == (1, 256, 80, 80), f"P3 expected (1, 256, 80, 80), got {p3.shape}"
assert p4.shape == (1, 512, 40, 40), f"P4 expected (1, 512, 40, 40), got {p4.shape}"
assert p5.shape == (1, 1024, 20, 20), f"P5 expected (1, 1024, 20, 20), got {p5.shape}"
print("\nAll shape checks passed!")

## Parameter count

Understanding the parameter distribution across stages helps with model analysis and debugging. The later stages have exponentially more parameters due to the doubling of channel widths.

In [None]:
def count_parameters(model):
    """Count trainable and total parameters."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total parameters: {total:,}")
    print(f"Trainable parameters: {trainable:,}")
    print(f"Size (MB): {total * 4 / 1024 / 1024:.1f}")
    return total

# Per-stage breakdown
print("=== Parameter Breakdown ===")
for name, module in backbone.named_children():
    params = sum(p.numel() for p in module.parameters())
    print(f"  {name}: {params:,}")
print()
count_parameters(backbone)

## Feature map visualization

Visualizing the feature maps at each scale gives intuition for what the backbone learns. Even with random weights, we can observe that:

- **P3** (80x80) retains fine spatial detail
- **P4** (40x40) captures medium-scale structure
- **P5** (20x20) shows coarse, high-level patterns

In [None]:
def visualize_feature_maps(features, names, num_channels=8):
    """Visualize first few channels of each feature map."""
    fig, axes = plt.subplots(len(features), num_channels, figsize=(20, 3 * len(features)))

    for i, (feat, name) in enumerate(zip(features, names)):
        feat_np = feat[0].detach().numpy()  # Remove batch dim
        for j in range(min(num_channels, feat_np.shape[0])):
            ax = axes[i, j] if len(features) > 1 else axes[j]
            ax.imshow(feat_np[j], cmap='viridis')
            ax.axis('off')
            if j == 0:
                ax.set_ylabel(name, fontsize=12, rotation=0, labelpad=60)

    plt.suptitle('Feature Map Activations (first 8 channels per scale)', fontsize=14)
    plt.tight_layout()
    plt.show()

# Generate with random input for visualization
with torch.no_grad():
    # Use structured input so feature maps are more interesting
    x = torch.randn(1, 3, 640, 640)
    p3, p4, p5 = backbone(x)

visualize_feature_maps([p3, p4, p5], ['P3 (80x80)', 'P4 (40x40)', 'P5 (20x20)'])

## Architecture diagram

The following diagram summarizes the complete backbone data flow:

```
Input (3x640x640)
      |
   +------+
   | Stem |  Conv 3x3, s=2
   +------+  -> 64x320x320
      |
   +------+
   |  S1  |  Conv 3x3, s=2 -> C3k2(n=2)
   +------+  -> 128x160x160
      |
   +------+
   |  S2  |  Conv 3x3, s=2 -> C3k2(n=2)
   +------+  -> 256x80x80 --------> P3
      |
   +------+
   |  S3  |  Conv 3x3, s=2 -> C3k2(n=2)
   +------+  -> 512x40x40 --------> P4
      |
   +------+
   |  S4  |  Conv 3x3, s=2 -> C3k2(n=2) -> SPPF
   +------+  -> 1024x20x20 -------> P5
```

Each stage doubles the channel count while halving the spatial resolution. The SPPF module is applied only at the deepest level where the computational overhead is minimal but the benefit of multi-scale pooling is greatest.

## Summary

In this notebook, we built the complete YOLOv11 backbone from scratch. Here is a recap of the key design choices:

1. **ConvBNSiLU** provides a clean, reusable primitive that appears throughout the architecture. Disabling the convolution bias (since batch normalization subsumes it) saves parameters.

2. **Bottleneck** blocks with residual connections enable deeper networks without vanishing gradients. The expansion factor controls the compute/accuracy tradeoff.

3. **C3k2** (CSP with 2 convolutions) is more parameter-efficient than C3 while maintaining strong feature extraction. The dense connections (collecting all bottleneck outputs) improve gradient flow and feature reuse.

4. **SPPF** captures multi-scale context through sequential max-pooling, enriching the deepest feature map with information from multiple receptive field sizes.

5. The **multi-scale output** design (P3, P4, P5) is essential for detecting objects of varying sizes. This feature pyramid will be further refined in the next notebook.

### Next steps

In **Notebook 3**, we will build the **FPN/PAN neck** that fuses these multi-scale features bidirectionally, and the **detection head** that produces bounding box predictions and class scores at each scale.