---
title: ResNet Skip-Connection Dimensioning and FPN
description: Why addition forces strict shape equality in ResNets, how 1×1 projection shortcuts handle dimension mismatches, and the canonical FPN featurizer that unifies backbone channels to d=256.
---

This notebook does two things:

1. Extracts dimensioning-relevant excerpts from the three foundational papers — ResNet (He et al., CVPR 2016), Identity Mappings (He et al., ECCV 2016), and FPN (Lin et al., CVPR 2017).

2. Demonstrates the **preferred, modern featurizer** pattern: a ResNet-style backbone with identity shortcuts when shapes match, 1×1 projection shortcuts when they don't, feeding into an FPN neck that unifies all pyramid levels to $d=256$ channels.

Papers (arXiv identifiers):
- ResNet: arXiv:1512.03385
- Identity mappings: arXiv:1603.05027
- FPN: arXiv:1612.03144


## 1) Dimensioning-relevant excerpts (text) and figure takeaways

### A. ResNet (He et al., CVPR 2016)

Short excerpt (dimension matching via projection; note the explicit stride-2 handling):

> “The projection shortcut … is used to match dimensions (done by 1×1 convolutions).”  
(He et al., 2016, Sec. 3.3)

> “When the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.”  
(He et al., 2016, Sec. 3.3)

(He et al., 2016, Sec. 3.3)

Short excerpt (options for increased dimensions):

> “(A) … identity mapping, with extra zero entries padded … (B) … projection shortcut … to match dimensions”  
(He et al., 2016, Sec. 3.3)

Figure takeaways (do not reproduce the copyrighted figures here; consult the paper figures directly):
- ResNet Fig. 3 (residual block): illustrates the residual branch $F(x)$ and the shortcut branch being added; addition requires exact shape match.
- ResNet Fig. 5 (bottleneck block): shows the 1×1–3×3–1×1 pattern that reduces then restores channels (e.g., 256→64→64→256); the shortcut is typically identity when input/output shapes match.

### B. Identity Mappings (He et al., ECCV 2016)

Short excerpt (why identity shortcuts are special in signal propagation):

> “forward and backward signals can be directly propagated … when using identity mappings as the skip connections …”  
(He et al., 2016, Abstract)
(He et al., 2016, Abstract)

Figure takeaway:
- The paper analyzes variants of residual units and shows that moving toward “cleaner” identity shortcuts improves optimization/propagation (conceptual motivation for using projections only when necessary).

### C. FPN (Lin et al., CVPR 2017)

Short excerpt (lateral dimension reduction for addition):

> “the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) [is merged] by element-wise addition.”  
(Lin et al., 2017, Sec. 3)

Short excerpt (the canonical $d=256$ design):

> “We set d = 256 … thus all extra convolutional layers have 256-channel outputs.”  
(Lin et al., 2017, Sec. 3)

Figure takeaways:
- FPN Fig. 3 (building block): shows **upsample (×2)** + **1×1 lateral conv** then **addition**, followed by a **3×3 “smoothing” conv**. The addition imposes strict spatial and channel alignment.


## 2) The core dimensional constraint

Let $x \in \mathbb{R}^{B \times C_{in} \times H \times W}$. A residual unit computes

\[
y = F(x) + \mathcal{S}(x),
\]

and **addition requires identical tensor shapes**:

\[
F(x), \mathcal{S}(x) \in \mathbb{R}^{B \times C_{out} \times H' \times W'}.
\]

Hence the shortcut must handle two mismatches:
- channel mismatch: $C_{in} \neq C_{out}$
- spatial mismatch: $(H,W) \neq (H',W')$ (typically caused by stride-2 downsampling)


### Residual block — shortcut options

```mermaid
flowchart LR
    x["x\n[B, Cᵢₙ, H, W]"]
    x --> c1["Conv 3×3\nstride s"]
    c1 --> b1["BN + ReLU"]
    b1 --> c2["Conv 3×3\nstride 1"]
    c2 --> b2["BN"]
    b2 --> add["⊕ Add"]
    x -->|"Cᵢₙ=Cₒᵤₜ, s=1"| id["Identity"]
    x -->|"else"| proj["1×1 Conv, stride s\nOption B"]
    id --> add
    proj --> add
    add --> relu["ReLU"]
    relu --> y["y\n[B, Cₒᵤₜ, H/s, W/s]"]
```

Addition **requires identical tensor shapes**: both the residual branch and the shortcut must produce $[B, C_{out}, H', W']$.


In [None]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

def shape(x):
    return tuple(x.shape)

def report(name, x):
    print(f"{name}: {shape(x)}")


## 3) ResNet-style block with correct shortcut dimensioning

We implement a standard BasicBlock with:
- residual branch: 3×3 conv → BN → ReLU → 3×3 conv → BN
- shortcut:
  - identity if stride=1 and $C_{in}=C_{out}$
  - otherwise a 1×1 conv (projection), with the same stride as the residual branch’s downsampling


In [None]:
class BasicBlock(nn.Module):
    def __init__(self, cin: int, cout: int, stride: int = 1):
        super().__init__()
        self.conv1 = nn.Conv2d(cin, cout, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(cout)
        self.conv2 = nn.Conv2d(cout, cout, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(cout)
        self.relu  = nn.ReLU(inplace=True)

        if stride != 1 or cin != cout:
            # Projection shortcut: matches channels and spatial size.
            self.shortcut = nn.Sequential(
                nn.Conv2d(cin, cout, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(cout),
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = out + self.shortcut(x)
        out = self.relu(out)
        return out


### 3.1) What goes wrong if you try to add mismatched tensors?

Below we show:
- a block that downsamples and increases channels (stride=2, 64→128)
- naive identity shortcut fails (shape mismatch)
- projection shortcut works


In [None]:
# Dummy input
x = torch.randn(2, 64, 56, 56)

# Residual branch that downsamples and changes channels
residual = nn.Sequential(
    nn.Conv2d(64, 128, 3, stride=2, padding=1, bias=False),
    nn.BatchNorm2d(128),
    nn.ReLU(inplace=True),
    nn.Conv2d(128, 128, 3, stride=1, padding=1, bias=False),
    nn.BatchNorm2d(128),
)

Fx = residual(x)
report("x", x)
report("F(x)", Fx)

print("\nAttempting F(x) + x (naive identity shortcut):")
try:
    _ = Fx + x
except RuntimeError as e:
    print("RuntimeError:", str(e).split("\n")[0])

print("\nUsing a projection shortcut (1×1 conv, stride=2):")
proj = nn.Conv2d(64, 128, 1, stride=2, bias=False)
Sx = proj(x)
report("S(x)", Sx)
y = Fx + Sx
report("F(x)+S(x)", y)


### 3.2) Option A vs. Option B (ResNet paper terminology)

In the ResNet paper’s discussion:
- Option A: downsample the shortcut (stride 2) and **zero-pad channels** to match $C_{out}$.
- Option B: downsample and **project with 1×1 conv** to match dimensions.

For FPN-style backbones, **Option B is the preferred practical choice** because:
- the feature hierarchy is consumed downstream (e.g., lateral merges), so having a learned projection at stage transitions is robust,
- and it matches the canonical ResNet-{50,101,152} “option B” design in the CVPR paper.

Below is a small functional illustration of “Option A-like” padding for the channel mismatch (spatial downsample uses strided slicing for simplicity).


In [None]:
def option_a_shortcut(x, cout: int, stride: int):
    # Spatial downsample: emulate stride-2 shortcut by subsampling.
    if stride == 2:
        x_ds = x[:, :, ::2, ::2]
    elif stride == 1:
        x_ds = x
    else:
        raise ValueError("This demo only supports stride 1 or 2.")
    cin = x_ds.shape[1]
    if cin > cout:
        raise ValueError("Option A padding demo expects cin <= cout.")
    if cin == cout:
        return x_ds
    pad_c = cout - cin
    # Pad channels: (N,C,H,W). We pad on the channel dimension by concatenating zeros.
    zeros = torch.zeros(x_ds.shape[0], pad_c, x_ds.shape[2], x_ds.shape[3], device=x_ds.device, dtype=x_ds.dtype)
    return torch.cat([x_ds, zeros], dim=1)

# Demonstrate option A-like shortcut shape matching
x = torch.randn(2, 64, 56, 56)
Sx_a = option_a_shortcut(x, cout=128, stride=2)
report("Option-A-like S(x)", Sx_a)


## 4) A minimal ResNet-like backbone that exposes {C2, C3, C4, C5}

FPN (Lin et al.) uses the outputs of each ResNet stage’s last block:
\{C2, C3, C4, C5\} with strides \{4, 8, 16, 32\} relative to the input.

We build a small backbone that mirrors this structure (conceptually like a tiny ResNet-18).


### Backbone stage layout — strides and channel widths

```mermaid
flowchart LR
    img["Image\n3×224×224"]
    stem["Stem\nConv7 s2 + MaxPool s2\n64×56×56"]
    c2["Stage 1\n64×56×56\n→ C2 stride 4"]
    c3["Stage 2\n128×28×28\n→ C3 stride 8"]
    c4["Stage 3\n256×14×14\n→ C4 stride 16"]
    c5["Stage 4\n512×7×7\n→ C5 stride 32"]
    img --> stem --> c2 -->|"stride 2"| c3 -->|"stride 2"| c4 -->|"stride 2"| c5
```

Each stage transition uses a stride-2 first block with a **1×1 projection shortcut** (Option B) to match dimensions.


In [None]:
class TinyResNetBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        # Stem (like ResNet): stride-2 conv + stride-2 maxpool => output stride 4
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        # Stages: produce C2..C5
        self.layer1 = nn.Sequential(BasicBlock(64,  64, stride=1), BasicBlock(64,  64, stride=1))  # C2, stride 4
        self.layer2 = nn.Sequential(BasicBlock(64, 128, stride=2), BasicBlock(128, 128, stride=1)) # C3, stride 8
        self.layer3 = nn.Sequential(BasicBlock(128,256, stride=2), BasicBlock(256, 256, stride=1)) # C4, stride 16
        self.layer4 = nn.Sequential(BasicBlock(256,512, stride=2), BasicBlock(512, 512, stride=1)) # C5, stride 32

    def forward(self, x):
        x = self.stem(x)
        c2 = self.layer1(x)
        c3 = self.layer2(c2)
        c4 = self.layer3(c3)
        c5 = self.layer4(c4)
        return {"C2": c2, "C3": c3, "C4": c4, "C5": c5}

backbone = TinyResNetBackbone()
x = torch.randn(1, 3, 224, 224)
C = backbone(x)
for k in ["C2","C3","C4","C5"]:
    report(k, C[k])


In [None]:
import matplotlib.pyplot as plt
import numpy as np

stages = ['C2\n(stride 4)', 'C3\n(stride 8)', 'C4\n(stride 16)', 'C5\n(stride 32)']
channels_bb = [64, 128, 256, 512]
spatial_bb  = [56, 28, 14, 7]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
colors = ['#4e79a7', '#f28e2b', '#e15759', '#76b7b2']

bars1 = ax1.bar(stages, channels_bb, color=colors)
ax1.set_ylabel('Channels')
ax1.set_title('Channel width per backbone stage')
for b, v in zip(bars1, channels_bb):
    ax1.text(b.get_x() + b.get_width()/2, b.get_height() + 4, str(v),
             ha='center', fontweight='bold')

bars2 = ax2.bar(stages, spatial_bb, color=colors)
ax2.set_ylabel('Spatial size (H = W, pixels)')
ax2.set_title('Feature map spatial size per backbone stage\n(input 224×224)')
for b, v in zip(bars2, spatial_bb):
    ax2.text(b.get_x() + b.get_width()/2, b.get_height() + 0.4, f'{v}×{v}',
             ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('backbone_dimensions.png', dpi=120, bbox_inches='tight')
plt.show()


## 5) Preferred FPN module (top-down + lateral, with $d=256$)

Canonical FPN design choices (as in Lin et al.):
- 1×1 lateral conv to unify channels to $d=256$
- top-down upsample by factor 2 (nearest neighbor is typical)
- element-wise addition (requires same $H \times W$ and same $d$)
- 3×3 conv “smoothing” on each merged map
- optional $P6$ via stride-2 3×3 conv on $P5$ (common in detection systems)


### FPN top-down pathway — lateral merges and channel unification

```mermaid
flowchart TB
    C5["C5: 512×7×7"] -->|"Lat 1×1"| M5["M5: 256×7×7"]
    C4["C4: 256×14×14"] -->|"Lat 1×1"| lat4["256×14×14"]
    C3["C3: 128×28×28"] -->|"Lat 1×1"| lat3["256×28×28"]
    C2["C2: 64×56×56"] -->|"Lat 1×1"| lat2["256×56×56"]
    M5 -->|"Up ×2"| up5["256×14×14"]
    up5 -->|"⊕"| M4["M4: 256×14×14"]
    lat4 --> M4
    M4 -->|"Up ×2"| up4["256×28×28"]
    up4 -->|"⊕"| M3["M3: 256×28×28"]
    lat3 --> M3
    M3 -->|"Up ×2"| up3["256×56×56"]
    up3 -->|"⊕"| M2["M2: 256×56×56"]
    lat2 --> M2
    M5 -->|"3×3 Conv"| P5["P5"]
    M4 -->|"3×3 Conv"| P4["P4"]
    M3 -->|"3×3 Conv"| P3["P3"]
    M2 -->|"3×3 Conv"| P2["P2"]
    P5 -->|"3×3 s2"| P6["P6"]
```

The 1×1 lateral convolutions unify **heterogeneous backbone channels** (64/128/256/512) to a **uniform $d=256$** before the element-wise additions. The additions require strict spatial and channel alignment — which the lateral convolutions and upsample guarantee.


In [None]:
class FPN(nn.Module):
    def __init__(self, c2: int, c3: int, c4: int, c5: int, d: int = 256, make_p6: bool = True):
        super().__init__()
        # Lateral 1×1 convs: Ck -> d
        self.lat2 = nn.Conv2d(c2, d, kernel_size=1)
        self.lat3 = nn.Conv2d(c3, d, kernel_size=1)
        self.lat4 = nn.Conv2d(c4, d, kernel_size=1)
        self.lat5 = nn.Conv2d(c5, d, kernel_size=1)

        # Smoothing 3×3 convs on each pyramid level
        self.smooth2 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth3 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth4 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth5 = nn.Conv2d(d, d, kernel_size=3, padding=1)

        self.make_p6 = make_p6
        self.p6 = nn.Conv2d(d, d, kernel_size=3, stride=2, padding=1) if make_p6 else None

    def forward(self, C):
        c2, c3, c4, c5 = C["C2"], C["C3"], C["C4"], C["C5"]

        m5 = self.lat5(c5)
        m4 = self.lat4(c4) + F.interpolate(m5, scale_factor=2.0, mode="nearest")
        m3 = self.lat3(c3) + F.interpolate(m4, scale_factor=2.0, mode="nearest")
        m2 = self.lat2(c2) + F.interpolate(m3, scale_factor=2.0, mode="nearest")

        p5 = self.smooth5(m5)
        p4 = self.smooth4(m4)
        p3 = self.smooth3(m3)
        p2 = self.smooth2(m2)

        out = {"P2": p2, "P3": p3, "P4": p4, "P5": p5}
        if self.make_p6:
            out["P6"] = self.p6(p5)
        return out

fpn = FPN(c2=64, c3=128, c4=256, c5=512, d=256, make_p6=True)

P = fpn(C)
for k in ["P2","P3","P4","P5","P6"]:
    report(k, P[k])


### 5.1) Sanity checks: the additions are well-defined

Each merge is of the form:
\[
M_\ell = \text{Lat}(C_\ell) + \text{Upsample}(M_{\ell+1})
\]
so we assert shape equality at each merge point.


In [None]:
with torch.no_grad():
    c2, c3, c4, c5 = C["C2"], C["C3"], C["C4"], C["C5"]

    m5 = fpn.lat5(c5)
    m4_up = F.interpolate(m5, scale_factor=2.0, mode="nearest")
    m4_lat = fpn.lat4(c4)
    assert m4_up.shape == m4_lat.shape, (m4_up.shape, m4_lat.shape)

    m4 = m4_lat + m4_up
    m3_up = F.interpolate(m4, scale_factor=2.0, mode="nearest")
    m3_lat = fpn.lat3(c3)
    assert m3_up.shape == m3_lat.shape, (m3_up.shape, m3_lat.shape)

    m3 = m3_lat + m3_up
    m2_up = F.interpolate(m3, scale_factor=2.0, mode="nearest")
    m2_lat = fpn.lat2(c2)
    assert m2_up.shape == m2_lat.shape, (m2_up.shape, m2_lat.shape)

print("All FPN merge-shape assertions passed.")


In [None]:
labels = ['C2/P2\n(stride 4)', 'C3/P3\n(stride 8)', 'C4/P4\n(stride 16)', 'C5/P5\n(stride 32)']
backbone_ch = [64, 128, 256, 512]
fpn_ch      = [256, 256, 256, 256]

x = np.arange(len(labels))
w = 0.35

fig, ax = plt.subplots(figsize=(11, 5))
b1 = ax.bar(x - w/2, backbone_ch, w, label='Backbone Cₖ (heterogeneous)', color='#4e79a7', alpha=0.85)
b2 = ax.bar(x + w/2, fpn_ch,      w, label='FPN output Pₖ (d = 256)',    color='#59a14f', alpha=0.85)

ax.set_ylabel('Number of channels')
ax.set_title('FPN channel unification: heterogeneous backbone → uniform 256-channel pyramid')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
ax.set_ylim(0, 600)
for b, v in [(b1, backbone_ch), (b2, fpn_ch)]:
    for bar, val in zip(b, v):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 6,
                str(val), ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('fpn_channel_unification.png', dpi=120, bbox_inches='tight')
plt.show()


## 6) What “preferred approach for FPN” means (operationally)

In a modern featurizer intended for FPN-style consumption, the pragmatic default is:

1. Backbone (ResNet-style):
   - Identity shortcut if $(C_{in}, H, W)$ matches $(C_{out}, H', W')$
   - 1×1 projection shortcut (with stride=2 when downsampling) otherwise  
   This matches the ResNet paper’s “projection to match dimensions” guidance and the widespread “option B” practice in deep variants.

2. FPN neck:
   - 1×1 lateral convs to unify all $C2..C5$ to $d=256$ channels
   - top-down nearest-neighbor upsample by 2
   - elementwise addition
   - 3×3 smoothing conv
   - optional $P6$ from $P5$ via stride-2 3×3 conv

The key theme is the same in both ResNet and FPN: **addition enforces strict shape equality**, so dimensioning is not a detail—it is the design constraint.


## References (primary sources)

- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. *Deep Residual Learning for Image Recognition*. CVPR 2016. arXiv:1512.03385.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. *Identity Mappings in Deep Residual Networks*. ECCV 2016. arXiv:1603.05027.
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. *Feature Pyramid Networks for Object Detection*. CVPR 2017. arXiv:1612.03144.
