# Padding and Stride
:label:`sec_padding`

Recall the example of a convolution in :numref:`fig_correlation`. 
The input had both a height and width of 3
and the convolution kernel had both a height and width of 2,
yielding an output representation with dimension $2\times2$.
Assuming that the input shape is $n_\textrm{h}\times n_\textrm{w}$
and the convolution kernel shape is $k_\textrm{h}\times k_\textrm{w}$,
the output shape will be $(n_\textrm{h}-k_\textrm{h}+1) \times (n_\textrm{w}-k_\textrm{w}+1)$: 
we can only shift the convolution kernel so far until it runs out
of pixels to apply the convolution to. 

In the following we will explore a number of techniques, 
including padding and strided convolutions,
that offer more control over the size of the output. 
As motivation, note that since kernels generally
have width and height greater than $1$,
after applying many successive convolutions,
we tend to wind up with outputs that are
considerably smaller than our input.
If we start with a $240 \times 240$ pixel image,
ten layers of $5 \times 5$ convolutions
reduce the image to $200 \times 200$ pixels,
slicing off $30 \%$ of the image and with it
obliterating any interesting information
on the boundaries of the original image.
*Padding* is the most popular tool for handling this issue.
In other cases, we may want to reduce the dimensionality drastically,
e.g., if we find the original input resolution to be unwieldy.
*Strided convolutions* are a popular technique that can help in these instances.

## 填充与步幅
:label:`sec_padding`

回忆 :numref:`fig_correlation` 中的卷积运算示例。输入的高宽均为3，卷积核的高宽均为2，得到$2\times2$的输出。假设输入形状为$n_\textrm{h}\times n_\textrm{w}$，卷积核形状为$k_\textrm{h}\times k_\textrm{w}$，则输出形状为$(n_\textrm{h}-k_\textrm{h}+1) \times (n_\textrm{w}-k_\textrm{w}+1)$：我们只能在有效像素范围内滑动卷积核。

以下我们将探讨填充（padding）和步幅（stride）等技术，它们能更精细地控制输出尺寸。

### 填充
当应用多层连续卷积时，输出尺寸缩减会导致边界信息丢失。例如$240 \times 240$图像经过10层$5 \times 5$卷积后，尺寸会缩减至$200 \times 200$，损失30%的边界信息。填充是解决这一问题的常用方法。

```mermaid
graph LR
    A[输入] --> B[填充]
    B --> C[卷积运算]
    C --> D[保持尺寸的输出]
    
    style A fill:#e6f3ff,stroke:#333
    style B fill:#ffe6e6,stroke:#333
    style C fill:#e6ffe6,stroke:#333
    style D fill:#f0f0f0,stroke:#333
```

数学表达：  
对输入$(n_h, n_w)$进行$p_h$行高填充和$p_w$列宽填充后，输出形状变为：
$$(n_h - k_h + p_h + 1) \times (n_w - k_w + p_w + 1)$$

常见填充方式：
- 零填充：用0填充边界
- 镜像填充：复制边界镜像值
- 周期性填充：重复输入特征

### 步幅
当需要大幅降维时，步幅能有效控制卷积核滑动间隔。步幅$s_h$和$s_w$分别表示高度和宽度方向的滑动步长。

输出尺寸计算公式：
$$\left\lfloor(n_h - k_h + p_h + s_h)/s_h\right\rfloor \times \left\lfloor(n_w - k_w + p_w + s_w)/s_w\right\rfloor$$

示例对比：
| 参数设置 | 输入尺寸 | 输出尺寸 | 计算量缩减 |
|---------|---------|---------|----------|
| stride=1 | 224x224 | 224x224 | 基准     |
| stride=2 | 224x224 | 112x112 | 75%      |
| stride=4 | 224x224 | 56x56   | 93.75%   |

```python
# PyTorch实现示例
conv = nn.Conv2d(in_channels=3, out_channels=64, 
                kernel_size=3, stride=2, padding=1)
# padding=1保持尺寸减半时的边界信息完整性
```

### 组合应用
现代CNN架构典型配置：
- 浅层使用小步幅（1-2）保持细节
- 深层使用大步幅（2）进行特征抽象
- 配合池化操作构建层次化特征

这种设计平衡了：
1. 特征分辨率保持
2. 感受野扩展
3. 计算效率优化

In [1]:
import torch
from torch import nn

## Padding

As described above, one tricky issue when applying convolutional layers
is that we tend to lose pixels on the perimeter of our image. Consider :numref:`img_conv_reuse` that depicts the pixel utilization as a function of the convolution kernel size and the position within the image. The pixels in the corners are hardly used at all. 

![Pixel utilization for convolutions of size $1 \times 1$, $2 \times 2$, and $3 \times 3$ respectively.](../img/conv-reuse.svg)
:label:`img_conv_reuse`

Since we typically use small kernels,
for any given convolution
we might only lose a few pixels
but this can add up as we apply
many successive convolutional layers.
One straightforward solution to this problem
is to add extra pixels of filler around the boundary of our input image,
thus increasing the effective size of the image.
Typically, we set the values of the extra pixels to zero.
In :numref:`img_conv_pad`, we pad a $3 \times 3$ input,
increasing its size to $5 \times 5$.
The corresponding output then increases to a $4 \times 4$ matrix.
The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: $0\times0+0\times1+0\times2+0\times3=0$.

![Two-dimensional cross-correlation with padding.](../img/conv-pad.svg)
:label:`img_conv_pad`

In general, if we add a total of $p_\textrm{h}$ rows of padding
(roughly half on top and half on bottom)
and a total of $p_\textrm{w}$ columns of padding
(roughly half on the left and half on the right),
the output shape will be

$$(n_\textrm{h}-k_\textrm{h}+p_\textrm{h}+1)\times(n_\textrm{w}-k_\textrm{w}+p_\textrm{w}+1).$$

This means that the height and width of the output
will increase by $p_\textrm{h}$ and $p_\textrm{w}$, respectively.

In many cases, we will want to set $p_\textrm{h}=k_\textrm{h}-1$ and $p_\textrm{w}=k_\textrm{w}-1$
to give the input and output the same height and width.
This will make it easier to predict the output shape of each layer
when constructing the network.
Assuming that $k_\textrm{h}$ is odd here,
we will pad $p_\textrm{h}/2$ rows on both sides of the height.
If $k_\textrm{h}$ is even, one possibility is to
pad $\lceil p_\textrm{h}/2\rceil$ rows on the top of the input
and $\lfloor p_\textrm{h}/2\rfloor$ rows on the bottom.
We will pad both sides of the width in the same way.

CNNs commonly use convolution kernels
with odd height and width values, such as 1, 3, 5, or 7.
Choosing odd kernel sizes has the benefit
that we can preserve the dimensionality
while padding with the same number of rows on top and bottom,
and the same number of columns on left and right.

Moreover, this practice of using odd kernels
and padding to precisely preserve dimensionality
offers a clerical benefit.
For any two-dimensional tensor `X`,
when the kernel's size is odd
and the number of padding rows and columns
on all sides are the same,
thereby producing an output with the same height and width as the input,
we know that the output `Y[i, j]` is calculated
by cross-correlation of the input and convolution kernel
with the window centered on `X[i, j]`.

In the following example, we create a two-dimensional convolutional layer
with a height and width of 3
and (**apply 1 pixel of padding on all sides.**)
Given an input with a height and width of 8,
we find that the height and width of the output is also 8.


## 填充

如前面所述，应用卷积层时容易丢失图像边缘的像素。参考 :numref:`img_conv_reuse` 中不同卷积核尺寸对像素利用率的可视化，可以看到角落像素的利用率最低。

![$1 \times 1$、$2 \times 2$ 和 $3 \times 3$ 卷积核的像素利用率示意图](../img/conv-reuse.svg)
:label:`img_conv_reuse`

### 填充机制
为解决边界信息丢失问题，常用方法是在输入图像周围添加填充元素（通常用零填充）。如 :numref:`img_conv_pad` 所示，将$3 \times 3$输入填充为$5 \times 5$，输出尺寸相应增加到$4 \times 4$。阴影部分展示了第一个输出元素的计算过程：$0×0+0×1+0×2+0×3=0$。

![带填充的二维互相关运算](../img/conv-pad.svg)
:label:`img_conv_pad`

数学表达式：
当添加$p_\textrm{h}$行高填充和$p_\textrm{w}$列宽填充时，输出尺寸变为：
$$(n_\textrm{h}-k_\textrm{h}+p_\textrm{h}+1)\times(n_\textrm{w}-k_\textrm{w}+p_\textrm{w}+1)$$

### 填充策略
常用配置方案：
```mermaid
graph TD
    A[输入尺寸n×n] --> B{卷积核奇偶性}
    B -->|奇数核k| C[对称填充(p-1)/2]
    B -->|偶数核k| D[非对称填充⌈p/2⌉与⌊p/2⌋]
    C --> E[保持尺寸n×n]
    D --> F[尺寸变化]
```

典型配置：
- 当$k_h$为奇数时，设置$p_h = k_h - 1$（高度方向总填充）
- 当$k_w$为奇数时，设置$p_w = k_w - 1$（宽度方向总填充）

### 奇数核优势
CNN常用奇数尺寸卷积核（1/3/5/7）的原因：
1. 对称填充：可均匀分配填充量到各边
2. 中心定位：输出元素Y[i,j]对应输入X[i,j]为中心的窗口
3. 尺寸保持：输入输出尺寸一致，便于网络设计

### 代码示例
```python
import torch
from torch import nn

# 定义带填充的卷积层
conv = nn.Conv2d(1, 1, kernel_size=3, padding=1)  # 3×3核，1像素填充

# 验证尺寸保持
X = torch.rand(size=(8, 8)).reshape(1, 1, 8, 8)
Y = conv(X)
print(Y.shape)  # 输出 torch.Size([1, 1, 8, 8])
```

### 工程实践建议
1. 对于浅层网络，优先使用3×3卷积核配合1像素填充
2. 深层网络可交替使用5×5和3×3卷积核
3. 使用7×7卷积核时需注意计算量激增问题
4. 偶数核尺寸仅在特殊需求时使用（如需要非对称感受野）

In [2]:
# We define a helper function to calculate convolutions. It initializes the
# convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
    # (1, 1) indicates that batch size and the number of channels are both 1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    # Strip the first two dimensions: examples and channels
    return Y.reshape(Y.shape[2:])

# 1 row and column is padded on either side, so a total of 2 rows or columns
# are added
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape

torch.Size([8, 8])

When the height and width of the convolution kernel are different,
we can make the output and input have the same height and width
by [**setting different padding numbers for height and width.**]

当卷积核的高度和宽度不同时，我们可以通过以下方式使输出与输入保持相同尺寸：
【通过为高度和宽度分别设置不同的填充值】。 


In [3]:
# We use a convolution kernel with height 5 and width 3. The padding on either
# side of the height and width are 2 and 1, respectively
conv2d = nn.LazyConv2d(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

torch.Size([8, 8])

## Stride

When computing the cross-correlation,
we start with the convolution window
at the upper-left corner of the input tensor,
and then slide it over all locations both down and to the right.
In the previous examples, we defaulted to sliding one element at a time.
However, sometimes, either for computational efficiency
or because we wish to downsample,
we move our window more than one element at a time,
skipping the intermediate locations. This is particularly useful if the convolution 
kernel is large since it captures a large area of the underlying image.

We refer to the number of rows and columns traversed per slide as *stride*.
So far, we have used strides of 1, both for height and width.
Sometimes, we may want to use a larger stride.
:numref:`img_conv_stride` shows a two-dimensional cross-correlation operation
with a stride of 3 vertically and 2 horizontally.
The shaded portions are the output elements as well as the input and kernel tensor elements used for the output computation: $0\times0+0\times1+1\times2+2\times3=8$, $0\times0+6\times1+0\times2+0\times3=6$.
We can see that when the second element of the first column is generated,
the convolution window slides down three rows.
The convolution window slides two columns to the right
when the second element of the first row is generated.
When the convolution window continues to slide two columns to the right on the input,
there is no output because the input element cannot fill the window
(unless we add another column of padding).

![Cross-correlation with strides of 3 and 2 for height and width, respectively.](../img/conv-stride.svg)
:label:`img_conv_stride`

In general, when the stride for the height is $s_\textrm{h}$
and the stride for the width is $s_\textrm{w}$, the output shape is

$$\lfloor(n_\textrm{h}-k_\textrm{h}+p_\textrm{h}+s_\textrm{h})/s_\textrm{h}\rfloor \times \lfloor(n_\textrm{w}-k_\textrm{w}+p_\textrm{w}+s_\textrm{w})/s_\textrm{w}\rfloor.$$

If we set $p_\textrm{h}=k_\textrm{h}-1$ and $p_\textrm{w}=k_\textrm{w}-1$,
then the output shape can be simplified to
$\lfloor(n_\textrm{h}+s_\textrm{h}-1)/s_\textrm{h}\rfloor \times \lfloor(n_\textrm{w}+s_\textrm{w}-1)/s_\textrm{w}\rfloor$.
Going a step further, if the input height and width
are divisible by the strides on the height and width,
then the output shape will be $(n_\textrm{h}/s_\textrm{h}) \times (n_\textrm{w}/s_\textrm{w})$.

Below, we [**set the strides on both the height and width to 2**],
thus halving the input height and width.


## 步幅

在进行互相关运算时，我们默认从输入张量的左上角开始滑动卷积窗口。当需要控制特征图下采样率或提升计算效率时，可以通过调整步幅(stride)参数实现窗口跳跃式滑动。

### 核心概念
- **步幅定义**：窗口每次滑动的行数和列数，分别称为垂直步幅(s_ℎ)和水平步幅(s_w)
- **默认值**：s_ℎ=1, s_w=1
- **下采样效应**：步幅加倍可使输出空间维度减半

### 步幅为(3,2)的示例分析
```mermaid
graph TD
    A[输入张量] --> B[第一次窗口位置]
    B --> C[计算输出8]
    A --> D[垂直下移3行]
    D --> E[第二次窗口位置]
    E --> F[计算输出6]
    A --> G[水平右移2列]
    G --> H[边界终止条件]
    
    style A fill:#e6f3ff,stroke:#333
    style B fill:#ffe6e6,stroke:#333
    style C fill:#e6ffe6,stroke:#333
    style D fill:#ffebcc,stroke:#333
```

### 输出形状计算公式
当设置填充数p_ℎ, p_w时，输出尺寸为：
$$ 
\left\lfloor \frac{n_ℎ - k_ℎ + p_ℎ + s_ℎ}{s_ℎ} \right\rfloor \times 
\left\lfloor \frac{n_w - k_w + p_w + s_w}{s_w} \right\rfloor
$$

### 特殊情形推导
当采用全填充策略时（p_ℎ=k_ℎ-1, p_w=k_w-1）：
$$
\text{输出尺寸} = 
\left\lfloor \frac{n_ℎ + s_ℎ -1}{s_ℎ} \right\rfloor \times 
\left\lfloor \frac{n_w + s_w -1}{s_w} \right\rfloor
$$

### 最佳实践
当输入尺寸能被步幅整除时：
```python
# 输入尺寸256x256，卷积核3x3，步幅2x2
output_shape = (256//2, 256//2)  # 输出128x128
```

### 参数配置表
| 输入尺寸 | 卷积核 | 步幅 | 输出尺寸 |
|---------|--------|------|---------|
| 224x224 | 3x3    | 2x2  | 112x112 |
| 112x112 | 5x5    | 3x3  | 37x37   |
| 512x512 | 7x7    | 4x4  | 128x128 |


In [4]:
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape

torch.Size([4, 4])

对于给定的卷积层配置 `nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)`，输出尺寸计算如下：

### 计算公式
$$ 
n_{out} = \left\lfloor \frac{n_{in} + 2p - k}{s} \right\rfloor + 1
$$
其中：
- $n_{in}$：输入尺寸（高/宽）
- $p=1$：单边填充量
- $k=3$：卷积核尺寸
- $s=2$：步幅

### 计算实例
假设输入特征图尺寸为 5×5：
```python
import torch
import torch.nn as nn

conv = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)
input = torch.randn(1, 3, 5, 5)  # (batch, channels, height, width)
output = conv(input)
print(output.shape)  # 输出 torch.Size([1, 1, 3, 3])
```

推导过程：
$$ 
n_{out} = \left\lfloor \frac{5 + 2×1 - 3}{2} \right\rfloor + 1 = \left\lfloor \frac{4}{2} \right\rfloor + 1 = 3
$$

### 尺寸对照表
| 输入尺寸 | 输出尺寸 | 计算步骤 |
|---------|---------|---------|
| 5×5     | 3×3     | (5+2-3)/2=2 → 2+1=3 |
| 7×7     | 4×4     | (7+2-3)/2=3 → 3+1=4 |
| 6×6     | 3×3     | (6+2-3)/2=2.5 → 2+1=3 |
| 4×4     | 2×2     | (4+2-3)/2=1.5 → 1+1=2 |

### 特殊情形验证
当输入尺寸为偶数时：
```python
input = torch.randn(1, 3, 6, 6)
output = conv(input)
print(output.shape)  # 输出 torch.Size([1, 1, 3, 3]) 
# 计算：(6+2-3)/2=2.5 → floor(2.5)=2 → 2+1=3
```

### 可视化计算流程
```mermaid
graph LR
    A[输入5x5] --> B[填充后7x7]
    B --> C[3x3卷积核]
    C --> D[步长2滑动]
    D --> E[输出3x3]
    
    style A fill:#e6f3ff,stroke:#333
    style B fill:#ffe6e6,stroke:#333
    style C fill:#e6ffe6,stroke:#333
    style D fill:#ffebcc,stroke:#

Let's look at (**a slightly more complicated example**).

我们看看一个稍微复杂的例子

In [5]:
conv2d = nn.LazyConv2d(1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape

torch.Size([2, 2])

## Summary and Discussion

Padding can increase the height and width of the output. This is often used to give the output the same height and width as the input to avoid undesirable shrinkage of the output. Moreover, it ensures that all pixels are used equally frequently. Typically we pick symmetric padding on both sides of the input height and width. In this case we refer to $(p_\textrm{h}, p_\textrm{w})$ padding. Most commonly we set $p_\textrm{h} = p_\textrm{w}$, in which case we simply state that we choose padding $p$. 

A similar convention applies to strides. When horizontal stride $s_\textrm{h}$ and vertical stride $s_\textrm{w}$ match, we simply talk about stride $s$. The stride can reduce the resolution of the output, for example reducing the height and width of the output to only $1/n$ of the height and width of the input for $n > 1$. By default, the padding is 0 and the stride is 1. 

So far all padding that we discussed simply extended images with zeros. This has significant computational benefit since it is trivial to accomplish. Moreover, operators can be engineered to take advantage of this padding implicitly without the need to allocate additional memory. At the same time, it allows CNNs to encode implicit position information within an image, simply by learning where the "whitespace" is. There are many alternatives to zero-padding. :citet:`Alsallakh.Kokhlikyan.Miglani.ea.2020` provided an extensive overview of those (albeit without a clear case for when to use nonzero paddings unless artifacts occur). 


## Exercises

1. Given the final code example in this section with kernel size $(3, 5)$, padding $(0, 1)$, and stride $(3, 4)$, 
   calculate the output shape to check if it is consistent with the experimental result.
1. For audio signals, what does a stride of 2 correspond to?
1. Implement mirror padding, i.e., padding where the border values are simply mirrored to extend tensors. 
1. What are the computational benefits of a stride larger than 1?
1. What might be statistical benefits of a stride larger than 1?
1. How would you implement a stride of $\frac{1}{2}$? What does it correspond to? When would this be useful?


[Discussions](https://discuss.d2l.ai/t/68)


### 练习题解答

1. **输出形状验证**  
输入尺寸假设为8×10：
$$ 
h_{out} = \lfloor (8 + 2×0 -3)/3 \rfloor +1 = \lfloor 5/3 \rfloor +1 = 2 \\
w_{out} = \lfloor (10 + 2×1 -5)/4 \rfloor +1 = \lfloor 7/4 \rfloor +1 = 2
$$
代码验证：
```python
conv = nn.Conv2d(1,1,kernel_size=(3,5),stride=(3,4),padding=(0,1))
x = torch.randn(1,1,8,10)
print(conv(x).shape)  # torch.Size([1,1,2,2])
```

2. **音频信号步幅含义**  
音频信号处理时：
- 时域步幅2 ⇒ 时间分辨率减半
- 等效采样率：原采样率/2
- 典型应用：时域下采样

3. **镜像填充实现**  
PyTorch实现方式：
```python
# 手动实现
def mirror_pad(x, padding):
    return torch.nn.functional.pad(x, padding, mode='reflect')

# 卷积层应用
conv = nn.Conv2d(3,16,kernel_size=3,padding=1,padding_mode='reflect')
```

4. **计算优势**  
步幅>1的优势矩阵：
| 因素 | 传统步幅1 | 步幅2 | 提升比例 |
|-----|---------|------|---------|
| 乘加操作 | O(n²k²) | O(n²k²/4) | 75% |
| 内存占用 | H×W×C | (H/2)×(W/2)×C | 75% |
| 层间通信 | 高 | 低 | - |

5. **统计优势**  
- 增强平移不变性
- 降低过拟合风险（感受野增大）
- 特征响应更鲁棒（MaxPooling类似效果）

6. **分数步幅实现**  
实现方法：
```python
# 使用转置卷积实现
conv = nn.ConvTranspose2d(3,3,kernel_size=3,stride=2,padding=1)
```
应用场景：
- 图像超分辨率重建
- 语义分割上采样
- GAN生成高分辨率图像