# Pooling
:label:`sec_pooling`

In many cases our ultimate task asks some global question about the image,
e.g., *does it contain a cat?* Consequently, the units of our final layer 
should be sensitive to the entire input.
By gradually aggregating information, yielding coarser and coarser maps,
we accomplish this goal of ultimately learning a global representation,
while keeping all of the advantages of convolutional layers at the intermediate layers of processing.
The deeper we go in the network,
the larger the receptive field (relative to the input)
to which each hidden node is sensitive. Reducing spatial resolution 
accelerates this process, 
since the convolution kernels cover a larger effective area. 

Moreover, when detecting lower-level features, such as edges
(as discussed in :numref:`sec_conv_layer`),
we often want our representations to be somewhat invariant to translation.
For instance, if we take the image `X`
with a sharp delineation between black and white
and shift the whole image by one pixel to the right,
i.e., `Z[i, j] = X[i, j + 1]`,
then the output for the new image `Z` might be vastly different.
The edge will have shifted by one pixel.
In reality, objects hardly ever occur exactly at the same place.
In fact, even with a tripod and a stationary object,
vibration of the camera due to the movement of the shutter
might shift everything by a pixel or so
(high-end cameras are loaded with special features to address this problem).

This section introduces *pooling layers*,
which serve the dual purposes of
mitigating the sensitivity of convolutional layers to location
and of spatially downsampling representations.

# 池化
:label:`sec_pooling`

在许多情况下，我们的最终任务会提出关于图像的全局性问题，例如*图像中是否包含猫？* 因此，最终层的单元需要对整个输入敏感。通过逐步聚合信息，生成越来越粗糙的特征图，我们实现了最终学习全局表示的目标，同时保留了卷积层在中间处理层的所有优势。随着网络深度的增加，每个隐藏节点对输入的感受野（相对输入而言）会变得更大。降低空间分辨率可以加速这一过程，因为卷积核覆盖的有效区域会更大。

此外，在检测低级特征（如边缘，如 :numref:`sec_conv_layer` 节所述）时，我们通常希望表示对平移具有一定的不变性。例如，假设我们有一张黑白分界清晰的图像 `X`，若将整个图像向右平移一个像素（即 `Z[i, j] = X[i, j + 1]`），新图像 `Z` 的输出可能会大不相同。边缘会移动一个像素的位置。现实中，物体几乎不会完全出现在同一位置。即使使用三脚架固定拍摄静止物体，快门的运动引起的相机振动也可能导致所有内容偏移几个像素（高端相机配备了特殊功能来解决此问题）。

本节将介绍*池化层*，它的双重作用是：
1. 降低卷积层对位置的敏感性
2. 对空间表征进行下采样


In [2]:
import torch
from torch import nn
from d2l import torch as d2l

## Maximum Pooling and Average Pooling

Like convolutional layers, *pooling* operators
consist of a fixed-shape window that is slid over
all regions in the input according to its stride,
computing a single output for each location traversed
by the fixed-shape window (sometimes known as the *pooling window*).
However, unlike the cross-correlation computation
of the inputs and kernels in the convolutional layer,
the pooling layer contains no parameters (there is no *kernel*).
Instead, pooling operators are deterministic,
typically calculating either the maximum or the average value
of the elements in the pooling window.
These operations are called *maximum pooling* (*max-pooling* for short)
and *average pooling*, respectively.

*Average pooling* is essentially as old as CNNs. The idea is akin to 
downsampling an image. Rather than just taking the value of every second (or third) 
pixel for the lower resolution image, we can average over adjacent pixels to obtain 
an image with better signal-to-noise ratio since we are combining the information 
from multiple adjacent pixels. *Max-pooling* was introduced in 
:citet:`Riesenhuber.Poggio.1999` in the context of cognitive neuroscience to describe 
how information aggregation might be aggregated hierarchically for the purpose 
of object recognition; there already was an earlier version in speech recognition :cite:`Yamaguchi.Sakamoto.Akabane.ea.1990`. In almost all cases, max-pooling, as it is also referred to, 
is preferable to average pooling. 

In both cases, as with the cross-correlation operator,
we can think of the pooling window
as starting from the upper-left of the input tensor
and sliding across it from left to right and top to bottom.
At each location that the pooling window hits,
it computes the maximum or average
value of the input subtensor in the window,
depending on whether max or average pooling is employed.


![Max-pooling with a pooling window shape of $2\times 2$. The shaded portions are the first output element as well as the input tensor elements used for the output computation: $\max(0, 1, 3, 4)=4$.](../img/pooling.svg)
:label:`fig_pooling`

The output tensor in :numref:`fig_pooling`  has a height of 2 and a width of 2.
The four elements are derived from the maximum value in each pooling window:

$$
\max(0, 1, 3, 4)=4,\\
\max(1, 2, 4, 5)=5,\\
\max(3, 4, 6, 7)=7,\\
\max(4, 5, 7, 8)=8.\\
$$

More generally, we can define a $p \times q$ pooling layer by aggregating over 
a region of said size. Returning to the problem of edge detection, 
we use the output of the convolutional layer
as input for $2\times 2$ max-pooling.
Denote by `X` the input of the convolutional layer input and `Y` the pooling layer output. 
Regardless of whether or not the values of `X[i, j]`, `X[i, j + 1]`, 
`X[i+1, j]` and `X[i+1, j + 1]` are different,
the pooling layer always outputs `Y[i, j] = 1`.
That is to say, using the $2\times 2$ max-pooling layer,
we can still detect if the pattern recognized by the convolutional layer
moves no more than one element in height or width.

In the code below, we (**implement the forward propagation
of the pooling layer**) in the `pool2d` function.
This function is similar to the `corr2d` function
in :numref:`sec_conv_layer`.
However, no kernel is needed, computing the output
as either the maximum or the average of each region in the input.

## 最大池化与平均池化

与卷积层类似，*池化*算子也由一个固定形状的窗口组成，该窗口根据其步幅在输入的所有区域上滑动。池化窗口每次滑动所覆盖的位置会计算一个输出值（该窗口有时也称为*池化窗口*）。然而，与卷积层中输入和核的互相关计算不同，池化层不包含参数（没有*核*）。池化运算是确定性的，通常计算池化窗口中元素的最大值或平均值。这些操作分别称为*最大池化*（简称max-pooling）和*平均池化*（average pooling）。

*平均池化*的历史几乎与CNN一样悠久。其思想类似于图像下采样。与直接取每隔一个（或三个）像素作为低分辨率图像不同，我们可以通过对相邻像素取平均值来获得信噪比更高的图像，因为这样结合了多个相邻像素的信息。*最大池化*由 :citet:`Riesenhuber.Poggio.1999` 在认知神经科学的背景下引入，用于描述如何通过层次化信息聚合实现物体识别；更早的版本已出现在语音识别领域 :cite:`Yamaguchi.Sakamoto.Akabane.ea.1990`。在几乎所有情况下，最大池化都比平均池化更优。

与互相关运算符类似，池化窗口从输入张量的左上角开始，从左到右、从上到下滑动。在池化窗口到达的每个位置，它会根据池化类型（最大池化或平均池化）计算窗口内输入子张量的最大值或平均值。


![使用 $2\times 2$ 池化窗口的最大池化。阴影部分是第一个输出元素，以及用于计算该输出的输入张量元素：$\max(0, 1, 3, 4)=4$。](../img/pooling.svg)
:label:`fig_pooling`

:numref:`fig_pooling` 中的输出张量高度和宽度均为2。这四个元素由每个池化窗口内的最大值产生：

$$
\max(0, 1, 3, 4)=4,\\
\max(1, 2, 4, 5)=5,\\
\max(3, 4, 6, 7)=7,\\
\max(4, 5, 7, 8)=8.\\
$$

更一般地，我们可以通过聚合 $p \times q$ 大小的区域来定义池化层。回到边缘检测问题，我们将卷积层的输出作为 $2\times 2$ 最大池化的输入。用 `X` 表示卷积层的输入，`Y` 表示池化层的输出。无论 `X[i, j]`、`X[i, j + 1]`、`X[i+1, j]` 和 `X[i+1, j + 1]` 的值是否不同，池化层始终输出 `Y[i, j] = 1`。也就是说，使用 $2\times 2$ 最大池化层，我们仍然可以检测到卷积层识别的模式在高度和宽度上移动不超过一个元素的情况。

在下面的代码中，我们（**实现池化层的前向传播**）函数 `pool2d`。该函数与 :numref:`sec_conv_layer` 中的 `corr2d` 函数类似，但无需核参数，其输出为输入每个区域的最大值或平均值。

In [3]:
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

We can construct the input tensor `X` in :numref:`fig_pooling` to [**validate the output of the two-dimensional max-pooling layer**].

我们可以通过构建 :numref:`fig_pooling` 中的输入张量 `X` 来 [**验证二维最大池化层的输出**]

In [4]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

tensor([[4., 5.],
        [7., 8.]])

Also, we can experiment with (**the average pooling layer**).

此外，我们可以尝试（**使用平均池化层**）。


In [5]:
pool2d(X, (2, 2), 'avg')

tensor([[2., 3.],
        [5., 6.]])

## [**Padding and Stride**]

As with convolutional layers, pooling layers
change the output shape.
And as before, we can adjust the operation to achieve a desired output shape
by padding the input and adjusting the stride.
We can demonstrate the use of padding and strides
in pooling layers via the built-in two-dimensional max-pooling layer from the deep learning framework.
We first construct an input tensor `X` whose shape has four dimensions,
where the number of examples (batch size) and number of channels are both 1.

## [**填充与步幅**]

与卷积层类似，池化层也会改变输出形状。和之前一样，我们可以通过填充输入和调整步幅来获得所需的输出形状。我们可以通过深度学习框架内置的二维最大池化层来演示池化层中填充和步幅的使用。首先构造一个形状为四维的输入张量 `X`，其中样本数（批量大小）和通道数都为1。

In [15]:
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))
X

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]]]])

Since pooling aggregates information from an area, (**deep learning frameworks default to matching pooling window sizes and stride.**) For instance, if we use a pooling window of shape `(3, 3)`
we get a stride shape of `(3, 3)` by default.

由于池化操作会对区域内的信息进行聚合，（**深度学习框架默认将池化窗口大小与步幅设为相同值**）。例如，当使用形状为 `(3, 3)` 的池化窗口时，默认获得的步幅形状也是 `(3, 3)`。
```</think></think>


In [7]:
pool2d = nn.MaxPool2d(3)
# Pooling has no model parameters, hence it needs no initialization
pool2d(X)

tensor([[[[10.]]]])

Needless to say, [**the stride and padding can be manually specified**] to override framework defaults if required.

[**如有需要，可以手动指定步幅与填充参数**] 来覆盖框架的默认设置。

In [8]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]]]])

Of course, we can specify an arbitrary rectangular pooling window with arbitrary height and width respectively, as the example below shows.

当然，我们可以如以下示例所示，分别指定任意矩形池化窗口的高度和宽度。

In [9]:
pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]]]])

## Multiple Channels

When processing multi-channel input data,
[**the pooling layer pools each input channel separately**],
rather than summing the inputs up over channels
as in a convolutional layer.
This means that the number of output channels for the pooling layer
is the same as the number of input channels.
Below, we will concatenate tensors `X` and `X + 1`
on the channel dimension to construct an input with two channels.

## 多通道

在处理多通道输入数据时，
[**池化层会分别对每个输入通道进行池化操作**]，
而不是像卷积层那样在通道维度上对输入进行汇总。
这意味着池化层的输出通道数与输入通道数相同。
接下来，我们将在通道维度上将张量 `X` 和 `X + 1` 进行拼接，
从而构造一个具有两个通道的输入。

In [16]:
X = torch.cat((X, X + 1), 1)
X

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]],

         [[ 1.,  2.,  3.,  4.],
          [ 5.,  6.,  7.,  8.],
          [ 9., 10., 11., 12.],
          [13., 14., 15., 16.]]]])

As we can see, the number of output channels is still two after pooling.

如我们所见，经过池化操作后输出通道数仍为两个。

In [17]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]],

         [[ 6.,  8.],
          [14., 16.]]]])

## Summary

Pooling is an exceedingly simple operation. It does exactly what its name indicates, aggregate results over a window of values. All convolution semantics, such as strides and padding apply in the same way as they did previously. Note that pooling is indifferent to channels, i.e., it leaves the number of channels unchanged and it applies to each channel separately. Lastly, of the two popular pooling choices, max-pooling is preferable to average pooling, as it confers some degree of invariance to output. A popular choice is to pick a pooling window size of $2 \times 2$ to quarter the spatial resolution of output. 

Note that there are many more ways of reducing resolution beyond pooling. For instance, in stochastic pooling :cite:`Zeiler.Fergus.2013` and fractional max-pooling :cite:`Graham.2014` aggregation is combined with randomization. This can slightly improve the accuracy in some cases. Lastly, as we will see later with the attention mechanism, there are more refined ways of aggregating over outputs, e.g., by using the alignment between a query and representation vectors. 


## Exercises

1. Implement average pooling through a convolution. 
1. Prove that max-pooling cannot be implemented through a convolution alone. 
1. Max-pooling can be accomplished using ReLU operations, i.e., $\textrm{ReLU}(x) = \max(0, x)$.
    1. Express $\max (a, b)$ by using only ReLU operations.
    1. Use this to implement max-pooling by means of convolutions and ReLU layers. 
    1. How many channels and layers do you need for a $2 \times 2$ convolution? How many for a $3 \times 3$ convolution?
1. What is the computational cost of the pooling layer? Assume that the input to the pooling layer is of size $c\times h\times w$, the pooling window has a shape of $p_\textrm{h}\times p_\textrm{w}$ with a padding of $(p_\textrm{h}, p_\textrm{w})$ and a stride of $(s_\textrm{h}, s_\textrm{w})$.
1. Why do you expect max-pooling and average pooling to work differently?
1. Do we need a separate minimum pooling layer? Can you replace it with another operation?
1. We could use the softmax operation for pooling. Why might it not be so popular?

## 摘要

池化是一个极其简单的操作。正如其名称所示，它在值窗口上聚合结果。所有卷积语义（如步幅和填充）的应用方式与之前相同。注意池化层对通道数保持不变，即它对每个通道单独作用。最后，在两种常用池化选择中，最大池化优于平均池化，因为它赋予输出一定程度的平移不变性。常见选择是采用$2 \times 2$的池化窗口，将输出的空间分辨率缩小四分之一。

需要注意的是，除了池化外还有许多降低分辨率的方法。例如，随机池化 :cite:`Zeiler.Fergus.2013` 和分数最大池化 :cite:`Graham.2014` 将聚合与随机化结合。这可以在某些情况下略微提高准确率。最后，正如我们稍后将在注意力机制中看到的，还有更精细的输出聚合方法，例如通过使用查询与表示向量之间的对齐。


## 练习

1. 通过卷积实现平均池化
1. 证明最大池化不能单独通过卷积实现
1. 最大池化可以使用ReLU运算实现，即$\textrm{ReLU}(x) = \max(0, x)$
    1. 仅使用ReLU运算表达$\max (a, b)$
    1. 通过卷积和ReLU层实现最大池化
    1. 对于$2 \times 2$卷积需要多少通道和层？$3 \times 3$卷积呢？
1. 池化层的计算成本是多少？假设池化层输入大小为$c\times h\times w$，池化窗口形状为$p_\textrm{h}\times p_\textrm{w}$，填充为$(p_\textrm{h}, p_\textrm{w})$，步幅为$(s_\textrm{h}, s_\textrm{w})$
1. 为什么最大池化和平均池化效果不同？
1. 是否需要单独的最小池化层？可以用其他操作代替吗？
1. 可以使用softmax运算进行池化，为什么它不太流行？

[Discussions](https://discuss.d2l.ai/t/72)
