# Convolutions for Images
:label:`sec_conv_layer`

Now that we understand how convolutional layers work in theory,
we are ready to see how they work in practice.
Building on our motivation of convolutional neural networks
as efficient architectures for exploring structure in image data,
we stick with images as our running example.

# 图像卷积
:label:`sec_conv_layer`

现在我们从理论层面理解了卷积层的工作原理，接下来看看它们在实践中的应用。基于卷积神经网络作为探索图像数据结构的高效架构这一动机，我们将继续以图像作为主要示例进行探讨。

In [2]:
import torch
from torch import nn
from d2l import torch as d2l

## The Cross-Correlation Operation

Recall that strictly speaking, convolutional layers
are a  misnomer, since the operations they express
are more accurately described as cross-correlations.
Based on our descriptions of convolutional layers in :numref:`sec_why-conv`,
in such a layer, an input tensor
and a kernel tensor are combined
to produce an output tensor through a (**cross-correlation operation.**)

Let's ignore channels for now and see how this works
with two-dimensional data and hidden representations.
In :numref:`fig_correlation`,
the input is a two-dimensional tensor
with a height of 3 and width of 3.
We mark the shape of the tensor as $3 \times 3$ or ($3$, $3$).
The height and width of the kernel are both 2.
The shape of the *kernel window* (or *convolution window*)
is given by the height and width of the kernel
(here it is $2 \times 2$).

![Two-dimensional cross-correlation operation. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: $0\times0+1\times1+3\times2+4\times3=19$.](../img/correlation.svg)
:label:`fig_correlation`

In the two-dimensional cross-correlation operation,
we begin with the convolution window positioned
at the upper-left corner of the input tensor
and slide it across the input tensor,
both from left to right and top to bottom.
When the convolution window slides to a certain position,
the input subtensor contained in that window
and the kernel tensor are multiplied elementwise
and the resulting tensor is summed up
yielding a single scalar value.
This result gives the value of the output tensor
at the corresponding location.
Here, the output tensor has a height of 2 and width of 2
and the four elements are derived from
the two-dimensional cross-correlation operation:

$$
0\times0+1\times1+3\times2+4\times3=19,\\
1\times0+2\times1+4\times2+5\times3=25,\\
3\times0+4\times1+6\times2+7\times3=37,\\
4\times0+5\times1+7\times2+8\times3=43.
$$

Note that along each axis, the output size
is slightly smaller than the input size.
Because the kernel has width and height greater than $1$,
we can only properly compute the cross-correlation
for locations where the kernel fits wholly within the image,
the output size is given by the input size $n_\textrm{h} \times n_\textrm{w}$
minus the size of the convolution kernel $k_\textrm{h} \times k_\textrm{w}$
via

$$(n_\textrm{h}-k_\textrm{h}+1) \times (n_\textrm{w}-k_\textrm{w}+1).$$

This is the case since we need enough space
to "shift" the convolution kernel across the image.
Later we will see how to keep the size unchanged
by padding the image with zeros around its boundary
so that there is enough space to shift the kernel.
Next, we implement this process in the `corr2d` function,
which accepts an input tensor `X` and a kernel tensor `K`
and returns an output tensor `Y`.


## 互相关运算

严格来说，卷积层所表达的运算应更准确地称为互相关运算。根据我们在 :numref:`sec_why-conv` 中对卷积层的描述，在该层中，输入张量和核张量通过（**互相关运算**）结合生成输出张量。

我们暂时忽略通道维度，先观察二维数据和隐藏表示的情况。如 :numref:`fig_correlation` 所示，输入是一个高度为3、宽度为3的二维张量，其形状记为$3 \times 3$或（$3$, $3$）。核的高度和宽度都为2，卷积窗口（或称核窗口）的形状由核的高度和宽度决定（此处为$2 \times 2$）。

![二维互相关运算。阴影部分为第一个输出元素，以及用于计算该输出的输入和核张量元素：$0×0+1×1+3×2+4×3=19$。](../img/correlation.svg)
:label:`fig_correlation`

在二维互相关运算中：
1. 卷积窗口初始位于输入张量左上角
2. 从左到右、从上到下进行滑动
3. 当窗口滑动到指定位置时，窗口内输入子张量与核张量进行逐元素相乘并求和
4. 计算结果作为输出张量对应位置的标量值

示例中输出张量形状为$2 \times 2$，四个元素值由以下二维互相关运算得出：

$$
0×0+1×1+3×2+4×3=19,\\
1×0+2×1+4×2+5×3=25,\\
3×0+4×1+6×2+7×3=37,\\
4×0+5×1+7×2+8×3=43.
$$

值得注意的是：
- 输出尺寸沿每个轴向都会略小于输入尺寸
- 当核的宽高大于$1$时，只能在核完全包含在图像内的位置进行有效计算
- 输出尺寸公式为：$$(n_\text{h}-k_\text{h}+1) × (n_\text{w}-k_\text{w}+1)$$

后续我们将学习通过零填充（padding）来保持特征图尺寸不变的方法。现在，我们通过实现`corr2d`函数来验证上述过程，该函数接受输入张量`X`和核张量`K`，返回输出张量`Y`。

In [3]:
def corr2d(X, K):  #@save
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

We can construct the input tensor `X` and the kernel tensor `K`
from :numref:`fig_correlation`
to [**validate the output of the above implementation**]
of the two-dimensional cross-correlation operation.

我们可以根据 :numref:`fig_correlation` 构建输入张量`X`和核张量`K`，来[**验证上述二维互相关运算实现**]的输出结果。



In [4]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

## Convolutional Layers

A convolutional layer cross-correlates the input and kernel
and adds a scalar bias to produce an output.
The two parameters of a convolutional layer
are the kernel and the scalar bias.
When training models based on convolutional layers,
we typically initialize the kernels randomly,
just as we would with a fully connected layer.

We are now ready to [**implement a two-dimensional convolutional layer**]
based on the `corr2d` function defined above.
In the `__init__` constructor method,
we declare `weight` and `bias` as the two model parameters.
The forward propagation method
calls the `corr2d` function and adds the bias.

## 卷积层

卷积层通过对输入和核进行互相关运算，并添加标量偏置来产生输出。卷积层的两个核心参数是核张量和标量偏置。在训练基于卷积层的模型时，我们通常采用与全连接层相同的策略——随机初始化卷积核参数。

现在基于前面实现的`corr2d`函数[**构建二维卷积层**]。在`__init__`构造函数中：
1. 声明`weight`和`bias`作为模型参数
2. 前向传播方法中：
   - 调用`corr2d`函数进行互相关运算
   - 添加偏置项

In [5]:
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

In
$h \times w$ convolution
or an $h \times w$ convolution kernel,
the height and width of the convolution kernel are $h$ and $w$, respectively.
We also refer to
a convolutional layer with an $h \times w$
convolution kernel simply as an $h \times w$ convolutional layer.


## Object Edge Detection in Images

Let's take a moment to parse [**a simple application of a convolutional layer:
detecting the edge of an object in an image**]
by finding the location of the pixel change.
First, we construct an "image" of $6\times 8$ pixels.
The middle four columns are black ($0$) and the rest are white ($1$).

## 图像中物体的边缘检测

我们通过定位像素值变化的位置，来分析[**卷积层的一个简单应用：检测图像中物体的边缘**]。首先构造一个$6 \times 8$像素的"图像"，中间四列为黑色（$0$），其余像素为白色（$1$）。



In [6]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

Next, we construct a kernel `K` with a height of 1 and a width of 2.
When we perform the cross-correlation operation with the input,
if the horizontally adjacent elements are the same,
the output is 0. Otherwise, the output is nonzero.
Note that this kernel is a special case of a finite difference operator. At location $(i,j)$ it computes $x_{i,j} - x_{(i+1),j}$, i.e., it computes the difference between the values of horizontally adjacent pixels. This is a discrete approximation of the first derivative in the horizontal direction. After all, for a function $f(i,j)$ its derivative $-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$. Let's see how this works in practice.

接下来构造一个高度为1、宽度为2的卷积核`K`。当对输入数据执行互相关运算时：
- 若水平相邻元素相同，输出结果为0
- 若存在差异，则输出非零值

该卷积核实际上是有限差分算子的特例。在位置$(i,j)$处，它计算$x_{i,j} - x_{(i+1),j}$，即水平相邻像素值的差分。这本质上是水平方向一阶导数的离散近似。具体来说，对于函数$f(i,j)$，其导数可表示为：
$$-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$$

这种差分运算在实际应用中能够有效捕捉图像的水平边缘特征。我们通过下面的实例演示其工作原理。


In [9]:
K = torch.tensor([[1.0, -1.0]])

We are ready to perform the cross-correlation operation
with arguments `X` (our input) and `K` (our kernel).
As you can see, [**we detect $1$ for the edge from white to black
and $-1$ for the edge from black to white.**]
All other outputs take value $0$.

通过输入张量`X`和核张量`K`的互相关运算，我们观察到：
- **从白到黑的边缘**检测结果为$1$
- **从黑到白的边缘**检测结果为$-1$
- 其他所有位置的输出值均为$0$

这种特征表明，我们设计的卷积核能够有效识别图像中物体边缘的位置和方向变化。正值边缘对应像素值从高到低的过渡(白→黑)，负值边缘则对应相反的过渡方向(黑→白)。这种微分特性使卷积层成为提取图像空间特征的强大工具。

In [10]:
Y = corr2d(X, K)
Y

tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])

We can now apply the kernel to the transposed image.
As expected, it vanishes. [**The kernel `K` only detects vertical edges.**]

当我们将该卷积核应用于转置后的图像时，[**检测结果完全消失**]，这验证了`K`卷积核的专一性——它只能检测垂直边缘。这种特性源于卷积核的定向设计，其权重分布决定了它对特定方向的特征敏感度。在实际应用中，检测不同方向的边缘需要设计相应取向的卷积核，这也是深度学习模型中多个卷积层组合使用的原因之一。

In [11]:
corr2d(X.t(), K)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

## Learning a Kernel

Designing an edge detector by finite differences `[1, -1]` is neat
if we know this is precisely what we are looking for.
However, as we look at larger kernels,
and consider successive layers of convolutions,
it might be impossible to specify
precisely what each filter should be doing manually.

Now let's see whether we can [**learn the kernel that generated `Y` from `X`**]
by looking at the input--output pairs only.
We first construct a convolutional layer
and initialize its kernel as a random tensor.
Next, in each iteration, we will use the squared error
to compare `Y` with the output of the convolutional layer.
We can then calculate the gradient to update the kernel.
For the sake of simplicity,
in the following
we use the built-in class
for two-dimensional convolutional layers
and ignore the bias.

## 学习卷积核

通过有限差分算子`[1, -1]`设计边缘检测器虽然直观，但仅限于特定场景。当面对更大尺寸的卷积核或多层卷积堆叠时，手动设计每个滤波器的功能变得不切实际。

我们现在尝试[**仅通过输入-输出对`(X,Y)`来学习生成`Y`的卷积核**]。具体步骤：
1. 构建卷积层，将其卷积核初始化为随机张量
2. 在每次迭代中：
   - 使用平方误差损失比较卷积层输出与目标`Y`
   - 计算梯度并更新卷积核参数
   
为简化实现，我们直接使用PyTorch内置的二维卷积层类，并暂时忽略偏置项。这种方法展示了通过数据驱动的方式自动学习特征提取器的核心思想，为后续深度学习模型的参数优化奠定了基础。


In [12]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.LazyConv2d(1, kernel_size=(1, 2), bias=False)

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2  # Learning rate

for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    # Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'epoch {i + 1}, loss {l.sum():.3f}')

epoch 2, loss 15.728
epoch 4, loss 4.840
epoch 6, loss 1.714
epoch 8, loss 0.657
epoch 10, loss 0.261


Note that the error has dropped to a small value after 10 iterations. Now we will [**take a look at the kernel tensor we learned.**]

注意到误差在经历过10次迭代后已经跌落到一个很小的值，现在我们[**看看学习到的卷积核长什么样子**]

In [18]:
conv2d.weight.data.reshape((1, 2))

tensor([[ 1.0389, -0.9343]])

Indeed, the learned kernel tensor is remarkably close
to the kernel tensor `K` we defined earlier.

## Cross-Correlation and Convolution

Recall our observation from :numref:`sec_why-conv` of the correspondence
between the cross-correlation and convolution operations.
Here let's continue to consider two-dimensional convolutional layers.
What if such layers
perform strict convolution operations
as defined in :eqref:`eq_2d-conv-discrete`
instead of cross-correlations?
In order to obtain the output of the strict *convolution* operation, we only need to flip the two-dimensional kernel tensor both horizontally and vertically, and then perform the *cross-correlation* operation with the input tensor.

It is noteworthy that since kernels are learned from data in deep learning,
the outputs of convolutional layers remain unaffected
no matter such layers
perform
either the strict convolution operations
or the cross-correlation operations.

To illustrate this, suppose that a convolutional layer performs *cross-correlation* and learns the kernel in :numref:`fig_correlation`, which is here denoted as the matrix $\mathbf{K}$.
Assuming that other conditions remain unchanged,
when this layer instead performs strict *convolution*,
the learned kernel $\mathbf{K}'$ will be the same as $\mathbf{K}$
after $\mathbf{K}'$ is
flipped both horizontally and vertically.
That is to say,
when the convolutional layer
performs strict *convolution*
for the input in :numref:`fig_correlation`
and $\mathbf{K}'$,
the same output in :numref:`fig_correlation`
(cross-correlation of the input and $\mathbf{K}$)
will be obtained.

In keeping with standard terminology in deep learning literature,
we will continue to refer to the cross-correlation operation
as a convolution even though, strictly-speaking, it is slightly different.
Furthermore,
we use the term *element* to refer to
an entry (or component) of any tensor representing a layer representation or a convolution kernel.


## Feature Map and Receptive Field

As described in :numref:`subsec_why-conv-channels`,
the convolutional layer output in
:numref:`fig_correlation`
is sometimes called a *feature map*,
as it can be regarded as
the learned representations (features)
in the spatial dimensions (e.g., width and height)
to the subsequent layer.
In CNNs,
for any element $x$ of some layer,
its *receptive field* refers to
all the elements (from all the previous layers)
that may affect the calculation of $x$
during the forward propagation.
Note that the receptive field
may be larger than the actual size of the input.

Let's continue to use :numref:`fig_correlation` to explain the receptive field.
Given the $2 \times 2$ convolution kernel,
the receptive field of the shaded output element (of value $19$)
is
the four elements in the shaded portion of the input.
Now let's denote the $2 \times 2$
output as $\mathbf{Y}$
and consider a deeper CNN
with an additional $2 \times 2$ convolutional layer that takes $\mathbf{Y}$
as its input, outputting
a single element $z$.
In this case,
the receptive field of $z$
on $\mathbf{Y}$ includes all the four elements of $\mathbf{Y}$,
while
the receptive field
on the input includes all the nine input elements.
Thus,
when any element in a feature map
needs a larger receptive field
to detect input features over a broader area,
we can build a deeper network.


Receptive fields derive their name from neurophysiology.
A series of experiments on a range of animals using different stimuli
:cite:`Hubel.Wiesel.1959,Hubel.Wiesel.1962,Hubel.Wiesel.1968` explored the response of what is called the visual
cortex on said stimuli. By and large they found that lower levels respond to edges and related
shapes. Later on, :citet:`Field.1987` illustrated this effect on natural
images with, what can only be called, convolutional kernels.
We reprint a key figure in :numref:`field_visual` to illustrate the striking similarities.

![Figure and caption taken from :citet:`Field.1987`: An example of coding with six different channels. (Left) Examples of the six types of sensor associated with each channel. (Right) Convolution of the image in (Middle) with the six sensors shown in (Left). The response of the individual sensors is determined by sampling these filtered images at a distance proportional to the size of the sensor (shown with dots). This diagram shows the response of only the even symmetric sensors.](../img/field-visual.png)
:label:`field_visual`

As it turns out, this relation even holds for the features computed by deeper layers of networks trained on image classification tasks, as demonstrated in, for example, :citet:`Kuzovkin.Vicente.Petton.ea.2018`. Suffice it to say, convolutions have proven to be an incredibly powerful tool for computer vision, both in biology and in code. As such, it is not surprising (in hindsight) that they heralded the recent success in deep learning.

## Summary

The core computation required for a convolutional layer is a cross-correlation operation. We saw that a simple nested for-loop is all that is required to compute its value. If we have multiple input and multiple output channels, we are  performing a matrix--matrix operation between channels. As can be seen, the computation is straightforward and, most importantly, highly *local*. This affords significant hardware optimization and many recent results in computer vision are only possible because of that. After all, it means that chip designers can invest in fast computation rather than memory when it comes to optimizing for convolutions. While this may not lead to optimal designs for other applications, it does open the door to ubiquitous and affordable computer vision.

In terms of convolutions themselves, they can be used for many purposes, for example detecting edges and lines, blurring images, or sharpening them. Most importantly, it is not necessary that the statistician (or engineer) invents suitable filters. Instead, we can simply *learn* them from data. This replaces feature engineering heuristics by evidence-based statistics. Lastly, and quite delightfully, these filters are not just advantageous for building deep networks but they also correspond to receptive fields and feature maps in the brain. This gives us confidence that we are on the right track.

## Exercises

1. Construct an image `X` with diagonal edges.
    1. What happens if you apply the kernel `K` in this section to it?
    1. What happens if you transpose `X`?
    1. What happens if you transpose `K`?
1. Design some kernels manually.
    1. Given a directional vector $\mathbf{v} = (v_1, v_2)$, derive an edge-detection kernel that detects
       edges orthogonal to $\mathbf{v}$, i.e., edges in the direction $(v_2, -v_1)$.
    1. Derive a finite difference operator for the second derivative. What is the minimum
       size of the convolutional kernel associated with it? Which structures in images respond most strongly to it?
    1. How would you design a blur kernel? Why might you want to use such a kernel?
    1. What is the minimum size of a kernel to obtain a derivative of order $d$?
1. When you try to automatically find the gradient for the `Conv2D` class we created, what kind of error message do you see?
1. How do you represent a cross-correlation operation as a matrix multiplication by changing the input and kernel tensors?


[Discussions](https://discuss.d2l.ai/t/66)


## 互相关与卷积运算

回忆我们在 :numref:`sec_why-conv` 中观察到的互相关运算与严格卷积运算的对应关系。对于二维卷积层，如果执行严格卷积运算（定义见 :eqref:`eq_2d-conv-discrete`）而非互相关运算会发生什么？要获得严格卷积的输出，我们只需要将二维核张量同时进行水平和垂直翻转，然后对输入张量执行互相关运算。

值得注意的是，在深度学习中核是从数据中学习得到的，因此无论卷积层执行严格卷积还是互相关运算，其输出结果都不会受到影响。

举例说明：假设卷积层执行互相关运算并学习得到 :numref:`fig_correlation` 中的核矩阵$\mathbf{K}$。在其他条件不变的情况下，当该层改为执行严格卷积时，学习得到的核$\mathbf{K}'$在经过水平和垂直翻转后将会与$\mathbf{K}$相同。也就是说，当卷积层对 :numref:`fig_correlation` 中的输入和$\mathbf{K}'$执行严格卷积时，将得到与 :numref:`fig_correlation` 中（输入与$\mathbf{K}$的互相关）相同的输出。

根据深度学习文献中的标准术语，我们仍将互相关运算称为卷积（尽管严格来说略有不同）。此外，我们使用元素（element）来指代任何表示层表示或卷积核的张量中的条目（或组件）。

## 特征图与感受野

如 :numref:`subsec_why-conv-channels` 所述，:numref:`fig_correlation` 中的卷积层输出有时被称为特征图（feature map），因为它可以被视为传递到后续层的空间维度（例如宽度和高度）上的学习表示（特征）。在CNN中，对于某个层的任意元素$x$，其感受野（receptive field）是指在正向传播期间可能影响$x$计算的所有先前层元素。注意感受野可能大于输入的实际尺寸。

继续使用 :numref:`fig_correlation` 解释感受野：给定$2 \times 2$卷积核，阴影输出元素（值为$19$）的感受野是输入中阴影部分的四个元素。现在假设将$2 \times 2$输出记为$\mathbf{Y}$，并考虑一个更深的CNN，其中包含额外的$2 \times 2$卷积层，该层以$\mathbf{Y}$作为输入并输出单个元素$z$。此时，$z$在$\mathbf{Y}$上的感受野包括$\mathbf{Y}$的所有四个元素，而在输入上的感受野包括所有九个输入元素。因此，当特征图中的任何元素需要更大的感受野来检测更广阔区域的输入特征时，我们可以构建更深的网络。

感受野的概念来源于神经生理学。通过在不同动物身上进行的一系列刺激实验 :cite:`Hubel.Wiesel.1959,Hubel.Wiesel.1962,Hubel.Wiesel.1968`，研究者探索了视觉皮层对这些刺激的反应。他们发现较低层级主要响应边缘和相关形状。后来，:citet:`Field.1987` 使用卷积核在自然图像上展示了这种效果。我们在 :numref:`field_visual` 中重印了关键图示来说明这种惊人的相似性。

![图片及说明取自 :citet:`Field.1987`：使用六个不同通道编码的示例。（左）与每个通道关联的六种传感器示例。（右）将（中）图与（左）中的六个传感器进行卷积的结果。单个传感器的响应通过按传感器尺寸比例采样这些滤波后的图像获得（用点表示）。该图仅显示偶对称传感器的响应。](../img/field-visual.png)
:label:`field_visual`

事实证明，这种关系甚至适用于经过图像分类任务训练的深层网络所计算的特征，如 :citet:`Kuzovkin.Vicente.Petton.ea.2018` 所示。可以说，卷积已被证明是计算机视觉领域（无论是生物学还是代码层面）极其强大的工具。因此，卷积预示深度学习最近的重大成功也就不足为奇了。

## 小结

卷积层需要的核心计算是互相关运算。我们看到，只需要简单的嵌套for循环就可以计算其值。当有多个输入和输出通道时，我们执行的是通道间的矩阵-矩阵运算。这种计算不仅简单，更重要的是高度局部化。这使得硬件优化成为可能，近年来计算机视觉的许多成果都得益于此。这意味着芯片设计者可以将优化重点放在快速计算而非内存上。虽然这对其他应用可能不是最优设计，但它为普及和实现经济高效的计算机视觉打开了大门。

就卷积本身而言，它们可用于许多目的，例如检测边缘和线条、模糊图像或锐化图像。最重要的是，统计学家（或工程师）不需要手动设计合适的滤波器。相反，我们可以直接从数据中学习它们。这用基于证据的统计取代了特征工程的启发式方法。最后，令人欣喜的是，这些滤波器不仅有利于构建深度网络，而且与大脑中的感受野和特征图相对应。这使我们确信自己走在正确的道路上。

## 练习

1. 构造具有对角边缘的图像`X`：
    1. 将本节中的核`K`应用于该图像会发生什么？
    1. 转置`X`后会发生什么？
    1. 转置`K`后会发生什么？
2. 手动设计一些核：
    1. 给定方向向量$\mathbf{v} = (v_1, v_2)$，推导检测与$\mathbf{v}$正交的边缘检测核（即在$(v_2, -v_1)$方向的边缘）
    1. 推导二阶导数的有限差分算子。相关卷积核的最小尺寸是多少？图像的哪些结构对其响应最强？
    1. 如何设计模糊核？为什么要使用这样的核？
    1. 要获得d阶导数，核的最小尺寸是多少？
3. 当尝试为我们创建的`Conv2D`类自动求梯度时，会看到什么样的错误信息？
4. 如何通过改变输入和核张量将互相关运算表示为矩阵乘法？