# From Fully Connected Layers to Convolutions
:label:`sec_why-conv`

To this day,
the models that we have discussed so far
remain appropriate options
when we are dealing with tabular data.
By tabular, we mean that the data consist
of rows corresponding to examples
and columns corresponding to features.
With tabular data, we might anticipate
that the patterns we seek could involve
interactions among the features,
but we do not assume any structure *a priori*
concerning how the features interact.

Sometimes, we truly lack the knowledge to be able to guide the construction of fancier architectures.
In these cases, an MLP
may be the best that we can do.
However, for high-dimensional perceptual data,
such structureless networks can grow unwieldy.

For instance, let's return to our running example
of distinguishing cats from dogs.
Say that we do a thorough job in data collection,
collecting an annotated dataset of one-megapixel photographs.
This means that each input to the network has one million dimensions.
Even an aggressive reduction to one thousand hidden dimensions
would require a fully connected layer
characterized by $10^6 \times 10^3 = 10^9$ parameters.
Unless we have lots of GPUs, a talent
for distributed optimization,
and an extraordinary amount of patience,
learning the parameters of this network
may turn out to be infeasible.

A careful reader might object to this argument
on the basis that one megapixel resolution may not be necessary.
However, while we might be able
to get away with one hundred thousand pixels,
our hidden layer of size 1000 grossly underestimates
the number of hidden units that it takes
to learn good representations of images,
so a practical system will still require billions of parameters.
Moreover, learning a classifier by fitting so many parameters
might require collecting an enormous dataset.
And yet today both humans and computers are able
to distinguish cats from dogs quite well,
seemingly contradicting these intuitions.
That is because images exhibit rich structure
that can be exploited by humans
and machine learning models alike.
Convolutional neural networks (CNNs) are one creative way
that machine learning has embraced for exploiting
some of the known structure in natural images.


## Invariance

Imagine that we want to detect an object in an image.
It seems reasonable that whatever method
we use to recognize objects should not be overly concerned
with the precise location of the object in the image.
Ideally, our system should exploit this knowledge.
Pigs usually do not fly and planes usually do not swim.
Nonetheless, we should still recognize
a pig were one to appear at the top of the image.
We can draw some inspiration here
from the children's game "Where's Waldo"
(which itself has inspired many real-life imitations, such as that depicted in :numref:`img_waldo`).
The game consists of a number of chaotic scenes
bursting with activities.
Waldo shows up somewhere in each,
typically lurking in some unlikely location.
The reader's goal is to locate him.
Despite his characteristic outfit,
this can be surprisingly difficult,
due to the large number of distractions.
However, *what Waldo looks like*
does not depend upon *where Waldo is located*.
We could sweep the image with a Waldo detector
that could assign a score to each patch,
indicating the likelihood that the patch contains Waldo. 
In fact, many object detection and segmentation algorithms 
are based on this approach :cite:`Long.Shelhamer.Darrell.2015`. 
CNNs systematize this idea of *spatial invariance*,
exploiting it to learn useful representations
with fewer parameters.

![Can you find Waldo (image courtesy of William Murphy (Infomatique))?](../img/waldo-football.jpg)
:width:`400px`
:label:`img_waldo`

We can now make these intuitions more concrete 
by enumerating a few desiderata to guide our design
of a neural network architecture suitable for computer vision:

1. In the earliest layers, our network
   should respond similarly to the same patch,
   regardless of where it appears in the image. This principle is called *translation invariance* (or *translation equivariance*).
1. The earliest layers of the network should focus on local regions,
   without regard for the contents of the image in distant regions. This is the *locality* principle.
   Eventually, these local representations can be aggregated
   to make predictions at the whole image level.
1. As we proceed, deeper layers should be able to capture longer-range features of the 
   image, in a way similar to higher level vision in nature. 

Let's see how this translates into mathematics.


## Constraining the MLP

To start off, we can consider an MLP
with two-dimensional images $\mathbf{X}$ as inputs
and their immediate hidden representations
$\mathbf{H}$ similarly represented as matrices (they are two-dimensional tensors in code), where both $\mathbf{X}$ and $\mathbf{H}$ have the same shape.
Let that sink in.
We now imagine that not only the inputs but
also the hidden representations possess spatial structure.

Let $[\mathbf{X}]_{i, j}$ and $[\mathbf{H}]_{i, j}$ denote the pixel
at location $(i,j)$
in the input image and hidden representation, respectively.
Consequently, to have each of the hidden units
receive input from each of the input pixels,
we would switch from using weight matrices
(as we did previously in MLPs)
to representing our parameters
as fourth-order weight tensors $\mathsf{W}$.
Suppose that $\mathbf{U}$ contains biases,
we could formally express the fully connected layer as

$$\begin{aligned} \left[\mathbf{H}\right]_{i, j} &= [\mathbf{U}]_{i, j} + \sum_k \sum_l[\mathsf{W}]_{i, j, k, l}  [\mathbf{X}]_{k, l}\\ &=  [\mathbf{U}]_{i, j} +
\sum_a \sum_b [\mathsf{V}]_{i, j, a, b}  [\mathbf{X}]_{i+a, j+b}.\end{aligned}$$

The switch from $\mathsf{W}$ to $\mathsf{V}$ is entirely cosmetic for now
since there is a one-to-one correspondence
between coefficients in both fourth-order tensors.
We simply re-index the subscripts $(k, l)$
such that $k = i+a$ and $l = j+b$.
In other words, we set $[\mathsf{V}]_{i, j, a, b} = [\mathsf{W}]_{i, j, i+a, j+b}$.
The indices $a$ and $b$ run over both positive and negative offsets,
covering the entire image.
For any given location ($i$, $j$) in the hidden representation $[\mathbf{H}]_{i, j}$,
we compute its value by summing over pixels in $x$,
centered around $(i, j)$ and weighted by $[\mathsf{V}]_{i, j, a, b}$. Before we carry on, let's consider the total number of parameters required for a *single* layer in this parametrization: a $1000 \times 1000$ image (1 megapixel) is mapped to a $1000 \times 1000$ hidden representation. This requires $10^{12}$ parameters, far beyond what computers currently can handle.  

### Translation Invariance

Now let's invoke the first principle
established above: translation invariance :cite:`Zhang.ea.1988`.
This implies that a shift in the input $\mathbf{X}$
should simply lead to a shift in the hidden representation $\mathbf{H}$.
This is only possible if $\mathsf{V}$ and $\mathbf{U}$ do not actually depend on $(i, j)$. As such,
we have $[\mathsf{V}]_{i, j, a, b} = [\mathbf{V}]_{a, b}$ and $\mathbf{U}$ is a constant, say $u$.
As a result, we can simplify the definition for $\mathbf{H}$:

$$[\mathbf{H}]_{i, j} = u + \sum_a\sum_b [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$


This is a *convolution*!
We are effectively weighting pixels at $(i+a, j+b)$
in the vicinity of location $(i, j)$ with coefficients $[\mathbf{V}]_{a, b}$
to obtain the value $[\mathbf{H}]_{i, j}$.
Note that $[\mathbf{V}]_{a, b}$ needs many fewer coefficients than $[\mathsf{V}]_{i, j, a, b}$ since it
no longer depends on the location within the image. Consequently, the number of parameters required is no longer $10^{12}$ but a much more reasonable $4 \times 10^6$: we still have the dependency on $a, b \in (-1000, 1000)$. In short, we have made significant progress. Time-delay neural networks (TDNNs) are some of the first examples to exploit this idea :cite:`Waibel.Hanazawa.Hinton.ea.1989`.

###  Locality

Now let's invoke the second principle: locality.
As motivated above, we believe that we should not have
to look very far away from location $(i, j)$
in order to glean relevant information
to assess what is going on at $[\mathbf{H}]_{i, j}$.
This means that outside some range $|a|> \Delta$ or $|b| > \Delta$,
we should set $[\mathbf{V}]_{a, b} = 0$.
Equivalently, we can rewrite $[\mathbf{H}]_{i, j}$ as

$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$
:eqlabel:`eq_conv-layer`

This reduces the number of parameters from $4 \times 10^6$ to $4 \Delta^2$, where $\Delta$ is typically smaller than $10$. As such, we reduced the number of parameters by another four orders of magnitude. Note that :eqref:`eq_conv-layer`, is what is called, in a nutshell, a *convolutional layer*. 
*Convolutional neural networks* (CNNs)
are a special family of neural networks that contain convolutional layers.
In the deep learning research community,
$\mathbf{V}$ is referred to as a *convolution kernel*,
a *filter*, or simply the layer's *weights* that are learnable parameters.

While previously, we might have required billions of parameters
to represent just a single layer in an image-processing network,
we now typically need just a few hundred, without
altering the dimensionality of either
the inputs or the hidden representations.
The price paid for this drastic reduction in parameters
is that our features are now translation invariant
and that our layer can only incorporate local information,
when determining the value of each hidden activation.
All learning depends on imposing inductive bias.
When that bias agrees with reality,
we get sample-efficient models
that generalize well to unseen data.
But of course, if those biases do not agree with reality,
e.g., if images turned out not to be translation invariant,
our models might struggle even to fit our training data.

This dramatic reduction in parameters brings us to our last desideratum, 
namely that deeper layers should represent larger and more complex aspects 
of an image. This can be achieved by interleaving nonlinearities and convolutional 
layers repeatedly. 

## Convolutions

Let's briefly review why :eqref:`eq_conv-layer` is called a convolution. 
In mathematics, the *convolution* between two functions :cite:`Rudin.1973`,
say $f, g: \mathbb{R}^d \to \mathbb{R}$ is defined as

$$(f * g)(\mathbf{x}) = \int f(\mathbf{z}) g(\mathbf{x}-\mathbf{z}) d\mathbf{z}.$$

That is, we measure the overlap between $f$ and $g$
when one function is "flipped" and shifted by $\mathbf{x}$.
Whenever we have discrete objects, the integral turns into a sum.
For instance, for vectors from
the set of square-summable infinite-dimensional vectors
with index running over $\mathbb{Z}$ we obtain the following definition:

$$(f * g)(i) = \sum_a f(a) g(i-a).$$

For two-dimensional tensors, we have a corresponding sum
with indices $(a, b)$ for $f$ and $(i-a, j-b)$ for $g$, respectively:

$$(f * g)(i, j) = \sum_a\sum_b f(a, b) g(i-a, j-b).$$
:eqlabel:`eq_2d-conv-discrete`

This looks similar to :eqref:`eq_conv-layer`, with one major difference.
Rather than using $(i+a, j+b)$, we are using the difference instead.
Note, though, that this distinction is mostly cosmetic
since we can always match the notation between
:eqref:`eq_conv-layer` and :eqref:`eq_2d-conv-discrete`.
Our original definition in :eqref:`eq_conv-layer` more properly
describes a *cross-correlation*.
We will come back to this in the following section.


## Channels
:label:`subsec_why-conv-channels`

Returning to our Waldo detector, let's see what this looks like.
The convolutional layer picks windows of a given size
and weighs intensities according to the filter $\mathsf{V}$, as demonstrated in :numref:`fig_waldo_mask`.
We might aim to learn a model so that
wherever the "waldoness" is highest,
we should find a peak in the hidden layer representations.

![Detect Waldo (image courtesy of William Murphy (Infomatique)).](../img/waldo-mask.jpg)
:width:`400px`
:label:`fig_waldo_mask`

There is just one problem with this approach.
So far, we blissfully ignored that images consist
of three channels: red, green, and blue. 
In sum, images are not two-dimensional objects
but rather third-order tensors,
characterized by a height, width, and channel,
e.g., with shape $1024 \times 1024 \times 3$ pixels. 
While the first two of these axes concern spatial relationships,
the third can be regarded as assigning
a multidimensional representation to each pixel location.
We thus index $\mathsf{X}$ as $[\mathsf{X}]_{i, j, k}$.
The convolutional filter has to adapt accordingly.
Instead of $[\mathbf{V}]_{a,b}$, we now have $[\mathsf{V}]_{a,b,c}$.

Moreover, just as our input consists of a third-order tensor,
it turns out to be a good idea to similarly formulate
our hidden representations as third-order tensors $\mathsf{H}$.
In other words, rather than just having a single hidden representation
corresponding to each spatial location,
we want an entire vector of hidden representations
corresponding to each spatial location.
We could think of the hidden representations as comprising
a number of two-dimensional grids stacked on top of each other.
As in the inputs, these are sometimes called *channels*.
They are also sometimes called *feature maps*,
as each provides a spatialized set
of learned features for the subsequent layer.
Intuitively, you might imagine that at lower layers that are closer to inputs,
some channels could become specialized to recognize edges while
others could recognize textures.

To support multiple channels in both inputs ($\mathsf{X}$) and hidden representations ($\mathsf{H}$),
we can add a fourth coordinate to $\mathsf{V}$: $[\mathsf{V}]_{a, b, c, d}$.
Putting everything together we have:

$$[\mathsf{H}]_{i,j,d} = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} \sum_c [\mathsf{V}]_{a, b, c, d} [\mathsf{X}]_{i+a, j+b, c},$$
:eqlabel:`eq_conv-layer-channels`

where $d$ indexes the output channels in the hidden representations $\mathsf{H}$. The subsequent convolutional layer will go on to take a third-order tensor, $\mathsf{H}$, as input.
We take
:eqref:`eq_conv-layer-channels`,
because of its generality, as
the definition of a convolutional layer for multiple channels, where $\mathsf{V}$ is a kernel or filter of the layer.

There are still many operations that we need to address.
For instance, we need to figure out how to combine all the hidden representations
to a single output, e.g., whether there is a Waldo *anywhere* in the image.
We also need to decide how to compute things efficiently,
how to combine multiple layers,
appropriate activation functions,
and how to make reasonable design choices
to yield networks that are effective in practice.
We turn to these issues in the remainder of the chapter.

## Summary and Discussion

In this section we derived the structure of convolutional neural networks from first principles. While it is unclear whether this was the route taken to the invention of CNNs, it is satisfying to know that they are the *right* choice when applying reasonable principles to how image processing and computer vision algorithms should operate, at least at lower levels. In particular, translation invariance in images implies that all patches of an image will be treated in the same manner. Locality means that only a small neighborhood of pixels will be used to compute the corresponding hidden representations. Some of the earliest references to CNNs are in the form of the Neocognitron :cite:`Fukushima.1982`. 

A second principle that we encountered in our reasoning is how to reduce the number of parameters in a function class without limiting its expressive power, at least, whenever certain assumptions on the model hold. We saw a dramatic reduction of complexity as a result of this restriction, turning computationally and statistically infeasible problems into tractable models. 

Adding channels allowed us to bring back some of the complexity that was lost due to the restrictions imposed on the convolutional kernel by locality and translation invariance. Note that it is quite natural to add channels other than just red, green, and blue. Many satellite 
images, in particular for agriculture and meteorology, have tens to hundreds of channels, 
generating hyperspectral images instead. They report data on many different wavelengths. In the following we will see how to use convolutions effectively to manipulate the dimensionality of the images they operate on, how to move from location-based to channel-based representations, and how to deal with large numbers of categories efficiently. 

## Exercises

1. Assume that the size of the convolution kernel is $\Delta = 0$.
   Show that in this case the convolution kernel
   implements an MLP independently for each set of channels. This leads to the Network in Network 
   architectures :cite:`Lin.Chen.Yan.2013`. 
1. Audio data is often represented as a one-dimensional sequence. 
    1. When might you want to impose locality and translation invariance for audio? 
    1. Derive the convolution operations for audio.
    1. Can you treat audio using the same tools as computer vision? Hint: use the spectrogram.
1. Why might translation invariance not be a good idea after all? Give an example. 
1. Do you think that convolutional layers might also be applicable for text data?
   Which problems might you encounter with language?
1. What happens with convolutions when an object is at the boundary of an image?
1. Prove that the convolution is symmetric, i.e., $f * g = g * f$.

[Discussions](https://discuss.d2l.ai/t/64)


# 从全连接层到卷积
:label:`sec_why-conv`

直到今天，我们迄今为止讨论的模型在处理表格数据时仍然是合适的选择。所谓表格数据，我们指的是由对应于样本的行和对应于特征的列组成的数据。对于表格数据，我们可能预期我们寻找的模式可能涉及特征之间的交互，但我们不会*先验地*假设关于特征如何交互的任何结构。

有时，我们确实缺乏指导构建更复杂架构的知识。在这些情况下，MLP可能是我们能做的最好选择。然而，对于高维感知数据，这种无结构的网络可能变得非常庞大。

例如，让我们回到我们区分猫和狗的例子。假设我们在数据收集方面做了彻底的工作，收集了一个带注释的一百万像素照片数据集。这意味着网络的每个输入有一百万个维度。即使激进地减少到一千个隐藏维度，也需要一个由$10^6 \times 10^3 = 10^9$个参数表征的全连接层。除非我们有大量的GPU、分布式优化的才能和极大的耐心，否则学习这个网络的参数可能是不可行的。

一个仔细的读者可能会基于一百万像素分辨率可能不是必要的而反对这个论点。然而，虽然我们可能可以用十万像素就足够了，但我们大小为1000的隐藏层严重低估了学习图像良好表示所需的隐藏单元数量，所以一个实际系统仍将需要数十亿个参数。此外，通过拟合如此多的参数来学习分类器可能需要收集一个庞大的数据集。然而，今天人类和计算机都能够很好地区分猫和狗，这似乎与这些直觉相矛盾。这是因为图像展现了丰富的结构，可以被人类和机器学习模型利用。卷积神经网络（CNNs）是机器学习拥抱的一种创造性方式，用于利用自然图像中已知的一些结构。

## 不变性

想象一下，我们想要在图像中检测一个物体。似乎合理的是，无论我们使用什么方法来识别物体，都不应过度关注物体在图像中的精确位置。理想情况下，我们的系统应该利用这种知识。猪通常不会飞，飞机通常不会游泳。尽管如此，如果一头猪出现在图像的顶部，我们仍应该能够识别它。我们可以从儿童游戏"寻找沃尔多"中获得一些灵感（这个游戏本身也启发了许多现实生活中的模仿，比如:numref:`img_waldo`中描绘的那个）。游戏由许多充满活动的混乱场景组成。沃尔多在每个场景中的某处出现，通常潜伏在某个不太可能的位置。读者的目标是找到他。尽管他有特征性的装束，但由于大量的干扰，这可能出人意料地困难。然而，*沃尔多长什么样子*并不取决于*沃尔多位于哪里*。我们可以用一个沃尔多检测器扫描图像，该检测器可以为每个patch分配一个分数，表示该patch包含沃尔多的可能性。实际上，许多物体检测和分割算法都基于这种方法:cite:`Long.Shelhamer.Darrell.2015`。CNN系统化了这种*空间不变性*的想法，利用它来学习有用的表示，同时使用更少的参数。

![你能找到沃尔多吗（图片由William Murphy (Infomatique)提供）？](../img/waldo-football.jpg)
:width:`400px`
:label:`img_waldo`

我们现在可以通过列举一些指导设计适合计算机视觉的神经网络架构的需求，使这些直觉更具体：

1. 在网络的最早层，无论图像中的patch出现在哪里，我们的网络都应该以类似的方式响应相同的patch。这个原则被称为*平移不变性*（或*平移等变性*）。
2. 网络的最早层应该关注局部区域，而不考虑图像远处区域的内容。这是*局部性*原则。最终，这些局部表示可以聚合起来对整个图像水平进行预测。
3. 随着深入，更深的层应该能够捕捉图像的更长距离特征，类似于自然界中更高级别的视觉。

让我们看看这如何转化为数学。

## 约束MLP

首先，我们可以考虑一个MLP，以二维图像$\mathbf{X}$作为输入，它们的直接隐藏表示$\mathbf{H}$同样表示为矩阵（它们在代码中是二维张量），其中$\mathbf{X}$和$\mathbf{H}$具有相同的形状。让这一点沉淀一下。我们现在想象不仅输入，而且隐藏表示也具有空间结构。

让$[\mathbf{X}]_{i, j}$和$[\mathbf{H}]_{i, j}$分别表示输入图像和隐藏表示中位置$(i,j)$的像素。因此，为了让每个隐藏单元从每个输入像素接收输入，我们会从使用权重矩阵（就像我们之前在MLP中所做的那样）切换到将我们的参数表示为四阶权重张量$\mathsf{W}$。假设$\mathbf{U}$包含偏置，我们可以正式表达全连接层为

$$\begin{aligned} \left[\mathbf{H}\right]_{i, j} &= [\mathbf{U}]_{i, j} + \sum_k \sum_l[\mathsf{W}]_{i, j, k, l}  [\mathbf{X}]_{k, l}\\ &=  [\mathbf{U}]_{i, j} +
\sum_a \sum_b [\mathsf{V}]_{i, j, a, b}  [\mathbf{X}]_{i+a, j+b}.\end{aligned}$$

从$\mathsf{W}$切换到$\mathsf{V}$现在仅仅是表面上的变化，因为两个四阶张量中的系数之间存在一一对应关系。我们只是重新索引下标$(k, l)$，使得$k = i+a$和$l = j+b$。换句话说，我们设$[\mathsf{V}]_{i, j, a, b} = [\mathsf{W}]_{i, j, i+a, j+b}$。索引$a$和$b$在正负偏移上都运行，覆盖整个图像。对于隐藏表示$[\mathbf{H}]_{i, j}$中的任何给定位置($i$, $j$)，我们通过在以$(i, j)$为中心的$x$上的像素求和来计算其值，这些像素由$[\mathsf{V}]_{i, j, a, b}$加权。在继续之前，让我们考虑这种参数化中*单个*层所需的参数总数：一个$1000 \times 1000$的图像（1百万像素）被映射到一个$1000 \times 1000$的隐藏表示。这需要$10^{12}$个参数，远远超出当前计算机可以处理的范围。

### 平移不变性

现在让我们调用上面建立的第一个原则：平移不变性:cite:`Zhang.ea.1988`。这意味着输入$\mathbf{X}$的移动应该简单地导致隐藏表示$\mathbf{H}$的移动。这仅在$\mathsf{V}$和$\mathbf{U}$实际上不依赖于$(i, j)$时才可能。因此，我们有$[\mathsf{V}]_{i, j, a, b} = [\mathbf{V}]_{a, b}$，而$\mathbf{U}$是常数，比如说$u$。结果，我们可以简化$\mathbf{H}$的定义：

$$[\mathbf{H}]_{i, j} = u + \sum_a\sum_b [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$

这是一个*卷积*！我们实际上是用系数$[\mathbf{V}]_{a, b}$加权位置$(i+a, j+b)$附近的像素，以获得值$[\mathbf{H}]_{i, j}$。注意，$[\mathbf{V}]_{a, b}$需要比$[\mathsf{V}]_{i, j, a, b}$少得多的系数，因为它不再依赖于图像内的位置。因此，所需的参数数量不再是$10^{12}$，而是更合理的$4 \times 10^6$：我们仍然有$a, b \in (-1000, 1000)$的依赖。简而言之，我们已经取得了重大进展。时延神经网络（TDNN）是最早利用这一想法的例子之一:cite:`Waibel.Hanazawa.Hinton.ea.1989`。

###  局部性

现在让我们调用第二个原则：局部性。如上所述，我们相信我们不必从位置$(i, j)$看很远，就能收集相关信息来评估$[\mathbf{H}]_{i, j}$发生了什么。这意味着在某个范围$|a|> \Delta$或$|b| > \Delta$之外，我们应该设置$[\mathbf{V}]_{a, b} = 0$。等价地，我们可以重写$[\mathbf{H}]_{i, j}$为

$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$
:eqlabel:`eq_conv-layer`

这将参数数量从$4 \times 10^6$减少到$4 \Delta^2$，其中$\Delta$通常小于$10$。因此，我们将参数数量再减少了四个数量级。注意，:eqref:`eq_conv-layer`简而言之，就是所谓的*卷积层*。*卷积神经网络*（CNN）是一个特殊的神经网络家族，包含卷积层。在深度学习研究社区中，$\mathbf{V}$被称为*卷积核*、*滤波器*，或者简单地称为层的*权重*，这些是可学习的参数。

虽然以前，我们可能需要数十亿个参数来表示图像处理网络中的单个层，但现在我们通常只需要几百个，而不改变输入或隐藏表示的维度。这种参数的急剧减少的代价是，我们的特征现在是平移不变的，而且我们的层在确定每个隐藏激活的值时只能结合局部信息。所有学习都依赖于施加归纳偏置。当这种偏置与现实一致时，我们得到样本高效的模型，它们能很好地泛化到未见过的数据。但当然，如果这些偏置与现实不一致，例如，如果图像不是平移不变的，我们的模型可能甚至难以拟合我们的训练数据。

这种参数的显著减少使我们达到了最后一个要求，即更深的层应该表示图像更大和更复杂的方面。这可以通过重复交替非线性和卷积层来实现。

## 卷积

让我们简要回顾一下为什么:eqref:`eq_conv-layer`被称为卷积。在数学中，两个函数之间的*卷积*:cite:`Rudin.1973`，比如说$f, g: \mathbb{R}^d \to \mathbb{R}$，定义为

$$(f * g)(\mathbf{x}) = \int f(\mathbf{z}) g(\mathbf{x}-\mathbf{z}) d\mathbf{z}.$$

也就是说，当一个函数被"翻转"并移动$\mathbf{x}$时，我们测量$f$和$g$之间的重叠。每当我们有离散对象时，积分变成了求和。例如，对于来自平方可和的无限维向量集合的向量，其索引在$\mathbb{Z}$上运行，我们得到以下定义：

$$(f * g)(i) = \sum_a f(a) g(i-a).$$

对于二维张量，我们有相应的和，其中$f$的索引为$(a, b)$，$g$的索引为$(i-a, j-b)$：

$$(f * g)(i, j) = \sum_a\sum_b f(a, b) g(i-a, j-b).$$
:eqlabel:`eq_2d-conv-discrete`

这看起来与:eqref:`eq_conv-layer`类似，只有一个主要区别。我们使用差异而不是$(i+a, j+b)$。不过，请注意，这种区别主要是表面上的，因为我们总是可以在:eqref:`eq_conv-layer`和:eqref:`eq_2d-conv-discrete`之间匹配符号。我们在:eqref:`eq_conv-layer`中的原始定义更恰当地描述了*互相关*。我们将在下一节回到这个问题。

## 通道
:label:`subsec_why-conv-channels`

回到我们的沃尔多检测器，让我们看看这是什么样子。卷积层选择给定大小的窗口，并根据滤波器$\mathsf{V}$加权强度，如:numref:`fig_waldo_mask`所示。我们可能旨在学习一个模型，使得无论"沃尔多性"最高的地方，我们都应该在隐藏层表示中找到一个峰值。

![检测沃尔多（图片由William Murphy (Infomatique)提供）。](../img/waldo-mask.jpg)
:width:`400px`
:label:`fig_waldo_mask`

这种方法只有一个问题。到目前为止，我们愉快地忽略了图像由三个通道组成：红色、绿色和蓝色。总之，图像不是二维对象，而是三阶张量，由高度、宽度和通道表征，例如，形状为$1024 \times 1024 \times 3$像素。虽然这些轴中的前两个关系到空间关系，但第三个可以被视为为每个像素位置分配一个多维表示。因此，我们将$\mathsf{X}$索引为$[\mathsf{X}]_{i, j, k}$。卷积滤波器必须相应地适应。我们现在有$[\mathsf{V}]_{a,b,c}$，而不是$[\mathbf{V}]_{a,b}$。

此外，正如我们的输入是一个三阶张量，将我们的隐藏表示也表述为三阶张量$\mathsf{H}$也是一个好主意。换句话说，我们不只是有一个与每个空间位置相对应的单个隐藏表示，而是想要一个与每个空间位置相对应的整个隐藏表示向量。我们可以将隐藏表示视为由堆叠在一起的多个二维网格组成。与输入一样，这些有时被称为*通道*。它们有时也被称为*特征图*，因为每个通道为后续层提供了一组空间化的学习特征。直观地说，你可以想象在更接近输入的较低层，某些通道可能专门用于识别边缘，而其他通道可能用于识别纹理。

为了支持输入($\mathsf{X}$)和隐藏表示($\mathsf{H}$)中的多个通道，我们可以添加$\mathsf{V}$的第四个坐标：$[\mathsf{V}]_{a, b, c, d}$。把所有这些放在一起，我们有：

$$[\mathsf{H}]_{i,j,d} = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} \sum_c [\mathsf{V}]_{a, b, c, d} [\mathsf{X}]_{i+a, j+b, c},$$
:eqlabel:`eq_conv-layer-channels`

其中$d$索引隐藏表示$\mathsf{H}$中的输出通道。后续的卷积层将继续将三阶张量$\mathsf{H}$作为输入。由于其普遍性，我们将:eqref:`eq_conv-layer-channels`作为多通道卷积层的定义，其中$\mathsf{V}$是该层的核或滤波器。

还有许多操作需要我们解决。例如，我们需要弄清楚如何将所有隐藏表示组合成单个输出，例如，无论图像中*任何地方*是否有沃尔多。我们还需要决定如何高效地计算，如何组合多个层，适当的激活函数，以及如何做出合理的设计选择来产生在实践中有效的网络。我们在本章的其余部分讨论这些问题。

## 总结与讨论

在本节中，我们从基本原理推导了卷积神经网络的结构。虽然不清楚这是否是CNNs发明的路线，但令人满意的是，当将合理的原则应用于图像处理和计算机视觉算法应该如何操作时，特别是在较低层次，它们是*正确的*选择。特别是，图像中的平移不变性意味着图像的所有patch都将以相同的方式处理。局部性意味着只有像素的小邻域将用于计算相应的隐藏表示。最早提到CNNs的一些参考文献是以新认知机的形式出现的:cite:`Fukushima.1982`。

我们在推理中遇到的第二个原则是如何在不限制其表达能力的情况下减少函数类中的参数数量，至少在模型上的某些假设成立时。我们看到，由于这种限制，复杂性急剧减少，将计算和统计上不可行的问题转变为可处理的模型。

添加通道使我们能够重新引入一些因为局部性和平移不变性对卷积核施加的限制而失去的复杂性。注意，除了红色、绿色和蓝色之外，添加其他通道是相当自然的。许多卫星图像，特别是用于农业和气象学的图像，有数十到数百个通道，生成高光谱图像。它们报告了许多不同波长的数据。在接下来的内容中，我们将看到如何有效地使用卷积来操纵它们所操作的图像的维度，如何从基于位置的表示转移到基于通道的表示，以及如何有效地处理大量类别。

## 练习

1. 假设卷积核的大小是$\Delta = 0$。证明在这种情况下，卷积核为每组通道独立实现了一个MLP。这导致了Network in Network架构:cite:`Lin.Chen.Yan.2013`。
2. 音频数据通常被表示为一维序列。
    1. 什么时候你会想要为音频施加局部性和平移不变性？
    2. 推导音频的卷积操作。
    3. 你能用与计算机视觉相同的工具处理音频吗？提示：使用频谱图。
3. 为什么平移不变性可能不是一个好主意？给出一个例子。
4. 你认为卷积层也适用于文本数据吗？你可能会在语言中遇到哪些问题？
5. 当物体在图像边界时，卷积会发生什么？
6. 证明卷积是对称的，即$f * g = g * f$。

[讨论](https://discuss.d2l.ai/t/64)
