# Multiple Input and Multiple Output Channels
:label:`sec_channels`

While we described the multiple channels
that comprise each image (e.g., color images have the standard RGB channels
to indicate the amount of red, green and blue) and convolutional layers for multiple channels in :numref:`subsec_why-conv-channels`,
until now, we simplified all of our numerical examples
by working with just a single input and a single output channel.
This allowed us to think of our inputs, convolution kernels,
and outputs each as two-dimensional tensors.

When we add channels into the mix,
our inputs and hidden representations
both become three-dimensional tensors.
For example, each RGB input image has shape $3\times h\times w$.
We refer to this axis, with a size of 3, as the *channel* dimension. The notion of
channels is as old as CNNs themselves: for instance LeNet-5 :cite:`LeCun.Jackel.Bottou.ea.1995` uses them. 
In this section, we will take a deeper look
at convolution kernels with multiple input and multiple output channels.

# 多输入多输出通道
:label:`sec_channels`

虽然我们在 :numref:`subsec_why-conv-channels` 中描述了构成每个图像的多个通道（例如，彩色图像具有标准的RGB通道来表示红、绿、蓝的数量）以及用于多通道的卷积层，但到目前为止，我们简化的所有数值示例都只有一个输入通道和一个输出通道。这使得我们可以将输入、卷积核和输出都视为二维张量。

当我们加入通道时，输入和隐藏表示都变成了三维张量。例如，每个RGB输入图像的形状为$3×h×w$。我们将这个大小为3的轴称为*通道*（channel）维度。通道的概念与CNN本身一样古老：例如LeNet-5 :cite:`LeCun.Jackel.Bottou.ea.1995` 就使用了通道。在本节中，我们将深入探讨具有多个输入通道和多个输出通道的卷积核。

In [1]:
import torch
from d2l import torch as d2l

## Multiple Input Channels

When the input data contains multiple channels,
we need to construct a convolution kernel
with the same number of input channels as the input data,
so that it can perform cross-correlation with the input data.
Assuming that the number of channels for the input data is $c_\textrm{i}$,
the number of input channels of the convolution kernel also needs to be $c_\textrm{i}$. If our convolution kernel's window shape is $k_\textrm{h}\times k_\textrm{w}$,
then, when $c_\textrm{i}=1$, we can think of our convolution kernel
as just a two-dimensional tensor of shape $k_\textrm{h}\times k_\textrm{w}$.

However, when $c_\textrm{i}>1$, we need a kernel
that contains a tensor of shape $k_\textrm{h}\times k_\textrm{w}$ for *every* input channel. Concatenating these $c_\textrm{i}$ tensors together
yields a convolution kernel of shape $c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$.
Since the input and convolution kernel each have $c_\textrm{i}$ channels,
we can perform a cross-correlation operation
on the two-dimensional tensor of the input
and the two-dimensional tensor of the convolution kernel
for each channel, adding the $c_\textrm{i}$ results together
(summing over the channels)
to yield a two-dimensional tensor.
This is the result of a two-dimensional cross-correlation
between a multi-channel input and
a multi-input-channel convolution kernel.

:numref:`fig_conv_multi_in` provides an example 
of a two-dimensional cross-correlation with two input channels.
The shaded portions are the first output element
as well as the input and kernel tensor elements used for the output computation:
$(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$.

![Cross-correlation computation with two input channels.](../img/conv-multi-in.svg)
:label:`fig_conv_multi_in`


To make sure we really understand what is going on here,
we can (**implement cross-correlation operations with multiple input channels**) ourselves.
Notice that all we are doing is performing a cross-correlation operation
per channel and then adding up the results.


### 多输入通道

当输入数据包含多个通道时，
我们需要构造一个与输入通道数相同的卷积核，
以便进行互相关运算。
假设输入数据的通道数为$c_\textrm{i}$，
则卷积核的输入通道数也需要是$c_\textrm{i}$。当卷积核的窗口形状为$k_\textrm{h}\times k_\textrm{w}$时，
在$c_\textrm{i}=1$的情况下，
我们可以将卷积核视为形状为$k_\textrm{h}\times k_\textrm{w}$的二维张量。

然而，当$c_\textrm{i}>1$时，
我们需要为每个输入通道准备形状为$k_\textrm{h}\times k_\textrm{w}$的卷积核张量。
将这些$c_\textrm{i}$个张量拼接在一起，
就得到了形状为$c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$的卷积核。
由于输入和卷积核都有$c_\textrm{i}$个通道，
我们可以对输入的二维张量和卷积核的二维张量在每个通道上进行互相关运算，
然后将$c_\textrm{i}$个结果相加（在通道维度求和），
得到一个二维张量。
这就是多通道输入与多输入通道卷积核进行二维互相关运算的结果。

:numref:`fig_conv_multi_in`展示了具有两个输入通道的二维互相关运算示例。
阴影部分对应第一个输出元素的计算过程，
使用了输入张量和卷积核张量的对应元素：
$(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$。

![具有两个输入通道的互相关计算。](../img/conv-multi-in.svg)
:label:`fig_conv_multi_in`


为了确保我们真正理解这里的计算逻辑，
让我们（**实现具有多个输入通道的互相关运算**）。
注意整个计算过程就是对每个通道执行互相关运算，
然后将结果相加。

In [2]:
def corr2d_multi_in(X, K):
    # Iterate through the 0th dimension (channel) of K first, then add them up
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

We can construct the input tensor `X` and the kernel tensor `K`
corresponding to the values in :numref:`fig_conv_multi_in`
to (**validate the output**) of the cross-correlation operation.

我们可以构造与 :numref:`fig_conv_multi_in` 中的值相对应的输入张量`X`和卷积核张量`K`来（**验证互相关运算的输出**）。


In [3]:
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

tensor([[ 56.,  72.],
        [104., 120.]])

## Multiple Output Channels
:label:`subsec_multi-output-channels`

Regardless of the number of input channels,
so far we always ended up with one output channel.
However, as we discussed in :numref:`subsec_why-conv-channels`,
it turns out to be essential to have multiple channels at each layer.
In the most popular neural network architectures,
we actually increase the channel dimension
as we go deeper in the neural network,
typically downsampling to trade off spatial resolution
for greater *channel depth*.
Intuitively, you could think of each channel
as responding to a different set of features.
The reality is a bit more complicated than this. A naive interpretation would suggest 
that representations are learned independently per pixel or per channel. 
Instead, channels are optimized to be jointly useful.
This means that rather than mapping a single channel to an edge detector, it may simply mean 
that some direction in channel space corresponds to detecting edges.

Denote by $c_\textrm{i}$ and $c_\textrm{o}$ the number
of input and output channels, respectively,
and by $k_\textrm{h}$ and $k_\textrm{w}$ the height and width of the kernel.
To get an output with multiple channels,
we can create a kernel tensor
of shape $c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$
for *every* output channel.
We concatenate them on the output channel dimension,
so that the shape of the convolution kernel
is $c_\textrm{o}\times c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$.
In cross-correlation operations,
the result on each output channel is calculated
from the convolution kernel corresponding to that output channel
and takes input from all channels in the input tensor.

We implement a cross-correlation function
to [**calculate the output of multiple channels**] as shown below.

## 多输出通道
:label:`subsec_multi-output-channels`

无论输入通道数是多少，
到目前为止我们始终只得到一个输出通道。
然而，正如 :numref:`subsec_why-conv-channels` 所讨论的，
每一层具有多个输出通道在实践中至关重要。
最流行的神经网络架构中，
随着神经网络层数的加深，
我们通常会增加输出通道的维度，
通过降低空间分辨率来换取更大的*通道深度*。
从直观上来说，我们可以将每个通道视为捕获不同的特征模式。
但实际情况要复杂得多——特征的表示并不是通过单个通道独立学习的，
而是通道之间通过协同优化达到最佳特征提取效果。
这意味着某个通道空间中的特定方向可能对应边缘检测，
而不是简单地将单个通道映射到边缘检测器。

设$c_\textrm{i}$和$c_\textrm{o}$分别表示输入和输出通道数，
$k_\textrm{h}$和$k_\textrm{w}$为卷积核的高度和宽度。
为了获得多通道输出，
我们需要为*每个*输出通道创建形状为$c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$的卷积核张量。
将这些张量沿输出通道维度拼接后，
卷积核的最终形状为$c_\textrm{o}\times c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$。
在进行互相关运算时，
每个输出通道的结果都由对应的卷积核计算得出，
且计算过程会使用所有输入通道的数据。

下面我们实现[**多通道输出的互相关计算函数**]：


In [4]:
def corr2d_multi_in_out(X, K):
    # Iterate through the 0th dimension of K, and each time, perform
    # cross-correlation operations with input X. All of the results are
    # stacked together
    return torch.stack([corr2d_multi_in(X, k) for k in K], 0)

We construct a trivial convolution kernel with three output channels
by concatenating the kernel tensor for `K` with `K+1` and `K+2`.


In [5]:
K = torch.stack((K, K + 1, K + 2), 0)
K.shape

torch.Size([3, 2, 2, 2])

Below, we perform cross-correlation operations
on the input tensor `X` with the kernel tensor `K`.
Now the output contains three channels.
The result of the first channel is consistent
with the result of the previous input tensor `X`
and the multi-input channel,
single-output channel kernel.


In [6]:
corr2d_multi_in_out(X, K)

tensor([[[ 56.,  72.],
         [104., 120.]],

        [[ 76., 100.],
         [148., 172.]],

        [[ 96., 128.],
         [192., 224.]]])

## $1\times 1$ Convolutional Layer
:label:`subsec_1x1`

At first, a [**$1 \times 1$ convolution**], i.e., $k_\textrm{h} = k_\textrm{w} = 1$,
does not seem to make much sense.
After all, a convolution correlates adjacent pixels.
A $1 \times 1$ convolution obviously does not.
Nonetheless, they are popular operations that are sometimes included
in the designs of complex deep networks :cite:`Lin.Chen.Yan.2013,Szegedy.Ioffe.Vanhoucke.ea.2017`.
Let's see in some detail what it actually does.

Because the minimum window is used,
the $1\times 1$ convolution loses the ability
of larger convolutional layers
to recognize patterns consisting of interactions
among adjacent elements in the height and width dimensions.
The only computation of the $1\times 1$ convolution occurs
on the channel dimension.

:numref:`fig_conv_1x1` shows the cross-correlation computation
using the $1\times 1$ convolution kernel
with 3 input channels and 2 output channels.
Note that the inputs and outputs have the same height and width.
Each element in the output is derived
from a linear combination of elements *at the same position*
in the input image.
You could think of the $1\times 1$ convolutional layer
as constituting a fully connected layer applied at every single pixel location
to transform the $c_\textrm{i}$ corresponding input values into $c_\textrm{o}$ output values.
Because this is still a convolutional layer,
the weights are tied across pixel location.
Thus the $1\times 1$ convolutional layer requires $c_\textrm{o}\times c_\textrm{i}$ weights
(plus the bias). Also note that convolutional layers are typically followed 
by nonlinearities. This ensures that $1 \times 1$ convolutions cannot simply be 
folded into other convolutions. 

![The cross-correlation computation uses the $1\times 1$ convolution kernel with three input channels and two output channels. The input and output have the same height and width.](../img/conv-1x1.svg)
:label:`fig_conv_1x1`

Let's check whether this works in practice:
we implement a $1 \times 1$ convolution
using a fully connected layer.
The only thing is that we need to make some adjustments
to the data shape before and after the matrix multiplication.


## $1\times 1$ 卷积层
:label:`subsec_1x1`

乍一看，[**$1 \times 1$卷积**]（即$k_\textrm{h} = k_\textrm{w} = 1$）似乎没有意义。毕竟，卷积的本质是相邻像素之间的相关性，而$1 \times 1$卷积显然做不到这一点。尽管如此，它们仍是复杂深度网络设计中常用的操作 :cite:`Lin.Chen.Yan.2013,Szegedy.Ioffe.Vanhoucke.ea.2017`。让我们具体分析它的实际作用。

由于使用了最小窗口，$1\times 1$卷积失去了在高度和宽度维度识别相邻元素间交互模式的能力。其唯一的计算发生在通道维度上。

:numref:`fig_conv_1x1`展示了使用3个输入通道和2个输出通道的$1\times 1$卷积核进行互相关计算的过程。注意输入和输出具有相同的高度和宽度。输出中的每个元素都来自输入图像中*同一位置*元素的线性组合。可以将$1\times 1$卷积层视为作用于每个像素位置的全连接层，将$c_\textrm{i}$个输入值转换为$c_\textrm{o}$个输出值。由于这仍然是卷积层，权重在不同像素位置间共享。因此，$1\times 1$卷积层需要$c_\textrm{o}\times c_\textrm{i}$个权重（加上偏置）。同时需注意，卷积层后通常会接非线性激活函数，这确保了$1 \times 1$卷积不能简单地合并到其他卷积中。

![使用三输入通道、两输出通道的$1\times 1$卷积核进行互相关计算。输入和输出具有相同的高度和宽度。](../img/conv-1x1.svg)
:label:`fig_conv_1x1`

我们可以通过全连接层实现$1 \times 1$卷积来验证这个理论，只需要在矩阵乘法前后调整数据形状即可。


In [7]:
def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = X.reshape((c_i, h * w))
    K = K.reshape((c_o, c_i))
    # Matrix multiplication in the fully connected layer
    Y = torch.matmul(K, X)
    return Y.reshape((c_o, h, w))

When performing $1\times 1$ convolutions,
the above function is equivalent to the previously implemented cross-correlation function `corr2d_multi_in_out`.
Let's check this with some sample data.


In [8]:
X = torch.normal(0, 1, (3, 3, 3))
K = torch.normal(0, 1, (2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(torch.abs(Y1 - Y2).sum()) < 1e-6

## Discussion

Channels allow us to combine the best of both worlds: MLPs that allow for significant nonlinearities and convolutions that allow for *localized* analysis of features. In particular, channels allow the CNN to reason with multiple features, such as edge and shape detectors at the same time. They also offer a practical trade-off between the drastic parameter reduction arising from translation invariance and locality, and the need for expressive and diverse models in computer vision. 

Note, though, that this flexibility comes at a price. Given an image of size $(h \times w)$, the cost for computing a $k \times k$ convolution is $\mathcal{O}(h \cdot w \cdot k^2)$. For $c_\textrm{i}$ and $c_\textrm{o}$ input and output channels respectively this increases to $\mathcal{O}(h \cdot w \cdot k^2 \cdot c_\textrm{i} \cdot c_\textrm{o})$. For a $256 \times 256$ pixel image with a $5 \times 5$ kernel and $128$ input and output channels respectively this amounts to over 53 billion operations (we count multiplications and additions separately). Later on we will encounter effective strategies to cut down on the cost, e.g., by requiring the channel-wise operations to be block-diagonal, leading to architectures such as ResNeXt :cite:`Xie.Girshick.Dollar.ea.2017`. 

## Exercises

1. Assume that we have two convolution kernels of size $k_1$ and $k_2$, respectively 
   (with no nonlinearity in between).
    1. Prove that the result of the operation can be expressed by a single convolution.
    1. What is the dimensionality of the equivalent single convolution?
    1. Is the converse true, i.e., can you always decompose a convolution into two smaller ones?
1. Assume an input of shape $c_\textrm{i}\times h\times w$ and a convolution kernel of shape 
   $c_\textrm{o}\times c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$, padding of $(p_\textrm{h}, p_\textrm{w})$, and stride of $(s_\textrm{h}, s_\textrm{w})$.
    1. What is the computational cost (multiplications and additions) for the forward propagation?
    1. What is the memory footprint?
    1. What is the memory footprint for the backward computation?
    1. What is the computational cost for the backpropagation?
1. By what factor does the number of calculations increase if we double both the number of input channels 
   $c_\textrm{i}$ and the number of output channels $c_\textrm{o}$? What happens if we double the padding?
1. Are the variables `Y1` and `Y2` in the final example of this section exactly the same? Why?
1. Express convolutions as a matrix multiplication, even when the convolution window is not $1 \times 1$. 
1. Your task is to implement fast convolutions with a $k \times k$ kernel. One of the algorithm candidates 
   is to scan horizontally across the source, reading a $k$-wide strip and computing the $1$-wide output strip 
   one value at a time. The alternative is to read a $k + \Delta$ wide strip and compute a $\Delta$-wide 
   output strip. Why is the latter preferable? Is there a limit to how large you should choose $\Delta$?
1. Assume that we have a $c \times c$ matrix. 
    1. How much faster is it to multiply with a block-diagonal matrix if the matrix is broken up into $b$ blocks?
    1. What is the downside of having $b$ blocks? How could you fix it, at least partly?


## 讨论

通道机制让我们能够结合两者的优势：既可以利用MLP实现显著的非线性，又可以保持卷积对特征的*局部*分析能力。具体来说，通道允许CNN同时处理多种特征（如边缘检测器和形状检测器），在保持平移不变性和局部性的参数效率优势下，仍能构建具有强表达力的计算机视觉模型。

但需要注意这种灵活性带来的计算代价：对于尺寸为$(h \times w)$的图像，计算$k \times k$卷积的时间复杂度是$\mathcal{O}(h \cdot w \cdot k^2)$。当输入输出通道分别为$c_\textrm{i}$和$c_\textrm{o}$时，复杂度将升至$\mathcal{O}(h \cdot w \cdot k^2 \cdot c_\textrm{i} \cdot c_\textrm{o})$。以256×256像素图像、5×5卷积核、128输入输出通道为例，计算量超过530亿次操作（乘法和加法分别计数）。后续我们将看到有效的优化策略，例如使用块对角化的通道操作，由此衍生出ResNeXt等架构 :cite:`Xie.Girshick.Dollar.ea.2017`。

## 练习题

1. 假设有两个尺寸分别为$k_1$和$k_2$的卷积核（中间无非线性激活）
    1. 证明这两个卷积操作等效于单个卷积操作
    1. 等效卷积的维度是多少？
    1. 逆命题是否成立？即能否总是将单个卷积分解为两个更小的卷积？

2. 给定输入形状$c_\textrm{i}\times h\times w$，卷积核形状$c_\textrm{o}\times c_\textrm{i}\times k_\textrm{h}\times k_\textrm{w}$，填充$(p_\textrm{h}, p_\textrm{w})$，步幅$(s_\textrm{h}, s_\textrm{w})$
    1. 前向传播的计算成本（乘加次数）是多少？
    1. 内存占用量是多少？
    1. 反向传播的内存占用量是多少？
    1. 反向传播的计算成本是多少？

3. 当输入通道$c_\textrm{i}$和输出通道$c_\textrm{o}$都加倍时，计算量增加多少倍？填充加倍时情况如何？

4. 本节最后一个例子中的变量`Y1`和`Y2`是否完全相同？为什么？

5. 将任意尺寸（非$1 \times 1$）的卷积操作表示为矩阵乘法

6. 实现$k \times k$卷积的快速算法时，有两种方案：a) 水平扫描源数据读取$k$宽度条带，逐个计算1宽度输出；b) 读取$k+\Delta$宽度条带，计算$\Delta$宽度输出。为什么方案b更优？$\Delta$的选择是否存在上限？

7. 给定$c \times c$矩阵：
    1. 若矩阵分解为$b$个块对角矩阵，计算速度提升多少倍？
    1. 块分解的缺点是什么？如何部分解决？

[Discussions](https://discuss.d2l.ai/t/70)
