# Multiple Input and Output Channels

Until now, we have ignored the fact that almost all images have three colour channels, (R, G, B), which has allowed us to think of our images as two-dimensional tensors. It also allowed us to treat our inputs, convolutional kernels and outputs as two-dimensional tensors. 

When we consider channels, our inputs and hidden representartions become three-dimensional tensors. :O The additional axis is known as the _channel_ dimension. 

In [1]:
import torch
from d2l import torch as d2l

## Multiple Input Channels

Our convolutional kernel is essentially the same, but stretched along a third dimension for channel, so it becomes a $k_h \times k_w \times c_i$ tensor, where $c_i$ is the number of channels. 

We perform our cross-correlation operation sliding the image as before, but with multiple layers per channel, if we wish, we can then sum the outputs together to produce a single output layer. 

![Screenshot 2025-03-23 at 19.34.26.png](attachment:9ec12517-1af6-44e0-9b78-808dd392f87a.png)

In [4]:
def corr2d_multi_in(X, K):
    # Iterate over the channel dimensions then sum them
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

In [15]:
X = torch.tensor(
[
    [   # Channel 1
        [0.0, 1.0, 2.0], 
        [3.0, 4.0, 5.0], 
        [6.0, 7.0, 8.0]
    ],
    [   # Channel 2
        [1.0, 2.0, 3.0], 
        [4.0, 5.0, 6.0], 
        [7.0, 8.0, 9.0]
    ]
])

K = torch.tensor([
    [   # Channel 1
        [0.0, 1.0],
        [2.0, 3.0]
    ],
    [   # Channel 2
        [1.0, 2.0],
        [3.0, 4.0]
    ]
])

corr2d_multi_in(X, K)

tensor([[ 56.,  72.],
        [104., 120.]])

## Multiple output channels

Okay, but defaulting to summing the outputs is not really sensible, we want multiple channels per layer. It even turns out that we commonl wish to _expand_ the number of channels as we go deeper into the layers of the network. Typically downsampling the spatial resolution to increase the channel resolution. 

We can create a kernel of $(c_i \times k_h \times k_w)$ for every output channel $c_o$. So that the shape of the output kernel is $(c_o \times c_i \times k_h \times k_w)$

So for multiple output layers, add an output layer dimension to the kernel, then iterate over that "output" layer dimension and stack the results together to produce the output kernel

In [25]:
def corr2d_multi_in_out(X, K):
    # Iterate through the 0th dimension of K, and each time, perform
    # cross-correlation operations with input X. All of the results are
    # stacked together
    return torch.stack([corr2d_multi_in(X, k) for k in K], 0)

In [26]:
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])
corr2d_multi_in(X, K)

tensor([[ 56.,  72.],
        [104., 120.]])

In [27]:
K = torch.stack((K, K + 1, K + 2), 0)
K.shape

torch.Size([3, 2, 2, 2])

In [28]:
corr2d_multi_in_out(X, K)

tensor([[[ 56.,  72.],
         [104., 120.]],

        [[ 76., 100.],
         [148., 172.]],

        [[ 96., 128.],
         [192., 224.]]])

## A 1x1 convolutional layer. 

Initially doesn't appear to make sense... loses the ability to correlate across spatial dimensions of the input images, only correlates across channel dimension. 

Produces output tensors of the same size, but where each pixel is a linear combination of the values of the inputs at the same pixel locations _only_, like a fully connected layer that only connects spatially _identical_ pixels in $c_i$ input layers to $c_o$ output layers.

Commonly used in complex deep networks. 

In [29]:
def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]

    X = X.reshape((c_i, h * w))
    K = K.reshape((c_o, c_i))
    Y = torch.matmul(K, X)
    return Y.reshape((c_o, h, w))

In [30]:
X = torch.normal(0, 1, (3, 3, 3))
K = torch.normal(0, 1, (2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(torch.abs(Y1 - Y2).sum()) < 1e-6

Channels allow us to have mlps which account for significant nonlinearities but which allow for localized analysis of features. They also allow the CNN to reason with multiple features, such as edge and shape detection at the same time. 