========= From Fully Connected Layers to Convolutions =========

### Invariance

The chapter motivates CNNs through the "Where's Waldo" analogy. Three key principles for image processing:

1. **Translation Invariance**: The network should respond similarly regardless of WHERE an object appears in the image
2. **Locality**: Early layers should focus on local regions, not distant pixels
3. **Hierarchical Features**: Deeper layers capture longer-range features by aggregating local information

### Constraining the MLP

#### Starting Point: Fully Connected Layer

For a 2D image input $\mathbf{X}$ and hidden representation $\mathbf{H}$, a fully connected layer would be:

$$[\mathbf{H}]_{i, j} = [\mathbf{U}]_{i, j} + \sum_k \sum_l [\mathsf{W}]_{i, j, k, l} [\mathbf{X}]_{k, l}$$

where:
- $[\mathbf{X}]_{i,j}$ is the pixel at position $(i, j)$
- $[\mathsf{W}]_{i,j,k,l}$ are the weights connecting input $(k, l)$ to output $(i, j)$
- $[\mathbf{U}]_{i,j}$ is the bias

**Problem**: For a $1000 \times 1000$ image, this requires $10^{12}$ parameters!

#### Translation Invariance

We invoke the first principle: the detector should work the same regardless of position.

Reindex by setting $a = k - i$ and $b = l - j$:

$$[\mathbf{H}]_{i, j} = [\mathbf{U}]_{i, j} + \sum_a \sum_b [\mathsf{V}]_{a, b} [\mathbf{X}]_{i+a, j+b}$$

Key insight: $\mathsf{V}$ no longer depends on $(i, j)$ — **weights are shared across all positions**.

This is a **convolution**! We weight pixels at $(i+a, j+b)$ near $(i, j)$ using coefficients $[\mathsf{V}]_{a, b}$.

#### Locality

We invoke the second principle: only nearby pixels matter for local features.

Restrict the sum to a local neighborhood of size $\Delta$:

$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b} [\mathbf{X}]_{i+a, j+b}$$

**Parameter reduction**: From $4 \times 10^6$ parameters to $(2\Delta + 1)^2$ parameters (e.g., $4 \times 10^4$ for typical kernel sizes).

This is a **convolutional layer** — the foundation of CNNs.

### Convolutions

#### Mathematical Definition

**Continuous convolution**:
$$(f * g)(\mathbf{x}) = \int f(\mathbf{z}) g(\mathbf{x} - \mathbf{z}) \, d\mathbf{z}$$

**Discrete convolution (1D)**:
$$(f * g)(i) = \sum_a f(a) g(i - a)$$

**Discrete convolution (2D)**:
$$(f * g)(i, j) = \sum_a \sum_b f(a, b) g(i - a, j - b)$$

#### Convolution vs Cross-Correlation

Note the sign difference between:
- **Convolution**: $g(i - a, j - b)$
- **Cross-correlation**: $g(i + a, j + b)$

Our formula in (7.1.3) is technically a **cross-correlation**, but in deep learning we call it "convolution" by convention. The difference doesn't matter because the kernel weights are learned.

### Channels

Real images have multiple input channels (e.g., RGB = 3 channels). We extend to handle this:

$$[\mathsf{H}]_{i,j,d} = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} \sum_c [\mathsf{V}]_{a, b, c, d} [\mathsf{X}]_{i+a, j+b, c}$$

where:
- $c$ indexes **input channels** (e.g., R, G, B)
- $d$ indexes **output channels** (feature maps in the hidden layer)
- $[\mathsf{V}]_{a, b, c, d}$ is a 4D tensor of learnable parameters

This is the **general form of a convolutional layer**.

### Summary

| Aspect | Fully Connected | Convolutional |
|--------|-----------------|---------------|
| **Parameters** | $O(n^4)$ for $n \times n$ images | $O(\Delta^2 \cdot c_{in} \cdot c_{out})$ |
| **Translation Invariance** | No | Yes (weight sharing) |
| **Locality** | No (all pixels connected) | Yes (kernel size $\Delta$) |
| **Inductive Bias** | None | Matches image structure |

### Key Principles Derived:

1. **Translation Invariance** → **Weight Sharing**: Same kernel applied everywhere
2. **Locality** → **Small Kernels**: Only neighboring pixels influence each output
3. **Channels** → **Feature Maps**: Multiple learned filters capture different patterns

### Why This Matters:

- **Massive parameter reduction**: Makes training feasible on images
- **Built-in inductive bias**: Reflects true structure of visual data
- **Hierarchical learning**: 
  - Early layers: edges, textures
  - Middle layers: parts, patterns
  - Deep layers: objects, scenes

================ Convolutions for Images =================

### The Cross-Correlation Operation

Strictly speaking, convolutional layers perform **cross-correlation**, not convolution (the kernel is not flipped).

#### How It Works

Given:
- Input tensor $\mathbf{X}$ of shape $(n_h \times n_w)$
- Kernel tensor $\mathbf{K}$ of shape $(k_h \times k_w)$

The kernel slides over the input, computing element-wise products and summing:

**Example** (from Fig 7.2.1 with $2 \times 2$ kernel on $3 \times 3$ input):

$$\begin{aligned}
0\times0+1\times1+3\times2+4\times3&=19\\
1\times0+2\times1+4\times2+5\times3&=25\\
3\times0+4\times1+6\times2+7\times3&=37\\
4\times0+5\times1+7\times2+8\times3&=43
\end{aligned}$$

#### Output Size Formula

$$\text{Output size} = (n_h - k_h + 1) \times (n_w - k_w + 1)$$

For the example: $(3-2+1) \times (3-2+1) = 2 \times 2$

### Convolutional Layers

A convolutional layer consists of:
1. **Kernel (weights)**: Learnable parameters
2. **Bias**: Scalar added to output

$$\mathbf{Y} = \text{corr2d}(\mathbf{X}, \mathbf{K}) + b$$

An $h \times w$ convolution (or $h \times w$ convolution kernel) refers to a kernel of height $h$ and width $w$.

### Object Edge Detection in Images

#### Example: Vertical Edge Detection

**Input**: $6 \times 8$ image with vertical edges
```
[[1, 1, 0, 0, 0, 0, 1, 1],
 [1, 1, 0, 0, 0, 0, 1, 1],
 ...]]
```

**Kernel**: $1 \times 2$ edge detector
$$\mathbf{K} = [1, -1]$$

**Output**: Detects transitions
- White-to-black edge: outputs $1$
- Black-to-white edge: outputs $-1$
- No change: outputs $0$

**Key insight**: This kernel only detects **vertical** edges. Applying it to the transposed image (horizontal edges) produces all zeros.

### Learning a Kernel

Instead of hand-designing kernels, we can **learn** them from data.

**Training loop** (simplified):
```python
conv2d = Conv2D(kernel_size=(1, 2))
for i in range(10):
    Y_hat = conv2d(X)
    loss = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    loss.sum().backward()
    conv2d.weight.data -= lr * conv2d.weight.grad
```

After training, the learned kernel approximates $[1, -1]$.

### Cross-Correlation and Convolution

#### Mathematical Difference

**Cross-correlation**:
$$(\mathbf{X} \star \mathbf{K})_{i,j} = \sum_a \sum_b \mathbf{K}_{a,b} \cdot \mathbf{X}_{i+a, j+b}$$

**True convolution** (kernel flipped):
$$(\mathbf{X} * \mathbf{K})_{i,j} = \sum_a \sum_b \mathbf{K}_{a,b} \cdot \mathbf{X}_{i-a, j-b}$$

#### Why It Doesn't Matter

- Flipping the kernel $\mathbf{K}$ gives $\mathbf{K}'$
- Cross-correlation with $\mathbf{K}' =$ Convolution with $\mathbf{K}$
- Since kernels are **learned**, the network learns the appropriate (possibly flipped) version

**Convention**: Deep learning uses "convolution" to mean cross-correlation.

### Feature Map and Receptive Field

#### Feature Map
The output of a convolutional layer is called a **feature map** — it represents learned spatial features of the input.

#### Receptive Field

The **receptive field** of an output element is the region in the input that affects its computation.

- For a single $k \times k$ conv layer: receptive field = $k \times k$
- For stacked layers: receptive field grows with depth

**Biological inspiration**: Named after neuroscience concept — neurons in visual cortex respond to specific regions of the visual field (Hubel & Wiesel, Nobel Prize 1981).

#### Deep Networks and Receptive Fields

With each convolutional layer, the receptive field grows:
- Layer 1: $k \times k$
- Layer 2: $(2k-1) \times (2k-1)$
- Layer $n$: receptive field increases, capturing larger-scale patterns

This is why **deep** CNNs can recognize complex objects — early layers detect edges, deeper layers detect parts, even deeper layers detect whole objects.

### Summary Table

| Concept | Formula/Description |
|---------|---------------------|
| Cross-correlation | $\sum_a \sum_b K_{a,b} \cdot X_{i+a, j+b}$ |
| Output size | $(n_h - k_h + 1) \times (n_w - k_w + 1)$ |
| Convolutional layer | $Y = \text{corr2d}(X, K) + b$ |
| Edge detection kernel | $[1, -1]$ for vertical edges |
| Receptive field | Input region affecting one output element |

### Key Takeaways

1. **Cross-correlation** is the actual operation (kernel not flipped), but called "convolution" by convention
2. **Kernels can be learned** — no need for manual design
3. **Feature maps** are spatial representations of detected patterns
4. **Receptive field** grows with network depth, enabling hierarchical feature learning
5. **Edge detection** is a simple but powerful example of what convolutions can do

In [1]:
import torch
from torch import nn
from d2l import torch as d2l

In [2]:
def corr2d(X, K):  #@save
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

In [3]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

In [4]:
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

In [5]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

In [6]:
K = torch.tensor([[1.0, -1.0]])

In [7]:
Y = corr2d(X, K)
Y

tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])

In [8]:
corr2d(X.t(), K)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

In [9]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.LazyConv2d(1, kernel_size=(1, 2), bias=False)

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2  # Learning rate

for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    # Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'epoch {i + 1}, loss {l.sum():.3f}')

epoch 2, loss 5.894
epoch 4, loss 1.783
epoch 6, loss 0.624
epoch 8, loss 0.238
epoch 10, loss 0.094




In [10]:
conv2d.weight.data.reshape((1, 2))

tensor([[ 1.0230, -0.9602]])