# Convolution and Pooling Layers

This notebook is intended to show you how parameters like kernel size, padding, and stride combine to determine the output size of convolutional and pooling layers.  When adapting network architectures to input images of sizes other than the original size you'll have to be able to determine the output sizes of the various layers in your network.

## Convolutional Layers

A key aspect of Conv2D layers is understanding how they modify the size and structure of the input image. The output size depends on several parameters such as kernel size, padding, and stride. Below we show the formula used to compute the output size, its motivation, and related concepts such as feature maps.

### Formula for Output Size

The formula for the output height and width of a Conv2D layer is:

$$\text{Output size} = \left\lfloor \frac{\text{Input size} + 2 \times \text{Padding} - \text{Kernel size}}{\text{Stride}} \right\rfloor + 1$$

Where:
- **Input size**: The height (or width) of the input image.
- **Kernel size**: The height (or width) of the convolutional filter (kernel).
- **Padding**: The number of pixels added around the input image.
- **Stride**: The number of pixels by which the filter is moved across the input image.
- **Floor**: The result is rounded down to the nearest integer. E.g. $\lfloor 4.7 \rfloor = 4$ and $\lfloor -3.2 \rfloor = -4$

### Motivation for the Formula

When a convolutional filter is applied to an image, it slides over the input image in both height and width. The size of the output depends on how the filter fits on the image and how much the filter moves (stride). 

- **Padding**: We may pad the input to control the spatial dimensions of the output, preserving the size or allowing the filter to "see" edge pixels more effectively.
- **Stride**: Adjusting the stride controls how much the filter moves, which impacts the output size. A stride of 1 means the filter moves pixel by pixel, while a stride of 2 means it jumps every two pixels, resulting in a smaller output.  The larger the stride the smaller the output which is why the stride is in the denominator of our output size formula

#### L02_2_Convolutional_Formulas Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l02_2_convolutional_formulas/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l02_2_convolutional_formulas/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/iDFKKdBxz74" target="_blank">Open Descript version of video in new tab</a>

In [1]:
# formula demo
from conv_size_widget import create_widget
create_widget()

HBox(children=(VBox(children=(IntSlider(value=5, description='Input Size:', max=20, min=3), IntSlider(value=3,…

Output()


### Feature Maps and Channels

Each convolution operation produces what is called a **feature map**. The feature map represents a set of features learned by applying the kernel to the input image.

- **Number of Channels**: The input to a Conv2D layer can have multiple channels. For example, a color image has 3 channels (red, green, blue). A convolutional layer can apply multiple filters, each generating a separate feature map. If a layer has \(N\) filters, the result is \(N\) feature maps (output channels).

- **Multiple Filters**: By using multiple filters, a convolutional layer can detect different types of patterns from the input, such as edges, corners, or textures. Each filter produces a distinct feature map, and these maps stack together to form the output with multiple channels.

For example, if the input has 3 channels and we use 6 filters, the output will have 6 feature maps, each capturing different features across the input.

#### **Example 1: Single Channel Input with Single Filter**

In [2]:
import torch
import torch.nn as nn

# Example image (1 batch, 1 channel, 5x5 image)
x = torch.rand(1, 1, 5, 5)

# Conv2D layer: 3x3 kernel, stride 1, no padding
conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=0)

# Apply the convolution
output = conv(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 1, 5, 5])
Output shape: torch.Size([1, 1, 3, 3])



**Explanation:**
- Input shape: (1, 1, 5, 5) → 1 batch, 1 channel, 5x5 image
- Conv2D layer: 1 input channel, 1 output channel (1 filter), 3x3 kernel, stride 1, padding 0
- Output shape: (1, 1, 3, 3)

Using the formula:

$$\text{Output size} = \left\lfloor \frac{5 + 2(0) - 3}{1} \right\rfloor + 1 = 3$$

---

#### **Example 2: RGB Image (3 Channels) with Multiple Filters**

In [3]:
# Example RGB image (1 batch, 3 channels, 5x5 image)
x = torch.rand(1, 3, 5, 5)

# Conv2D layer: 3x3 kernel, stride 1, no padding, 6 output channels (filters)
conv = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=3, stride=1, padding=0)

# Apply the convolution
output = conv(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 3, 5, 5])
Output shape: torch.Size([1, 6, 3, 3])




**Explanation:**
- Input shape: (1, 3, 5, 5) → 1 batch, 3 channels (RGB), 5x5 image
- Conv2D layer: 3 input channels, 6 output channels (filters), 3x3 kernel, stride 1, padding 0
- Output shape: (1, 6, 3, 3) → 6 feature maps of size 3x3

Each of the 6 output channels represents a feature map extracted by one of the filters.

---

#### **Example 3: Using Padding and Stride**


In [4]:
# Example image (1 batch, 1 channel, 8x8 image)
x = torch.rand(1, 1, 8, 8)

# Conv2D layer: 3x3 kernel, stride 2, padding 1, 4 output channels
conv = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3, stride=2, padding=1)

# Apply the convolution
output = conv(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 1, 8, 8])
Output shape: torch.Size([1, 4, 4, 4])




**Explanation:**
- Input shape: (1, 1, 8, 8) → 1 batch, 1 channel, 8x8 image
- Conv2D layer: 1 input channel, 4 output channels (filters), 3x3 kernel, stride 2, padding 1
- Output shape: (1, 4, 4, 4) → 4 feature maps of size 4x4

Using the formula:

$$\text{Output size} = \left\lfloor \frac{8 + 2(1) - 3}{2} \right\rfloor + 1 = 4$$

Each of the 4 output channels represents a feature map extracted by one of the filters.

---

## Pooling Layers

The formula and examples below apply to both maximum and average pooling.  The formula is similar to the one for Conv2D, but it doesn't involve the number of filters since pooling is applied to every input channel. 

### **Formula for Pooling Layer Output Size**

For a 2D pooling operation (e.g., MaxPool2D or AvgPool2D), the output height and width are given by:


$$\text{Output size} = \left\lfloor \frac{\text{Input size} + 2 \times \text{Padding} - \text{Kernel size}}{\text{Stride}} \right\rfloor + 1$$

Where:
- **Input size**: The height (or width) of the input feature map.
- **Kernel size**: The height (or width) of the pooling window.
- **Padding**: The number of pixels added around the input.
- **Stride**: The number of pixels the pooling window moves each step.
- **Floor**: The result is rounded down to the nearest integer.

### **Motivation for the Formula**

Pooling layers are typically used for **downsampling**, which reduces the spatial dimensions of feature maps. Pooling works by sliding a window (of size `kernel_size`) over the input and taking the maximum or average value within the window, depending on the pooling type. The stride controls how far the window moves at each step.

The formula for the output size of a pooling layer is the same basic idea as for convolution: it depends on how the window fits onto the input and how much it moves.

### **Examples Using Pooling Layers in PyTorch**

#### **Example 1: MaxPooling Without Padding**

In [4]:
# Example image (1 batch, 1 channel, 5x5 image)
x = torch.rand(1, 1, 5, 5)

# MaxPool2D layer: 2x2 window, stride 2, no padding
pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)

# Apply the pooling
output = pool(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 1, 5, 5])
Output shape: torch.Size([1, 1, 2, 2])


**Explanation:**
- Input size: 5x5
- Kernel size: 2x2
- Padding: 0
- Stride: 2

Using the formula:

$$\text{Output size} = \left\lfloor \frac{5 + 2(0) - 2}{2} \right\rfloor + 1 = 2$$

So the output will be a 2x2 feature map.

#### **Example 2: MaxPooling with Padding**

In [5]:
# Example image (1 batch, 1 channel, 5x5 image)
x = torch.rand(1, 1, 5, 5)

# MaxPool2D layer: 3x3 window, stride 2, padding 1
pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

# Apply the pooling
output = pool(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 1, 5, 5])
Output shape: torch.Size([1, 1, 3, 3])


**Explanation:**
- Input size: 5x5
- Kernel size: 3x3
- Padding: 1
- Stride: 2

Using the formula:

$$\text{Output size} = \left\lfloor \frac{5 + 2(1) - 3}{2} \right\rfloor + 1 = 3$$

So the output will be a 3x3 feature map.

---