# Exercise 1.4.1 - Pooling Layers in CNNs
#### By Jonathan L. Moran (jonathan.moran107@gmail.com)
From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## Objectives

* Implement a simplified version of the Pooling layer using Numpy.

## 1. Introduction

In [None]:
### Importing required modules

In [1]:
import numpy as np

In [None]:
### Setting the environment variables

In [None]:
ENV_COLAB = True                # True if running in Google Colab instance

In [None]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/'

In [None]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out/')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data/')

In [None]:
### Creating subdirectories (if not exists)
os.makedirs(DIR_OUT, exist_ok=True)

### 1.1. Pooling Layers

#### Background

* What are they?
* Why are they useful?
* How are they implemented?

_Pooling layers_ are fundamental building blocks in [convolutional neural network](https://en.wikipedia.org/wiki/Convolutional_neural_network) architectures. These intermediate layers are used in CNNs to reduce the dimentionality of input data. Pooling layers function by aggregating an input's spatial information in a process akin to [downsampling](https://en.wikipedia.org/wiki/Downsampling_\(signal_processing\)). By reducing the dimensionality of the input data, pooling layers help mitigate kernel sensitivity to granular features (finer details). As the input data is aggregated and its area reduced in size, the convolution kernels of a predefined size are able to cover a larger subregion within the input.

Pooling layers, like their convolutional counterparts, operate over an input using a sliding-window approach. Because of this, pooling layers share several parameters — a _window size_ (the height, width dimensions of the _pooling window_) and a _stride_ amount (number of pixels to shift the centre of the sliding window on each iteration). These hyperparameters allow practitioners to control how much spatial resolution to reduce from the input data.

![Fig. 1. The sliding window technique used to compute the pooling layer output.](figures/2022-09-07-Figure-1-Sliding-Window-Technique.png)

$$\textrm{Fig. 1. The sliding window technique for a 4x4 kernel with a stride length of 3.}$$

One key difference between pooling layers and other types of convolutional layers is that the pooling layers contain _zero trainable parameters_. Therefore, pooling layers help downsample input data without introducing significant complexity to the model.

#### Types of Pooling layers

Suppose we have an input matrix $X$,

$$
\begin{align}
X = 
    \begin{bmatrix}
        0 & 1 & 2 \\
        3 & 4 & 5 \\
        6 & 7 & 8 \\
    \end{bmatrix} 
\end{align}
$$

In order to calculate the kernel dimensions, we need to first know the dimensions of our input. From inspection we see that $X$ is a 2D matrix with the following dimensions: a width of $3$, a height of $3$ and a depth of $1$.

We can select a kernel/filter with dimensions $\left(\mathrm{f}_{\mathrm{w}}, \ \mathrm{f}_{\mathrm{h}}, \ \mathrm{d}_{0}\right)$ using the following equations defined in terms of input dimensions $\left(\mathrm{w}_{0}, \ \mathrm{h}_{0}, \ \mathrm{d}_{0}\right)$,

$$
\begin{align}
    \mathrm{w}_{\mathrm{output}} &= \frac{\mathrm{w}_{0} - \mathrm{f}_{\mathrm{w}} + 2*p}{s} + 1, \\
    \
    \mathrm{h}_{\mathrm{output}} &= \frac{\mathrm{h}_{0} - \mathrm{f}_{\mathrm{h}} + 2*p}{s} + 1. \\
\end{align}
$$

In the formulas above we have several hyperparameters available to the practitioner to modify: the filter size, $\mathrm{f}$, the padding amount, $\mathrm{p}$ (discussed below), and stride length $\mathrm{s}$. The filter depth, $\mathrm{d}_{\mathrm{kernel}}$, is the only fixed parameter and its value is equal to the depth $\mathrm{d}_{0}$ of the input . 

In the following examples, we will use a kernel of dimensions $\left(\mathrm{w}_{\mathrm{kernel}} x \ \mathrm{h}_{\mathrm{kernel}} x \ \mathrm{d}_{\mathrm{kernel}}\right) = \left(2 x 2 x 1\right)$. We will also select a stride length $s = 1$.

##### The Max Pooling layer

Rather than subsampling an input image by selecting e.g., every other pixel value, we can choose the _maximum_ value over a set of adjacent pixels in the kernel window.

$$
\begin{align}
X^{\prime} &= 
    \begin{bmatrix}
        max
        \begin{pmatrix}
        0 & 1 \\
        3 & 4 \\
        \end{pmatrix} 
        &
        max
        \begin{pmatrix}
        1 & 2 \\
        4 & 5 \\
        \end{pmatrix}
        \\
        max
        \begin{pmatrix}
        3 & 4 \\
        6 & 7 \\
        \end{pmatrix}
        &
        max
        \begin{pmatrix}
        4 & 5 \\
        7 & 8 \\
        \end{pmatrix}
    \end{bmatrix}. \\
\end{align}
$$


Therefore we obtain the output matrix

$$
\begin{align}
X^{\prime} &= 
\begin{bmatrix}
    4 & 5 \\
    7 & 8 \\
\end{bmatrix}
\end{align}. \\
$$

##### The Min Pooling layer

In a Min Pooling layer, we preserve the _minimum_ pixel value taken from the adjacent pixels in each kernel window, 


$$
\begin{align}
X^{\prime} &= 
    \begin{bmatrix}
        min
        \begin{pmatrix}
        0 & 1 \\
        3 & 4 \\
        \end{pmatrix} 
        &
        min
        \begin{pmatrix}
        1 & 2 \\
        4 & 5 \\
        \end{pmatrix}
        \\
        min
        \begin{pmatrix}
        3 & 4 \\
        6 & 7 \\
        \end{pmatrix}
        &
        min
        \begin{pmatrix}
        4 & 5 \\
        7 & 8 \\
        \end{pmatrix}
    \end{bmatrix}. \\
\end{align}
$$

Therefore we obtain the output matrix

$$
\begin{align}
X^{\prime} &= 
\begin{bmatrix}
    0 & 1 \\
    3 & 4 \\
\end{bmatrix}
\end{align}. \\
$$

##### The Average Pooling layer

We can also use an Average Pooling layer to downsample the input. The _average_ value is computed for every window of pixels,

$$
\begin{align}
X^{\prime} &= 
    \begin{bmatrix}
        avg
        \begin{pmatrix}
        0 & 1 \\
        3 & 4 \\
        \end{pmatrix} 
        &
        avg
        \begin{pmatrix}
        1 & 2 \\
        4 & 5 \\
        \end{pmatrix}
        \\
        avg
        \begin{pmatrix}
        3 & 4 \\
        6 & 7 \\
        \end{pmatrix}
        &
        avg
        \begin{pmatrix}
        4 & 5 \\
        7 & 8 \\
        \end{pmatrix}
    \end{bmatrix}. \\
\end{align}
$$


Therefore we obtain the output matrix

$$
\begin{align}
X^{\prime} &= 
\begin{bmatrix}
    2 & 3 \\
    5 & 6 \\
\end{bmatrix}
\end{align}. \\
$$

We will see in our implementation how the above layers act to reduce an input image's spatial dimentionality while preserving information in various ways. 

### 1.2. Padding and Stride

#### Padding

_Padding_ is a technique that allows us to specify the output size we want to obtain after performing a strided pooling or convolution operation. In the above example, we saw that our input matrix $X$ of dimensions $\left(3 x 3 x 1\right)$ resulted in an output matrix $X^{\prime}$ of dimensions $\left(2 x 2 x 1\right)$ after performing a pooling operation over $X$ with a selected kernel size of $\left(2 x 2 x 1\right)$ and a stride length of $1$.

Suppose instead we wanted to preserve the original dimensions of the input matrix after performing the sliding window approach. Good news is that while this is technically possible, we must first take several extra steps into consideration before jumping into the sliding window calculations like before.

By revisiting the output width and size formulas above, we can find the ideal padding dimensions $p$ that we will need to use in order to obtain a desired output dimensions. That is, if we want an output of dimensions $\left(\mathrm{w}_{\mathrm{output}}, \ \mathrm{h}_{\mathrm{output}}, \ \mathrm{d}_{\mathrm{output}}\right) = \left(\mathrm{w}_{0}, \ \mathrm{h}_{0}, \ \mathrm{d}_{0}\right)$, then we need to solve the equations for a specific padding amount $p$ where that is true,

$$
\begin{align}
    \mathrm{w}_{\mathrm{output}} &= \frac{\mathrm{w}_{0} - \mathrm{f}_{\mathrm{w}} + 2*p}{s} + 1 &= \ \mathrm{w}_{0}, \\
    \
    \mathrm{h}_{\mathrm{output}} &= \frac{\mathrm{h}_{0} - \mathrm{f}_{\mathrm{h}} + 2*p}{s} + 1 &=  \ \mathrm{h}_{0}. \\
\end{align}
$$


Rearranging the above expressions in terms of padding $p$, we obtain

$$
\begin{align}
    p_{\mathrm{w}} &= \frac{-s*\mathrm{w}_{\mathrm{output}} - s + \mathrm{f}_{\mathrm{w}} - \mathrm{w}_{0}}{2}, \\
    \
    p_{\mathrm{h}} &= \frac{-s*\mathrm{h}_{\mathrm{output}} - s + \mathrm{f}_{\mathrm{h}} - \mathrm{h}_{0}}{2}. \\
\end{align}
$$



Let's assume we have an input matrix $X$ with dimensions $\left(6 x 6 x 1\right)$. If we want to preserve the dimensionality $X$ after passing it through one of the pooling layers above, we need to find a value of $p$ that makes the expressions above equal our initial dimensions of $X$. That is, $\left(\mathrm{w}_{\mathrm{output}}, \ \mathrm{h}_{\mathrm{output}}, \ \mathrm{d}_{\mathrm{output}}\right) = \left(\mathrm{w}_{0}, \ \mathrm{h}_{0}, \ \mathrm{d}_{0}\right)$. For a given kernel size $\left(3, 3, 1\right)$ and a stride length of $s=1$, we can solve for $p$ after substituting in these known values,
$$
\begin{align}
    p_{\mathrm{w}} &= \frac{-(1)*6 - 1 + 3 - 6}{2} &= 1, \\
    \
    p_{\mathrm{h}} &= \frac{-(1)*6 - 1 + 3 - 6}{2} &= 1. \\
\end{align}
$$

From the above calculations we determined that we will need to _pad_ our input matrix $X$ by a value $1$ along each axis. In other words, we must add a column vector of dimensions $\left(1 \ x \ \mathrm{w}_{0}\right)$ to both the left-most and right-most columns of matrix $X$, and add a row vector of dimensions $\left(1 \ x \ \mathrm{h}_{0}\right)$ to the upper-most and lower-most rows of matrix $X$. Now that we know the dimensions of our padding, lets see which values we should use to fill in the padding arrays.

##### Zero padding

As the name implies, _zero padding_ is a technique that adds an array of _zeros_ along the perimeter of an input image. When used together when Max Pooling, the zero-padded arrays will not affect the resulting values of a pooling operation. That is, taking the $max$ of a non-zero pixel value together with a neighbouring padding value of zero results in the original pixel value. Therefore, no information along the perimeter of an image is lost or discarded.

To use zero padding in Numpy, simply pass the `mode="constant"` argument along with `constant_values=[0]` into the [`numpy.pad()`](https://numpy.org/doc/stable/reference/generated/numpy.pad.html) function.

##### Nearest-neighbour interpolation padding

Another padding technique known as [nearest-neighbour interpolation](https://en.wikipedia.org/wiki/Nearest-neighbor_interpolation) is to fill the padding arrays with values that equal the image pixel values along the perimeter. Suppose we have a row vector `X[0,:] = [0, 1, 2, 3]`. Using a padding size of $p=1$, we would therefore add a row vector of equal length to the top of the matrix and fill it with the _same_ adjacent pixel values along the vector `X[0, :]`. This technique could be used with Average Pooling to avoid information loss.

To use nearest-neighbour interpolation padding in Numpy, simply pass the `mode="edge"` argument into the [`numpy.pad()`](https://numpy.org/doc/stable/reference/generated/numpy.pad.html) function.

##### Note on the practicality of pooling

As mentioned earlier, the primary use case for pooling layers in convolutional neural networks is to intentionally _reduce_ the spatial dimensionality of the input. Therefore, it is often not the case that one would use padding to _preserve_ input dimensionality. However, unexpected reductions in dimensionality may occur when using a sliding window approach without a carefully selected kernel size.

A more practical use case for padding is often to handle the unexpected _edge cases_ (pun intended) when performing the sliding window calculations. In the case that the selected kernel size is _not_ an integer multiple of the input dimensions, one would end up with something like the following:

![Fig. 2. Sliding window "overshoot" occurring along the perimeter of the input.](figures/2022-09-07-Figure-2-Sliding-Window-Overshoot.png)

$$
\textrm{Fig. 2. Sliding window "overshoot" occurring along the perimeter of the input.}
$$

A kernel with this type of mismatch will tend to "overshoot" the bounds of the input and lead to an _incomplete_ window of pixel values (shown in red). When this happens, the standard practice is to _discard_ the window and _drop_ all pixel values in that window segment. When using the TensorFlow [`tf.nn.convolution`](https://www.tensorflow.org/api_docs/python/tf/nn/convolution) or [`tf.nn.pool`](https://www.tensorflow.org/api_docs/python/tf/nn/pool) operations, this is the expected behaviour when using the [`padding='VALID'`](https://www.tensorflow.org/api_docs/python/tf/nn#valid_padding_2) argument. 

While losing a few windows of computation might not seem like much, this affect gets compounded as more and more convolution or pooling layers are added to a network. With modern networks having handfuls of these layers, this can add up to a significant, unexpected reduction in resolution as your images pass through the network. To give [some perspective](https://blog.paperspace.com/padding-in-convolutional-neural-networks/), suppose we have a CNN with just four convolution layers and a kernel size of $3x3$. When we apply these layers to a $\left(28, 28\right)$ input image (e.g., an MNIST dataset image), we end up at the end with a $\left(20, 20\right)$ image. By reducing the image down to this size, we lose a total of _eight rows and columns_. This is equivalent to a loss of nearly *28%* of the original image *along each dimension*. All together, we have a total pixel loss of a whopping **49%** — and that's not even factoring in our intended downsampling!

To circumvent this behaviour, padding is added to the edges of the input to "fill in" the missing values in the window. We described above two techniques for padding: _zero padding_ (left) or _[nearest-neighbour interpolation](https://en.wikipedia.org/wiki/Nearest-neighbor_interpolation) padding_ (right):

![Fig. 3. Two common padding techniques: (a) zero padding and (b) interpolation.](figures/2022-09-07-Figure-3-Sliding-Window-Padding-Techniques.png)

$$
\textrm{Fig. 3. Two common padding techniques: (a) zero padding and (b) nearest-neighbour interpolation padding.}
$$

With padding enabled (i.e., when passing in the [`padding='SAME'`](https://www.tensorflow.org/api_docs/python/tf/nn#same_padding_2) argument in the TensorFlow operation), we can preserve the image dimensions using one of these two techniques. Note that other techniques exist for populating padding values with e.g., summary statistics using an arithmetic mean.

### Stride

Since we gave a disproportionately long time to padding, we won't go into that much detail about _stride_. Just know that this parameter also has an impact on the resulting dimensions of an input, and that stride can be expressed in terms of displacement in both the $x$- and $y$-direction. A stride of $\left(2, 3\right)$ indicates that the filter will be displaced by $2$ pixels on each horizontal pass, and that the filter will be displaced by $3$ pixels on each vertical pass. 

## 2. Programming Task

### 2.1. Padding and Stride

You will have to implement two functions and a small script. The first function is a padding 
function. Using the input size and the pooling layer parameters (stride and filter size), 
this function finds the padding `wpad` and `hpad` (width and height padding) such that the 
input dimensions are padded.

**Note**: Udacity doesn't explicity mention that `get_paddings` should return the padding dimensions such that the original input `array` dimensions are preserved after the intended pooling operation. Therefore, we will assume this to be the desired output in order to make use of the padding formulas we highlighted in Sect. 1.2. Udacity also fails to mention the desired output format of `get_paddings`, which we know is to be used with the Numpy [`np.pad()`](https://numpy.org/doc/stable/reference/generated/numpy.pad.html) function. Here we are calculating the array-like sequence for the `pad_width` argument, which is expected to be the sequence of values with which to pad _each_ dimension of the input `array`.

Suppose we have a $2x2$ input matrix of zeros,

In [33]:
X = [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]

We can define a `pad_width` sequence in the following form:
```
pad_width = [[upper_width, lower_width], [left_width, right_width]]
```

Given that we want to, say, pad the edges of `X` such that the:
* Upper-most (top) edge has a padding width of `1`;
* Lower-most (bottom) edge has a padding width of `2`;
* Left-most edge edge has a padding width of `0`;
* Right-most edge has a padding with of `3`.

We can yield the following sequence:

In [34]:
pad_width = [[1, 2], [0, 3]]

Now testing this with the Numpy [`pad()`](https://numpy.org/doc/stable/reference/generated/numpy.pad.html) function using a constant padding value of `1` we obtain:

In [35]:
np.pad(X, pad_width, constant_values=1)

array([[1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1]])

From the result we can compare our results and see that we indeed obtained our matrix `X` padded with `1s` according to our specifications.

In [None]:
### From Udacity's `pooling.py`

In [None]:
def get_paddings(array: numpy.ndarray, pool_size: int, 
                 pool_stride: int) -> List[List[int]]:
    """Returns the range of values to pad on each edge using Numpy.
    
    Computes the padding dimensions such that the input
    spatial dimensionality is preserved (akin to the
    TensorFlow Keras `padding="same"` argument).

    For more information on the Numpy `pad()` function, see:
    https://numpy.org/doc/stable/reference/generated/numpy.pad.html
    
    :param array: np.ndarray, the input data of dimensions [NxWxHxC], 
        i.e., [num. samples x width x height x channel count].
    :param pool_size: int, the symmetric kernel window dimension.
    :param pool_stride: int, the amount to displace the kernel window
        in the horizontal and vertical direction (symmetric).
    :returns: paddings, the Numpy `pad_width` padding sequences.
    """
    # IMPLEMENT THIS FUNCTION
    
    wpad = None
    hpad = None
    # Get the per-sample dimensions
    _, w, h, d = array.shape
    # Compute the padding dimensions
    wpad = (-pool_stride * w - pool_stride + pool_size - w) // 2
    hpad = (-pool_stride * h - pool_stride + pool_size - h) // 2
    # Return the number of values to pad along each dimension
    return [[0, 0], [0, wpad], [0, hpad], [0, 0]]

The next function calculates the output dimensions after pooling given the padded array
dimensions and the pooling parameters (stride and filter size).

In [None]:
### From Udacity's `pooling.py`

In [None]:
def get_output_size(shape, pool_size, pool_stride):
    """ 
    given input shape, pooling window and stride, output shape 
    args:
    - shape [list]: input shape
    - pool_size [int]: window size
    - pool_stride [int]: stride
    returns
    - output_shape [list]: output array shape
    """
    # IMPLEMENT THIS FUNCTION
    new_w = None
    new_h = None
    return [shape[0], int(new_w), int(new_h), shape[3]]

Finally, the script calculates the pooling layer output.

In [None]:
### From Udacity's `pooling.py`

In [None]:
parser = argparse.ArgumentParser(description='Download and process tf files')
parser.add_argument('-f', '--pool_size', required=True, type=int, default=3,
                    help='pool filter size')
parser.add_argument('-s', '--stride', required=True, type=int, default=3,
                    help='stride size')
args = parser.parse_args()

input_array = np.random.rand(1, 224, 224, 16)
pool_size = args.pool_size
pool_stride = args.stride

# padd the input layer
paddings = get_paddings(input_array, pool_size, pool_stride)
padded = np.pad(input_array, paddings, mode='constant', constant_values=0)

# get output size
output_size = get_output_size(padded.shape, pool_size, pool_stride)
output = np.zeros(output_size)

# IMPLEMENT THE POOLING CALCULATION 
check_output(output)

You can run `python pooling.py` to check your implementation - note that the checking of the output will require input of a 3x3 filter and stride of 3.

In [None]:
### From Udacity's `utils.py`

In [None]:
def check_output(output):
    """
    checking the shape of the output
    """
    if output.shape == (1, 75, 75, 16):
        print('Success!')
    else:
        print('Failure')

## Tips

Pooling only affects the spatial dimensions and preserves the batch size (first axis of the padded array) 
and the number of channels (last axis).

## Credits


Helpful resources:
* [7.3. Padding and Strides | Dive into Deep Learning](https://d2l.ai/chapter_convolutional-neural-networks/padding-and-strides.html)
* [Tutorial 22 - Padding in Convolutional Neural Network by K. Naik | YouTube](https://www.youtube.com/watch?v=PGBop7Ka9AU)