# Convolutional Neural Networks

There is no better way to get an introduction to convolutional layers than with an interactive widget that you can explore to see what the calculation looks like.

Images and other 2D feature maps in TensorFlow are represented as tensors of shape (B, H, W, C) where B is the batch size, H is the height, W is the width and C is the number of channels or feature maps.

**Try to answer the following questions as detailed as possible by playing with the parameters of the widget and observing the consquences**:

- How does the kernel size affect the number of parameters?


- How does the number of filters affect the number of parameters?


- How does the kernel size affect the height and width of the resulting feature maps?


- How does that change when padding is set to "same" instead of "valid"?


- What does strides do and how does it affect the output shape?


- How are different feature maps mixed with each other?

In [8]:
from IPython.display import IFrame
IFrame(src='https://spinkk.github.io/tensorflow_convolution_widget.html', width=1000, height=800)

# Different interpretations of convolutional inductive biases

An inductive bias, that is a restriction of or preference for certain solutions to the modeling problem, that convolutional layers introduce to a model is to share weights across the spatial structure of their input or representations.

You can think of convolutions as sliding a filter template over feature maps, computing the correlation of the filter templates with the feature map contents at each feature map location. The output of a convolutional layer can thus be read as values indicating how similar the kernel weights (the convolutional filter) is with the input at that location. In fact, convolutions compute the dot product between the filter and the input, which is the highest if the spatial structure of both kernel and signal are identical (a template is found to match the input). While this is a nice and intuitive interpretation of convolutional layers in ANNs, it is certainly an oversimplification. The convolution operation also involves linearly combining different feature maps into a new single feature map per filter.

Another perspective from which to view convolutional layers can be unlocked by comparing them to linear or dense/fully connected layers. For a kernel size that is the same as the input size, a convolutional layer ends up being equivalent to a fully connected layer, where the number of filters becomes the number of units (try it in the interactive widget!). Once we decrease the kernel size, we end up with spatial weight sharing, that is, we have fewer parameters than in the case where we flatten the feature maps and have a fully connected layer that takes the resulting vector as its input. From this perspective, convolutional layers introduce weight sharing that acts in a regularizing way. Indeed convolutional architectures lead to less overfitting than fully connected architectures.


In [9]:
import tensorflow as tf
import numpy as np

In [20]:
input_tensor = tf.random.normal((32,64,64,16))

layer = tf.keras.layers.Conv2D(filters=12,
                               kernel_size=(3,3),
                               strides=(1,1),
                               padding="same")

print("output shape:", layer(input_tensor).shape)
print("number of parameters:", layer.trainable_variables[0].numpy().size+layer.trainable_variables[1].numpy().size)

output shape: (32, 64, 64, 12)
number of parameters: 1740


# Pooling operations

An operation that is often used in conjunction with convolutional layers is pooling layers. A pooling layer can be thought of as a convolution in the sense that it performs a similar sliding filtering. Instead of a convolution however it uses a rank filter, e.g. a mean filter for average pooling, or a max filter for max pooling.

If the kernel size for the pooling is set to the size of the input feature maps, we call this global pooling, which practically reduces a number n of 2D feature maps to an n-dimensional vector - each feature map was reduced to a single number, either its mean or its maximum.

Pooling layers do not have any trainable variables. They can however contribute to inductive biases such as increased translation invariance (that is the same object present in different places in a scene leads to the same representations).

In convolutional architectures, global average pooling is preferred over flattening feature maps before feeding the resulting vector to a fully connected layer that does classification. The reason is that the resulting dimensionality is much smaller and thus the fully connected layer will have fewer weights, leading to less overfitting.

In [23]:
input_tensor = tf.random.normal((32,64,64,16))
pooling_layer = tf.keras.layers.MaxPool2D(pool_size=(2,2))
print("output shape (max pool):", pooling_layer(input_tensor).shape)

global_pooling_layer = tf.keras.layers.GlobalAveragePooling2D()
print("output shape (global avg pool):", global_pooling_layer(input_tensor).shape)

output shape (max pool): (32, 32, 32, 16)
output shape (global avg pool): (32, 16)
