# Exercise 1.6.1 — Fully Convolutional Networks
#### By Jonathan L. Moran (jonathan.moran107@gmail.com)
From the Self-Driving Car Engineer Nanodegree offered at Udacity.

## Objectives

* Use the $1\times1$ [`Conv2D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) convolutional layer to effectively "replace" a [`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer by preserving (not modifying) the input shape;

## 1. Introduction

In [None]:
### Importing the required modules

In [None]:
import numpy as np
import os
import tensorflow as tf
from typing import List, Union, Tuple

In [None]:
tf.__version__

'2.9.2'

In [None]:
tf.test.gpu_device_name()

''

In [None]:
### Setting the environment variables

In [None]:
ENV_COLAB = True                # True if running in Google Colab instance

In [None]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/'

In [None]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out/')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data/')

In [None]:
### Creating subdirectories (if not exists)
os.makedirs(DIR_OUT, exist_ok=True)

### 1.1. Fully Convolutional Networks

[Fully convolutional networks](https://d2l.ai/chapter_computer-vision/fcn.html#fully-convolutional-networks) is a type of [convolutional neural network]() (CNN) which extracts feature maps from input images. Unlike traditional CNNs, a fully convolutional network transforms the dimensions of the intermediate feature maps (i.e., output tensors) back to the original height and width of the input image using [transposed convolutions](https://d2l.ai/chapter_computer-vision/transposed-conv.html#sec-transposed-conv) (covered in Sect. 1.2). What makes fully convolutional networks unique is their ability to preserve 2D spatial information; this is unlike [fully _connected_ networks](https://d2l.ai/chapter_convolutional-neural-networks/why-conv.html) which flatten a 2D image to a 1D vector.

Convolution layers serve as the backbone of modern-day perception nets — they operate under a set of working principles and constraints (translation equivariance, spatial locality, efficiency) that help make intractible image processing workloads possible. With fully convolutional networks we can reduce spatial dimensionality and process images of arbitrary depth (i.e., channels), from three-channel RGB images to n-dimensional satellite [spectral images](https://en.wikipedia.org/wiki/Spectral_imaging) by modifying the number of filters used and the number of output channels expected in the convolutional layers. 

In summary, fully convolutional networks (FCN) allow us to take advantage of the spatial relationships between pixels in input images. The convolutional layers making up an FCN can be configured to handle three- or more channel images and can either preserve, reduce or upscale their respective inputs. This functionality provided by the fully convolutional network is what often motivates their use in both research and in practise today.

#### From Fully-Connected to Fully-Convolutional Layers

In this task, we rewrite a Dense fully-connected layer ([`tf.keras.layers.Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)) as a 2D Convolutional layer ([`tf.keras.layers.Conv2D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D)). To do so, we follow the simple guidelines for converting a fully-connected to a fully-convolutional layer:
   * The number of _outputs_ of the fully-connected layer becomes the _kernel size_ of the fully-convolutional layer;
   * The number of _inputs_ of the fully-connected layer becomes the number of _weights_ of the fully-convolutional layer.
    
In addition to the above rules-of-thumb, we also enforce the following specifications for our 1x1 2D Convolutional layer:
   * **Filter size**: $1\times 1$;
   * **Stride**: $1$;
   * **Padding**: Zero-padding.
   
Note that the filter size is used interchangeably with **kernel size** here. By setting our filter to a size of $1\times 1$, we are analogising fully-connected layers with fully-convolutional layers; however, with this convolutional layer we introduce the ability of the network to preserve spatial information in the input tensor. This is in contrast with the Dense fully-connected layer which does not have the ability to preserve spatial information, since it flattens the input image into a one-dimensional vector. 

The ability to preserve 2D spatial information can be used to identify features in the input image and make predictions more accurately when compared to the original Fully Connected Network. Additionally, the use of a $1\times 1$ convolutional layer can reduce the number of parameters in the network and therefore make it more computationally-efficient. By choosing a filter size of $1\times 1$, we get the ability to reduce network parameters while maintaining a significant amount of spatial information throughout the network, which might not be the case when selecting larger filter sizes (e.g., $3\times 3$ or $5\times 5$). 

### 1.2. Transposed Convolutions

[Transposed convolutions](https://d2l.ai/chapter_computer-vision/transposed-conv.html) help [upsample](https://en.wikipedia.org/wiki/Upsampling) the output of a previous layer to a higher resolution or spatial dimension (height and width). Transposed convolutions work in contrast with typical convolutional layers, which tend to downsample (reduce) the spatial dimensions of the input and are therefore sometimes considered to be "deconvolution" layers.

The word "transposed" here refers to the act of _transferring something_ to a different place or context. Here, transposed convolutions _transfer_ patches of data from a matrix onto a [sparse tensor](https://nvidia.github.io/MinkowskiEngine/tutorial/sparse_tensor_basic.html) (i.e., a vector with many zero-valued entries). Using the patches of data, the sparse regions of the tensor are assigned values ("filled"). A neat animation of this process created by [V. Dumoulin](https://github.com/vdumoulin/conv_arithmetic) [1] is replicated below:

<table style="width:100%; table-layout:fixed;">
  <tr>
    <td><img width="150px" src="figures/conv_arithmetic/no_padding_no_strides_transposed.gif"></td>
    <td><img width="150px" src="figures/conv_arithmetic/arbitrary_padding_no_strides_transposed.gif"></td>
    <td><img width="150px" src="figures/conv_arithmetic/same_padding_no_strides_transposed.gif"></td>
    <td><img width="150px" src="figures/conv_arithmetic/full_padding_no_strides_transposed.gif"></td>
  </tr>
  <tr>
    <td>No padding, no strides, transposed</td>
    <td>Arbitrary padding, no strides, transposed</td>
    <td>Half padding, no strides, transposed</td>
    <td>Full padding, no strides, transposed</td>
  </tr>
  <tr>
    <td><img width="150px" src="figures/conv_arithmetic/no_padding_strides_transposed.gif"></td>
    <td><img width="150px" src="figures/conv_arithmetic/padding_strides_transposed.gif"></td>
    <td><img width="150px" src="figures/conv_arithmetic/padding_strides_odd_transposed.gif"></td>
    <td></td>
  </tr>
  <tr>
    <td>No padding, strides, transposed</td>
    <td>Padding, strides, transposed</td>
    <td>Padding, strides, transposed (odd)</td>
    <td></td>
  </tr>
</table>


$$
\begin{align}
\textrm{Figure 1. Transposed Convolution Operations — Visualised (credit: V. Dumoulin).}
\end{align}
$$

In the above animations by V. Dumoulin [1], we note that the blue grid corresponds to an input feature map, whereas the cyan grid is the output feature map.

#### Why "transposed"?

The transposed convolution process is akin to a matrix transpose; we can implement the convolution with the following:

Given an input vector $\mathrm{x}$ and a weight matrix $\mathrm{W}$, the forward propagation function of the convolution is given by $\mathrm{y} = \mathrm{W}\cdot \mathrm{x}$, where $\mathrm{y}$ is the output vector. The backpropagation function of the convolution can be represented as a multiplication of the input with the transposed weight matrix $\mathrm{W}^{\top}$, since the backpropagation follows the chain rule with definition $\nabla_{\mathrm{x}}\mathrm{y} = \mathrm{W}^{\top}$. Therefore, the transposed convolution layer can be boiled down to just swapping the forward propagation function of the transposed convolution layer with the backpropagation function of the previous convolution layer. Since the math is essentially the same as with traditional convolution layers, the property of differentiability is thus retained which makes the training of transposed convolution layers the same as with previous networks.  

#### More details

The transposed convolution layer, like the regular convolution layer, depends also on the configuration of the [padding and stride](https://d2l.ai/chapter_convolutional-neural-networks/padding-and-strides.html) amounts. Unlike the traditional convolution layer, transposed convolution layers apply the padding to the _output_, rather than the _input_. So with a $\left(\textrm{height}, \textrm{width}\right)$ padding of e.g., $(1, 1)$, the transposed layer will _remove_ the first and last rows of the output tensor after it has passed through the layer. 

The _stride_ of a transposed convolution layer is also slightly different than in the regular convolution layer. The stride affects the amount of overlap between the adjacent intermediate kernel tensor elements and the _output_ feature map. Changing the $\left(\textrm{height}, \textrm{width}\right)$ stride from e.g., $(1, 1)$ to $(2, 2)$ will increase both the height and width of the intermediate tensors and therefore increase the amount of overlap between the intermediate tensors and the _output_ feature map. With a regular convolution layer, the stride parameter affects the amount of overlap between the adjacent kernel elements and the _input_ feature map.

For example, assuming an input feature map of dimensions $3\times 3 \times 1$, and given a desired upsampling to dimensions $6\times 6\times 1$, we can initialise a [`tf.keras.layers.Conv2DTranspose`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2DTranspose) layer with a stride of $(2, 2)$, a kernel size of $(1, 1)$, and when using a padding configuration of `"SAME"` we get an output feature map of dimensions $6\times 6\times 1$. To illustrate this in code, consider the following:

In [None]:
# Defining our input feature map
# Input dimensions: [BATCH_SIZE, HEIGHT_IN, WIDTH_IN, CHANNELS_IN]
x = tf.convert_to_tensor(
    np.random.randn(1, 3, 3, 1), 
    dtype=tf.float32
)
x.shape

TensorShape([1, 3, 3, 1])

In [None]:
### Creating the Conv2DTranspose layer with above configurations
model_conv_t = tf.keras.models.Sequential()
model_conv_t.add(
    tf.keras.layers.Conv2DTranspose(
        filters=1,                  # Num. input channels
        kernel_size=(1, 1),
        strides=(2, 2),
        padding='SAME'
    )
)

In [None]:
### Applying the input feature map to the transposed convolution layer
model_conv_t(x).shape

TensorShape([1, 6, 6, 1])

With this example we see that the dimensions of the output feature map is indeed $6\times 6 \times 1$ (note: the first dimension of value $1$ is the batch size — here `x` contains only one input sample). 

## 2. Programming Task

NOTE: the code provided here has been migrated to the TensorFlow 2.x API. Some functionality may differ from the original implementation.

### 2.1. Fully Convolutional Layer

In [None]:
### From Udacity's `quiz.py`

In [None]:
# custom init with the seed set to 0 by default
def custom_init(
        shape: Union[tf.Tensor, List[int], Tuple[int]], 
        dtype: tf.dtypes.DType=tf.float32,
        seed: int=0,
        partition_info=None
) -> tf.Tensor:
    """Initialises the weights of a layer.
    
    Samples the values at random from a parameterised normal distribution.
    
    :param shape: Shape of the weight vector (i.e., number of weights).
    :param dtype: Data type of the weight vector values to return.
    :param seed: Value of the random seed to create.
    :param partition_info: Optional info about paritioning of a tensor,
        not used in TF2.x API.
    :returns: weight vector of randomly initialised values.
    """
    return tf.random.normal(
        shape=shape, 
        dtype=dtype, 
        seed=seed
    )


def conv_1x1(
        filters: int,
        kernel_size: Union[List[int], Tuple[int], tf.Tensor]=(1, 1),
        stride: int=1
) -> tf.Tensor:
    """Initialises a 1x1 2D Convolutional layer.
    
    To convert a fully-connected to a fully-convolutional layer, we initialise
    the 2D Convolutional layer parameters according to the following:
       1. The number of outputs becomes the kernel size,
       2. The number of inputs becomes the number of weights.
    
    NOTE: The `tf.layers.conv2d` API has been deprecated since TF1.15,
    therefore we use the `tf.keras.layers.Conv2D` layer in TF2.x to initialise
    the layer with modified arguments.
    
    :param filters: the dimensions of the output.
    :param kernel_size: the dimensions of the kernel, i.e., size of the window
    used in the convolution / sliding window operations.
    :param stride: the amount of pixels to "shift" the filter over the input on
    each sliding window operation.
    :returns: the configured `Conv2D` layer.
    """
    return tf.keras.layers.Conv2D(
        filters=filters,
        kernel_size=kernel_size,
        strides=stride,
        padding='VALID',
        kernel_initializer=custom_init
    )

#### Testing the FCN layer

In [None]:
### Setting the parameters
# Number of output channels (i.e., number of kernels)
NUM_OUTPUTS = 2
KERNEL_SIZE = (1, 1)
# Number of pixels to "move over" for each sliding window operation
STRIDE = (1, 1)
# Batch size (i.e., number of samples per iteration)
BATCH_SIZE = 1
# Number of input channels
CHANNELS_IN = 1

In [None]:
### Creating an input tensor

In [None]:
x = tf.convert_to_tensor(
    np.random.randn(BATCH_SIZE, NUM_OUTPUTS, NUM_OUTPUTS, CHANNELS_IN), 
    dtype=tf.float32
)
x

<tf.Tensor: shape=(1, 2, 2, 1), dtype=float32, numpy=
array([[[[ 0.5990718 ],
         [ 0.34034762]],

        [[-1.0601754 ],
         [-0.4629273 ]]]], dtype=float32)>

The above "dataset" is defined as a 4-D tensor with a number of samples equal to one (i.e., `BATCH_SIZE = 1`). Since we are going to be comparing the effect of the `Conv2D` layer to the `Dense` layer on the output image size, we want to set our input "image" to be of size (`NUM_OUTPUT`, `NUM_OUTPUT`, `CHANNELS_IN`). That is, we are expecting the input and output size of the tensor to be the same (unmodified) after it is passed through either the `Conv2D` or the `Dense` layer.  

In [None]:
type(x)

tensorflow.python.framework.ops.EagerTensor

In [None]:
# [batch_size, in_height, in_width, in_channels]
x.shape

TensorShape([1, 2, 2, 1])

##### Creating the Conv2D layer model

In [None]:
model_conv = tf.keras.models.Sequential()
model_conv.add(
    conv_1x1(
        filters=NUM_OUTPUTS, 
        kernel_size=KERNEL_SIZE, 
        stride=STRIDE
    )
)

##### Creating the Dense layer model

In [None]:
model_dense = tf.keras.models.Sequential()
model_dense.add(
    tf.keras.layers.Dense(
        units=NUM_OUTPUTS,
        kernel_initializer=custom_init
    )
)

##### Passing the input tensor through each model

In [None]:
conv_out = model_conv(x)

In [None]:
dense_out = model_dense(x)

##### Comparing the output shape

In [None]:
conv_out.shape == dense_out.shape

True

Since we assume the $1\times 1$ convolutional layer configuration, we observe here that both network layers produce the same output shape of $(1, 2, 2, 2)$.

### 2.2. Transposed Convolutions

In [None]:
# TODO.

## 3. Closing Remarks

##### Alternatives
* TODO.

##### Extensions of task
* TODO.

## 4. Future Work

* TODO.

## Credits

This assignment was prepared by David Siller, Kelvin Lwin et al., 2020 (link [here]).

References
* [1] Dumoulin, V. et al. A Guide to Convolution Arithmetic for Deep Learning. arXiv. [doi:10.48550/arXiv.1603.07285](https://arxiv.org/abs/1603.07285).


Helpful resources:
* [14.10. Transposed Convolution | Dive Into Deep Learning](https://d2l.ai/chapter_computer-vision/transposed-conv.html).