# <center>Week-6: Starting with Semantic Segmentation

# <center> Transpose Convolution

#### Transposed Convolutions are used to upsample the input feature map to a desired output feature map using some learnable parameters.



## Why do we need transpose convolution?




One of the best ways for us to gain some intuition is by looking at examples from Computer Vision that use the transposed convolution. Most of these examples start with a series of regular convolutions to compress the input data into an abstract spatial representation, and then use transposed convolutions to decompress the abstract representation into something of use.


![](https://miro.medium.com/max/1400/1*1OabPemOWCLrpCwIUmIsCg.png)


A convolutional auto-encoder is tasked with recreating its input image, after passing intermediate results through a ‘bottleneck’ of a limited size. Uses of auto-encoders include compression, noise removal, colourisation and in-painting. Success depends on being able to learn dataset specific compression in the convolution kernels and dataset specific decompression in the transposed convolution kernels. Why stop there though?


## Super Resolution
![](https://miro.medium.com/max/1400/1*eTMo62iQxKp9aR3b5iG14w.png)


With ‘super resolution’ the objective is to upscale the input image to higher resolutions, so transposed convolutions can be used as an alternative to classical methods such as bicubic interpolation. Similar arguments to convolutions using learnable kernels over hand crafted kernels apply here.

## Semantic Segmentation


Semantic segmentation is an example of using transposed convolution layers to decompress the abstract representation into a different domain (from the RGB image input). We output a class for each pixel of the input image, and then just for visualisation purposes, we render each class as a distinct colour (e.g. the person class shown in red, cars in dark blue, etc.).


![](https://miro.medium.com/max/1400/1*KICInky28yGdU9T45kIL5Q.jpeg)

1. The CNN layers we have seen so far, such as convolutional layers and pooling layers typically reduce (downsample) the spatial dimensions (height and width) of the input, or keep them unchanged. 

2. In semantic segmentation that classifies at pixel-level, it will be convenient if the spatial dimensions of the input and output are the same. For example, the channel dimension at one output pixel can hold the classification results for the input pixel at the same spatial position.

3. To achieve this, especially after the spatial dimensions are reduced by CNN layers, we can use another type of CNN layers that can increase (upsample) the spatial dimensions of intermediate feature maps. 

4. In this session, we will introduce transposed convolution, which is also called fractionally-strided convolution, for reversing downsampling operations by the convolution.

# Now let's understand how transpose convolution works 



### The basic operation that goes in a transposed convolution is explained below:
### 1. Consider a 2x2 encoded feature map which needs to be upsampled to a 3x3 feature map

![](https://miro.medium.com/max/97/1*BMJnnOKPhK8hoFP6sQ9edQ.png) 

Input Feature map

![](https://miro.medium.com/max/252/1*VxtMdM-DsGwIa51GyDx-XQ.png)

Output feature shape





### 2. We take a kernel of size 2x2 with unit stride and zero padding.

![](https://miro.medium.com/max/102/1*e6UnrcsFRaOidCq7mwJpTA.png)

kernel of size 2*2

### 3. Now we take the upper left element of the input feature map and multiply it with every element of the kernel

![](https://miro.medium.com/max/429/1*7hVid7EAqCPkG6sEjHMI5w.png)



### 4. Similarly, we do it for all the remaining elements of the input feature map

![](https://miro.medium.com/max/700/1*yxBd_pCiEVVwEQFmc-Heog.png)

### 5. As you can see, some of the elements of the resulting upsampled feature maps are over-lapping. To solve this issue, we simply add the elements of the over-lapping positions

![](https://miro.medium.com/max/700/1*faRskFzI7GtvNCLNeCN8cg.png)




## 6. The resulting output will be the final upsampled feature map having the required spatial dimensions of 3x3

![](https://miro.medium.com/max/790/1*Lpn4nag_KRMfGkx1k6bV-g.gif)

In [None]:
def trans_conv(X, K):
    h, w = K.shape
    Y = np.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i: i + h, j: j + w] += X[i, j] * K
    return Y

In [None]:
X = np.array([[0.0, 1.0], [2.0, 3.0]])
K = np.array([[0.0, 1.0], [2.0, 3.0]])
trans_conv(X, K)

## Experimenting with Transpose convolution with tensorflow

In [None]:
import tensorflow as tf


In [None]:
kz = 2 # kernel_size
st = 1 # strides

input = tf.random.uniform((1, 2, 2, 2)) #batch size, width, height, channels

trans_conv1 = tf.keras.layers.Conv2DTranspose(filters = 3, kernel_size = kz, strides= st , padding='valid')

output = trans_conv1(input)

print(output)


tf.Tensor(
[[[[ 0.18330155 -0.5533184   0.22738452]
   [ 0.7073788  -0.4854805  -0.18231249]
   [-0.0832573  -0.05566709 -0.47303542]]

  [[ 0.08941013 -0.40070945 -0.5542858 ]
   [ 0.22750475 -0.39118344 -0.41565126]
   [ 0.25596932 -0.40971956 -0.1890703 ]]

  [[ 0.05023995 -0.06407139 -0.15725702]
   [-0.3692479  -0.09256556 -0.32696834]
   [ 0.3253673  -0.3716717   0.17068009]]]], shape=(1, 3, 3, 3), dtype=float32)


In [None]:
kz = 3 # kernel_size
st = 1 # strides

input = tf.random.uniform((1, 2, 2, 2)) #batch size, width, height, channels

trans_conv1 = tf.keras.layers.Conv2DTranspose(filters = 3, kernel_size = kz, strides= st , padding='valid')

output = trans_conv1(input)

print(output)

tf.Tensor(
[[[[-1.08339183e-01  3.15186717e-02  1.78843811e-01]
   [ 5.68277240e-02  1.62411064e-01  6.94189072e-02]
   [ 1.70952469e-01  2.63806432e-04  2.94180959e-01]
   [ 3.30078453e-02  8.03290494e-03  8.00928026e-02]]

  [[ 4.76946235e-02 -2.31726453e-01  2.56205797e-01]
   [-2.77556658e-01 -1.52992055e-01  2.03182995e-01]
   [ 2.58328676e-01  1.91317558e-01  1.65269583e-01]
   [ 2.04924539e-01  6.48251176e-03  1.74412012e-01]]

  [[ 1.16140537e-01 -3.86820316e-01  3.94807868e-02]
   [ 4.58863080e-02 -1.90832764e-01  7.97833130e-02]
   [ 1.58888936e-01 -1.30844653e-01 -3.30444276e-01]
   [ 6.86421692e-02  2.61069722e-02 -1.96500689e-01]]

  [[-1.20005995e-01 -1.76531971e-01 -6.61680698e-02]
   [ 1.04666509e-01  5.34620732e-02 -8.39624256e-02]
   [ 3.54047775e-01 -1.57816976e-01 -9.18743908e-02]
   [ 1.59025311e-01 -2.01287925e-01 -2.54352335e-02]]]], shape=(1, 4, 4, 3), dtype=float32)


In [None]:
kz = 3 # kernel_size
st = 2 # strides

input = tf.random.uniform((1, 2, 2, 2)) #batch size, width, height, channels

trans_conv1 = tf.keras.layers.Conv2DTranspose(filters = 3, kernel_size = kz, strides= st , padding='valid')

output = trans_conv1(input)

print(output)

tf.Tensor(
[[[[-0.04421911 -0.06894263  0.04900293]
   [-0.07226346  0.04708602  0.01601166]
   [-0.12643951 -0.23452944  0.17883417]
   [-0.3032666   0.19758974  0.06748375]
   [ 0.25124922  0.23036814 -0.11326155]]

  [[ 0.0310073  -0.09268826  0.00505753]
   [-0.01722443  0.02740151 -0.05210368]
   [ 0.0937487  -0.32524535  0.00128413]
   [-0.0727782   0.11478873 -0.21866834]
   [-0.15160584  0.266193   -0.08369448]]

  [[-0.14748995 -0.23048097  0.12484887]
   [-0.26959863  0.15750004  0.04686622]
   [-0.09692953  0.2968594  -0.34635746]
   [ 0.0451934  -0.11400144  0.10167167]
   [-0.03372756 -0.07746136 -0.28200796]]

  [[ 0.15572926 -0.41741964  0.01743059]
   [-0.00854215  0.14005695 -0.21540165]
   [-0.14558351  0.19590788 -0.0897291 ]
   [-0.03839193  0.02290654 -0.05892329]
   [-0.03401329  0.06334916 -0.01837292]]

  [[-0.18158177  0.21949588 -0.2511167 ]
   [ 0.1378204  -0.22878046  0.02679643]
   [-0.13393018  0.0150971  -0.25914615]
   [ 0.03021458 -0.02557989  0.0328286

## Difference between Conv2DTranspose and UpSampling2D in keras

Transpose: https://keras.io/api/layers/convolution_layers/convolution2d_transpose/

UpSampling: https://keras.io/api/layers/reshaping_layers/up_sampling2d/

In [None]:
# Reference: https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf