# Introduction

Semantic segmentation task requires building masks of objects in <u>original resolution</u>.
But traditional convolutional networks constuct feature maps of a smaller and coarse size. 

How to deal with it?

**Deconvolutional networks** extend standard convolutional models with a series of unpooling convolutions that gradually return spatial dimension back to its original size.

<img src="img/deconvnet.png" width=1000>

While reconstructing high-resolution output, such networks use almost no context information. Nevertheless this architecture works well and it showed pretty good results.

So these deconvolutional networks basically learn how to interpolate.

### Without upsampling

One way to preserve resolution without upsampling is to set all convolution strides to 1. 

This lets us remove all the poloing layers from the network. And this is the main idea of **Fully Convolutional Networks**. FCNs are built only of convolutions. Now it's a mainstrem architecture.

**But** if we want to have large receptive field (to use large kernels), we'll have to deal with thousands of parameters. That won't work well.

Do FCNs user dilated onvolutions too?

### Dilated / Atrous Convolutions

Dilated (or atrous) convolutions are convolutions with a sparse kernel. Only a subset of parameters in such kernels are learnable, others are always set to zero.

Dilation factor = rate of pixels that are learnable along a dimension.
- D = 1, every pixel is learnable (standard convolution)
- D = 2, every second pixel is learnable
- D = 3, every third pixel os learnable

<img src="img/dilated1.png" width=500>

**Purpose:** Dilated convolutions make it possible to increase the receptive field (size of the kernel) while maintaning the same number of parameters.



Of course they may fail to detect some high frequency patterns, but in practice they tend to work pretty well.


So, the fully convolutional approach with usage of atrous convolutions is a more effective way of maintaining original spatial resolution:

<img src="img/modern_segmentation_2.png" width=500>

### Modern architecture for image segmentation

Let's consider traditional classification model:

<img src="img/modern_segmentation_1.png" width=900>

State-of-the art network archtectures for segmentation utilize the following 3 ideas:

1. They append a Fully Convolutional part with dilated convolutions

    Complete FCN would be too cumbersome, so there is still some downsampling on first layers; thus there is an upsampling transformation before the output


2. They replace fully-connected dense head with a series of 1x1 convolutions

3. They do x8 bicubic interpolation to get back to the original resolution

<img src="img/modern_segmentation_3.png" width=900>


# DeepLab

The algorithm was proposed in 2017 and became state-of-the-art in Semantic Segmentation on Pascal VOC dataset.

[arxiv](https://arxiv.org/pdf/1606.00915.pdf)

Main ideas:
1. They utilized dilated convolutions
2. They use probabilistic post-processing (CRFs) to refine segment borders


## DeepLab :: Conditional Random Fields

#### What is CRF
CRF is a probabilistic graphical model with multiple variables when each one depends on other.

#### Why
CRFs are standard Computer Vision trick to smoothen class boundaries, known from early 2000s. We don't want too much jittering in pixel classes.

#### How it works
We need to model probability of every segmentation (given our coarse scores) and choose the segmentation that is most probable. 

If we consider mask pixels independent, probability of segmentation is just a product of pixel probabilities. But of course they are not and each mask pixel depends on all other pixels.

Calculating the probabilities for all possible segmentations is intractable, but we can decompose it to product of independent blocks of pixels.

We assume that pixel probability is product of 
1. segment score (output of the network)
2. all interactions with other pixels

If there is no interactions, we would just assign each pixel to a class with a highest probability. That would be our most probable mask.

How interactions are coded:
- we don't check pairs of the same class
- we penalize more if two pixels are close to each other (this is the border of 2 classes)
- we penalize more if colors of two pixels are similar

<img src="img/crf_encoding.png" width=700>

That way we find a segmentaion with least possible inconsistencies, when penalties are applied only on color borders.

### Inference
Inference for CRFs is iterative. On each iteration class probabilities for each pixel are updated.
 
Those updates can be implemented as a convolution with gaussian filter.

<img src="img/crf_inference.png" width=900>

Couple of optimizations:
- we don't need far away pixels
- we can use just a subsample of nearby pixels
