# Semantic Segmentation

In this exercise we will implement a convolutional neural network for semantic segmentation.
The goal of semantic segmentation is to classify the image on the pixel level, for each pixel
we want to determine the class of the object which it belongs to.




## 1. Cityscapes dataset

[Cityscapes dataset](https://www.cityscapes-dataset.com/dataset-overview/) contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations. Dataset contains 2975 training and 500 validation images of size 2048x1024. Here we will use downsampled images of size 384x160. The original dataset has 19 classes but we lower that to 7 due to low visibility of very small objects in downsampled images.

0-road
1-building
2-infrastructure
3-nature
4-sky
5-person
6-vehicle
7-ignore

## 2. Building the graph

Let's begin by importing all the modules and setting the fixed random seed.

In [13]:
import time

import tensorflow as tf
import numpy as np

import utils
from data import Dataset

tf.set_random_seed(31415)

### Dataset

The `Dataset` class implements an iterator which returns the next batch data in each iteration. Data is already normalized to have zero mean and unit variance. The iteration is terminated when we reach the end of the dataset (one epoch).

In [12]:
batch_size = 10
# create the Dataset for training and validation
train_data = Dataset('train', batch_size)
val_data = Dataset('val', batch_size, shuffle=False)

print('Train shape:', train_data.x.shape)
print('Validation shape:', val_data.x.shape)

print('mean = ', train_data.x.mean((0,1,2)))
print('std = ', train_data.x.std((0,1,2)))

# store the input image dimensions
height = train_data.height
width = train_data.width
channels = train_data.channels

Train shape: (2975, 160, 384, 3)
Validation shape: (500, 160, 384, 3)
mean =  [-0.03841928 -0.04215339 -0.05894543]
std =  [ 1.01808784  1.02022915  1.02732566]


### Inputs

First, we will create input placeholders for Tensorflow computational graph of the model. For a supervised learning model, we need to declare placeholders which will hold input images (x) and target labels (y) of the mini-batches as
we feed them to the network.

In [15]:
# create placeholders for inputs
def build_inputs():
    with tf.name_scope('data'):
        x = tf.placeholder(tf.float32, shape=(None, height, width, channels), name='rgb_images')
        y = tf.placeholder(tf.int32, shape=(None, height, width), name='labels')
    return x, y

### Model

Now we can define the computational graph. Here we will heavily use [`tf.layers`](https://www.tensorflow.org/api_docs/python/tf/layers) high level API.

### Loss

Now we are going to implement the `build_loss` funcion which will create operations for our loss and return the final `tf.Tensor` holding the scalar loss value.
Because segmentation is just classificaion on a pixel level we can again use the cross-entropy loss function \\(L\\) between the target one-hot distribution \\( \mathbf{y} \\) and the predicted distribution from a softmax layer \\( \mathbf{s} \\). But compared to image clasificaion here we need to define the loss at each pixel. Below are the equations describing the loss for one example (one pixel in our case).
$$
L = - \sum_{i=1}^{C} y_i log(s_j(\mathbf{x})) \\
s_i(\mathbf{x}) = \frac{e^{x_i}}{\sum_{j=1}^{C} e^{x_j}} \\
$$

In [19]:
def build_loss(logits, y):
  with tf.name_scope('loss'):
    y = tf.reshape(y, shape=[-1])
    logits = tf.reshape(logits, [-1, num_classes])

    mask = y < num_classes
    y = tf.boolean_mask(y, mask)
    logits = tf.boolean_mask(logits, mask)

    y_one_hot = tf.one_hot(y, num_classes)
    xent = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_one_hot)

    xent = tf.reduce_mean(xent)
    tf.summary.scalar('cross_entropy', xent)
    loss = add_regularization(xent)
    return loss

### Putting it all together

In [10]:
x, y = build_inputs()
logits, is_training = build_model(x, num_classes)
loss = build_loss()

NameError: name 'build_inputs' is not defined

## 3. Training the model

### Training one epoch

In [None]:
def train(sess):

### Validation

We usually evaluate the semantic segmentation results with [Intersection over Union](https://en.wikipedia.org/wiki/Jaccard_index) measure (IoU aka Jaccard index). Note that accurracy we used on MNIST image classification problem is a bad measure in this case because semantic segmentation datasets are often heavily imbalanced. First we compute IoU for each class in one-vs-all fashion (shown below) and then take the mean IoU (mIoU) over all classes. By taking the mean we are treating all classes as equally important.
$$
IOU = \frac{TP}{TP + FN + FP}
$$

![iou](assets/iou.png)

In [11]:
def validate(sess):

SyntaxError: unexpected EOF while parsing (<ipython-input-11-8ba014eddca4>, line 1)

## 4. Visualizing the results

In [8]:
# restore the saved model

## 5. Better model with skip connections

TODO slika