<a href="https://colab.research.google.com/github/lblogan14/deep_learning_for_computer_vision/blob/master/ch5_semantic_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
%cd /content/drive/My' 'Drive/Colab' 'Notebooks/Deep_Learning_for_Computer_Vision/

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision


In [0]:
import tensorflow as tf

#Predict Pixels
Image classification is the task of predicting labels or categories. 

Object detection is the task of predicting a list of several deep learning-based algorithms
with its corresponding bounding box.

**Semantic segmentation** is the task of predicting pixel-wise labels.

**Instance segmentation** is the task of segmenting every instance with a pixel-wise label. Instance segmentation can be thought of as an extension of object
detection with pixel-level labels.

#Datasets
The `PASCAL` and `COCO` datasets can be used for the segmentation task as well. The annotations are different as
they are labelled pixel-wise. New algorithms are usually benchmarked against
the `COCO` dataset. `COCO` also has stuff datasets such as grass, wall, and sky. The pixel
accuracy property can be used as a metric for evaluating algorithms.

Other datasets:
* http://www.cs.bu.edu/~betke/BiomedicalImageSegmentation
* https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening/data
* https://www.kaggle.com/c/diabetic-retinopathy-detection
* https://grand-challenge.org/all_challenges
* http://www.via.cornell.edu/databases
* https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection
* https://aws.amazon.com/public-datasets/spacenet
* https://www.iarpa.gov/challenges/fmow.html
* https://www.kaggle.com/c/planet-understanding-the-amazon-from-space

#Algorithms for Semantic Segmentation
A sliding window approach can be applied at a pixel
level for segmentation. A sliding window approach takes an image and breaks
the image into smaller crops. Every crop of the image is classified for a label.
This approach is expensive and inefficient because it doesn't reuse the shared
features between the overlapping patches.

##Fully Convolutional Network (FCN)
The **Fully Convolutional Network (FCN)** introduced the idea of an end-to-end convolutional network.

Any standard CNN architecture can be used for FCN by removing the fully connected layers. The fully connected layers are replaced by
a convolution layer. The depth is higher in the final layers and the size is smaller.
Hence, 1D convolution can be performed to reach the desired number of labels.
But for segmentation, the spatial dimension has to be preserved. Hence, the full
convolution network is constructed without a max pooling, as shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/fcn.JPG?raw=true)

The loss for this network is computed by averaging the cross-entropy loss of
every pixel and mini-batch.

The final layer has a depth equal to the number of
classes. FCN is similar to object detection except that the spatial dimension is
preserved. The output produced by the architecture will be coarse as some pixels
may be mispredicted.

##The SegNet Architecture
The **SegNet** has an encoder and decoder approach. The encoder has various
convolution layers and decoder has various deconvolution layers. SegNet
improved the coarse outputs produced by FCN. Because of this, it is less
intensive on memory. When the features are reduced in dimensions, it is
upsampled again to the image size by deconvolution, reversing the convolution
effects. Deconvolution learns the parameters for upsampling. The output of such
architecture will be coarse due to the loss of information in pooling layers.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/segnet.JPG?raw=true)

###Upsample the layers by pooling
Max pooling is a sampling strategy that picks the maximum value from a window.

The reverse process is for upsampling. Now each value can be surrounded with zeros to upsample the layer:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/upsample1.JPG?raw=true)

Another way to add zeros is to remember the locations of downsampling and use it for upsampling:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/upsample2.JPG?raw=true)

###Sample the layers by convolution
The layers can be upsampled or downsampled directly using convolution. The
stride used for convolution can be increased to cause downsampling:

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/downsample.JPG?raw=true)

Downsampling by convolution is called **atrous convolution** or **dilated
convolution** or **strided convolution**.

Similarly, it can be reversed to upsample
by learning a kernel:

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/upsample.JPG?raw=true)

Upsampling directly using a convolution can be termed as **transposed
convolution**, **deconvolution** or **fractionally strided
convolution** or **up-convolution**.

###Build SegNet using TensorFlow

In [0]:
tf.reset_default_graph()
tf.keras.backend.clear_session()

In [0]:
input_height = 360
input_width = 480
kernel = 3
filter_size = 64
pad = 1
pool_size = 2
nClasses = 100

In [0]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Layer(input_shape=(input_height, input_width, 3)))

In [0]:
# Encoder
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(filter_size, 
                                 kernel, 
                                 kernel,
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(128, 
                                 kernel, 
                                 kernel, 
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(256, 
                                 kernel, 
                                 kernel, 
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(512, 
                                 kernel, 
                                 kernel, 
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))

In [0]:
# Decoder
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(512, 
                                 kernel, 
                                 kernel, 
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(256, 
                                 kernel, 
                                 kernel, 
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(128, 
                                 kernel, 
                                 kernel, 
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(filter_size, 
                                 kernel, 
                                 kernel, 
                                 padding='valid'))
model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.Conv2D(nClasses, 1, 1, border_mode='valid', ))

model.outputHeight = model.output_shape[-2]
model.outputWidth = model.output_shape[-1]

model.add(tf.keras.layers.Reshape((nClasses, model.output_shape[-2] * model.output_shape[-1]),
                                  input_shape=(nClasses, 
                                               model.output_shape[-2], 
                                               model.output_shape[-1])))

model.add(tf.keras.layers.Permute((2, 1)))
model.add(tf.keras.layers.Activation('softmax'))

model.compile(loss="categorical_crossentropy", optimizer=tf.keras.optimizers.Adam, metrics=['accuracy'])

###Skipping Connections
The coarseness of segmentation output can be limited by skip architecture, and
higher resolutions can be obtained.

Another alternative way is to scale up the last
three layers and average them as shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/skip_connection.JPG?raw=true)

###Dilated Convolutions
The pixel-wise classification and image classification are structurally different.

The pooling layers that decrease information will produce coarse segmentation. 

Pooling is essential to have a wider view and allows sampling. The **dilated convolution** is used to solve
this problem for less-lossy sampling while having a wider view. 

The dilated convolution is essentially convolution by skipping every pixel in the window as shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/dilated_conv.JPG?raw=true)

The dilation distance varies from layer to layer. The output of such a
segmentation result is upscaled for a finer resolution.

##DeepLab
performs convolutions on multiple scales and uses the features from various scales to obtain a score map. The score map is interpolated and passed through a **conditional random field (CRF)** for final segmentation.

The image scale processing is performed by processing images of various sizes with its CNN or parallel convolutions with varying level of dilated convolutions.

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/deeplab.JPG?raw=true)

##RefiNet
When it comes to high-resolution pictures, the computational compelxity increases and causes problem for dilated convolutions because dilated convolutions need bigger input and they are memory intensive.

RefiNet is proposed to overcome this problem:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/refinet.JPG?raw=true)

RefiNet applies the encoder-decoder structure: the encoder outputs a CNN and the decoder concatenates the features of various sizes.
The concatenation is done by upscaling the low dimensional feature

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/refinet2.JPG?raw=true)

##PSPnet
increases the kernel size of pooling layers. The pooling is carried in a pyramid shape. The pyramid covers various portions and sizes of the images simultaneously. The loss function in-between the architecture enables moderate supervision.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/pspnet.JPG?raw=true)

##Large Kernels
have bigger receptive fields than small kernels. The
computational complexity of these large kernels can be used to overcome with
an approximate smaller kernel.

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/large_kernel.JPG?raw=true)

##DeepLab V3
In this updated version, the concept of batch normalization is applied to improve the performance.

The multi-scale of the feature is encoded in a cascaded fashion to improve the performance:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/deeplab_v3.JPG?raw=true)

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/deeplab_v3_2.JPG?raw=true)

#Ultra-Nerve Segmentation
The original dataset of the segmentation of the nerve structure from ultrasound images of the neck can be download from https://www.kaggle.com/c/ultrasound-nerve-segmentation.

The UNET model resembles an autoencoder but with convolutions instead of a fully connected layer. The encoding part with convolution decreases the dimensions and the decoder part increases dimensions:

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/unet.JPG?raw=true)

The convolutions of the similar sized encoder and decoder part are learned with skipping connections.

The model output is a mask ranging from 0 to 1.

Let's start with data preparation

In [0]:
from __future__ import print_function

import os
import numpy as np

from skimage.io import imsave, imread

data_path = './data/ch5'

image_rows = 420
image_cols = 580


In [0]:
def create_train_data():
  train_data_path = os.path.join(data_path, 'train')
  images = os.listdir(train_data_path)
  total = int(len(images) / 2)
  
  imgs = np.ndarray((total, image_rows, image_cols), dtype=np.uint8)
  imgs_mask = np.ndarray((total, image_rows, image_cols), dtype=np.uint8)
  
  i = 0
  print('-'*30)
  print('Creating training images...')
  print('-'*30)
  for image_name in images:
    if 'mask' in image_name:
      continue
    image_mask_name = image_name.split('.')[0] + '_mask.tif'
    img = imread(os.path.join(train_data_path, image_name), as_grey=True)
    img_mask = imread(os.path.join(train_data_path, image_mask_name), as_grey=True)
    
    img = np.array([img])
    img_mask = np.array([img_mask])
    
    imgs[i] = img
    imgs_mask[i] = img_mask
    
    if i%100 == 0:
      print('Done: {0}/{1} images'.format(i, total))
    i += 1
  print('Loading done.')
  
  np.save('imgs_train.npy', imgs)
  np.save('imgs_mask_train.npy', imgs_mask)
  print('Saving to .npy files done.')

In [0]:
def load_train_data():
  imgs_train = np.load('imgs_train.npy')
  imgs_mask_train = np.load('imgs_mask_train.npy')
  return imgs_train, imgs_mask_train

In [0]:
def create_test_data():
  test_data_path = os.path.join(data_path, 'test')
  images = os.listdir(test_data_path)
  total = len(images)
  
  imgs = np.ndarray((total, image_rows, image_cols), dtype=np.uint8)
  imgs_id = np.ndarray((total, ), dtype=np.int32)
  
  i = 0
  print('-'*30)
  print('Creating test images...')
  print('-'*30)
  for image_name in images:
    img_id = int(image_name.split('.')[0])
    img = imread(os.path.join(train_data_path, image_name), as_grey=True)
    
    img = np.array([img])
    
    imgs[i] = img
    imgs_id[i] = img_id
    
    if i%100 == 0:
      print('Done: {0}/{1} images'.format(i, total))
    i += 1
  print('Loading done.')
  
  np.save('imgs_test.npy', imgs)
  np.save('imgs_id_test.npy', imgs_id)
  print('Saving to .npy files done.')

In [0]:
def load_test_data():
  imgs_test = np.load('imgs_test.npy')
  imgs_id = np.load('imgs_id_test.npy')
  return imgs_test, imgs_id

In [0]:
create_train_data()
create_test_data()

------------------------------
Creating training images...
------------------------------
Done: 0/5635 images


  strip = decompress(strip)


Done: 100/5635 images
Done: 200/5635 images
Done: 300/5635 images
Done: 400/5635 images
Done: 500/5635 images
Done: 600/5635 images
Done: 700/5635 images
Done: 800/5635 images
Done: 900/5635 images
Done: 1000/5635 images
Done: 1100/5635 images
Done: 1200/5635 images
Done: 1300/5635 images
Done: 1400/5635 images
Done: 1500/5635 images
Done: 1600/5635 images
Done: 1700/5635 images
Done: 1800/5635 images
Done: 1900/5635 images
Done: 2000/5635 images
Done: 2100/5635 images
Done: 2200/5635 images
Done: 2300/5635 images


Now we can define the model.

Start with the size of images,

In [0]:
import os
from skimage.transform import resize
from skimage.io import imsave, imread
import numpy as np
import tensorflow as tf

In [0]:
image_height, image_width = 96, 96
smoothness = 1.0
work_dir = './data/ch5'

Define the dice coefficients and its loss function

In [0]:
def dice_coefficient(y1, y2):
  y1 = tf.flatten(y1)
  y2 = tf.flatten(y2)
  return (2.*tf.sum(y1*y2)+smoothness) / (tf.sum(y1)+tf.sum(y2)+smoothness)

In [0]:
def dice_coefficient_loss(y1, y2):
  return -dice.coefficient(y1, y2)

Define the layers to be used:

In [0]:
def preprocess(imgs):
  imgs_p = np.nadrray((imgs.shape[0], image_height, image_width), dtype=np.uint8)
  for i in range(imgs.shape[0]):
    imgs_p[i] = resize(imgs[i], (image_width, image_height), preserve_range=True)
  imgs_p = imgs_p[..., np.newaxis]
  return imgs_p

In [0]:
def convolution_layer(filters, kernel=(3,3), activation='relu', input_shape=None):
  if input_shape is None:
    return tf.keras.layers.Conv2D(filters=filters,
                                  kernel_size=kernel,
                                  activation=activation)
  else:
    return tf.keras.layers.Conv2D(filters=filters,
                                  kernel_size=kernel,
                                  activation=activation,
                                  input_shape=input_shape)

In [0]:
def concatenated_deconvolution_layer(filters):
  return tf.keras.layers.concatenate([
      tf.keras.layers.Conv2DTranspose(filters=filters,
                                      kernel=(2,2),
                                      strides=(2,2),
                                      padding='same')],
      axis=3)

In [0]:
def pooling_layer():
  return tf.keras.layers.MaxPooling2D(pool_size=(2,2))

Define the UNET model:

In [0]:
unet = tf.keras.models.Sequential()
inputs = tf.keras.layers.Input((image_height, image_width, 1))
input_shape = (image_height, image_width, 1)
unet.add(convolution_layer(32, input_shape=input_shape))
unet.add(convolution_layer(32))
unet.add(pooling_layer())

unet.add(convolution_layer(64))
unet.add(convolution_layer(64))
unet.add(pooling_layer())

unet.add(convolution_layer(128))
unet.add(convolution_layer(128))
unet.add(pooling_layer())

unet.add(convolution_layer(256))
unet.add(convolution_layer(256))
unet.add(pooling_layer())

unet.add(convolution_layer(512))
unet.add(convolution_layer(512))

In [0]:
unet.add(concatenated_deconvolution_layer(256))
unet.add(convolution_layer(256))
unet.add(convolution_layer(256))

unet.add(concatenated_deconvolution_layer(128))
unet.add(convolution_layer(128))
unet.add(convolution_layer(128))

unet.add(concatenated_deconvolution_layer(64))
unet.add(convolution_layer(64))
unet.add(convolution_layer(64))

unet.add(concatenated_deconvolution_layer(32))
unet.add(convolution_layer(32))
unet.add(convolution_layer(32))

unet.add(convolution_layer(1, kernel=(1,1), activation='sigmoid'))

unet.summary()
unet.compile(optimizer=tf.keras.optimizers.Adam(lr=1e-5),
             loss=dice_coefficient_loss,
             metrics=[dice_coefficient])

Train the model using the training images:

In [0]:
x_train, y_train_mask = load_train_data()

# Preprocessing the training images first
x_train = preprocess(x_train)
y_train_mask = preprocess(y_train_mask)

x_train = x_train.astype('float32')
mean = np.mean(x_train)
std = np.std(x_train)

x_train -= mean
x_train /= std

y_train_mask = y_train_mask.astype('float32')
y_train_mask /= 255.

In [0]:
unet.fit(x_train, 
         y_train_mask, 
         batch_size=32, 
         epochs=20, 
         verbose=1, 
         shuffle=True,
         validation_split=0.2)

After training, start to test the model performance using the testing data,

In [0]:
x_test, y_test_mask = load_test_data()

x_test = preprocess(x_test)
x_test = x_test.astype('float32')
x_test -= mean
x_test /= std

y_test_pred = unet.predict(x_test, verbose=1)

In [0]:
# Save the prediction result
for image, image_id in zip(y_test_pred, y_test_mask):
  image = (image[:,:,0] * 255.).astype(np.uint8)
  imsave(os.path.join(work_dir, str(image_id)+'.png'), image)

#FCN Model for Segmentation

In [0]:
import tensorflow as tf
from .resnet50 import ResNet50

In [0]:
nb_labels = 6
input_shape = [28, 28]

img_height, img_width, _ = input_shape
input_tensor = tf.keras.layers.Input(shape=input_shape)
weights = 'imagenet'

Inititalize the `ResNet` model:

In [0]:
resnet50_model = ResNet50(include_top=False,
                          weights=weights,
                          input_tensor=input_tensor)

Take the last three layers from `ResNet` model:

In [0]:
final_32 = resnet50_model.get_layer('final_32').output
final_16 = resnet50_model.get_layer('final_16').output
final_x8 = resnet50_model.get_layer('final_x8').output

Each skip connection has to be compressed to match the channel that is equal to
the number of labels:

In [0]:
c32 = tf.keras.layers.Conv2D(nb_labels, (1,1))(final_32)
c16 = tf.keras.layers.Conv2D(nb_labels, (1,1))(final_16)
c8 = tf.keras.layers.Conv2D(nb_labels, (1,1))(final_x8)

The output of the compressed skip connection can be resized using **bilinear
interpolation**. The interpolation can be implemented by using a `Lambda` layer that
can compute TensorFlow operation.

In [0]:
def resize_bilinear(images):
  return tf.image.resize_bilinear(images, [img_height, img_width])

In [0]:
r32 = tf.keras.layers.Lambda(resize_bilinear)(c32)
r16 = tf.keras.layers.Lambda(resize_bilinear)(c16)
r8 = tf.keras.layers.Lambda(resize_bilinear)(c8)

Merge these three layers by adding those three values,

In [0]:
m = tf.keras.layers.Add()([r32, r16, r8])

The probabilities of the model can be applied using softmax activation. The
model is resized before and after applying softmax:

In [0]:
x = tf.keras.ayers.Reshape((img_height * img_width, nb_labels))(m)
x = tf.keras.layers.Activation('img_height')(x)
x = tf.keras.layers.Reshape((img_height, img_width, nb_labels))(x)

In [0]:
fcn_model = tf.keras.models.Model(input=input_tensor, output=x)

Thus, a simple FCN model has been defined.

#Instance Segmentation
This process of separating the required information from the rest is
widely known as **segmenting instances**. During this process, the input image is
first taken, then the bounding box will be localized with the objects and at last, a
pixel-wise mask will be predicted for each of the class. For each of the objects,
pixel-level accuracy is calculated.

**Mask RCNN**: 

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/mask_rcnn.JPG?raw=true)

The architecture looks similar to the R-CNN with an addition of segmentation. It
is a multi-stage network with end-to-end training. The region proposals are
learned. The network is split into two, one for detection and the other for a
classification score.