I participated in a competition organized by Lyft and Udacity in May 2018. Our task was to build a system to extract cars and road from videos recorded from [CARLA simulator](http://carla.org/).

I used a semantic segmentation algorithm with deep learning system which achieved a car F-score of 0.8291 and road F-score of 0.9773 with FPS of 11.363. To build this algorithm, I used MobileUNet network with 22300 training data, augmented by horizontal flipping, color intensity adjustments, and image rotations.

## 1. Model Architecture

MobileUNet [\[1\]](#ref1) is a variation of MobileNet [\[2\]](#ref2) and UNet [\[3\]](#ref3). This specific architecture was chosen due to its higher inference speed compared to most semantic segmentation models.

My implementation uses the following structure:

| Layer               |     Description                                      |
|:-------------------:|:----------------------------------------------------:|
| Input               | 256x256x3 RGB image                                  |
| *Downsampling path* |
| **Block 1**         | Skip connection - add to **Block 8**                 |
| ConvBlock           | Number of filters set to 64                          |
| DSConvBlock         | Number of filters set to 64                          |
| Max Pooling         | stride = [2, 2], pool size = [2, 2], padding = VALID |
| **Block 2**         | Skip connection - add to **Block 7**                 |
| ConvBlock           | Number of filters set to 128                         |
| DSConvBlock         | Number of filters set to 128                         |
| Max Pooling         | stride = [2, 2], pool size = [2, 2], padding = VALID |
| **Block 3**         | Skip connection - add to **Block 6**                 |
| ConvBlock           | Number of filters set to 256                         |
| DSConvBlock         | Number of filters set to 256                         |
| Max Pooling         | stride = [2, 2], pool size = [2, 2], padding = VALID |
| **Block 4**         | Skip connection - add to **Block 5**                 |
| ConvBlock           | Number of filters set to 512                         |
| DSConvBlock         | Number of filters set to 512                         |
| Max Pooling         | stride = [2, 2], pool size = [2, 2], padding = VALID |
| *Upsampling path* |
| **Block 5**         | Skip connection - add by **Block 4**                 |
| ConvTransposeBlock  | Number of filters set to 512                         |
| DSConvBlock         | Number of filters set to 512                         |
| DSConvBlock         | Number of filters set to 512                         |
| DSConvBlock         | Number of filters set to 512                         |
| Add by **Block 4**  | Arithmetic add                                       |
| **Block 6**         | Skip connection - add by **Block 3**                 |
| ConvTransposeBlock  | Number of filters set to 512                         |
| DSConvBlock         | Number of filters set to 512                         |
| DSConvBlock         | Number of filters set to 512                         |
| DSConvBlock         | Number of filters set to 256                         |
| Add by **Block 4**  | Arithmetic add                                       |
| **Block 7**         | Skip connection - add by **Block 2**                 |
| ConvTransposeBlock  | Number of filters set to 256                         |
| DSConvBlock         | Number of filters set to 128                         |
| DSConvBlock         | Number of filters set to 128                         |
| DSConvBlock         | Number of filters set to 128                         |
| Add by **Block 4**  | Arithmetic add                                       |
| **Block 8**         | Skip connection - add by **Block 1**                 |
| ConvTransposeBlock  | Number of filters set to 128                         |
| DSConvBlock         | Number of filters set to 128                         |
| DSConvBlock         | Number of filters set to 64                          |
| Add by **Block 4**  | Arithmetic add                                       |
| ConvTransposeBlock  | Number of filters set to 64                          |
| DSConvBlock         | Number of filters set to 64                          |
| DSConvBlock         | Number of filters set to 64                          |
| *Softmax* | |
| Convolution         | filters = 3 (num. of classes), kernel = [1, 1], padding = SAME |

Total number of classes is 3, for *Background*, *Road*, and *Car*.

Each **ConvBlock** is an operation with the following architecture:

| Layer                          |     Description                                   | 
|:------------------------------:|:-------------------------------------------------:|
| Convolution                    | variable filters, kernel = [1, 1], padding = SAME |
| Fused Batch Normalization      |  |
| ReLu Activation                |  |

All **Batch Normalizations** in the architecture are fused to improve their speed.

**DSConvBlock** is short for Depthwise Separable Convolutional Block. Depthwise separable convolutions are used for mobile devices because of their efficient use of parameters. It has the following architecture:

| Layer                                    |     Description                         | 
|:----------------------------------------:|:---------------------------------------:|
| Separable Convolution       | kernel = [3, 3] depth multiplier = 1, padding = SAME |
| Fused Batch Normalization   |  |
| ReLu Activation             |  |
| Convolution                 | variable filters, kernel = [1, 1], padding = SAME    |
| Fused Batch Normalization   |  |
| ReLu Activation             |  |

**ConvTransposeBlock** is the upsampling operation to decode the activations.

| Layer                                    |     Description                                      | 
|:----------------------------------------:|:----------------------------------------------------:|
| Transpose Convolution       | variable filters, kernel = [3, 3] stride = [2, 2], padding = SAME |
| Batch Normalization         |  |
| ReLu Activation             |  |

## 2. Training Data

In this section, I will describe some preprocessing steps that were done in this project. All data were gathered by recording images from CARLA simulation on 800 x 600 pixels resolution. In addition to 1000 images provided by Lyft, I recorded 72 more runs, each contains 274 screenshots. Total images is then `1000 + (72 * 274) = 20728` images.

### 2.1. Initial preprocessing

These images are resized into 256 x 256 pixels to accommodate the model's input. Segmentation data are processed to only take the road and cars, convert everything else to background (including our car's hood), and reindex the segmentation (0 for background, 1 for road, and 2 for cars). Below are some examples of the training data: