# Exercise 1.7.1 — Scene Understanding
#### By Jonathan L. Moran (jonathan.moran107@gmail.com)
From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## Objectives

* Compute the Mean Intersection over Union (IoU) of the multi-class segmentation label predictions;
* Implement a standard convolution block and compare it to the MobileNet convolution block. 

## 1. Introduction

In [1]:
### Importing required modules

In [2]:
import numpy as np
import os
import tensorflow as tf
from typing import List, Union, Tuple

In [3]:
tf.__version__

'2.9.2'

In [4]:
tf.test.gpu_device_name()

''

In [5]:
### Setting the environment variables

In [6]:
ENV_COLAB = True                # True if running in Google Colab instance

In [7]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/'

In [8]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out/')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data/')

In [9]:
### Creating subdirectories (if not exists)
os.makedirs(DIR_OUT, exist_ok=True)

### 1.1. Scene Understanding 

#### Background

TODO.

#### Metrics — Intersection over Union (IoU)

In the [very first exercise](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Computer-Vision/Exercises/1-1-1-Choosing-Metrics/2022-07-25-Choosing-Metrics-IoU.ipynb) of this course, we covered the Intersection over Union (IoU) metric and its application to the bounding box prediction task. Now, we use the IoU metric again but this time for semantic segmentation and scene understanding.

We start with the same general formula for the IoU score given by:

$$
\begin{align}
\mathrm{IoU} &= \frac{\textrm{Area of Intersection}}{\textrm{Area of Union}},
\end{align}
$$

but now we calculate the IoU score using the following binary classification metrics:

$$
\begin{align}
\mathrm{IoU} &= \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP} + \mathrm{FN} + \mathrm{TN}}.
\end{align}
$$

With this form of the IoU equation, all we need to do is compute the true positive ($\mathrm{TP}$), true negative ($\mathrm{TN}$), and the false positive ($\mathrm{FN}$), false negative ($\mathrm{FN}$) rates. For the image segmentation task, this boils down to the pixel-wise classification predictions. Thankfully, the algorithms we designed in [Sect. 2.1](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Computer-Vision/Exercises/1-1-1-Choosing-Metrics/2022-07-25-Choosing-Metrics-IoU.ipynb) of Exercise 1.1.1 hold; all we need to do is compute the pixel-wise classification metrics for each class using the same tabular approach as before. With these metrics, we evaluate the $\mathrm{IoU}$ formula and obtain a score indicating the amount of "overlap" between the predicted region and the true region of each segmented object. 

Let's illustrate this with a simple example:

```python
ground_truth_labels = [
    [0, 0, 0, 0], 
    [1, 1, 1, 1],
    [2, 2, 2, 2], 
    [3, 3, 3, 3],
]
predicted_labels = [
    [1, 0, 0, 0],
    [1, 3, 0, 1],
    [2, 2, 2, 3],
    [3, 1, 0, 0],
]
```

Above we define a set of _ground-truth_ and _predicted_ labels. Each row in the matrix corresponds to a class; looking at the first row of the `ground_truth_labels` ("$\mathrm{A}$") matrix, we see that class `0` should appear at all four pixel locations. Looking at the first row of `predicted_labels` ("$\mathrm{B}$" matrix), we see instead that only three of the four pixel locations were given a correct prediction of class `0`. In other words, we have in the first row a $\mathrm{TP} = 3$. Now, we need to compute for class `0` the false positive ($\mathrm{FP}$) rate. To do this, we examine the _other_ pixel locations (i.e., other rows of the `predicted_labels` matrix), and add up any occurrences of class label `0` where the corresponding entries in `ground_truth_labels` do not match. Since class label `0` was predicted _incorrectly_ at pixel locations $\mathrm{B}_{2, 3}$, $\mathrm{B}_{3, 3}$, and $\mathrm{B}_{4, 4}$, we have a $\mathrm{FP} = 3$. Now let's complete the calculations for the two other metrics: true negative ($\mathrm{TN}$) and false negative ($\mathrm{FN}$). The $\mathrm{TN}$ value for this problem is easy to compute, since we assume all predictions here were valid (i.e., that we expected a class label to be predicted for every pixel in `predicted_labels`). That means our $\mathrm{TN} = 0$. Lastly, our $\mathrm{FN}$ rate is computed as the number of _incorrect_ predictions for class `0`. Looking at the first row of the `predicted_labels` matrix (i.e., the "predictions" for class `0`), we count the number of label predictions that are _not_ equal to class `0` to get our false negative rate. With _one_ incorrect class `0` prediction at the first index $\mathrm{B}_{1, 1} = $ `1`, we have therefore a $\mathrm{FN} = 1$. 

With these four classification metrics out of the way, we can obtain the $\mathrm{IoU}$ score for class `0` as follows:

$$
\begin{align}
\mathrm{IoU}_{0} &= \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP} + \mathrm{FN} + \mathrm{TN}} = \frac{3}{3 + 3 + 1 + 0} = \frac{3}{7} \approx 0.4286.
\end{align}
$$

Now that we have the $\mathrm{IoU}$ score for class `0` computed, we repeat the process for the other three rows (classes) in `predicted_labels` to obtain each classes' respective $\mathrm{IoU}$ score. Once we have completed the calculations of all four classes, we can take the average to obtain the $\mathrm{IoU}_{\textrm{mean}}$, as simply:

$$
\begin{align}
\mathrm{IoU}_{\textrm{mean}} &= \frac{1}{n}\sum_{i=0}^{n} \mathrm{IoU}_{i},
\end{align}
$$

which is nothing but the sum of the per-class $\mathrm{IoU}$ scores divided by the total number of classes.

### 1.2. Convolutions

[Convolution](https://en.wikipedia.org/wiki/Convolution) is a measure of overlap between two functions as one slides over the other. Mathematically, it is a sum of products given by the following:
$$
\begin{align}
\left(f * g\right)\left(t\right) := \int_{-\infty}^{\infty} f\left(\tau\right)\cdot g\left(t-\tau\right)d\tau.
\end{align}
$$

A [convolutional layer](https://en.wikipedia.org/wiki/Convolutional_neural_network#Convolutional_layers) in a neural network performs the convolution operation by applying a filter over the input tensor. After the same filter is repeatedly applied to the input, a feature map is created which shows the positions and intensity of a detected feature in an input. 

Convolutional layers are extremely useful in neural networks, but they often come with a high computational cost due to the number of parameters required for each input (and each channel of the input). In this section we will look at the basics of regular convolution and review the number of parameters required. Then, we will introduce several alternative convolution methods — namely, depth-wise, depth-wise separable and point-wise convolution, which help reduce this computational cost. These alternatives are especially useful for networks intended to be run on mobile devices and embedded hardware where computational resources are limited.

#### Regular Convolutions

In a typical [convolutional layer](https://en.wikipedia.org/wiki/Convolutional_neural_network#Convolutional_layers), we have a set of $\mathrm{N}$ kernels, each with a size of $\mathrm{D}_{k} * \mathrm{D}_{k}$. Each of these kernels convolves ("slides over") the entire input, which is a $\mathrm{D}_{f} * \mathrm{D}_{f} * \mathrm{M}$ sized feature map (a tensor). When considering the computational cost of a convolution operation, we must consider the number of _parameters_ required for each convolutional layer. Generally, models with more convolutional layers have more parameters, and therefore require more computational resources to use. While this can lead to higher accuracy in image processing tasks, special attention needs to be paid to the computational cost of convolutional networks when utilising low-end hardware, such as with in-vehicle embedded devices which will be powering these type of networks with real-time inference demands.

To understand the computational cost of a regular convolutional layer, we have the following cost:
$$
\begin{align}
\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{N} * \mathrm{D}_{k} * \mathrm{D}_{k},
\end{align}
$$

where $\mathrm{D}_{g} * \mathrm{D}_{g}$ is the size of the output feature map. A regular convolution takes in a $\mathrm{D}_{f} * \mathrm{D}_{f} * \mathrm{M}$ input feature map and returns a $\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{N}$ feature map as output.

This is illustrated below:

<img src="figures/2023-01-26-Figure-1-Standard-Convolution-Filters.png" alt="Figure 1. Filters in a regular convolutional layer.">

$$
\begin{align}
\textrm{Figure 1. Filters in a regular convolutional layer (credit: Howard et al., 2017).}
\end{align}
$$

#### Depth-wise Convolutions

[Depth-wise convolution](https://tvm.d2l.ai/chapter_common_operators/depthwise_conv.html) and depth-wise separable convolution are two atypical convolution operations that have less parameters and therefore require less computational power to compute. Depth-wise convolutions are used in the [MobileNets](https://arxiv.org/abs/1704.04861) [1] architecture designed for mobile and embedded applications.

Depth-wise convolution acts on each input channel separately with a different kernel for each. With a number of input channels $\mathrm{M}$ we have $\mathrm{M} * \mathrm{D}_{k} * \mathrm{D}_{k}$ kernels. Since depth-wise convolution only acts on a single input channel at a time, the kernel depth $\mathrm{N}$ is set equal to $1$. Therefore, the computational cost of the depth-wise convolution is:
$$
\begin{align}
\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{D}_{k} * \mathrm{D}_{k}.
\end{align}
$$

To produce the same effect with regular convolution, each channel of the input requires its own kernel. To compute the convolution, each channel is selected individually, and all elements in the kernel are set to zero except for those corresponding to the respective input channel. The final output of the stacked convolutions is one $\mathrm{M}$-channel output feature map. As shown above, depth-wise convolution reduces the number of parameters required by a factor $\mathrm{N}$, i.e., the number of filters required _per input channel_. For a three-channel input, we require _three times less_ number of parameters with the depth-wise convolution approach.   

This is illustrated below:

<img src="figures/2023-01-26-Figure-2-Depthwise-Convolution-Filters.png" alt="Figure 2. Filters in a depth-wise convolutional layer.">

$$
\begin{align}
\textrm{Figure 2. Filters in a depth-wise convolutional layer (credit: Howard et al., 2017).}
\end{align}
$$

In order to understand depth-wise separable convolution, we first introduce the point-wise convolution.

#### Point-wise Convolutions

Point-wise convolution is a form of convolution that applies a $1\times 1$ kernel across an input. Unlike the depth-wise convolution, the $1\times 1$ kernel used here has a depth equal to the number of channels in the input. The computational complexity of the point-wise convolution is similar to the regular convolution, but instead the filter size $\mathrm{D}_{k} * \mathrm{D}_{k}$ is equal to $1 \times 1$. Therefore, we have the following number of parameters:
$$
\begin{align}
\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{N}.
\end{align}
$$

This is illustrated below:

<img src="figures/2023-01-26-Figure-3-Pointwise-Convolution-Filters.png" alt="Figure 3. Filters in a point-wise convolutional layer.">

$$
\begin{align}
\textrm{Figure 3. Filters in a point-wise convolutional layer (credit: Howard et al., 2017).}
\end{align}
$$

#### Depth-wise Separable Convolutions

Point-wise convolution is used in conjunction with depth-wise convolution in order to perform depth-wise separable convolution. Here, depth-wise separable convolution borrows the idea that the depth and the spatial dimension of a filter can be separated, as is the case with e.g., the [Sobel](https://en.wikipedia.org/wiki/Sobel_operator) filter for edge detection. The depth-wise separable convolution separates a kernel into two independent kernels, each of which performs two convolutions: the depth-wise and the point-wise convolution. Thus, the total computational cost for the depth-wise separable convolution is:
$$
\begin{align}
\left(\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{N}\right) + \left(\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{N}\right),
\end{align}
$$

which results in a $\frac{1}{\mathrm{N}} + \frac{1}{\mathrm{D}_{k} * \mathrm{D}_{k}}$ reduction in total computation, which can be observed in the following:
$$
\begin{align}
\frac{\left(\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{N}\right) + \left(\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{N}\right)}{\mathrm{D}_{g} * \mathrm{D}_{g} * \mathrm{M} * \mathrm{N} * \mathrm{D}_{k} * \mathrm{D}_{k}} &= \frac{\mathrm{D}_{k}^{2} + \mathrm{N}}{\mathrm{D}_{k}^{2} * \mathrm{N}} = \frac{1}{\mathrm{N}} + \frac{1}{\mathrm{D}_{k}^{2}}.
\end{align}
$$

When $\mathrm{N}$ is selected to be large enough, depth-wise separable convolution networks can be immensely more computationally efficient. For example, the MobileNet architecture uses a $3\times 3$ kernel, which results in $\sim 9\mathrm{x}$ efficiency improvement over regular convolution layers.

#### Width Multiplier

In addition to utilising alternative convolution methods, scaling the number of input and output channels proportional to a _width multiplier_ is often performed. The width multiplier $\alpha$ is a hyperparameter set to a value in the range $\alpha \in \left[0, 1\right]$. This results in a computational cost:
$$
\begin{align}
\left(\mathrm{D}_{f} * \mathrm{D}_{f} * \alpha\mathrm{M} * \mathrm{D}_{k} * \mathrm{D}_{k}\right) + \left(\mathrm{D}_{f} * \mathrm{D}_{f} * \alpha\mathrm{M} * \alpha\mathrm{N}\right).
\end{align}
$$

#### Resolution Multiplier

Similar to the width multiplier, a _resolution multiplier_ is used to scale the size of the input feature map. The resolution multiplier $\rho$ is a hyperparameter set to a value in the range $\rho \in \left[0, 1\right]$. This results in a computational cost:
$$
\begin{align}
\left(\rho\mathrm{D}_{f} * \rho\mathrm{D}_{f} * \mathrm{M} * \mathrm{D}_{k} * \mathrm{D}_{k}\right) + \left(\rho\mathrm{D}_{f} * \rho\mathrm{D}_{f} * \mathrm{M} * \mathrm{N}\right).
\end{align}
$$

Combining the width and resolution multipliers results in a scaled computational cost of:
$$
\begin{align}
\left(\rho\mathrm{D}_{f} * \rho\mathrm{D}_{f} * \alpha\mathrm{M} * \mathrm{D}_{k} * \mathrm{D}_{k}\right) + \left(\rho\mathrm{D}_{f} * \rho\mathrm{D}_{f} * \alpha\mathrm{M} * \alpha\mathrm{N}\right).
\end{align}
$$

In the MobileNets architecture, for example, these values of $\alpha$ and $\rho$ are selected w.r.t. the speed versus accuracy versus size trade-off. In the paper by original author Howard et al., 2017, the authors found that the resolution multiplier has the effect of reducing computational cost by $\rho^{2}$, whereas the width multiplier has the effect of reducing computational cost and number of parameters quadratically by roughly $\alpha^{2}$.

### 1.3. Convolutional Network Architectures

#### FCN-8 

The [FCN-8]() architecture by Long et al., 2014 [2] is an architecture that uses $1\times 1$ convolutional layers to replace the fully-connecte—d layers of a standard neural network. As a result, the FCN-8 architecture is able to preserve spatial information of the input tensor and perform the down-sampling and feature extraction routines of a convolutional network.

Fully-Convolutional Network (FCN) architectures have two primary components — an _encoder_ and a _decoder_. The encoder extracts features from an image using a series of sliding window convolution operations. The decoder in an FCN is used to up-scale the down-sampled intermediate feature maps generated from the encoder to a higher resolution — usually matching the original input dimensions. In the FCN-8 architecture, the encoder block is a set of $1\times 1$ convolution layers. The decoder block of the FCN-8 is a set of transposed convolution layers which upsample the feature maps to the size of the original input. This process is usually referred to as "reverse convolution" or deconvolution since its effect is essentially reversing (with some loss) the downsampling of the input. 

In order to preserve fine-grained segmentation maps through the network to the decoder block, a set of skip _connections_ are used. Essentially, these "connections" between non-adjacent layers of differing resolutions help retain information from the original input by combining the output feature maps of each respective layer using an element-wise addition operation. As a result, the FCN-8 with skip connections is able to use information from multiple resolutions to make more precise segmentation decisions. Skip connections, along with the other advancements from the FCN-8 architecture, have proven successful in empirical studies between the FCN-8 and its "sister" networks — the FCN-16 and FCN-32. For more information on Fully-Convolutional Networks, see the [notebook from the previous lesson](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/1.7/1-Computer-Vision/Exercises/1-6-1-Fully-Convolutional-Networks/2023-01-23-Fully-Convolutional-Networks.ipynb).  

#### MobileNets

The [MobileNets](https://arxiv.org/abs/1704.04861) architecture by Howard et al., 2017 [1] is an architecture that uses depth-wise separable convolutions to build light-weight deep neural networks. The MobileNets architecture, as the name suggests, is designed to run object detection and classification tasks efficiently (i.e., with high FPS and low memory footprint) on mobile and embedded devices. The MobileNets architecture achieves this in a three-part approach:
1. **Depth-wise separable convolutions** — Perform a depth-wise convolution followed by a $1\times 1$ convolution (instead of a standard convolution). The $1\times 1$ convolution is called a point-wise convolution if it follows after a depth-wise convolution;
2. **Width multipliers** — Reduces the size of the input / output channels using a scaling factor set to a value between $0.0$ and $1.0$;
3. **Resolution multipliers** — Reduces the size of the original input using a scaling factor set to a value between $0.0$ and $1.0$.

These three techniques reduce the cummulative number of parameters in the network and therefore the amount of computation required. The downside to models exploiting the parameter reduction approach is that accuracy is often the trade-off. 

#### Single Shot Detector (SSD)

Many of the earlier deep neural network architectures involved networks with more than one training phase; the [Faster-RCNN](https://arxiv.org/abs/1506.01497) for example, first trains a Region Proposal Network (RPN) which is then merged with a pre-trained classification sub-network. The [Single Shot Detector](https://arxiv.org/abs/1512.02325) (SSD) by Liu et al., 2015 [3] combines these two sub-networks into a single-pass network that predicts bounding box locations and classifies the corresponding object classes. The major difference with single-shot networks is that they can be trained end-to-end, whereas architectures with multiple sub-networks, such as the Faster-RCNN, must train each module separately. The following is an outline of the original SSD architecture proposed by Liu et al., 2015 [3]:

<img src="figures/2023-01-26-Figure-4-Single-Shot-Detector-Network-Architecture.png" alt="Figure 4. Architecture of the Single Shot Detector (SSD) proposed by Liu et al., 2015.">

$$
\begin{align}
\textrm{Figure 4. Architecture of the Single Shot Detector (SSD) proposed by Liu et al., 2015.}
\end{align}
$$

In the above architecture, we note the use of the VGG-16 pre-trained convolutional base. In this notebook, we will instead be using the MobileNet pre-trained base from Howard et al., 2017 [1].

##### Bounding box detection with SSD

SSD operates on feature maps to predict bounding box locations. Recall a feature map of $\mathrm{D}_{f} * \mathrm{D}_{f} * \mathrm{M}$. For each feature map location, $k$ bounding boxes are predicted. Each bounding box carries with it the following information:
* $\left(\mathrm{c}x, \mathrm{c}y, w, h\right)$ — Four bounding box corner offset locations;
* $C = \left(c_{1}, c_{2},\ldots, c_{p}\right)$ — class probabilities.

The SSD does not predict the _shape_ of the box but rather the location of where the box is in the image. The $k$ bounding boxes each have a pre-determined shape (i.e., the anchors). This is illustrated in the figure below:

<img src="figures/2023-01-26-Figure-5-Bounding-Box-Internal-Representation-with-SSD.png" alt="Figure 5. Internal representation of bounding boxes using anchors with the Single Shot Detector (SSD) network.">

$$
\begin{align}
\textrm{Figure 5. Internal representation of bounding boxes using anchors in Single Shot Detector (SSD) network.}
\end{align}
$$

The anchor boxes used in the SSD have coordinates that are manually configured prior to training. Shown in Figure 5(c) is a set of $k = 4$ anchor boxes of varying size used to isolate an object for detection.

In order to filter nonsensical bounding boxes, we use a loss function. For the final set of $N$ matched boxes, we compute the loss as:
$$
\begin{align}
L &= \frac{1}{N}\left(L_{\textrm{class}} + L_{\textrm{box}}\right),
\end{align}
$$
where $L_{\textrm{class}}$ is a softmax loss for classification, and $L_{\textrm{box}}$ is an L1 smooth loss representing the error of the matched boxes to the ground-truth boxes. Note that L1 smooth loss is a modification of the standard L1 loss which is more robust to outliers. Also note that when $N = 0$, the loss is set to $0.0$.

##### In Summary

* A pre-trained convolutional base is used (e.g., VGG-16 or MobileNet);
* The base model is extended with several convolutional blocks;
* Each feature map is used to predict bounding boxes, and therefore diversity in feature map size allows for object detection at different resolutions;
* Boxes are filtered by the Intersection over Union (IoU) metric and with hard negative mining (looking for difficult-to-detect examples);
* The loss functions are softmax for classification and smooth L1 for detection;
* The entire SSD network can be trained end-to-end.

## 2. Programming Task

### 2.1. Intersection over Union (IoU)

Here we use the TensorFlow [`tf.keras.metrics.MeanIoU`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanIoU) function to compute the mean Intersection over Union (IoU) across all classes $i=0,\ldots, n$.

In order to use the metric as a standalone function, we have to first initialise the respective [`tf.keras.metrics.Metric`](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/metrics/Metric) subclass instance (i.e., `MeanIoU`), then perform a single "state update" using the [`update_state()`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanIoU#update_state) class method. As arguments to this function, we pass in the `y_true` and `y_pred` tensors that we wish to evaluate. Optionally, we can provide a `sample_weight` scalar value or vector of rank equal to `y_true`. 

In [10]:
### Defining the number of distinct class labels (i.e., classes)
N_CLASSES = 4

In [11]:
### Initialising the `tf.keras.metrics.Metric` instance
iou_mean = tf.keras.metrics.MeanIoU(
    num_classes=N_CLASSES,
    name='Mean IoU for multi-class object segmentation data',
    dtype=tf.dtypes.float32,
    ### Additional arguments for TF2.10+ API:
    #ignore_class=None,
    #sparse_y_true=True,    # `True` if class labels are integers, `False` if floating-point
    #sparse_y_pred=True,
    #axis=-1
)

#### Testing the `MeanIoU` metric

In [12]:
### Defining our prediction and ground-truth sets
ground_truth_labels = [
    [0, 0, 0, 0], 
    [1, 1, 1, 1],
    [2, 2, 2, 2], 
    [3, 3, 3, 3],
]
predicted_labels = [
    [1, 0, 0, 0],
    [1, 3, 0, 1],
    [2, 2, 2, 3],
    [3, 1, 0, 0],
]

In [13]:
### Converting the matrices to n-rank tensors
y_true = tf.convert_to_tensor(
    np.array(
        ground_truth_labels).reshape(1, -1, len(ground_truth_labels), N_CLASSES
    ),
    dtype=tf.float32
)
y_pred = tf.convert_to_tensor(
    np.array(
        predicted_labels).reshape(1, -1, len(predicted_labels), N_CLASSES
    ),
    dtype=tf.float32
)

In [14]:
### Computing the mean IoU
iou_mean.update_state(
    y_true=y_true,
    y_pred=y_pred
)
iou_mean.result().numpy()

0.41964284

As shown above, we obtain a mean IoU score for the set of predictions of $\mathrm{IoU}_{\textrm{mean}} \ \approx 0.420$, which matches our expected value for this test set.

Now, we repeat the Mean IoU calculation using a second test set of predictions. Note that [`tf.keras.metrics.Metric`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric) instances are stateful by default, meaning that each call to `update_state()` computes the Mean IoU for the input _mini-batch of predictions / ground-truth labels_. In other words, the default behaviour for this `Metric` instance is to accumulate the Mean IoU score across calls to `update_state()`. Since we are computing the Mean IoU score of the full batch (i.e., `batch_size=1`) manually, we must use the [`reset_state()`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanIoU#reset_state) method before performing the IoU calculation with [`update_state()`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanIoU#update_state).

In [15]:
### Defining our new prediction set (assuming same `ground_truth_labels`)
predicted_labels = [
    [0, 0, 0, 0],
    [1, 0, 0, 1],
    [1, 2, 2, 1],
    [3, 3, 0, 3],
]

In [16]:
### Converting the matrix to an n-rank tensor
y_pred = tf.convert_to_tensor(
    np.array(
        predicted_labels).reshape(1, -1, len(predicted_labels), N_CLASSES
    ),
    dtype=tf.float32
)

In [17]:
### Resetting the state of the `MeanIoU` instance
# i.e., current cumulative mean IoU becomes `0.0`
iou_mean.reset_state()

In [18]:
### Computing the new mean IoU
iou_mean.update_state(
    y_true=y_true,
    y_pred=y_pred
)
iou_mean.result().numpy()

0.53869045

As shown above, we obtain a new mean IoU score for this second test of predictions of $\mathrm{IoU}_{\textrm{mean}} \ \approx 0.539$, which matches our expected value for this second test set.

### 2.2. Separable Depthwise Convolution

NOTE: the code provided here has been migrated to the TensorFlow 2.x API. Some functionality may differ from the original implementation.

Here we implement the a MobileNets depth-wise separable convolution block using the following [`tf.keras.layers.Layer`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer) components:
* [`tf.nn.depthwise_conv2d`](https://www.tensorflow.org/api_docs/python/tf/nn/depthwise_conv2d);
* [`tf.keras.layers.BatchNormalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization);
* [`tf.keras.layers.ReLU`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/ReLU);
* [`tf.keras.layers.Conv2D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D).


We will then compare the number of parameters of the depth-wise separable convolution block to a regular convolution block, which is formed using the following [`tf.keras.layers.Layer`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer) components:
* [`tf.keras.layers.Conv2D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) layer;
* [`tf.keras.layers.BatchNormalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization).

In [19]:
### From Udacity's `CarND-Object-Detection-Lab.ipynb`

In [20]:
def vanilla_conv_block(
        x: Union[np.ndarray, tf.Tensor], 
        kernel_size: Union[Tuple[int], List[int]],
        output_channels: int
) -> tf.keras.models.Sequential:
    """Implements a vanilla (regular) convolution block.
    
    A convolution block here is defined as the following:
        Vanilla Conv -> Batch Norm -> ReLU,
    where 'Vanilla Conv' corresponds to the `Conv2D` layer
    provided in the TensorFlow 2.x API.
    
    :param x: Input tensor used to build the convolutional block,
        i.e., input shape of `x` is provided for delayed-build pattern.
    :param kernel_size: Kernel size to assign the convolutional layer.
    :param output_channels: Depth of the output tensor.
    :returns: The 'vanilla' convolutional block,
        buit for input tensors of shape given by `x`.
    """
    
    ### Build the convolutional block
    conv = tf.keras.models.Sequential() 
    conv.add(
        tf.keras.layers.Conv2D(
            filters=output_channels,
            kernel_size=kernel_size,
            strides=(2, 2),
            padding='SAME'
        )
    )
    conv.add(
        tf.keras.layers.BatchNormalization()
    )
    conv.add(
        tf.keras.layers.ReLU()
    )
    ### Build the block by setting expected input shape to shape of tensor `x`
    conv.build(x.shape)
    ### Return the built convolutional block
    return conv

In [21]:
def mobilenet_conv_block(
        x: Union[np.ndarray, tf.Tensor], 
        kernel_size: Union[Tuple[int], List[int]], 
        output_channels: int
) -> tf.keras.models.Sequential:
    """Implements a depth-wise separable convolution block (Howard, 2017).

    A depth-wise separable convolution block is defined as the following:
        Depth-wise -> Batch Norm -> ReLU -> Point-wise -> Batch Norm -> ReLU,
    where the 'Point-wise' is implemented using the `Conv2D` layer provided
    in the TensorFlow 2.x API.

    :param x: Input tensor used to build the MobileNet convolutional block,
        i.e., input shape of `x` is provided for delayed-build pattern.
    :param kernel_size: Kernel size to assign the convolutional layers.
    :param output_channels: Depth of the output tensor
    :returns: The 'MobileNet' convolutional block,
        built for input tensors of shape given by `x`.
    """
    
    ### Build the MobileNet convolutional block
    conv = tf.keras.models.Sequential()
    conv.add(
        # Should have 3x3 kernel
        tf.keras.layers.DepthwiseConv2D(
            kernel_size=KERNEL_SIZE,
            strides=(1, 1),
            padding='SAME'
        )
    )
    conv.add(
        tf.keras.layers.BatchNormalization()
    )
    conv.add(
        tf.keras.layers.ReLU()
    )
    # Should have 1x1 kernel
    conv.add(
        tf.keras.layers.Conv2D(
            filters=1,
            kernel_size=(1, 1),
            strides=(1, 1),
            padding='SAME',
        )
    )
    conv.add(
        tf.keras.layers.BatchNormalization()
    )
    conv.add(
        tf.keras.layers.ReLU()
    )
    ### Build the block by setting expected input shape to shape of tensor `x`
    conv.build(x.shape)
    ### Return the built MobileNet convolutional block
    return conv

#### Testing the MobileNets convolutional block 

In [22]:
### From Udacity's `CarND-Object-Detection-Lab.ipynb`

In [23]:
### Setting the input parameters
INPUT_CHANNELS = 32
OUTPUT_CHANNELS = 512
KERNEL_SIZE = 3
IMG_HEIGHT = 256
IMG_WIDTH = 256

In [24]:
### Creating the input tensor
x = tf.constant(
    np.random.rand(1, IMG_HEIGHT, IMG_WIDTH, INPUT_CHANNELS),
    dtype=tf.float32
)

In [25]:
### Testing the 'vanilla' convolutional block
conv_vanilla = vanilla_conv_block(
    x=x,
    kernel_size=KERNEL_SIZE,
    output_channels=OUTPUT_CHANNELS
)

In [26]:
### Printing the 'vanilla' convolutional block summary (no. parameters)
conv_vanilla.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (1, 128, 128, 512)        147968    
                                                                 
 batch_normalization (BatchN  (1, 128, 128, 512)       2048      
 ormalization)                                                   
                                                                 
 re_lu (ReLU)                (1, 128, 128, 512)        0         
                                                                 
Total params: 150,016
Trainable params: 148,992
Non-trainable params: 1,024
_________________________________________________________________


In [27]:
### Compute the filter of the Depth-wise Conv2D
# Assuming the 'BHWC' format
input_channel_dim = x.get_shape().as_list()[-1]
tf.Variable(tf.random.truncated_normal(
    shape=(KERNEL_SIZE, KERNEL_SIZE, input_channel_dim, 1)
)).shape

TensorShape([3, 3, 32, 1])

In [28]:
### Testing the 'MobileNet' depth-wise separable convolutional block
conv_mobilenet = mobilenet_conv_block(
    x=x,
    kernel_size=KERNEL_SIZE,    # Should be (3, 3) for MobileNet
    output_channels=OUTPUT_CHANNELS
)

In [29]:
conv_mobilenet.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 depthwise_conv2d (Depthwise  (1, 256, 256, 32)        320       
 Conv2D)                                                         
                                                                 
 batch_normalization_1 (Batc  (1, 256, 256, 32)        128       
 hNormalization)                                                 
                                                                 
 re_lu_1 (ReLU)              (1, 256, 256, 32)         0         
                                                                 
 conv2d_1 (Conv2D)           (1, 256, 256, 1)          33        
                                                                 
 batch_normalization_2 (Batc  (1, 256, 256, 1)         4         
 hNormalization)                                                 
                                                      

##### Comparing the number of parameters

In [30]:
cv_params = conv_vanilla.count_params()
cm_params = conv_mobilenet.count_params()
diff_percent = (cv_params - cm_params) / cv_params * 100
print(f"Total parameters for 'vanilla' ConvNet block: {cv_params}")
print(f"Total parameters for 'MobileNet' ConvNet block: {cm_params}")
print(f"Reduction in parameters with 'MobileNet' ConvNet block: {diff_percent:.3f}%")

Total parameters for 'vanilla' ConvNet block: 150016
Total parameters for 'MobileNet' ConvNet block: 485
Reduction in parameters with 'MobileNet' ConvNet block: 99.677%


With the MobileNet's depth-wise separable convolutional block, we have a $99.677\%$ reduction in the total number of parameters in each block when compared to the standard "vanilla" convolutional block (i.e., the `Conv2D -> Batch Norm -> ReLU` block).

If minimising the total number of parameters is your goal, then the MobileNet architecture is sure to suit your needs.

### 2.3. Object Detection Inference

In this section we will detect objects using an object detection model and its pre-trained weights made available at the [TensorFlow Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). All models and weights provided by the TensorFlow team have been pre-trained on the [COCO 2017](http://cocodataset.org/) dataset. The TensorFlow team even provides documentation regarding the use of the [Zoo models on mobile devices](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_mobile_tf2.md).

Here we will be experimenting with the following set of pre-trained models:
* [SSD MobileNet V1 FPN 640x640](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017.tar.gz);
* [RFCN ResNet101](http://download.tensorflow.org/models/object_detection/rfcn_resnet101_coco_11_06_2017.tar.gz) — DEPRECATED (archive [here](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/rfcn-resnet101-coco-tf));
* [Faster R-CNN Inception ResNet V2 640x640](http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017.tar.gz).

Note that the above links are for the 11.6.17 versions and are intended for use with TensorFlow v1 models. Since we are instead going to be using the TensorFlow 2.x API, it's best to download these files instead from the [TensorFlow 2 Detection Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md).

Each of these three models produces a set of regressed bounding box coordinates as outputs. We will define here a set of utility / helper functions which:
1. Filter detected bounding boxes;
2. Convert the detected bounding boxes to the original image coordinates;
3. Render the converted bounding boxes onto the original image.

#### Setting up the TensorFlow Object Detection API

In [None]:
import pathlib

In [None]:
### Clone the `tensorflow/models` repository
if "models" in pathlib.Path.cwd().parts:
    while "models" in pathlib.Path.cwd().parts:
        os.chdir('..')
elif not pathlib.Path('models').exists():
    !git clone --depth 1 https://github.com/tensorflow/models

In [None]:
### Install the TF Object Detection API
%%bash
cd models/research
protoc object_detection/protos/*.proto --python_out=.
cp object_detection/packages/tf2/setup.py .
python -m pip install

In [None]:
### Import required Python modules
import io
import maptlotlib.pyplot as plt
from PIL import Image, ImageColor, ImageDraw, ImageFront
from scipy.stats import norm
import scipy.misc
from six import BytesIO

In [None]:
### Import TensorFlow modules
from object_detection.utils import label_map_util
from object_detection.utils import config_util
from object_detection.builders import visualization_utils as viz_utils
from object_detection.builders import model_builder

#### Defining the utility functions

In [None]:
### From TensorFlow's `inference_tf2_colab.ipynb`
# Credit: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/inference_tf2_colab.ipynb

In [None]:
def load_image_into_numpy_array(path):
  """Load an image from file into a numpy array.

  Puts image into numpy array to feed into tensorflow graph.
  Note that by convention we put it into a numpy array with shape
  (height, width, channels), where channels=3 for RGB.

  Args:
    path: the file path to the image

  Returns:
    uint8 numpy array with shape (img_height, img_width, 3)
  """
  img_data = tf.io.gfile.GFile(path, 'rb').read()
  image = Image.open(BytesIO(img_data))
  (im_width, im_height) = image.size
  return np.array(image.getdata()).reshape(
      (im_height, im_width, 3)).astype(np.uint8)

def get_keypoint_tuples(eval_config):
  """Return a tuple list of keypoint edges from the eval config.
  
  Args:
    eval_config: an eval config containing the keypoint edges
  
  Returns:
    a list of edge tuples, each in the format (start, end)
  """
  tuple_list = []
  kp_list = eval_config.keypoint_edge
  for edge in kp_list:
    tuple_list.append((edge.start, edge.end))
  return tuple_list

In [None]:
### From Udacity's `CarND-Object-Detection-Lab.ipynb`

In [None]:
COLOUR_LIST = sorted([c for c in ImageColor.colormap.keys()])

In [None]:

# Colors (one for each class)
cmap = ImageColor.colormap
print("Number of colors =", len(cmap))
COLOR_LIST = sorted([c for c in cmap.keys()])

#
# Utility funcs
#

def filter_boxes(min_score, boxes, scores, classes):
    """Return boxes with a confidence >= `min_score`"""
    n = len(classes)
    idxs = []
    for i in range(n):
        if scores[i] >= min_score:
            idxs.append(i)
    
    filtered_boxes = boxes[idxs, ...]
    filtered_scores = scores[idxs, ...]
    filtered_classes = classes[idxs, ...]
    return filtered_boxes, filtered_scores, filtered_classes

def to_image_coords(boxes, height, width):
    """
    The original box coordinate output is normalized, i.e [0, 1].
    
    This converts it back to the original coordinate based on the image
    size.
    """
    box_coords = np.zeros_like(boxes)
    box_coords[:, 0] = boxes[:, 0] * height
    box_coords[:, 1] = boxes[:, 1] * width
    box_coords[:, 2] = boxes[:, 2] * height
    box_coords[:, 3] = boxes[:, 3] * width
    
    return box_coords

def draw_boxes(image, boxes, classes, thickness=4):
    """Draw bounding boxes on the image"""
    draw = ImageDraw.Draw(image)
    for i in range(len(boxes)):
        bot, left, top, right = boxes[i, ...]
        class_id = int(classes[i])
        color = COLOR_LIST[class_id]
        draw.line([(left, top), (left, bot), (right, bot), (right, top), (left, top)], width=thickness, fill=color)
        
def load_graph(graph_file):
    """Loads a frozen inference graph"""
    graph = tf.Graph()
    with graph.as_default():
        od_graph_def = tf.GraphDef()
        with tf.gfile.GFile(graph_file, 'rb') as fid:
            serialized_graph = fid.read()
            od_graph_def.ParseFromString(serialized_graph)
            tf.import_graph_def(od_graph_def, name='')
    return graph

#### Loading input data

In [None]:
### From Udacity's `CarND-Object-Detection-Lab.ipynb`

In [None]:

detection_graph = load_graph(SSD_GRAPH_FILE)
# detection_graph = load_graph(RFCN_GRAPH_FILE)
# detection_graph = load_graph(FASTER_RCNN_GRAPH_FILE)

# The input placeholder for the image.
# `get_tensor_by_name` returns the Tensor with the associated name in the Graph.
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')

# Each box represents a part of the image where a particular object was detected.
detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')

# Each score represent how level of confidence for each of the objects.
# Score is shown on the result image, together with the class label.
detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')

# The classification of the object (integer id).
detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')

#### Running inference on test images

In [None]:
### From Udacity's `CarND-Object-Detection-Lab.ipynb`

In [None]:
# Load a sample image.
image = Image.open('./assets/sample1.jpg')
image_np = np.expand_dims(np.asarray(image, dtype=np.uint8), 0)

with tf.Session(graph=detection_graph) as sess:                
    # Actual detection.
    (boxes, scores, classes) = sess.run([detection_boxes, detection_scores, detection_classes], 
                                        feed_dict={image_tensor: image_np})

    # Remove unnecessary dimensions
    boxes = np.squeeze(boxes)
    scores = np.squeeze(scores)
    classes = np.squeeze(classes)

    confidence_cutoff = 0.8
    # Filter boxes with a confidence score less than `confidence_cutoff`
    boxes, scores, classes = filter_boxes(confidence_cutoff, boxes, scores, classes)

    # The current box coordinates are normalized to a range between 0 and 1.
    # This converts the coordinates actual location on the image.
    width, height = image.size
    box_coords = to_image_coords(boxes, height, width)

    # Each class with be represented by a differently colored box
    draw_boxes(image, box_coords, classes)

    plt.figure(figsize=(12, 8))
    plt.imshow(image) 

### 2.4. Timing Detection

### 2.5. Object Detection Pipeline

## 3. Closing Remarks

##### Alternatives
* TODO.
##### Extensions of task
* TODO.

## 4. Future Work

- ⬜️ TODO.

## Credits

This assignment was prepared by Kelvin Lwin, Andrew Bauman, Dominique Luna et al., 2021 (link [here](https://github.com/udacity/CarND-Object-Detection-Lab)).

References
* [1] Howard, A. G. et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv. 2017. [doi:10.48550/arXiv.1704.04861](https://arxiv.org/abs/1704.04861).
* [2] Shelhamer, E. et al. Fully Convolutional Networks for Semantic Segmentation. arXiv. 2016. [doi:10.48550/arXiv.1605.06211](https://arxiv.org/abs/1605.06211).
* [3] Liu, W. et al. SSD: Single Shot MultiBox Detector. European Conference on Computer Vision, ECCV. Lecture Notes in Computer Science, 9905:21-37. 2016. [doi:10.1007/978-3-319-46448-0_2](https://doi.org/10.1007/978-3-319-46448-0_2).

Helpful resources:
* [`CarND-Object-Detection-Lab` by @udacity | GitHub](https://github.com/udacity/CarND-Object-Detection-Lab);
* [3.4. Depthwise Convolution | Dive Into Deep Learning](https://tvm.d2l.ai/chapter_common_operators/depthwise_conv.html);
* [Depth-wise Convolution and Depth-wise Separable Convolution by A. Pandey | Medium](https://medium.com/@zurister/depth-wise-convolution-and-depth-wise-separable-convolution-37346565d4ec);
* [Pointwise Convolution by A. Shrivastav | OpenGenus](https://iq.opengenus.org/pointwise-convolution/);
* [Depthwise Separable Convolution - A FASTER CONVOLUTION! | YouTube](https://www.youtube.com/watch?v=T7o3xvJLuHk)