# Object Detection 

- Object detection is the process of localizing an object into an image by predicting the coordinates of a bounding box that contains it, while at the same time correctly classifying it.

 - The tasks of regressing the bounding box coordinates of a single object and classifying the content are called localization and classification.

In [0]:
%tensorflow_version 2.x

In [0]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

In [0]:
import matplotlib.pyplot as plt

In [0]:
print(tf.__version__)

2.1.0-rc1


In [0]:
(train, test, validation), info = tfds.load( "voc", split=["train", "test", "validation"], with_info=True )

In [0]:
print(info)

For every image, there is a SequenceDict object that contains the information of every labeled object present. 

In [0]:
with tf.device("/GPU:0"): 
    for row in train.take(5): 
        obj = row["objects"] 
        image = tf.image.convert_image_dtype(row["image"], tf.float32) 
 
        for idx in tf.range(tf.shape(obj["label"])[0]): 
            image = tf.squeeze( 
                tf.image.draw_bounding_boxes( 
                    images=tf.expand_dims(image, axis=[0]), 
                    boxes=tf.reshape(obj["bbox"][idx], (1, 1, 4)), 
                    colors=tf.reshape(tf.constant((1.0, 1.0, 0, 0)), (1, 4)), 
                ), 
                axis=[0], 
            ) 

            print( 
                "label: ", info.features["objects"]["label"].int2str(obj["label"][idx]) 
            ) 
        plt.imshow(image)
        plt.show()


In [0]:
def filter(dataset):
    return dataset.filter(lambda row: tf.equal(tf.shape(row["objects"]["label"])[0], 1))

train, test, validation = filter(train), filter(test), filter(validation)

# Object Localization

- Object Localization is just a regression problem

In [0]:
inputs = tf.keras.layers.Input(shape=(299,299,3))
net = hub.KerasLayer(
    "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/2",
    output_shape = [2048],
    trainable = False,
) (inputs)

net = tf.keras.layers.Dense(512) (net)
net = tf.keras.layers.ReLU() (net)
cordinates = tf.keras.layers.Dense(4, use_bias=False) (net)

regressor = tf.keras.Model(inputs = inputs, outputs=cordinates)

In [0]:
def prepare(dataset):
    def _fn(row):
        row["image"] = tf.image.convert_image_dtype(row["image"], tf.float32)
        row["image"] = tf.image.resize(row["image"], (299, 299))
        return row

    return dataset.map(_fn)


In [0]:
train, test, validation = prepare(train), prepare(test), prepare(validation)

Using the mean_squared error loss

In [0]:
def l2(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_pred - tf.squeeze(y_true, axis=[1])))


In [0]:
def draw(dataset, regressor, step):
    with tf.device("/CPU:0"):
        row = next(iter(dataset.take(3).batch(3)))
        images = row["image"]
        obj = row["objects"]
        boxes = regressor(images)
        tf.print(boxes)

        images = tf.image.draw_bounding_boxes(
            images=images, boxes=tf.reshape(boxes, (-1, 1, 4)), colors=[[0,0,255]]
        )
        images = tf.image.draw_bounding_boxes(
            images=images, boxes=tf.reshape(obj["bbox"], (-1, 1, 4)), colors=[[0,0,255]]
        )
        tf.summary.image("images", images, step=step)

In [0]:
optimizer = tf.optimizers.Adam()
epochs = 10
batch_size = 3

global_step = tf.Variable(0, name="global_step", trainable=False, dtype=tf.int64)

train_writer = tf.summary.create_file_writer("log/train")
validation_writer = tf.summary.create_file_writer("log/test")

with validation_writer.as_default():
    draw(validation, regressor, global_step)

[[-0.996103168 -0.690272868 0.609899282 -0.544541836]
 [-1.04774523 -1.03945589 0.701453269 -0.593110919]
 [-0.624802232 -0.333296239 0.730165124 -0.145599976]]


In [0]:
@tf.function
def train_step(image, coordinates):
    with tf.GradientTape() as tape:
        loss = l2(coordinates, regressor(image))
    gradients = tape.gradient(loss, regressor.trainable_variables)
    optimizer.apply_gradients(zip(gradients, regressor.trainable_variables))
    return loss


In [0]:
train_batches = train.cache().batch(batch_size).prefetch(1)
with train_writer.as_default():
    for _ in tf.range(epochs):
        for batch in train_batches:
            obj = batch["objects"]
            coordinates = obj["bbox"]
            loss = train_step(batch["image"], coordinates)
            tf.summary.scalar("loss", loss, step=global_step)
            global_step.assign_add(1)
            if (global_step % 10 == 0):
                tf.print("step ", global_step, " loss: ", loss)
                with validation_writer.as_default():
                    draw(validation, regressor, global_step)
                with train_writer.as_default():
                    draw(train, regressor, global_step)


The training loop previously defined has various problems:

The only measured metric is the L2 loss

The validation set is never used to measure any numerical score

No check for overfitting is present

There is a complete lack of a metric that measures how good the regression of the bounding box is, measured on both the training and the validation set

Of course, having a perfect match is not an easy task; for this reason, a function that measures how good the detected bounding box is with a numerical score (with respect to the ground truth) is needed. The most widely used function to measure the goodness of localization is the Intersection over Union.

# Intersection Over Union

Intersection over Union (IoU) is defined as the ratio between the area of overlap and the area of union.

In practice, the IoU measures how much the predicted bounding box overlaps with the ground truth. Since IoU is a metric that uses the areas of the objects, it can be easily expressed treating the ground truth and the detected area like sets.

The IoU value is in the [0,1] range, where 0 is a no-match (no overlap), and 1 is the perfect match. The IoU value is used as an overlap criterion; usually, an IoU value greater than 0.5 is considered as a true positive (match), while any other value is regarded as a false positive. There are no true negatives.

In [0]:
def iou(pred_box, gt_box, h, w):
    """
    Compute IoU between detect box and gt boxes
    Args:
        pred_box: shape (4,): y_min, x_min, y_max, x_max - predicted box
        gt_boxes: shape (4,): y_min, x_min, y_max, x_max - ground truth
        h: image height
        w: image width
    """

    def _swap(box):
        return tf.stack([box[1] * w, box[0] * h, box[3] * w, box[2] * h])

    pred_box = _swap(pred_box)
    gt_box = _swap(gt_box)

    box_area = (pred_box[2] - pred_box[0]) * (pred_box[3] - pred_box[1])
    area = (gt_box[2] - gt_box[0]) * (gt_box[3] - gt_box[1])

    xx1 = tf.maximum(pred_box[0], gt_box[0])
    yy1 = tf.maximum(pred_box[1], gt_box[1])
    xx2 = tf.maximum(pred_box[2], gt_box[2])
    yy2 = tf.maximum(pred_box[3], gt_box[3])

    w = tf.maximum(0, xx2 - xx1)
    h = tf.maximum(0, yy2 - yy2)

    inter = w * h
    return inter / (box_area + area - inter)


# Average precision

A value of IoU greater than a specified threshold (usually 0.5) allows us to treat the bounding box regressed as a match.

Avg_precision = TP / (TP + FP)

In the object detection challenges, the Average Precision (AP) is often measured for different values of IoU. The minimum requirement is to measure the AP for an IoU value of 0.5,

- Average precision and the IoU are not object-detection-specific metrics, but they can be used whenever a localization task is performed (the IoU) and the precision of the detection is measured (the mAP).

- Measuring the mean average precision (over a single class) requires you to fix a threshold for the IoU measurement and to define the tf.metrics.Precision object that computes the mean average precision over the batches.

In [0]:
# IoU threshold
threshold = 0.75
# Metric object
precision_metric = tf.metrics.Precision()

def draw(dataset, regressor, step):
    with tf.device("/CPU:0"):
        row = next(iter(dataset.take(3).batch(3)))
        images = row["image"]
        obj = row["objects"]
        boxes = regressor(images)

        images = tf.image.draw_bounding_boxes(
            images=images, boxes=tf.reshape(boxes, (-1, 1, 4))
        )
        images = tf.image.draw_bounding_boxes(
            images=images, boxes=tf.reshape(obj["bbox"], (-1, 1, 4))
        )
        tf.summary.image("images", images, step=step)

        true_labels, predicted_labels = [], []
        for idx, predicted_box in enumerate(boxes):
            iou_value = iou(predicted_box, tf.squeeze(obj["bbox"][idx]), 299, 299)
            true_labels.append(1)
            predicted_labels.append(1 if iou_value >= threshold else 0)

        precision_metric.update_state(true_labels, predicted_labels)
        tf.summary.scalar("precision", precision_metric.result(), step=step)

# Multi-Task Learning

In practice, multi-task learning is a machine learning subfield with the explicit goal of solving multiple different tasks, exploiting commonalities and differences across tasks. It has been empirically shown that using the same network to solve multiple tasks usually results in improved learning efficiency and prediction accuracy compared to the performance achieved by the same network trained to solve the same tasks separately.

Multi-task learning also helps to fight the overfitting problem since the neural network is less likely to adapt its parameters to solve a specific task, so it has to learn how to extract meaningful features that can be useful to solve different tasks.

Using a double-headed neural network allows us to have faster inference time, since only a single forward pass of a single model is needed to achieve better performance overall.

In [0]:
num_classes = 20

In [0]:
inputs = tf.keras.layers.Input(shape=(299,299,3))
net = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/2", output_shape=[2048],
                     trainable=False) (inputs)

In [0]:
reg = tf.keras.layers.Dense(512, activation='relu') (net)
cordinates = tf.keras.layers.Dense(4, use_bias=False) (reg)

In [0]:
clasf = tf.keras.layers.Dense(512, activation='relu') (net)
clasf = tf.keras.layers.Dense(256, activation='relu') (clasf)
clasf = tf.keras.layers.Dense(num_classes, activation='softmax', use_bias=False) (clasf)

In [0]:
model = tf.keras.Model(inputs=inputs, outputs=[cordinates, clasf])

In [0]:
model.summary()

Classifying images with a single object inside and regressing the coordinate of the only bounding box present can be applied only in limited real-life scenarios. More often, instead, given an input image, it is required to localize and classify multiple objects at the same time (the real object detection problem).

# Anchor Boxes and Anchor Based Detectors

Anchor-based detectors rely upon the concept of anchor boxes to detect objects in images in a single pass, using a single architecture.

The intuitive idea of the anchor-based detectors is to split the input image into several regions of interests (the anchor boxes) and apply a localization and regression network to each of them. The idea is to make the network learn not only to regress the coordinates of a bounding box and classify its content, but also to use the same network to look at different regions of the image in a single forward pass.

To train these models, it is required not only to have a dataset with the annotated ground truth boxes, but also to add to every input image a new collection of boxes that overlap (with the desired amount of IoU) the ground truth boxes.

Anchor-boxes are a discretization of the input image in different regions, also called anchors or bounding boxes prior. The idea behind the concept of anchor-boxes is that the input can be discretized in different regions, each of them with a different appearance. An input image could contain big and small objects, and therefore the discretization should be made at different scales in order to detect the same time objects at different resolutions.

When discretizing the input in anchor boxes, the important parameters are as follows:

- The grid size: How the input is evenly divided
- The box scale levels: Given the parent box, how to resize the current box
- The aspect ratio levels: For every box, the ratio between width and height