# Object detection and segmentation

Different types of tasks for computer vision:
- Image classification: Assign one or more labels to an image
- **Object localization**: Assign a label to most prominent object, define a box around that object
- **Object detection**: Assign a label and define a box for all objects in an image
- **Semantic segmentation**: Determine the class of each pixel in the image
- Instance segmentation: Determine the class of each pixel in the image distinguishing different instances of the same class

In this lesson, we focus on the bold types of tasks.

Input vs. tasks:
- Image label(s) -> Image classification
- Object label, bounding box of the object of interest -> Object localization
- Object labels, bounding boxes of all objects of interest -> Object detection
- Semantic mask with foreground and background -> Binary semantic segmentation
- Semantic masks for all classes (including background) -> Multi-class semantic segmentation

## Object localization



Definitions:
- Object localization = the task of assigning a label and determining the bounding box of an object of interest in an image.
- A bounding box = a rectangular box that completely encloses the object, whose sides are parallel to the sides of the image.
- **Multi-head model** = a CNN where we have a backbone as typical for CNNs, but more than one head.

Example of Multi-Head model in PyTorch:
```python
from torch import nn


class MultiHead(nn.Module):
    def __init__(self):
        super().__init__()
        # Backbone: this can be a custom network, or a
        # pre-trained network such as a resnet50 where the
        # original classification head has been removed. It computes
        # an embedding for the input image
        self.backbone = nn.Sequential(..., nn.Flatten())

        # Classification head: an MLP or some other neural network
        # ending with a fully-connected layer with an output vector
        # of size n_classes
        self.class_head = nn.Sequential(..., nn.Linear(out_feature, n_classes))

        # Localization head: an MLP or some other neural network
        # ending with a fully-connected layer with an output vector
        # of size 4 (the numbers defining the bounding box)
        self.loc_head = nn.Sequential(..., nn.Linear(out_feature, 4))

    def forward(self, x):
        x = self.backbone(x)

        class_scores = self.class_head(x)
        bounding_box = self.loc_head(x)

        return class_scores, bounding_box
```

When having a Multi-Head model we need to combine loss functions (the one for the class, and the one for the bounding box). In PyTorch we can implement the multiple losses the following way:

```python
class_loss = nn.CrossEntropyLoss()
loc_loss = nn.MSELoss()
alpha = 0.5

...
for images, labels in train_data_loader:
    ...

    # Get predictions
    class_scores, bounding_box = model(images)

    # Compute sum of the losses
    loss = class_loss(class_scores) + alpha * loc_loss(bounding_box)

    # Backpropagation
    loss.backward()


    optimizer.step()
```


## Object Detection

The task of object detection consists of detecting and localizing all the instances of the objects of interest.

Nowadays there are two approaches to solving the problem of handling a variable number of objects, and of their different aspect ratios and scales:

### One-stage object detection
We consider a fixed number of windows with different scales and aspect ratios, centered at fixed locations (anchors). The output of the network then has a fixed size. The localization head will output a vector with a size of 4 times the number of anchors, while the classification head will output a vector with a size equal to the number of anchors multiplied by the number of classes.

### Two-stage object detection
In the first stage, an algorithm or a neural network proposes a fixed number of windows in the image. These are the places in the image with the highest likelihood of containing objects. Then, the second stage considers those and applies object localization, returning for each place the class scores and the bounding box.

In practice, the difference between the two is that while the first type has fixed anchors (fixed windows in fixed places), the second one optimizes the windows based on the content of the image.


## One-Stage Object Detection: RetinaNet

We divide the image with a regular grid. Then for each grid cell we consider a certain number of windows with different aspect ratios and different sizes. We then "anchor" the windows in the center of each cell. If we have 4 windows and 45 cells, then we have 180 anchors.

Summarizing, RetinaNet is characterized by three key features:
- Anchors - Anchors are windows with different sizes and different aspect ratios, placed in the center of cells defined by a grid on the image.
- Feature Pyramid Networks - The Feature Pyramid Network(opens in a new tab) is an architecture that extracts multi-level, semantically-rich feature maps from an image:
- Focal loss - The Focal Loss adds a factor in front of the normal cross-entropy loss to dampen the loss due to examples that are already well-classified so that they do not dominate. This factor introduces a hyperparameter γ: the larger γ, the more the loss of well-classified examples is suppressed.

## Object detection metrics

The following metrics are often used:
- **Intersection over Union** - The IoU is a measure of how much two boxes (or other polygons) coincide. As the name suggests, it is the ratio between the area of the intersection, or overlap, and the area of the union of the two boxes or polygons
- **Average Precision** - Area under the precision-recall curve
- **Mean Average Precision (mAP)** - Average over all classes of the Average Precision
- **Average Recall** - Area under the recall vs IoU curve
- **Mean Average Recall (mAR)** - Average over all classes of the Average Recall

## Semantic segmentation 

Semantic segmentation is one key technique of image segmentation.

The task of semantic segmentation consists of classifying each pixel of an image to determine to which class it belongs.

### UNet architecture
The UNet is a specific architecture for semantic segmentation. It has the structure of a standard autoencoder, with an encoder that takes the input image and encodes it through a series of convolutional and pooling layers into a low-dimensional representation.

Then the decoder architecture starts from the same representation and constructs the output mask by using transposed convolutions. However, the UNet adds skip connections between the feature maps at the same level in the encoder and in the decoder.

Example in PyTorch:
```python
import segmentation_models_pytorch as smp

# Binary segmentation?
binary = True
n_classes = 1

model = smp.Unet(
        encoder_name='resnet50',
        encoder_weights='imagenet',
        in_channels=3,
        # +1 is for the background
        classes=n_classes if binary else n_classes + 1)
```

### Dice loss
The Dice Loss function is often used for semantic segmentation. The Dice loss derives from the F1 score, which is the geometric mean of precision and recall. Consequently, the Dice loss tends to balance precision and recall at the pixel level. A perfect Dice score is 0.

In PyTorch:
```python
loss = smp.losses.DiceLoss(smp.losses.BINARY_MODE, from_logits=True)
```

## Glossary

Object localization: The task of determining if an image contains an object, and localize it with a bounding box.

Bounding box: A rectangular box that completely encloses a given object in an image, whose sides are parallel to the sides of the image.

Multi-head model: A CNN where we have one backbone but more than one head.

Object detection: The task of localizing using a bounding box every object of interest in an image.

Anchors: Windows with different sizes and different aspect ratios, placed in the center of cells defined by a grid on an image.

Feature Pyramid Network (FPN): An architecture that extracts multi-level, semantically-rich feature maps from an image.

Focal Loss: A modification of the Cross-Entropy Loss, **by **including a factor in front of the CE Loss to dampen the loss due to examples that are already well-classified, so they do not dominate.

Mean Average Recall (mAR): A metric for object detection algorithms. It is obtained by computing the Average Recall for each class of objects, as twice the integral of the Recall vs IoU curve, and then by averaging the Average Recall for each class.

Mean Average Precision (mAP): A metric for object detection algorithms. It is obtained by computing the Average Precision (AP) for each class. The AP is computed by integrating an interpolation of the Precision-Recall curve. The mAP is the mean average of the AP over the classes.

Intersection over Union (IoU): The ratio between the area of the intersection, or overlap, and the area of the union of two boxes or polygons. Used to measure how much two boxes coincide.

Semantic segmentation: The task of assigning a class to each pixel in an image.

Dice loss: A useful measure of loss for semantic segmentation derived from the F1 score, which is the geometric mean of precision and recall. The Dice loss tends to balance precision and recall at the pixel level.