# Focal Loss for Dense Object Detection

## TL;DR
* Dense object detection often suffers from imbalanced data, many candidate locations but few positives.
* Reduce the impact on total loss from easy examples.

<img src="figs/focal-loss-fig-1.png" width="40%">

## Introduction
* Object detection are usually based on a two-stage approach or a one-stage approach.
* Two-stage:
    * First stage generates candidate object locations.
    * Second stage classifies each candidate object location.
    * (So far) SoTA but more complex pipeline.
* One-stage:
    * Dense predictions.
    * (So far) Worse results but less complex pipeline.
* Authors identify class imbalance to be the problem of worse results of one-stage detectors.

## Focal Loss
* Small adaptation to standard cross entropy loss (CE) to get dynamic scaling.
* Intuitively: 
    * Scale down loss of easy examples.
    * Decrease the range in which an example gets high loss.

**Standard CE loss:**
$$
\begin{align*}
    CE(p, y) &= \left\{
        \begin{array}{ll}
            -log(p) & \quad y = 1 \\
            -log(1 - p) & \quad y = 0
        \end{array}
    \right. \\
\end{align*}
$$

**Balanced CE loss:**
$$
\begin{align*}
    CE_b(p, y) &= \left\{
        \begin{array}{ll}
            -\alpha log(p) & \quad y = 1 \\
            -(1 - \alpha) log(1 - p) & \quad y = 0
        \end{array}
    \right. \\
\end{align*}
$$
* $\alpha$ can be picked based on class frequency or as hyper parameter.

**Focal loss:**
$$
\begin{align*}
    FL(p, y) &= \left\{
        \begin{array}{ll}
            -(1 - p)^\gamma log(p) & \quad y = 1 \\
            -p^\gamma log(1 - p) & \quad y = 0
        \end{array}
    \right. \\
\end{align*}
$$
* $\gamma = 0$ is same as standard CE loss.

**Balanced Focal loss:**
$$
\begin{align*}
    FL_b(p, y) &= \left\{
        \begin{array}{ll}
            -\alpha (1 - p)^\gamma log(p) & \quad y = 1 \\
            -(1- \alpha) p^\gamma log(1 - p) & \quad y = 0
        \end{array}
    \right. \\
\end{align*}
$$
* $\gamma$ and $\alpha$ interacts so should be picked together. Guideline: lower $\alpha$ for higher $\gamma$.

**Model initialization:**
$$
\begin{align*}
    b = -log(\frac{1 - \pi}{\pi}) 
\end{align*}
$$
* $b$ is the bias of the final conv layer for the classification.
* $\pi$ is the rare class probability.
* Model initialization is done such that the model predicts the rare class with low probability. Improves training stability.


## RetinaNet
* Backbone network based on *Feature Pyramid Network* (FPN) based on resnet (50/101).
* FPN outputs feature maps at different scales.
* Two heads per pyramid level (but shared params over levels), one for classification and one for bounding box regression.
* Focal Loss is applied for the classification task only.

## Experiments
* Task is dense detection on COCO dataset, image scale 600x600.
* Metric is AP
* Standard CE loss + standard initialization leads to network diverging.
* Standard CE loss + proposed initialization leads to good results (30.2 AP).
* Balanced CE loss ($\alpha = 0.75$) improves further (31.1 AP).
* Balanced focal loss ($\alpha = 0.25, \gamma = 2.0$) improves further (34.0 AP).