# Focal Loss for Dense Object Detection


Follow [Andrew Ng's](https://www.youtube.com/watch?v=733m6qBH-jI&list=PLoROMvodv4rOABXSygHTsbvUz4G_YQhOb&index=9&t=0s) suggested strategy of multiple passes over a paper. Namely:


1. Title, Abstract and Figures
2. Intro/Conclusion and Skim the rest (Skipping related work)
3. Read All Remaining Sections (however Skip the maths)
4. Read Everything (but skip the parts that don't make sense)

After some background reading (medium posts, blog posts, other papers, youtube videos from [cs231 2016](https://www.youtube.com/watch?v=GxZrEKZfW2o&t=3s) [cs231 2017](https://www.youtube.com/watch?v=nDPWywWRIRo&t=3909s)), I have made notes below following each pass.

## Title, Abstract and Figures

### Title
Focal Loss for Dense Object Detection
Comments: 

What is `Focal Loss`? A new type of loss function called focal loss is introduced in this paper.

What is `Dense Object Detection`? Dense object detection is looking for objects in the same image over many densely covered spatial positions, scales and aspect ratios. 

### Abstract

**Comments**

Object detection is a task of finding bounding boxes for zero of more instances of predefined classes on an image. The bounding box is a rectangle that locates and bounds the instance of a class. A solution which does this is called an object detector.

Here is an example:

![](object_detection.jpeg)

There are two classes of object detection algorithm that are state of the art on competing metrics - namely:

1. Dense object detection performs at the state of the art level of performance, as measured by an accuracy metric: average precision (area under precision recall curve - for different classification thresholds).

![](yolo.jpeg)

2. Sparse object detection performs at the state of the art level of performance, as measured by speed of prediction on test cases, for a respectable level but not excellent of performance of accuracy on the average precision metric.

![](selective_search.jpeg)

```
Aside: The dense object detection approach is for one stage learning models - those that do proposal, classification, and localization in one shot - such as OverFeat, SSD or YOLO - many (dense) blind possibly overlapping candidate boxes are made and fitted over a grid.

The sparse object detection approach is a two step approach - where first regions that may contain an object are proposed and then these boxes are put through a classifier - some examples being R-CNN, Fast R-CNN, Faster R-CNN and Mask R-CNN.
```

The first approach produces excellent results. However, the second approach provides a 

> In contrast, one-stage detectors that are applied
over a regular, dense sampling of possible object locations
have the potential to be faster and simpler, but have trailed
the accuracy of two-stage detectors thus far.

`Question: Simpler in what sense? Still very difficult to interpret or understand a sparse object detection model.`

The authors suggest that the paper identifies the reason for poor performance on sparse object detection. They claim that they are able to pinpoint the blame on *extremely unbalanced data* (no objects >> objects) occurring at training time. That is to say, dense detectors suffer from many more examples of no objects in boxes over an image than examples of boxes with images.

They have a way to address this imbalance. They notice that no object *boxes* are assigned much lower cost $-log(Pr($no object | box$))$ than $-log(Pr($class j | box$))$ - the probability of class (object) $j$ the classes of objects when they occur in boxes.

Effectively, they are noticing an artifact that the probabilities of no object are more peaked than they should be and the probabilities of the object classes are flatter than they should be. 

`Idea: Could what the authors do be translated into applying a regularization or prior to make the optimizer prefer models with flatter no-object probability and more peaky object probabilities. Could we explicitly set the Bayesian prior to achieve the same result`

The authors view their algorithm as having a *focal loss* which focuses (by having a higher cost) on harder examples (examples with objects present - those with lower probabilities) and having less focus on easy negatives (examples with no object and higher probability - the no object class). They believe this alleviates the problem of the detector being overwhelmed by the imbalanced data, so that learning can still take place. 

To see how well their loss does for dense objects detectors, the authors present a network they dub RetinaNet. Their results show that with the *focal loss*  RetinaNet is as fast as the popular one stage detectors, and yet have accuracy (as measures by average precision) as good as the state of the art two stage detectors.

### Figures

#### Figure 1

A graph showing how the proposed novel loss function - focal loss - has the desired properties of increasing cost for low probability values and reducing cost for high probability values. A hyperparameter $\gamma$ is shown over a range of values. 

Note that the proposed loss function is pinned at the extremes of 0 loss for predicting with probability 1 the ground truth and shoots off to infinity asymptote as the probability of predicting the ground truth goes to 0.

#### Figure 2

This is a plot of Speed vs accuracy (Average Precision) for RetinaNet with base feature extractor network as ResNet100 or ResNet50 compared to other popular competitive algorithms. It shows that RetinaNet is forms an upper enveloper over the other methods and so is superior in speed and accuracy vs the popular algorithms - with one exception. YOLOv2 is faster, but at a much lower accuracy than RetinaNet. 

#### Figure 3
Outline of RetinaNet. The backbone of the network is a Feature Pyramid Network (FPN) - which allows the network to work at various scales. FPN receives inputs from a ConvNet Feature Extractor, and it is connected to two heads: a classifier for the anchor boxes, a regressor to fine tune the anchor boxes to get estimates for the bounding box - the ground truth boxes and IoU metric is used for this. The authors claim that this design is simple enough to allow the focal loss to show it's power.

```
Question: If the network as well as the loss function are being changed - then doesn't this make it hard to ensure the changes are orthogonal and not interfering and impacting the results?
```

#### Figure 4

This graph shows the q-q plot or CDF of normalized loss against the CDF of sample, for each of foreground(object) and background (no object) examples from the ResNet-101 predictions, at different values of gamma for Focal Loss (including Cross Entropy Loss).

This graph is used to identify that the loss on 


```
normalized loss: I guess this is the normalized cost function
CDF to CDF - p-p plot.
```

Actually, reading the section for this figure - I don't get this section well - there is the dataset, the model, some normalization of loss, splits into unbalanced foreground and background labels, quite a lot that needs to be teased out. I will come back to this on the third or fourth pass.


#### Figure 5

This graph and the following two are used to show that a similar function to the Focal Loss (FL), $FL^*$ (which depends on a $\beta$ as well as $\gamma$ parameter) is close to FL. 

```
There is no proper rigorous analysis, but empirical results and plots - is this good enough?
```


#### Figure 6

Figure 6 show the derivatives are similar in nature and magnitude for FL and $FL^*$.

#### Figure 7

This shows the effective variations of the parameters $\beta$ and $\gamma$ such that the loss function has desired behavior. It also shows less desirable loss functions in light black lines vs the better loss functions in heavier blue lines.


## Introduction and Conclusion

### 1. Introduction

The benchmark data set for researchers working on the object detection task is MS COCO (or COCO).

The reference metric is AP (Average Precision).

```
Need to find references and blog posts on AP
```

State Of The Art (SOTA) object detectors on the AP (average precision) metric are all from a family of sparse object detectors that rely on two stage (proposal then classification+localization) technologies. 

Although their performance is to be applauded (still not good enough for many production applications) these two stage detectors are quite slow - as measured in Frames Per Second, FPS.

A promising direction has been the development of one stage detectors, they run one order of magnitude faster (7 FPS for `Faster R-CNN` two stage detector vs 21 FPS for `YOLO Fast` using VGG-16 as a CNN feature extractor). However the performance of one stage detectors has lagged significantly (73.2 - `Faster R-CNN` vs 66.4 `YOLO Fast` AP).

This paper introduces a few technologies that demonstrate a one stage detector with comparable speed to SOTA one stage detectors for speed and comparable AP as SOTA two stage detectors - what they call an upper envelope of performance. The authors assert that the main driver of this improved performance is their new loss function - which combats extremely imbalanced foreground/background data in one stage detectors. (The authors believe  extremely imbalanced foreground/background data  this to be the impediment to better one stage detector performance).

Although two stage detectors also see class imbalance, it is less severe and can be remedied by:

- sampling heuristics (fixing the imbalance ratio by sampling)
- online hard example mining (OHEM is a technique applied inside a minibatch to only select *high loss function* examples from the minibatch forward pass and use them in the backward pass)

Imbalance is less of an issue for two stage detectors because the first stage proposals are chosen as very likely to have foreground objects - most two stage detectors only present 1-2k proposal regions. 

One shot detectors need to deal with orders of magnitude more locations that can be overlapping densely covered with many aspect ratios. As a consequence imbalances of 1000 to 1 are normal in background/foreground object classes. At training time this imbalance dominates the loss function, enough that effective learning for the foreground object classes doesn't happen. 

Strategies like sampling and OHEM can help but are not as efficient because the training algorithm still concentrates weight changing efforts on the easily classified background examples - there are so many more background examples due to the imbalance.

Typically treatment for this imbalance is to use:

- bootstrapping the dataset to create a distribution of parameters and averaging the resultant probabilities.

- hard example mining (produce many data augmented examples possibly via crops/reflections of mis-classifications from prior training -foregrounds predicted as backgrounds - so that algorithm becomes better at learning these mappings.)


The authors believe their novel loss function alleviates the imbalance problem - by down-weighting easy examples and up-weighting hard examples. The exact form of the loss function is not important - the authors show that similar functions work.

To show how good the focal loss function is, the authors produced a one stage learning algorithm with many state of the art innovations (FPN backbone, ResNet) - called RetinaNet. It's performance of 5 FPS and very good AP of 39.1 - by comparison YOLOv2 achieves 21.2 AP, with undocumented speed.

### 6. Conclusion

The authors reiterate that the problem with one stage detector AP performance has been extreme background/foreground class imbalance. 

They present a modulating term applied to the cross entropy loss to create a new focal loss function. They believe this is a high quality solution as it results a better detector. The authors are pleased with a new one stage detector network trained using this focal loss, one that is better on speed and accuracy metrics.


## All Remaining Section (except the Mathematics)

### 2. Related Work

**Classic Object Detectors**

There have some old school object detectors, starting around 2001, including:

**Viola-Jones** was the first set of algorithms to produce useful results for object detection - applied to face recognition. Essentially taking rectangles and matching them to eyes, cheeks, nose and mouth, features are extracted. Then a boosting algorithm of the adaBoost variety is applied to these features to detect faces. **LeCun's** seminal paper proposed sliding windows to detect characters. **Histogram of Oriented Gradients** of integral channels produced good results for pedestrian detection - essentially finding edges in images, again powered by sliding windows. With the onset of Deep Learning, these fell behind in SOTA performance.  

**Two Stage Detectors** have been suggested by some of the authors of this paper. They have pioneered innovations and improvements in this space. The main idea is to propose a smaller number of regions (relative to sliding windows) where the propasal step is done so each region has a high probability of having a foreground object class. Then a feature extractor CNN is used to get features and then a classifier and located at most one object class in each region proposal. 

The first iteration of the methodology R-CNN using a fixed method - they used selective search algorithm was used for the region proposal, CNN feature extractor run on this and then SVMs used classify objects, and regression head to get the bounding boxes. However, this was slow.

Later this improved by using the faster R-CNN algorithm. This would move things around a little and change the SVM. An image is processed by a CNN feature extractor, then the region proposal takes place, then this is fed into a CNN which outputs two heads - a classifier using a fully connected layer and a bounding box regression, both of which take the CNN as input.

This was still too slow for many applications. So, Faster R-CNN algorithm was proposed. With this algorithm, the region proposal step is also trained using a CNN, using a IoU criteria to train region finding - taking input from a CNN feature extractor network that is pre-trained. Then the rest follows as before for the fast R-CNN algorithm.

There have been several extensions since then to the same basic idea - first classify/set regions likely to have foreground objects, then classify and regress bounding boxes on them.

**One Stage Detectors**

One stage detectors are all from the dense object detector family. They propose several orders of magnitude more regions - using anchors has been a key insight. Then they finesse the bounding boxes around these anchors and classify the object contained.  Among those that have been suggested are:

- OverFeat
- SSD
- YOLO

They all are highly performant on speed, but lag accuracy materially compared to two stage detectors.

RetinaNet is a one stage detector taken from this family of dense detectors using FPN for the scaling to boost performance. It achieves higher accuracy and maintains speed by having the novel loss function - focal loss. 

**Class Imbalance**

The one stage detectors suffer from extreme imbalance, $~10^5$ regions. This poses two problems:

- overwhelming the cost function at training time with easy negative (background) classifications 
- inefficient estimation as most regions produce no signal


Hard negative mining has been the traditional solution. Other more complex re-sampling strategies can be used.

The central theme of this paper, focal loss is shown to be good at dealing with such imbalance.

**Robust Estimation**

Robust Loss functions have been used to down-weight or reduce the influence of outliers - outliers being determined by their frequency in the data. This paper instead focuses on down-weighting or reducing the impact of easy examples - even if they are the majority of the imbalanced data. 

### 3. Focal Loss

Traditional cross entropy loss is given in equation 1. It is the average number of bits needed to learn the outcome of a random draw from the true distribution. One can also think of it as a measure how different the true distribution q is from the distribution being used p.

One hot data vectors are used as instances to compute the cross entropy for data against a model distribution. Then this averaged cross entropy function is the cost function which is usually run through an optimization algorithm (often from the gradient descent family of optimizers) to find the best parameters that minimize the cost.

By minimizing the cross entropy function, we can get a best fit. This also reduces the NLL of MLEs for classifiers.

#### 3.1 Balanced Cross Entropy

Although the above is a good strategy, the reality is that sometimes we come across imbalanced data in classification tasks. For example our current case of dense object detection. What this imbalance does is make it very difficult for gradient descent algorithms to find a good local optima. The reason is that the class(es) with the highest frequency contribute so much more to the loss function, that they dominate the loss function and the other examples are not concentrated on enough by the loss function optimizer.

Usually re-sampling strategies are run to artificially add more examples of the lower/least frequency classes. This can be done through:

- over-sampling the lower frequency classes
- under-sampling the higher frequency classes
- setting a weight hyper-parameter to increase the contribution of the lower frequency classes and/or reduce the contribution of the higher frequency classes.

Equation (3) lists the expression for the last of these strategies - in a compacted form.

#### 3.2 Focal Loss Definition

The tunable focal loss equation is introduced in equations 4 (one parameter) and 5 (two parameter). Somewhat strange, but at training time an approximation to this is used.

The authors outline how the focal loss has the desired property of reducing the loss of misclassifying high probability classes (such as background) and increasing the loss of misclassifying more complicated examples.

#### 3.3 Class Imbalance and Model Initialization

With extreme class imbalance of around $10^3-10^6$ to 1, the usual parameter initialization strategy for binary classification leads to unstable training early on for focal loss. The authors suggest instead giving initial conditions that corresponds to 0.01 probability for the foreground (lower frequency) class. They even call this a prior - which it doesn't feel like in a Bayesian sense - but given that neural networks can have many local optima - perhaps it is like a Bayesian prior, in that it impacts the parameter choices and final probability estimates.

#### 3.4 Class Imbalance and Two Stage Detectors

Class imbalance is less of an issue for two stage detectors because of the proposal layer, that filters for likely regions. There are still 1000-2000 proposal locations usually. Of these, again many will not have a foreground object. However, the regions are proposed to be much more likely to have a foreground object. This helps reduce the class imbalance. Secondly at training time, the minibatches are constructed to have reasonable ratios of classes - say 1:3, effectively re-sampling within the minibatches.

The authors note this is like re-weighting the terms of the loss function with a $\alpha$ weight hyperparameter.

By contrast the focal loss function works at a different level of abstraction.

```
It would have been interesting to see how focal loss does with two stage detectors.
```

### 4. RetinaNet Detector

**Feature Pyramid Network**
Given a single image input resolution, it is normal that some objects are at very different scales to others. In that case, performance can be improved by having classification go through a range of scales. A Feature Pyramid Network is a collection of layers that achieves this. It works as follows:

- lateral connections are made from the convolutional layers to each scale layer of the pyramid.

- top down connections feed into the scale layers through the convolutions.

- the scale layers connect to their own classification and box sub networks. 

These then feed into the final set of classification and bounding box predictions.

Scale layer $P_l$ reduces input resolution by by a factor $2^l$.

**Anchors**

Default boxes, or anchors are spaced out across the image data. They then assign to a K hot vector of classifications and probabilities and the bounding box regression. These anchors are presented at the various scales corresponding to the FPNs.

**Classification Subnets**

This head predicts the classification probability for each of the classes for each anchor box.

**Box Regression Subnet**

This finesses the anchors box to get the bounding box for each object in the anchor box.

#### 4.1 Inference and Training

**Inference**

This is a FCN with two heads per bounding box, a classification head and a bounding box regression head. The architecture includes the backbone FPN and a CNN extractor.

To improve speed of training, after some threshold only the top 1000 detected objects are used to continue training on. Non-max suppression is used when at a threshold of 0.5 to keep only the most promising predictions.

**Focal Loss**

The authors found in their hyperparameter tuning that focal loss with $\gamma = 2$, $\alpha = 0.25$, worked best. They suggest reducing $\alpha$ as $\gamma$ is increased.

**Initialization**

The hidden layers were initialized with the usual bias being 0 and small variance ($\sigma=0.01$) random gaussian noise.

For the final layer of the classification head, the bias parameter is set as the inverse logit of 0.01. 

**Optimization**

minibatch (16 images) stochastic gradient descent using GPUs is used to train RetinaNet for 90000 iterations and  decaying learning parameter. Focal loss is used for the classification part and L1 loss for the bounding box regression. They trained for around 35 hours.


### 5. Experiments

Training is done on 40k images with some randomized selection from the dataset. The test-dev set is found for data where no public labels are given and the test-dev features are tested and predicted on using an evaluation server.

#### 5.1 Training Dense Detection

**Network Initialization**
If the usual initialization techniques are used, the network diverges. However, the earlier outlined strategy of using the prior probs of 0.01 and inverse logit to get bias parameters of the final layer of the network, make it converge to good AP values. This is all for the cross entropy loss.

**Balanced Cross Entropy**

Weighted or balanced cross entropy loss with $\alpha=0.9$ gives the best results.

**Focal Loss**
The hyperparameter search worked as explained above. The table shows a range of values used.

**Analysis of the Focal Loss**

This section is important as it shows that the FL reduces the loss for all but the most extreme negative background examples contribute to the FL over the dataset. That is, the majority of background easy classifications do not contribute very much to the loss function.

I feel some other kind of chart could be used to indicate this - maybe plotting the cumulative loss function cumulatively against the probability estimate of the foreground and background classes.

**Online Hard Example Mining**

OHEM discards the easy examples and has the loss function optimized on the hard examples that were misclassified - this happens inside the gradient descent algorithm where examples are included or discarded. Note that focal loss still keeps the easy examples, but reduces their effect on the loss function. 

The authors note that the AP was higher (better) for focal loss by 3.6 points than OHEM.

**Hinge Loss**

Lin et al. tried to train with hinge loss - setting loss to zero after a certain threshold, but it didn't work producing unstable results that didn't converge. 

#### 5.2 Model Architecture Design

**Anchor Density**
How high a density to have for for the anchors and possibly having the same anchor locations for different tall and wide boxes is not an exact science, but impacts performance a great deal for one stage detectors. The authors find that after a certain density, increasing the number and spread of anchors doesn't help performance.

**Speed Vs Accuracy**

Faster networks need smaller backbone networks. The speed accuracy tradeoff is show in the figures. it shows an upper envelope of better accuracy and faster speed compared to most models that are at the SOTA. 


### 5.3 Comparison to State of the Art

The authors reiterate their faster and more accurate model over a range of backbone networks. They show exact numbers.


## Everything Else

### Mathematics

#### Equation 1

Under this binary regime either $y=1$ or $y=-1$ are the two classes.

$$
\begin{equation}
  CE(p,y) = \left \{
  \begin{aligned}
    &-\log(p), && \text{if}\ y=1 \\
    &-\log(1-p), && \text{if}\ y=-1 
  \end{aligned} \right.
\end{equation} 
$$

This is the equation for cross entropy loss for a binary classifier - it is the average number of bits needed under assumed probability distribution $p$ to identify an event drawn using the true distribution of $y$.

This is the natural loss function for binary classification and nicely links to MLE estimation/NLL cost function.

#### Equation 2a

$$
\begin{equation}
  p_t  \left \{
  \begin{aligned}
    & p, && \text{if}\ y=1 \\
    & 1-p, && \text{if}\ y=-1 
  \end{aligned} \right.
\end{equation} 
$$

This equation is a little odd because, there is no index or variable $t$ being referenced by the subscript. It might have been better to replace that with something $\tilde{p} = \tilde{p}(p,y)$, to keep a relation with p and indicate that it is transform of $p$ and $y$.

One way that it makes sense to keep the subscript t notation is if we were looking at multiclass classification. In that case,

$$
\begin{equation}
  p_t  \left \{
  \begin{aligned}
    & p(y_t), && \text{if}\ y=y_t \\
    & 1-p(y_t), && \text{if}\ y \neq y_t 
  \end{aligned} \right.
\end{equation} 
$$

However, even then this is messy notation.

#### Equation 2a

From the equation above we can rewrite $CE(p, y) = CE(p_t) = − log(p_t)$, keeping the same odd subscript $t$ notation.

#### Equation 3

$CE(p_t) = -\alpha_t log(p_t)$

This is an equation made to put weight ($\alpha_t \in [0,1]$) to increase or decrease the cost of different outcomes. It's making adjustments (unnatural from some perspectives) to the cross entropy loss function.

This is a well known practice for dealing with imbalanced data.

In fact, note that this can be expressed as:

$CE(p_t) = - log(p_t^{\alpha_t})$

Now if we have $y_0 = 1$ and $y_1 = -1$ (so $t=0$ or $t=1$) for the binary case, then in order to be consistent probability distributions in this representation we need:

$(1-p_0)^{\alpha_1} + p_0^{\alpha_0} = 1$

This follows because $p_1 = 1 - p_0$.

which we can write as saying that $\alpha_0$ is free to be any value in [0,1], but that the following must hold - for probability consistency:

${\alpha_1}  = \frac{1- p_0^{\alpha_0}}{log(1-p_0)}$

With a similar constraint existing if we extend to multiclass classification. 

The effect is to increase the probability $p_0$ and decrease the probability $p_1$ with a functional form that does not depend on the weights or features of the network. If we assigned foreground images to the label $y_0=1$, then it makes sense to upweight the class with fewer examples.

If this modification was acting on the weights, then it would be called a regularizer, or a flat Bayesian prior. But since it is at the level of the loss function we don't call it that but the effect is similar.


Of course one downside is that we need to use hyperparameter tuning to find the best value of the loss function weight parameter.

#### Equation 4

The central topic of this paper is the tunable focal loss, given by the equation:

$FL(p_t) = −(1 − p_t)^{\gamma} log(p_t)$

for $\gamma \geq 0$. 

The authors find that in practice the exact choice of gamma does not affect much the accuracy. They suggest $\gamma \in [2,5]$ for good tuning results based on their experiments.

Let's do our analysis of this function via Taylor series. Using $log(x) \approx (x-1) - (1/2)(x-1)^2$ for small enough x, well then:

$FL(p_t) \approx −(1 − p_t)^{\gamma} ((p_t-1) - ^2)$

$ = −(1 − p_t)^{\gamma} ((p_t-1) - (1/2)(p_t^2- 2*p_t + 1))$

$ = −(1 − p_t)^{\gamma} (1/2)(p_t-1)(3 - p_t)$

$ = (1/2)(1 − p_t)^{\gamma + 1 }(3 - p_t)$

There is no real insight to be had from this expansion, but it might have been useful.


This $FL$ function is still pinned at 0 and infinity for perfect and worst fit to the one hot vector of actual observation y to distribution estimated via $p_t$. What it does however, is to lower the loss for good estimations ($p_t$ close to y) and increase the loss for bad estimation ($p_t$ far from $y$). 

In effect, the focal loss is applying a modulating factor to the cross entropy loss. This might be seen as a kind of flat Bayesian prior, or regularizer to encourage the kind of fit that does well for imbalanced data. Only this flattening is not at the level of the weights, but higher up in the model, acting at the loss function level.

```
There is something not satisfying that the true probabilities cannot be estimated well due to imbalanced data, so much so that we need to resort to modifying the cross entropy loss function using heuristics. 

The hope would be to have methods which deal with imbalance at a lower level, at the parameter or architecture level. The reason I believe this to be important is that the cross entropy function is well understood with connections to maximum likelihood estimation. To introduce extra parameters - some of which might be better pushed down the network parameter estimation level feels wrong.

However, since focal loss produces state of the art results we should applaud it's contribution to the field. It also is specifically used for data imbalance - where other strategies have failed to get solutions efficiently. Also, it is not post processing the estimated probabilities - but directly optimizing them, albeit in a novel way.

Perhaps it is time to consider others families of loss functions.
```
#### Equation 5

In practice the authors suggest that a weight factor applied to the modulated loss function produces good results. So one could express focal loss as:

$FL(p_t) = -\alpha_t (1 − p_t)^{\gamma} log(p_t)$

Now with two tunable parameters.

```
Let's see what happens if we try to express the probability model implied by this loss - for the binary case:
```
Say, we set:

$$ \phi_t = -\alpha_t (1 − p_t)^{\gamma} log(p_t)$$

Then:

$$ exp(- \frac{\phi_t}{\alpha_t} (1 − p_t)^{-\gamma}) = p_t$$

replacing the ratio $\frac{\phi_t}{\alpha_t} = \Gamma_t$

$$ exp( - \Gamma_t (1 − p_t)^{-\gamma}) = p_t $$

```
So as we are optimizing the loss function, we are restricting the solution to be on a level set with the form above. I wonder if constrained optimization could be used, maybe not relevant
```

#### Equation 6

This and the following two equations are in the appendix and are supposed to show that there is a easier to optimize function that is very similar to the focal loss described earlier.
 
$x_t = yx$

It really isn't clear what x is supposed to represent. It may just be that for any scalar $x$, we denote by $x_t$ the product of $y$ and $x$. We are told that $y \in {-1,1}$ is the ground truth label. I initially thought $x$ was a reference to the features $x$, but that doesn't seem to be the case. We are also told that 

$p_t = \sigma(x_t)$

is compatible with equation 2. There isn't an explicit definition of $\sigma(\cdot)$. I had thought that it might be the sigmoid function. However matching up the earlier definition from equation 2

$$
\begin{equation}
  p_t  \left \{
  \begin{aligned}
    & p, && \text{if}\ y=1 \\
    & 1-p, && \text{if}\ y=-1 
  \end{aligned} \right.
\end{equation} 
$$

it's hard to tell. So, let's investigate:

$p_t = \sigma(x_t) = \sigma(yx)$ is to be $p$ if $y=1$ and $1-p$ if $y=-1$.

Again, it's difficult to be sure. So what we'll do it look further ahead at equation 9. From this, we can see that the differential equation can be evaluated only if $\sigma(\cdot)$ is the sigmoid function. It would have been nice if the authors had made this explicit.


#### Equation 7

$p_t^{∗} = \sigma(\gamma x_t + \beta)$

#### Equation 8

$FL∗ = − \frac{log(p_t^{∗})}{\gamma}$

The authors assert that $FL^*$ for the settings of $\gamma=2$ and $\beta=1$ is very close to FL and computationally easier. The show graphs where the function is similar within a good range.

```
Question: As neural networks are excellent function approximators, I'm inclined to wonder why they couldn't just approximate these specific transformations. Perhaps the extra help is needed because of the overwhelming imbalance. 
```

#### Equation 9

$\frac{dCE}{dx} = y(p_t − 1)$

This equation only works if $\sigma(\cdot)$ is the sigmoid function. We'll do the arithmetic here:

$CE = -log(p_t)  = -log(\sigma(x_t)) = -log(\sigma(yx))$

$\frac{dCE}{dx}  = -[\sigma(yx)]^{-1} * \frac{d[\sigma(yx)]}{dx}$

$ = -[\sigma(yx)]^{-1} * \sigma(yx) * [1- \sigma(yx)] y$

$= y(p_t − 1)$



The reason for giving this and the following two equations seems to be to allow parameter updates via gradient descent when training. In addition, they serve to demonstrate the closeness of $FL$ to $FL^*$, particularly with reference to the figures with the chosen values of the tuning parameters.

#### Equation 10

$\frac{dF}{dx} = y(1 − p_t)^\gamma (\gamma p_t log(p_t) + p_t − 1)$

#### Equation 11

$\frac{dFL^{∗}}{dx} = y(p_t^{*} − 1)$


### Tables
#### Table 1
#### Table 2
#### Table 3

### Appendix
#### Appendix A
#### Appendix B
