The properness of cross entropy
===========================

Definition of a proper loss function from from paragraph 1 of Section 3 in page 3 Cid-Sueiro 2012.

A loss function $l(\mathbf{y}, \mathbf{q})$ is proper, if the true probability vector $\mathbf{p}$ is a member of $\arg \min_\mathbf{q} \mathbb{E}[l(\mathbf{y}, \mathbf{q})]$. And it is strictly proper if $\mathbf{q}$ is the only minimizer of the set.

Cid-Sueiro, Jesús. 2012. “Proper Losses for Learning from Partial Labels.” Pp. 1574–82 in Advances in Neural Information Processing Systems 25.

Example 1: True probability $\mathbf{p} = [0.8, 0.2, 0.0]$
---------------------------------------------------

Imagine that our model always predicts the probability $\mathbf{q} = [0.8, 0.2, 0.0]$.

We can sample once from the entire population and obtain one sample of the form $\mathbf{y}=[1, 0, 0]$, if we compute the CE on this sample we would obtain

$$CE([1, 0, 0], [0.8, 0.2, 0.0]) = -1\log(0.8) - 0\log(0.2) - 0\log(0.0) = -\log(0.8)$$

However, if we keep sampling from the true distribution $\mathbf{p} = [0.8, 0.2, 0.0]$, 20\% of the times we would also obtain a label of the form $\mathbf{y}=[0, 1, 0]$

$$CE([0, 1, 0], [0.8, 0.2, 0.0]) = -0\log(0.8) - 1\log(0.2) - 0\log(0.0) = -\log(0.2)$$

The definition of a proper loss does not have into account one individual sample, but the minimization of the loss as an expectation over the full population. Then, in the previous case and the entire population we should expect

- that 80\% of the times $CE([1, 0, 0], [0.8, 0.2, 0.0]) = -\log(0.8)$
- that 20\% of the times $CE([0, 1, 0], [0.8, 0.2, 0.0]) = -\log(0.2)$
- that 0\% of the times  $CE([0, 0, 1], [0.8, 0.2, 0.0]) = -\log(0.0)$

This corresponds to the following expectation

$$\mathbb{E}[CE([0.8, 0.2, 0.0], [0.8, 0.2, 0.0]) = -(0.8\log(0.8) + 0.2\log(0.2))$$ 

In [1]:
import numpy as np

def expected_ce(p, q):
    #q = np.clip(p, np.finfo(float).eps, 1)
    #q /= q.sum()
    non_zero_p = p != 0
    return -p[non_zero_p] @ np.log(q[non_zero_p])

p = np.array([0.8, 0.2, 0.0])
q = np.array([0.8, 0.2, 0.0])

print('Expected CE = {}'.format(expected_ce(p, q)))

Expected CE = 0.5004024235381879


On the contrary, if the model predicts the probability $\mathbf{q} = [0.8, 0.1, 0.1]$ we should expect

- that 80\% of the times the CE would be $-\log(0.8)$
- that 20\% of the times CE = $-\log(0.1)$
- that 0\% of the times CE = $-\log(0.1)$

This corresponds to the following expectation

$$\mathbb{E}[CE(\mathbf{y}, [0.8, 0.1, 0.1]) = -(0.8\log(0.8) + 0.2\log(0.1))$$ 

In [2]:
p = np.array([0.8, 0.2, 0.0])
q = np.array([0.8, 0.1, 0.1])

print('Expected CE = {}'.format(expected_ce(p, q)))

Expected CE = 0.639031859650177


Example 2: True probability $\mathbf{p} = [0.8, 0.1, 0.1]$
---------------------------------------------------

If the true probability distribution is $\mathbf{p} = [0.8, 0.2, 0.0]$ and our model predicts $\mathbf{q} = [0.8, 0.1, 0.1]$ we should expect

- that 80\% of the times the CE would be $-\log(0.8)$
- that 10\% of the times CE = $-\log(0.2)$
- that 10\% of the times CE = $-\log(0)$

In [3]:
p = np.array([0.8, 0.1, 0.1])
q = np.array([0.8, 0.1, 0.1])

print('Expected CE = {}'.format(expected_ce(p, q)))

Expected CE = 0.639031859650177


In [4]:
p = np.array([0.8, 0.1, 0.1])
q = np.array([0.8, 0.2, 0.0])

print('Expected CE = {}'.format(expected_ce(p, q)))

Expected CE = inf


  return -p[non_zero_p] @ np.log(q[non_zero_p])
