# Loss functions in DNN

We use the **cross-entropy loss** (or log-loss).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Entropy

It is based on ideas from information theory. In that context, the **cross entropy** between two probability distributions $p$ and $q$ over the same set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme is optimized for the "wrong" distribution $q$, rather than the true distribution, $p$. Basically, it is telling us the average "message length" required to identify the underlying distribution is really $p$ even though we assumed it was $q$.

The _entropy_ of a distribition is the expected value of the _information_. For a binary classifier, the _Shannon information_ (named after Claude Shannon) is

\begin{equation}
I(x_i) = -\log_2 p \left(x_i\right)
\end{equation}

The entropy, as an expectation value, is weighted by the probability:

\begin{equation}
H = -\sum_{i} p \left(x_i\right) \log_2 p \left(x_i\right)
\end{equation}

In [2]:
def shannon_entropy(pvec):
    pvec_ = np.array(pvec)
    return np.sum(-pvec_ * np.log2(pvec_))

The more "mixed up" the data is (the more spread across classes), the higher the entropy:

In [3]:
a = [0.5, 0.5]
b = [0.4, 0.6]
c = [0.1, 0.9]
[shannon_entropy(vec) for vec in [a, b, c]]

[1.0, 0.97095059445466858, 0.46899559358928122]

In [4]:
a = [0.2] * 5
b = [0.3] + [0.2] * 3 + [0.1]
c = [0.8] + [0.05] * 4
[shannon_entropy(vec) for vec in [a, b, c]]

[2.3219280948873622, 2.2464393446710154, 1.1219280948873622]

### Cross entropy

The _cross entropy_ is defined according to

\begin{equation}
\begin{split}
H\left(p, q\right) =&~ \sum_{x_i} p \left(x_i\right) \log \frac{1}{q\left(x_i\right)} \\
=&~ - \sum_{x_i} p \left(x_i\right) \log q\left(x_i\right) 
\end{split}
\end{equation}

Information theory can be confusing, but we can recast this in terms of optimization. Consider simple logistic regression for two classes. Here, we model probability with the logistic function:

\begin{equation}
g\left(z\right) = 1 / \left(1 + e^{-z}\right)
\end{equation}

The probability of finding $y = 1$ is:

\begin{equation}
q_{y=1} = \hat{y} \equiv g(\mathbf{w \cdot x + b})
\end{equation}

where, g is our model and $\mathbf{x}$ is our feature vector and $\mathbf{w}$ and $\mathbf{b}$ describe a linear model (which is not a requirement). Similarly, the probability of finding $y = 0$ is:

\begin{equation}
q_{y=0} = 1 - \hat{y}
\end{equation}

For the true (observed) probabilities, we can write:

\begin{equation}
\begin{split}
p_{y=1} =&~ y \\
p_{y=0} =&~ 1 - y \\
\end{split}
\end{equation}

Now, we can write the cross entropy as a similarity measure between p and q:

\begin{equation}
\begin{split}
H\left(p, q\right) =&~ -\sum_i p_i \log q_i \\
=&~ -y \log \hat{y} - \left(1 - y\right) \log \left(1 - \hat{y}\right)\\
\end{split}
\end{equation}

### Multinomial regression and the softmax

Logistic regression uses the logistic function to bound model outputs between zero and one - making them interpretable as probabilities:

\begin{equation}
\sigma \left(t \right) = \frac{1}{1 + e^{-t}}
\end{equation}

If we assume $t$ is a linear function of single explanatory variable, we may write it as $t = \beta_0 + \beta_1 x$ and the logistic function as

\begin{equation}
F(x) = \frac{1}{1 + e^{- \left(\beta_0 + \beta_1 x \right)}}
\end{equation}

and $F(X)$ is interpreted as the probability of a the dependent variable being a "success". We can compute the inverse of the logistic function, or _logit_ as 

\begin{equation}
g \left(F \left(x \right) \right) = \ln \left(\frac{F \left(x \right)}{F \left(1 - x \right)} \right) = \beta_0 + \beta_1 x
\end{equation}

The predictor function in multinomial logistic regression is

\begin{equation}
\begin{split}
\ln \left(\frac{p_i}{1 - p_i} \right) =&~ \beta_{0,k} + \beta_{1, k} x_{1, i} + \cdots + \beta_{M,k} x_{M, i} \\
=&~ \mathbf{\beta_k} \cdot \mathbf{x_i} \\
\end{split}
\end{equation}

A simple way to derive the multinomial model is to consider for $k$ possible outcomes, running $k-1$ independent binary regression models in which one outcome is chosen as a pivot and the other $k-1$ outocmes are regressed against the pivot outcome. If we choose the last outcome as the pivot, this looks like:

\begin{equation}
\begin{split}
\ln \frac{Pr \left(y_i = 1 \right)}{Pr \left(y_i = k \right)} =&~ \beta_1 \cdot \mathbf{x_i} \\
\ln \frac{Pr \left(y_i = 2 \right)}{Pr \left(y_i = k \right)} =&~ \beta_2 \cdot \mathbf{x_i} \\
\cdots&~ \\
\ln \frac{Pr \left(y_i = k - 1 \right)}{Pr \left(y_i = k \right)} =&~ \beta_{k - 1} \cdot \mathbf{x_i} \\
\end{split}
\end{equation}

If we exponentiate both sides we have:

\begin{equation}
\begin{split}
Pr \left(y_i = 1 \right) =&~ Pr \left(y_i = k \right) e^{\beta_1 \cdot \mathbf{x_i}} \\
Pr \left(y_i = 2 \right) =&~ Pr \left(y_i = k \right) e^{\beta_2 \cdot \mathbf{x_i}} \\
\cdots&~ \\
Pr \left(y_i = k - 1 \right) =&~ Pr \left(y_i = k \right) e^{\beta_{k - 1} \cdot \mathbf{x_i}} \\
\end{split}
\end{equation}

We may then use the fact that all the $k$ probabilities must some to 1:

\begin{equation}
Pr \left(y_i = k \right) = \frac{1}{1 + \sum_{j=1}^{k-1} e^{\beta_j x_i} }
\end{equation}

And so we have for the probabilities:

\begin{equation}
\begin{split}
Pr \left(y_i = 1 \right) =&~ \frac{e^{\beta_1 \cdot \mathbf{x_i}}}{1 + \sum_{j=1}^{k-1} e^{\beta_j x_i} } \\
Pr \left(y_i = 2 \right) =&~ \frac{e^{\beta_2 \cdot \mathbf{x_i}}}{1 + \sum_{j=1}^{k-1} e^{\beta_j x_i} } \\
\cdots&~ \\
Pr \left(y_i = k - 1 \right) =&~ \frac{e^{\beta_{k - 1} \cdot \mathbf{x_i}}}{1 + \sum_{j=1}^{k-1} e^{\beta_j x_i} } \\
\end{split}
\end{equation}

Or, more generally:

\begin{equation}
Pr \left(y = k \right) = \frac{e^{\beta_k \cdot \mathbf{x_i}}}{\sum_j e^{\beta_j \cdot \mathbf{x_i}}}
\end{equation}

This is called the _softmax equation_ and it is common for neural networks to encode their final output using this function.

Generally, when working with a multiclass classifier, we encode the state vector as a "one-hot" vector. So, for example if we are identifying an event as coming from target 1, 2, 3, 4, or 5, we write those states as `[1, 0, 0, 0, 0]`, `[0, 1, 0, 0, 0]`, `[0, 0, 1, 0, 0]`, etc.

Therefore, we can imagine the final output of a neural network being five neurons that will be interpreted _through a softmax function_ as the probability of being in target 1, 2, 3, 4, or 5. Regardless of what the weights actually are, by passing the final five outputs as a vector to a softmax, we get a number that we may interpret as a probability for each outcome (all are between 0 and 1 and the sum will be 1).

### Softmax cross entropy

The cross entropy can be extended to include this multi-output form.

In [5]:
def softmax(vec):
    """
    note, you wouldn't write a 'real' softmax like this - you need numerical stability tricks
    """
    exp = np.exp(vec)
    return exp / exp.sum(axis=0)

In [6]:
y_true = np.array([0, 1, 0, 0, 0])

In [7]:
def cross_entropy(final_softmax, true_labels):
    return np.sum(-true_labels * np.log(final_softmax))

In [8]:
nn_final_cases = np.array([
    [15, 30, 25, 6, 10],    # moderately strong correct reco
    [32, 30, 20, 8, 12],    # incorrect
    [10, 31, 20, 1, 1],     # strongly correct reco
    [30, 30, 30, 12, 10],   # no clear signal
    [29, 30, 31, 20, 12],   # weakly incorrect
])
[cross_entropy(softmax(vec), y_true) for vec in nn_final_cases]

[0.0067156544288905244,
 2.1269334246813152,
 1.6702319748914317e-05,
 1.0986122944318208,
 1.4076170787652222]

Some notable properties of this loss function:

1. it is bounded from below by zero
2. the function goes to zero when the softmax output is close to the true label
3. there is no notion of "distance" in the "wrongness" - off by 1 is as bad as off by 2, etc.

Softmax cross entropy is very popular in classification tasks. It is not obviously correct in _localization_ tasks - in that case one might prefer the squared distance (or some other distance metric). We choose to use the softmax cross entropy because the space we are operating in is not linear. That is to say, when mapping a kernel across pixels, the "distance" between two columns (two planes) is not the same in every view or even across one view because of the targets and UXVX interleaving, etc.

We also tried a distance squared approach when trying to explicitly get `z` as a regression value, but this was not very successful.