# Cross Entropy Loss Function

Also known as logarithmic loss, it's a type of cost function like the MSE loss function. It calculates the entropy, or uncertainty, between the **true** probability distribution and the **predicted** probability distribution; how similar or different they are.

In [8]:
import numpy as np
import torch
import matplotlib.pyplot as plt

# Initial Problem

Suppose there's a set of numbers $\vec {p} = [1, 3, 5, 2, ...]$ of length $N$. What set of corresponding numbers $\vec {q} = [q_1, q_2, q_3, ...]$ minimizes the cross-entropy loss function defined as:

$$H(\vec{p}, \vec{q}) = -\sum_{i = 1}^{N} p_i \ln(q_i)$$

with a constraint $\Sigma _i p_i= \Sigma _i q_i$

$p_i$ denotes the actual distribution of values; $q_i$ denotes the predicted distribution of values.


In [9]:
p = np.array([5,4,3,2,1])
q1 = p
q2 = np.array([3,2,5,7,8])
q3 = np.array([9,3,6,4,1])

In [10]:
-sum(p*np.log(q3))

np.float64(-23.528439171277483)

Because this is a constrained optimization problem, Lagrange multipliers are useful. Defined as shown:

$$f(\vec{q}, \lambda) = -\sum_{i = 1}^{N} p_i \ln(q_i) - \lambda \left(\sum_{i}p_i - \sum_{i} q_i \right)$$

Now, in order to find the extrema (for machine learning, it's the minimum) of the loss function, we set the partial derivative with respect to $q_i$ equal to 0. Same goes for the partial derivative with respect to $\lambda$.

$$\frac{\partial f}{\partial q_i} = -\frac{p_i}{q_i} + \lambda \rightarrow= 0$$

$$\frac{\partial f}{\partial \lambda} = \left(\sum_{i} p_i - \sum_{i} q_i \right) \rightarrow= 0$$

From this, we get the 2 equations:

- $p_i = \lambda q_i \rightarrow \Sigma p_i = \lambda \Sigma q_i$
- $\Sigma p_i = \Sigma q_i$

These equations can only be true if $\lambda = 1$ so we must have $q_i = p_i$, and thus $\vec q = \vec p$

# Classification Problem

In classification problems, an input is taken such as an image, called $x$. This image may contain an object like a ball, shoe, etc. We'll call the true probability of image $x$ belonging to class $i$ as $p_i$. The goal of a classifier is to create a function $f$ such that 

$$f(x) = \vec q$$

where $\vec q$ is as close to $\vec p$ as possible.

- Note that $\vec p$ and $\vec q$ are probability mass functions (PMFs), and each element of the vector represents different classes. Because probability is involved, it follows that $\Sigma p_i = \Sigma q_i = 1$. This is the constraint of the Lagrange multiplier above.

- Typically, we know what class $\tilde c$ an image $x$ belongs to. In this case, it's typically the case that $p_{\tilde c} = 1$ for the class $i = \tilde c$ that we know it is, and other $p_i$ s are equal to 0.

To minimize the difference between $\vec p$ and $\vec q$, we minimize the cross-entropy loss function:

$$H(\vec{p}, \vec{q}) = -\sum_{i} p_i \ln(q_i)$$

Since the minimum of this function occurs exactly when $p_i = q_i$ for all $i$. In this case, when we know which class an image belongs to (one of the $p_i$ s = 1), we have:

$$H(\vec{p}, \vec{q}) = -\ln(q_{\tilde c})$$

In [11]:
p = np.zeros(10); p[4] = 1
p

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])

In [15]:
q = np.random.randn(10)
q = q/sum(q)
q

array([ 0.20855977,  0.01010575, -0.00462332,  0.15392126,  0.06783992,
       -0.09038419, -0.1227637 ,  0.2720416 ,  0.12023831,  0.38506459])

In [16]:
H = -np.log(q[p > 0])
H

array([2.69060445])

In [17]:
q[4] = 20
q = q/sum(q)
H = -np.log(q[p > 0])
H

array([0.04555446])

# For Multiple Images

In this scenario, let's compute the loss over $N$ images ($x_n$). Suppose we know exactly what class the image belongs to. We can express the true class of the nth image as $\tilde c (n)$. The probability of image $n$ belonging to class $c$ as $q_n (c)$.

- Thus, the **predicted** probability of an image $n$ belonging to its true class $\tilde c (n)$ is $q_n(\tilde c (n))$.

We sum it together:

$$L(p, q) = \sum_{n = 1}^{N}H({p_n}, {q_n}) = -\sum_{n = 1}^{N} \ln(q_n(\tilde c (n)))$$

Get a sample of $p$ s

In [18]:
p = np.zeros((4, 10), dtype = int)
p[0][4] = 1
p[3][3] = 1
p[2][8] = 1
p[1][2] = 1
p

array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

Create probability array of $q$ s:

In [24]:
q = np.random.randn(40).reshape(4, 10)
q = q/np.expand_dims(np.sum(q, axis = 1), axis = 1) # axis = 1 within np.sum sums across rows.
q # Probability of each of the 4 images belonging to one of 10 classes.

array([[ 0.1531486 , -0.33646045,  0.17881717,  1.12984208,  0.01711937,
         0.09428152, -0.10180105, -0.41069267,  0.0230275 ,  0.25271793],
       [-0.13861425, -0.10440734,  0.2918362 ,  0.10760671,  0.3639906 ,
         0.21897043,  0.66270117,  0.17414809, -0.35027397, -0.22595765],
       [-0.0932851 ,  0.4827172 ,  0.30793316,  0.00392941, -0.08734484,
        -0.02598965, -0.06607129,  0.0031672 ,  0.50359873, -0.02865483],
       [ 0.54019793, -1.88490379,  0.25124035,  0.70971842,  0.42960954,
        -0.08397891, -0.63278626,  1.06370509,  0.40170946,  0.20548817]])

Compute $H$ for every term in the sum. Then, sum together to calculate $L$.

In [25]:
q[p>0]

array([0.01711937, 0.2918362 , 0.50359873, 0.70971842])

In [26]:
Hs = -np.log(q[p>0]) # Take natural log of every value of q where p is 1.
Hs

array([4.06754454, 1.23156259, 0.68597551, 0.34288698])

In [27]:
L = sum(Hs) # Total loss L(p, q).
L

np.float64(6.327969625576402)

# Obtaining the $q$ s in Machine Learning

*$\vec q$* should be related to a probability density function.

- Bounded between 0 and 1 (Meaning that the data is normalized.)
- The closer to 0, the less confidence we have that image $n$ belongs to class $c$
- The closer to 1, the more confidence we have that image $n$ belongs to class $c$
- $\sum_{c=0}^{C} q_c = 1$ for each image.

Suppose a neural network outputs $f(x_n) = \hat{y}_n$ where $\hat{y}_n$ is a vector with the same length as the number of classes, but $\hat{y}_n$ is not normalized like $\vec q$ should be. We can enforce the last condition by normalizing the following way:

$$q_n(c) = \frac{\exp \left (\hat{y}_n(c) \right)}{\sum_{c^{'} = 0}^{C} \exp \left (\hat{y}_n(c^{'}) \right)}$$

So we can write the loss of the output vector as:

$$L(\hat{y}_n) = -\sum_{n = 1}^{N} \ln(q_n(\tilde c (n))) = - \sum_{n = 0}^{N} \ln \left(\frac{\exp \left (\hat{y}_n(\tilde{c}(n)) \right)}{\sum_{c = 0}^{C} \exp \left (\hat{y}_n(c) \right)} \right)$$

- Note that for a given $n$, $\hat{y}_n(c)$ is the network output for a given image $x_n$

Get sample $\hat{y}$

In [33]:
yhat = 20*np.random.randn(40).reshape(4, 10) **2
yhat

array([[3.00131002e+00, 1.98733066e+01, 5.96261095e+01, 1.09678714e+01,
        2.46671490e+00, 9.13040898e-01, 4.12436157e-02, 7.83721322e+00,
        7.48304803e+00, 9.69454301e-01],
       [2.49066194e+00, 5.20570255e+01, 4.61273034e+01, 1.52378361e+01,
        3.38444331e-01, 9.44825774e-01, 2.55229482e-01, 1.07898274e+01,
        4.50163385e+01, 1.73982761e+01],
       [4.34726323e+01, 2.64560271e-01, 1.38631843e+01, 7.49144768e+01,
        6.65430380e+01, 3.61217176e+01, 2.07193740e+00, 1.21724136e+01,
        1.50934946e+01, 4.90860942e+00],
       [4.17048673e+01, 1.98287359e+01, 4.53426593e-01, 3.36044786e+00,
        6.19356333e+01, 4.75274865e-02, 2.27418740e+01, 4.76039078e+00,
        6.63599050e+00, 6.41687811e+01]])

First axis is with respect to $n$. The second is with respect to $C$.

In [34]:
yhat.shape

(4, 10)

Compute the $q$ s from the exponential fraction above (can be difficult):

In [37]:
q = np.exp(yhat)
q = q/np.expand_dims(np.sum(q, axis = 1), axis = 1)
q

array([[2.55954081e-25, 5.43972662e-18, 1.00000000e+00, 7.37896904e-22,
        1.49965151e-25, 3.17130712e-26, 1.32623826e-26, 3.22380031e-23,
        2.26233099e-23, 3.35535389e-26],
       [2.96530533e-22, 9.96477704e-01, 2.64985454e-03, 1.01881851e-16,
        3.44645957e-23, 6.32006147e-23, 3.17127160e-23, 1.19220562e-18,
        8.72441014e-04, 8.83820338e-16],
       [2.21248300e-14, 3.80060471e-33, 3.05956774e-27, 9.99768671e-01,
        2.31328887e-04, 1.42042488e-17, 2.31625631e-32, 5.64115053e-28,
        1.04707325e-26, 3.95127831e-31],
       [1.58425050e-10, 5.00196535e-20, 1.92553214e-28, 3.52414741e-27,
        9.68130457e-02, 1.28313109e-28, 9.21086641e-19, 1.42903067e-26,
        9.32404065e-26, 9.03186954e-01]])

Show that they are normalized.

In [36]:
q.sum(axis = 1)

array([1., 1., 1., 1.])

Compute $\tilde {c}(n)$ (can be tricky):

In [None]:
c_tilde = np.where(p) # np.where is like a condition.
c_tilde # Numbers in each array represent an index that starts at 0. (1st image is 4th class, 3rd image is the 8th class).

(array([0, 1, 2, 3]), array([4, 2, 8, 3]))

Compute the loss function:

In [40]:
Hs = -np.log(q[c_tilde])
L = sum(Hs)
L

np.float64(183.82401760653408)

# Proof This is Equivalent to PyTorch

Create loss function:

In [52]:
L = torch.nn.CrossEntropyLoss(reduction = 'sum') # reduction parameter sums across images. 

Evaluate on data above:

In [56]:
L(torch.tensor(yhat), torch.tensor(p, dtype = torch.float))

tensor(183.8240, dtype=torch.float64)