# Cross Entropy Loss

Suppose for a batch, we have target values $[v_1, ..., v_n]$.

Then probabilities are defined as **softmax**, or $p_k = \frac{e^{v_k}}{\sum_k{e^{v_k}}}$.

Overall loss for the batch is $l = \sum_k {-y_k \log p_k}$.

Note:
- $y_k$'s do not necessarily need to be $1$'s and $0$'s. But they often are for known labels with one-hot encoding.
- The loss is not symmetric wrt. $y$'s and $p$'s.

## Notes on Operating

For experiments see the relevant sections below.

### Class IDs

The targets could also just be integers denoting the class ids.

### Multi-Target

Suppose you simultaneously want multiple independent targets to be inferred.

For example, you want different cells each with labels 'X', 'O' or '.'.

Then you must arrange your data thus -

- Dimension 1: Batch
- Dimention 2: Class
- Dimention 3 onwards: All independent sections


In [22]:
import torch
from torch import nn

## Generate Data for Experiments

In [23]:
torch.manual_seed(42)

# Normally _ROWS denote # batches.
_ROWS = 10
# _COLS denote all the classes.
_COLS = 3

input = torch.randn((_ROWS, _COLS), requires_grad=True)
print(input)

target = torch.zeros((_ROWS, _COLS))
for index in range(_ROWS):
    target[index, torch.randint(_COLS, (1,))] = 1
print(target)

tensor([[ 1.9269,  1.4873,  0.9007],
        [-2.1055,  0.6784, -1.2345],
        [-0.0431, -1.6047, -0.7521],
        [ 1.6487, -0.3925, -1.4036],
        [-0.7279, -0.5594, -2.3169],
        [-0.2168, -1.3847, -0.8712],
        [-0.2234,  1.7174,  0.3189],
        [-0.4245, -0.8286,  0.3309],
        [-1.5576,  0.9956, -0.8798],
        [-0.6011, -1.2742,  2.1228]], requires_grad=True)
tensor([[0., 0., 1.],
        [0., 1., 0.],
        [1., 0., 0.],
        [0., 0., 1.],
        [1., 0., 0.],
        [1., 0., 0.],
        [0., 1., 0.],
        [1., 0., 0.],
        [0., 1., 0.],
        [0., 1., 0.]])


## Dissecting the Cross Entropy implementation in Pytorch

In [24]:
loss = nn.CrossEntropyLoss()
loss_unreduced = nn.CrossEntropyLoss(reduction='none')

loss(input, target)

tensor(1.2496, grad_fn=<DivBackward1>)

It is the same as computing **per batch, and then taking the mean**.

NB. This behavior can be controlled with the parameter `reduction` which defaults to `"mean"`.

In [25]:
sum(loss(input[i, :], target[i, :]) for i in range(_ROWS)) / _ROWS

tensor(1.2496, grad_fn=<DivBackward0>)

We can check by computing it ourselves.

In [26]:
import numpy as np


def alt_defn(input, target):
    with torch.no_grad():
        # Softmax.
        # This could also be computed as -
        # probs = torch.softmax(input, dim=1)
        exp = np.exp(input.numpy())
        probs = exp / exp.sum(axis=1, keepdims=True)

        per_batch_loss = (-target.numpy() * np.log(probs)).sum(axis=1)
        return per_batch_loss.mean()


alt_defn(input, target)

np.float32(1.2496029)

### Class ID Instead of One-Hot

Instead of one-hot, we could also directly pass the class ids.

In [27]:
class_ids = target.argmax(dim=1)
class_ids

tensor([2, 1, 0, 2, 0, 0, 1, 0, 1, 1])

In [28]:
loss(input, class_ids)

tensor(1.2496, grad_fn=<NllLossBackward0>)

## Multi Target

Note that just changing the view wouldn't work.

You will also have to ensure that dim 1 is the class. This can be done by transposing, and we see the loss indeed matches.

In [29]:
input_m = input.view(2, 5, 3).transpose(1, 2)
target_m = target.view(2, 5, 3).transpose(1, 2)

In [30]:
loss(input_m, target_m)

tensor(1.2496, grad_fn=<DivBackward1>)