In [2]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [3]:
import torch
from torch import Tensor, nn
import torch.nn.functional as F

  from .autonotebook import tqdm as notebook_tqdm


## Loss functions
Loss functions provide a <mark>quantitative measure of the current performance of a neural network</mark>. There are many to choose from, but the <mark>most appropriate will often depend on your task and the form of the targets and outputs</mark>. Similar to layers in a neural network, <mark>losses are either offered as classes (inheriting from `nn.Module`) or as functions in `nn.functional`.</mark>

PyTorch provides implementations for many common losses (https://pytorch.org/docs/stable/nn.html#loss-functions), and more advanced ones can be written by the user.

In general, <mark>PyTorch losses will:</mark>
- <mark>Take an `input` argument of predictions and a `target` argument of true values</mark>. In general, the first dimension is expected to be a batch dimension.
- <mark>Have a *reduction* method, which determines how the final value is produced</mark>. The loss of each item in the batch will first be computed in isolation, then <mark>these can either be returned as an (N,) tensor (`reduction='none'`), or they can be reduced to the mean (`reduction='mean'` default) or the sum (`reduction='sum'`).</mark>

The <mark>losses in PyTorch make strong assumptions on the inputs and targets (shapes, normalisation, log-space, logits, etc.), and often this isn't indicated in the name</mark>, so it is best check the docs to see what exactly is expected.

Additionally, <mark>most losses have a `weight`, the effect of which varies between loss function and doesn't always behave as expected (to a HEP person). Additionally, they must be provided during initialisation</mark>, rather than vary per batch. <mark>If decent weight handling is required, write your own inheriting losses, or see mine: https://github.com/GilesStrong/lumin/blob/master/lumin/nn/losses/basic_weighted.py</mark>

Below will be a few common losses.

### Binary classification
For classification tasks with only two classes, the DNN can have a single output with a sigmoid output activation. The <mark>binary cross entropy function can then be used to quantify performance.</mark>

In [4]:
logit = torch.rand(10,1)  # pre-activation values of the DNN output, for a batch size of 10
targs = torch.randint(0,2, size=(10,1)).float()  # random binary targets

In [5]:
loss_fn = nn.BCELoss()

In [6]:
loss_fn(torch.sigmoid(logit), targs)

tensor(0.7286)

This is the mean binary cross-entropy for our batch. We could instead get the <mark>raw BCE per element:</mark>

In [8]:
loss_fn = nn.BCELoss(reduction='none')

In [9]:
loss_fn(torch.sigmoid(logit), targs)

tensor([[0.9069],
        [0.8259],
        [0.3929],
        [0.6429],
        [1.1332],
        [0.4033],
        [0.8378],
        [1.1399],
        [0.3644],
        [0.6391]])

In the above, we took the logits and applied a sigmoid activation to them, which involves taking the exponential of the logits. The BCE then compute the natural log of the predictions. <mark>One can save time and numerical precision, by instead computing the BCE directly from the logits:</mark>

In [10]:
loss_fn = nn.BCEWithLogitsLoss()

In [11]:
loss_fn(logit, targs)

tensor(0.7286)

### Multi-label classification
This is similar to binary classification, except now we are predicting which <mark>non-mutually-exclusive Boolean properties the inputs have. Again we can use sigmoids for each of the targets, and BCE for the loss.</mark>

In [15]:
logit = torch.rand(10,5)  # pre-activation values of the DNN output, for a batch size of 10 for 5 labels
targs = torch.randint(0,2, size=(10,5)).float()  # random binary targets

In [16]:
loss_fn = nn.BCELoss()

In [17]:
loss_fn(torch.sigmoid(logit), targs)

tensor(0.7243)

In [18]:
loss_fn = nn.BCELoss(reduction='none')

In [19]:
loss = loss_fn(torch.sigmoid(logit), targs)
loss  # reduction none, now gives the BCE per lable per item in the batch

tensor([[0.5836, 1.1112, 0.7216, 0.7140, 1.0407],
        [1.1576, 0.3898, 0.5819, 0.6702, 1.1884],
        [0.6334, 1.3086, 0.4543, 0.5928, 0.4659],
        [0.8954, 0.5466, 0.6364, 1.2228, 0.8511],
        [0.4344, 0.3651, 1.1878, 0.3831, 0.7508],
        [0.7620, 0.6174, 0.5119, 0.8727, 0.4796],
        [1.0074, 0.4904, 1.0814, 1.1764, 1.1757],
        [0.6722, 0.4820, 1.1738, 0.3276, 1.0157],
        [0.3354, 0.4324, 0.4923, 0.6751, 0.5781],
        [0.3370, 0.4880, 1.0552, 0.5384, 0.5494]])

In [20]:
loss.mean(-1, keepdim=True)  # we can get the mean loss per item ourselves, though

tensor([[0.8342],
        [0.7976],
        [0.6910],
        [0.8305],
        [0.6243],
        [0.6487],
        [0.9862],
        [0.7343],
        [0.5027],
        [0.5936]])

### Multi-class classification
Extending binary classification to the case where items belong to <mark>one and only one class, and there are more than two classes. The loss here is the categorical cross-entropy, which works by comparing the predicted probabilities that an item belongs to each of the classes to the true class it belongs to</mark>. This requires that <mark>per item, the logits are normalised to one: the softmax activation will perform this normalisation. **However** none of the pyTorch CCE losses actually expect a softmaxed input...</mark>

In [21]:
logit = torch.rand(10,5)  # pre-activation values of the DNN output, for a batch size of 10 for 5 classes
targs = torch.randint(0,5, size=(10,))  # random targets for five classes

In [22]:
loss_fn = nn.CrossEntropyLoss()

In [23]:
loss_fn(logit, targs)  # Unlike BCELoss, the CrossEntropyLoss expects the logits. Really this should be called CrossEntropyWithLogitsLoss, but hey ho

tensor(1.5307)

Alternative, <mark>if you do want to have a softmax output, there is the negative log likelihood loss, which expects... the log of the softmaxed outputs.</mark>

In [24]:
loss_fn = nn.NLLLoss()

In [25]:
loss_fn(F.softmax(logit, dim=-1).log(), targs)  # the dim=-1 indicates to normalise over the last dimension

tensor(1.5307)

<mark>Alternatively, we can use the logsoftmax activation function:</mark>

In [26]:
loss_fn(F.log_softmax(logit, dim=-1), targs)

tensor(1.5307)

#### Multi-d multi-class classification
If predicting the class of 2D data, or higher, the expected tensor shape for:
 - inputs is (batch, class, x, y,...)
 - targets is (batch, x, y,...)

In [27]:
logit = torch.rand(10,5,2,3,4)  # pre-activation values of the DNN output, for a batch size of 10 for 5 classes over a cuboid
targs = torch.randint(0,5, size=(10,2,3,4))  # random targets for five classes

In [28]:
loss_fn = nn.CrossEntropyLoss()

In [29]:
loss_fn(logit, targs)  # Unlike BCELoss, the CrossEntropyLoss expects the logits. Really this should be called CrossEntropyWithLogitsLoss, but hey ho

tensor(1.6519)

In [30]:
loss_fn = nn.NLLLoss()

<mark>Remember to normalise over the class dimension!!</mark>

In [31]:
loss_fn(F.softmax(logit, dim=1).log(), targs)  # remember to normalise over the class dimension

tensor(1.6519)

### Regression
Regression problems involve predicting float targets. <mark>Typically no output activation is used, such that outputs linear map to [-inf,inf]. In such problems, the loss should scale with the error on the prediction.</mark> Common choices are:
- squared error (p-t)**2
- absolute error |p-t|

In [34]:
logit = torch.rand(10,1)  # Outputs of the DNN output
targs = torch.rand(10,1)  # random targets values

In [35]:
loss_fn = nn.MSELoss()  # Mean square error

In [36]:
loss_fn(logit, targs)

tensor(0.0881)

In [37]:
loss_fn = nn.L1Loss()  # L1 loss is the absolute error

In [38]:
loss_fn(logit, targs)

tensor(0.2216)

## Functional losses
As mentioned, <mark>function versions of the losses exist, too, e.g.:</mark>

In [39]:
logit = torch.rand(10,1)  # Outputs of the DNN output
targs = torch.rand(10,1)  # random targets values

In [40]:
F.mse_loss(logit, targs)

tensor(0.1216)

## Custom loss function
<mark>Class-based losses inherit from `nn.Module` so making our own is quite easy. We can even inherit from existing losses that are close to what we want.</mark>
Let's make a loss that takes the squared-error on predictions and then divides it by the target:

In [41]:
class FractionalMSE(nn.MSELoss):  # Inherit from the basic MSELoss
    def __init__(self):
        super().__init__(reduction='none')  # Set the reduction to none such that the SE shape matches the targets
        
    def forward(self, input, target):
        se = super().forward(input, target)  # Compute the MSE 
        fse = se/target
        return torch.mean(fse)  # return the mean fractional squared error

In [42]:
logit = torch.rand(10,2)  # Outputs of the DNN output
targs = torch.rand(10,2)  # random targets values

In [43]:
loss_fn = FractionalMSE()

In [44]:
loss_fn(logit, targs)

tensor(1.2730)