# Loss Functions

In this tutorial, we will learn serveral commonly used loss functons for regression, classification and other tasks.

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'                # CPU only

import numpy as np

import torch
import torch.nn as nn

import tensorflow as tf

tf.enable_eager_execution()              # Eager mode for TensorFlow

tfe = tf.contrib.eager                   # it may become `tf.eager` or just use `tf` in 2.0+

## For regression

### L1 loss (absolute difference)

$$\ell_{sum}(\hat{y}, y) = \sum_{i=1}^{B}\left|\hat{y}_i - y_i\right| $$
$$\ell_{mean}(\hat{y}, y) = \frac{1}{B}\ell_{sum}(\hat{y}, y)$$

Here, $B$ is the batch size, $\hat{y}$ is the outputs and $y$ is the real values.

#### About the mean version 

The loss averaging over the batch so-called mean reduction is for two purposes:
1. When adding regularization to the loss, mean reduction can scale them to a fixed ratio.
2. If the batch size changes during the training, mean reducation may keep the optimization more stable.

Moreover, when we update a weight $w$ with SGD,
$$ w := w + \lambda_{mean} \frac{\partial \ell_{mean}}{\partial w} = w + \frac{\lambda_{sum}}{B} \frac{\partial \ell_{sum}}{\partial w}.$$

We will just use $\ell = \ell_{mean}$ in the following.

In [2]:
# Pytorch

criterion = nn.L1Loss(reduction='elementwise_mean')       # default; it will be 'mean' in 1.0+
input = torch.tensor([1.6, 0.5, -2.], requires_grad=True)
target = torch.tensor([0.8, 0, -1.5])
loss = criterion(input, target)
loss.backward()
print(loss)
print(input.grad)

tensor(0.6000, grad_fn=<L1LossBackward>)
tensor([ 0.3333,  0.3333, -0.3333])


In [3]:
criterion = nn.L1Loss(reduction='sum') 
input = torch.tensor([1.6, 0.5, -2.], requires_grad=True)
target = torch.tensor([0.8, 0, -1.5])
loss = criterion(input, target)
loss.backward()
print(loss)
print(input.grad)

tensor(1.8000, grad_fn=<L1LossBackward>)
tensor([ 1.,  1., -1.])


In [4]:
# TensorFlow

input = tfe.Variable([1.6, 0.5, -2.])
target = [0.8, 0, -1.5]
with tfe.GradientTape() as tape:
    loss = tf.losses.absolute_difference(target, input, 
                                         reduction=tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS) # default
grad = tape.gradient(loss, input)
print(loss)
print(grad)

tf.Tensor(0.59999996, shape=(), dtype=float32)
tf.Tensor([ 0.33333334  0.33333334 -0.33333334], shape=(3,), dtype=float32)


In [5]:
input = tfe.Variable([1.6, 0.5, -2.])
target = [0.8, 0, -1.5]
with tfe.GradientTape() as tape:
    loss = tf.losses.absolute_difference(target, input, 
                                         reduction=tf.losses.Reduction.SUM) 
grad = tape.gradient(loss, input)
print(loss)
print(grad)

tf.Tensor(1.8, shape=(), dtype=float32)
tf.Tensor([ 1.  1. -1.], shape=(3,), dtype=float32)


### Mean squared error (squared L2 norm)

$$\ell = \frac{1}{B} \sum_i \left(\hat{y}_i - y_i\right)^2$$

It is more sensitive than L1 loss.

In [6]:
# Pytorch
criterion = nn.MSELoss()
input = torch.tensor([1.6, 0.5, -2.], requires_grad=True)
target = torch.tensor([0.8, 0, -1.5])
loss = criterion(input, target)
loss.backward()
print(loss)
print(input.grad)

tensor(0.3800, grad_fn=<MseLossBackward>)
tensor([ 0.5333,  0.3333, -0.3333])


In [7]:
# TensorFlow

input = tfe.Variable([1.6, 0.5, -2.])
target = [0.8, 0, -1.5]
with tfe.GradientTape() as tape:
    loss = tf.losses.mean_squared_error(target, input) 
grad = tape.gradient(loss, input)
print(loss)
print(grad)

tf.Tensor(0.38000003, shape=(), dtype=float32)
tf.Tensor([ 0.53333336  0.33333334 -0.33333334], shape=(3,), dtype=float32)


### Huber loss (Smooth L1 loss)

$$\ell = \frac{1}{B} \sum \begin{cases}  0.5\left(\hat{y}_i-y_i\right)^2, & \text{ if } \left|\hat{y}_i-y_i\right|<1 \\ 
\left|\hat{y}_i-y_i\right|-0.5, & \text{ otherwise} \end{cases} $$

It is less sensitive (robust) to outliers than the MSELoss and in some cases prevents exploding gradients (e.g. see “Fast R-CNN” paper by Ross Girshick).

Pytorch:
```python
nn.SmoothL1Loss()
```

Tensorflow:
```python
tf.losses.huber_loss()
```

## For classification

### Negative log likelihood

$$ \ell = \sum_{i=1}^{B} \frac{-w_{y_i}}{\sum_{i=1}^{B} w_{y_i}}\hat{y}_{i,y_i} $$

Pytorch:
```python
nn.NLLLoss()
```
Here, $\hat{y}$ should be log-probabilities of each class. 
In Pytorch, we can use `nn.LogSigmoid` for binary classification or `nn.LogSoftmax` for multiclass classification.



In [8]:
m = nn.LogSoftmax(dim=1)
criterion = nn.NLLLoss()
input = torch.tensor([[1.5, 1., 0.5, 0.2], 
                      [10., -0.3, 0.4, 0.7], 
                      [1.2, 5, -1., 5]], requires_grad=True)
output = m(input)
print(output)

target = torch.tensor([1, 0, 3])
loss = criterion(output, target)
loss.backward()
print(loss)
print(input.grad)

tensor([[ -0.8096,  -1.3096,  -1.8096,  -2.1096],
        [ -0.0002, -10.3002,  -9.6002,  -9.3002],
        [ -4.5055,  -0.7055,  -6.7055,  -0.7055]],
       grad_fn=<LogSoftmaxBackward>)
tensor(0.6718, grad_fn=<NllLossBackward>)
tensor([[ 0.1483, -0.2434,  0.0546,  0.0404],
        [-0.0001,  0.0000,  0.0000,  0.0000],
        [ 0.0037,  0.1646,  0.0004, -0.1687]])


In [10]:
m = nn.Softmax(dim=1)
criterion = nn.MSELoss()
input = torch.tensor([[1.5, 1., 0.5, 0.2], 
                      [10.0, -0.3, 0.4, 0.7], 
                      [1.2, 0, -1., 5]], requires_grad=True)
target = torch.tensor([[0, 1, 0, 0],
                       [1, 0, 0, 0],
                       [0, 0, 0, 1]])
output = m(input)
print(output)
loss = criterion(output, target.float())
loss.backward()
print(loss)
print(input.grad)

tensor([[0.4450, 0.2699, 0.1637, 0.1213],
        [0.9998, 0.0000, 0.0001, 0.0001],
        [0.0217, 0.0065, 0.0024, 0.9694]], grad_fn=<SoftmaxBackward>)
tensor(0.0645, grad_fn=<MseLossBackward>)
tensor([[ 2.9858e-02, -3.4758e-02,  3.3075e-03,  1.5924e-03],
        [-8.5331e-09,  1.2682e-09,  2.9385e-09,  4.3275e-09],
        [ 1.8379e-04,  3.8858e-05,  1.2642e-05, -2.3529e-04]])


### Cross entropy loss

$$\ell = \sum_{i}\frac{-w_{y_i}}{\sum_{i}w_{y_i}} \left(\hat{y}_{i,y_i} - \log\left(\sum_j \exp\left(\hat{y}_{i, j}\right) \right) \right)$$

In Pytorch, `nn.CrossEntorpyLoss()` combines `nn.LogSoftmax()` and `nn.NLLLoss()`.

In TensorFlow, `tf.losses.softmax_cross_entropy` can be used.

For more detail about this loss function, please refer to http://neuralnetworksanddeeplearning.com/chap3.html#what_does_the_cross-entropy_mean_where_does_it_come_from


In [None]:
criterion = nn.CrossEntropyLoss()
input = torch.tensor([[1.5, 1., 0.5, 0.2], 
                      [10., -0.3, 0.4, 0.7], 
                      [1.2, 5, -1., 5]], requires_grad=True)
target = torch.tensor([1, 0, 3])
loss = criterion(input, target)
loss.backward()
print(loss)
print(input.grad)



### Binary Cross entropy loss

$$\ell = \frac{1}{B}\sum_{i}-w_{i} \left[y_i\log\hat{y}_{i} - \left(1 - y_i\right) \log\left( 1 - \hat{y}_i \right) \right]$$

Pytorch:
```python
nn.BCELoss()
```

TensorFlow:
```python
tf.losses.sigmoid_cross_entropy()
```


## For ranking

### Margin ranking loss (Hinge loss)

$$\ell = \frac{1}{B}\sum_{i} \max\left(0, -y \left(\hat{y}_{i, 1} - \hat{y}_{i, 2}\right)\right) + m $$

Here,  if $y=1$ then it assumed that $\hat{y}_1$ shoud be ranked heigher than $\hat{y}_2$, and vice-versa for $y = -1$; $m$ is the margin between the two inputs.

Pytorch:
```python
nn.MarginRankingLoss()
```
Also see `nn.TripletMarginLoss()`(anchor-based) and `nn.MultiMarginLoss()`(for multi-class classification).

Tensorflow:
```python
tf.losses.hinge_loss()
```

## For similarity

### Cosine loss
$$\ell = \frac{1}{B}\sum_i \begin{cases} 1 - \cos\left(\hat{y}_{i,1},\hat{y}_{i,2} \right),& \text{ if similar} \\ \max\left(0, \cos\left(\hat{y}_{i,1},\hat{y}_{i,2} \right) - m\right),& \text{if dissimilar} \end{cases} $$

Pytorch:
```python
nn.CosineEmbeddingLoss()
```

TensorFlow:
```python
tf.losses.cosine_distance()
```

## For distribution

### Kullback-Leibler (KL) divergence

$$\begin{align}
\ell &= \frac{1}{B}\sum_i y_i\log\frac{y_i}{\hat{y}_i} \\
     &= \frac{1}{B}\sum_i \left(y_i\log y_i - y_i\log\hat{y}_i\right)\end{align}$$
     
Pytroch:
```python
nn.KLDivLoss()             # the input given is expected to contain log-probabilities
```

TensorFlow:
```python
tf.distributions.kl_divergence()
```

## For count

### Log Poisson loss

$$\begin{align}
\ell &= \frac{1}{B}\sum_i \left[\hat{y}_i - y\log\hat{y} + \log \left(y!\right) \right]\\
     &\approx \frac{1}{B}\sum_i \left[\hat{y} - y\log\hat{y} + 0.5\log(2\pi y)\right]
\end{align}$$

It assumes that $y \sim P(\hat{y})$. The $0.5\log(2\pi y)$ is [Stirling's approximation](https://en.wikipedia.org/wiki/Stirling%27s_approximation). It can be used for $y > 1$ (compute full loss).


Pytorch:
```python
nn.PoissonNLLLoss()                 # full=False in default
```

TensorFlow:
```python
tf.nn.log_poisson_loss()            # compute_full_loss=False in default
```