## Loss Functions
A loss function is a mathematical function or expression used to measure how well a model is doing on dataset. Loss function takes a truth (y) and a prediction (ŷ) as an input and produces a real valued score. The higher this score, the worse the model's prediction is.

#### Loss function in pytorch

All PyTorch's loss functions are packaged in the nn module, base class for all neural networks. This makes adding a loss function into model training easy.

- Mean squared error (MSE)
- L1 loss
- Cross entropy
- Binary cross entropy
- Negative log likelihood
- Smooth L1
- Hinge embedding
- Margin ranking
- Triple margin
- Cosine embedding
- Custom loss function

### Mean Square Error (MES)

MSE is similar to MAE. Instead of computing the absolute difference between values in the prediction tensor and target, as with MAE, it computes the square difference between values in the prediction tensor and that of the target tensor. By doing so, relatively large differences are penalized more, while relatively small differences are penalized less. MSE is considered less robust at handling outliers and noise than MAE though.

In [1]:
import torch
import torch.nn as nn
import numpy as np

In [2]:
torch.randn(3, 5)

tensor([[ 0.1142, -0.3803,  0.5391,  0.9389,  0.7259],
        [ 1.5444, -0.3720, -1.3497, -0.1036,  0.7893],
        [ 0.7241,  0.6515,  0.9762,  0.1345,  0.3507]])

In [4]:
mse_loss = nn.MSELoss(size_average=None, reduce=None, reduction='mean')

x = torch.randn(3, 5, requires_grad=True) # input
target = torch.randn(3, 5)
output = mse_loss(x, target)
print(output)

tensor(1.9189, grad_fn=<MseLossBackward0>)


### L1 loss function

L1 loss function computes the mean absolute error (MAE) between each value in the predicted tensor and that of the target. It first calculates the absolute difference between each value in the predicted tensor and that of the target, and computes the sum of all the values returned from each absolute difference computation. Finally, it computes the average of this sum value to obtain the MAE. The L1 loss function is very robust for handling noise.

In [5]:
# reduction methods
# mean: (default) compute the average of the output
# sum: output is summed 
# none: no reduction to output

l1_loss = nn.L1Loss(reduction='mean')

x = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = l1_loss(x, target)
print(output)

tensor(1.1165, grad_fn=<MeanBackward0>)


### Cross entropy

Cross-entropy loss is used in classification problems involving a number of discrete classes. It measures the difference between two probability distributions for a given set of random variables. Usually, when using cross-entropy, the output of our network is a softmax layer, which ensures that the output of the neural network is a probability value between 0 - 1.

Cross-entropy and the expression for it have origins in information theory. We want the probability of the correct class to be close to 1, whereas the other classes have a probability close to 0.

For Pytorch's CrossEntropyLoss function, we need to determine the relationship between network output and loss function. 
1. There is a limit to how small or how large a number can be.
2. If input to the exponential function used in the softmax formula is a negative number, the resultant is an exponentially small number, and if it's a positive number, the resultant is an exponentially large number.
3. The network's output is assumed to be the vector just prior to applying the softmax function.
4. The log function is the inverse of the exponential function, and log(exp(x)) is just equal to x.

Mathematical simplifications are made assuming the exponential function that is the core of the softmax function and the log function that is used in the cross-entropy computations in order to be more numerically stable and avoid really small or really large numbers. The consequences of these simplifications are that the network output without the use of a softmax function can be used in conjunction with PyTorch's CrossEntropyLoss() to optimize the probability distribution. Then, when the network has been trained, the softmax function can be used to create a probability distribution.

In [6]:
# CrossEntropyLoss() assumes that each input has one particular class
# and each class has a unique index
# therefore ground truth vector (targets) is created as a vector of integers
# and has index representing the correct class for each input

ce_loss = nn.CrossEntropyLoss()
outputs = torch.randn(3, 5, requires_grad=True) 
targets = torch.tensor([1, 0, 3], dtype=torch.int64) 
loss = ce_loss(outputs, targets)
print(loss)

tensor(2.0496, grad_fn=<NllLossBackward0>)


### Binary Corss entropy

Binary cross-entropy loss is a special class of cross-entropy losses used for binary classification. Usually when using BCE loss for binary classification, the output of the neural network is a Sigmoid layer to ensure that the output is either a value close to zero or a value close to one.

In [7]:
bce_loss = nn.BCELoss()
sigmoid = nn.Sigmoid()
probabilities = sigmoid(torch.randn(4, 1, requires_grad=True)) 
targets = torch.tensor([1, 0, 1, 0], dtype=torch.float32).view(4, 1)
oss = bce_loss(probabilities, targets) 
print(probabilities)
print(f'{loss=}')

tensor([[0.1922],
        [0.5377],
        [0.6821],
        [0.6447]], grad_fn=<SigmoidBackward0>)
loss=tensor(2.0496, grad_fn=<NllLossBackward0>)


### Negative Log likelihood

The NLL loss function works quite similarly to the cross-entropy loss function. As cross-entropy loss combines a log-softmax layer and NLL loss. This means that NLL loss can be used to obtain the cross-entropy loss value by having the last layer of the neural network be a log-softmax layer instead of a normal softmax layer.

In [8]:
m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()

x = torch.randn(3, 5, requires_grad=True)           # input size N x C = 3 x 5
target = torch.tensor([1, 0, 4])                    # each element have 0 <= value < C
output = loss(m(x), target)
output.backward()

print(output)

tensor(2.0164, grad_fn=<NllLossBackward0>)


### Smooth L1 loss

The smooth L1 loss function combines the benefits of MSE loss and MAE loss through a heuristic value beta. When the absolute difference between the ground truth value and the predicted value is below beta, the criterion uses a squared difference, much like MSE loss(gradient at each loss value varies and can be derived everywhere). Moreover, as the loss value reduces the gradient diminishes, which is convenient during gradient descent. However for very large loss values the gradient explodes, hence the criterion switching to a MAE, whose gradient is almost constant for every loss value, when the absolute difference becomes larger than beta and the potential gradient explosion is eliminated.

In [9]:
sl1_loss = nn.SmoothL1Loss()
x = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = sl1_loss(x, target)
print(output)

tensor(0.7836, grad_fn=<SmoothL1LossBackward0>)


### Hinge embedding loss

Hinge embedding loss is mostly used in semi-supervised learning tasks to measure the similarity between two inputs. It's used when there is an input tensor and a label tensor containing values of 1 or -1. It is mostly used in problems involving non-linear embeddings and semi-supervised learning.

In [10]:
x = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)

hinge_loss = nn.HingeEmbeddingLoss()
output = hinge_loss(x, target)
output.backward()

print(f'{x=}')
print(f'{target=}')
print(f'{output=}')

x=tensor([[-0.0382, -1.1631, -3.0978,  1.3347,  0.9333],
        [-1.1167,  0.1461, -0.4923, -0.3534,  0.5138],
        [ 1.3475,  0.2561, -2.8684,  1.5239, -0.0787]], requires_grad=True)
target=tensor([[-0.7516, -1.3190,  0.0808, -0.5089, -2.0764],
        [-0.6985,  0.1275,  0.5060, -1.3018,  1.2995],
        [-0.4525,  0.3468, -0.5632,  1.7237, -1.7925]])
output=tensor(1.0804, grad_fn=<MeanBackward0>)


### Margin ranking loss

Margin ranking loss belongs to the ranking losses which is used to measure the relative distance between a set of inputs in a dataset. It takes two inputs and a label containing only 1 or -1. 
- If the label is 1, then it is assumed that the first input should have a higher ranking than the second input.
- If the label is -1, then it is assumed that the second input should have a higher ranking than the first input.

In [11]:
mr_loss = nn.MarginRankingLoss()
input1 = torch.randn(3, requires_grad=True)
input2 = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()
output = mr_loss(input1, input2, target)
print(f'{input1=}')
print(f'{input2=}')
print(f'{output=}')

input1=tensor([-0.2377, -0.7254, -2.4694], requires_grad=True)
input2=tensor([-0.0987, -1.4337,  1.0014], requires_grad=True)
output=tensor(0.2824, grad_fn=<MeanBackward0>)


### Triple margin loss

This measures data points by using triplets of the training data sample. The triplets involved are an anchor sample, a positive sample and a negative sample. 
1. to get the distance between the positive sample and the anchor as minimal as possible
2. to get the distance between the anchor and the negative sample to have greater than a margin value plus the distance between the positive sample and the anchor.
  
Usually, the positive sample belongs to the same class as the anchor, but the negative sample does not. We use triplet margin loss to predict a high similarity value between the anchor and the positive sample and a low similarity value between the anchor and the negative sample.

In [12]:
triplet_loss = nn.TripletMarginLoss(margin=1.0, p=2)
anchor = torch.randn(100, 128, requires_grad=True)
positive = torch.randn(100, 128, requires_grad=True)
negative = torch.randn(100, 128, requires_grad=True)
output = triplet_loss(anchor, positive, negative)
print(output) 

tensor(1.1436, grad_fn=<MeanBackward0>)


### Cosine embedding loss

Cosine embedding loss measures the loss given inputs x1, x2, and a label tensor y containing values 1 or -1. It is used for measuring the degree to which two inputs are similar or dissimilar by computing the cosine distance between the two data points in space. The cosine distance correlates to the angle between the two points which means that the smaller the angle, the closer the inputs and hence the more similar they are.

In [13]:
cosine_loss = nn.CosineEmbeddingLoss()
input1 = torch.randn(3, 6, requires_grad=True)
input2 = torch.randn(3, 6, requires_grad=True)
target = torch.randn(3).sign()
output = cosine_loss(input1, input2, target)
print(f'{input1=}')
print(f'{input2=}')
print(f'{output=}')

input1=tensor([[ 0.3354, -0.7056, -0.0219, -0.0693, -0.8870, -1.7239],
        [-0.3579,  0.7590,  0.5300,  0.2401, -0.9892,  0.4002],
        [ 1.3142, -0.3310, -0.1450, -0.5311,  0.0077,  0.2826]],
       requires_grad=True)
input2=tensor([[ 0.0281, -0.3903,  0.7034,  0.3182, -0.3571,  0.1982],
        [-0.4479,  0.1134,  0.9433, -0.0840, -0.0399, -2.2796],
        [ 0.3697,  0.0200, -1.2353,  0.2872,  0.6938,  0.2455]],
       requires_grad=True)
output=tensor(0.7287, grad_fn=<MeanBackward0>)


### Custom function

PyTorch provides ways to build our own loss function - A custom loss function to calculate the mean square error

In [14]:
def custom_mean_square_error(y_pred, target):
  square_difference = torch.square(y_pred - target)
  loss_value = torch.mean(square_difference)
  return loss_value

In [15]:
y_pred = torch.randn(3, 5, requires_grad=True);
target = torch.randn(3, 5)
pytorch_loss = nn.MSELoss();
p_loss = pytorch_loss(y_pred, target)
custom_loss = custom_mean_square_error(y_pred, target)
print(f'{custom_loss=}')
print(f'{p_loss=}')

custom_loss=tensor(1.2872, grad_fn=<MeanBackward0>)
p_loss=tensor(1.2872, grad_fn=<MseLossBackward0>)
