Defining cost/loss function is very critical in model training. The main purpose of backpropagation during training process is to reduce the defined cost.

1) Loss <br>
&emsp;     a) Regression <br>
&emsp;&emsp;         i)   Mean Absolute Error (L1 loss) <br>
&emsp;&emsp;         ii)  Mean Squared Error (L2 loss) <br>
&emsp;&emsp;         iii) Root Mean Squared Error <br>
&emsp;&emsp;         iV) Huber loss <br>
&emsp;    b) Classification <br>
&emsp;&emsp;        i)   Cross Entropy (Binary, Multiclass, Multilabel)<br>
&emsp;&emsp;        ii)  KL divergence <br>
&emsp;&emsp;        iii) Hinge loss <br>
&emsp;&emsp;        iv)  Focal loss <br>
&emsp;    c) Ranking <br>
&emsp;&emsp;        i)   Margin Ranking loss<br>
&emsp;&emsp;        ii)  Triplet Ranking loss <br>
2) Metrics <br>
&emsp;    a) Regression <br>
&emsp;    b) Classification <br>
&emsp;&emsp;        i)   Acuuracy, Precision, Recall, F1, Sensitivity, Specificity, AUC<br>
&emsp;    b) Ranking <br>
&emsp;&emsp;        i)   MAP (Mean Average Precision)<br>
&emsp;&emsp;        ii)   NDCG (Normalized Discounted Cumulative Gain)<br>

In [114]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchmetrics.classification import BinaryHingeLoss
from torchmetrics.classification import MulticlassHingeLoss
from sklearn.metrics import ndcg_score

# Loss

## Regression Loss

### MAE (Mean Absolute Error)

--> It is also called L1 loss. It computes the average of absolute differences between actual values and predicted values. 

$
\begin{align}
& L = \frac{1}{n}\sum_{i=1}^{n}|\hat{y_i}-y_i| \\
& L - \text{loss} \\
& i - i^{th} \text{ training sample} \\
& y_{i} - \text{actual value of the } i^{th} \text{training sample} \\
& \hat{y_{i}} - \text{predicted value of the } i^{th} \text{training sample} \\
& N - \text{total number of training samples}
\end{align}
$

In [9]:
pred = np.array([1.7204,  3.0735, -0.5096])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([1.2,  1.9, 0.0])
target = torch.tensor(target)

loss = nn.L1Loss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([ 1.7204,  3.0735, -0.5096], dtype=torch.float64, requires_grad=True)
target:  tensor([1.2000, 1.9000, 0.0000], dtype=torch.float64)
output:  tensor(0.7345, dtype=torch.float64, grad_fn=<MeanBackward0>)


In [10]:
## if we have multiple/multilevel regression output 

pred = np.array([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
                [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
                [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
                   [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
                   [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]])
target = torch.tensor(target)

loss = nn.L1Loss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
        [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
        [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]], dtype=torch.float64,
       requires_grad=True)
target:  tensor([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
        [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
        [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]], dtype=torch.float64)
output:  tensor(1.1340, dtype=torch.float64, grad_fn=<MeanBackward0>)


### MSE (Mean Squared Error) (Quadratic loss)

--> It is also called L2 loss. It computes the average of the squared differences between actual values and predicted values. 

$
\begin{align}
& L = \frac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2 \\
& L - \text{loss} \\
& i - i^{th} \text{ training sample} \\
& y_{i} - \text{actual value of the } i^{th} \text{training sample} \\
& \hat{y_{i}} - \text{predicted value of the } i^{th} \text{training sample} \\
& N - \text{total number of training samples}
\end{align}
$

In [4]:
pred = np.array([1.7204,  3.0735, -0.5096])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([1.2,  1.9, 0.0])
target = torch.tensor(target)

loss = nn.MSELoss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([ 1.7204,  3.0735, -0.5096], dtype=torch.float64, requires_grad=True)
target:  tensor([1.2000, 1.9000, 0.0000], dtype=torch.float64)
output:  tensor(0.6359, dtype=torch.float64, grad_fn=<MseLossBackward0>)


In [7]:
## if we have multiple/multilevel regression output 

pred = np.array([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
                [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
                [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
                   [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
                   [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]])
target = torch.tensor(target)

loss = nn.MSELoss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
        [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
        [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]], dtype=torch.float64,
       requires_grad=True)
target:  tensor([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
        [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
        [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]], dtype=torch.float64)
output:  tensor(1.8773, dtype=torch.float64, grad_fn=<MseLossBackward0>)


### RMSE (Root Mean Squared Error)
--> PyTorch does not have RMSE Loss, one simple way to implement is as below

In [15]:
class RMSELoss(nn.Module):
    def __init__(self, eps=1e-6):
        super().__init__()
        self.mse = nn.MSELoss()
        self.eps = eps
        
    def forward(self,yhat,y):
        #  if the mse=0, there will be issue during the backward pass as you multiply 0 by infinity (derivative of sqrt at 0).
        mse_val = self.mse(yhat,y)
        mse_val = (mse_val + self.eps) if mse_val == 0 else mse_val
        loss = torch.sqrt(mse_val)
        return loss

In [16]:
pred = np.array([1.7204,  3.0735, -0.5096])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([1.2,  1.9, 0.0])
target = torch.tensor(target)

loss = RMSELoss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([ 1.7204,  3.0735, -0.5096], dtype=torch.float64, requires_grad=True)
target:  tensor([1.2000, 1.9000, 0.0000], dtype=torch.float64)
output:  tensor(0.7974, dtype=torch.float64, grad_fn=<SqrtBackward0>)


In [17]:
## if we have multiple/multilevel regression output 

pred = np.array([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
                [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
                [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
                   [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
                   [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]])
target = torch.tensor(target)

loss = RMSELoss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
        [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
        [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]], dtype=torch.float64,
       requires_grad=True)
target:  tensor([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
        [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
        [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]], dtype=torch.float64)
output:  tensor(1.3701, dtype=torch.float64, grad_fn=<SqrtBackward0>)


### Huber Loss (Smooth MSE)

$
\begin{align}
& L_{\delta}=
&    \left\{\begin{matrix}
&        \frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y})  \right | < \delta\\
&        \delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise
&    \end{matrix}\right. \\
& L - \text{loss} \frac{1}{n} \sum_{i=1}^{N} L_{\delta i} \\
& i - i^{th} \text{ training sample} \\
& y_{i} - \text{actual value of the } i^{th} \text{training sample} \\
& \hat{y_{i}} - \text{predicted value of the } i^{th} \text{training sample} \\
& N - \text{total number of training samples}
\end{align}
$

In [18]:
pred = np.array([1.7204,  3.0735, -0.5096])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([1.2,  1.9, 0.0])
target = torch.tensor(target)

loss = nn.HuberLoss(reduction='mean', delta=1.0)
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([ 1.7204,  3.0735, -0.5096], dtype=torch.float64, requires_grad=True)
target:  tensor([1.2000, 1.9000, 0.0000], dtype=torch.float64)
output:  tensor(0.3129, dtype=torch.float64, grad_fn=<HuberLossBackward0>)


In [19]:
## if we have multiple/multilevel regression output 

pred = np.array([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
                [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
                [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
                   [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
                   [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]])
target = torch.tensor(target)

loss = nn.HuberLoss(reduction='mean', delta=1.0)
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([[ 0.1158,  0.6110,  1.4007,  0.9448, -1.3474],
        [ 0.5122,  1.9972, -0.4636,  0.4942, -0.7275],
        [ 0.7347,  1.0200,  0.6460, -1.0006,  0.5116]], dtype=torch.float64,
       requires_grad=True)
target:  tensor([[-0.0830, -0.7957,  1.0230, -1.5763,  0.3064],
        [ 0.0061, -0.0928,  0.4557, -1.2883,  0.0917],
        [-0.3364, -1.3713,  0.1138, -0.3415,  0.4300]], dtype=torch.float64)
output:  tensor(0.7171, dtype=torch.float64, grad_fn=<HuberLossBackward0>)


### Quantile loss

In [1]:
#TODO

## Classification Loss

### Cross Entropy

--> In machine learning, entropy is a measure of randomness in information. High entropy means there is too much noise/randomness in the information, <br>
so that we can not conclude any decision from the information. <br>
--> Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events. <br>
--> Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. <br>
--> The logistic loss is sometimes called cross-entropy loss. It is also known as log loss. <br>

#### Binary Cross Entropy

--> It is used when we have two classes. <br>

$
\begin{align}
& L = - \frac{1}{N} \sum_{i=1}^{N} {(y_{i}\log(\hat{y_{i}}) + (1 - y_{i})\log(1 - \hat{y_{i}}))} \\
& L - \text{loss} \\
& i - i^{th} \text{ training sample} \\
& y_{i} - \text{actual value of the } i^{th} \text{training sample} \\
& \hat{y_{i}} - \text{predicted value of the } i^{th} \text{training sample} \\
& N - \text{total number of training samples}
\end{align}
$

In [16]:
pred = np.array([1.7204,  3.0735, -0.5096])
pred = torch.tensor(pred, requires_grad=True)

m = nn.Sigmoid()
pred = m(pred)

target = np.array([1.,  1., 0])
target = torch.tensor(target)

loss = nn.BCELoss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

input:  tensor([0.8482, 0.9558, 0.3753], dtype=torch.float64,
       grad_fn=<SigmoidBackward0>)
target:  tensor([1., 1., 0.], dtype=torch.float64)
output:  tensor(0.2268, dtype=torch.float64, grad_fn=<BinaryCrossEntropyBackward0>)


In [33]:
pred = np.array([1.7204,  3.0735, -0.5096])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([1.,  1., 0])
target = torch.tensor(target)

loss = nn.BCEWithLogitsLoss() # no need to use sigmoid - it is combination of a Sigmoid layer and the BCELoss in one single class
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([ 1.7204,  3.0735, -0.5096], dtype=torch.float64, requires_grad=True)
target:  tensor([1., 1., 0.], dtype=torch.float64)
output:  tensor(0.2268, dtype=torch.float64,
       grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


<b> Q) Why log is used in BCE? </b> <br>
The probability of an output or the likelihood function is <br>
$
\begin{align}
& \hat{y_{i}}^{y_i}(1-\hat{y_{i}})^{(1-y_{i})}
\end{align}
$ 
<br>
$y_{i}$ is encoded as 0 or 1. To calculate the maximum likelihood, product is difficult to compute. <br>
Using log to transfer the product into summation. It is easier. <br>

Another reason is, <br>
The log value offers less penalty for small differences between predicted probability and corrected probability. when the difference is large the penalty will be higher.
$
log(0) = \text{infinite/undefined} \\
log(1) = 0
$

<b> Q) Why negative sign in BCE </b> <br>
probabilities lie between 0 and 1, all the log values are negative. In order to compensate for this negative value, we will use a negative average of the values.

#### Multi-Class Cross Entropy

--> It is used when we have multiple classes. <br>

$
\begin{align}
& L = - \frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} {(y_{i,c}\log(\hat{y_{i,c}})} \\
& L - \text{loss} \\
& i - i^{th} \text{ training sample} \\
& y_{i} - \text{actual value of the } i^{th} \text{training sample} \\
& \hat{y_{i}} - \text{predicted value of the } i^{th} \text{training sample} \\
& N - \text{total number of training samples} \\
& C - \text{total number of classes}
\end{align}
$

In [112]:
# nsamples X nclasses
# this is very important. let's say we have 3 classes - 0, 1, 2
# From Deep learning model, 
#  the idea is, at the very last layer, it should produce 3 values (same as number of classes)
#   
pred = np.array([
        [-0.2412, -0.1822, -0.1700],
        [ 0.3150, -0.8486,  2.0958],
        [ -0.2112, -0.2039, -4.1485],
        [0.9562, 0.3474, 0.1844]
    ])
pred = torch.tensor(pred, requires_grad=True)


m = nn.LogSoftmax(dim=1)
pred = m(pred)

print('predicted: ', pred)

target = np.array([2.,  1., 1., 0])
target = torch.tensor(target, dtype=torch.long)

print('target: ', target)

loss = nn.NLLLoss()
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

# during inference we have to use torch max
_, inf_pred = torch.max(pred, 1)
print('predicted: ', inf_pred)

predicted:  tensor([[-1.1425, -1.0835, -1.0713],
        [-1.9806, -3.1442, -0.1998],
        [-0.7065, -0.6992, -4.6438],
        [-0.6962, -1.3050, -1.4680]], dtype=torch.float64,
       grad_fn=<LogSoftmaxBackward0>)
target:  tensor([2, 1, 1, 0])
predicted:  tensor([[-1.1425, -1.0835, -1.0713],
        [-1.9806, -3.1442, -0.1998],
        [-0.7065, -0.6992, -4.6438],
        [-0.6962, -1.3050, -1.4680]], dtype=torch.float64,
       grad_fn=<LogSoftmaxBackward0>)
target:  tensor([2, 1, 1, 0])
output:  tensor(1.4027, dtype=torch.float64, grad_fn=<NllLossBackward0>)
predicted:  tensor([2, 2, 1, 0])


In [49]:
# nsamples X nclasses
# this is very important. let's say we have 3 classes - 0, 1, 2
# From Deep learning model, 
#  the idea is, at the very last layer, it should produce 3 values (same as number of classes)
#   
pred = np.array([
        [-0.2412, -0.1822, -0.1700],
        [ 0.3150, -0.8486,  2.0958],
        [ -0.2112, -0.2039, -4.1485],
        [0.9562, 0.3474, 0.1844]
    ])
pred = torch.tensor(pred, requires_grad=True)

print('predicted: ', pred)

target = np.array([2.,  1., 1., 0])
target = torch.tensor(target, dtype=torch.long)

print('target: ', target)

loss = nn.CrossEntropyLoss() # no need to have softmax, - it is combination of a Sigmoid layer and the NLLLoss (negative log likelihood loss) in one single class
output = loss(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

# during inference we have to use torch max
_, inf_pred = torch.max(pred, 1)
print('predicted: ', inf_pred)

predicted:  tensor([[-0.2412, -0.1822, -0.1700],
        [ 0.3150, -0.8486,  2.0958],
        [-0.2112, -0.2039, -4.1485],
        [ 0.9562,  0.3474,  0.1844]], dtype=torch.float64, requires_grad=True)
target:  tensor([2, 1, 1, 0])
predicted:  tensor([[-0.2412, -0.1822, -0.1700],
        [ 0.3150, -0.8486,  2.0958],
        [-0.2112, -0.2039, -4.1485],
        [ 0.9562,  0.3474,  0.1844]], dtype=torch.float64, requires_grad=True)
target:  tensor([2, 1, 1, 0])
output:  tensor(1.4027, dtype=torch.float64, grad_fn=<NllLossBackward0>)
predicted:  tensor([2, 2, 1, 0])


<b> Q) Difference between Sigmoid and Softmax </b> <br>

Sigmoid:  <br>
It squashes a vector in the range (0, 1). It is applied independently to each element of $s$ -- $s_{i}$. It’s also called logistic function.  <br>
$
\sigma(s_{i}) = \frac{1} {1 + e^{-s_{i}}}
$
<br>

Softmax: <br>
It squashes a vector in the range (0, 1) and all the resulting elements add up to 1. It is applied to the output scores s. As elements represent a class, they can be interpreted as class probabilities.  <br>
$
\sigma(s_{i}) = \frac{e^{s_{i}}}{\sum_{c=1}^C e^{s_{i,c}}} \ \ \ for\ c=1,2,\dots,C
$

#### Multi-Labels Cross Entropy

--> In multi-label problem, the class can be represented as one-hot code vector.
--> We can use BCE loss for this purpose.

In [83]:
### batch_size/num_of_samples = 3
### num_classes = 5

pred_before_sigmoid = np.array([
        [ 1.4397182 , -0.7993438 ,  4.113389  ,  3.2199187 ,  4.5777845 ],
        [ 0.30619335,  0.10168511,  4.253479  ,  2.3782277 ,  4.7390924 ],
        [ 1.124632  ,  1.6056736 ,  2.9778094 ,  2.0808482 ,  2.0735667 ]
    ])
pred_before_sigmoid = torch.tensor(pred_before_sigmoid, requires_grad=True)

pred_after_sigmoid = torch.sigmoid(pred_before_sigmoid)


target_classes = np.array([
        [1., 1., 0., 0., 0.],
        [0., 1., 0., 0., 1.],
        [1., 1., 1., 1., 0.]
    ])
target_classes = torch.tensor(target_classes, dtype=torch.double)



fn_bce_loss = torch.nn.BCELoss()
bce_loss = fn_bce_loss(pred_after_sigmoid, target_classes)

fn_bce_loss_logits = torch.nn.BCEWithLogitsLoss()
bce_loss_logit = fn_bce_loss_logits(pred_before_sigmoid, target_classes)


bce_loss = round(bce_loss.detach().item(), 6)
bce_loss_logit = round(bce_loss_logit.detach().item(), 6)

print("Loss:- ",bce_loss, bce_loss_logit)

assert bce_loss == bce_loss_logit

prediction = (pred_after_sigmoid > 0.7).type(torch.float64)
print(prediction)

Loss:-  1.628549 1.628549
tensor([[1., 0., 1., 1., 1.],
        [0., 0., 1., 1., 1.],
        [1., 1., 1., 1., 1.]], dtype=torch.float64)


### KL divergence

--> KL divergence is different from Cross Entropy. It calculates relative (average) entropy between two probability distributions. <br>
--> The KL divergence between two probability distributions measures how different the two distributions are. <br>
--> When two probability distributions are exactly similar, then the KL divergence between them is 0. <br>
 <br>
For example, let’s say that we have a true distribution 𝑃 and an approximate distribution 𝑄. Then KL divergence will calculate the similarity  <br>
(or dissimilarity) between the two probability distributions.  <br>

$
D_{KL}(P\|Q) = \sum_{x \in \chi}^{} P(x) log{\frac{P(X)}{Q(X)}}
$ <br>
where $\chi$ is the probability space. <br>
--> We need to keep in mind that although KL divergence tells us how one probability distribution is different from another, it is not a distance metric. That is, it does not calculate the distance between the probability distributions 𝑃 and 𝑄. <br>

<br>
<br>

Use cases of KL divergence: <br>
    1) Autoencoder,  Variational Autoencoders <br>

In [111]:
p1 = np.array([
        [-1.0156, -0.2873, -0.4043,  0.2866,  0.0977],
        [-1.1921,  0.4708,  2.1700, -0.1780, -0.2064],
        [ 1.5237, -0.5784, -0.4806,  1.5462, -1.4848]
    ])
p1 = torch.tensor(p1, requires_grad=True)
p2 = np.array([
        [0.4866, 0.6380, 0.0370, 0.3784, 0.5492],
        [0.4328, 0.9565, 0.5474, 0.4046, 0.2326],
        [0.4962, 0.7836, 0.2754, 0.3386, 0.6686]
    ])
p2 = torch.tensor(p2)
p3 = np.array([
        [0.5202, 0.7551, 0.7836, 0.0691, 0.4091],
        [-1.1921,  0.4708,  2.1700, -0.1780, -0.2064],
        [0.9314, 0.4530, 0.7055, 0.0060, 0.4177]
    ])
p3 = torch.tensor(p3)

kl_loss = nn.KLDivLoss(reduction="batchmean", log_target=True)
p1 = F.log_softmax(p1, dim=1) # prediction should be a distribution in the log space

p2 = F.log_softmax(p2, dim=1) # target
p3 = F.log_softmax(p3, dim=1) # target

output_1 = kl_loss(p1, p2)
output_2 = kl_loss(p1, p3)

print('KL Divergence of p1 and p2 = {}'.format(output_1))
print('KL Divergence of p1 and p3 = {}'.format(output_2))

# p1-p3 are more similar distribution compare to p1-p2

KL Divergence of p1 and p2 = 0.511531182594443
KL Divergence of p1 and p3 = 0.2787421654654177


### Hinge Loss

--> It is very particular to <b>SVM algorithm.</b>. Mainly used for maximum margin classification models. <br>
--> It includes margin or distance from the classification. Even if new observations are classified correctly, still penalty will be incured if the margin from the decision boundary is not large enugh.

#### Binary Hinge Loss

$
\begin{align}
& L = \sum_{i=1}^{N} max(0, 1 - y_{i} \cdot \hat{y_{i}}) \\
& L - \text{loss} \\
& i - i^{th} \text{ training sample} \\
& y_{i} - \text{actual value of the } i^{th} \text{training sample} \in -1, 1\\
& \hat{y_{i}} - \text{predicted value of the } i^{th} \text{training sample} \\
& N - \text{total number of training samples}
\end{align}
$

In [7]:
pred = np.array([0.25, 0.25, 0.55, 0.75, 0.75])
pred = torch.tensor(pred, requires_grad=True)

m = nn.Sigmoid()
pred = m(pred)

target = np.array([0, 0, 1, 1, 1])
target = torch.tensor(target)

loss = BinaryHingeLoss() # Binary Hinge loss
output = loss(pred, target)
# output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

loss_squared = BinaryHingeLoss(squared=True) # Binary squared Hinge loss
output = loss_squared(pred, target)
output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([0.5622, 0.5622, 0.6341, 0.6792, 0.6792], dtype=torch.float64,
       grad_fn=<SigmoidBackward0>)
target:  tensor([0, 0, 1, 1, 1])
output:  tensor(0.8264, grad_fn=<SqueezeBackward0>)
predicted:  tensor([0.5622, 0.5622, 0.6341, 0.6792, 0.6792], dtype=torch.float64,
       grad_fn=<SigmoidBackward0>)
target:  tensor([0, 0, 1, 1, 1])
output:  tensor(1.0441, grad_fn=<SqueezeBackward0>)


#### Multi-Class Hinge Loss

$
\begin{align}
& L = \sum_{i=1}^{N} max(0, 1 - \hat{y_{i,y}} + \max_{i \neq y}{(\hat{y_{i,y})}}) \\
& L - \text{loss} \\
& i - i^{th} \text{ training sample} \\
& y_{i} - \text{actual value of the } i^{th} \text{training sample} \in -1, 1\\
& \hat{y_{i}} - \text{predicted value of the } i^{th} \text{training sample} \\
& N - \text{total number of training samples} \\
& y \in 0, .., C is the target class
\end{align}
$

In [11]:
pred = np.array([
                [0.25, 0.20, 0.55],
                [0.55, 0.05, 0.40],
                [0.10, 0.30, 0.60],
                [0.90, 0.05, 0.05]
            ])
pred = torch.tensor(pred, requires_grad=True)

m = nn.Sigmoid()
pred = m(pred)

target = np.array([0, 1, 2, 0])
target = torch.tensor(target)

loss = MulticlassHingeLoss(num_classes=3) # Multiclass Hinge loss
output = loss(pred, target)
# output.backward()

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

loss_squared = MulticlassHingeLoss(num_classes=3, multiclass_mode='one-vs-all', squared=True) # Multiclass squared Hinge loss
output = loss_squared(pred, target)
output.mean().backward() # *** backward is only valid if loss is a tensor containing a single element.

print('predicted: ', pred)
print('target: ', target)
print('output: ', output)

predicted:  tensor([[0.5622, 0.5498, 0.6341],
        [0.6341, 0.5125, 0.5987],
        [0.5250, 0.5744, 0.6457],
        [0.7109, 0.5125, 0.5125]], dtype=torch.float64,
       grad_fn=<SigmoidBackward0>)
target:  tensor([0, 1, 2, 0])
output:  tensor(0.9810, grad_fn=<SqueezeBackward0>)
predicted:  tensor([[0.5622, 0.5498, 0.6341],
        [0.6341, 0.5125, 0.5987],
        [0.5250, 0.5744, 0.6457],
        [0.7109, 0.5125, 0.5125]], dtype=torch.float64,
       grad_fn=<SigmoidBackward0>)
target:  tensor([0, 1, 2, 0])
output:  tensor([1.3178, 1.8515, 1.9099], grad_fn=<DivBackward0>)


### Focal Loss

--> Focal Loss is particularly useful in cases where there is a class imbalance. <br>
--> Focal Loss was introduced by Lin et al of Facebook AI Research in 2017 as a means of combatting extremely imbalanced datasets <br>
where positive cases were relatively rare. Their paper "Focal Loss for Dense Object Detection" is retrievable here: https://arxiv.org/abs/1708.02002. <br>
--> The idea is give high weights to the rare class and small weights to the dominating or common class. <br>
$
\begin{align}
& FL(p_{t}) = \alpha_{t}(1 - p_{t})^{\gamma}log(p_{t}) \\
& p_{t} = \text{probability of ground truth class} \\
& 
\end{align}
$ <br>
--> https://github.com/pytorch/vision/blob/main/torchvision/ops/focal_loss.py#L38

In [81]:
'''
Args:
        inputs (Tensor): A float tensor of arbitrary shape.
                The predictions for each example.
        targets (Tensor): A float tensor with the same shape as inputs. Stores the binary
                classification label for each element in inputs
                (0 for the negative class and 1 for the positive class).
        alpha (float): Weighting factor in range (0,1) to balance
                positive vs negative examples or -1 for ignore. Default: ``0.25``.
        gamma (float): Exponent of the modulating factor (1 - p_t) to
                balance easy vs hard examples. Default: ``2``.
        reduction (string): ``'none'`` | ``'mean'`` | ``'sum'``
                ``'none'``: No reduction will be applied to the output.
                ``'mean'``: The output will be averaged.
                ``'sum'``: The output will be summed. Default: ``'none'``.

'''
class FocalLoss(nn.Module):
    def __init__(self, alpha=.25, gamma=2, reduction='mean'):
        super(FocalLoss, self).__init__()
        self.alpha = alpha # weight assigned to class 1
        self.gamma = gamma   
        self.reduction = reduction

    def forward(self, preds, targets):
        '''
            :param inputs: batch_size * dim
            :param targets: (batch,)
            :return:
        '''
        
        BCE_loss = nn.BCEWithLogitsLoss(reduction='none')(preds, targets) # reduction should be none
        
        p = torch.sigmoid(preds)
        pt = p * targets + (1 - p) * (1 - targets)
        
        at = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        
        loss = at*(1-pt)**self.gamma * BCE_loss
        
        
        if self.reduction == "none":
            pass
        elif self.reduction == "mean":
            loss = loss.mean()
        elif self.reduction == "sum":
            loss = loss.sum()
        
        print("preds:- ",preds)
        print("targets:- ", targets)
        print("pt:- ", pt)
        print("at:- ", at)
        print("BCE_loss:- ", BCE_loss)
        print("F_loss:- ", loss)
        
        
        return loss

In [92]:
pred = np.array([1.7204,  3.0735, -0.5096])
pred = torch.tensor(pred, requires_grad=True)

target = np.array([1.,  0., 0.]) 
target = torch.tensor(target)

o_cls_num = torch.numel(target[target == 0])
total_cls_num = torch.numel(target)

pos_weight = o_cls_num/total_cls_num
print("pos_weight:- ",pos_weight)
loss = FocalLoss(alpha=pos_weight)

f_loss = loss(pred, target)

target = np.array([1.,  1., 0.]) 
target = torch.tensor(target)

o_cls_num = torch.numel(target[target == 0])
total_cls_num = torch.numel(target)

pos_weight = o_cls_num/total_cls_num
print("pos_weight:- ",pos_weight)
loss = FocalLoss(alpha=pos_weight)

f_loss = loss(pred, target)

pos_weight:-  0.6666666666666666
preds:-  tensor([ 1.7204,  3.0735, -0.5096], dtype=torch.float64, requires_grad=True)
targets:-  tensor([1., 0., 0.], dtype=torch.float64)
pt:-  tensor([0.8482, 0.0442, 0.6247], dtype=torch.float64, grad_fn=<AddBackward0>)
at:-  tensor([0.6667, 0.3333, 0.3333], dtype=torch.float64)
BCE_loss:-  tensor([0.1647, 3.1187, 0.4705], dtype=torch.float64,
       grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
F_loss:-  tensor(0.3248, dtype=torch.float64, grad_fn=<MeanBackward0>)
pos_weight:-  0.3333333333333333
preds:-  tensor([ 1.7204,  3.0735, -0.5096], dtype=torch.float64, requires_grad=True)
targets:-  tensor([1., 1., 0.], dtype=torch.float64)
pt:-  tensor([0.8482, 0.9558, 0.6247], dtype=torch.float64, grad_fn=<AddBackward0>)
at:-  tensor([0.3333, 0.3333, 0.6667], dtype=torch.float64)
BCE_loss:-  tensor([0.1647, 0.0452, 0.4705], dtype=torch.float64,
       grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
F_loss:-  tensor(0.0152, dtype=torch.float64, grad_fn

<b>q) How we can caluculate alpha ? </b> <br>

The idea is give high weights to the rare class and small weights to the dominating or common class.<br>
<br>
we know, aplha = weight assigned to class 1 <br>
<br>
if 1 is high/common class , then alpha should be small<br>
if 1 is low/rare class , then alpha should be high<br>
<br>
so alpha = [1 - (1_num/total_num)]<br>
which means, <br>
if 1_num is high then alpha will be small<br>
if 1_num is low then alpha will be high<br>
<br>
the above equation can be written<br>
so alpha = [0_num/total_num]<br>
<br>
that's why we did in code, <br>
o_cls_num = torch.numel(target[target == 0])<br>
total_cls_num = torch.numel(target)<br>

pos_weight = o_cls_num/total_cls_num<br>


## Ranking Loss

--> Learning to Rank (LTR) refers to ML techniques for solving ranking task. (applicable to any search/recommendation system) <br>
--> We try to learn a function  f(q,D), given a query  q and a relevant list of items  D, to predict the order (ranking) of all items within list. <br>

--> All learning to rank models are ML model (e.g. Decision tree, Neural model) to compute f(x). <br>
--> The choice of loss function is the distinctive element for LTR models.
--> There are three approaches, depending on how the loss is calcualted.
1) pointwise
Input: single candidate Y_i = f(q, d_i)
loss: score (treat as regression) i.e. how accurate are s_i (actual score) and Y_i (predicted score).
2) pairwise
Input: pair candidate Y_i = (q, d_i) and Y_j = (q, d_j) . 
loss: if s_i > s_j, then are Y_i > Y_j, (treat as binary classification)
3) listwise
Input: list candidate (q, d_1), (q, d_2), ...., (q, d_n) . 
loss: RankNet,LambdaRank, LambdaMART, Plackett-Luce model

--> the objective of Ranking Losses is to predict relative distances between inputs.
--> It is mainly used in Siamese nets and Triplet nets.
--> There are two types Ranking loss:
1) Pairwise Ranking loss
2) Triplet Ranking loss

### Margin Ranking loss

--> It computes the relative distances between inputs. <br>
$
\begin{align}
& loss(x, y) = max(0, -y * (x1 - x2) + margin) \\
& x1, x2 :- inputs \\
& y : label (1 or -1) \\
& \text{when y == 1, the first input will be ranked higher. when y == -1, the second input will be ranked higher.}
\end{align}
$

In [98]:
input_one = np.array([-0.2305,  0.1481,  1.7113])
input_one = torch.tensor(input_one, requires_grad=True)


input_two = np.array([-1.6993,  2.2877, -0.4003])
input_two = torch.tensor(input_two, requires_grad=True)

target = np.array([-1.,  1.,  1.])
target = torch.tensor(target)

ranking_loss = nn.MarginRankingLoss()
output = ranking_loss(input_one, input_two, target)
output.backward()

print('input one: ', input_one)
print('input two: ', input_two)
print('target: ', target)
print('output: ', output)

input one:  tensor([-0.2305,  0.1481,  1.7113], dtype=torch.float64, requires_grad=True)
input two:  tensor([-1.6993,  2.2877, -0.4003], dtype=torch.float64, requires_grad=True)
target:  tensor([-1.,  1.,  1.], dtype=torch.float64)
output:  tensor(1.2028, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Triplet Ranking loss

--> It computes a criterion for measuring the triplet loss in models. <br>
$
\begin{align}
& loss(a, p, n) = max\{d(a_{i}, p_{i}) - d(a_{i}, n_{i}) + margin, 0\} \\
& x1, x2, x3 :- inputs \\
& \text{A triplet consists of a (anchor), p (positive examples), and n (negative examples)} \\
\end{align}
$

In [104]:
a = np.array([[-0.8134, -0.8855],
        [-1.7437,  1.3037],
        [-1.3493,  0.3088]]) # anchor
a = torch.tensor(a, requires_grad=True)


p = np.array([[ 0.1226, -1.6654],
        [-0.1654, -0.1100],
        [ 0.9043,  0.4446]]) # positive
p = torch.tensor(p, requires_grad=True)

n = np.array([[-0.3027,  1.3877],
        [ 1.1685,  0.9808],
        [-0.4683,  0.3750]]) # negative
n = torch.tensor(n, requires_grad=True)

ranking_loss = nn.MarginRankingLoss()
output = ranking_loss(a, p, n)
output.backward()

print('a: ', a)
print('p: ', p)
print('n: ', n)
print('output: ', output)

a:  tensor([[-0.8134, -0.8855],
        [-1.7437,  1.3037],
        [-1.3493,  0.3088]], dtype=torch.float64, requires_grad=True)
p:  tensor([[ 0.1226, -1.6654],
        [-0.1654, -0.1100],
        [ 0.9043,  0.4446]], dtype=torch.float64, requires_grad=True)
n:  tensor([[-0.3027,  1.3877],
        [ 1.1685,  0.9808],
        [-0.4683,  0.3750]], dtype=torch.float64, requires_grad=True)
output:  tensor(0.3159, dtype=torch.float64, grad_fn=<MeanBackward0>)


## Reinforcement Loss

In [2]:
#TODO

## Custom Loss

In [None]:
#PyTorch
class CustomLoss(nn.Module):
    def __init__(self, weight=None, size_average=True):
        super(DiceLoss, self).__init__()

    def forward(self, inputs, targets, smooth=1):
        
        #comment out if your model contains a sigmoid or equivalent activation layer
        inputs = F.sigmoid(inputs)       
        
        #flatten label and prediction tensors
        inputs = inputs.view(-1)
        targets = targets.view(-1)
        
        loss = ((input-target)**2).mean()  
        
        return loss
    

# Forward pass to the Network
# then, 
loss.backward()

# Metric

## Regression

## Classification

$
\begin{align}
& Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \\
& Precision = \frac{TP}{TP+FP} \\
& Recall = \frac{TP}{TP+FN} \\
& F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN} \\
\end{align}
$

--> <b>ROC curve </b> <br>
--> An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds.  <br>
This curve plots two parameters: <br>
1) True Positive Rate (Y axis) <br>
2) False Positive Rate (X axis) <br>

$
\begin{align}
& TPR = \frac{TP}{TP+FN} \\
& FPR = \frac{FP}{FP+TN} \\
\end{align}
$

--> <b>AUC</b> stands for "Area under the ROC Curve." <br>
That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).<br>


$
\begin{align}
& Sensitivity = Recall = \frac{TP}{TP+FN} \\
& Specificity = \frac{TN}{FP+TN} \\
\end{align}
$

## Ranking

### Mean Average Precision (mAP)

$
\begin{align}
& Precision = \frac{TP}{TP+FP} \\
& Recall = \frac{TP}{TP+FN} \\
\end{align}
$

--> In <b>information retrieval</b>, the definition is different. <br>
$
\begin{align}
& Precision = \frac{|\{\text{relevant doc}\} \cap \{\text{retrieved doc}\}|}{|\{\text{retrieved doc}\}|} \\
& Recall = \frac{|\{\text{relevant doc}\} \cap \{\text{retrieved doc}\}|}{|\{\text{relevant doc}\}|} \\
\end{align}
$

By default, precision takes all the retrieved documents into account, but however, it can also be evaluated at a given number of retrieved documents, <br>
commonly known as cut-off rank, where the model is only assessed by considering only its top-most queries. The measure is called precision at k or <b> P@K </b>.

$
\begin{align}
& AP@n = \frac{1}{GTP} \sum_{k}^{n} (P@k * rel@k) \\
& GTP total number of ground truth positives \\
& P@k = precision@k \\
& rel@k = relevance@k
\end{align}
$

$
\begin{align}
& mAP = \frac{1}{N} \sum_{i = 1}^{N} AP_{i}
\end{align}
$

[source:- https://www.youtube.com/watch?v=pM6DJ0ZZee0] <br>

<div>
<img src="images/MAP.png" width="500"/>
</div>

### Normalized Discounted Cumulative Gain (NDCG)

--> Discounted Cumulative Gain (DCG) is the metric of measuring <b>ranking quality</b>. <br>
--> It is mostly used in information retrieval problems such as measuring the effectiveness of the search engine algorithm by ranking the <br>
articles it displays according to their relevance in terms of the search keyword. <br>
--> let's <b>example</b>, <br>
Google shows the below documents for a search query. <br>
D_1 <br>
D_2 <br>
D_3 <br>
D_4 <br>
D_5 <br>
--> Now give relevance score to every document. [0 : not relevant, 1-2 : somewhat relevant, 3 : completely relevant] <br>
D_1 :- 3 <br>
D_2 :- 2 <br>
D_3 :- 0 <br>
D_4 :- 0 <br>
D_5 :- 1 <br>

<b>STEP 1: </b> <br>
Calculate <b>Cumulative Gain</b>. $CG_{p}$ at a particular rank position p is defined as <br>
$
\begin{align}
& CG_{p} = \sum_{i=1}^{p}rel_{i} \\
& rel_{i} :-  \text{graded relevance of the result at position i}
\end{align}
$ <br>

$
\begin{align}
& CG = \sum_{i=1}^{5}rel_{i} = 3 + 2 + 0 + 0 + 1 = 6 
\end{align}
$ <br>

<b>STEP 2: </b> <br>
Calculate <b>Discounted Cumulative Gain</b>. DCG is that highly relevant documents appearing lower in a search result list should be penalized as the <br>
graded relevance value is reduced logarithmically proportional to the position of the result. <br>
$
\begin{align}
& DCG_{p} = \sum_{i=1}^{p} \frac{2^{rel_{i} - 1}}{log_{2} (i+1)} \\
& rel_{i} :-  \text{graded relevance of the result at position i}
\end{align}
$ <br>

$
\begin{align}
& DCG_{p} = \sum_{i=1}^{p} \frac{2^{rel_{i} - 1}}{log_{2} (i+1)} \\
& = \frac{3}{log_{2}(2)} + \frac{2}{log_{2}(3)} + \frac{0}{log_{2}(4)} + \frac{0}{log_{2}(5)} + \frac{1}{log_{2}(6)} \\
& = 4.67
\end{align}
$ <br>

<b>STEP 3: </b> <br>
Search result lists vary in length depending on the query. Comparing a search engine's performance from one query to the next cannot be consistently <br>
achieved using DCG alone, so the cumulative gain at each position for a chosen value of p should be normalized across queries.  <br>
$
\begin{align}
& nDCG_{p} = \frac{DCG_{p}}{IDCG_{p}} \\
& DCG_{p} = \sum_{i=1}^{p} \frac{2^{rel_{i} - 1}}{log_{2} (i+1)} \\
& IDCG_{p} = \sum_{i=1}^{|REL_{p}|} \frac{2^{rel_{i} - 1}}{log_{2} (i+1)} \\
& REL_{p} :-  \text{represents the list of relevant documents ordered by their relevance in the corpus up to position p.} \\
& rel_{i} :-  \text{graded relevance of the result at position i}
\end{align}
$ <br>

$
\begin{align}
& DCG_{p} = \sum_{i=1}^{p} \frac{2^{rel_{i} - 1}}{log_{2} (i+1)} \\
& = \frac{3}{log_{2}(2)} + \frac{2}{log_{2}(3)} + \frac{0}{log_{2}(4)} + \frac{0}{log_{2}(5)} + \frac{1}{log_{2}(6)} \\
& = 4.67 \\
& \text{Now we need to arrange these articles in descending order by rankings and calculate DCG to get the Ideal Discounted Cumulative Gain (IDCG) ranking. } \\
& IDCG_{p} = \sum_{i=1}^{|REL_{p}|} \frac{2^{rel_{i} - 1}}{log_{2} (i+1)} \\
& = \frac{3}{log_{2}(2)} + \frac{2}{log_{2}(3)} + \frac{1}{log_{2}(4)} + \frac{0}{log_{2}(5)} + \frac{0}{log_{2}(6)}  \\
& = 4.76  \\
& nDCG_{p} = \frac{4.67}{4.76} = 0.98
\end{align}
$ <br>

'''
For code follow 
https://github.com/dkaterenchuk/ranking_measures
https://github.com/karlhigley/ranking-metrics-torch
'''

# Extra

<b>Q) Difference between loss, cost function, Objective function</b> <br>
<b>Loss function</b> is usually a function defined on a data point, prediction and label, and measures the penalty. For example: <br>
$
\begin{align}
& \text{square loss} :->  \\
& 𝑙(𝑓(𝑥_{𝑖}|\theta),𝑦_{i})=(\sum_{j=0}^{d}𝑥_{𝑖}|\theta_{j}−𝑦_{𝑖})^2
\end{align}
$ <br>
<br>
<b>Cost function</b> is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example: <br>
$
\begin{align}
& \text{Mean Squared Error} \\
& 𝑀𝑆𝐸(\theta) = \frac{1}{N} \sum_{i=1}^{N}(\sum_{j=0}^{d}𝑥_{𝑖}|\theta_{j})−𝑦_{𝑖})^2 + \lambda \sum_{j=0}^{d}\theta_{j}^2
\end{align}
$ <br>
<br>
<b>Objective function</b> is the most general term for any function that you optimize during training. <br>
    Not all objective functions are cost function. For example, <br>
    MLE(Maximum liklihood) is a type of objective function, which is maximized.
A loss function is a part of a cost function which is a type of an objective function.
    

<b> Q) When to use L1 loss and L2 loss? </b> <br>
L2 is much more sensitive to outliers because the differences are squared, whilst L1 is the absolute difference and is therefore not as sensitive. <br>
The choice between L1 and L2 comes down to how much you want to punish outliers in your predictions. If minimising large outliers is important for your <br>
model then L2 is best as this will highlight them more due to the squaring, however if occasional large outliers are not an issue then L1 may be best <br>

<b>Huber Loss</b> is a combination of MAE and MSE. It overcomes the problem with MAE and MSE. <br>
Problem with MAE for training of neural nets is its constantly large gradient, which can lead to missing minima at the end of training using gradient descent. <br>
For MSE, gradient decreases as the loss gets close to its minima, making it more precise, but it is less robust to outliers. <br>

So Huber loss can be really helpful in such cases, as it curves around the minima which decreases the gradient. And it’s more robust to outliers than MSE. <br> 
Therefore, it combines good properties from both MSE and MAE. However, the problem with Huber loss is that we might need to train hyper-parameter delta which is an iterative process.

# Resources
1) https://machinelearningmastery.com/cross-entropy-for-machine-learning/
2) https://neptune.ai/blog/pytorch-loss-functions
3) https://www.youtube.com/watch?v=7q7E91pHoW4
4) https://towardsdatascience.com/sigmoid-and-softmax-functions-in-5-minutes-f516c80ea1f9
5) https://stats.stackexchange.com/questions/207794/what-loss-function-for-multi-class-multi-label-classification-tasks-in-neural-n
6) https://stackoverflow.com/questions/52855843/multi-label-classification-in-pytorch
7) https://discuss.pytorch.org/t/what-kind-of-loss-is-better-to-use-in-multilabel-classification/32203/43
8) https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras.md
9) https://debuggercafe.com/sparse-autoencoders-using-kl-divergence-with-pytorch/
10) https://shangeth.com/post/kl-divergence/
11) https://gombru.github.io/2018/05/23/cross_entropy_loss/
12) https://gombru.github.io/2019/04/03/ranking_loss/
13) https://amaarora.github.io/2020/06/29/FocalLoss.html
14) https://www.kaggle.com/code/bigironsphere/loss-function-library-keras-pytorch/notebook
15) https://www.geeksforgeeks.org/normalized-discounted-cumulative-gain-multilabel-ranking-metrics-ml/
16) https://en.wikipedia.org/wiki/Discounted_cumulative_gain
17) https://www.youtube.com/watch?v=pM6DJ0ZZee0&t=141s
18) https://everdark.github.io/k9/notebooks/ml/learning_to_rank/learning_to_rank.html
19) https://towardsdatascience.com/learning-to-rank-a-complete-guide-to-ranking-using-machine-learning-4c9688d370d4
20) https://stats.stackexchange.com/questions/179026/objective-function-cost-function-loss-function-are-they-the-same-thing
21) https://stephenallwright.com/l1-vs-l2-loss/
22) https://heartbeat.comet.ml/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
23) https://www.kaggle.com/questions-and-answers/173937
24) https://stackoverflow.com/questions/53613722/loss-function-for-simple-reinforcement-learning-algorithm