In [133]:
"""
Title: Learning various pytorch loss function APIs via a video blog
Author: Kaikai Zhao
Reference: 
https://www.bilibili.com/video/BV1Sv4y1A7dz/?spm_id_from=333.788&vd_source=295aeb7cc6407338dd3e15d41a6b90ed
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

# generate data
# logits shape: [Batchsize, Number of classes]
batchsize = 2
num_class = 4

logits = torch.randn(batchsize, num_class) # the output of a model
target_indices = torch.randint(num_class, size=(batchsize,)) # delta-type target distribution
target_logits = torch.randn(batchsize, num_class).softmax(dim=1) # the target is a distribution

The following description is taken from en.wikipedia on Cross entropy.
## Cross entropy
In information theory, the cross-entropy between two probability distributions $p$ and $q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution $q$, rather than the true distribution $p$.
### Definition
The cross-entropy of the distribution $q$ relative to a distribution $p$ over a given set is defined as follows.
$$
H(p,q)=-\mathrm{E}_p[\log q],
$$
where $\mathrm{E}_p$ is the expected value operator w.r.t the distribution $p$.
### Discrete case
For discrete probability distributions $p$ and $q$ with the same support $\mathcal{X}$, this means
$$
H(p,q)=-\sum_{x\in\mathcal{X}}p(x)\log q(x)
$$
### Continuous case
The situation for continuous distributions is analogous. We have to assume that $p$ and $q$ are absolutely continuous w.r.t. some reference measure $r$ (usually $r$ is a Lebesgue measure on a Borel $\sigma$-algebra). Let $P$ and $Q$ be probability density functions of $p$ and $q$ w.r.t. $r$. Then
$$
-\int_{\mathcal{X}}P(x)\log Q(x)\mathrm{d}r(x)=\mathrm{E}_{p}[-\log Q]
$$
and therefore
$$
H(p,q)=-\int_{\mathcal{X}}P(x)\log Q(x)\mathrm{d}r(x)
$$
Note that the notation is also used for a different concept, the joint entropy of $p$ and $q$.

In [134]:
## 1. Call cross entropy loss

### method 1 for CE loss: when targets are integers, like classification tasks
ce_loss_fn = nn.CrossEntropyLoss() # instantiation, this func will perform log-exp-sum operations for you
ce_loss1 = ce_loss_fn(logits, target_indices) # The input is expected to contain raw, unnormalized scores for each class.

# Here the f at the beginning means format. This is a way of printing in Python3.
# You can also perform arithmetic operations in the curly brackets.
print(f"ce loss1: {ce_loss1}")  

### method 2 for CE loss: when target is a distribution
ce_loss2 = ce_loss_fn(logits, target_logits)
print(f"ce loss2: {ce_loss2}")  

ce loss1: 1.7834562063217163
ce loss2: 1.9118367433547974


## Relation to Maximum likelihood
The part is taken from wikipedia on Cross entropy.

In classifiction problems we want to estimate the probability of different outcomes. Let the estimated probability of outcome $i$ be $q_{\theta}(X=i)$ with to-be-optimized parameters $\theta$ and let the frequency (empirical probability) of outcome $i$ in the training set be $p(X=i)$. Given $N$ conditionally independent samples in the training set, then the likelihood of the parameters $\theta$ of the model on the training set $q_{\theta}(X=i)$ is 
$$
\mathcal{L}(\theta)=\prod_{i\in X}(\text{est. probability of }i)^{\text{numbers of occurences of }i}=\prod_{i\in X}(q_{\theta}(X=i))^{Np(X=i)}
$$
so the likelihood divided by $N$ is,
$$
\frac{1}{N}\log(\mathcal{L}(\theta))=\frac{1}{N}\log\prod_{i\in X}(q_{\theta}(X=i))^{Np(X=i)}=\sum_{i}p(X=i)\log q_{\theta}(X=i)=-H(p,q)
$$
so that maximizing the log-likelihood w.r.t. the parameters $\theta$ is the same as minimizing the cross-entropy. 

In [127]:
## 2. call Negative log-likelihood loss
m = nn.LogSoftmax(dim=-1)
nll_loss_fn = nn.NLLLoss()
nll_loss = nll_loss_fn(m(logits),target_indices)
print(f"nll loss: {nll_loss}") 
### NLL loss = CE loss when target distribution is a delta distribution, because of math, see wikipedia

nll loss: 0.7282825708389282


## KL divergence
### Definition
For discrete probability distributions $P$ and $Q$ defined on the same probability space, the relative entropy from $Q$ to $P$ is defined to be
$$
D_{KL}(P\|Q)=\sum_{x\in\mathcal{X}}P(x)\frac{P(x)}{Q(x)}
$$
In other words, it is the expectation of the logarithmic difference between the probabilities $P$ and $Q$, where the expectation is taken using the probabilities $P$.

For distributions of $P$ and $Q$ of a continuous variable, relative entropy is defined to be the integral:
$$
D_{KL}(P\|Q)=\int_{-\infty}^{+\infty}p(x)\log\frac{p(x)}{q(x)}\mathrm{d}x
$$
where $p$ and $q$ denote the probability densities of $P$ and $Q$. Here we abuse the notations of $P,Q$ and $p,q$ in this notebook.  

In [128]:
## 3. KL divergence
kl_loss_fn = nn.KLDivLoss()
kl_loss = kl_loss_fn(m(logits), target_logits.softmax(dim=-1))
print(f"KL loss: {kl_loss}") 

kl_loss_fn_mathematically_correct = nn.KLDivLoss(reduction="batchmean")
kl_loss = kl_loss_fn_mathematically_correct(m(logits), target_logits.softmax(dim=-1))
print(f"KL loss: {kl_loss}") 

KL loss: 0.0361519530415535
KL loss: 0.144607812166214




The definition of cross-entropy may be formulated using KL divergence, $D_{KL}(p\|q)$, divergence of $p$ from $q$ (a.k.a the relative entropy of $p$ w.r.t. $q$).
$$
H(p,q)=H(p)+D_{KL}(p\|q)
$$
Generally, target info entropy $H(p)$ is a constant, so optimization via CE loss is equivalent to optimization via KL loss. In particular, when target distribution is a delta distribution, $H(p)=0$. 

In [129]:
## 4. verify CE loss = Info entroy + KL loss
ce_loss_sample_fn = nn.CrossEntropyLoss(reduction="none")
ce_loss_sample = ce_loss_sample_fn(logits, torch.softmax(target_logits,dim=-1))
print(f"ce loss sample: {ce_loss_sample}") 

kl_loss_sample_fn = nn.KLDivLoss(reduction="none")
kl_loss_sample = kl_loss_sample_fn(m(logits), target_logits.softmax(dim=-1)).sum(dim=-1)
print(f"KL loss sample: {kl_loss_sample}") 

target_info_entropy_sample = torch.distributions.Categorical(probs=target_logits.softmax(dim=-1)).entropy()
print(f"target_info_entropy_sample: {target_info_entropy_sample}") # when target distribution is a delta distribution, info entroy is 0.

# Generally, target info entropy is a constant, so optimization via CE loss is equivalent to optimization via KL loss
print(torch.allclose(ce_loss_sample,kl_loss_sample+target_info_entropy_sample)) 

ce loss sample: tensor([1.5688, 1.4553])
KL loss sample: tensor([0.1949, 0.0943])
target_info_entropy_sample: tensor([1.3739, 1.3610])
True


## Binary cross entropy loss
$$
l_n(x_n,y_n)=-[y_n\log p(x_n) + (1-y_n)\log(1-p(x_n))]
$$

In [130]:
## 5. Binary cross entropy loss
BCE_loss_fn = nn.BCELoss()
logits = torch.randn(batchsize)
prob_1 = torch.sigmoid(logits)
target = torch.randint(2, size=(batchsize, ))
BCE_loss = BCE_loss_fn(prob_1, target.float())
print(f"BCE_loss: {BCE_loss}")

### calculate BCE loss using NLL loss
prob_1 = prob_1.unsqueeze(dim=-1)
prob_0 = 1 - prob_1
prob = torch.cat([prob_0, prob_1], dim=-1)
nll_loss_binary = nll_loss_fn(torch.log(prob), target)
print(f"nll_loss binary: {nll_loss_binary}")
print(torch.allclose(nll_loss_binary,BCE_loss))

BCE_loss: 0.9651271104812622
nll_loss binary: 0.9651271104812622
True


In [135]:
## 6. cosine similarity loss
cosine_loss_fn = nn.CosineEmbeddingLoss()
v1 = torch.randn(batchsize, 512)
v2 = torch.randn(batchsize, 512)
target = torch.randint(2, size=(batchsize, ))*2 - 1 # target is supposed to be -1 or 1
cosine_loss = cosine_loss_fn(v1,v2,target)
print(f"cosine_loss: {cosine_loss}")

cosine_loss: 0.9742875099182129


# Summary of PyTorch Loss-Input Confusion (Cheatsheet)
https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/other/pytorch-lossfunc-cheatsheet.md

- `torch.nn.functional.binary_cross_entropy` takes logistic sigmoid values as inputs
- `torch.nn.functional.binary_cross_entropy_with_logits` takes logits as inputs 
- `torch.nn.functional.cross_entropy` takes logits as inputs (performs `log_softmax` internally)
- `torch.nn.functional.nll_loss` is like `cross_entropy` but takes log-probabilities (log-softmax) values as inputs
- `torch.nn.functional.cosine_embedding_loss(input1, input2, target)` takes two vectors as inputs, and also a target which is required to be -1 or 1.