# 机器学习中的loss总结

In [3]:
import paddle
import paddle.nn as nn
import paddle.nn.functional as F

## 分类loss

### entropy
* 动机 \
熵度量分布p的不确定程度，分布越不确定，编码所需的bit数越多。比如: \
[$\frac{1}{4}$, $\frac{1}{4}$, $\frac{1}{4}$, $\frac{1}{4}$] 比 [$\frac{1}{2}$, $\frac{1}{2}$]分布复杂，前者熵值为2，需要2个bits进行编码，后者熵值为1，需要1个bit即可。
* 定义\
$$
H(p) = -\sum_{i}p_ilog_2(p_i)
$$

### cross entropy
* 动机\
交叉熵度量的是,分布q与分布p的接近程度,用分布q去逼近分布p，分布q与分布p越接近，交叉熵越小，反之，越大。
* 定义\
$$
H(p,q) = -\sum_{i}p_ilog_2(q_i)
$$

### KL divergence
* 动机\
度量概率分布之间的距离\
KL散度，又叫做相对熵，分布q与分布p之间的接近程度减去分布p自身的混乱程度。
* 定义\
$$\begin{align}
D_{KL}(p,q) &= \sum_{i}p_ilog_2(\frac{p_i}{q_i})\\
         &= \sum_{i}p_i\left[log_2(p_i)-log_2(q_i)\right]\\
         &= \sum_{i}p_ilog_2(p_i)-\sum_{i}p_ilog_2(q_i)\\
         &= -H(p) + H(p,q) \\
         &= H(p,q) - H(p)
\end{align}
$$
intuitive case: https://www.youtube.com/watch?v=SxGYPqCgJWM \
* 例子\
有两枚硬币1和2：
$$
coin_1 = \begin{cases}
p_1,  & head 正面 \\
p_2, & tail 反面
\end{cases}
$$
$$
coin_2 = \begin{cases}
q_1,  & head 正面 \\
q_2, & tail 反面
\end{cases}
$$
观察到的抛硬币的序列：$[H, H, T, H, T, H, H, T, H,T, T, H]$
$$
P(observations|coin_1) = p_1^{N_h}p_2^{N_t}\\
P(observations|coin_2) = q_1^{N_h}q_2^{N_t}\\
\frac{P(observations|coin_1)}{P(observations|coin_2)} = \frac{p_1^{N_h}p_2^{N_t}}{q_1^{N_h}q_2^{N_t}} \\
\frac{1}{N}log(\frac{P(observations|coin_1)}{P(observations|coin_2)}) = \frac{1}{N}log(\frac{p_1^{N_h}p_2^{N_t}}{q_1^{N_h}q_2^{N_t}})\\
\frac{1}{N}log(\frac{P(observations|coin_1)}{P(observations|coin_2)}) = \frac{1}{N}\left[log(p_1^{N_h}) + log(p_2^{N_t})-log(q_1^{N_h}) - log(q_2^{N_t})\right]\\
\frac{1}{N}log(\frac{P(observations|coin_1)}{P(observations|coin_2)}) = \frac{N_h}{N}log(p_1) + \frac{N_t}{N}log(p_2)-\frac{N_h}{N}log(q_1) - \frac{N_t}{N}log(q_2)\\
\lim_{N \to \infty}\frac{1}{N}log(\frac{P(observations|coin_1)}{P(observations|coin_2)}) = p_1log(p_1) + p_2log(p_2)-p_1log(q_1) - p_2log(q_2)\\
\lim_{N \to \infty}\frac{1}{N}log(\frac{P(observations|coin_1)}{P(observations|coin_2)}) = p_1log(p_1)-p_1log(q_1) + p_2log(p_2) - p_2log(q_2)\\
\lim_{N \to \infty}\frac{1}{N}log(\frac{P(observations|coin_1)}{P(observations|coin_2)}) = p_1log(\frac{p_1}{q_1}) + p_2log(\frac{p_2}{q2})\\
$$
拓展到多分类：
$$
D_{KL}(p,q) = \sum_{i}p_ilog_2(\frac{p_i}{q_i})
$$
* 应用
    * R_Drop：https://github.com/dropreg/R-Drop

In [26]:
import numpy as np

In [32]:
# 真实分布y: [0.1, 0.2, 0.7]
# 预测分布y_hat: [0.1, 0.2, 0.7]
# 也就是说预测分布万全接近真实分布
# 1) 计算cross_entropy
H_y_yhat = -(0.1*np.log2(0.1) + 0.2*np.log2(0.2) + 0.7*np.log2(0.7)) # 1.15

# 2) 计算分布p的自身熵值
H_y = -(0.1*np.log2(0.1) + 0.2*np.log2(0.2) + 0.7*np.log2(0.7)) # 1.15

# 3) 计算y与y_hat之间的 DL divergence
D_kl_y_yhat = 0.1*np.log2(0.1/0.1) + 0.2*np.log2(0.2/0.2) + 0.7*np.log2(0.7/0.7) # 0

上面的例子说明：\
预测分布与真实分布的交叉熵值比较大，可能是真实分布自身比较复杂引起的，预测分布与真实分布之间的KL散度更能反应两个分布之间的接近程度。\
当真实分布是one-hot分布时，预测分布与真实分布越接近，交叉熵越接近0，反之越大。

In [None]:
# R_Drop
# 核心思想：sub-model dropout两次之后的输入分布需要保持一致，比如一个句子经过ernie+dropout之后的表示应该保持一致，用KL散度来做惩罚进行正则化。
import torch.nn.functional as F

# define your task model, which outputs the classifier logits
model = TaskModel()

def compute_kl_loss(self, p, q, pad_mask=None):
    p_loss = F.kl_div(F.log_softmax(p, dim=-1), F.softmax(q, dim=-1), reduction='none')
    q_loss = F.kl_div(F.log_softmax(q, dim=-1), F.softmax(p, dim=-1), reduction='none')
    # pad_mask is for seq-level tasks
    if pad_mask is not None:
        p_loss.masked_fill_(pad_mask, 0.)
        q_loss.masked_fill_(pad_mask, 0.)
    # You can choose whether to use function "sum" and "mean" depending on your task
    p_loss = p_loss.sum()
    q_loss = q_loss.sum()
    loss = (p_loss + q_loss) / 2
    return loss

# keep dropout and forward twice
logits = model(x)
logits2 = model(x)

# cross entropy loss for classifier
ce_loss = 0.5 * (cross_entropy_loss(logits, label) + cross_entropy_loss(logits2, label))
kl_loss = compute_kl_loss(logits, logits2)

# carefully choose hyper-parameters
loss = ce_loss + α * kl_loss

## 排序loss

## reduction 选 sum or mean
各种loss中都有一个reduction参数，一般有三个选项: None, sum, mean，这个参数的含义是对一个batch的每个样本对于的loss进行求和或者取平均，默认是取平均。取平均可以规避对batch size的依赖，从而不需要根据batch_size的设定来调整learning_rate的大小。看下数学分析:\
比如，MSEloss

$$
Loss=\begin{cases}
\sum_{i=1}^N (\hat{y_i}-y_i)^2,  & reduction=sum \\
\frac{1}{N}\sum_{i=1}^N (\hat{y_i}-y_i)^2, & reduction=mean
\end{cases}
$$
其中batch_size=N, $\hat{y_i}=f(x_i)$, $x_i$是第 $i$ 个样本，f($\cdot$)是模型。\
求偏导\
$$
\frac{\partial{Loss}}{\partial{X}}=\begin{cases}
\sum_{i=1}^N 2*(\hat{y_i}-y_i)*\frac{\partial{\hat{y_i}}}{\partial{x_i}},  & reduction=sum \\
\frac{1}{N}\sum_{i=1}^N 2*(\hat{y_i}-y_i)*\frac{\partial{\hat{y_i}}}{\partial{x_i}}, & reduction=mean
\end{cases}
$$
可以看出，reduction=sum时，偏导会受batch_size影响,当reduction=mean时，偏导基本不受batch_size影响，因为平均之后，均值基本偏差不大。所以默认的reduction选mean。\
代码验证：

In [25]:
# setup
batch_size=128
feature_num = 10
model = nn.Linear(feature_num, 1)
x = paddle.randn([batch_size, feature_num])
y = paddle.randn([batch_size, 1])

# mean
criterion = nn.MSELoss(reduction='mean')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
# batch_size=10 跑10次
# 12.3  12.4  6.8  5.9  14  6  9  9  5.9  23.6
# batch_size=128 跑10次
# 9   9   4.8   10.7   5.7  10  8.4  7.4  8.6

# sum
model.clear_gradients()
criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
# batch_size=10 跑10次 
# 122.9  123.6  68  58.9  143  61  91  90  58.6  236.6
# batch_size=128 跑10次
# 1160   1182   619   1380  732  1366.7  1076  956  1107

Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=False,
       [8.65152359])
Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=False,
       [1107.39501953])


通过上面的实验可以看出:\
<font color=red>**reduction=mean 可以使loss不受batch_size变化的影响，使梯度的更新更稳定，进而不需根据batch_size调整learning_rate**</font>
