#### Dice Loss for Data-imbalanced NLP Tasks


#### 数据不平衡NLP任务的dice loss

* Many NLP tasks such as tagging and machine reading comprehension are faced with the severe data imbalance issue: negative examples significantly outnumber positive ones, and the huge number of easy-negative examples overwhelms training. The most commonly used cross entropy criteria is actually accuracyoriented, which creates a discrepancy between training and test. At training time, each training instance contributes equally to the objective function, while at test time F1 score concerns more about positive examples.


* 许多自然语言处理任务，如标注和机器阅读理解，都面临着严重的数据不平衡问题：负例明显多于正例，训练中过多的出现了负例。最常用的交叉熵准则实际上是面向准确度的，这在训练和测试之间造成了差异。在训练阶段，每个训练实例对目标函数的贡献是相等的，而在测试阶段，F1分数更多地关注积极的例子。

* In this paper, we propose to use dice loss in replacement of the standard cross-entropy objective for data-imbalanced NLP tasks. Dice loss is based on the Sørensen–Dice coefficient (Sorensen, 1948) or Tversky index (Tversky, 1977), which attaches similar importance to false positives and false negatives, and is more immune to the data-imbalance issue. To further alleviate the dominating influence from easy-negative examples in training, we propose to associate training examples with dynamically adjusted weights to deemphasize easy-negative examples. Experimental results show that this strategy narrows down the gap between the F1 score in evaluation and the dice loss in training.

* 在这篇文章中，我们提出用dice损失来代替标准的交叉熵目标函数来处理数据不平衡的NLP任务。骰子丢失是基于Sørensen–Dice系数（Sorensen，1948）或Tversky指数（Tversky，1977），它同样重视假阳性和假阴性，对数据不平衡问题更具免疫力。为了进一步减轻易负数在训练中的支配性影响，我们提出将训练样本与动态调整权重相结合，以淡化易负数。实验结果表明，该策略缩小了F1评分与训练中的dice损失。

* With the proposed training objective, we observe significant performance boosts over a wide range of data imbalanced NLP tasks. Notably, we are able to achieve SOTA results on CTB5, CTB6 and UD1.4 for the part of speech tagging task, and competitive or even better results on CoNLL03, OntoNotes5.0, MSRA and OntoNotes4.0 for the named entity recognition task along with the machine reading comprehension and paraphrase identification tasks.

* 在提出的训练目标下，我们观察到在大量数据不平衡的NLP任务中，性能显著提升。值得注意的是，我们能够在CTB5、CTB6和UD1.4上获得词性标注任务的SOTA结果，在CoNLL03、OntoNotes5.0、MSRA和OntoNotes4.0上，对于命名实体识别任务，以及机器阅读理解和释义识别任务，我们能够获得甚至更好的结果。

#### 1. 介绍

* Data imbalance is a common issue in a variety of NLP tasks such as tagging and machine reading comprehension. Table 1 gives concrete examples: for the Named Entity Recognition (NER) task (Sang and De Meulder, 2003; Nadeau and Sekine, 2007), most tokens are backgrounds with tagging class O. Specifically, the number of tokens with tagging class O is 5 times as many as those with entity labels for the CoNLL03 dataset and 8 times for the OntoNotes5.0 dataset; Data-imbalanced is- sue is more severe for MRC tasks (Rajpurkar et al., 2016; Nguyen et al., 2016; Rajpurkar et al., 2018; Dasigi et al., 2019) with the value of negative-positive ratio being 50-200.1

* 数据不平衡是各种自然语言处理任务（如标记和机器阅读理解）中的常见问题。表1给出了具体的例子：对于命名实体识别（NER）任务（Sang和De Meulder，2003；Nadeau和Sekine，2007），大多数标记都是以标记类O为背景的。具体而言，在CoNLL03数据集中，标记类为O的令牌数量是具有实体标签的令牌数量的5倍，对于OntoNotes5.0数据集，是8倍；MRC任务的数据不平衡更为严重（Rajpurkar et al.，2016；Nguyen et al.，2016；Rajpurkar et al.，2018；Dasigi et al.，2019），负阳性率为50-200.1

![avatar](图片/1.png)

<center>表1：不同数据不平衡NLP任务的正负例数及其比率。

* Data imbalance results in the following two issues: (1) the training-test discrepancy: Without balancing the labels, the learning process tends to converge to a point that strongly biases towards class with the majority label. This actually creates a discrepancy between training and test: at training time, each training instance contributes equally to the objective function while at test time, F1 score concerns more about positive examples; (2) the overwhelming effect of easy-negative examples. As pointed out by Meng et al. (2019), a significantly large number of negative examples also means that the number of easy-negative example is large. The huge number of easy examples tends to overwhelm the training, making the model not sufficiently learn to distinguish between positive examples and hard-negative examples. The cross-entropy objective (CE for short) or maximum likelihood (MLE) objective, which is widely adopted as the training objective for data-imbalanced NLP tasks (Lample et al., 2016; Wu et al., 2019; Devlin et al., 2018; Yu et al., 2018a; McCann et al., 2018; Ma and Hovy, 2016; Chen et al., 2017), handles neither of the issues.

* 数据不平衡导致了以下两个问题：（1）训练测试差异：在不平衡标签的情况下，学习过程往往会收敛到一个对大多数标签的班级有强烈偏向的点。这实际上造成了训练和测试之间的差异：在训练时，每个训练实例对目标函数的贡献是相等的，而在测试时，F1分数更多地关注积极的例子；（2）容易的消极例子的压倒性影响。如孟等所指出。（2019年），显著大量的反面例子也意味着容易的反面例子数量很多。大量的正面例子容易被大量的反面例子所淹没，难以区分。交叉熵目标（简称CE）或最大似然（MLE）目标被广泛采用作为数据不平衡NLP任务的训练目标（Lample等人，2016年；Wu等人，2019；Devlin等人，2018年；Yu等人，2018a；McCann等人，2018年；Ma和Hovy，2016年；Chen等人，2017年），这两个问题都没有解决。

* To handle the first issue, we propose to replace CE or MLE with losses based on the Sørensen–Dice coefficient (Sorensen, 1948) or Tversky index (Tversky, 1977). The Sørensen–Dice coefficient, dice loss for short, is the harmonic mean of precision and recall. It attaches equal importance to false positives (FPs) and false negatives (FNs) and is thus more immune to data-imbalanced datasets. Tversky index extends dice loss by using a weight that trades precision and recall, which can be thought as the approximation of the Fβ score, and thus comes with more flexibility. Therefore, we use dice loss or Tversky index to replace CE loss to address the first issue.

* 为了解决第一个问题，我们建议用基于Sørensen–Dice系数（Sorensen，1948）或Tversky指数（Tversky，1977）的损失来代替CE或MLE。Sørensen-Dice系数，简称骰子损失，是准确度和召回率的调和平均值。它同样重视假阳性（FPs）和假阴性（FNs），因此对数据不平衡的数据集更具免疫力。Tversky指数通过使用一个权值来扩展骰子损失，这个权重可以被认为是Fβ分数的近似值，因此具有更大的灵活性。因此，我们使用骰子损失或Tversky指数来代替CE损失来解决第一个问题。

* Only using dice loss or Tversky index is not enough since they are unable to address the dominating influence of easy-negative examples. This is intrinsically because dice loss is actually a soft version of the F1 score. Taking the binary classification task as an example, at test time, an example will be classified as negative as long as its probability is smaller than 0.5, but training will push the value to 0 as much as possible. This gap isn’t a big issue for balanced datasets, but is extremely detrimental if a big proportion of training examples are easy-negative ones: easy-negative examples can easily dominate training since their probabilities can be pushed to 0 fairly easily. Meanwhile, the model can hardly distinguish between hardnegative examples and positive ones. Inspired by the idea of focal loss (Lin et al., 2017) in computer vision, we propose a dynamic weight adjusting strategy, which associates each training example with a weight in proportion to (1 − p), and this weight dynamically changes as training proceeds. This strategy helps deemphasize confident examples during training as their probability p approaches 1, making the model attentive to hard negative examples, and thus alleviates the dominating effect of easy-negative examples. Combing both strategies, we observe significant performance boosts on a wide range of data imbalanced NLP tasks.

* 仅仅使用骰子损失或Tversky指数是不够的，因为它们无法解决简单负面例子的主要影响。这本质上是因为骰子损失实际上是F1分数的软版本。以二元分类任务为例，在测试时，只要一个例子的概率小于0.5，它就会被分类为负，但训练会将该值尽可能地推到0。对于平衡数据集来说，这个差距并不是一个大问题，但是如果训练样本中有很大一部分是容易被否定的，那么这个差距是非常有害的：容易的负样本可以很容易地控制训练，因为它们的概率可以很容易地推到0。同时，该模型很难区分硬性否定性和积极性。受计算机视觉中焦点丢失（Lin等人，2017）的启发，我们提出了一种动态权重调整策略，该策略将每个训练示例与（1-p）成比例的权重相关联，并且该权重随着训练的进行而动态变化。这种策略有助于在训练过程中，当自信样本的概率p接近1时，可以消除它们的重要性，从而使模型更加关注硬阴性样本，从而减轻简单负数的主导效应。结合这两种策略，我们观察到在大量数据不平衡的NLP任务上性能显著提高。

* The rest of this paper is organized as follows: related work is presented in Section 2. We describe different proposed losses in Section 3. Experimental results are presented in Section 4. We perform ablation studies in Section 5, followed by a brief conclusion in Section 6.

* 本文的其余部分安排如下：第二节介绍了相关工作。我们在第3节中描述了不同的拟议损失。实验结果见第4节。我们在第5节进行消融研究，然后在第6节得出简要结论。

### 2. 相关工作

#### 2.1. 数据重采样

* The idea of weighting training examples has a long history. Importance sampling (Kahn and Marshall, 1953) assigns weights to different samples and changes the data distribution. Boosting algorithms such as AdaBoost (Kanduri et al., 2018) select harder examples to train subsequent classifiers. Similarly, hard example mining (Malisiewicz et al., 2011) downsamples the majority class and exploits the most difficult examples. Oversampling (Chen et al., 2010; Chawla et al., 2002) is used to balance the data distribution. Another line of data resampling is to dynamically control the weights of examples as training proceeds. For example, focal loss (Lin et al., 2017) used a soft weighting scheme that emphasizes harder examples during training. In self-paced learning (Kumar et al., 2010), example weights are obtained through optimizing the weighted training loss which encourages learning easier examples first. At each training step, self-paced learning algorithm optimizes model parameters and example weights jointly. Other works (Chang et al., 2017; Katharopoulos and Fleuret, 2018) adjusted the weights of different training examples based on training loss. Besides, recent work (Jiang et al., 2017; Fan et al., 2018) proposed to learn a separate network to predict sample weights.

* 加权训练例子的想法由来已久。重要性抽样（Kahn and Marshall，1953）为不同的样本分配权重，并改变数据分布。AdaBoost（Kanduri et al.，2018）等Boosting算法选择更难的例子来训练后续分类器。类似地，硬示例挖掘（Malisiewicz et al.，2011）减少了大多数类的采样，并利用了最困难的示例。过采样（Chen et al.，2010；Chawla et al.，2002）用于平衡数据分布。数据重采样的另一行是在训练过程中动态控制示例的权重。例如，Focus loss（Lin等人，2017年）使用了一种软加权方案，在训练阶段强调较难的例子。在自主学习（Kumar等人，2010）中，通过优化加权训练损失来获得示例权重，从而鼓励首先学习更简单的示例。在每个训练步骤中，自学习算法联合优化模型参数和样本权重。其他著作（Chang et al.，2017；Katharopoulos and Fleuret，2018）根据训练损失调整了不同训练示例的权重。此外，最近的工作（Jiang et al.，2017；Fan et al.，2018）提议学习一个单独的网络来预测样本权重。

#### 2.2. 视觉数据不平衡问题

* The background-object label imbalance issue is severe and thus well studied in the field of object detection (Li et al., 2015; Girshick, 2015; He et al., 2015; Girshick et al., 2013; Ren et al., 2015). The idea of hard negative mining (HNM) (Girshick et al., 2013) has gained much attention recently. Pang et al. (2019) proposed a novel method called IoU-balanced sampling and Chen et al. (2019) designed a ranking model to replace the conventional classification task with a average-precision loss to alleviate the class imbalance issue. The efforts made on object detection have greatly inspired us to solve the data imbalance issue in NLP.


* 背景对象标签不平衡问题非常严重，因此在目标检测领域得到了很好的研究（Li et al.，2015；Girshick，2015；He et al.，2015；Girshick et al.，2013；Ren et al.，2015）。硬阴性采矿（HNM）的概念（Girshick et al.，2013）最近备受关注。Pang等人。（2019）提出了一种称为IoU平衡抽样的新方法，Chen等人。（2019）设计了一个排名模型，用平均精度损失代替传统的分类任务，以缓解类别不平衡问题。在目标检测方面所做的努力极大地启发了我们解决NLP中数据不平衡的问题。

* Sudre et al. (2017) addressed the severe class imbalance issue for the image segmentation task. They proposed to use the class re-balancing property of the Generalized Dice Loss as the training objective for unbalanced tasks. Shen et al. (2018) investigated the influence of Dice-based loss for multi-class organ segmentation using a dataset of abdominal CT volumes. Kodym et al. (2018) proposed to use the batch soft Dice loss function to train the CNN network for the task of segmentation of organs at risk (OAR) of medical images. Shamir et al. (2019) extended the definition of the classical Dice coefficient to facilitate the direct comparison of a ground truth binary image with a probabilistic map. In this paper, we introduce dice loss into NLP tasks as the training objective and propose a dynamic weight adjusting strategy to address the dominating influence of easy-negative examples.

* Sudre等人。（2017）解决了图像分割任务的严重类不平衡问题。提出了用不平衡骰子训练任务的不平衡性作为训练目标。Shen等人。（2018）利用腹部CT容积数据集调查了基于骰子的损失对多类别器官分割的影响。Kodym等人。（2018）提出利用批量软骰子丢失功能训练CNN网络，完成医学图像危险器官分割（OAR）任务。Shamir等人。（2019）扩展了经典骰子系数的定义，以便于直接比较地面真实二值图像与概率图。本文将骰子损失引入NLP任务中作为训练目标，并提出了一种动态权值调整策略，以解决容易出现的负示例对NLP任务的影响。

### 3. 损失

#### 3.1. 符号

* For illustration purposes, we use the binary classification task to demonstrate how different losses work. The mechanism can be easily extended to multi-class classification. Let X denote a set of training instances and each instance xi ∈ X is associated with a golden binary label yi = [yi0,yi1] denoting the ground-truth class xi belongs to, and pi = [pi0 , pi1 ] is the predicted probabilities of the two classes respectively, where yi0,yi1 ∈ {0,1},pi0,pi1 ∈ [0,1] and pi1 + pi0 = 1.


* 为了便于说明，我们使用二进制分类任务来演示不同的损失是如何工作的。该机制可以很容易地扩展到多类分类。席席席艺龙网X表示一组训练实例，每个实例Xi x与一个黄金二进制标记Yi＝[Yi0，Yi1]表示地面实数类XI所属，而PI＝[PI0，PI1]分别是两个类的预测概率，其中Yi0，Y1，{0,1}，Pi0，Pi1，[0,1]和Pi1+Pi0＝1。

#### 3.2 交叉熵损失

* 香草交叉熵（CE）损失由下式得出：

<center>$CE = -\frac{1}{N}\sum\limits_{i}\sum\limits_{j\in\{0,1\}} y_{ij}log p_{ij}  ~~~~~~~~$(1)</center>

* As can be seen from Eq.1, each xi contributes equally to the final objective. Two strategies are normally used to address the the case where we wish that not all xi are treated equal: associating different classes with different weighting factor α or resampling the datasets. For the former, Eq.1 is adjusted as follows:

* 从等式1可以看出，每个席的贡献与最终目标相等。两种策略通常用于解决的情况下，我们希望不是所有的席熙平等对待：关联不同的阶级与不同的加权因子α或重采样数据集。对于前者，公式1调整如下：

<center>$ {Weighted CE} =  -\frac{1}{N}\sum\limits_{i} {\alpha_{i}}\sum\limits_{j\in\{0,1\}} y_{ij}log p_{ij}~~~~~~~~$(2)</center>

* where $\alpha_{i} \in [0, 1]$ may be set by the inverse class frequency or treated as a hyperparameter to set by cross validation. In this work, we use $lg(\frac{n-n_t}{n_t} + K)$ to calculate the coefficient $\alpha$, where $n_t$ is the number of samples with class $t$ and $n$ is the total number of samples in the training set. $K$ is a hyperparameter to tune. The data resampling strategy constructs a new dataset by sampling training examples from the original dataset based on human-designed criteria, e.g., extract equal training samples from each class. Both strategies are equivalent to changing the data distribution and thus are of the same nature. Empirically, these two methods are not widely used due to the trickiness of selecting $\alpha$ especially for multi-class classification tasks and that inappropriate selection can easily bias towards rare classes (Valverde et al., 2017).

* 其中$\alpha_{i} \in [0, 1]$可以通过逆类频率来设置，也可以作为一个超参数通过交叉验证来设置。在这项工作中，我们使用$lg(\frac{n-n_t}{n_t} + K)$来计算系数$\alpha$，其中$n_t$是$t$类的样本数，$n$是训练集中的样本总数。$K$是要调整的超参数。数据重采样策略根据人为设计的准则，从原始数据集中抽取训练样本，构造新的数据集，例如从每个类中提取相等的训练样本。这两种策略都相当于改变数据分布，因此具有相同的性质。从经验上讲，这两种方法并没有被广泛应用，因为选择$\alpha$的难度很大，尤其是对于多类分类任务，不恰当的选择很容易偏向稀有类（Valverde等人，2017）。

#### 3.3 骰子系数与Tversky指数

* Sørensen–Dice coefficient (Sorensen, 1948; Dice, 1945), dice coefficient (DSC) for short, is a F1-oriented statistic used to gauge the similarity of two sets. Given two sets A and B, the dice coefficient between them is given as follows:

* Sørensen–Dice系数（Sorensen，1948；Dice，1945），简称Dice系数（DSC），是一种面向F1的统计，用于衡量两组数据的相似性。给定两个集合A和B，它们之间的骰子系数如下：

<center>$DSC(A,B) = \frac{2|A\bigcap B|}{|A|+|B|}~~~~~~~~$(3)</center>

* In our case, $A$ is the set that contains of all positive examples predicted by a specific model, and $B$ is the set of all golden positive examples in the dataset. When applied to boolean data with the definition of true positive (TP), false positive (FP), and false negative (FN), it can be then written as follows:

* 在我们的例子中，$A$是包含由特定模型预测的所有正示例的集合，而$B$是数据集中所有黄金正示例的集合。当应用于定义为真阳性（TP）、假阳性（FP）和假阴性（FN）的布尔数据时，它可以写为：

<center>$DSC = \frac{2TP}{2TP+FN+FP}=F1~~~~~~~$(4)</center>

* For an individual example $x_i$, its corresponding $DSC$ is given as follows:

* 对于一个单独例子$x_i$，其相应的$DSC$给出如下：

<center>$DSC(x_i) = \frac{2p_{i1} y_{i1}}{p_{i1}+y_{i1}}~~~~~~~~$(5)</center>

![avatar](图片/2.png)

<center>表2：不同的损失及其计算公式。</center>

* As can be seen, a negative example with $y_{i1} = 0$ does not contribute to the objective. For smoothing purposes, it is common to add a $\gamma$ factor to both the nominator and the denominator, making the form to be as follows:

* 可以看出，$y_{i1}=0$的反例对目标没有帮助。为了平滑起见，通常在命名词和分母上添加一个$\gamma$因子，使其形式如下：

<center>$DSC(x_i) = \frac{2p_{i1} y_{i1} + \gamma}{p_{i1} + y_{i1} + \gamma}~~~~~~~~~$(6)</center>

* As can be seen, negative examples whose DSC is $\frac{\gamma}{p_{i1} + \gamma}$ , also contribute to the training. Additionally, Milletari et al. (2016) proposed to change the denominator to the square form for faster convergence, which leads to the following dice loss (DL):

* 可见，DSC为$\frac{\gamma}{p_{i1} + \gamma}$的反面例子也有助于训练。此外，Millari等人。（2016）建议将分母改为正方形形式以加快收敛，从而导致以下骰子损失（DL）：

![avatar](图片/3.png)

* Another version of DL is to directly compute set-level dice coefficient instead of the sum of individual dice coefficient. We choose the latter due to ease of optimization.

* DL的另一个版本是直接计算集合级骰子系数，而不是单个骰子系数的总和。我们选择后者是因为易于优化。

* Tversky index (TI), which can be thought as the approximation of the $F_{\beta}$ score, extends dice coefficient to a more general case. Given two sets $A$ and $B$, tversky index is computed as follows:

* Tversky index（TI）可以看作是$F_{\beta}$分数的近似值，它将骰子系数扩展到更一般的情况。给定两组$A$和$B$，tversky指数计算如下：

![avatar](图片/4.png)

* Tversky index offers the flexibility in controlling the tradeoff between false-negatives and false-positives. It degenerates to DSC if $\alpha = \beta = 0.5$. The Tversky loss (TL) is thus as follows:

* Tversky指数提供了控制假阴性和假阳性之间权衡的灵活性。如果$\alpha = \beta = 0.5$，则退化为DSC。因此，Tversky损失（TL）如下：

![avatar](图片/5.png)

#### 3.4 自动调整dice损失

* Consider a simple case where the dataset consists of only one example $x_i$, which is classified as positive as long as $p_{i1}$ is larger than 0.5. The computation of $F1$ score is actually as follows:

* 考虑一个简单的例子，其中数据集只包含一个示例$x_i$，只要$p_{i1}$大于0.5，它就被分类为正。$F1$分数的计算方法如下：

![avatar](图片/6.png)

![avatar](图片/7.png)

<center>图1：四种损失的衍生工具说明。当p超过0.5时，DSC的导数接近于零，而对于其他损失，只有当概率恰好为1时，导数才达到0，这意味着它们将尽可能地将p推到1。</center>

* Comparing Eq.5 with Eq.10, we can see that Eq.5 is actually a soft form of F 1, using a continuous p rather than the binary $I(p_{i1} > 0.5)$. This gap isn’t a big issue for balanced datasets, but is extremely detrimental if a big proportion of training examples are easy-negative ones: easy-negative examples can easily dominate training since their probabilities can be pushed to 0 fairly easily. Meanwhile, the model can hardly distinguish between hard-negative examples and positive ones, which has a huge negative effect on the final F1 performance.

* 比较等式5和等式10，我们可以看到等式5实际上是F 1的软形式，它使用连续的p而不是二进制的$I(p_{i1} > 0.5)$。对于平衡数据集来说，这个差距并不是一个大问题，但是如果训练样本中有很大一部分是容易被否定的，那么这个差距是非常有害的：容易的负样本可以很容易地控制训练，因为它们的概率可以很容易地推到0。同时，该模型很难区分硬性的反面例子和正面的例子，这对F1的最终表现产生了巨大的负面影响。

* To address this issue, we propose to multiply the soft probability $p$ with a decaying factor $(1 − p)$, changing Eq.10 to the following form:

* 为了解决这个问题，我们建议将软概率$p$乘以衰减因子$(1−p)$，将公式10改为以下形式：

![avatar](图片/9.png)

* One can think $(1−p_{i1})$ as a weight associated with each example, which ch|anges as training proceeds. The intuition of changing $p_{i1}$ to $(1 − p_{i1})p_{i1}$ is to push down the weight of easy examples. For easy examples whose probability are approaching 0 or 1, $(1 − p_{i1})p_{i1}$ makes the model attach significantly less focus to them.

* 我们可以把$(1−p_{i1})$看作与每个示例相关联的权重，它随着训练的进行而变化。将$p_{i1}$更改为$(1−p_{i1})p_{i1}$的直觉是降低简单示例的权重。对于概率接近0或1的简单示例，$(1 − p_{i1})p_{i1}$使模型对它们的关注明显减少。

* A close look at Eq.11 reveals that it actually mimics the idea of focal loss (FL for short) (Lin et al., 2017) for object detection in vision. Focal loss was proposed for one-stage object detector to handle foreground-background tradeoff encountered during training. It down-weights the loss assigned to well-classified examples by adding $a(1-p)^\gamma$ factor, leading the final loss to be $-(1-p)^\gamma log{p}$.

* 仔细观察等式11，我们发现它实际上模仿了视觉中目标检测的焦点损失（简称FL）（Lin等人，2017）。为了解决训练过程中遇到的前景背景权衡问题，提出了一种单级目标探测器的焦距损失问题。它通过添加$a(1-p)^\gamma$因子来降低分配给分类良好的示例的损失，最终损失为$-(1-p)^\gamma log{p}$。

* In Table 2, we summarize all the aforementioned losses. Figure 1 gives an explanation from the perspective in derivative: The derivative of $DSC$ approaches zero right after $p$ exceeds 0.5, which suggests the model attends less to examples once they are correctly classified. But for the other losses, the derivatives reach 0 only if the probability is exactly 1, which means they will push $p$ to 1 as much as possible.

* 在表2中，我们总结了上述所有损失。图1从衍生工具的角度给出了一个解释：在$p$超过0.5之后，$DSC$的导数就接近于零，这表明一旦示例被正确分类，模型就不太注意这些示例。但对于其他损失，只有当概率恰好为1时，衍生工具才会达到0，这意味着它们将尽可能将$p$推至1。

### 4 实验

* We evaluated the proposed method on four NLP tasks, part-of-speech tagging, named entity recognition, machine reading comprehension and paraphrase identification. Hyperparameters are tuned on the corresponding development set of each dataset. More experiment details including datasets and hyperparameters are shown in supplementary material.

* 我们在词性标注、命名实体识别、机器阅读理解和释义识别四个NLP任务上对该方法进行了评估。超参数在每个数据集的相应开发集上进行调整。更多的实验细节包括数据集和超参数在补充材料中给出。

#### 4.1 词性标注

* Settings Part-of-speech tagging (POS) is the task of assigning a part-of-speech label (e.g., noun, verb, adjective) to each word in a given text. In this paper, we choose BERT (Devlin et al., 2018) as the backbone and conduct experiments on three widely used Chinese POS datasets including Chinese Treebank (Xue et al., 2005) 5.0/6.0 and UD1.4 and English datasets including Wall Street Journal (WSJ) and the dataset proposed by Ritter et al. (2011). We report the span-level micro-averaged precision, recall and F1 for evaluation.

* 设置词性标注（POS）是将词性标签（如名词、动词、形容词）分配给给定文本中的每个词的任务。本文以BERT（Devlin et al.，2018）为骨干，对中文树库（Xue et al.，2005）5.0/6.0和UD1.4三个广泛使用的中文POS数据集和包括《华尔街日报》（WSJ）在内的英文数据集和Ritter等人提出的数据集进行了实验。（2011年）。我们报告了跨距水平的微平均精度、召回率和F1以供评估。

* Baselines We used the following baselines:
* Joint-POS: Shao et al. (2017) jointly learns Chinese word segmentation and POS.
* Lattice-LSTM: Zhang and Yang (2018) constructs a word-character lattice network.
* Bert-Tagger: Devlin et al. (2018) treats part-of-speech as a tagging task.

* 基线我们使用了以下基线：
* 联合立场：邵等。（2017）共同学习汉语分词和词性。
* Lattice LSTM:Zhang and Yang（2018）构建了一个词-字格网络。
* 伯特·塔格：Devlin等人。（2018）将词性视为标记任务。

![avatar](图片/10.png)

<center>表3:CTB5、CTB6和UD1.4中文POS数据集的实验结果。</center>

* Results Table 3 presents the experimental results on Chinese datasets. As can be seen, the proposed DSC loss outperforms the best baseline results by a large margin, i.e., outperforming BERT-tagger by +1.86 in terms of F1 score on CTB5,+1.80 on CTB6 and +2.19 on UD1.4. As far as we are concerned, we are achieving SOTA performances on the three datasets. Focal loss only obtains a little performance improvement on CTB5 and CTB6, and the dice loss obtains huge gain on CTB5 but not on CTB6, which indicates the three losses are not consistently robust in solving the data imbalance issue.
* Table 4 presents the experimental results for English datasets.

* 结果表3给出了中国数据集的实验结果。可以看出，建议的DSC损失比最好的基线结果有很大的差距，也就是说，在CTB5的F1分数、CTB6的+1.80和UD1.4的+2.19方面，比BERT tagger高出1.86。就我们而言，我们在三个数据集上取得了SOTA性能。焦点损失对CTB5和CTB6的性能改善很小，而dice损失在CTB5上获得了巨大的增益，而在CTB6上却没有，这表明这三种损失在解决数据不平衡问题上并不一致。
* 表4给出了英语数据集的实验结果。

![avatar](图片/11.png)

#### 4.2 命名实体识别

* Settings Named entity recognition (NER) is the task of detecting the span and semantic category of entities within a chunk of text. Our implementation uses the current state-of-the-art model proposed by Li et al. (2019) as the backbone, and changes the MLE loss to DSC loss. Datasets that we use include OntoNotes4.0 (Pradhan et al., 2011), MSRA (Levow, 2006), CoNLL2003 (Sang and Meulder, 2003) and OntoNotes5.0 (Pradhan et al., 2013). We report span-level micro-averaged precision, recall and F1.

* 命名为实体识别（NER）的任务是检测文本块中实体的跨度和语义类别。我们的实现使用了Li等人提出的当前最先进的模型。（2019）作为主干，并将MLE损失改为DSC损失。我们使用的数据集包括OntoNotes4.0（Pradhan等人，2011年）、MSRA（Levow，2006年）、CoNLL2003年（Sang和Meulder，2003年）和OntoNotes5.0（Pradhan等人，2013年）。我们报告了跨级微平均精度、召回率和F1。

* Baslines We use the following baselines:
* ELMO：a tagging model with pretraining from Peters et al.(2018)
* Lattice-LSTM: Zhang and Yang (2018) constructs a word-character lattice, only used in Chinese datasets.
* CVT: Clark et al. (2018) uses Cross-View Training(CVT) to improve the representations of a Bi-LSTM encoder.
* Bert-Tagger: Devlin et al. (2018) treats NER as a tagging task.
* Glyce-BERT: Wu et al. (2019) combines Chinese glyph information with BERT pretraining.
* BERT-MRC: Li et al. (2019) formulates NER as a machine reading comprehension task and achieves SOTA results on Chinese and English NER benchmarks.

* 基线我们使用以下基线：
* ELMO:Peters等人（2018）的带预训练的标签模型
* Lattice LSTM:Zhang and Yang（2018）构建了一个词-字符格，仅用于中文数据集。
* CVT:Clark等人。（2018）使用交叉视图训练（CVT）改进Bi LSTM编码器的表示。
* 伯特·塔格：Devlin等人。（2018）将NER视为标记任务。
* 甘利伯特：吴等。（2019）将中文字形信息与伯特预训练相结合。
* BERT-MRC:Li等人。（2019）将阅读理解任务制定为机器阅读理解任务，并在中英文阅读理解基准测试中取得了SOTA成绩。

* Results Table 5 shows experimental results on NER datasets. DSC outperforms BERT-MRC(Li et al., 2019) by +0.29, +0.96, +0.97 and +2.36 respectively on CoNLL2003, OntoNotes5.0, MSRA and OntoNotes4.0. As far as we are concerned, we are setting new SOTA performances on all of the four NER datasets.

* 结果表5显示了NER数据集的实验结果。在CoNLL2003、OntoNotes5.0、MSRA和OntoNotes4.0上，DSC的表现分别比BERT-MRC（Li等人，2019年）高出+0.29、+0.96、+0.97和+2.36。就我们而言，我们正在为所有四个NER数据集设置新的SOTA性能。

![avatar](图片/12.png)

#### 4.3 机器阅读理解

* Settings The task of machine reading comprehension (MRC) (Seo et al., 2016; Wang et al., 2016; Wang and Jiang, 2016; Wang et al., 2016; Shen et al., 2017; Chen et al., 2017) predicts the answer span in the passage given a question and the passage. We followed the standard protocols in Seo et al. (2016), in which the start and end indexes of answer are predicted. We report Extract Match (EM) as well as F1 score on validation set. We use three datasets on this task: SQuAD v1.1, SQuAD v2.0 (Rajpurkar et al., 2016, 2018) and Quoref (Dasigi et al., 2019).

* 设置机器阅读理解任务（MRC）（Seo等，2016；Wang et al.，2016；Wang and Jiang，2016；Wang et al.，2016；Shen et al.，2017；Chen et al.，2017）预测给定问题和文章中的答案广度。我们遵循Seo等的标准协议。（2016），对应答的起始和结束指标进行了预测。我们报告提取匹配（EM）和F1分数验证集。我们在这项任务中使用了三个数据集：SQuAD v1.1、SQuAD v2.0（Rajpurkar等人，2016年、2018年）和Quoref（Dasigi等人，2019年）。

* Baselines We used the following baselines:
* QANet: Yu et al. (2018b) builds a model based on convolutions and self-attentions. Convolutions are used to model local interactions and self-attention are used to model global interactions.
* BERT: Devlin et al. (2018) scores each candidate span and the maximum scoring span is used as a prediction.
* XLNet: Yang et al. (2019) proposes a generalized autoregressive pretraining method that enables learning bidirectional contexts.

* 基线我们使用了以下基线：
* QANet:Yu等人。（2018b）建立了一个基于卷积和自我关注的模型。卷积用于模拟局部交互作用，而自我注意用于模拟全局交互作用。
* 伯特：德夫林等人。（2018）对每个候选跨度进行评分，并使用最大得分跨度作为预测。
* XLNet:Yang等人。（2019）提出了一种广义自回归预训练方法，该方法能够学习双向上下文。

![avatar](图片/13.png)

* Results Table 6 shows the experimental results for MRC task. With either BERT or XLNet, our proposed DSC loss obtains significant performance boost on both EM and F1. For SQuADv1.1, our proposed method outperforms XLNet by +1.25 in terms of F1 score and +0.84 in terms of EM. For SQuAD v2.0, the proposed method achieves 87.65 on EM and 89.51 on F1. On QuoRef, the proposed method surpasses XLNet by +1.46 on EM and +1.41 on F1.

* 结果表6显示了MRC任务的实验结果。无论是BERT还是XLNet，我们提出的DSC损失在EM和F1上都获得了显著的性能提升。对于SQuADv1.1，我们提出的方法在F1得分方面优于XLNet+1.25，EM方面优于+0.84。对于squadv2.0，该方法在EM上达到87.65，在F1上达到89.51。在qoref上，该方法在EM上比XLNet高出1.46，在F1上比XLNet高出1.41。

#### 4.4 释义识别

* Settings Paraphrase identification (PI) is the task of identifying whether two sentences have the same meaning or not. We conduct experiments on the two widely-used datasets: MRPC (Dolan and Brockett, 2005) and QQP. F1 score is reported for comparison. We use BERT (Devlin et al., 2018) and XLNet (Yang et al., 2019) as baselines.

* 设置释义识别（PI）是识别两个句子是否具有相同意义的任务。我们对两个广泛使用的数据集进行了实验：MRPC（Dolan和Brockett，2005）和QQP。报告F1分数以供比较。我们使用BERT（Devlin et al.，2018）和XLNet（Yang et al.，2019）作为基线。

* Results Table 7 shows the results. We find that replacing the training objective with DSC introduces performance boost for both settings, +0.58 for MRPC and +0.73 for QQP.

* 结果表7显示了结果。我们发现，用DSC代替训练目标可以提高两种设置的性能，MRPC为+0.58，QQP为+0.73。

![avatar](图片/14.png)

### 5 消融研究

#### 5.1 数据集在不同程度上不平衡

* It is interesting to see how differently the proposed objectives affect datasets imbalanced to different extents. We use the paraphrase identification dataset QQP (37% positive and 63% negative) for studies. To construct datasets with different imbalance degrees, we used the original QQP dataset to construct synthetic training sets with different positive-negative ratios. Models are trained on these different synthetic sets and then test on the same original test set.

* 有趣的是，我们可以看到所提出的目标在不同程度上对不平衡数据集的影响有多大。我们使用释义识别数据集QQP（37%阳性，63%阴性）进行研究。为了构造不同不平衡度的数据集，我们利用原始的QQP数据集构造具有不同正负比的综合训练集。模型在这些不同的合成集上进行训练，然后在同一原始测试集上进行测试。

![avatar](图片/15.png)

<center>表8:QQP不同数据扩充方式对F1得分的影响。</center>

* Original training set (original) The original dataset with 363,871 examples, with 37% being positive and 63% being negative
* Positive augmentation (+ positive) We created a balanced dataset by adding positive examples. We first randomly chose positive training examples in the original training set as templates. Then we used Spacy4 to retrieve entity mentions and replace them with new ones by linking mentions to their corresponding entities in DBpedia. The augmented set contains 458,477 examples, with 50% being positive and 50% being negative.
* Negative augmentation (+ negative) We created a more imbalanced dataset. The size of the newly constructed training set and the data augmented technique are exactly the same as +negative, except that we chose negative training examples as templates. The augmented training set contains 458,477 examples, with 21% being positive and 79% being negative.
* Negative downsampling (- negative) We down-sampled negative examples in the original training set to get a balanced training set. The down-sampled set contains 269,165 examples, with 50% being positive and 50% being negative.
* Positive and negative augmentation (+ positive & +negative) We augmented the original training data with additional positive and negative examples with the data distribution staying the same. The augmented dataset contains 458,477 examples, with 50% being positive and 50% being negative.

* 原始训练集（原始）原始数据集，包含363871个示例，其中37%为阳性，63%为阴性
* 正增广（+Positive）我们通过添加正面示例创建了一个平衡的数据集。我们首先随机选取原始训练集中的正面训练范例作为模版。然后我们使用Spacy4检索实体引用，并通过将引用链接到DBpedia中相应的实体来替换它们。增广集包含458477个例子，其中50%为阳性，50%为阴性。
* 负增长（+Negative）我们创建了一个更不平衡的数据集。新构造的训练集的大小和数据扩充技术与+negative完全相同，只是我们选择了负数训练样本作为模板。扩展训练集包含458477个例子，其中21%为阳性，79%为阴性。
* 负下采样（-Negative）在原始训练集中对负样本进行下采样，得到一个均衡的训练集。下采样集包含269165个样本，其中50%为阳性，50%为阴性。
* 正、负增广（+正和+负）我们在数据分布不变的情况下，通过增加正、负样本对原始训练数据进行增广。扩展数据集包含458477个示例，其中50%为正，50%为负。

* Results are shown in Table 8. We first look at the first line, with all results obtained using the MLE objective. We can see that + positive outperforms original, and +negative underperforms original. This is in line with our expectation since + positive creates a balanced dataset while +negative creates a more imbalanced dataset. Despite the fact that -negative creates a balanced dataset, the number of training data decreases, resulting in inferior performances.

* 结果见表8。我们首先看第一行，所有的结果都是用MLE目标得到的。我们可以看到+的表现优于原始，而+的表现低于原始。这与我们的预期一致，因为+正产生一个平衡的数据集，而+负的创建了一个更不平衡的数据集。尽管-negative创建了一个平衡的数据集，但是训练数据的数量会减少，从而导致性能下降。

![avatar](图片/16.png)

<center>表8:QQP不同数据扩充方式对F1得分的影响。</center>

* DSC achieves the highest F1 score across all datasets. Specially, for +positive, DSC achieves minor improvements (+0.05 F1) over DL. In contrast, it significantly outperforms DL for +negative dataset. This is in line with our expectation since DSC helps more on more imbalanced datasets. The performance of FL and DL are not consistent across different datasets, while DSC consistently performs the best on all datasets.

* DSC在所有数据集中均获得最高F1分数。特别是，对于正电荷，DSC比DL有轻微的改善（+0.05 F1）。相反，它的性能明显优于DL-for+negative数据集。这符合我们的预期，因为DSC对更多不平衡数据集的帮助更大。在不同的数据集上，FL和DL的性能并不一致，而DSC在所有数据集上的性能始终最佳。

#### 5.2 以准确性为导向的任务的dice损失？

* We argue that the cross-entropy objective is actually accuracy-oriented, whereas the proposed losses perform as a soft version of F1 score. To explore the effect of the dice loss on accuracy-oriented tasks such as text classification, we conduct experiments on the Stanford Sentiment Tree-bank (SST) datasets including SST-2 and SST-5. We fine-tuned BERTLarge with different training objectives. Experimental results for SST are shown in Table 9. For SST-5, BERT with CE achieves 55.57 in terms of accuracy, while DL and DSC perform slightly worse (54.63 and 55.19, respectively). Similar phenomenon is observed for SST-2. These results verify that the proposed dice loss is not accuracy-oriented, and should not be used for accuracy-oriented tasks.

* 我们认为，交叉熵目标实际上是以准确性为导向的，而提议的损失表现为F1分数的软版本。为了探讨骰子丢失对文本分类等面向准确度的任务的影响，我们在斯坦福情感树库（SST）数据集SST-2和SST-5上进行了实验。我们根据不同的培训目标对伯特朗进行了微调。表9显示了SST的实验结果。对于SST-5，带CE的BERT精度达到55.57，而DL和DSC的精度稍差（分别为54.63和55.19）。SST-2也有类似的现象。这些结果验证了所提出的骰子损失不是以精度为导向的，也不应该用于以精度为导向的任务。

![avatar](图片/17.png)

<center>表9:DL和DSC对情绪分类任务的影响。BERT+CE是指对BERT进行微调，以交叉熵为训练目标。</center>

#### 5.3 Tversky指数中的超参数

* As mentioned in Section 3.3, Tversky index ($TI$) offers the flexibility in controlling the tradeoff between false-negatives and false-positives. In this subsection, we explore the effect of hyperparameters (i.e., α and β) in $TI$ to test how they manipulate the tradeoff. We conduct experiments on the Chinese OntoNotes4.0 NER dataset and English QuoRef MRC dataset. Experimental results are shown in Table 10. The highest F1 on Chinese OntoNotes4.0 is 84.67 when $\alpha$ is set to 0.6 while for QuoRef, the highest F1 is 68.44 when $\alpha$ is set to 0.4. In addition, we can observe that the performance varies a lot as α changes in distinct datasets, which shows that the hyperparameters $\alpha$, $\beta$ acturally play an important role in $TI$.

* 如第3.3节所述，Tversky指数（$TI$）提供了控制假阴性和假阳性之间权衡的灵活性。在本小节中，我们探讨了$TI$中超参数（即α和β）的影响，以测试它们是如何操纵权衡的。我们在中文OntoNotes4.0 NER数据集和英文QuoRef MRC数据集上进行了实验。实验结果见表10。当$\alpha$设置为0.6时，中文OntoNotes4.0的最高F1为84.67；而对于QuoRef，当$\alpha$设置为0.4时，最高F1为68.44。此外，我们可以观察到，在不同的数据集中，随着α的变化，性能也有很大的变化，这表明超参数$\alpha$、$\beta$在$TI$中起着重要作用。

![avatar](图片/18.png)

<center>表10:Tversky指数中超参数的影响。我们设置$\beta=1−\alpha$，因此这里只列出$\alpha$。</center>

### 6 结论

* In this paper, we propose the dice-based loss to narrow down the gap between training objective and evaluation metrics (F1 score). Experimental results show that the proposed loss function help to achieve significant performance boost without changing model architectures.

* 在本文中，我们提出了基于骰子的损失来缩小训练目标与评估指标（F1分数）之间的差距。实验结果表明，在不改变模型结构的情况下，所提出的损失函数有助于获得显著的性能提升。