# ALBERT: A LITE BERT FOR SELF-SUPERVISEDLEARNING OF LANGUAGE REPRESENTATIONS
# 

## 摘要

* Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self- supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at [https://github.com/ google-research/google-research/tree/master/albert]. 
----------
* 在对自然语言表示进行预训练时，增加模型大小通常会提高下游任务的性能。然而，由于GPU/TPU的内存限制、更长的训练时间和意外的模型降级，在某些时候，模型的进一步增加变得更加困难。为了解决这些问题，我们提出了两种降低内存消耗和提高BERT训练速度的参数约简技术（Devlin等人，2019）。综合的经验证据表明，我们提出的方法的模型，规模比原来的Bert更好。我们还使用了一个自我监督的损失，该损失侧重于句子间连贯性的建模，并表明它始终有助于下游任务的多句子输入。因此，我们的最佳模型在GLUE、RACE和SQuAD基准上得到了最好的结果，同时与BERT large相比，参数更少。代码和预先训练的模型可以在 [https://github.com/google research/google research/tree/master/albert] 上找到。

### 1 简介

* Full network pre-training (Dai & Le, 2015; Radford et al., 2018; Devlin et al., 2019; Howard & Ruder, 2018) has led to a series of breakthroughs in language representation learning. Many nontrivial NLP tasks, including those that have limited training data, have greatly benefited from these pre-trained models. One of the most compelling signs of these breakthroughs is the evolution of machine performance on a reading comprehension task designed for middle and high-school English exams in China, the RACE test (Lai et al., 2017): the paper that originally describes the task and formulates the modeling challenge reports then state-of-the-art machine accuracy at 44.1%; the latest published result reports their model performance at 83.2% (Liu et al., 2019); the work we present here pushes it even higher to 89.4%, a stunning 45.3% improvement that is mainly attributable to our current ability to build high-performance pretrained language representations. 
----------
* 全网络预训（Dai&Le，2015；Radford et al.，2018；Devlin et al.，2019；Howard&Ruder，2018）在语言表征学习方面取得了一系列突破。许多NLP任务，包括那些训练数据有限的任务，都从这些预训练的模型中受益匪浅。这些突破的一个最引人注目的标志是，在为中国中高中英语考试设计的阅读理解任务中，机器性能的演变，即种族测试（Lai等人。，2017）：最初描述任务并制定建模挑战报告的论文，当时最先进的机器精度为44.1%；最新公布的结果报告其模型性能为83.2%（Liu等人，2019）；我们在这里介绍的工作将其推高至89.4%，惊人的45.3%的改进主要归因于以我们目前的能力，建立高性能的预训练语言表达。

* Evidence from these improvements reveals that a large network is of crucial importance for achieving state-of-the-art performance (Devlin et al., 2019; Radford et al., 2019). It has become common practice to pre-train large models and distill them down to smaller ones (Sun et al., 2019; Turc et al., 2019) for real applications. Given the importance of model size, we ask: Is having better NLP models as easy as having larger models? 
----------
* 从这些改进中得到的证据表明，大型网络对于实现最先进的性能至关重要（Devlin等人，2019年；Radford等人，2019年）。为实际应用预训练大型模型并将其提炼为小型模型（Sun等人，2019年；Turc等人，2019年）已成为普遍做法。考虑到模型大小的重要性，我们问：有更好的NLP模型和有更大的模型一样容易吗？

* An obstacle to answering this question is the memory limitations of available hardware. Given that current state-of-the-art models often have hundreds of millions or even billions of parameters, it is easy to hit these limitations as we try to scale our models. Training speed can also be significantly hampered in distributed training, as the communication overhead is directly proportional to the number of parameters in the model. We also observe that simply growing the hidden size of a model such as BERT-large (Devlin et al., 2019) can lead to worse performance. Table 1 and Fig. 1 show a typical example, where we simply increase the hidden size of BERT-large to be 2x larger and get worse results with this BERT-xlarge model. 
----------
* 回答这个问题的一个障碍是可用硬件的内存限制。考虑到当前最先进的模型通常有数亿甚至数十亿个参数，当我们试图缩放模型时，很容易遇到这些限制。在分布式训练中，由于通信开销与模型中参数的数量成正比，因此训练速度也会受到很大的限制。我们还观察到，简单地增大模型的隐层大小，例如BERT large（Devlin et al.，2019），可能会导致性能下降。表1和图1给出了一个典型的例子，在这个例子中，我们简单地将BERT-large的隐藏大小增加到原来的2倍，然后使用这个BERT-xlarge模型得到更差的结果。

![avatar](picture1.png)

<center>图1:BERT-large和BERT-xlarge的训练损失（左）和dev-masked LM精度（右）（就隐藏大小而言，比BERT-large大2倍）。较大的模型具有较低的masked LM精度，但没有明显的过拟合迹象。</center>

![avatar](picture2.png)

<center>表1：增大BERT-large的隐藏大小会导致在比赛中的性能下降。<\center>

* Existing solutions to the aforementioned problems include model parallelization (Shoeybi et al., 2019) and clever memory management (Chen et al., 2016; Gomez et al., 2017). These solutions address the memory limitation problem, but not the communication overhead and model degradation problem. In this paper, we address all of the aforementioned problems, by designing A Lite BERT (ALBERT) architecture that has significantly fewer parameters than a traditional BERT architecture. 
----------
* 上述问题的现有解决方案包括模型并行化（Shoeybi等人，2019）和智能内存管理（Chen等人，2016；Gomez等人，2017）。这些解决方案解决的是内存限制问题，而不是通信开销和模型降级问题。在本文中，我们通过设计一个Lite BERT（ALBERT）架构来解决上述所有问题，该架构的参数比传统的BERT架构少得多。

* ALBERT incorporates two parameter reduction techniques that lift the major obstacles in scaling pre-trained models. The first one is a factorized embedding parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, we separate the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter-efficiency. An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster. The parameter reduction techniques also act as a form of regularization that stabilizes the training and helps with generalization. 
-------
* ALBERT结合了两种参数简化技术，消除了缩放预训练模型的主要障碍。第一个是因子化嵌入参数化。通过将大的词汇嵌入矩阵分解为两个小矩阵，将隐藏层的大小与词汇嵌入的大小分离开来。这种分离使得在不显著增加词汇表嵌入的参数大小的情况下更容易增加隐藏层的大小。第二种技术是跨层参数共享。这种技术防止了参数随网络深度的增长。这两种技术都在不严重影响性能的情况下显著减少了BERT的参数数目，从而提高了参数效率。类似于BERT-large的ALBERT配置的参数少了18倍，训练速度快了1.7倍。参数归约技术也可以作为正则化的一种形式，稳定训练并有助于泛化。

* To further improve the performance of ALBERT, we also introduce a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness (Yang et al., 2019; Liu et al., 2019) of the next sentence prediction (NSP) loss proposed in the original BERT.
---------
* 为了进一步提高ALBERT的性能，我们还引入了一种用于句子顺序预测的自监督损失（SOP）。SOP primary关注句子间的连贯性，旨在解决原BERT中提出的下一个句子预测（NSP）损失的无效性（Yang等人，2019；Liu等人，2019）。

* As a result of these design decisions, we are able to scale up to much larger ALBERT configurations that still have fewer parameters than BERT-large but achieve significantly better performance. We establish new state-of-the-art results on the well-known GLUE, SQuAD, and RACE benchmarks for natural language understanding. Specifically, we push the RACE accuracy to 89.4%, the GLUE benchmark to 89.4, and the F1 score of SQuAD 2.0 to 92.2.
-------
* 作为这些设计决策的结果，我们能够扩展到更大的ALBERT配置，这些配置的参数仍然比BERT-large少，但是可以获得更好的性能。我们在著名的自然语言理解的GLUE、SQuAD和RACE基准上得到了最好的结果。具体来说，我们把RACE的准确率推到89.4%，把GLUE基准推到89.4，把F1值推到92.2分。

### 2 相关工作
#### 2.1扩大自然语言的表象学习

* Learning representations of natural language has been shown to be useful for a wide range of NLP tasks and has been widely adopted (Mikolov et al., 2013; Le & Mikolov, 2014; Dai & Le, 2015; Peters et al., 2018; Devlin et al., 2019; Radford et al., 2018; 2019). One of the most significant changes in the last two years is the shift from pre-training word embeddings, whether standard (Mikolov et al., 2013; Pennington et al., 2014) or contextualized (McCann et al., 2017; Peters et al., 2018), to full-network pre-training followed by task-specific fine-tuning (Dai & Le, 2015; Radford et al., 2018; Devlin et al., 2019). In this line of work, it is often shown that larger model size improves performance. For example, Devlin et al. (2019) show that across three selected natural language understanding tasks, using larger hidden size, more hidden layers, and more attention heads always leads to better performance. However, they stop at a hidden size of 1024. We show that, under the same setting, increasing the hidden size to 2048 leads to model degradation and hence worse performance. Therefore, scaling up representation learning for natural language is not as easy as simply increasing model size. 
------
* 自然语言的学习表征已被证明对广泛的NLP任务有用，并被广泛采用（Mikolov等人，2013；Le&Mikolov，2014；Dai&Le，2015；Peters等人，2018；Devlin等人，2019；Radford等人，2018；2019）。过去两年中最重要的变化之一是从标准的（Mikolov et al.，2013；Pennington et al.，2014）或情境化的（McCann et al.，2017；Peters et al.，2018）训练前单词嵌入到全网络训练后的特定任务微调（Dai&Le，2015；Radford et al.，2018；Devlin et al.，20182019年）。在这项工作中，经常会发现较大的模型尺寸可以提高性能。例如，Devlin等人。（2019）显示，在三个选定的自然语言理解任务中，使用更大的隐藏大小、更多的隐藏层和更多的注意力总是能带来更好的表现。但是，它们的隐藏大小为1024。我们表明，在相同的设置下，将隐藏大小增加到2048会导致模型降级，从而导致性能下降。因此，扩展自然语言的表示学习并不像简单地增加模型大小那么容易。

* In addition, it is difficult to experiment with large models due to computational constraints, especially in terms of GPU/TPU memory limitations. Given that current state-of-the-art models often have hundreds of millions or even billions of parameters, we can easily hit memory limits. To address this issue, Chen et al. (2016) propose a method called gradient checkpointing to reduce the memory requirement to be sublinear at the cost of an extra forward pass. Gomez et al. (2017) propose a way to reconstruct each layer’s activations from the next layer so that they do not need to store the intermediate activations. Both methods reduce the memory consumption at the cost of speed. In contrast, our parameter-reduction techniques reduce memory consumption and increase training speed.
----
* 此外，由于计算上的限制，很难对大型模型进行实验，特别是在GPU/TPU内存限制方面。考虑到当前最先进的模型通常有数亿甚至数十亿个参数，我们很容易达到内存限制。为了解决这个问题，Chen等人。（2016）提出了一种称为梯度检查点的方法，以减少以额外向前传递为代价的次线性内存需求。Gomez等人。（2017）提出一种从下一层重建每一层激活的方法，以便它们不需要存储中间激活。两种方法都以牺牲速度为代价降低了内存消耗。相比之下，我们的参数减少技术减少了内存消耗，提高了训练速度。

#### 2.2跨层参数共享

* The idea of sharing parameters across layers has been previously explored with the Transformer architecture (Vaswani et al., 2017), but this prior work has focused on training for standard encoder- decoder tasks rather than the pretraining/finetuning setting. Different from our observations, Dehghani et al. (2018) show that networks with cross-layer parameter sharing (Universal Transformer, UT) get better performance on language modeling and subject-verb agreement than the standard transformer. Very recently, Bai et al. (2019) propose a Deep Equilibrium Model (DQE) for transformer networks and show that DQE can reach an equilibrium point for which the input embedding and the output embedding of a certain layer stay the same. Our observations show that our embeddings are oscillating rather than converging. Hao et al. (2019) combine a parameter-sharing transformer with the standard one, which further increases the number of parameters of the standard transformer. 
----
* 先前已经用Transformer架构探索了跨层共享参数的想法（Vaswani等人，2017年），但之前的工作主要集中在标准编码器-解码器任务的训练，而不是预训练/微调设置。与我们的观察结果不同，Dehghani等人。（2018）研究表明，跨层参数共享网络（Universal Transformer，UT）在语言建模和主谓一致性方面的性能优于标准Transformer。最近，Bai等人。（2019）提出了变压器网络的深度均衡模型（DQE），并证明DQE可以达到某一层的输入嵌入和输出嵌入保持不变的均衡点。我们的观察表明，我们的嵌入是振荡的，而不是会聚的。Hao等人。（2019）将参数共享变压器与标准变压器相结合，进一步增加了标准变压器的参数数量。

#### 2.3句子排序目标

* ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text. Several researchers have experimented with pretraining objectives that similarly relate to discourse coherence. Coherence and cohesion in discourse have been widely studied and many phenomena have been identified that connect neighboring text segments (Hobbs, 1979; Halliday & Hasan, 1976; Grosz et al., 1995). Most objectives found effective in practice are quite simple. Skip- thought (Kiros et al., 2015) and FastSent (Hill et al., 2016) sentence embeddings are learned by using an encoding of a sentence to predict words in neighboring sentences. Other objectives for sentence embedding learning include predicting future sentences rather than only neighbors (Gan et al., 2017) and predicting explicit discourse markers (Jernite et al., 2017; Nie et al., 2019). Our loss is most similar to the sentence ordering objective of Jernite et al. (2017), where sentence embeddings are learned in order to determine the ordering of two consecutive sentences. Unlike most of the above work, however, our loss is defined on textual segments rather than sentences. BERT (Devlin et al., 2019) uses a loss based on predicting whether the second segment in a pair has been swapped with a segment from another document. We compare to this loss in our experiments and find that sentence ordering is a more challenging pretraining task and more useful for certain downstream tasks. Concurrently to our work, Wang et al. (2019) also try to predict the order of two consecutive segments of text, but they combine it with the original next sentence prediction in a three-way classification task rather than empirically comparing the two.
----
* ALBERT使用了一个基于预测两个连续文本片段顺序的预训练损失。一些研究者已经对与语篇连贯相关的训练前目标进行了实验。语篇中的连贯和衔接已经被广泛研究，并发现了许多连接相邻语段的现象（Hobbs，1979；Halliday&Hasan，1976；Grosz等人，1995）。大多数在实践中发现有效的目标都很简单。Skip-think（Kiros等人，2015）和FastSent（Hill等人，2016）通过使用句子编码预测相邻句子中的单词来学习句子嵌入。句子嵌入学习的其他目标包括预测未来的句子而不仅仅是邻居（Gan et al.，2017）和预测显性话语标记语（Jernite et al.，2017；Nie et al.，2019）。我们的损失与Jernite等人的句子排序目标最为相似。（2017），学习句子嵌入以确定两个连续句子的顺序。然而，与上述大多数工作不同，我们的损失是在文本段而不是句子上定义的。BERT（Devlin et al.，2019）根据预测一对文件中的第二段是否已与另一份文件中的一段交换而使用损失。我们在实验中比较了这一损失，发现句子排序是一个更具挑战性的预训练任务，对某些下游任务更有用。同时我们的工作，王等。（2019）也尝试预测两个连续文本段的顺序，但他们将其与原始的下一句预测结合在一个三向分类任务中，而不是通过经验比较两者。

### 3 ALBERT的元素
* In this section, we present the design decisions for ALBERT and provide quantified comparisons against corresponding configurations of the original BERT architecture (Devlin et al., 2019).
----
* 在本节中，我们将介绍ALBERT的设计决策，并提供与原始BERT架构的相应配置的量化比较（Devlin等人，2019）。

#### 3.1模型架构选择

* The backbone of the ALBERT architecture is similar to BERT in that it uses a transformer encoder (Vaswani et al., 2017) with GELU nonlinearities (Hendrycks & Gimpel, 2016). We follow the BERT notation conventions and denote the vocabulary embedding size as E, the number of encoder layers as L, and the hidden size as H. Following Devlin et al. (2019), we set the feed-forward/filter size to be 4H and the number of attention heads to be H/64. 
----
* ALBERT架构的主干与BERT类似，它使用具有GELU非线性的变压器编码器（Vaswani等人，2017年）（Hendrycks&Gimpel，2016年）。我们遵循BERT符号惯例，将词汇表嵌入大小表示为E，将编码器层数表示为L，将隐藏大小表示为H。（2019），我们将前馈/滤波器的大小设置为4H，关注头的数量设置为H/64。

* There are three main contributions that ALBERT makes over the design choices of BERT.
----
* ALBERT对BERT的设计选择做出了三大贡献。

* Factorized embedding parameterization. In BERT, as well as subsequent modeling improvements such as XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), the WordPiece embedding size E is tied with the hidden layer size H, i.e., E ≡ H. This decision appears suboptimal for both modeling and practical reasons, as follows. 
----
* 因子化嵌入参数化。在BERT中，以及随后的建模改进，例如XLNet（Yang et al.，2019）和RoBERTa（Liu et al.，2019），字块嵌入大小E与隐藏层大小H相关联。由于建模和实际原因，此决策似乎都是次优的，如下所示。

* From a modeling perspective, WordPiece embeddings are meant to learn context-independent representations, whereas hidden-layer embeddings are meant to learn context-dependent representations. As experiments with context length indicate (Liu et al., 2019), the power of BERT-like representations comes from the use of context to provide the signal for learning such context-dependent representations. As such, untying the WordPiece embedding size E from the hidden layer size H allows us to make a more efficient usage of the total model parameters as informed by modeling needs, which dictate that H ≫ E. 
----
* 从建模的角度来看，字块嵌入意味着学习上下文无关的表示，而隐藏层嵌入意味着学习上下文相关的表示。正如上下文长度的实验所表明的（Liu等人，2019），类BERT表示的使用上下文来提供学习这种上下文相关表示的信号。因此，从隐藏层大小H中解开嵌入大小E的字块可以使我们更有效地利用建模需求所通知的总模型参数，这就要求H≫E。

* From a practical perspective, natural language processing usually require the vocabulary size V to be large.1 If E ≡ H, then increasing H increases the size of the embedding matrix, which has size V ×E. This can easily result in a model with billions of parameters, most of which are only updated sparsely during training.
----
* 从实际应用的角度看，自然语言处理通常要求词汇量V值较大，如果E恒等于H值，随着H的增大，则嵌入矩阵增大，嵌入矩阵的维度是V×E值，很容易产生一个包含数十亿个参数的模型，其中大部分参数在训练过程中只得到稀疏更新。

* Therefore, for ALBERT we use a factorization of the embedding parameters, decomposing them into two smaller matrices. Instead of projecting the one-hot vectors directly into the hidden space of size H, we first project them into a lower dimensional embedding space of size E, and then project it to the hidden space. By using this decomposition, we reduce the embedding parameters from O(V × H) to O(V × E + E × H). This parameter reduction is significant when H ≫ E. We choose to use the same E for all word pieces because they are much more evenly distributed across documents compared to whole-word embedding, where having different embedding size (Grave et al. (2017); Baevski & Auli (2018); Dai et al. (2019) ) for different words is important.
----
* 因此，对于ALBERT，我们使用嵌入参数的因子分解，将它们分解为两个较小的矩阵。我们不直接将一个onehot向量投影到H大小的隐藏空间中，而是先将其投影到E大小的低维嵌入空间中，然后将其投影到隐藏空间中。通过这种分解，我们将嵌入参数从O（V×H）降到O（V×E+E×H）。当H≫E时，这个参数的减少是非常重要的。我们选择对所有单词片段使用相同的E，因为它们在文档中的分布比具有不同嵌入大小的整个单词嵌入要均匀得多（Grave等人。（2017）；Baevski&Auli（2018）；Dai等人。（2019年）对于不同的词来说很重要。

* Cross-layer parameter sharing. For ALBERT, we propose cross-layer parameter sharing as another way to improve parameter efficiency. There are multiple ways to share parameters, e.g., only sharing feed-forward network (FFN) parameters across layers, or only sharing attention parameters. The default decision for ALBERT is to share all parameters across layers. All our experiments use this default decision unless otherwise specified We compare this design decision against other strategies in our experiments in Sec. 4.5. 
----
* 跨层参数共享。对于ALBERT，我们提出了跨层参数共享作为提高参数效率的另一种方法。有多种方法可以共享参数，例如，仅跨层共享前馈网络（FFN）参数，或仅共享注意力参数。ALBERT的默认决定是跨层共享所有参数。我们所有的实验都使用这个默认的决定，除非另有说明，否则我们将这个设计决定与实验中的其他策略进行秒级比较。4.5条。

* Similar strategies have been explored by Dehghani et al. (2018) (Universal Transformer, UT) and Bai et al. (2019) (Deep Equilibrium Models, DQE) for Transformer networks. Different from our observations, Dehghani et al. (2018) show that UT outperforms a vanilla Transformer. Bai et al. (2019) show that their DQEs reach an equilibrium point for which the input and output embedding of a certain layer stay the same. Our measurement on the L2 distances and cosine similarity show that our embeddings are oscillating rather than converging. 
----
* Dehghani等人也探索了类似的策略。（2018）（Universal Transformer，UT）和Bai等人。（2019）（深(Deep Equilibrium Models，DQE）变压器网络。与我们的观察结果不同，Dehghani等人。（2018）显示UT优于一般Transformer。Bai等人。（2019）表明它们的dqe达到一个平衡点，在这个平衡点上，某一层的输入和输出嵌入保持不变。我们对L2距离和余弦相似性的测量表明，我们的嵌入是振荡的，而不是收敛的。

![avatar](picture3.png)

<center>图2:BERT-large和ALBERT-large每层输入输出嵌入的L2距离和余弦相似性（以度为单位）。</center>

<center>Figure 2 shows the L2 distances and cosine similarity of the input and output embeddings for each layer, using BERT-large and ALBERT-large configurations (see Table 2). We observe that the transitions from layer to layer are much smoother for ALBERT than for BERT. These results show that weight-sharing has an effect on stabilizing network parameters. Although there is a drop for both metrics compared to BERT, they nevertheless do not converge to 0 even after 24 layers. This shows that the solution space for ALBERT parameters is very different from the one found by DQE. 
</center> 
<center>图2显示了每个层的输入和输出嵌入的L2距离和余弦相似性，使用了BERT-large和ALBERT-large配置（见表2）。我们观察到，从一层到另一层的转换，ALBERT要比BERT平滑得多。结果表明，权重分配对网络参数的稳定有一定的影响。尽管与BERT相比，这两个度量都有所下降，但即使在24层之后，它们也不会收敛到0。这表明ALBERT参数的解空间与DQE的解空间有很大的不同。</center>

* Inter-sentence coherence loss. In addition to the masked language modeling (MLM) loss (Devlin et al., 2019), BERT uses an additional loss called next-sentence prediction (NSP). NSP is a binary classification loss for predicting whether two segments appear consecutively in the original text, as follows: positive examples are created by taking consecutive segments from the training corpus; negative examples are created by pairing segments from different documents; positive and negative examples are sampled with equal probability. The NSP objective was designed to improve performance on downstream tasks, such as natural language inference, that require reasoning about the relationship between sentence pairs. However, subsequent studies (Yang et al., 2019; Liu et al., 2019) found NSP’s impact unreliable and decided to eliminate it, a decision supported by an improvement in downstream task performance across several tasks. 
----
* 句间连贯损失。除了语言建模（MLM）损失（Devlin等人，2019），BERT还使用了一个称为下一句预测（NSP）的额外损失。NSP是一个二元分类损失，用于预测两个片段是否在原始文本中连续出现，具体如下：从训练语料库中提取连续的片段来创建正示例；从不同文档中配对片段来创建负示例；正例和负例的抽样概率相等。NSP的目标是提高下游任务的性能，例如需要对句子对之间的关系进行推理的自然语言推理。然而，随后的研究（Yang等人，2019；Liu等人，2019）发现NSP的影响不可靠，并决定消除它，这一决定得到了多个任务下游任务性能改善的支持。

* We conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task, as compared to MLM. As formulated, NSP conflates topic prediction and coherence prediction in a single task2. However, topic prediction is easier to learn compared to coherence prediction, and also overlaps more with what is learned using the MLM loss. 
----
* 我们推测，NSP产生这个问题的原因是与MLM损失相比，任务简单。如所述，NSP将主题预测和一致性预测合并在一个任务中2。然而，与连贯预测相比，主题预测更容易学习，而且与使用MLM损失所学习的内容重叠更多。

* We maintain that inter-sentence modeling is an important aspect of language understanding, but we propose a loss based primarily on coherence. That is, for ALBERT, we use a sentence-order prediction (SOP) loss, which avoids topic prediction and instead focuses on modeling inter-sentence coherence. The SOP loss uses as positive examples the same technique as BERT (two consecutive segments from the same document), and as negative examples the same two consecutive segments but with their order swapped. This forces the model to learn finer-grained distinctions about discourse-level coherence properties. As we show in Sec. 4.6, it turns out that NSP cannot solve the SOP task at all (i.e., it ends up learning the easier topic-prediction signal, and performs at random-baseline level on the SOP task), while SOP can solve the NSP task to a reasonable degree, presumably based on analyzing misaligned coherence cues. As a result, ALBERT models consistently improve downstream task performance for multi-sentence encoding tasks. 
----
* 我们认为句间建模是语言理解的一个重要方面，但是我们提出了一个主要基于连贯性的损失。也就是说，对于ALBERT，我们使用了一个句子顺序预测（SOP）损失，它避免了主题预测，而是侧重于建立句子间的连贯性模型。SOP loss使用与BERT（来自同一文档的两个连续片段）相同的技术作为正示例，使用与BERT相同的两个连续片段作为负示例，但它们的顺序交换。这迫使模型学习关于语篇层面连贯性的更细粒度的区别。如我们在第二节所示。4.6实验结果表明，NSP完全不能解决SOP任务（即它最终学习到更容易的主题预测信号，并在随机基线水平上执行SOP任务），而SOP可以在合理的程度上解决NSP任务，可能是基于分析不一致的一致性线索。结果，ALBERT模型一致地提高了多句子编码任务的下游任务性能。

#### 3.2模型设置

* We present the differences between BERT and ALBERT models with comparable hyperparameter settings in Table 2. Due to the design choices discussed above, ALBERT models have much smaller parameter size compared to corresponding BERT models. 
----
* 我们在表2中给出了BERT和ALBERT模型之间的超参数设置差异。由于上面讨论的设计选择，与相应的BERT模型相比，ALBERT模型具有更小的参数大小。

![avatar](picture4.png)

<center>表2：本文分析的主要BERT和ALBERT模型的结构。</center>

* For example, ALBERT-large has about 18x fewer parameters compared to BERT-large, 18M versus 334M. If we set BERT to have an extra-large size with H = 2048, we end up with a model that has 1.27 billion parameters and under-performs (Fig. 1). In contrast, an ALBERT-xlarge configuration with H = 2048 has only 60M parameters, while an ALBERT-xxlarge configuration with H = 4096 has 233M parameters, i.e., around 70% of BERT-large’s parameters. Note that for ALBERT- xxlarge, we mainly report results on a 12-layer network because a 24-layer network (with the same configuration) obtains similar results but is computationally more expensive. 
----
* 例如，与BERT-large相比，ALBERT-large的参数大约少了18倍，18M比334M。如果我们将BERT设置为一个H=2048的超大尺寸，我们最终得到的模型的参数为12.7亿，并且性能不足（图1）。相比之下，H=2048的ALBERT xlarge配置只有60M参数，而H=4096的ALBERT xxlarge配置有233M参数，即约为BERT large参数的70%。注意，对于ALBERT-xxlarge，我们主要在12层网络上报告结果，因为24层网络（具有相同的配置）获得相似的结果，但计算成本更高。

* This improvement in parameter efficiency is the most important advantage of ALBERT’s design choices. Before we can quantify this advantage, we need to introduce our experimental setup in more detail.
----
* 这种参数效率的提高是ALBERT设计选择的最重要优势。在量化这一优势之前，我们需要更详细地介绍我们的实验设置。

### 4 实验结果
#### 4.1实验设置

* To keep the comparison as meaningful as possible, we follow the BERT (Devlin et al., 2019) setup in using the BOOKCORPUS (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2019) for pretraining baseline models. These two corpora consist of around 16GB of uncompressed text. We format our inputs as “[CLS] x1 [SEP] x2 [SEP]”, where x1 = x1,1, x1,2 · · · and x2 = x1,1, x1,2 · · · are two segments.3 We always limit the maximum input length to 512, and randomly generate input sequences shorter than 512 with a probability of 10%. Like BERT, we use a vocabulary size of 30,000, tokenized using SentencePiece (Kudo & Richardson, 2018) as in XLNet (Yang et al., 2019). 
----
* 为了使比较尽可能有意义，我们遵循BERT（Devlin et al.，2019）设置，使用BOOKCORPUS（Zhu et al.，2015）和English Wikipedia（Devlin et al.，2019）进行基线模型训练。这两个语料库包含大约16GB的未压缩文本。我们将输入格式化为“[CLS]x1[SEP]x2[SEP]”，其中x1=x1,1，x1,2···和x2=x1,1，x1,2····是两个段。3我们总是将最大输入长度限制为512，并随机生成小于512的输入序列，概率为10%。与BERT一样，我们使用30000个词汇，使用SentencePiece标记（Kudo&Richardson，2018），如XLNet（Yang et al.，2019）。

* We generate masked inputs for the MLM targets using n-gram masking (Joshi et al., 2019), with the length of each n-gram mask selected randomly. The probability for the length n is given by
----
* 我们使用n-gram掩蔽为MLM目标生成掩蔽输入（Joshi等人，2019），随机选择每个n-gram掩蔽的长度。长度n的概率由

<center>$p(n)=\frac{1/n}{\sum^{N}_{k=1}{1/k}}$</center>

* We set the maximum length of n-gram (i.e., n) to be 3 (i.e., the MLM target can consist of up to a 3-gram of complete words, such as “White House correspondents”).
----
* 我们将n-gram（即n）的最大长度设置为3（即，MLML目标最多可包含3-gram的完整单词，如“白宫通讯员”）。

* All the model updates use a batch size of 4096 and a LAMB optimizer with learning rate 0.00176 (You et al., 2019). We train all models for 125,000 steps unless otherwise specified. Training was done on Cloud TPU V3. The number of TPUs used for training ranged from 64 to 1024, depending on model size. 
----
* 所有模型更新都使用4096的批大小和学习率为0.00176的LAMB优化器（You等人，2019）。除非另有说明，我们对所有型号进行125000步的训练。训练是在云TPU V3上完成的。用于培训的TPU数量从64到1024不等，具体取决于型号大小。

* The experimental setup described in this section is used for all of our own versions of BERT as well as ALBERT models, unless otherwise specified.
----
* 除非另有说明，否则本节中描述的实验设置将用于我们自己的所有版本的BERT和ALBERT模型。

#### 4.2 评价基准
##### 4.2.1 内在评价

* To monitor the training progress, we create a development set based on the development sets from SQuAD and RACE using the same procedure as in Sec. 4.1. We report accuracies for both MLM and sentence classification tasks. Note that we only use this set to check how the model is converging; it has not been used in a way that would affect the performance of any downstream evaluation, such as via model selection. 
----
* 为了监控训练进度，我们使用与Sec中相同的程序，根据来自SQuAD和RACE的开发集创建开发集。4.1条。我们报告传销和句子分类任务的准确性。请注意，我们仅使用此集合来检查模型是如何收敛的；它的使用方式不会影响任何下游评估的性能，例如通过模型选择。

##### 4.2.2下游评价

* Following Yang et al. (2019) and Liu et al. (2019), we evaluate our models on three popular benchmarks: The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018), two versions of the Stanford Question Answering Dataset (SQuAD; Rajpurkar et al., 2016; 2018), and the ReAding Comprehension from Examinations (RACE) dataset (Lai et al., 2017). For completeness, we provide description of these benchmarks in Appendix A.1. As in (Liu et al., 2019), we perform early stopping on the development sets, on which we report all comparisons except for our final comparisons based on the task leaderboards, for which we also report test set results. 
----
* 跟随杨等人。（2019）和Liu等人。（2019年），我们在三个流行基准上评估我们的模型：通用语言理解评估（GLUE）基准（Wang等人，2018年）、斯坦福问答数据集的两个版本（SQuAD；Rajpurkar等人，2016年；2018年）和考试阅读理解（RACE）数据集（Lai等人，2017年）。为了完整起见，我们在附录A.1中提供了这些基准的描述。与（Liu等人，2019年）一样，我们在开发集上执行早期停止，在开发集上报告所有比较，但基于任务排行榜的最终比较除外，我们还报告测试集结果。

### 4.3 BERT与ALBERT的总体比较

* We are now ready to quantify the impact of the design choices described in Sec. 3, specifically the ones around parameter efficiency. The improvement in parameter efficiency showcases the most important advantage of ALBERT’s design choices, as shown in Table 3: with only around 70% of BERT-large’s parameters, ALBERT-xxlarge achieves significant improvements over BERT-large, as measured by the difference on development set scores for several representative downstream tasks: SQuAD v1.1 (+1.9%), SQuAD v2.0 (+3.1%), MNLI (+1.4%), SST-2 (+2.2%), and RACE (+8.4%).
----
* 我们现在已经准备好量化Sec中描述的设计选择的影响。三是围绕参数效率的问题。参数效率的提高显示了ALBERT设计选择的最重要优势，如表3所示：ALBERT xxlarge只有大约70%的BERT large参数，比BERT large有了显著的改进，根据几个具有代表性的下游任务的开发集得分差异来衡量：SQuAD v1.1（+1.9%）、SQuAD v2.0（+3.1%）、MNLI（+1.4%）、SST-2（+2.2%）和RACE（+8.4%）。

* We also observe that BERT-xlarge gets significantly worse results than BERT-base on all metrics. This indicates that a model like BERT-xlarge is more difficult to train than those that have smaller parameter sizes. Another interesting observation is the speed of data throughput at training time under the same training configuration (same number of TPUs). Because of less communication and fewer computations, ALBERT models have higher data throughput compared to their corresponding BERT models. The slowest one is the BERT-xlarge model, which we use as a baseline. As the models get larger, the differences between BERT and ALBERT models become bigger, e.g., ALBERT-xlarge can be trained 2.4x faster than BERT-xlarge. 
----
* 我们还观察到，在所有指标上，BERT-xlarge得到的结果比BERT差得多。这表明像BERT xlarge这样的模型比那些参数较小的模型更难训练。另一个有趣的观察是，在相同的训练配置（相同数量的tpu）下，训练时间的数据吞吐量速度。由于通信量少、计算量少，ALBERT模型比相应的BERT模型具有更高的数据吞吐量。最慢的是BERT-xlarge模型，我们使用它作为基线。随着模型的增大，BERT模型和ALBERT模型之间的差异变大，例如，ALBERT-xlarge的训练速度比BERT-xlarge快2.4倍。

![avatar](picture5.png)

<center>Table 3: Dev set results for models pretrained over BOOKCORPUS and Wikipedia for 125k steps. Here and everywhere else, the Avg column is computed by averaging the scores of the downstream tasks to its left (the two numbers of F1 and EM for each SQuAD are first averaged).</center>

<center>表3:125k步在BOOKCORPUS和Wikipedia上预先训练的模型的开发集结果。在这里和其他地方，Avg列是通过对其左边的下游任务的得分进行平均来计算的（每个队的F1和EM的两个数字是首先平均的）。</center> 

* Next, we perform ablation experiments that quantify the individual contribution of each of the design choices for ALBERT.
----
* 接下来，我们进行消融实验，量化每个设计选择对ALBERT的贡献。

#### 4.4 因子嵌入参数化

* Table 4 shows the effect of changing the vocabulary embedding size E using an ALBERT-base configuration setting (see Table 2), using the same set of representative downstream tasks. Under the non-shared condition (BERT-style), larger embedding sizes give better performance, but not by much. Under the all-shared condition (ALBERT-style), an embedding of size 128 appears to be the best. Based on these results, we use an embedding size E = 128 in all future settings, as a necessary step to do further scaling.
----
* 表4显示了使用ALBERT基本配置设置（见表2）更改词汇表嵌入大小E的效果，使用的是同一组具有代表性的下游任务。在非共享条件（BERT风格）下，较大的嵌入尺寸可以获得更好的性能，但不是很大。在所有共享条件（ALBERT样式）下，大小为128的嵌入似乎是最好的。基于这些结果，我们在以后的所有设置中使用嵌入大小E=128作为进一步缩放的必要步骤。

![avatar](picture6.png)

<center>Table 4: The effect of vocabulary embedding size on the performance of ALBERT-base.</center>
<center>表4：词汇嵌入大小对ALBERT base性能的影响。</center>

#### 4.5 跨层参数共享

* Table 5 presents experiments for various cross-layer parameter-sharing strategies, using an ALBERT-base configuration (Table 2) with two embedding sizes (E = 768 and E = 128). We compare the all-shared strategy (ALBERT-style), the not-shared strategy (BERT-style), and intermediate strategies in which only the attention parameters are shared (but not the FNN ones) or only the FFN parameters are shared (but not the attention ones).
----
* 表5给出了各种跨层参数共享策略的实验，使用两种嵌入大小（E=768和E=128）的ALBERT基配置（表2）。我们比较了全共享策略（ALBERT风格）、非共享策略（BERT风格）和只共享注意参数（但不共享FNN参数）或只共享FFN参数（但不共享注意参数）的中间策略。

* The all-shared strategy hurts performance under both conditions, but it is less severe for E = 128 (- 1.5 on Avg) compared to E = 768 (-2.5 on Avg). In addition, most of the performance drop appears to come from sharing the FFN-layer parameters, while sharing the attention parameters results in no drop when E = 128 (+0.1 on Avg), and a slight drop when E = 768 (-0.7 on Avg).
----
* 在这两种情况下，全共享策略都会损害性能，但对于E=128（平均值为1.5），它的严重性要低于E=768（平均值为2.5）。此外，大部分性能下降似乎来自共享FFN层参数，而共享注意参数在E=128时没有下降（平均值为0.1），在E=768时略有下降（平均值为0.7）。

* There are other strategies of sharing the parameters cross layers. For example, We can divide the L layers into N groups of size M , and each size-M group shares parameters. Overall, our experimental results shows that the smaller the group size M is, the better the performance we get. However, decreasing group size M also dramatically increase the number of overall parameters. We choose all-shared strategy as our default choice.
----
* 还有其他跨层共享参数的策略。例如，我们可以将L层划分为N个M大小的组，每个M大小的组共享参数。总的来说，我们的实验结果表明，组大小M越小，我们得到的性能越好。但是，减小组大小M也会显著增加总体参数的数量。我们选择所有共享策略作为默认选择。

![avatar](picture7.png)

<center>Table 5: The effect of cross-layer parameter-sharing strategies, ALBERT-base configuration.</center>
<center>表5：跨层参数共享策略的效果，ALBERT基本配置。</center>

#### 4.6 句子顺序预测

* We compare head-to-head three experimental conditions for the additional inter-sentence loss: none (XLNet- and RoBERTa-style), NSP (BERT-style), and SOP (ALBERT-style), using an ALBERT-base configuration. Results are shown in Table 6, both over intrinsic (accuracy for the MLM, NSP, and SOP tasks) and downstream tasks.
----
* 我们比较了附加句间损失的三个实验条件：none（XLNet-和RoBERTa风格）、NSP（BERT风格）和SOP（ALBERT风格），使用ALBERT基本配置。结果如表6所示，既有过内在（传销、NSP和SOP任务的准确性）也有下游任务。

![avatar](picture8.png)

<center>Table 6: The effect of sentence-prediction loss, NSP vs. SOP, on intrinsic and downstream tasks.</center>
<center>表6：句子预测损失，NSP与SOP，对内在任务和下游任务的影响。</center>

* The results on the intrinsic tasks reveal that the NSP loss brings no discriminative power to the SOP task (52.0% accuracy, similar to the random-guess performance for the “None” condition). This allows us to conclude that NSP ends up modeling only topic shift. In contrast, the SOP loss does solve the NSP task relatively well (78.9% accuracy), and the SOP task even better (86.5% accuracy). Even more importantly, the SOP loss appears to consistently improve downstream task performance for multi-sentence encoding tasks (around +1% for SQuAD1.1, +2% for SQuAD2.0, +1.7% for RACE), for an Avg score improvement of around +1%.
----
* intrinsic结果表明，NSP损失对SOP任务没有判别能力（52.0%的准确率，类似于“无”条件下的随机猜测性能）。这使得我们可以得出结论，NSP最终只对主题转移建模。相比之下，SOP损失确实较好地解决了NSP任务（78.9%的准确率），而SOP任务则更好（86.5%的准确率）。更重要的是，对于多句子编码任务（SQuAD1.1约为+1%，SQuAD2.0约为+2%，RACE约为+1.7%），平均得分提高约为+1%，SOP损失似乎持续改善下游任务的性能。

#### 4.7网络深度和宽度的影响

* In this section, we check how depth (number of layers) and width (hidden size) affect the performance of ALBERT. Table 7 shows the performance of an ALBERT-large configuration (see Table 2) using different numbers of layers. Networks with 3 or more layers are trained by fine-tuning using the parameters from the depth before (e.g., the 12-layer network parameters are fine-tuned from the checkpoint of the 6-layer network parameters).4 Similar technique has been used in Gong et al. (2019). If we compare a 3-layer ALBERT model with a 1-layer ALBERT model, although they have the same number of parameters, the performance increases significantly. However, there are diminishing returns when continuing to increase the number of layers: the results of a 12-layer network are relatively close to the results of a 24-layer network, and the performance of a 48-layer network appears to decline.
----
* 在本节中，我们将检查深度（层数）和宽度（隐藏大小）如何影响ALBERT的性能。表7显示了使用不同层数的ALBERT大型配置（见表2）的性能。对3层或3层以上的网络，利用之前的深度参数进行微调训练（如从6层网络参数的检查点微调12层网络参数），4 Gong等人也采用了类似的技术。（2019年）。如果将三层ALBERT模型与一层ALBERT模型进行比较，虽然它们具有相同的参数数目，但是性能显著提高。然而，当继续增加层的数量时，回报率是递减的：12层网络的结果与24层网络的结果相对接近，48层网络的性能似乎下降。

![avatar](picture9.png)

<center>Table 7: The effect of increasing the number of layers for an ALBERT-large configuration.</center>

<center>表7:ALBERT大型配置增加层数的效果。</center>

* A similar phenomenon, this time for width, can be seen in Table 8 for a 3-layer ALBERT-large configuration. As we increase the hidden size, we get an increase in performance with diminishing returns. At a hidden size of 6144, the performance appears to decline significantly. We note that none of these models appear to overfit the training data, and they all have higher training and development loss compared to the best-performing ALBERT configurations.
----
* 类似的现象，这次的宽度，可以在表8中看到一个3层ALBERT大配置。当我们增加隐藏的大小时，我们会得到收益递减的性能增加。在6144的隐藏大小，表现似乎明显下降。我们注意到，这些模型似乎都没有过度拟合训练数据，而且与性能最好的ALBERT配置相比，它们都具有更高的训练和开发损失。

![avatar](picture10.png)

<center>Table 8: The effect of increasing the hidden-layer size for an ALBERT-large 3-layer configuration.</center>

<center>表8：增加ALBERT大3层配置的隐藏层大小的效果。</center>

#### 4.8如果我们训练的时间相同会怎么样？

* The speed-up results in Table 3 indicate that data-throughput for BERT-large is about 3.17x higher compared to ALBERT-xxlarge. Since longer training usually leads to better performance, we perform a comparison in which, instead of controlling for data throughput (number of training steps), we control for the actual training time (i.e., let the models train for the same number of hours). In Table 9, we compare the performance of a BERT-large model after 400k training steps (after 34h of training), roughly equivalent with the amount of time needed to train an ALBERT-xxlarge model with 125k training steps (32h of training). 
----
* 表3中的加速结果表明，与ALBERT xxlarge相比，BERT large的数据吞吐量大约高出3.17x。由于较长的训练通常会带来更好的性能，因此我们进行比较，在比较中，我们不控制数据吞吐量（训练步骤的数量），而是控制实际的训练时间（即，让模型训练相同的小时数）。在表9中，我们比较了BERT大模型在400k个训练步骤（训练34h后）后的性能，大致相当于用125k个训练步骤（训练32h）训练ALBERT xxlarge模型所需的时间。

* After training for roughly the same amount of time, ALBERT-xxlarge is significantly better than BERT-large: +1.5% better on Avg, with the difference on RACE as high as +5.2%.
----
* 经过差不多相同时间的训练，ALBERT-xxlarge的平均成绩明显优于伯特xxlarge的+1.5%，而在比赛中的差异高达+5.2%。

![avatar](picture11.png)

<center>Table 9: The effect of controlling for training time, BERT-large vs ALBERT-xxlarge configurations.</center>
<center>表9：控制训练时间的效果，BERT-large vs ALBERT-xxlarge。</center>

#### 4.9非常宽的阿尔伯特模型也需要很深吗？

* In Section 4.7, we show that for ALBERT-large (H=1024), the difference between a 12-layer and a 24-layer configuration is small. Does this result still hold for much wider ALBERT configurations, such as ALBERT-xxlarge (H=4096)?
----
* 在第4.7节中，我们展示了对于ALBERT large（H=1024），12层和24层配置之间的差别很小。这个结果是否仍然适用于更广泛的ALBERT配置，例如ALBERT xxlarge（H=4096）？

![avatar](picture12.png)

<center>Table 10: The effect of a deeper network using an ALBERT-xxlarge configuration.</center>
<center>表10：使用ALBERT xxlarge配置的更深层网络的效果。</center>

* The answer is given by the results from Table 10. The difference between 12-layer and 24-layer ALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration.
----
* 答案由表10的结果给出。12层和24层ALBERT xxlarge配置在下游精度方面的差异可以忽略不计，平均得分相同。我们的结论是，当共享所有跨层参数（ALBERT样式）时，不需要比12层配置更深的模型。

#### 4.10额外训练数据和dropout影响

* The experiments done up to this point use only the Wikipedia and BOOKCORPUS datasets, as in (Devlin et al., 2019). In this section, we report measurements on the impact of the additional data used by both XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019).
----
* 到目前为止所做的实验仅使用维基百科和书籍语料库数据集，如（Devlin等人，2019年）。在本节中，我们报告了XLNet（Yang et al.，2019）和RoBERTa（Liu et al.，2019）使用的额外数据的影响测量结果。

* Fig. 3a plots the dev set MLM accuracy under two conditions, without and with additional data, with the latter condition giving a significant boost. We also observe performance improvements on the downstream tasks in Table 11, except for the SQuAD benchmarks (which are Wikipedia-based, and therefore are negatively affected by out-of-domain training material).
----
* 图3a描绘了无附加数据和有附加数据两种情况下的dev-set传销精度，后一种情况显著提高。我们在表11中还观察到下游任务的性能改进，除了基准（基于维基百科，因此受到其他训练数据的负面影响）。

![avatar](picture13.png)

<center>Figure 3: The effects of adding data and removing dropout during training.</center>
<center>图3：在训练期间添加数据和删除dropout的效果。</center>

![avatar](picture14.png)

<center>Table 11: The effect of additional training data using the ALBERT-base configuration.</center>
<center>表11：使用ALBERT基本配置的附加训练数据的效果。</center>

* We also note that, even after training for 1M steps, our largest models still do not overfit to their training data. As a result, we decide to remove dropout to further increase our model capacity. The plot in Fig. 3b shows that removing dropout significantly improves MLM accuracy. Intermediate evaluation on ALBERT-xxlarge at around 1M training steps (Table 12) also confirms that removing dropout helps the downstream tasks. There is empirical (Szegedy et al., 2017) and theoretical (Li et al., 2019) evidence showing that a combination of batch normalization and dropout in Convolutional Neural Networks may have harmful results. To the best of our knowledge, we are the first to show that dropout can hurt performance in large Transformer-based models. However, the underlying network structure of ALBERT is a special case of the transformer and further experimentation is needed to see if this phenomenon appears with other transformer-based architectures or not. 
----
* 我们还注意到，即使经过1M步的训练，我们最大的模型仍然不适合他们的训练数据。因此，我们决定取消dropout，以进一步增加我们的模型能力。图3b中的图显示，去除掉dropout显著地提高了MLM的准确性。对ALBERT xxlarge在大约1步M训练上的中级评估（表12）也证实了消除dropout有助于完成下游任务。有经验（Szegedy et al.，2017）和理论（Li et al.，2019）证据表明，卷积神经网络中的BN和dropout组合可能会产生有害的结果。据我们所知，我们是第一个表明在基于大型变压器的模型中，dropout会损害性能的人。然而，ALBERT的底层网络结构是transformer的一个特例，需要进一步的实验来观察这种现象是否出现在其他基于transformer的架构中。

![avatar](picture15.png)

<center>Table 12: The effect of removing dropout, measured for an ALBERT-xxlarge configuration.</center>
<center>表12：消除dropout的影响，测量了一个ALBERT-xxlarge配置。</center>

#### 4.11 NLU任务的最新进展

* The results we report in this section make use of the training data used by Devlin et al. (2019), as well as the additional data used by Liu et al. (2019) and Yang et al. (2019). We report state-of-the-art results under two settings for fine-tuning: single-model and ensembles. In both settings, we only do single-task fine-tuning5. Following Liu et al. (2019), on the development set we report the median result over five runs.
----
* 我们在本节中报告的结果利用了Devlin等人使用的训练数据。（2019年），以及Liu等人使用的附加数据。（2019）和Yang等人。（2019年）。我们报告了两种微调设置下的最新结果：单模型和集成。在这两种情况下，我们只做一个任务微调5。跟随刘等人。（2019年），在开发集上，我们报告了五次运行的中值结果。

* The single-model ALBERT configuration incorporates the best-performing settings discussed: an ALBERT-xxlarge configuration (Table 2) using combined MLM and SOP losses, and no dropout. The checkpoints that contribute to the final ensemble model are selected based on development set performance; the number of checkpoints considered for this selection range from 6 to 17, depending on the task. For the GLUE (Table 13) and RACE (Table 14) benchmarks, we average the model predictions for the ensemble models, where the candidates are fine-tuned from different training steps using the 12-layer and 24-layer architectures. For SQuAD (Table 14), we average the prediction scores for those spans that have multiple probabilities; we also average the scores of the “unanswerable” decision. 
----
* 单模式ALBERT配置包括了最佳性能设置讨论：ALBERT-xxlarge配置（表2）使用组合MLM损失和SOP损失，没有dropout。最终集成模型的检查点是根据开发集的性能选择的；此选择所考虑的检查点的数量从6到17不等，具体取决于任务。对于GLUE（表13）和RACE（表14）基准，我们对集成模型的模型预测进行平均，其中候选模型使用12层和24层架构从不同的训练步骤进行微调。对于团队（表14），我们对那些具有多重概率的跨度的预测得分进行平均；我们还对“无法回答”的决策得分进行平均。

* Both single-model and ensemble results indicate that ALBERT improves the state-of-the-art significantly for all three benchmarks, achieving a GLUE score of 89.4, a SQuAD 2.0 test F1 score of 92.2, and a RACE test accuracy of 89.4. The latter appears to be a particularly strong improvement, a jump of +17.4% absolute points over BERT (Devlin et al., 2019), +7.6% over XLNet (Yang et al., 2019), +6.2% over RoBERTa (Liu et al., 2019), and 5.3% over DCMI+ (Zhang et al., 2019), an ensemble of multiple models specifically designed for reading comprehension tasks. Our single model achieves an accuracy of 86.5%, which is still 2.4% better than the state-of-the-art ensemble model. 
----
* 单模型和整体测试结果都表明，ALBERT在所有三个基准测试中都显著提高了技术水平，达到了GLUE数据集上分数89.4，SQuAD 2.0测试F1分数92.2，比赛测试精度89.4。后者似乎是一个特别强劲的改进，比BERT（Devlin等人，2019年）的绝对分数提高了17.4%，比XLNet（Yang等人，2019年）的绝对分数提高了7.6%，比RoBERTa（Liu等人，2019年）的绝对分数提高了6.2%，比DCMI+（Zhang等人，2019年）的绝对分数提高了5.3%，DCMI+（Zhang等人，2019年）是专门为阅读理解任务设计的多个模型的集合。我们的单一模型达到了86.5%的精度，这仍然比最先进的集成模型好2.4%。

![avatar](picture16.png)

<center>Table 13: State-of-the-art results on the GLUE benchmark. For single-task single-model results, we report ALBERT at 1M steps (comparable to RoBERTa) and at 1.5M steps. The ALBERT ensemble uses models trained with 1M, 1.5M, and other numbers of steps.</center>
<center>表13：GLUE基准的最新结果。对于单任务单模型结果，我们报告ALBERT的训练步数为1M（与RoBERTa相当）和1.5M。ALBERT集成使用的是经过1M、1.5M和其他步数训练的模型。</center>

![avatar](picture17.png)

<center>Table 14: State-of-the-art results on the SQuAD and RACE benchmarks.</center>
<center>表14：SQuAD和RACE基准的最新结果。</center>

### 5讨论

* While ALBERT-xxlarge has less parameters than BERT-large and gets significantly better results, it is computationally more expensive due to its larger structure. An important next step is thus to speed up the training and inference speed of ALBERT through methods like sparse attention (Child et al., 2019) and block attention (Shen et al., 2018). An orthogonal line of research, which could provide additional representation power, includes hard example mining (Mikolov et al., 2013) and more efficient language modeling training (Yang et al., 2019). Additionally, although we have convincing evidence that sentence order prediction is a more consistently-useful learning task that leads to better language representations, we hypothesize that there could be more dimensions not yet captured by the current self-supervised training losses that could create additional representation power for the resulting representations.
----
* 虽然ALBERT-xxlarge的参数比BERT-large少，并且得到了更好的结果，但是由于它的结构更大，计算成本更高。因此，下一个重要步骤是通过稀疏注意（Child et al.，2019）和块注意（Shen et al.，2018）等方法加快ALBERT的训练和推理速度。一个可以提供额外表示能力的正交研究线包括硬示例挖掘（Mikolov等人，2013）和更有效的语言建模训练（Yang等人，2019）。此外，尽管我们有令人信服的证据表明，句子顺序预测是一项更为持续有效的学习任务，能够带来更好的语言表达，我们假设，可能有更多的维度尚未被当前的自我监督训练损失捕获，这可能会为结果表示创建额外的表示能力。