### TINYBERT：蒸馏BERT进行自然语言理解

* Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, the pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be well transferred to a small “student” TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT.
* TinyBERT is empirically effective and achieves more than 96% the performance of teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them.

* 预训练语言模型，如BERT，显著提高了许多自然语言处理任务的性能。然而，预训练语言模型通常计算量大、内存密集，因此很难在资源受限的设备上有效地执行它们。为了在保证模型精度的同时加快推理速度，减小模型规模，本文首次提出了一种新的基于transformer的模型知识蒸馏方法。通过利用这种新的KD方法，编码在一个大的“teacher”BERT中的大量知识可以很好地传递给一个小的“student”TinyBERT。此外，我们还为TinyBERT引入了一个新的两阶段学习框架，该框架在训练前和任务特定的学习阶段执行transformer蒸馏。该框架确保了TinyBERT能够捕获BERT中的一般领域以及任务特定的知识。
* TinyBERT具有经验性的有效性，在GLUE benchmark上达到了BERTBASE teacher的96%以上的性能，同时在推理方面小了7.5倍，速度提高了9.4倍。TinyBERT在BERT蒸馏方面也明显优于最先进的基线，其参数仅为∼28%，推断时间为∼31%。

#### 1、介绍

* Pre-training language models then fine-tuning on downstream tasks has become a new paradigm for natural language processing (NLP). Pre-trained language models (PLMs), such as BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019) and SpanBERT (Joshi et al., 2019), have achieved great success in many NLP tasks (e.g., the GLUE benchmark (Wang et al., 2018) and the challenging multi-hop reasoning task (Ding et al., 2019)). However, PLMs usually have an extremely large number of parameters and need long inference time, which are difficult to be deployed on edge devices such as mobile phones. Moreover, recent studies (Kovaleva et al., 2019) also demonstrate that there is redundancy in PLMs. Therefore, it is crucial and possible to reduce the computational overhead and model storage of PLMs while keeping their performances.

* 预先训练语言模型然后对下游任务进行微调已经成为自然语言处理（NLP）的一种新范式。预训练语言模型（PLM），如BERT（Devlin et al.，2018）、XLNet（Yang et al.，2019）、RoBERTa（Liu et al.，2019）和SpanBERT（Joshi et al.，2019），已经在许多NLP任务（例如，GLUE benchmark（Wang等人，2018）和具有挑战性的多跳推理任务（Ding等人，2019））中取得了巨大成功。然而，PLMs通常参数数目庞大，推理时间长，难以应用于手机等边缘设备。此外，最近的研究（Kovaleva et al.，2019）也表明PLMs中存在冗余。因此，在保持PLMs性能的同时，降低PLMs的计算开销和模型存储是非常重要和可能的。

* There has been many model compression techniques (Han et al., 2015a) proposed to accelerate deep model inference and reduce model size while maintaining accuracy. The most commonly used techniques include quantization (Gong et al., 2014), weights pruning (Han et al., 2015b), and knowledge distillation (KD) (Romero et al., 2014). In this paper we focus on knowledge distillation, an idea proposed by Hinton et al. (2015) in a teacher-student framework. KD aims to transfer the knowledge embedded in a large teacher network to a small student network. The student network is trained to reproduce the behaviors of the teacher network. Based on the framework, we propose a novel distillation method specifically for Transformer-based models (Vaswani et al., 2017), and use BERT as an example to investigate the KD methods for large scale PLMs.

* 有许多模型压缩技术（Han et al.，2015a）被提出，以加速深层模型推断并在保持准确性的同时减小模型大小。最常用的技术包括量化（Gong et al.，2014）、权重修剪（Han et al.，2015b）和知识蒸馏（KD）（Romero et al.，2014）。在本文中，我们关注的是Hinton等人（2015）在teacher-student框架下提出的一个概念——知识蒸馏。KD的目的是将嵌入在一个大的teacher网络中的知识转移到一个小的student网络中。学生网络被训练成再现teacher网络的行为。基于该框架，我们提出了一种新的基于transformer模型的蒸馏方法（Vaswani et al.，2017），并以BERT为例研究了大规模PLMs的KD方法。

* KD has been extensively studied in NLP (Kim & Rush, 2016; Hu et al., 2018), while designing KD methods for BERT has been less explored. The pre-training-then-fine-tuning paradigm firstly pre-trains BERT on a large scale unsupervised text corpus, then fine-tunes it on task-specific dataset, which greatly increases the difficulty of BERT distillation. Thus we are required to design an effective KD strategy for both stages. To build a competitive TinyBERT, we firstly propose a new Transformer distillation method to distill the knowledge embedded in teacher BERT. Specifically, we design several loss functions to fit different representations from BERT layers: 1) the output of the embedding layer; 2) the hidden states and attention matrices derived from the Transformer layer; 3) the logits output by the prediction layer. The attention based fitting is inspired by the recent findings (Clark et al., 2019) that the attention weights learned by BERT can capture substantial linguistic knowledge, which encourages that the linguistic knowledge can be well transferred from teacher BERT to student TinyBERT. However, it is ignored in existing KD methods of BERT, such as Distilled BiLSTMSOFT (Tang et al., 2019), BERT-PKD (Sun et al., 2019) and DistilBERT2. Then, we propose a novel two-stage learning framework including the general distillation and the task-specific distillation. At the general distillation stage, the original BERT without fine-tuning acts as the teacher model. The student TinyBERT learns to mimic the teacher’s behavior by executing the proposed Transformer distillation on the large scale corpus from general domain. We obtain a general TinyBERT that can be fine-tuned for various downstream tasks. At the task-specific distillation stage, we perform the data augmentation to provide more task-specific data for teacher- student learning, and then re-execute the Transformer distillation on the augmented data. Both the two stages are essential to improve the performance and generalization capability of TinyBERT. A detailed comparison between the proposed method and other existing methods is summarized in Table 1. The Transformer distillation and two-stage learning framework are two key ideas of the proposed method.

* NLP对KD进行了广泛的研究（Kim&Rush，2016；Hu et al.，2018），而为BERT设计KD方法的研究较少。预训练再微调范式首先在大规模的无监督文本语料库上对BERT进行预训练，然后在特定任务数据集上对其进行微调，这大大增加了BERT提取的难度。因此，这两个阶段都需要有效的设计策略。为了建立一个有竞争力的TinyBERT，我们首先提出了一种新的transformer蒸馏方法来提取BERT teacher所蕴含的知识。具体来说，我们设计了几种损失函数，以适应不同的BERT层表示：1）嵌入层的输出；2）从transformer层导出的隐藏状态和attention矩阵；3）预测层输出的logits。基于注意的拟合受到最近的研究结果的启发（Clark et al.，2019），BERT学习的attention权重可以捕获大量的语言知识，这鼓励语言知识可以很好地从教师BERT转移到学生TinyBERT。然而，现有的BERT KD方法，如蒸馏Bilsmsoft（Tang et al.，2019）、BERT-PKD（Sun等人，2019）和DistilBERT2等，忽略了这一点。然后，我们提出了一个新的两阶段学习框架，包括一般的提炼和特定任务的提炼。在没有微调的情况下，BERT-teacher在原始的蒸馏阶段充当了普通模型的角色。学生TinyBERT通过在大规模语料库上执行所提出的transformer蒸馏来学习模仿teacher的行为。我们得到了一个可以针对各种下游任务进行微调的通用TinyBERT。在特定任务提取阶段，我们进行数据扩充，为师生学习提供更多的任务特定数据，然后对扩充后的数据重新进行transformer蒸馏。这两个阶段对于提高TinyBERT的性能和泛化能力都是必不可少的。表1总结了拟议方法与其他现有方法之间的详细比较。变压器蒸馏和两阶段学习框架是该方法的两个关键思想。

* Table 1: A summary of KD methods for BERT. Abbreviations: INIT(initializing student BERT with some layers of pre-trained teacher BERT), DA(conducting data augmentation for task-specific training data). Embd, Attn, Hidn, and Pred represent the knowledge from embedding layers, attention matrices, hidden states, and final prediction layers, respectively.

* 表1:BERT的KD方法概述。缩写：INIT（用预先训练过的teacher BERT的一些层初始化学生BERT）、DA（对特定任务的训练数据进行数据扩充）。Embd、Attn、Hidn和Pred分别表示来自嵌入层、注意矩阵、隐藏状态和最终预测层的知识。

![avater](图片/1.png)

* The main contributions of this work are as follows: 1) We propose a new Transformer distillation method to encourage that the linguistic knowledge encoded in teacher BERT can be well transferred to TinyBERT. 2) We propose a novel two-stage learning framework with performing the proposed Transformer distillation at both the pre-training and fine-tuning stages, which ensures that Tiny- BERT can capture both the general-domain and task-specific knowledge of the teacher BERT. 3) We show experimentally that our TinyBERT can achieve more than 96% the performance of teacher BERTBASE on GLUE tasks, while having much fewer parameters (∼13.3%) and less inference time (∼10.6%), and significantly outperforms other state-of-the-art baselines on BERT distillation.

* 本文的主要贡献如下：1）我们提出了一种新的transformer提取方法，以鼓励将BERT teacher编码的语言知识很好地传递给TinyBERT。2） 我们提出了一个新的两阶段学习框架，在预训练和微调阶段执行所提出的transformer蒸馏，以确保Tiny-BERT能够同时捕获教师BERT的一般领域和特定任务的知识。3） 实验结果表明，我们的TinyBERT在GLUE任务上的性能可以达到BERTBASE老师的96%以上，同时参数更少（∼13.3%）和推理时间（∼10.6%），明显优于其他最新的BERT蒸馏基线。

#### 2、准备工作

* We firstly describe the formulation of Transformer (Vaswani et al., 2017) and Knowledge Distillation (Hinton et al., 2015). Our proposed Transformer distillation is a specially designed KD method for Transformer-based models.

* 我们首先描述了transformer（Vaswani et al.，2017）和知识蒸馏（Hinton et al.，2015）的公式。我们提出的Transformer蒸馏是一种专门为基于Transformer的模型设计的KD方法。

##### 2.1、Transformer层

* Most of the recent pre-trained language models (e.g., BERT, XLNet and RoBERTa) are built with Transformer layers, which can capture long-term dependencies between input tokens by self-attention mechanism. Specifically, a standard Transformer layer includes two main sub-layers: multi-head attention (MHA) and fully connected feed-forward network (FFN).

* 目前大多数预先训练的语言模型（如BERT、XLNet和RoBERTa）都是用Transformer层构建的，它可以通过自我注意机制捕获输入tokens之间的长期依赖关系。具体地说，标准Transformer层包括两个主要子层：多头注意（MHA）和全连接前馈网络（FFN）。

* **Multi-Head Attention (MHA)**. The calculation of attention function depends on the three components of queries, keys and values, which are denoted as matrices **Q**, **K** and **V** respectively. The attention function can be formulated as follows:

![avater](图片/2.png)

* where $d_k$ is the dimension of keys and acts as a scaling factor, $A$ is the attention matrix calculated from the compatibility of $Q$ and $K$ by dot-product operation. The final function output is calculated as a weighted sum of values $V$ , and the weight is computed by applying **softmax()** operation on the each column of matrix $A$. According to Clark et al. (2019), the attention matrices in BERT can capture substantial linguistic knowledge, and thus play an essential role in our proposed distillation method.

* 其中，$d_k$是键的维数，用作比例因子，$A$是通过点积运算从$Q$和$K$的兼容性计算出的注意力矩阵。最后的函数输出以值$V$的加权和计算，权重通过对矩阵$A$的每一列应用**softmax()** 操作来计算。Clark等人（2019）认为，BERT中的注意矩阵能够捕获大量的语言知识，因此在我们提出的提取方法中起着至关重要的作用。

* Multi-head attention is defined by concatenating the attention heads from different representation subspaces as follows:

* 多头部注意是通过连接来自不同表示子空间的注意头来定义的，如下所示：

![avatar](图片/3.png)

* where h is the number of attention heads, and headi denotes the i-th attention head, which is calculated by the Attention() function with inputs from different representation subspaces, the matrix W acts as a linear transformation.

* 其中h是注意头的数目，$head_i$表示第i个注意头，它由attention()函数计算，输入来自不同的表示子空间，矩阵$W$充当线性变换。

* **Position-wise Feed-Forward Network (FFN).** Transformer layer also contains a fully connected feed-forward network, which is formulated as follows:

* **位置前馈网络（FFN）。** transformer层还包含一个完全连接的前馈网络，其公式如下：

![avatar](图片/4.png)

* We can see that the FFN contains two linear transformations and one ReLU activation.

* 我们可以看到FFN包含两个线性变换和一个ReLU激活。

##### 2.2、知识提炼

* **KD** aims to transfer the knowledge of a large teacher network $T$ to a small student network $S$. The student network is trained to mimic the behaviors of teacher networks. Let $f^T$ and $f^S$ represent the behavior functions of teacher and student networks, respectively. The behavior function targets at transforming network inputs to some informative representations, and it can be defined as the output of any layer in the network. In the context of Transformer distillation, the output of MHA layer or FFN layer, or some intermediate representations (such as the attention matrix A) can be used as behavior function. Formally, KD can be modeled as minimizing the following objective function:


* **KD** 旨在将大型教师网络的知识转移到小型学生网络$S$。学生网络被训练成模仿教师网络的行为。让$f^T$和$f^S$分别代表教师和学生网络的行为功能。行为函数的目标是将网络输入转换为某种信息表示，它可以定义为网络中任何层的输出。在transformer蒸馏过程中，可以将MHA层或FFN层的输出，或一些中间表示（如注意矩阵A）作为行为函数。形式上，KD可以建模为最小化以下目标函数：

![avatar](图片/5.png)

* where $L(·)$ is a loss function that evaluates the difference between teacher and student networks, $x$ is the text input and $X$ denotes the training dataset. Thus the key research problem becomes how to define effective behavior functions and loss functions. Different from previous KD methods, we also need to consider how to perform KD at the pre-training stage of BERT in addition to the task-specific training stage.

* 其中，$L（·）$是评估教师和学生网络之间差异的损失函数，$x$是文本输入，$x$表示训练数据集。因此，如何定义有效行为函数和损失函数成为研究的关键问题。与以往的KD方法不同，除了任务特定的训练阶段外，还需要考虑如何在BERT的预训练阶段进行KD。

#### 3、方法

* In this section, we propose a novel distillation method for Transformer-based models, and present a two-stage learning framework for our model distilled from BERT, which is called TinyBERT.

* 在这一部分中，我们提出了一种新的基于变换器的模型的提取方法，并提出了一个从BERT中提取模型的两阶段学习框架，称为TinyBERT。

##### 3.1 Transformer蒸馏

* The proposed Transformer distillation is a specially designed KD method for Transformer networks. Figure 1 displays an overview of the proposed KD method. In this work, both the student and teacher networks are built with Transformer layers. For a clear illustration, we firstly formulate the problem before introducing our method.

* 本文提出的transformer蒸馏是一种专为变压器网络设计的KD方法。图1显示了建议的KD方法的概述。在这项工作中，学生和教师网络都是用Transformer层构建的。为了更清楚地说明问题，我们在介绍我们的方法之前首先对问题进行了阐述。

![avatar](图片/6.png)

<center>Figure 1: An overview of Transformer distillation: (a) the framework of Transformer distillation, (b) the details of Transformer-layer distillation consisting of Attnloss(attention based distillation) and Hidnloss(hidden states based distillation).</center>

<center>图1：transformer蒸馏概述：（a）transformer蒸馏的框架；（b）transformer层蒸馏的细节，包括Attnloss（基于注意力的蒸馏）和Hidnloss（基于隐藏状态的蒸馏）。</center>

* Problem Formulation. Assuming that the student model has $M$ Transformer layers and teacher model has $N$ Transformer layers, we choose $M$ layers from the teacher model for the Transformer-layer distillation. The function $n = g(m)$ is used as a mapping function from student layers to teacher layers, which means that the m-th layer of student model learns the information from the n-th layer of teacher model. The embedding-layer distillation and the prediction-layer distillation are also considered. We set 0 to be the index of embedding layer and $M + 1$ to be the index of prediction layer, and the corresponding layer mappings are defined as $0 = g(0)$ and $N + 1 = g(M + 1)$ respectively. The effect of the choice of different mapping functions on the performances will be studied in the experiment section. Formally, the student can acquire knowledge from the teacher by minimizing the following objective:

* 问题的形成。假设student模型有$M$Transformer层，teacher模型有$N$Transformer层，我们从teacher模型中选择$M$层进行transformer层蒸馏。函数$n=g(m)$作为从student层到teacher层的映射函数，这意味着第m层student模型从第n层teacher模型中学习信息。考虑了嵌入层蒸馏和预测层蒸馏。我们设置0作为嵌入层的索引，$M+1$作为预测层的索引，相应的层映射分别定义为$0=g(0)$和$N+1=g(M+1)$。实验部分将研究不同映射函数的选择对性能的影响。从形式上讲，学生可以通过最小化以下目标从老师那里获得知识：

![avatar](图片/7.png)

* where $L_{layer}$ refers to the loss function of a given model layer (e.g., Transformer layer or embedding layer) and $\lambda_m$ is the hyper-parameter that represents the importance of the m-th layer’s distillation.

* 其中，$L_{layer}$表示给定模型层（例如，变压器层或嵌入层）的损耗函数，$\lambda_m$是表示第m层蒸馏重要性的超参数。

* Transformer-layer Distillation. The proposed Transformer-layer distillation includes the attention based distillation and hidden states based distillation, which is shown in Figure 1 (b). The attention based distillation is motivated by the recent findings that attention weights learned by BERT can capture rich linguistic knowledge (Clark et al., 2019). This kind of linguistic knowledge includes the syntax and coreference information, which is essential for natural language understanding. Thus we propose the attention based distillation to encourage that the linguistic knowledge can be transferred from teacher BERT to student TinyBERT. Specifically, the student learns to fit the matrices of multi-head attention in the teacher network, and the objective is defined as:

* Transformer层蒸馏。所提出的Transformer层蒸馏包括基于attention的蒸馏和基于隐藏状态的蒸馏，如图1（b）所示。基于attention的蒸馏是由最近发现的，注意权重学习BERT可以捕获丰富的语言知识（克拉克等，2019年）。这种语言知识包括句法和共指信息，这是自然语言理解所必需的。因此，我们提出了基于注意力机制的提炼方法来鼓励语言知识从教师到学生之间的转换。具体地说，学生学习如何在教师网络中拟合多头注意力矩阵，目标定义为：

![avatar](图片/8.png)

* where $h$ is the number of attention heads, $A_i ∈ R^{l×l}$ refers to the attention matrix corresponding to the i-th head of teacher or student, $l$ is the input text length, and **MSE()** means the mean squared error loss function. In this work, the (unnormalized) attention matrix $A_i$ is used as the fitting target instead of its softmax output $softmax(A_i)$, since our experiments show that the former setting has a faster convergence rate and better performances.

* 其中，$h$为注意头数，$A_i∈R^{l×l}$表示第i个教师或学生的注意矩阵，$l$为输入文本长度，**MSE()** 为均方误差损失函数。在这项工作中，我们使用（未规范化）注意矩阵$A_i$代替其softmax输出$softmax（A_i）$作为拟合目标，因为我们的实验表明前者具有更快的收敛速度和更好的性能。

![avatar](图片/9.png)

<center>Figure 2: The illustration of TinyBERT learning</center>

<center>图2：TinyBERT学习说明</center>

* In addition to the attention based distillation, we also distill the knowledge from the output of Transformer layer (as shown in Figure 1 (b)), and the objective is as follows:

* 除了基于attention的蒸馏，我们还从attention层的输出中提取知识（如图1（b）），目标如下：

![avatar](图片/10.png)

* where the matrices $H^S ∈ R^{l×d′}$ and $H^T ∈ R^{l×d}$ refer to the hidden states of student and teacher networks respectively, which are calculated by Equation 4. The scalar values $d$ and $d′$ denote the hidden sizes of teacher and student models, and $d′$ is often smaller than d to obtain a smaller student network. The matrix $W_h ∈ R^{d′×d}$ is a learnable linear transformation, which transforms the hidden states of student network into the same space as the teacher network’s states.

 * 其中，$H^S∈R^{l×d′}$和$H^T∈R^{l×d}$分别表示student网络和teacher网络的隐藏状态，由方程4计算。标量值$d$和$d′$表示教师和学生模型的隐藏大小，$d′$通常小于d以获得更小的student网络。矩阵$W_h∈R^{d′×d}$是一个可学习的线性变换，它将student网络的隐藏状态转换成与teacher网络状态相同的空间。

* **Embedding-layer Distillation.** We also perform embedding-layer distillation, which is similar to the hidden states based distillation and formulated as:

* **嵌入层蒸馏。** 我们还进行嵌入层蒸馏，类似于基于隐状态的蒸馏，其公式如下：

![avatar](图片/11.png)

* where the matrices $E^S$ and $H^T$ refer to the embeddings of student and teacher networks, respectively. In this paper, they have the same shape as the hidden state matrices. The matrix $W_e$ is a linear transformation playing a similar role as $W_h$.

* 其中矩阵$E^S$和$H^T$分别表示student和teacher网络的嵌入。在本文中，它们具有与隐藏状态矩阵相同的形状。矩阵$W_e$是一个线性变换，其作用与$W_h$相似。

* Prediction-Layer Distillation. In addition to imitating the behaviors of intermediate layers, we also use the knowledge distillation to fit the predictions of teacher model (Hinton et al., 2015). Specifically, we penalize the soft cross-entropy loss between the student network’s logits against the teacher’s logits:

* 预测层蒸馏。除了模仿中间层的行为，我们还使用知识蒸馏来拟合teacher模型的预测（Hinton等人，2015）。具体来说，我们惩罚student网络的逻辑与teacher的逻辑之间的软交叉熵损失：

![avatar](图片/12.png)

* where $z^S$ and $z^T$ are the logits vectors predicted by the student and teacher respectively, **log_softmax()** means the log likelihood, t means the temperature value. In our experiment, we find that $t = 1$ performs well.  

* Using the above distillation objectives (i.e. Equations 7, 8, 9 and 10), we can unify the distillation loss of the corresponding layers between the teacher and the student network:

* 其中，$z^S$和$z^T$分别是学生和教师预测的logits向量，**log_softmax()** 表示对数可能性，T表示温度值。在我们的实验中，我们发现$t=1$表现良好。

* 我们可以用上述公式（i）和学生之间的等式（i）来统一上述各层和各层之间的蒸馏损失

![avatar](图片/13.png)

* In our experiments, we firstly perform intermediate layer distillation $(M ≥ m ≥ 0)$ , then perform the prediction-layer distillation $(m = M + 1)$.

* 在我们的实验中，我们先进行中间层蒸馏$（M≥M≥0）$，然后进行预测层蒸馏$（M=M+1）$。

##### 3.2、TINYBERT学习

* The application of BERT usually consists of two learning stages: the pre-training and fine-tuning. The plenty of knowledge learned by BERT in the pre-training stage is of great importance and should also be transferred. Therefore, we propose a novel two-stage learning framework including the general distillation and the task-specific distillation, as illustrated in Figure 2. General distillation helps student TinyBERT learn the rich knowledge embedded in teacher BERT, which plays an important role in improving the generalization capability of TinyBERT. The task-specific distillation teaches the student the task-specific knowledge. With the two-step distillation, we can further reduce the gap between teacher and student models.

* BERT的应用通常包括两个学习阶段：预训练和微调。BERT在训练前阶段所学到的大量知识是非常重要的，也应该加以转移。因此，我们提出了一个新的两阶段学习框架，包括一般的提炼和任务特定的提炼，如图2所示。通识蒸馏帮助学生模型掌握教师模型所蕴含的丰富知识，对提高学生模型的泛化能力起着重要作用。任务特定的蒸馏教给学生模型特定任务的知识。通过两步蒸馏，可以进一步缩小师生模型之间的差距。

* Table 2: Results are evaluated on the test set of GLUE official benchmark. All models are learned in a single-task manner. “-” means the result is not reported.

* 表2：在GLUE官方基准测试集上对结果进行评估。所有模型都是以单一任务的方式学习的。“-”表示不报告结果。

![avatar](图片/14.png)

* Table 3: The model sizes and inference time for baselines and TinyBERT. The number of layers does not include the embedding and prediction layers.

* 表3：基线和TinyBERT的模型大小和推断时间。层数不包括嵌入层和预测层。

![avatar](图片/15.png)

* **General Distillation.** In general distillation, we use the original BERT without fine-tuning as the teacher and a large-scale text corpus as the learning data. By performing the Transformer distillation5 on the text from general domain, we obtain a general TinyBERT that can be fine-tuned for downstream tasks. However, due to the significant reductions of the hidden/embedding size and the layer number, general TinyBERT performs relatively worse than BERT.

* **一般蒸馏** 在一般蒸馏中，我们使用未经微调的原始BERT作为教师，以大规模文本语料库作为学习数据。通过对general domain中的文本执行Transformer提取，我们得到了一个可以针对下游任务进行微调的通用TinyBERT。然而，由于隐藏/嵌入大小和层数的显著减少，一般的TinyBERT性能相对比BERT差。

* **Task-specific Distillation.** Previous studies show that the complex models, fine-tuned BERTs, suffer from over-parametrization for domain-specific tasks (Kovaleva et al., 2019). Thus, it is possible for small models to achieve comparable performances to the BERTs. To this end, we propose to derive competitive fine-tuned TinyBERTs through the task-specific distillation. In the task-specific distillation, we reperform the proposed Transformer distillation on an augmented task-specific dataset (as shown in Figure 2). Specifically, the fine-tuned BERT is used as the teacher and a data augmentation method is proposed to expand the task-specific training set. Learning more task-related examples, the generalization capabilities of student model can be further improved. In this work, we combine a pre-trained language model BERT and GloVe (Pennington et al., 2014) word embeddings to do word-level replacement for data augmentation. Specifically, we use the language model to predict word replacements for single-piece words (Wu et al., 2019), and use the word embeddings to retrieve the most similar words as word replacements for multiple-pieces words. Some hyperparameters are defined to control the replacement ratio of a sentence and the amount of augmented dataset. More details of the data augmentation procedure are discussed in Appendix A.

* **任务特定蒸馏。** 先前的研究表明，复杂模型（微调BERTs）在特定领域任务中存在过度参数化现象（Kovaleva等人，2019年）。因此，小型模型有可能获得与BERTs相当的性能。为此，我们建议通过特定任务的蒸馏来获得具有竞争力的微调TinyBERTs。在特定于任务的提取中，我们在一个扩展的特定于任务的数据集上重新生成所建议的Transformer蒸馏（如图2所示）。具体地说，使用微调的BERT作为教师，并提出了一种扩展特定任务训练集的数据扩充方法。学习更多的任务相关实例，学生模型的泛化能力可以进一步提高。**在这项工作中，我们结合了一个预训练语言模型BERT和GloVe（Pennington et al.，2014）的单词嵌入来进行单词级的数据扩充替换**。具体来说，我们使用语言模型来预测单件词的替换（Wu et al.，2019），并使用单词嵌入来检索最相似的单词作为多个片段单词的替换。定义了一些超参数来控制句子的替换率和扩充数据集的数量。数据扩充程序的更多细节在附录A中讨论。

* The above two learning stages are complementary to each other: the general distillation provides a good initialization for the task-specific distillation, while the task-specific distillation further improves TinyBERT by focusing on learning the task-specific knowledge. Although there is a big gap between BERT and TinyBERT in model size, by performing the proposed two-stage distillation, the TinyBERT can achieve competitive performances in various NLP tasks. The proposed Transformer distillation and two-stage learning framework are the two most important components of the proposed distillation method.

* 以上两个学习阶段是相辅相成的：一般的蒸馏为任务特定的蒸馏提供了良好的初始化，而任务特定的蒸馏通过集中学习任务特定的知识进一步提高了TinyBERT。虽然BERT和TinyBERT在模型尺寸上有很大的差距，但是通过执行所提出的两级蒸馏，TinyBERT可以在各种NLP任务中获得有竞争力的性能。所提出的Transformer蒸馏和两阶段学习框架是所提出的蒸馏方法的两个最重要的组成部分。

#### 4、实验

* In this section, we evaluate the effectiveness and efficiency of TinyBERT on a variety of tasks with different model settings.

* 在本节中，我们将评估TinyBERT在具有不同模型设置的各种任务上的有效性和效率。

#### 4.1、模型设置

* We instantiate a tiny student model (the number of layers M=4, the hidden size d′=312, the feed- forward/filter size d′i=1200 and the head number h=12) that has a total of 14.5M parameters. If not specified, this student model is referred to as the TinyBERT. The original BERTBASE (the number of layers N=12, the hidden size d=768, the feed-forward/filter size di=3072 and the head number h=12) is used as the teacher model that contains 109M parameters. We use g(m) = 3 × m as the layer mapping function, so TinyBERT learns from every 3 layers of BERTBASE. The learning weight λ of each layer is set to 1, which performs well for the learning of our TinyBERT.

* 我们例举了一个微小的student模型（层数M=4，隐藏尺寸d′=312，前馈/滤波器尺寸d′i=1200，头部数h=12），它总共有14.5M的参数。如果未指定，此student模型称为TinyBERT。前向滤波器的数量为307m，采用的滤波器尺寸为307m/12。我们使用g（m）=3×m作为层映射函数，因此TinyBERT从BERTBASE的每3层中学习。每层的学习权λ设为1，这对我们的TinyBERT学习效果很好。

![avatar](图片/16.png)

#### 4.2、在GLUE上的实验结果

* We evaluate TinyBERT on the General Language Understanding Evaluation (GLUE) (Wang et al., 2018) benchmark, which is a collection of diverse natural language understanding tasks. The details of experiment settings are described in Appendix B. The evaluation results are presented in Table 2 and the efficiencies of model size and inference time are also evaluated in Table 3.

* 我们根据通用语言理解评估（GLUE）（Wang等人，2018）基准评估TinyBERT，该基准是多种自然语言理解任务的集合。实验设置的细节在附录B中描述。评估结果如表2所示，模型大小和推理时间的效率也在表3中进行了评估。

* The experiment results demonstrate that: 1) There is a large performance gap between BERT_small and BERT_base due to the big reduction in model size. 2) TinyBERT is consistently better than BERT_small in all the GLUE tasks and achieves a large improvement of 6.3% on average. This indicates that the proposed KD learning framework can effectively improve the performances of small models regardless of downstream tasks. 3) TinyBERT significantly outperforms the state-of-the-art KD baselines (i.e., BERT-PKD and DistillBERT) by a margin of at least 3.9%, even with only ∼28% parameters and ∼31% inference time of baselines (see Table 3). 4) Compared with the teacher BERT_base, TinyBERT is 7.5x smaller and 9.4x faster in the model efficiency, while main- taining competitive performances. 5) TinyBERT has a comparable model efficiency (slightly larger in size but faster in inference) with Distilled BiLSTMSOFT and obtains substantially better performances in all tasks reported by the BiLSTM baseline. 6) For the challenging CoLA dataset (the task of predicting linguistic acceptability judgments), all the distilled small models have a relatively bigger performance gap with teacher model. TinyBERT achieves a significant improvement over the strong baselines, and its performance can be further improved by using a deeper and wider model to capture more complex linguistic knowledge as illustrated in the next subsection. We also provide more complete comparisons with the student architecture same as the baselines in Appendix E

* 实验结果表明：1）由于模型尺寸的大幅缩减，BERT_small与BERT_base之间存在较大的性能差距。2） TinyBERT在所有的胶水任务上一直优于Bertsmilg，平均提高了6.3%。这表明所提出的KD学习框架可以有效地改善小模型的性能，而不考虑下游任务。3） 至少在基线3的情况下（至少见第643条基线）的BERT-a参数（见第643条基线）的BERT-e（至少见第643条基线）。4） 与BERTBASE老师相比，TinyBERT的模型效率比BERTBASE小7.5倍，速度快9.4倍，同时保持了竞争性的性能。5） TinyBERT具有与蒸馏BILSTMoft相当的模型效率（尺寸稍大，但推理速度更快），并且在BiLSTM基线报告的所有任务中获得了显著更好的性能。6） 对于具有挑战性的CoLA数据集（预测语言可接受性判断的任务），所有提取出的小模型与teacher模型有较大的性能差距。TinyBERT在强基线的基础上取得了显著的改进，其性能可以通过使用更深入和更广泛的模型来获取更复杂的语言知识，如下一小节所示。我们还提供了与student架构的更完整的比较，与附录E中的基线相同

* Moreover, BERT-PKD and DistillBERT initialize their student models with some layers of well pre-trained teacher BERT (see Table 1), which makes the student models have to keep the same size settings of Transformer layer (or embedding layer) as their teacher BERT. In our two-stage distillation framework, TinyBERT is initialized by general distillation, so it has the advantage of being more flexible in model size selection.

* 此外，BERT-PKD和DistilletBert使用一些经过良好训练的teacher BERT初始化student模型（见表1），这使得student模型必须保持与教师BERT相同的变压器层（或嵌入层）大小设置。在我们的两级蒸馏框架中，TinyBERT是通过通用蒸馏初始化的，因此它具有模型尺寸选择更灵活的优点。

#### 4.3、模型大小影响

* We evaluate how much improvement can be achieved when increasing the model size of TinyBERT on several typical GLUE tasks, where MNLI and MRPC are used in the ablation studies of Devlin et al. (2018), and CoLA is the most difficult task in GLUE. Specifically, three wider and deeper variants are proposed and their evaluation results on development set are displayed in Table 4. We can observe that: 1) All the three TinyBERT variants can consistently outperform the original smallest TinyBERT, which indicates that the proposed KD method works for the student models of various model sizes. 2) For the CoLA task, the improvement is slight when only increasing the number of layers (from 49.7 to 50.6) or hidden size (from 49.7 to 50.5). To achieve more dramatic improvements, the student model should become deeper and wider (from 49.7 to 54.0). 3) Another interesting observation is that the smallest 4-layer TinyBERT can even outperform the 6-layers baselines, which further confirms the effectiveness of the proposed KD method.


* 我们评估了在几种典型的GLUE任务中增加TinyBERT模型尺寸可以取得多大的改善，其中MNLI和MRPC用于Devlin等人（2018）的消融研究，而CoLA是胶水中最困难的任务。具体而言，提出了三个更广和更深的变体，它们对开发集的评价结果如表4所示。我们可以观察到：1）这三个TinyBERT变量都能持续地优于原来最小的TinyBERT，这表明所提出的KD方法适用于不同模型尺寸的student模型。2） 对于CoLA任务，当只增加层数（从49.7增加到50.6）或隐藏大小（从49.7增加到50.5）时，改进很小。为了实现更显著的改进，student模型应该更深入和更广泛（从49.7到54.0）。3） 另一个有趣的发现是最小的4层TinyBERT甚至可以优于6层基线，这进一步证实了所提出的KD方法的有效性。

<center>Table 5: Ablation studies of different procedures (i.e., TD, GD, and DA) of the two-stage learning framework. The variants are validated on the dev set.</center>

<center>表5：两阶段学习框架中不同程序（即TD、GD和DA）的消融研究。在dev set上验证变量。</center>

![avatar](图片/17.png)

<center>Table 6: Ablation studies of different distilla- tion objectives in the TinyBERT learning. The variants are validated on the dev set.</center>

<center>表6：TinyBERT学习中不同蒸馏目标的消融研究。在dev set上验证变量。</center>

![avatar](图片/18.png)

<center>表7：不同映射策略的结果（dev）。</center>

![avatar](图片/19.png)

#### 4.4、消融研究

* In this section, we conduct ablation studies to investigate the contributions of : 1) different procedures of the proposed two-stage TinyBERT learning framework (see Figure 2), and 2) different distillation objectives (see Equation 11).

* 在本节中，我们进行烧蚀研究，以调查以下因素的贡献：1）所提出的两阶段TinyBERT学习框架的不同程序（见图2）和2）不同的蒸馏目标（见方程式11）。

* **Effects of different learning procedures.** The proposed two-stage TinyBERT learning framework (see Figure 2) consists of three key procedures: TD (Task-specific Distillation), GD (General Distillation) and DA (Data Augmentation). The effects of different learning procedures are analyzed and presented in Table 5. The results indicate that all the three procedures are crucial for the proposed KD method. The TD and DA has comparable effects in all the four tasks. We can also find the task-specific procedures (TD and DA) are more helpful than the pre-training procedure (GD) in all the four tasks. Another interesting observation is that GD has more effect on CoLA than on MNLI and MRPC. We conjecture that the ability of linguistic generalization (Warstadt et al., 2018) learned by GD plays a more important role in the downstream CoLA task (linguistic acceptability judgments).


* **不同学习程序的效果。** 建议的两阶段TinyBERT学习框架（见图2）包括三个关键步骤：TD（任务特定蒸馏）、GD（一般蒸馏）和DA（数据扩充）。表5分析了不同学习过程的效果。结果表明，这三个步骤对所提出的KD方法都是至关重要的。TD和DA在这四个任务中的效果相当。我们还发现任务特定程序（TD和DA）在所有四个任务中都比预训练程序（GD）更有帮助。另一个有趣的观察结果是GD对CoLA的影响大于MNLI和MRPC。我们推测GD学习的语言泛化能力（Warstadt et al.，2018）在下游的CoLA任务（语言可接受性判断）中扮演着更重要的角色。

* **Effects of different distillation objectives.** We investigate the effects of distillation objectives on the TinyBERT learning. Several baselines are proposed including the TinyBERT learning without the Transformer-layer distillation (No Trm), embedding-layer distillation (No Emb) and prediction-layer distillation (No Pred)7 respectively. The results are illustrated in Table 6 and show that all the proposed distillation objectives are useful for the TinyBERT learning. The performance drops significantly from 75.3 to 56.3 under the setting (No Trm), which indicates Transformer-layer distillation is the key for TinyBERT learning. Furthermore, we study the contributions of attention (No Attn) and hidden states (No Hidn) in the Transformer-layer distillation. We can find the attention based distillation has a bigger effect than hidden states based distillation on TinyBERT learning. Meanwhile, these two kinds of knowledge distillation are complementary to each other, which makes TinyBERT obtain the competitive results.

* **不同蒸馏目标的影响。** 我们研究了蒸馏目标对TinyBERT学习的影响。提出了不嵌入蒸馏层的预测层（Emb）和无预蒸馏层（tr-7）。结果如表6所示，表明所有提出的蒸馏目标对TinyBERT学习是有用的。在没有Trm的情况下，性能从75.3下降到56.3，说明Transformer层蒸馏是TinyBERT学习的关键。此外，我们还研究了注意（No-Attn）和隐藏态（No-Hidn）在Transformer层蒸馏中的贡献。我们发现基于注意的蒸馏比基于TinyBERT学习的基于隐状态的蒸馏有更大的影响。同时，这两种知识的提炼是相辅相成的，使TinyBERT获得了竞争的结果。

#### 4.5、映射函数的效果

* We investigate the effects of different mapping functions $n = g(m)$ on the TinyBERT learning. Our original TinyBERT as described in section 4.1 uses the uniform-strategy, and we compare with two typical baselines including top-strategy $(g(m) = m + N − M ; 0 < m ≤ M )$ and bottom-strategy $(g(m) = m;0 < m ≤ M).$

* 我们研究了不同映射函数$n=g（m）$对TinyBERT学习的影响。我们最初的TinyBERT如第4.1节所述使用统一策略，并与两个典型基线进行比较，包括顶部策略$（g（m）=m+N−M；0<m≤M）$和底部策略$（g（m）=m；0<m≤M）$

* The comparison results are presented in Table 7. We find that the top-strategy performs better than the bottom-strategy in MNLI, while being worse in MRPC and CoLA tasks, which confirms the observations that different tasks depend on the different kinds of knowledge from BERT layers. Since the uniform-strategy acquires the knowledge from bottom to top layers of BERTBASE, it achieves better performances than the other two baselines in all the four tasks. Adaptively choosing layers for a specific task is a challenging problem and we leave it as the future work.


* 比较结果见表7。我们发现，在MNLI中，顶层策略的表现优于底层策略，而在MRPC和CoLA任务中表现较差，这证实了不同任务依赖于BERT层的不同知识的观察结果。由于统一策略从底层到顶层获取了BERTBASE的知识，因此在这四个任务中，它的性能都优于其他两个基线。为特定任务自适应地选择层是一个具有挑战性的问题，我们将其留待以后的工作。

* **Other Experiments.** We also evaluate TinyBERT on the question answering tasks, and study whether we can use BERTSMALL as the initialization of the general TinyBERT. The experiments are detailed in Appendix C and D.

* **其他实验。** 我们还对TinyBERT的答疑任务进行了评估，并研究是否可以使用Bertsall作为一般TinyBERT的初始化。实验详情见附录C和D。

#### 5、结论和今后的工作

* In this paper, we firstly introduce a new KD method for Transformer-based distillation, then we further propose a two-stage framework for TinyBERT learning. Extensive experiments show that the TinyBERT achieves competitive performances meanwhile significantly reducing the model size and shortening the inference time of original BERTBASE, which provides an effective way to deploy BERT-based NLP applications on the edge devices.


* 本文首先介绍了一种新的基于Transformer的蒸馏KD方法，然后提出了一个两阶段TinyBERT学习框架。大量的实验表明，TinyBERT在显著减小原BERT_BASE模型大小和缩短推理时间的同时，取得了具有竞争力的性能，为在边缘设备上部署基于BERT的NLP应用提供了一种有效的途径。

* In future work, we would study how to effectively transfer the knowledge from wider and deeper teachers (e.g., BERTLARGE and XLNetLARGE) to student TinyBERT. The joint learning of distillation and quantization/pruning would be another promising direction to further compress the pre-trained language models.

* 在今后的工作中，我们将研究如何有效地将知识从更广泛、更深层次的教师（如bertlaug和XLNetLARGE）传递给student TinyBERT。蒸馏和量化/剪枝的联合学习将是进一步压缩预训练语言模型的另一个有前途的方向。

# TinyBert特点

* 1.与以往的KD方法不同，除了任务特定的训练阶段外，还需要考虑如何在BERT的预训练阶段进行KD。
* 2.Transformer蒸馏：embedding，Attn+hidn,pred
* 3.层的映射函数：有的任务底层映射好，有的任务顶层映射好。
* 4.特定任务蒸馏，使用GloVe和BERT扩充语料的做法不清楚具体操作。