## Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

### Nils Reimers and Iryna Gurevych

### Abstract

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.

BERT(Devlin et al ., 2018)和RoBERT(刘et al ., 2019)设置一个新的先进的性能等句子对回归任务语义文本相似度(STS)。然而,它要求两个句子被送入网络,导致大规模的计算开销:在10000条句子中找到最相似的两个句子需要大约5000万推理计算，使用BERT(~ 65小时)。BERT的建设使得它不适合语义相似性搜索和无监督聚类等任务。

In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

在本出版物中，我们提出了Sentence-BERT（SBERT），这是一个经过预训练的BERT网络的改进，它使用连词和三元组网络结构来导出语义上有意义的句子嵌入，可以使用余弦相似性进行比较。这将查找最相似的对的工作量从BERT/RoBERTa的65小时减少到SBERT的大约5秒，同时保持BERT的准确性。我们评估了SBERT和SRoBERTa在常见的STS任务和迁移学习任务上的表现，其中SBERT和SRoBERTa优于其他最先进的句子嵌入方法。

### 1 Introduction

In this publication, we present Sentence-BERT (SBERT), a modification of the BERT network using siamese and triplet networks that is able to derive semantically meaningful sentence embeddings. This enables BERT to be used for certain new tasks, which up-to-now were not applicable for BERT. These tasks include large-scale seman-tic similarity comparison, clustering, and information retrieval via semantic search.

在这篇文章中，我们提出了Sentence-BERT（SBERT），它是BERT网络的一个改进，它使用孪生和三元组网络，能够导出语义上有意义的句子嵌入。这使得BERT可以用于某些新任务，而到目前为止还不适用于BERT。这些任务包括大规模的语义相似性比较、聚类和通过语义搜索进行信息检索。

BERT set new state-of-the-art performance on various sentence classification and sentence-pair regression tasks. BERT uses a cross-encoder: Two sentences are passed to the transformer network and the target value is predicted. However, this setup is unsuitable for various pair regression tasks due to too many possible combinations. Finding in a collection of n = 10 000 sentences the pair with the highest similarity requires with BERT n·(n-1)/2 = 49 995 000 inference computations. On a modern V100 GPU, this requires about 65 hours. Similar, finding which of the over 40 million existent questions of Quora is the most similar for a new question could be modeled as a pair-wise comparison with BERT, however, answering a single query would require over 50 hours.

BERT在各种句子分类和句子对回归任务上有了新的表现。BERT使用交叉编码器：将两个句子传递到变压器网络，并预测目标值。但是，由于太多可能的组合，这种设置不适合各种对回归任务。在n=10000个句子的集合中找到相似度最高的对需要BERT n·（n-1）/2=49995000个推理计算。在现代V100 GPU上，这需要大约65个小时。类似地，在Quora的4000多万个现存问题中，找出哪一个对新问题最相似，可以与BERT进行成对比较，然而，回答一个单一的问题需要50多个小时。

A common method to address clustering and semantic search is to map each sentence to a vector space such that semantically similar sentences are close. Researchers have started to input individual sentences into BERT and to derive fixed size sentence embeddings. The most commonly used approach is to average the BERT output layer (known as BERT embeddings) or by using the output of the first token (the [CLS] token). As we will show, this common practice yields rather bad sentence embeddings, often worse than averaging GloVe embeddings (Pennington et al., 2014). 

解决聚类和语义搜索的一种常用方法是将每个句子映射到向量空间，这样语义相似的句子就接近了。研究人员已经开始在BERT中输入独立句子，并推导出固定大小的句子嵌入。最常用的方法是平均BERT输出层（称为BERT嵌入），或者使用第一个token的输出（即[CLS]token）。正如我们将要展示的，这种常见的做法产生了相当糟糕的句子嵌入，通常比平均的GloVe嵌入更糟糕（Pennington等人，2014）。

To alleviate this issue, we developed SBERT. The siamese network architecture enables that fixed-sized vectors for input sentences can be derived. Using a similarity measure like cosinesimilarity or Manhatten / Euclidean distance, semantically similar sentences can be found. These similarity measures can be performed extremely efficient on modern hardware, allowing SBERT to be used for semantic similarity search as well as for clustering. The complexity for finding the most similar sentence pair in a collection of 10,000 sentences is reduced from 65 hours with BERT to the computation of 10,000 sentence embeddings (\~5 seconds with SBERT) and computing cosinesimilarity (~0.01 seconds) . By using optimized index structures, finding the most similar Quora question can be reduced from 50 hours to a few milliseconds (Johnson et al., 2017).

为了缓解这个问题，我们开发了SBERT。孪生网络结构使输入句子的固定大小向量可以被导出。使用余弦相似度或曼哈顿欧几里得距离等相似性度量，可以找到本质上相似的句子。
这些相似性度量可以在现代硬件上非常有效地执行，允许SBERT用于语义相似性搜索和聚类。在10000个句子集合中找到最相似的句子对的复杂度从使用BERT的65个小时减少到10000个句子嵌入的计算（SBERT的约5秒）和余弦相似度的计算（~0.01秒）。通过使用优化的索引结构，找到最相似的Quora问题可以从50小时缩短到几毫秒（Johnson等人，2017）。

We fine-tune SBERT on NLI data, which creates sentence embeddings that significantly outperform other state-of-the-art sentence embedding methods like InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018). On seven Semantic Textual Similarity (STS) tasks, SBERT achieves an improvement of 11.7 points compared to InferSent and 5.5 points compared to Universal Sentence Encoder. On SentEval (Conneau and Kiela, 2018), an evaluation toolkit for sentence embeddings, we achieve an improvement of 2.1 and 2.6 points, respectively.

我们在NLI数据上对SBERT进行了微调，这使得句子嵌入明显优于其他最先进的句子嵌入方法，如Inferesent（Conneau等人，2017）和Universal Session Encoder（Cer等人，2018）。在七个语义文本相似性（STS）任务中，SBERT比推断结果提高了11.7个点，比通用句子编码提高了5.5分。在句子嵌入评价工具SentEval（Conneau and Kiela，2018）上，我们分别提高了2.1和2.6分。

SBERT can be adapted to a specific task. It sets new state-of-the-art performance on a challenging argument similarity dataset (Misra et al., 2016) and on a triplet dataset to distinguish sentences from different sections of a Wikipedia article (Dor et al., 2018).

SBERT可以适应特定的任务。它在具有挑战性的参数相似度数据集（Misra等人，2016年）和三元组数据集上创造了最新的技术表现，以区分Sentence与维基百科文章的不同部分（Dor等人，2018年）。

The paper is structured in the following way: Section 3 presents SBERT, section 4 evaluates SBERT on common STS tasks and on the challenging Argument Facet Similarity (AFS) corpus (Misra et al., 2016). Section 5 evaluates SBERT on SentEval. In section 6, we perform an ablation study to test some design aspect of SBERT. In section 7, we compare the computational efficiency of SBERT sentence embeddings in contrast to other state-of-the-art sentence embedding methods.

论文结构如下：第3节介绍了SBERT，第4节评估了SBERT在常见STS任务和挑衅性论据方面相似性（AFS）语料库（Misra等人，2016）。第5节评估SentEval的SBERT。在第6节中，我们进行了消融研究，以测试SBERT的某些设计方面。在第七节中，我们比较了SBERT语句嵌入与其他最新的句子嵌入方法的计算效率。

### 2  Related Work

We first introduce BERT, then, we discuss state-of-the-art sentence embedding methods.BERT (Devlin et al., 2018) is a pretrained transformer network (Vaswani et al., 2017), which set for various NLP tasks new state-of-the-art results, including question answering, sentence classification, and sentence-pair regression. The input for BERT for sentence-pair regression consists of the two sentences, separated by a special [SEP] token. Multi-head attention over 12 (base-model) or 24 layers (large-model) is applied and the output is passed to a simple regression function to derive the final label. Using this setup, BERT set a new state-of-the-art performance on the Semantic Textual Semilarity (STS) benchmark (Cer et al., 2017). RoBERTa (Liu et al., 2019) showed, that the performance of BERT can further improved by small adaptations to the pre-training process. We also tested XLNet (Yang et al., 2019), but it led in general to worse results than BERT. 

我们首先介绍了BERT，然后讨论了嵌入句子的技术现状方法.BERT（Devlin et al.，2018）是一个预训练的transformer网络（Vaswani et al.，2017），为各种NLP任务设置了最新的研究成果，包括问答、句子分类和句子对回归。句子对回归的BERT输入由两个句子组成，用一个特殊的[SEP]标记隔开。应用12层（基本模型）或24层（大模型）的多头注意力，并将输出传递给一个简单的回归函数，从而得出最终的标签。使用这种设置，BERT在语义文本半拉里（STS）基准上设置了一个新的最先进的性能（Cer等人，2017）。RoBERTa（Liu et al.，2019）表明，通过对训练前的小调整，BERT的表现可以进一步提高。我们还测试了XLNet（Yang等人，2019年），但总体而言，结果比BERT差。

A large disadvantage of the BERT network structure is that no independent sentence embeddings are computed, which makes it difficult to derive sentence embeddings from BERT. To bypass this limitations, researchers passed single sentences through BERT and then derive a fixed sized vector by either averaging the outputs (similar to average word embeddings) or by using the output of the special CLS token (for example: May et al. (2019); Zhang et al. (2019); Qiao et al. (2019)). These two options are also provided by the popular bert-as-a-service-repository3. Up to our knowledge, there is so far no evaluation if these methods lead to useful sentence embeddings.

BERT网络结构的一大缺点是没有计算独立的句子嵌入，这使得从BERT中提取句子嵌入变得困难。为了绕过这一限制，研究人员通过BERT传递单个sentence，然后通过平均输出（类似于平均单词嵌入）或使用特殊CLS标记的输出（例如：May等人，al。（2019年）；Zhang等人。（2019年）；乔等人。（2019年）。这两个选项也由popular bert-as-a-service-repository3提供。据我们所知，到目前为止，还没有评估这些方法是否会导致有用的句子嵌入。

Sentence embeddings are a well studied area with dozens of proposed methods. Skip-Thought (Kiros et al., 2015) trains an encoder-decoder architecture to predict the surrounding sentences. InferSent (Conneau et al., 2017) uses labeled data of the Stanford Natural Language Inference dataset (Bowman et al., 2015) and the MultiGenre NLI dataset (Williams et al., 2018) to train a siamese BiLSTM network with max-pooling over the output. Conneau et al. showed, that InferSent consistently outperforms unsupervised methods like SkipThought. Universal Sentence Encoder (Cer et al., 2018) trains a transformer network and augments unsupervised learning with training on SNLI. Hill et al. (2016) showed, that the task on which sentence embeddings are trained significantly impacts their quality. Previous work (Conneau et al., 2017; Cer et al., 2018) found that the SNLI datasets are suitable for training sentence embeddings. Yang et al. (2018) presented a method to train on conversations from Reddit using siamese DAN and siamese transformer networks, which yielded good results on the STS benchmark dataset.

句子嵌入是一个很好的研究领域，有很多方法被提出。Skip think（Kiros et al.，2015）训练了一个编解码器架构来预测周围的句子。InferSent（Conneau et al.，2017）使用斯坦福自然语言推理数据集（Bowman等人，2015）和多类型NLI数据集（Williams等人，2018）的标记数据，对输出进行最大池化的孪生BiLSTM网络进行训练。Conneau等人。一直以来，有监督的推断都比无监督的方法要好。通用句子编码器（Cer等人，2018年）训练了一个transformer网络，并通过SNLI训练增强了无监督学习。希尔等人。（2016）表明，句子嵌入训练任务对其质量有显著影响。先前的工作（Conneau等人，2017年；Cer等人，2018年）发现SNLI数据集适合于训练sentence嵌入。Yang等人。（2018）提出了一种使用孪生 DAN和孪生transformer网络工程训练Reddit对话的方法，在STS基准数据集上取得了良好的结果。

![picture1.png](attachment:picture1.png)

Figure 1: SBERT architecture with classification objective function, e.g., for fine-tuning on SNLI dataset. The two BERT networks have tied weights (siamese network structure).

图1：具有分类目标函数的SBERT体系结构，例如，用于对SNLI数据集进行微调。这两个BERT网络具有相同的权重（孪生网络结构）。

Humeau et al. (2019) addresses the run-time overhead of the cross-encoder from BERT and present a method (poly-encoders) to compute a score between m context vectors and pre-computed candidate embeddings using attention. This idea works for finding the highest scoring sentence in a larger collection. However, polyencoders have the drawback that the score function is not symmetric and the computational overhead is too large for use-cases like clustering, which would require O(n2) score computations.

休莫等人。（2019）解决了来自BERT的交叉编码器的运行时开销，并提出了一种方法（poly encoders）来计算m个上下文向量和预先计算的候选嵌入之间的分数。这个想法适用于在一个更大的集合中找到得分最高的句子。然而，poly 编码器有一个缺点，即分数函数不是对称的，对于像聚类这样的用例，计算开销太大，这将需要O（n2）score计算。

Previous neural sentence embedding methods started the training from a random initialization. In this publication, we use the pre-trained BERT and RoBERTa network and only fine-tune it to yield useful sentence embeddings. This reduces significantly the needed training time: SBERT can be tuned in less than 20 minutes, while yielding better results than comparable sentence embedding methods.

以前的神经句子嵌入方法是从随机初始化开始训练的。在本出版物中，我们使用预先训练过的BERT和RoBERTa网络，并且仅对其进行微调以生成有用的句子嵌入。这大大减少了所需的训练时间：SBERT可以在不到20分钟的时间内进行调整，同时比类似的句子嵌入方法产生更好的效果。

### 3 Model

SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding. We experiment with three pooling strategies: Using the output of the CLS-token, computing the mean of all output vectors (MEANstrategy), and computing a max-over-time of the output vectors (MAX-strategy). The default configuration is MEAN. 

SBERT将池操作添加到BERT/RoBERTa的输出中，以导出固定大小的sentence嵌入。我们实验了三种池策略：使用CLS令牌的输出，计算所有输出向量的平均值（mean策略），以及计算输出向量随时间的最大值（max策略）。默认配置是MEAN。

![picture2.png](attachment:picture2.png)

Figure 2: SBERT architecture at inference, for example, to compute similarity scores. This architecture is also used with the regression objective function.

图2:SBERT体系结构在推理，例如，计算相似性得分。该体系结构还与回归目标函数一起使用。

In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al., 2015) to update the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity.

为了微调BERT/RoBERTa，我们使用孪生和三元组网络（Schroff et al.，2015）来更新权重，使生成的句子嵌入在语义上有意义，并且可以与余弦相似性进行比较。

The network structure depends on the available training data. We experiment with the following structures and objective functions.

网络结构取决于可用的训练数据。我们对以下结构和目标函数进行了实验。


Classification Objective Function. We concatenate the sentence embeddings u and v with the element-wise difference |u v| and multiply it with the trainable weight Wt ∈ R3n×k: o = softmax(Wt(u, v, |u v|))

分类目标函数。我们将嵌入u和v的句子与元素差| u v |联系起来，并将其与可训练权重Wt∈R3n×k:

o=softmax（Wt（u，v，| u v |））相乘

where n is the dimension of the sentence embeddings and k the number of labels. We optimize cross-entropy loss. This structure is depicted in Figure 1.

其中n是句子嵌入的维数，k是标签的数量。我们优化了交叉熵损失。该结构如图1所示。

Regression Objective Function. The cosinesimilarity between the two sentence embeddings u and v is computed (Figure 2). We use meansquared-error loss as the objective function.

回归目标函数。计算两个句子嵌入u和v之间的余弦相似性（图2）。我们使用均方误差损失作为目标函数。

Triplet Objective Function. Given an anchor sentence a, a positive sentence p, and a negative sentence n, triplet loss tunes the network such that the distance between a and p is smaller than the distance between a and n. Mathematically, we minimize the following loss function:

三重态目标函数。给定一个锚定句a，一个正句子p，一个否定句n，三元组损失调整网络，使得a和p之间的距离小于a和n之间的距离。从数学上讲，我们最小化以下损失函数：

max(||sa sp|| − ||sa sn|| +  eps, 0)

with sx the sentence embedding for a/n/p, || · || a distance metric and margin eps. Margin eps ensures that sp is at least eps closer to sa than sn. As metric we use Euclidean distance and we set eps = 1 in our experiments.

在sx中，a/n/p的句子嵌入，| | | | |是距离度量和边距。裕度eps确保sp至少比sn更接近sa。作为度量，我们使用欧几里德距离，我们在实验中设置eps=1。

### 3.1 Training Details

We train SBERT on the combination of the SNLI (Bowman et al., 2015) and the Multi-Genre NLI (Williams et al., 2018) dataset. The SNLI is a collection of 570,000 sentence pairs annotated with the labels contradiction, eintailment, and neutral. MultiNLI contains 430,000 sentence pairs and covers a range of genres of spoken and written text. We fine-tune SBERT with a 3-way softmaxclassifier objective function for one epoch. We used a batch-size of 16, Adam optimizer with learning rate 2e 5, and a linear learning rate warm-up over 10% of the training data. Our default pooling strategy is MEAN.

我们根据SNLI (Bowman et al.， 2015)和多类型NLI (Williams et al.， 2018)数据集对SBERT进行训练。SNLI是经文的570000个句子对注释标签矛盾,eintailment, neu 击毙了。多项式包含43万句对，涵盖了口语和书面文本的一系列体裁。我们用我家的微调SBERT softmax 分类器目标函数为一个时代。我们使用了一个批处理大小为16的Adam优化器，学习率为2e-5，并且线性学习率预热超过10%的训练数据。我们断层池策略的意思。

![picture3.png](attachment:picture3.png)

Table 1: Spearman rank correlation between the cosine similarity of sentence representations and the gold labels for various Textual Similarity (STS) tasks. Performance is reported by convention as ×100. STS12-STS16: SemEval 2012-2016, STSb: STSbenchmark, SICK-R: SICK relatedness dataset.ρρ

表1：在不同的文本相似性（STS）任务中，句子表征的余弦相似性与黄金标签之间的Spearman秩相关。按惯例，性能报告为×100。STS12-STS16:SemEval 2012-2016，STSb:STSbenchmark，SICK-R:SICK相关性数据集。ρρ

### 4 Evaluation - Semantic Textual Similarity

We evaluate the performance of SBERT for common Semantic Textual Similarity (STS) tasks. State-of-the-art methods often learn a (complex) regression function that maps sentence embeddings to a similarity score. However, these regression functions work pair-wise and due to the combinatorial explosion those are often not scalable if the collection of sentences reaches a certain size. Instead, we always use cosine-similarity to compare the similarity between two sentence embeddings. We ran our experiments also with negative Manhatten and negative Euclidean distances as similarity measures, but the results for all approaches remained roughly the same.

我们评估了SBERT在常见语义文本相似性（STS）任务中的性能。最先进的方法通常学习一个（复杂的）回归函数，该函数将句子嵌入到相似度得分。然而，这些重新生成函数是成对工作的，并且由于二进制的爆炸，如果句子集合达到一定的大小，这些函数通常是不可伸缩的。相反，我们总是用余弦相似度来比较两个句子嵌入词之间的相似度。我们也用负Manhatten距离和负欧几里德距离作为相似性度量，但所有应用研究的结果大致相同。

### 4.1 Unsupervised STS

We evaluate the performance of SBERT for STS without using any STS specific training data. We use the STS tasks 2012 - 2016 (Agirre et al., 2012, 2013, 2014, 2015, 2016), the STS benchmark (Cer et al., 2017), and the SICK-Relatedness dataset (Marelli et al., 2014). These datasets provide labels between 0 and 5 on the semantic relatedness of sentence pairs. We showed in (Reimers et al., 2016) that Pearson correlation is badly suited for STS. Instead, we compute the Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and the gold labels. The setup for the other sentence embedding methods is equivalent, the similarity is computed by cosinesimilarity. The results are depicted in Table 1.

在不使用任何STS特定训练数据的情况下，我们评估了SBERT对STS的性能。我们使用了STS任务2012-2016（Agirre等人，2012、2013、2014、2015、2016）、STS基准（Cer等人，2017）和病态相关性数据集（Marelli等人，2014）。这些数据集为句子对的语义关联性提供了0到5之间的拉贝尔值。我们在（Reimers et al.，2016）中指出，Pearson相关性非常适合于STS。相反，我们计算句子嵌入的余弦相似度和黄金标签之间的Spearman秩相关。其他句子嵌入方法的设置是等价的，相似度由余弦相似度计算。结果如表1所示。

The results shows that directly using the output of BERT leads to rather poor performances. Averaging the BERT embeddings achieves an average correlation of only 54.81, and using the CLStoken output only achieves an average correlation of 29.19. Both are worse than computing average GloVe embeddings.

结果表明，直接使用BERT的输出会导致性能较差。Av  eraging the BERT embeddings实现的平均年龄相关性仅为54.81，而使用CLS代币输出的平均相关性仅为29.19。两者都比计算平均手套嵌入量差。

Using the described siamese network structure and fine-tuning mechanism substantially improves the correlation, outperforming both InferSent and Universal Sentence Encoder substantially. The only dataset where SBERT performs worse than Universal Sentence Encoder is SICK-R. Universal Sentence Encoder was trained on various datasets, including news, question-answer pages and discussion forums, which appears to be more suitable to the data of SICK-R. In contrast, SBERT was pre-trained only on Wikipedia (via BERT) and on NLI data.

使用所述的连体网络结构和微调机制大大提高了相关性，大大优于推断和通用句子编码器。唯一比通用句子编码器表现差的数据集是SICK-R。Universal Sentence Encoder接受过各种数据集的培训，包括新闻、问答页面和讨论论坛，这些数据集似乎更适合SICK-R的数据。相比之下，SBERT只在Wikipedia（通过BERT）和NLI数据上接受过预培训。

While RoBERTa was able to improve the performance for several supervised tasks, we only observe minor difference between SBERT and SRoBERTa for generating sentence embeddings.

虽然RoBERTa能够改善多个监督任务的性能，但是我们只观察到SBERT和SRoBERTa在生成句子嵌入方面的细微差别。

### 4.2 Supervised STS

The STS benchmark (STSb) (Cer et al., 2017) provides is a popular dataset to evaluate supervised STS systems. The data includes 8,628 sentence pairs from the three categories captions, news, and forums. It is divided into train (5,749), dev (1,500) and test (1,379). BERT set a new state-of-the-art performance on this dataset by passing both sentences to the network and using a simple regression method for the output.

STS基准（STSb）（Cer等人，2017）提供了一个流行的数据集，用于评估受监督的STS系统。数据包括8628个句子对，分别来自标题、新闻和论坛三个类别。分为列车（5749）、开发（1500）和试验（1379）。BERT在这个数据集上设置了一个新的最先进的性能，通过将两个句子传递给网络，并对输出使用简单的回归方法。

![picture4.png](attachment:picture4.png)

Table 2: Evaluation on the STS benchmark test set.

BERT systems were trained with 10 random seeds and 4 epochs. SBERT was fine-tuned on the STSb dataset, SBERT-NLI was pretrained on the NLI datasets, then fine-tuned on the STSb dataset.

STS评估表2：基准测试。

用10个随机种子和4个时代训练BERT系统。SBERT在STSb数据集上进行了微调，SBERT-NLI在NLI数据集上进行了预训练，然后对STSb数据集进行了微调。

We use the training set to fine-tune SBERT using the regression objective function. At prediction time, we compute the cosine-similarity between the sentence embeddings. All systems are trained with 10 random seeds to counter variances (Reimers and Gurevych, 2018).

The results are depicted in Table 2. We experimented with two setups: Only training on STSb, and first training on NLI, then training on STSb. We observe that the later strategy leads to a slight improvement of 1-2 points. This two-step approach had an especially large impact for the BERT cross-encoder, which improved the performance by 3-4 points. We do not observe a significant difference between BERT and RoBERTa.



我们利用训练集利用回归目标函数对SBERT进行微调。在预测时，我们计算句子嵌入之间的余弦相似度。所有系统都使用10个随机种子进行训练，以抵消方差（Reimers和Gurevych，2018）。

结果如表2所示。我们实验了两种设置：只训练STSb，首先训练NLI，然后训练STSb。我们观察到，后一种策略导致了1-2个点的轻微改善。这种两步法对BERT交叉编码器的影响特别大，使其性能提高了3-4个百分点。我们没有观察到伯特和罗伯塔之间的显著差异。

### 4.3 Argument Facet Similarity

We evaluate SBERT on the Argument Facet Similarity (AFS) corpus by Misra et al. (2016). The AFS corpus annotated 6,000 sentential argument pairs from social media dialogs on three controversial topics: gun control, gay marriage, and death penalty. The data was annotated on a scale from 0 (“different topic”) to 5 (“completely equivalent”). The similarity notion in the AFS corpus is fairly different to the similarity notion in the STS datasets from SemEval. STS data is usually descriptive, while AFS data are argumentative excerpts from dialogs. To be considered similar, arguments must not only make similar claims, but also provide a similar reasoning. Further, the lexical gap between the sentences in AFS is much larger. Hence, simple unsupervised methods as well as state-of-the-art STS systems perform badly on this dataset (Reimers et al., 2019).

We evaluate SBERT on this dataset in two scenarios: 1) As proposed by Misra et al., we evaluate SBERT using 10-fold cross-validation. A drawback of this evaluation setup is that it is not clear how well approaches generalize to different topics. Hence, 2) we evaluate SBERT in a cross-topic setup. Two topics serve for training and the approach is evaluated on the left-out topic. We repeat this for all three topics and average the results.

SBERT is fine-tuned using the Regression Objective Function. The similarity score is computed using cosine-similarity based on the sentence embeddings. We also provide the Pearson correlation to make the results comparable to Misra et al. However, we showed (Reimers et al., 2016) that Pearson correlation has some serious drawbacks and should be avoided for comparing STS systems. The results are depicted in Table 3.r

Unsupervised methods like tf-idf, average GloVe embeddings or InferSent perform rather badly on this dataset with low scores. Training SBERT in the 10-fold cross-validation setup gives a performance that is nearly on-par with BERT.

However, in the cross-topic evaluation, we observe a performance drop of SBERT by about 7 points Spearman correlation. To be considered similar, arguments should address the same claims and provide the same reasoning. BERT is able to use attention to compare directly both sentences (e.g. word-by-word comparison), while SBERT must map individual sentences from an unseen topic to a vector space such that arguments with similar claims and reasons are close. This is a much more challenging task, which appears to require more than just two topics for training to work on-par with BERT.

我们在Misra等人的论点方面相似性（AFS）语料库上对SBERT进行了评估。（2016年）。AFS语料库对来自社交媒体对话的6000对句子论点进行了注释，涉及三个有争议的话题：枪支管制、同性婚姻和死刑。数据被标注在从0（“不同主题”）到5（“完全相同”）的范围内。AFS语料库中的相似性概念与SemEval的STS数据集中的相似性概念有很大的不同。STS数据通常是描述性的，而AFS数据则是对话中有争议的摘录。为了被认为是相似的，论点不仅必须提出类似的主张，而且必须提供类似的推理。此外，AFS中句子之间的词汇差距更大。因此，简单的无监督方法以及最先进的STS系统在该数据集上表现不佳（Reimers等人，2019年）。

我们在两种情况下评估这个数据集上的SBERT：1）正如Misra等人提出的，我们使用10倍交叉验证来评估SBERT。这种评估设置的一个缺点是，不清楚方法在多大程度上适用于不同的主题。因此，2）我们在跨主题设置中评估SBERT。两个主题用于培训，方法在遗漏的主题上进行评估。我们对所有三个主题重复这个步骤，并计算结果的平均值。

使用SBERT微调目标函数。根据句子嵌入情况，利用余弦相似度计算相似度。我们还提供了皮尔逊相关性，使结果可与Misra等人比较。然而，我们发现（Reimers等人，2016年）Pearson相关性存在一些严重缺陷，在比较STS系统时应避免。结果如表3所示。r

无监督的方法，如tf-idf，平均手套嵌入或推断，在这个数据集上表现得很差，得分很低。在10倍交叉验证设置中训练SBERT的性能接近于BERT。

然而，在跨主题评估中，我们观察到SBERT的绩效下降了约7个点Spearman相关。为了被认为是相似的，论点应该处理相同的主张和提供同样的推理。BERT能够利用注意力直接比较两个句子（例如逐字比较），而SBERT必须将一个看不见的主题中的单个句子映射到向量空间中，这样具有相似主张和理由的论据接近。这是一个更具挑战性的任务，似乎需要两个以上的培训主题才能与BERT并驾齐驱。

### 4.4 Wikipedia Sections Distinction

Dor et al. (2018) use Wikipedia to create a thematically fine-grained train, dev and test set for sentence embeddings methods. Wikipedia articles are separated into distinct sections focusing on certain aspects. Dor et al. assume that sentences in the same section are thematically closer than sentences in different sections. They use this to create a large dataset of weakly labeled sentence triplets: The anchor and the positive example come from the same section, while the negative example comes from a different section of the same article. For example, from the Alice Arnold article: Anchor: Arnold joined the BBC Radio Drama Company in 1988., positive: Arnold gained media attention in May 2012., negative: Balding and Arnold are keen amateur golfers.

Dor等人。（2018）使用Wikipedia为句子嵌入方法创建一个主题化的细粒度训练、开发和测试集。维基百科的文章分为不同的部分，侧重于某些方面。Dor等人。假设同一部分的句子在主题上比不同部分的句子更接近。他们使用这个方法来创建一个包含弱标记句子三元组的大型数据集：锚定和正示例来自同一节，而负示例来自同一篇文章的不同部分。例如，摘自《爱丽丝·阿诺德》文章：主播：阿诺德1988年加入BBC广播剧公司，正面报道：阿诺德2012年5月获得媒体关注；负面报道：秃顶和阿诺德都是热衷于业余高尔夫球手的人。

![picture5.png](attachment:picture5.png)

Table 3: Average Pearson correlation and average Spearman's rank correlation on the Argument Facet Similarity (AFS) corpus (Misra et al., 2016). Misra et al. proposes 10-fold cross-validation. We additionally evaluate in a cross-topic scenario: Methods are trained on two topics, and are evaluated on the third topic.rρ



表3：论元方面相似性（AFS）语料库的平均皮尔逊相关和平均斯皮尔曼秩相关（Misra等人，2016年）。Misra等人。提出10倍交叉验证。另外，我们在一个跨主题的场景中进行评估：方法在两个主题上进行培训，并在第三个主题上进行评估。rρ

We use the dataset from Dor et al. We use the Triplet Objective, train SBERT for one epoch on the about 1.8 Million training triplets and evaluate it on the 222,957 test triplets. Test triplets are from a distinct set of Wikipedia articles. As evaluation metric, we use accuracy: Is the positive example closer to the anchor than the negative example?

Results are presented in Table 4. Dor et al. finetuned a BiLSTM architecture with triplet loss to derive sentence embeddings for this dataset. As the table shows, SBERT clearly outperforms the BiLSTM approach by Dor et al.



我们使用Dor等人的数据集。我们使用三联体目标，在大约180万个训练三胞胎上训练SBERT一个时期，并在222957个测试三胞胎上进行评估。测试三胞胎来自一组不同的维基百科文章。作为评估指标，我们使用准确度：正面示例是否比负面示例更接近锚定？

结果见表4。Dor等人。微调了一个带有三元组丢失的BiLSTM体系结构，以导出此数据集的语句嵌入。如表所示，SBERT明显优于Dor等人的BiLSTM方法。



###  5        Evaluation - SentEval

SentEval (Conneau and Kiela, 2018) is a popular toolkit to evaluate the quality of sentence embeddings. Sentence embeddings are used as features for a logistic regression classifier. The logistic regression classifier is trained on various tasks in a 10-fold cross-validation setup and the prediction accuracy is computed for the test-fold.

SentEval（Conneau and Kiela，2018）是一个流行的评估句子嵌入质量的工具包。句子嵌入作为logistic回归分类器的特征。logistic回归分类器在一个10倍交叉验证系统中进行各种任务的训练，并计算测试倍数的预测精度。

![picture6.png](attachment:picture6.png)

Table 4: Evaluation on the Wikipedia section triplets dataset (Dor et al., 2018). SBERT trained with triplet loss for one epoch.



表4:Wikipedia部分三胞胎数据集的评估（Dor等人，2018年）。SBERT训练了一个时期的三重态丢失。

The purpose of SBERT sentence embeddings are not to be used for transfer learning for other tasks. Here, we think fine-tuning BERT as described by Devlin et al. (2018) for new tasks is the more suitable method, as it updates all layers of the BERT network. However, SentEval can still give an impression on the quality of our sentence embeddings for various tasks.

We compare the SBERT sentence embeddings to other sentence embeddings methods on the following seven SentEval transfer tasks:

• MR: Sentiment prediction for movie reviews snippets on a five start scale (Pang and Lee, 2005).

• CR: Sentiment prediction of customer product reviews (Hu and Liu, 2004).

• SUBJ: Subjectivity prediction of sentences from movie reviews and plot summaries (Pang and Lee, 2004).

• MPQA: Phrase level opinion polarity classification from newswire (Wiebe et al., 2005).

• SST: Stanford Sentiment Treebank with binary labels (Socher et al., 2013).

• TREC: Fine grained question-type classification from TREC (Li and Roth, 2002).

• MRPC: Microsoft Research Paraphrase Corpus from parallel news sources (Dolan et al., 2004).

SBERT句子嵌入的目的不是为了完成其他任务的迁移学习。在这里，我们认为微调BERT如Devlin et al。（2018）对于新任务是更合适的方法，因为它更新了BERT网络的所有层。然而，对于不同的任务，SentEval仍然能给我们留下印象。

在以下七个句子迁移任务中，我们将SBERT句子嵌入方法与其他句子嵌入方法进行了比较：

•MR：电影评论片段的情绪预测，从五个开始（Pang和Lee，2005）。

•CR：客户产品评论的情绪预测（Hu和Liu，2004）。

•主题：电影评论和情节摘要中句子的主观性预测（Pang和Lee，2004）。

•MPQA：来自newswire的短语级观点极性分类（Wiebe等人，2005年）。

•SST：带有二进制标签的斯坦福情感树库（Socher等人，2013年）。

•TREC：来自TREC的细粒度问题类型分类（Li和Roth，2002）。

•MRPC:Microsoft Research对平行新闻来源语料库的释义（Dolan等人，2004年）。

The results can be found in Table 5. SBERT is able to achieve the best performance in 5 out of 7 tasks. The average performance increases by about 2 percentage points compared to InferSent as well as the Universal Sentence Encoder. Even though transfer learning is not the purpose of SBERT, it outperforms other state-of-the-art sentence embeddings methods on this task.

结果见表5。SBERT能够在7个任务中的5个任务中获得最佳性能。平均性能比InferSent和通用句子编码器提高了约2个百分点。尽管转移学习不是SBERT的目的，但它在这项任务上优于其他最先进的句子嵌入方法。

![picture7.png](attachment:picture7.png)

Table 5: Evaluation of SBERT sentence embeddings using the SentEval toolkit. SentEval evaluates sentence embeddings on different sentence classification tasks by training a logistic regression classifier using the sentence embeddings as features. Scores are based on a 10-fold cross-validation.



表5：使用SentEval工具箱评估SBERT句子嵌入。SentEval以句子嵌入为特征，训练logistic回归分类器，对不同的句子分类任务进行句子嵌入评价。分数基于10倍交叉验证。

It appears that the sentence embeddings from SBERT capture well sentiment information: We observe large improvements for all sentiment tasks (MR, CR, and SST) from SentEval in comparison to InferSent and Universal Sentence Encoder.

The only dataset where SBERT is significantly worse than Universal Sentence Encoder is the TREC dataset. Universal Sentence Encoder was pre-trained on question-answering data, which appears to be beneficial for the question-type classification task of the TREC dataset.

Average BERT embeddings or using the CLStoken output from a BERT network achieved bad results for various STS tasks (Table 1), worse than average GloVe embeddings. However, for SentEval, average BERT embeddings and the BERT CLS-token output achieves decent results (Table 5), outperforming average GloVe embeddings. The reason for this are the different setups. For the STS tasks, we used cosine-similarity to estimate the similarities between sentence embeddings. Cosine-similarity treats all dimensions equally. In contrast, SentEval fits a logistic regression classifier to the sentence embeddings. This allows that certain dimensions can have higher or lower impact on the classification result.

We conclude that average BERT embeddings / CLS-token output from BERT return sentence embeddings that are infeasible to be used with cosinesimilarity or with Manhatten / Euclidean distance. For transfer learning, they yield slightly worse results than InferSent or Universal Sentence Encoder. However, using the described fine-tuning setup with a siamese network structure on NLI datasets yields sentence embeddings that achieve a new state-of-the-art for the SentEval toolkit.

结果表明，SBERT的句子嵌入能够很好地捕捉情感信息：我们观察到，与推断和通用句子编码器相比，seneval中的所有情感任务（MR、CR和SST）都有很大的改善。

唯一一个SBERT明显比通用语句编码器差的数据集是TREC数据集。通用句子编码器对问答数据进行了预训练，这对TREC数据集的问题类型分类任务是有益的。

平均的BERT嵌入或使用BERT网络的CLStoken输出在各种STS任务中获得了糟糕的结果（表1），比平均手套嵌入更差。然而，对于SentEval，平均的BERT嵌入和BERT CLS令牌输出获得了不错的结果（表5），优于平均手套嵌入。原因是不同的设置。对于STS任务，我们使用余弦相似度来估计句子嵌入之间的相似度。余弦相似性同等对待所有维度。相比之下，SentEval适合于句子嵌入的logistic回归分类器。这使得某些维度对分类结果的影响可能更高或更低。

我们的结论是，从BERT输出的平均BERT嵌入/CLS令牌返回的句子嵌入不适合用于余弦相似性或Manhatten/Euclidean距离。对于迁移学习，它们产生的结果比推断或通用句子编码器稍差。然而，在NLI数据集上使用所描述的带暹罗网络结构的微调设置会产生句子嵌入，从而为SentEval工具箱实现一种新的技术水平。



### 6        Ablation Study

We have demonstrated strong empirical results for the quality of SBERT sentence embeddings. In this section, we perform an ablation study of different aspects of SBERT in order to get a better understanding of their relative importance.

We evaluated different pooling strategies (MEAN, MAX, and CLS). For the classification objective function, we evaluate different concatenation methods. For each possible configuration, we train SBERT with 10 different random seeds and average the performances.

The objective function (classification vs. regression) depends on the annotated dataset. For the classification objective function, we train SBERTbase on the SNLI and the Multi-NLI dataset. For the regression objective function, we train on the training set of the STS benchmark dataset. Performances are measured on the development split of the STS benchmark dataset. Results are shown in Table 6.

我们对SBERT句子嵌入的质量进行了实证研究。为了更好的了解SBERT在这一节中的重要性。

我们评估了不同的池策略（MEAN、MAX和CLS）。对于分类目标函数，我们评估了不同的连接方法。对于每一种可能的配置，我们用10种不同的随机种子训练SBERT并平均性能。

目标函数（分类与回归）依赖于带注释的数据集。对于分类目标函数，我们基于SNLI和多NLI数据集训练sbert。对于回归目标函数，我们在STS基准数据集的训练集上进行训练。性能在STS基准数据集的开发划分上进行度量。结果见表6。

![picture8.png](attachment:picture8.png)

Table 6: SBERT trained on NLI data with the classification objective function, on the STS benchmark (STSb) with the regression objective function. Configurations are evaluated on the development set of the STSb using cosine-similarity and Spearman's rank correlation. For the concatenation methods, we only report scores with MEAN pooling strategy.

表6:SBERT使用分类目标函数对NLI数据进行培训，在STS基准（STSb）上使用回归目标函数进行培训。利用余弦相似性和Spearman秩相关对STSb的展开集进行构型评估。对于串联方法，我们只报告平均池策略的分数。



When trained with the classification objective function on NLI data, the pooling strategy has a rather minor impact. The impact of the concatenation mode is much larger. InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018) both use (u,v,|u − v|,u ∗ v) as input for a softmax classifier. However, in our architecture, adding the element-wise ∗ v decreased the performance.u

The most important component is the elementwise difference |u − v|. Note, that the concatenation mode is only relevant for training the softmax classifier. At inference, when predicting similarities for the STS benchmark dataset, only the sentence embeddings and are used in combination with cosine-similarity. The element-wise difference measures the distance between the dimensions of the two sentence embeddings, ensuring that similar pairs are closer and dissimilar pairs are further apart.uv

When trained with the regression objective function, we observe that the pooling strategy has a large impact. There, the MAX strategy perform significantly worse than MEAN or CLS-token strategy. This is in contrast to (Conneau et al., 2017), who found it beneficial for the BiLSTM-layer of InferSent to use MAX instead of MEAN pooling.

当NLA策略对目标分类的影响较小时，NLA策略对目标分类的影响较小。串联模式的影响要大得多。InferSent（Conneau等人，2017）和通用句子编码器（Cer等人，2018）都使用（u，v，| u−v |，u*v）作为softmax分类器的输入。然而，在我们的架构中，添加元素wise*v会降低性能。美国

最重要的组成部分是元素差异| u−v |。注意，串联模式只与训练softmax分类器有关。在推断时，当预测STS基准数据集的相似度时，只有句子嵌入和与余弦相似度结合使用。元素差异度量两个句子嵌入维度之间的距离，确保相似对更近，而不同对之间的距离更远。美国五

当使用回归目标函数进行训练时，我们观察到池策略有很大的影响。在这里，MAX策略的性能明显比MEAN或CLS令牌策略差。这与（Conneau et al.，2017）相反，后者发现使用MAX而不是平均值池有利于推断的BiLSTM层。

### 7        Computational Efficiency

Sentence embeddings need potentially be computed for Millions of sentences, hence, a high computation speed is desired. In this section, we compare SBERT to average GloVe embeddings, InferSent (Conneau et al., 2017), and Universal Sentence Encoder (Cer et al., 2018).

For our comparison we use the sentences from the STS benchmark (Cer et al., 2017). We compute average GloVe embeddings using a simple for-loop with python dictionary lookups and NumPy. InferSentis based on PyTorch. For Universal Sentence Encoder, we use the TensorFlow Hub version, which is based on TensorFlow. SBERT is based on PyTorch. For improved computation of sentence embeddings, we implemented a smart batching strategy: Sentences with similar lengths are grouped together and are only padded to the longest element in a mini-batch. This drastically reduces computational overhead from padding tokens.[4]

Performances were measured on a server with

Intel i7-5820K CPU @ 3.30GHz, Nvidia Tesla V100 GPU, CUDA 9.2 and cuDNN. The results are depicted in Table 7.

句子嵌入需要计算数百万个句子，因此需要较高的计算速度。在本节中，我们将SBERT与平均手套嵌入、推断（Conneau et al.，2017）和通用句子编码器（Cer et al.，2018）进行比较。

为了进行比较，我们使用了来自STS基准的句子（Cer等人，2017）。我们使用一个简单的for循环和python字典查找和NumPy计算平均手套嵌入。基于Pythorch的推断。对于通用句子编码器，我们使用了基于TensorFlow的TensorFlow Hub版本。SBERT基于Pythorch。为了提高句子嵌入的计算效率，我们实现了一种智能的批处理策略：将长度相似的句子组合在一起，并且只填充到小批量中最长的元素。这大大减少了填充令牌的计算开销。[4]

在服务器上测量性能

Intel i7-5820K CPU@3.30GHz，Nvidia Tesla V100 GPU，CUDA 9.2和cuDNN。结果如表7所示。

![picture9.png](attachment:picture9.png)

Table 7: Computation speed (sentences per second) of sentence embedding methods. Higher is better.



表7：句子嵌入方法的计算速度（每秒句子数）。越高越好。

On CPU, InferSent is about 65% faster than SBERT. This is due to the much simpler network architecture. InferSent uses a single BiLSTM layer, while BERT uses 12 stacked transformer layers. However, an advantage of transformer networks is the computational efficiency on GPUs. There, SBERT with smart batching is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. Smart batching achieves a speed-up of 89% on CPU and 48% on GPU. Average GloVe embeddings is obviously by a large margin the fastest method to compute sentence embeddings.



在CPU上，InferSent比SBERT快65%。这是由于网络架构简单得多。InferSent使用单个BiLSTM层，而BERT使用12个堆叠的变压器层。然而，变压器网络的一个优点是在gpu上的计算效率。在那里，使用智能批处理的SBERT比InferSent快9%，比通用句子编码器快55%。智能批处理在CPU和GPU上分别实现89%和48%的加速。平均手套嵌入显然是计算句子嵌入的最快方法。



### 8        Conclusion

We showed that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity. The performance for seven STS tasks was below the performance of average GloVe embeddings.

To overcome this shortcoming, we presented

Sentence-BERT (SBERT). SBERT fine-tunes BERT in a siamese / triplet network architecture. We evaluated the quality on various common benchmarks, where it could achieve a significant improvement over state-of-the-art sentence embeddings methods. Replacing BERT with RoBERTa did not yield a significant improvement in our experiments.

SBERT is computationally efficient. On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. SBERT can be used for tasks which are computationally not feasible to be modeled with BERT. For example, clustering of 10,000 sentences with hierarchical clustering requires with BERT about 65 hours, as around 50 Million sentence combinations must be computed. With SBERT, we were able to reduce the effort to about 5 seconds.

Acknowledgments
This work has been supported by the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1 and grant GU 798/17-1). It has been co-funded by the German Federal Ministry of Education and Research (BMBF) under the promotional references 03VP02540 (ArgumenText).



我们证明了BERT开箱即用的方法将句子映射到一个向量空间，而这个向量空间不适合与诸如余弦相似度这样的常用相似性度量一起使用。七个STS任务的表现低于一般手套嵌入的表现。

为了克服这个缺点，我们提出

句子Bert（SBERT）。SBERT在孪生/三元组网络架构中微调BERT。我们在各种常用的基准上对质量进行了评估，与目前最先进的句子嵌入方法相比，它可以实现显著的改进。用RoBERTa代替BERT并没有在我们的实验中产生显著的改进。

SBERT计算效率高。在GPU上，它比InferSent快9%，比通用句子编码器快55%。SBERT可以用于计算上不可能用BERT建模的任务。例如，使用层次聚类法对10000个句子进行聚类需要大约65个小时，因为必须计算大约5000万个句子组合。使用SBERT，我们可以将工作时间缩短到大约5秒。

致谢
这项工作得到了德国研究基金会通过德以项目合作（DIP，grant DA 1600/1-1和grant GU 798/17-1）的支持。它是由德国联邦教育和研究部（BMBF）共同资助的推广参考文献03VP02540（ArgumenText）。

