# Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch
https://github.com/UKPLab/sentence-transformers


BERT / RoBERTa / XLM-RoBERTa produces out-of-the-box rather bad sentence embeddings. 

This repository fine-tunes BERT / RoBERTa / DistilBERT / ALBERT / XLNet with __a siamese or triplet network structure__ to produce semantically meaningful sentence embeddings that can be used in unsupervised scenarios: Semantic textual similarity via cosine-similarity, clustering, semantic search.

We provide an increasing number of state-of-the-art pretrained models that can be used to derive sentence embeddings. See Pretrained Models. Details of the implemented approaches can be found in our publication: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019).

You can use this code to easily train your own sentence embeddings, that are tuned for your specific task. We provide various dataset readers and you can tune sentence embeddings with different loss function, depending on the structure of your dataset. For further details, see Train your own Sentence Embeddings.



## install

In [2]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ee/71/acfb3f1016f83d90590130dc2ee0d8cd36b005aa7afa45b465837b711070/sentence-transformers-0.3.3.tar.gz (65kB)
[K     |████████████████████████████████| 71kB 336kB/s eta 0:00:01
Collecting tokenizers==0.8.1.rc1 (from transformers>=3.0.2->sentence-transformers)
  Using cached https://files.pythonhosted.org/packages/a3/c8/b07f4346b36ca83988a4a59c081156ec2c96aad5b4c448c75deea4f53356/tokenizers-0.8.1rc1-cp37-cp37m-macosx_10_10_x86_64.whl
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/luoyonggui/Library/Caches/pip/wheels/75/d6/0a/cab163b21d0597cc1580bc344487b11ad405e0d1d314725f2b
Successfully built sentence-transformers
Installing collected packages: sentence-transformers, tokenizers
  Found existing installation: tokenizers 0.8.1
    Uninstalling tokenizers-0.8.1:
      Successfull

In [3]:
!pip install tokenizers==0.8.1

Collecting tokenizers==0.8.1
  Using cached https://files.pythonhosted.org/packages/2b/3e/7cf9b5daa88371c96d9b63d31917e30ba93b1d89421aef79c00e806bc54d/tokenizers-0.8.1-cp37-cp37m-macosx_10_11_x86_64.whl
[31mERROR: transformers 3.0.2 has requirement tokenizers==0.8.1.rc1, but you'll have tokenizers 0.8.1 which is incompatible.[0m
Installing collected packages: tokenizers
  Found existing installation: tokenizers 0.8.1rc1
    Uninstalling tokenizers-0.8.1rc1:
      Successfully uninstalled tokenizers-0.8.1rc1
Successfully installed tokenizers-0.8.1


# 论文Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

In [1]:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1908.10084', width=1200, height=550)

## Abstract
BERT (Devlin et al., 2018) and RoBERTa (Liuet  al.,  2019)  has  set  a  new  state-of-the-art performance on sentence-pair regression tasks like  semantic  textual  similarity  (STS).  How-ever,  it  requires  that  both  sentences  are  fed into the network, which causes a massive com-putational  overhead:   Finding  the  most  sim-ilar  pair  in  a  collection  of  10,000  sentences requires about 50 million inference computa-tions (~65 hours) with BERT. 

The construction of BERT makes it unsuitable for semantic sim-ilarity search as well as for unsupervised tasks like clustering.

In this publication, we present Sentence-BERT(SBERT),  a  modification  of  the  pretrained BERT network that use siamese and triplet net-work structures to derive semantically mean-ingful sentence embeddings that can be com-pared using cosine-similarity. This reduces the effort for finding the most similar pair from 65hours with BERT / RoBERTa to about 5 sec-onds with SBERT, while maintaining the ac-curacy from BERT. 

We evaluate SBERT and SRoBERTa on com-mon  STS  tasks  and  transfer  learning  tasks,where   it   outperforms   other   state-of-the-art sentence embeddings methods.

## model structure

# Getting Started-Sentences Embedding with a Pretrained Model

In [4]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

I0821 14:41:04.881035 140736034558848 file_utils.py:39] PyTorch version 1.6.0 available.
I0821 14:41:09.233434 140736034558848 file_utils.py:55] TensorFlow version 2.2.0 available.
I0821 14:41:10.136711 140736034558848 SentenceTransformer.py:31] Load pretrained SentenceTransformer: bert-base-nli-mean-tokens
I0821 14:41:10.137421 140736034558848 SentenceTransformer.py:34] Did not find a '/' or '\' in the name. Assume to download model from server.
I0821 14:41:10.139036 140736034558848 SentenceTransformer.py:55] Downloading sentence transformer model from https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/bert-base-nli-mean-tokens.zip and saving it at /Users/luoyonggui/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip
100%|██████████| 405M/405M [11:46<00:00, 573kB/s]    
I0821 14:53:03.760298 140736034558848 SentenceTransformer.py:69] Load SentenceTransformer from folder:

In [5]:
# Then provide some sentences to the model.

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

I0821 15:22:29.079473 140736034558848 SentenceTransformer.py:138] Start tokenization 3 sentences


HBox(children=(IntProgress(value=0, description='Batches', max=1, style=ProgressStyle(description_width='initi…




In [8]:
# And that's it already. We now have a list of numpy arrays with the embeddings.
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    
    print("Embedding:", embedding.size)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: 768

Sentence: Sentences are passed as a list of string.
Embedding: 768

Sentence: The quick brown fox jumps over the lazy dog.
Embedding: 768



# Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.
## Dataset Download

First, you should download some datasets. For this run the examples/datasets/get_data.py:

## Model Training from Scratch

training_nli.py fine-tunes BERT (and other transformer models) from the pre-trained model as provided by Google & Co. It tunes the model on Natural Language Inference (NLI) data. 

Given two sentences, the model should classify if these two sentence entail, contradict, or are neutral to each other. 

For this, the two sentences are passed to a transformer model to generate fixed-sized sentence embeddings. These sentence embeddings are then passed to a softmax classifier to derive the final label (entail, contradict, neutral). 

This generates sentence embeddings that are useful also for other tasks like clustering or semantic textual similarity.

First, we define a sequential model of how a sentence is mapped to a fixed size sentence embedding:

## Loss Functions

We implemented various loss-functions that allow training of sentence embeddings from various datasets. These loss-functions are in the package sentence_transformers.losses.

### SoftmaxLoss:   
Given the sentence embeddings of two sentences, trains a softmax-classifier. Useful for training on datasets like NLI.
### CosineSimilarityLoss:   
Given a sentence pair and a gold similarity score (either between -1 and 1 or between 0 and 1), computes the cosine similarity between the sentence embeddings and minimizes the mean squared error loss.
### TripletLoss:   
Given a triplet (anchor, positive example, negative example), minimizes the triplet loss.
### BatchHardTripletLoss:   
Implements the batch hard triplet loss from the paper In Defense of the Triplet Loss for Person Re-Identification. Each batch must contain multiple examples from the same class. The loss optimizes then the distance between the most-distance positive pair and the closest negative-pair.
### MultipleNegativesRankingLoss:   
Each batch has one positive pair, all other pairs are treated as negative examples. The loss was used in the papers Efficient Natural Language Response Suggestion for Smart Reply and Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model.

