# SentenceTransformers 句子转换器

SentenceTransformers 是一个 Python 框架，用于最先进的句子、文本和图像嵌入。

<https://www.sbert.net/>

## Installation

In [None]:
!pip install -U sentence-transformers

In [6]:
import os
os.environ['HTTPS_PROXY']="127.0.0.1:7890"

## Usage

SentenceTransformer('all-MiniLM-L6-v2') 要加载的句子转换器模型。
 
all-MiniLM-L6-v2 是一个在超过10亿个训练对的大型数据集上进行微调的MiniLM模型。

In [7]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.',
             'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173580e-02 -4.28514741e-02 -1.56286433e-02  1.40537620e-02
  3.95538062e-02  1.21796235e-01  2.94334069e-02 -3.17524038e-02
  3.54959220e-02 -7.93139860e-02  1.75878499e-02 -4.04369868e-02
  4.97259349e-02  2.54912414e-02 -7.18700811e-02  8.14968795e-02
  1.47072005e-03  4.79627512e-02 -4.50336263e-02 -9.92175043e-02
 -2.81769522e-02  6.45046532e-02  4.44670655e-02 -4.76217121e-02
 -3.52952443e-02  4.38671559e-02 -5.28566055e-02  4.33053763e-04
  1.01921499e-01  1.64072476e-02  3.26996371e-02 -3.45986560e-02
  1.21339587e-02  7.94871151e-02  4.58345981e-03  1.57778040e-02
 -9.68208257e-03  2.87625939e-02 -5.05806319e-02 -1.55793680e-02
 -2.87906714e-02 -9.62281693e-03  3.15556787e-02  2.27348860e-02
  8.71449038e-02 -3.85027379e-02 -8.84718224e-02 -8.75499845e-03
 -2.12343037e-02  2.08923351e-02 -9.02077779e-02 -5.25732152e-02
 -1.05639109e-02  2.88310386e-02 -1.61454994e-02  6.17839070e-03
 -1.23234

## Comparing Sentence Similarities 比较句子相似性

句子（文本）被映射为使得具有相似含义的句子在向量空间中是接近的。在向量空间中测量相似性的一种常见方法是使用余弦相似性。

In [14]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.6153]])


如果你有一个包含更多句子的列表，你可以使用以下代码示例：

In [18]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ['A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man is riding a horse.',
             'A woman is playing violin.',
             'Two men pushed carts through the woods.',
             'A man is riding a white horse on an enclosed ground.',
             'A monkey is playing drums.',
             'Someone in a gorilla costume is playing a set of drums.'
             ]

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474
