# **[Hugging Face: Sentence Similarity](https://huggingface.co/tasks/sentence-similarity)**



## **Sentence Similarity**

Sentence Similarity is the task of determining how similar two texts are. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are between them. This task is particularly useful for information retrieval and clustering/grouping.

### **Use Cases**

**Information Retrieval**

You can extract information from documents using Sentence Similarity models. The first step is to rank documents using Passage Ranking models. You can then get to the top ranked document and search it with Sentence Similarity models by selecting the sentence that has the most similarity to the input query.

### **The Sentence Transformers library**


The [Sentence Transformers](https://www.sbert.net/) library is very powerful for calculating embeddings of sentences, paragraphs, and entire documents. An embedding is just a vector representation of a text and is useful for finding how similar two texts are.

You can find and use [hundreds of Sentence Transformers](https://huggingface.co/models?library=sentence-transformers&sort=downloads) models from the Hub by directly using the library, playing with the widgets in the browser or using the Inference API.

### **Example**

In [1]:
!pip install -U sentence-transformers --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m97.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


In [2]:
# Download used models
!git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
#!git clone https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2

Cloning into 'all-MiniLM-L6-v2'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 46 (delta 10), reused 21 (delta 10), pack-reused 25[K
Unpacking objects: 100% (46/46), 314.94 KiB | 1.92 MiB/s, done.
Filtering content: 100% (3/3), 260.15 MiB | 44.32 MiB/s, done.


In [12]:
from sentence_transformers import SentenceTransformer, util

model_name = 'sentence-transformers/all-MiniLM-L6-v2' # EN (384 dimensions)
#model_name = 'distiluse-base-multilingual-cased-v2' # multilingual (512 dimensions) (Pooling + Dense)
model = SentenceTransformer(model_name)

sentences = ["what is happiness?", "Happiness is a state of the spirit."]
embeddings = model.encode(sentences)

similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1]).numpy()[0][0]
print(similarity)

0.7928971


### **Additional Resources**
- [SBert | Sentence Transformers Documentation](https://www.sbert.net/)
- [Hugging Face | Getting Started With Embeddings](https://huggingface.co/blog/getting-started-with-embeddings)
- [Hugging Face | Sentence Transformers in the Hugging Face Hub](https://huggingface.co/blog/sentence-transformers-in-the-hub)
- [Hugging Face | Using Sentence Transformers at Hugging Face](https://huggingface.co/docs/hub/sentence-transformers)
- [Hugging Face | Models for Sentence Similarity](https://huggingface.co/models?library=sentence-transformers&sort=downloads)
- [Hugging Face | Models for Sentence Similarity in Portuguese](https://huggingface.co/models?library=sentence-transformers&language=pt&sort=downloads)

### **Example in Portuguese**

In [8]:
# Download used models
!git clone https://huggingface.co/rufimelo/bert-large-portuguese-cased-sts
#!git clone https://huggingface.co/ricardoz/BERTugues-base-portuguese-cased
#!git clone https://huggingface.co/PORTULAN/albertina-ptpt

Cloning into 'bert-large-portuguese-cased-sts'...
remote: Enumerating objects: 34, done.[K
remote: Total 34 (delta 0), reused 0 (delta 0), pack-reused 34[K
Unpacking objects: 100% (34/34), 289.16 KiB | 2.05 MiB/s, done.


In [10]:
from sentence_transformers import SentenceTransformer, util

model_name = 'rufimelo/bert-large-portuguese-cased-sts' # PT-BR (1024 dimensions)
#model_name = 'ricardoz/BERTugues-base-portuguese-cased' # PT-BR (768 dimensions)
#model_name = 'PORTULAN/albertina-ptpt' # PT-PT (1536 dimensions)
model = SentenceTransformer(model_name)

sentences = ["Tinha uma pedra no meio do caminho.", "Mas essa pedra não era o Português."]
sentence_embeddings = model.encode(sentences)

print("Sentence embeddings:")
print(sentence_embeddings)

sentence_similarity = util.pytorch_cos_sim(sentence_embeddings[0], sentence_embeddings[1]).numpy()[0][0]

print("Sentence similarity:")
print(sentence_similarity)

Sentence embeddings:
[[ 1.7217176  -0.48384118  0.06242486 ...  0.49255776 -1.1902417
   0.9546114 ]
 [ 0.05113917 -0.9153737  -0.1231728  ...  0.7539909  -0.73372877
   1.6433395 ]]
Sentence similarity:
0.4097103


## **Using HuggingFace Transformers**

In [None]:
!pip install -U transformers --quiet
!pip install -U torch --quiet

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
model_name = 'rufimelo/bert-large-portuguese-cased-sts'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Sentence embeddings:
tensor([[ 0.7264, -1.6596, -0.7420,  ...,  0.0246, -1.2221,  0.5228],
        [ 0.0238, -0.9716, -0.1902,  ..., -0.3143, -0.1059,  0.1850]])


In [None]:
# Another option for the sentece embedding according to original BERT paper

sentence_embeddings =  model(**encoded_input).last_hidden_state[:, 0]

print("Sentence embeddings based on [CLS] token:")
print(sentence_embeddings)

Sentence embeddings based on [CLS] token:
tensor([[ 0.4050, -1.5686, -1.1648,  ..., -0.0514, -1.4019,  0.2716],
        [-0.2752, -0.5248, -0.8590,  ..., -0.4845, -0.2114, -0.5468]],
       grad_fn=<SelectBackward0>)
