#Text similarity by BERT (huggingface)

This notebook provides a small example on how to use BERT to measure the similarity of two texts:



First, we have to install the framework, sentence-transformers, https://pypi.org/project/sentence-transformers/, which easily allows us to compute dense vector representations for sentences, paragraphs, etc.

The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. They achieve state-of-the-art performance in various task. 

Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.


In [None]:
!pip install -U sentence-transformers


Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.7/dist-packages (2.0.0)


We load the model SentenceTransformer to convert texts into embeddings. To do this, we load the model bert-base-nli-mean-tokens (https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens), which it produces sentence embeddings (of low quality).

It maps texts to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')


We define our set of sentences and apply the model on these sentences to obtain their embeddings (vectors)

In [None]:

sentences=["The hotel was very good, and not expensive",
           "The inn was especially nice, and not overprice"]

sentence_embeddings = model.encode(sentences)

We now can calculate teh distance by usine cosine distance. We can see that this approach can capture the synonym relationships between words!!!

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.87287253]], dtype=float32)

In [None]:

sentences=["Where can I find the user guide?",
           "Where can I find the instructions?",
           "Where is the manual?",
           "Where can I find the true love?"]
sentence_embeddings = model.encode(sentences)
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.90135634, 0.73777556, 0.6358726 ]], dtype=float32)

However, it still provide a high similarity to sentences with opposite meanings but sharing the vocabulary.

In [None]:

sentences=["The hotel was very good, and not expensive",
           "The hotel was not very good, and not expensive"]
sentence_embeddings = model.encode(sentences)
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)


array([[0.7723092]], dtype=float32)

The model 'bert-base-nli-mean-tokens' is deprecated. 
You can find new models at 
https://www.sbert.net/docs/pretrained_models.html


Use **paraphrase-mpnet-base-v2** for the best quality, and **paraphrase-MiniLM-L6-v2** if you want a quick model with high quality.





In [None]:
#first, we load the model

#Use **paraphrase-mpnet-base-v2** for the best quality, 
#and **paraphrase-MiniLM-L6-v2** if you want a quick model with high quality.


from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')


Now, we represent each sentence using the model and measure the similarity using cosine distance. In particular, we compare the first sentence with the others. 

In [None]:
sentences=["Where can I find the user guide?",
           "Where can I find the instructions?",
           "Where is the manual?",
           "Where can I find the true love?"]
sentence_embeddings = model.encode(sentences)
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.4881479 , 0.44247025, 0.21037427]], dtype=float32)

In [None]:

sentences=["The hotel was very good, and not expensive",
           "The hotel was not very good, and not expensive"]
           
sentence_embeddings = model.encode(sentences)
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.87287253]], dtype=float32)

#Question Answering 

The following models were trained on Google’s Natural Questions dataset, a dataset with 100k real queries from Google search together with the relevant passages from Wikipedia.

nq-distilbert-base-v1: MRR10: 72.36 on NQ dev set (small)


In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('nq-distilbert-base-v1')



In [None]:
query_embedding = model.encode('How many people live in London?')

#The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode([['London1', 'London has 9,787,426 inhabitants at the 2011 census.'],
                                  ['London2', 'The population of London is estimated to be just over nine million people in 2020.'],
                                  ['London3', "London is the capital and most populous city of England and the United Kingdom."]])


#util.pytorch_cos_sim(A, B) which computes the cosine similarity between all vectors in A and all vectors in B.
print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

Similarity: tensor([[0.6201, 0.7110, 0.5209]])


# Question answering for Spanish

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distiluse-base-multilingual-cased-v1')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=690.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2830.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=556.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=122.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=341.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=538971577.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=53.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961847.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=452.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=114.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1575975.0, style=ProgressStyle(descript…




In [None]:
query_embedding = model.encode('¿Cuál es la renta per capita en España?')

#The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode([['Answer1', 'España tiene un PIB Per cápita trimestral de 6.070€ euros.'],
                                  ['Answer2', 'PIB per cápita actual (corresponde al 2019, último dato publicado por el INE): 26.426 euros.'],
                                  ['Answer3', "El consumo per cápita en España en 2020 fue dieciocho puntos inferior al de la media de la eurozona"],
                                  ['Answer3', "Luis Enrique es el mejor seleccionador que España puede tener”"]])


#util.pytorch_cos_sim(A, B) which computes the cosine similarity between all vectors in A and all vectors in B.
print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

Similarity: tensor([[0.4368, 0.3822, 0.4629, 0.4487]])
