# BERT For Measuring Text Similarity
High-performance semantic similarity with BERT
James Briggs



### BERT and sequence similarity!
A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text — then perform several transformations.
It’s highly-dimensional magic.
Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be.

####The logic is this:
Take a sentence, convert it into a vector.
Take many other sentences, and convert them into vectors.
Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them — more on that here.
We now have a measure of semantic similarity between sentences — easy!
At a high level, there’s not much else to it. But of course, we want to understand what is happening in a little more detail and implement this in Python too! So, let’s get started.

#### Why BERT Helps
BERT, as we already mentioned — is the MVP of NLP. And a big part of this is down to BERTs ability to embed the meaning of words into densely packed vectors.
We call them dense vectors because every value within the vector has a value and has a reason for being that value — this is in contrast to sparse vectors, such as one-hot encoded vectors where the majority of values are 0.
BERT is great at creating these dense vectors, and each encoder layer (there are several) outputs a set of dense vectors.

BERT base network — with the hidden layer representations highlighted in green.
For BERT base, this will be a vector containing 768. Those 768 values contain our numerical representation of a single token — which we can use as contextual word embeddings.
Because there is one of these vectors for representing each token (output by each encoder), we are actually looking at a tensor of size 768 by the number of tokens.

#### We can take these tensors — and transform them 
this creates semantic representations of the input sequence. We can then take our similarity metrics and calculate the respective similarity between different sequences.
The simplest and most commonly extracted tensor is the last_hidden_state tensor — which is conveniently output by the BERT model.
Of course, this is a pretty large tensor — at 512x768 — and we want a vector to apply our similarity measures to it.
To do this, we need to convert our last_hidden_states tensor to a vector of 768 dimensions.
## Creating The Vector
For us to convert our last_hidden_states tensor into our vector — we use a mean pooling operation.
Each of those 512 tokens has a respective 768 values. This pooling operation will take the mean of all token embeddings and compress them into a single 768 vector space — creating a ‘sentence vector’.
At the same time, we can’t just take the mean activation as is. We need to consider null padding tokens (which we should not include).
In Code
That’s great on the theory and logic behind the process — but how do we apply this in reality?
#### We’ll outline two approaches — 
the easy way and the slightly more complex way.
Easy — Sentence-Transformers
The easiest approach for us to implement everything we just covered is through the sentence-transformers library — which wraps most of this process into a few lines of code.
First, we install sentence-transformers using pip install sentence-transformers. This library uses HuggingFace’s transformers behind the scenes — so we can actually find sentence-transformers models here.

reference :
https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1 

more word manipulation tools
https://pythonprogramming.net/wordnet-nltk-tutorial/ 

In [1]:
#Write a few sentences to encode (sentences 0 and 2 are both similar):

sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."
]

In [2]:
#!pip install sentence-transformers



In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')


In [3]:
#Encode the sentences

sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

(4, 768)

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

In [14]:
#Let's calculate cosine similarity for sentence 0:

result=cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

Max = max(result[0])

print(Max)

0.7219258


These similarities translate to:
base sentence
* Three years later, the coffin was still full of Jello.

Index	Sentence	Similarity
* 1	"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go."	0.3309
* 2	"The person box was packed with jelly many dozens of months later."	0.7219
* 3	"He found a leprechaun in his walnut shell."	0.5547