# BERT For Measuring Text Similarity
High-performance semantic similarity with BERT
James Briggs



### BERT and sequence similarity!
A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text — then perform several transformations.
It’s highly-dimensional magic.
Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be.

####The logic is this:
Take a sentence, convert it into a vector.
Take many other sentences, and convert them into vectors.
Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them — more on that here.
We now have a measure of semantic similarity between sentences — easy!
At a high level, there’s not much else to it. But of course, we want to understand what is happening in a little more detail and implement this in Python too! So, let’s get started.

#### Why BERT Helps
BERT, as we already mentioned — is the MVP of NLP. And a big part of this is down to BERTs ability to embed the meaning of words into densely packed vectors.
We call them dense vectors because every value within the vector has a value and has a reason for being that value — this is in contrast to sparse vectors, such as one-hot encoded vectors where the majority of values are 0.
BERT is great at creating these dense vectors, and each encoder layer (there are several) outputs a set of dense vectors.

BERT base network — with the hidden layer representations highlighted in green.
For BERT base, this will be a vector containing 768. Those 768 values contain our numerical representation of a single token — which we can use as contextual word embeddings.
Because there is one of these vectors for representing each token (output by each encoder), we are actually looking at a tensor of size 768 by the number of tokens.

#### We can take these tensors — and transform them 
this creates semantic representations of the input sequence. We can then take our similarity metrics and calculate the respective similarity between different sequences.
The simplest and most commonly extracted tensor is the last_hidden_state tensor — which is conveniently output by the BERT model.
Of course, this is a pretty large tensor — at 512x768 — and we want a vector to apply our similarity measures to it.
To do this, we need to convert our last_hidden_states tensor to a vector of 768 dimensions.
## Creating The Vector
For us to convert our last_hidden_states tensor into our vector — we use a mean pooling operation.
Each of those 512 tokens has a respective 768 values. This pooling operation will take the mean of all token embeddings and compress them into a single 768 vector space — creating a ‘sentence vector’.
At the same time, we can’t just take the mean activation as is. We need to consider null padding tokens (which we should not include).
In Code
That’s great on the theory and logic behind the process — but how do we apply this in reality?
#### We’ll outline two approaches — 
the easy way and the slightly more complex way.
Easy — Sentence-Transformers
The easiest approach for us to implement everything we just covered is through the sentence-transformers library — which wraps most of this process into a few lines of code.
First, we install sentence-transformers using pip install sentence-transformers. This library uses HuggingFace’s transformers behind the scenes — so we can actually find sentence-transformers models here.

reference :
https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1 

more word manipulation tools
https://pythonprogramming.net/wordnet-nltk-tutorial/ 

In [1]:
#Write a few sentences to encode (sentences 0 and 2 are both similar):

sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."
]

In [2]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 4.0 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 44.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 48.8 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 7.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 38.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |██

In [3]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')


Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
#Encode the sentences

sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

(4, 768)

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
#Let's calculate cosine similarity for sentence 0:

cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.33088917, 0.721926  , 0.55483633]], dtype=float32)

These similarities translate to:
base sentence
* Three years later, the coffin was still full of Jello.

Index	Sentence	Similarity
* 1	"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go."	0.3309
* 2	"The person box was packed with jelly many dozens of months later."	0.7219
* 3	"He found a leprechaun in his walnut shell."	0.5547

# Advanced approach
Now, this is the easier — more abstract approach. Seven lines of code to compare our sentences.
Involved — Transformers And PyTorch
Before getting into the second approach, it is worth noting that it does the same thing as the first — but at one level lower.
With this approach, we need to perform our own transformation to the last_hidden_state to create the sentence embedding. For this, we perform the mean pooling operation.

In [7]:

from transformers import AutoTokenizer, AutoModel
import torch

In [8]:
#First we initialize our model and tokenizer:

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [10]:
# Then we tokenize the sentences just as before: 

sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."
]

# initialize dictionary to store tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # encode each sentence and append to dictionary
    new_tokens = tokenizer.encode_plus(sentence, max_length=128,
                                       truncation=True, padding='max_length',
                                       return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

print('done')

done


In [12]:
#We process these tokens through our model:
outputs = model(**tokens)
outputs.keys()

#The dense vector representations of our text are contained within the outputs 'last_hidden_state' tensor, which we access like so:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[-0.0692,  0.6230,  0.0354,  ...,  0.8033,  1.6314,  0.3281],
         [ 0.0367,  0.6842,  0.1946,  ...,  0.0848,  1.4747, -0.3008],
         [-0.0121,  0.6543, -0.0727,  ..., -0.0326,  1.7717, -0.6812],
         ...,
         [ 0.1953,  1.1085,  0.3390,  ...,  1.2826,  1.0114, -0.0728],
         [ 0.0902,  1.0288,  0.3297,  ...,  1.2940,  0.9865, -0.1113],
         [ 0.1240,  0.9737,  0.3933,  ...,  1.1359,  0.8768, -0.1043]],

        [[-0.3212,  0.8251,  1.0554,  ..., -0.1855,  0.1517,  0.3937],
         [-0.7146,  1.0297,  1.1217,  ...,  0.0331,  0.2382, -0.1563],
         [-0.2352,  1.1353,  0.8594,  ..., -0.4310, -0.0272, -0.2968],
         ...,
         [-0.5400,  0.3236,  0.7839,  ...,  0.0022, -0.2994,  0.2659],
         [-0.5643,  0.3187,  0.9576,  ...,  0.0342, -0.3030,  0.1878],
         [-0.5172,  0.3599,  0.9336,  ...,  0.0243, -0.2232,  0.1672]],

        [[-0.7576,  0.8399, -0.3792,  ...,  0.1271,  1.2514,  0.1365],
         [-0.6591,  0.7613, -0.4662,  ...,  0

In [13]:
embeddings.shape

torch.Size([4, 128, 768])

## next step
After we have produced our dense vectors embeddings, we need to perform a mean pooling operation to create a single vector encoding (the sentence embedding).
To do this mean pooling operation, we will need to multiply each value in our embeddings tensor by its respective attention_mask value — so that we ignore non-real tokens.

In [15]:
#To perform this operation, we first resize our attention_mask tensor:
attention_mask = tokens['attention_mask']
attention_mask.shape


torch.Size([4, 128])

In [16]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape
mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 

In [17]:
#Each vector above represents a single token attention mask - each token now has a vector of size 
#768 representing it's attention_mask status. Then we multiply the two tensors to apply the attention mask:
masked_embeddings = embeddings * mask
masked_embeddings.shape
masked_embeddings

tensor([[[-0.0692,  0.6230,  0.0354,  ...,  0.8033,  1.6314,  0.3281],
         [ 0.0367,  0.6842,  0.1946,  ...,  0.0848,  1.4747, -0.3008],
         [-0.0121,  0.6543, -0.0727,  ..., -0.0326,  1.7717, -0.6812],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000]],

        [[-0.3212,  0.8251,  1.0554,  ..., -0.1855,  0.1517,  0.3937],
         [-0.7146,  1.0297,  1.1217,  ...,  0.0331,  0.2382, -0.1563],
         [-0.2352,  1.1353,  0.8594,  ..., -0.4310, -0.0272, -0.2968],
         ...,
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000]],

        [[-0.7576,  0.8399, -0.3792,  ...,  0.1271,  1.2514,  0.1365],
         [-0.6591,  0.7613, -0.4662,  ...,  0

In [19]:
#Then we sum the remained of the embeddings along axis 1:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([4, 768])

In [20]:

#Then sum the number of values that must be given attention in each position of the tensor:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape
summed_mask

tensor([[15., 15., 15.,  ..., 15., 15., 15.],
        [22., 22., 22.,  ..., 22., 22., 22.],
        [15., 15., 15.,  ..., 15., 15., 15.],
        [14., 14., 14.,  ..., 14., 14., 14.]])

In [21]:
#Finally, we calculate the mean as the sum of the embedding activations summed 
#divided by the number of values that should be given attention in each position summed_mask:
mean_pooled = summed / summed_mask
mean_pooled

tensor([[ 0.0745,  0.8637,  0.1795,  ...,  0.7734,  1.7247, -0.1803],
        [-0.3715,  0.9729,  1.0840,  ..., -0.2552, -0.2759,  0.0358],
        [-0.5030,  0.7950, -0.1240,  ...,  0.1441,  0.9704, -0.1791],
        [-0.2131,  1.0175, -0.8833,  ...,  0.7371,  0.1947, -0.3011]],
       grad_fn=<DivBackward0>)

Once we have our dense vectors, we can calculate the cosine similarity between each — which is the same logic we used before:


In [22]:
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()

# calculate
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

array([[0.3308891 , 0.721926  , 0.55483633]], dtype=float32)

## final results 

We return almost the same results — the only difference being that the cosine similarity for index three has shifted from 0.5547 to 0.5548 — a minor difference due to rounding.
That’s all for this introduction to measuring the semantic similarity of sentences using BERT — using both sentence-transformers and a lower-level implementation with PyTorch and transformers.
You can find the full notebooks for both approaches here and here.
I hope you’ve enjoyed the article. Let me know if you have any questions or suggestions via Twitter or in the comments below. If you’re interested in more content like this, I post on YouTube too.
Thanks for reading!