# Calculating Similarity

When calculating similarity between our transformer embedded vectors, we can use any of the *three* similarity metrics already covered.

But first, let's create some embeddings.

In [1]:
sentences = [
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "Three years later, the coffin was still full of Jello.",
    "The person box was packed with jelly many dozens of months later.",
    "Standing on one's head at job interviews forms a lasting impression.",
    "It took him a month to finish the meal.",
    "He found a leprechaun in his walnut shell.",
    "The aquatic, gill-bearing animal being had night visions of fleeing the spherical glass prison and into the potty where he saw his mates go."
]

# thanks to https://randomwordgenerator.com/sentence.php

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

# initialize dictionary that will contain tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # tokenize sentence and append to dictionary lists
    new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True,
                                       padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

In [3]:
tokens['input_ids'].shape

torch.Size([7, 128])

We process these tokens through our model:

In [4]:
outputs = model(**tokens)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

The dense vector representations of our `text` are contained within the `outputs` **'last_hidden_state'** tensor, which we access like so:

In [5]:
embeddings = outputs.last_hidden_state
embeddings[0]

tensor([[-0.3212,  0.8251,  1.0554,  ..., -0.1855,  0.1517,  0.3937],
        [-0.7146,  1.0297,  1.1217,  ...,  0.0331,  0.2382, -0.1563],
        [-0.2352,  1.1353,  0.8594,  ..., -0.4310, -0.0272, -0.2968],
        ...,
        [-0.5400,  0.3236,  0.7839,  ...,  0.0022, -0.2994,  0.2659],
        [-0.5643,  0.3187,  0.9576,  ...,  0.0342, -0.3030,  0.1878],
        [-0.5172,  0.3599,  0.9336,  ...,  0.0243, -0.2232,  0.1672]],
       grad_fn=<SelectBackward0>)

In [6]:
embeddings.shape

torch.Size([7, 128, 768])

After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we ignore non-real tokens.

To perform this operation, we first resize our `attention_mask` tensor:

In [7]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([7, 128])

In [8]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([7, 128, 768])

In [9]:
mask[0]

tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

Each vector above represents a single token attention mask - each token now has a vector of size 768 representing it's *attention_mask* status. Then we multiply the two tensors to apply the attention mask:

In [10]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([7, 128, 768])

In [11]:
masked_embeddings[0]

tensor([[-0.3212,  0.8251,  1.0554,  ..., -0.1855,  0.1517,  0.3937],
        [-0.7146,  1.0297,  1.1217,  ...,  0.0331,  0.2382, -0.1563],
        [-0.2352,  1.1353,  0.8594,  ..., -0.4310, -0.0272, -0.2968],
        ...,
        [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
        [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
        [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000]],
       grad_fn=<SelectBackward0>)

Then we sum the remained of the embeddings along axis `1`:

In [12]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([7, 768])

Then sum the number of values that must be given attention in each position of the tensor:

In [13]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([7, 768])

In [14]:
summed_mask

tensor([[22., 22., 22.,  ..., 22., 22., 22.],
        [15., 15., 15.,  ..., 15., 15., 15.],
        [15., 15., 15.,  ..., 15., 15., 15.],
        ...,
        [12., 12., 12.,  ..., 12., 12., 12.],
        [14., 14., 14.,  ..., 14., 14., 14.],
        [31., 31., 31.,  ..., 31., 31., 31.]])

Finally, we calculate the mean as the sum of the embedding activations `summed` divided by the number of values that should be given attention in each position `summed_mask`:

In [15]:
mean_pooled = summed / summed_mask

In [16]:
mean_pooled

tensor([[-0.3715,  0.9729,  1.0840,  ..., -0.2552, -0.2759,  0.0358],
        [ 0.0745,  0.8637,  0.1795,  ...,  0.7734,  1.7247, -0.1803],
        [-0.5030,  0.7950, -0.1240,  ...,  0.1441,  0.9704, -0.1791],
        ...,
        [-0.2019,  0.0597,  0.8603,  ..., -0.0100,  0.8431, -0.0841],
        [-0.2131,  1.0175, -0.8833,  ...,  0.7371,  0.1947, -0.3011],
        [-0.0152,  0.6374,  1.1590,  ..., -0.2850, -0.2945,  0.1262]],
       grad_fn=<DivBackward0>)

And that is how we calculate our dense similarity vector.

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

Let's calculate cosine similarity for sentence `0`:

In [18]:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()

# calculate
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

array([[0.33088917, 0.24826953, 0.2923194 , 0.20174849, 0.2950728 ,
        0.8143991 ]], dtype=float32)

These similarities translate to:

| Index | Sentence | Similarity |
| --- | --- | --- |
| 0 | "Three years later, the coffin was still full of Jello." | 0.331 |
| 1 | "The person box was packed with jelly many dozens of months later." | 0.248 |
| 2 | "Standing on one's head at job interviews forms a lasting impression." | 0.292 |
| 3 | "It took him a month to finish the meal." | 0.202 |
| 4 | "He found a leprechaun in his walnut shell." | 0.295 |
| 5 |  "The aquatic, gill-bearing animal being had night visions of fleeing the spherical glass prison and into the potty where he saw his mates go." | 0.814 |

So, as intended, the most similar sentence is that in index **5** - which contains the same meaning as our first sentence, without using the same words:

`"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go."`