# Calculating Similarity

When calculating similarity between our transformer embedded vectors, we can use any of the *three* similarity metrics already covered.

But first, let's create some embeddings.

In [1]:
sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "Standing on one's head at job interviews forms a lasting impression.",
    "It took him a month to finish the meal.",
    "He found a leprechaun in his walnut shell."
]

# thanks to https://randomwordgenerator.com/sentence.php

In [3]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('../../models/bert-base-cased-squad2')
model = AutoModel.from_pretrained('../../models/bert-base-cased-squad2')

# initialize dictionary that will contain tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # tokenize sentence and append to dictionary lists
    new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True,
                                       padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

Some weights of the model checkpoint at ../../models/bert-base-cased-squad2 were not used when initializing BertModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
tokens['input_ids'].shape

torch.Size([6, 128])

We process these tokens through our model:

In [5]:
outputs = model(**tokens)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

The dense vector representations of our `text` are contained within the `outputs` **'last_hidden_state'** tensor, which we access like so:

In [6]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[-0.0316,  0.3213,  1.0839,  ..., -0.0315,  0.3153, -0.3369],
         [-0.2773,  0.3218,  1.0824,  ..., -0.1640, -0.7640, -0.0356],
         [ 0.3556, -0.1141,  1.7803,  ...,  0.4573,  0.0182, -1.3971],
         ...,
         [ 0.0284, -0.7136,  1.3945,  ...,  0.1442, -0.0084, -1.7079],
         [ 0.0802, -0.6769,  1.3921,  ...,  0.1274,  0.0058, -1.6575],
         [ 0.2010, -0.7192,  1.3515,  ...,  0.1734, -0.0769, -1.5365]],

        [[-0.3922,  0.3952,  0.7426,  ..., -0.0303,  0.5689, -0.2515],
         [-0.0624, -0.9280,  1.2167,  ...,  0.2665,  0.1439, -0.4486],
         [ 0.2753, -0.4714,  1.0930,  ...,  0.2404,  1.1289, -0.4512],
         ...,
         [-0.0721, -0.7183,  1.3501,  ...,  0.1265, -0.2407, -1.4680],
         [-0.0512, -0.6419,  1.3827,  ...,  0.0417, -0.1954, -1.5097],
         [-0.0870, -0.6302,  1.3772,  ...,  0.0955, -0.1894, -1.5086]],

        [[ 0.0136,  0.2703,  1.2669,  ..., -0.0415,  0.5272, -0.3850],
         [-0.4467, -0.7345,  1.0505,  ...,  0

In [7]:
embeddings.shape

torch.Size([6, 128, 768])

After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we ignore non-real tokens.

To perform this operation, we first resize our `attention_mask` tensor:

In [8]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([6, 128])

In [9]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([6, 128, 768])

In [10]:
mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 

Each vector above represents a single token attention mask - each token now has a vector of size 768 representing it's *attention_mask* status. Then we multiply the two tensors to apply the attention mask:

In [11]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([6, 128, 768])

In [12]:
masked_embeddings

tensor([[[-0.0316,  0.3213,  1.0839,  ..., -0.0315,  0.3153, -0.3369],
         [-0.2773,  0.3218,  1.0824,  ..., -0.1640, -0.7640, -0.0356],
         [ 0.3556, -0.1141,  1.7803,  ...,  0.4573,  0.0182, -1.3971],
         ...,
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000]],

        [[-0.3922,  0.3952,  0.7426,  ..., -0.0303,  0.5689, -0.2515],
         [-0.0624, -0.9280,  1.2167,  ...,  0.2665,  0.1439, -0.4486],
         [ 0.2753, -0.4714,  1.0930,  ...,  0.2404,  1.1289, -0.4512],
         ...,
         [-0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [-0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [-0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000]],

        [[ 0.0136,  0.2703,  1.2669,  ..., -0.0415,  0.5272, -0.3850],
         [-0.4467, -0.7345,  1.0505,  ...,  0

Then we sum the remained of the embeddings along axis `1`:

In [13]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([6, 768])

Then sum the number of values that must be given attention in each position of the tensor:

In [14]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([6, 768])

In [15]:
summed_mask

tensor([[15., 15., 15.,  ..., 15., 15., 15.],
        [22., 22., 22.,  ..., 22., 22., 22.],
        [16., 16., 16.,  ..., 16., 16., 16.],
        [16., 16., 16.,  ..., 16., 16., 16.],
        [12., 12., 12.,  ..., 12., 12., 12.],
        [17., 17., 17.,  ..., 17., 17., 17.]])

Finally, we calculate the mean as the sum of the embedding activations `summed` divided by the number of values that should be given attention in each position `summed_mask`:

In [16]:
mean_pooled = summed / summed_mask

In [17]:
mean_pooled

tensor([[-0.0594, -0.2442,  1.1076,  ...,  0.3828, -0.0689, -0.7749],
        [ 0.1026, -0.1534,  0.6878,  ...,  0.2588,  0.3581, -0.7802],
        [-0.0840,  0.0138,  1.1218,  ...,  0.0100, -0.0371, -0.6318],
        [-0.0884, -0.4308,  1.1361,  ..., -0.0542, -0.1450, -0.7998],
        [ 0.0566, -0.4325,  0.7826,  ...,  0.2553,  0.0868, -0.4280],
        [ 0.1879, -0.2350,  1.1156,  ...,  0.0360,  0.0325, -0.8093]],
       grad_fn=<DivBackward0>)

And that is how we calculate our dense similarity vector.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

Let's calculate cosine similarity for sentence `0`:

In [19]:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()

# calculate
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

array([[0.914305  , 0.93545866, 0.88925207, 0.87824136, 0.9339597 ]],
      dtype=float32)

These similarities translate to:

| Index | Sentence | Similarity |
| --- | --- | --- |
| 1 | "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go." | 0.3309 |
| 2 | "The person box was packed with jelly many dozens of months later." | 0.7219 |
| 3 | "Standing on one's head at job interviews forms a lasting impression." | 0.1748 |
| 4 | "It took him a month to finish the meal." | 0.4471 |
| 5 | "He found a leprechaun in his walnut shell." | 0.5548 |


So, as intended, the most similar sentence is that in index **2** - which contains the same meaning as our first sentence, without using the same words:

`"Three years later, the coffin was still full of Jello."`