# Calculating Similarity

When calculating similarity between our transformer embedded vectors, we can use any of the *three* similarity metrics already covered.

But first, let's create some embeddings.

In [23]:
sentences = [
    "Karachi",
    "cosmopolitan",
    "marketplace",
    "city",
    "buy",
    "advertise"
]

# thanks to https://randomwordgenerator.com/sentence.php

In [24]:
from transformers import AutoTokenizer, AutoModel, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained('../../models/bert-base-cased-squad2')
model = AutoModelForQuestionAnswering.from_pretrained('../../models/bert-base-cased-squad2')

# initialize dictionary that will contain tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # tokenize sentence and append to dictionary lists
    new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True,
                                       padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

In [25]:
tokens['input_ids'].shape

torch.Size([6, 128])

We process these tokens through our model:

In [26]:
outputs = model(**tokens, output_hidden_states=True)
outputs.keys()

odict_keys(['start_logits', 'end_logits', 'hidden_states'])

The dense vector representations of our `text` are contained within the `outputs` **'last_hidden_state'** tensor, which we access like so:

In [27]:
embeddings = outputs.hidden_states[12]
embeddings

tensor([[[ 0.0228,  0.3132, -0.7590,  ...,  0.4847, -0.0601, -1.1486],
         [ 0.2346, -0.6184,  0.4789,  ..., -0.3706,  0.4964, -1.1067],
         [-0.4202,  0.1938,  0.3223,  ...,  0.4876,  0.5470, -1.6546],
         ...,
         [ 0.2705, -0.7980,  1.0575,  ...,  0.1251,  0.5117, -1.2836],
         [ 0.2530, -0.8059,  1.0278,  ...,  0.0902,  0.5184, -1.3027],
         [ 0.2663, -0.7753,  0.9917,  ...,  0.0626,  0.5782, -1.2521]],

        [[ 0.6517, -0.4943,  0.4826,  ..., -0.5239,  1.0242,  0.4323],
         [ 0.3876, -0.3390,  0.5466,  ..., -0.7689,  0.2167, -0.4809],
         [ 0.0898,  0.0048,  1.3895,  ..., -0.6799,  0.6728, -0.6674],
         ...,
         [ 0.4997, -0.4367,  1.4845,  ..., -0.0255,  0.2526, -1.3542],
         [ 0.5141, -0.4033,  1.4899,  ..., -0.0335,  0.3059, -1.3596],
         [ 0.5047, -0.3955,  1.4584,  ..., -0.0819,  0.3089, -1.3398]],

        [[ 0.0577, -0.1037,  1.3791,  ..., -0.7789,  0.5880, -0.0669],
         [ 0.2101, -0.6814,  1.1493,  ..., -0

In [28]:
embeddings.shape

torch.Size([6, 128, 768])

After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we ignore non-real tokens.

To perform this operation, we first resize our `attention_mask` tensor:

In [29]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([6, 128])

In [30]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([6, 128, 768])

In [31]:
mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 

Each vector above represents a single token attention mask - each token now has a vector of size 768 representing it's *attention_mask* status. Then we multiply the two tensors to apply the attention mask:

In [32]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([6, 128, 768])

In [33]:
masked_embeddings

tensor([[[ 0.0228,  0.3132, -0.7590,  ...,  0.4847, -0.0601, -1.1486],
         [ 0.2346, -0.6184,  0.4789,  ..., -0.3706,  0.4964, -1.1067],
         [-0.4202,  0.1938,  0.3223,  ...,  0.4876,  0.5470, -1.6546],
         ...,
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000]],

        [[ 0.6517, -0.4943,  0.4826,  ..., -0.5239,  1.0242,  0.4323],
         [ 0.3876, -0.3390,  0.5466,  ..., -0.7689,  0.2167, -0.4809],
         [ 0.0898,  0.0048,  1.3895,  ..., -0.6799,  0.6728, -0.6674],
         ...,
         [ 0.0000, -0.0000,  0.0000,  ..., -0.0000,  0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ..., -0.0000,  0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ..., -0.0000,  0.0000, -0.0000]],

        [[ 0.0577, -0.1037,  1.3791,  ..., -0.7789,  0.5880, -0.0669],
         [ 0.2101, -0.6814,  1.1493,  ..., -0

Then we sum the remained of the embeddings along axis `1`:

In [34]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([6, 768])

Then sum the number of values that must be given attention in each position of the tensor:

In [35]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([6, 768])

In [36]:
summed_mask

tensor([[3., 3., 3.,  ..., 3., 3., 3.],
        [4., 4., 4.,  ..., 4., 4., 4.],
        [3., 3., 3.,  ..., 3., 3., 3.],
        [3., 3., 3.,  ..., 3., 3., 3.],
        [3., 3., 3.,  ..., 3., 3., 3.],
        [5., 5., 5.,  ..., 5., 5., 5.]])

Finally, we calculate the mean as the sum of the embedding activations `summed` divided by the number of values that should be given attention in each position `summed_mask`:

In [37]:
mean_pooled = summed / summed_mask

In [38]:
mean_pooled

tensor([[-0.0543, -0.0371,  0.0141,  ...,  0.2006,  0.3278, -1.3033],
        [ 0.2214, -0.2275,  0.7728,  ..., -0.5768,  0.7151, -0.3456],
        [ 0.0468, -0.3281,  1.1293,  ..., -0.4582,  0.4110, -0.6053],
        [ 0.1323, -0.7955,  0.7883,  ..., -0.2233,  0.2522, -1.1343],
        [ 0.1689,  0.3100,  0.3806,  ..., -0.6839,  0.2257, -0.1748],
        [ 0.1389,  0.1010,  1.2612,  ..., -0.1814,  0.0524, -0.8526]],
       grad_fn=<DivBackward0>)

And that is how we calculate our dense similarity vector.

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

Let's calculate cosine similarity for sentence `0`:

In [40]:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()

In [41]:
# calculate
sim = cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

print(sentences)

['Karachi', 'cosmopolitan', 'marketplace', 'city', 'largest', 'advertise']


In [42]:
print(sim)

[[0.6263751  0.59248245 0.6986703  0.536074   0.6119157 ]]


These similarities translate to:

| Index | Sentence | Similarity |
| --- | --- | --- |
| 1 | "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go." | 0.3309 |
| 2 | "The person box was packed with jelly many dozens of months later." | 0.7219 |
| 3 | "Standing on one's head at job interviews forms a lasting impression." | 0.1748 |
| 4 | "It took him a month to finish the meal." | 0.4471 |
| 5 | "He found a leprechaun in his walnut shell." | 0.5548 |


So, as intended, the most similar sentence is that in index **2** - which contains the same meaning as our first sentence, without using the same words:

`"Three years later, the coffin was still full of Jello."`