# Building Dense Vectors Using Transformers

We will be using the [`sentence-transformers/stsb-distilbert-base`](https://huggingface.co/sentence-transformers/stsb-distilbert-base) model to build our dense vectors.

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

First we initialize our model and tokenizer:

In [4]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/stsb-distilbert-base')
model = AutoModel.from_pretrained('sentence-transformers/stsb-distilbert-base')

Then we tokenize a sentence just as we have been doing before:

In [3]:
text = "hello world what a time to be alive!"

tokens = tokenizer.encode_plus(text, max_length=128,
                               truncation=True, padding='max_length',
                               return_tensors='pt')

We process these tokens through our model:

In [4]:
outputs = model(**tokens)
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.9489,  0.6905, -0.2188,  ...,  0.0161,  0.5874, -0.1449],
         [-0.6643,  1.1984, -0.1346,  ...,  0.4839,  0.6338, -0.5003],
         [-0.3289,  0.6412,  0.2473,  ..., -0.0965,  0.4298,  0.0515],
         ...,
         [-0.7853,  0.8094, -0.2639,  ...,  0.2177,  0.3335,  0.1107],
         [-0.7528,  0.6285, -0.0088,  ...,  0.1024,  0.4585,  0.1720],
         [-1.0754,  0.4878, -0.3458,  ...,  0.2764,  0.5604,  0.1236]]],
       grad_fn=<NativeLayerNormBackward>), hidden_states=None, attentions=None)

The dense vector representations of our `text` are contained within the `outputs` **'last_hidden_state'** tensor, which we access like so:

In [5]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[-0.9489,  0.6905, -0.2188,  ...,  0.0161,  0.5874, -0.1449],
         [-0.6643,  1.1984, -0.1346,  ...,  0.4839,  0.6338, -0.5003],
         [-0.3289,  0.6412,  0.2473,  ..., -0.0965,  0.4298,  0.0515],
         ...,
         [-0.7853,  0.8094, -0.2639,  ...,  0.2177,  0.3335,  0.1107],
         [-0.7528,  0.6285, -0.0088,  ...,  0.1024,  0.4585,  0.1720],
         [-1.0754,  0.4878, -0.3458,  ...,  0.2764,  0.5604,  0.1236]]],
       grad_fn=<NativeLayerNormBackward>)

In [6]:
embeddings.shape

torch.Size([1, 128, 768])

After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we ignore non-real tokens.

To perform this operation, we first resize our `attention_mask` tensor:

In [7]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([1, 128])

In [8]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([1, 128, 768])

In [9]:
attention_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])

In [20]:
mask[0][0].shape

torch.Size([768])

In [10]:
mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]])

Each vector above represents a single token attention mask - each token now has a vector of size 768 representing it's *attention_mask* status. Then we multiply the two tensors to apply the attention mask:

In [11]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([1, 128, 768])

In [12]:
masked_embeddings

tensor([[[-0.9489,  0.6905, -0.2188,  ...,  0.0161,  0.5874, -0.1449],
         [-0.6643,  1.1984, -0.1346,  ...,  0.4839,  0.6338, -0.5003],
         [-0.3289,  0.6412,  0.2473,  ..., -0.0965,  0.4298,  0.0515],
         ...,
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000]]],
       grad_fn=<MulBackward0>)

Then we sum the remained of the embeddings along axis `1`:

In [13]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([1, 768])

Then sum the number of values that must be given attention in each position of the tensor:

In [14]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([1, 768])

In [17]:
summed_mask

tensor([[11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11.

Finally, we calculate the mean as the sum of the embedding activations `summed` divided by the number of values that should be given attention in each position `summed_mask`:

In [15]:
mean_pooled = summed / summed_mask

In [16]:
mean_pooled

tensor([[-3.8485e-01,  7.8107e-01, -1.7720e-01, -1.4125e+00, -2.3358e-01,
          9.0891e-01, -7.8390e-02,  6.0347e-01,  6.7886e-02, -3.9842e-01,
          3.9223e-02, -4.6774e-01, -7.1848e-01, -1.1863e-01, -7.1194e-02,
          6.6017e-03, -1.4093e-01,  3.1271e-01, -6.5574e-01, -1.6470e-01,
         -1.0026e-01, -3.8357e-01,  6.1278e-02, -7.3818e-01, -5.9918e-01,
          2.8855e-01,  8.6372e-01,  5.8388e-01, -3.5059e-02,  4.3197e-01,
         -5.0111e-01, -4.3498e-01,  2.3498e-01, -3.7127e-01, -1.0044e+00,
          1.0000e+00, -2.1000e+00, -3.2251e-01, -1.6085e-01, -7.3701e-01,
          5.4928e-01, -1.2066e-01,  7.2698e-01, -5.0327e-02, -1.7545e+00,
          8.0573e-01, -5.0553e-01, -4.7172e-01, -1.6727e-01,  5.9727e-01,
          5.6203e-01, -3.6104e-01, -1.6429e-01, -5.5215e-01, -5.0417e-01,
          5.6187e-01, -1.1415e+00,  1.0771e+00,  5.5689e-01, -7.0632e-02,
         -2.6932e-01, -6.8905e-01,  1.8093e-01,  3.1045e-01,  3.9036e-02,
          3.1064e-01, -4.4495e-01, -4.

And that is how we calculate our dense similarity vector.