# Building Dense Vectors Using Transformers

We will be using the [`sentence-transformers/stsb-distilbert-base`](https://huggingface.co/sentence-transformers/stsb-distilbert-base) model to build our dense vectors.

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

First we initialize our model and tokenizer:

In [3]:
model_name = 'sentence-transformers/bert-base-nli-mean-tokens'

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Downloading tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/418M [00:00<?, ?B/s]

Then we tokenize a sentence just as we have been doing before:

In [5]:
text = "hello world what a time to be alive!"

tokens = tokenizer.encode_plus(text, max_length=128,
                               truncation=True, padding='max_length',
                               return_tensors='pt')

We process these tokens through our model:

In [7]:
outputs = model(**tokens)
outputs[0]

tensor([[[ 3.0681e-01, -7.8805e-02,  1.7431e+00,  ..., -2.5349e-02,
          -1.1080e-01,  4.8311e-02],
         [ 7.1302e-01,  1.0437e-01,  1.8346e+00,  ...,  1.1343e-01,
          -7.5564e-02,  1.2668e-01],
         [ 8.1722e-01,  1.1321e-01,  1.5408e+00,  ..., -3.8067e-01,
           8.7477e-02, -1.9020e-01],
         ...,
         [ 5.4669e-01,  1.7181e-01,  1.1392e+00,  ...,  3.8548e-02,
          -1.5396e-01,  2.3015e-01],
         [ 3.4457e-01,  1.3151e-01,  1.1324e+00,  ..., -1.4211e-03,
          -1.7517e-01,  1.5220e-01],
         [ 3.2320e-01,  3.3350e-03,  1.1888e+00,  ...,  1.6736e-02,
          -2.0864e-01,  8.9316e-02]]], grad_fn=<NativeLayerNormBackward0>)

### Last Hidden State Tensor

The dense vector representations of our `text` are contained within the `outputs` **'last_hidden_state'** tensor, which we access like so:

In [8]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[ 3.0681e-01, -7.8805e-02,  1.7431e+00,  ..., -2.5349e-02,
          -1.1080e-01,  4.8311e-02],
         [ 7.1302e-01,  1.0437e-01,  1.8346e+00,  ...,  1.1343e-01,
          -7.5564e-02,  1.2668e-01],
         [ 8.1722e-01,  1.1321e-01,  1.5408e+00,  ..., -3.8067e-01,
           8.7477e-02, -1.9020e-01],
         ...,
         [ 5.4669e-01,  1.7181e-01,  1.1392e+00,  ...,  3.8548e-02,
          -1.5396e-01,  2.3015e-01],
         [ 3.4457e-01,  1.3151e-01,  1.1324e+00,  ..., -1.4211e-03,
          -1.7517e-01,  1.5220e-01],
         [ 3.2320e-01,  3.3350e-03,  1.1888e+00,  ...,  1.6736e-02,
          -2.0864e-01,  8.9316e-02]]], grad_fn=<NativeLayerNormBackward0>)

In [9]:
embeddings.shape

torch.Size([1, 128, 768])

### Mean pooling 

After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we ignore non-real tokens.

To perform this operation, we first resize our `attention_mask` tensor:

In [10]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([1, 128])

In [14]:
# Add an additional dimension to tensor
attention_mask.unsqueeze(-1).shape

torch.Size([1, 128, 1])

In [13]:
# Expand to match the size of our embeddings
attention_mask.unsqueeze(-1).expand(embeddings.shape).shape

torch.Size([1, 128, 768])

In [15]:
# Convert to float
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()

In [16]:
attention_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])

In [17]:
mask[0][0].shape

torch.Size([768])

In [18]:
mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]])

Each vector above represents a single token attention mask - each token now has a vector of size 768 representing it's *attention_mask* status. Then we multiply the two tensors to apply the attention mask:

In [19]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([1, 128, 768])

In [20]:
masked_embeddings

tensor([[[ 0.3068, -0.0788,  1.7431,  ..., -0.0253, -0.1108,  0.0483],
         [ 0.7130,  0.1044,  1.8346,  ...,  0.1134, -0.0756,  0.1267],
         [ 0.8172,  0.1132,  1.5408,  ..., -0.3807,  0.0875, -0.1902],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000]]],
       grad_fn=<MulBackward0>)

In [21]:
# Note that the embeddings toward the end have been set to zero by our attention mask/padding
embeddings

tensor([[[ 3.0681e-01, -7.8805e-02,  1.7431e+00,  ..., -2.5349e-02,
          -1.1080e-01,  4.8311e-02],
         [ 7.1302e-01,  1.0437e-01,  1.8346e+00,  ...,  1.1343e-01,
          -7.5564e-02,  1.2668e-01],
         [ 8.1722e-01,  1.1321e-01,  1.5408e+00,  ..., -3.8067e-01,
           8.7477e-02, -1.9020e-01],
         ...,
         [ 5.4669e-01,  1.7181e-01,  1.1392e+00,  ...,  3.8548e-02,
          -1.5396e-01,  2.3015e-01],
         [ 3.4457e-01,  1.3151e-01,  1.1324e+00,  ..., -1.4211e-03,
          -1.7517e-01,  1.5220e-01],
         [ 3.2320e-01,  3.3350e-03,  1.1888e+00,  ...,  1.6736e-02,
          -2.0864e-01,  8.9316e-02]]], grad_fn=<NativeLayerNormBackward0>)

Then we sum the remained of the embeddings along axis `1`:

### Take mean of all embeddings to get a 1x768 vector

#### First take the sum along the axis with length = 128 (axis 1)

In [22]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([1, 768])

#### Then sum the number of values that must be given attention in each position of the tensor:

In [23]:
counts = torch.clamp(mask.sum(1), min=1e-9)
counts.shape

torch.Size([1, 768])

Finally, we calculate the mean as the sum of the embedding activations `summed` divided by the number of values that should be given attention in each position `summed_mask`:

In [25]:
mean_pooled = summed / counts

mean_pooled

tensor([[ 6.4371e-01, -1.6713e-01,  1.6938e+00,  1.0773e-01,  3.9651e-01,
         -1.0034e-01, -4.7002e-02,  7.9962e-01,  9.9557e-02, -9.4254e-01,
         -2.3453e-01, -2.0168e-02,  1.9673e-01,  3.9952e-01,  1.1500e-01,
          3.4273e-01, -6.4042e-01, -5.6996e-01,  2.7601e-01, -6.6139e-01,
         -7.0416e-01, -2.9305e-01, -1.5104e-01, -2.2795e-01,  9.0203e-01,
         -5.3868e-01, -1.0533e-01, -6.9518e-01, -3.7914e-01,  2.6069e-01,
         -8.8313e-01, -1.1394e-01,  1.0662e+00, -5.9388e-01, -4.5752e-01,
          1.4677e+00, -6.4333e-01, -1.7422e-01,  1.2722e-01, -4.2277e-01,
          6.6559e-01, -1.9796e-01,  1.1205e+00,  2.3012e-01, -1.3350e+00,
          4.1537e-01, -5.1203e-01,  4.1024e-01,  2.7452e-01, -8.5214e-01,
          1.2710e-01, -8.0763e-01, -6.3740e-01, -3.1229e-01, -4.9854e-01,
          3.2761e-01,  3.9804e-01, -4.4230e-01, -8.0381e-02,  4.9029e-01,
          6.1305e-01, -5.9101e-01,  5.5611e-01,  1.7230e-01, -1.0206e+00,
          1.7572e-01,  1.0949e+00, -3.

And that is how we calculate our dense similarity vector / sentence vector.