# Sentence Embeddings
Part of the material is taken from https://github.com/BramVanroy/bert-for-inference

In [34]:
!pip install transformers
!pip install sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




# BERT
This section gives an example of how [BERT](https://arxiv.org/abs/1810.04805) can be used to extract sentence embeddings while at the same time giving some information about the model.

In [35]:
import torch
from transformers import AutoModel, AutoTokenizer

In [36]:
bert_checkpoint = 'bert-base-uncased'

In [37]:
sentences = [
    "BERT provides contextual embeddings for each word in a sentence.",
    "Sentence-BERT can be used to find similar sentences."
]

## The tokenizer

A deep learning model works with tensors. Tensors are vectors. Vectors are a bunch of
numbers. To get started, then, the input text (string) needs to be converted into some data type (numbers)
that the model can use. This is done by the tokenizer.

The BERT tokenizer uses [WordPiece](https://arxiv.org/pdf/1609.08144.pdf), a tokenization technique that splits a sentence into tokens that may not correspond to whole words. Each word can be further divided into subwords or word pieces. This approach addresses the issue of out-of-vocabulary (OOV) tokens. Since each word must be associated with a vector, a word that was not encountered during training cannot be directly associated with an embedding. By assigning embeddings to word pieces instead of whole words, this issue is greatly reduced. In the rare case that even a word piece is not present in BERT’s vocabulary, it is assigned a default embedding, corresponding to the UNK (i.e., unknown) token.
The word pieces generated by the BERT tokenizer can be recognized by the ## prefix, indicating the beginning of a word piece derived from a split applied to an original word.

In [38]:
# Initialize the tokenizer with a pretrained model
tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)

### Single sentence

In [39]:
inputs1 = tokenizer(sentences[0], return_tensors='pt')
inputs1

{'input_ids': tensor([[  101, 14324,  3640,  6123,  8787,  7861,  8270,  4667,  2015,  2005,
          2169,  2773,  1999,  1037,  6251,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [40]:
tokens1 = tokenizer.convert_ids_to_tokens(inputs1['input_ids'][0])
tokens1

['[CLS]',
 'bert',
 'provides',
 'context',
 '##ual',
 'em',
 '##bed',
 '##ding',
 '##s',
 'for',
 'each',
 'word',
 'in',
 'a',
 'sentence',
 '.',
 '[SEP]']

In [41]:
len(tokens1)

17

In [42]:
inputs2 = tokenizer(sentences[1], return_tensors='pt')
inputs2

{'input_ids': tensor([[  101,  6251,  1011, 14324,  2064,  2022,  2109,  2000,  2424,  2714,
         11746,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [43]:
tokens2 = tokenizer.convert_ids_to_tokens(inputs2['input_ids'][0])
tokens2

['[CLS]',
 'sentence',
 '-',
 'bert',
 'can',
 'be',
 'used',
 'to',
 'find',
 'similar',
 'sentences',
 '.',
 '[SEP]']

In [44]:
len(tokens2)

13

You will probably have noticed the so-called "special tokens" [CLS] and [SEP]. These tokens are added automatically by
the tokenizer so we don't have to worry about them. The first one is a classification token which has been
pretrained. It is specifically inserted for any sort of classification task. So instead of having to average of all
tokens and use that as a sentence representation, it is recommended to just take the output of the [CLS] which then
represents the whole sentence. [SEP], on the other hand, is inserted as a separator between multiple instances. We will
not use that here, but it used for things like next sentence prediction where it is a separator between the current and
the next sentence. It is especially important to remember the [CLS] token as it can play a great role in classification
and regression tasks.

### Batch of sentences

 A language model is fed with batch of sentences (in our case we have a batch of 2 sentences) where all the sentences are aligned to the same length.
This 'alignment' is performed by a padding operation. The two most popular padding techniques are: using the longest text in the batch and padding shorter texts up to this length, or setting a fixed maximal sequence length for the model (typically 512) and pad all items up to this
length. The latter approach is easier to implement but is not memory-efficient and is computationally heavier. The
choice, as always, is yours.
Without specifying the padding method in the tokenizer, it raises an error.

In [45]:
# This raises an error
tokenizer(sentences, return_tensors='pt', truncation=True, padding=True)

{'input_ids': tensor([[  101, 14324,  3640,  6123,  8787,  7861,  8270,  4667,  2015,  2005,
          2169,  2773,  1999,  1037,  6251,  1012,   102],
        [  101,  6251,  1011, 14324,  2064,  2022,  2109,  2000,  2424,  2714,
         11746,  1012,   102,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

In [46]:
# Padding by longest sentence in the batch
longest_inputs = tokenizer(sentences, return_tensors='pt', padding='longest')
longest_inputs

{'input_ids': tensor([[  101, 14324,  3640,  6123,  8787,  7861,  8270,  4667,  2015,  2005,
          2169,  2773,  1999,  1037,  6251,  1012,   102],
        [  101,  6251,  1011, 14324,  2064,  2022,  2109,  2000,  2424,  2714,
         11746,  1012,   102,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

In [47]:
len(longest_inputs.input_ids[0])

17

The attention mask is a tensor used to identify which tokens are meaningful and which are padding. This helps the model focus on the actual content of the input while ignoring any padding added to ensure uniform sequence lengths within a batch.

This tensor contains the following values:
- 1: It indicates a token that is not padding and should be attended to by the model.
- 0: It indicates a padding token, which should be ignored by the model during computation.

In [48]:
# Padding by maximum length
max_inputs = tokenizer(sentences, return_tensors='pt', padding='max_length', max_length=20)
max_inputs

{'input_ids': tensor([[  101, 14324,  3640,  6123,  8787,  7861,  8270,  4667,  2015,  2005,
          2169,  2773,  1999,  1037,  6251,  1012,   102,     0,     0,     0],
        [  101,  6251,  1011, 14324,  2064,  2022,  2109,  2000,  2424,  2714,
         11746,  1012,   102,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}

In [49]:
len(max_inputs.input_ids[0])

20

In [50]:
# What happen if we use a too small max length?
# This raises an error
tokenizer(sentences, return_tensors='pt', padding='max_length', max_length=10, truncation=True)

{'input_ids': tensor([[  101, 14324,  3640,  6123,  8787,  7861,  8270,  4667,  2015,   102],
        [  101,  6251,  1011, 14324,  2064,  2022,  2109,  2000,  2424,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [51]:
# We can solve the problem by applying the truncation
trunc_inputs = tokenizer(sentences, return_tensors='pt', padding='max_length', max_length=10, truncation=True)
trunc_inputs

{'input_ids': tensor([[  101, 14324,  3640,  6123,  8787,  7861,  8270,  4667,  2015,   102],
        [  101,  6251,  1011, 14324,  2064,  2022,  2109,  2000,  2424,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [52]:
len(trunc_inputs.input_ids[0])

10

In [53]:
# Padding by model maximum length (we don't manually specify the length)
inputs = tokenizer(sentences, return_tensors='pt', padding='max_length')

In [54]:
len(inputs.input_ids[0])

512

## The model
Now that we have preprocessed our input string into a tensor of IDs, we can feed this to the model. Remember that the
IDs are the IDs of a token in the tokenizer's vocabulary. The model "knows" which words are being processed because it
"knows" which token belongs to which ID.

To get started, we first need to initialize the model. Just like the tokenizer, the model is pretrained which makes it
very easy for us to just use the pretrained language model to get some token or sentence representations out of it.
Note how we use the same pretrained model as the tokenizer uses (`bert-base-uncased`). This is the smaller BERT model
that has been trained on lower case text. Because the model has been trained on lower case text, it does not know cased
text. You may have noticed that the tokenizer automatically lowercases the text for us. Whether to use a cased or
uncased language model really depends on the task. If you think that casing matters (e.g. for NER), you may want to
opt for a cased model, otherwise casing might just add noise.

The model returns several outputs. One of the them is the last hidden state `(batch_size, sequence_length, hidden_size)` which contains an embedding for each token.

Graphic cards (GPUs) are much better at doing operations on tensors than a CPU is. Therefore, we wish to run our
computations on the GPU if it is available. Note that you need to have a GPU available as well as CUDA, and a
GPU-accelerated torch version. To increase the calculation speed, we have to move our model to the correct device:
if it's available we'll move the model `.to()` the GPU, otherwise it'll stay on the CPU. It is important to remember
that the model and the data to process need to be on the same device. This means that we will have to move our
tokenized data to the same device as the model, too.

Finally, we also set the model to evaluation mode (`.eval`) because we want to use the pretrained architecture and not modify it.

In [55]:
model = AutoModel.from_pretrained(bert_checkpoint)
# Set the device to GPU (cuda) if available, otherwise stick with CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = model.to(device)
inputs = inputs.to(device)

_ = model.eval()

## Token embeddings
The model has been initialized, and the input string has been converted into a tensor.

In [56]:
with torch.no_grad():
    out = model(**inputs)

In [57]:
out.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [58]:
# We only want the last_hidden_states
token_embeddings = out.last_hidden_state
token_embeddings

tensor([[[-0.4208, -0.2507, -0.5199,  ..., -0.0421, -0.3106,  0.5707],
         [ 0.5584, -0.1401,  0.0563,  ..., -0.0178,  0.0344,  0.1316],
         [-0.4197,  0.1562, -0.2669,  ..., -0.0750, -0.9624,  0.2741],
         ...,
         [ 0.0903,  0.0786,  0.0605,  ...,  0.3320, -0.4488, -0.1018],
         [-0.2519, -0.3576,  0.2042,  ...,  0.0694, -0.2591, -0.2165],
         [-0.1488, -0.1902,  0.2266,  ...,  0.0328, -0.3447, -0.1769]],

        [[-0.4421, -0.6379, -0.3509,  ..., -0.4756, -0.1228,  0.9048],
         [ 0.1877, -0.3324,  0.0039,  ...,  0.1608,  0.4673, -0.1423],
         [ 0.1484,  0.0621, -0.2109,  ...,  0.0318, -0.2136,  0.5368],
         ...,
         [-0.4046, -0.8245, -0.2647,  ...,  0.0262, -0.0338, -0.0711],
         [-0.2051, -0.4144,  0.2916,  ...,  0.0751, -0.0623, -0.3122],
         [-0.1398, -0.2971,  0.3165,  ...,  0.0152, -0.1268, -0.1836]]],
       device='cuda:0')

In [59]:
token_embeddings.shape

torch.Size([2, 512, 768])

In [60]:
token_embeddings[1][:10].shape

torch.Size([10, 768])

The token_embeddings variable has a size of `(batch_size, sequence_length, 768)`.
In our case, that is `(2, 512, 768)` because we only have two sentences (batch size of 2), and our setences were padded to the maximum model length (i.e., 512). `768` is the number of hidden dimensions.

## Sentence embeddings

Let's say that we want to retrieve a sentence embedding. In other words, we want to reduce the size of `(2, 512, 768)` to `(2, 768)` where `2` is the batch size and `768` is the number of hidden dimensions. There are many ways to make a sentence abstraction of tokens, and it often depends on the given task.

### Averaging token embeddings

First of all, we will try the averaging all the tokens included in each sentence.
Remember that the sentences have been padded and we want to ignore padded embedding in the average computation.

In [61]:
attention_mask = inputs['attention_mask']
attention_mask

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')

In [62]:
sentence_embeddings = torch.sum(token_embeddings * attention_mask.unsqueeze(-1), dim=1) / attention_mask.sum(dim=1, keepdim=True)
print(sentence_embeddings.shape)
print(sentence_embeddings[0].size())

torch.Size([2, 768])
torch.Size([768])


**We now have a vector of 768 features for each sentence representing our sentence embeddings.**

### CLS token embedding

In [63]:
sentence_embeddings = token_embeddings[:, 0, :]
print(sentence_embeddings.shape)
print(sentence_embeddings[0].size())

torch.Size([2, 768])
torch.Size([768])


## Saving and loading results

It is likely that you want to use your generated feature vector in another model or task and just save them to your
hard drive. You can easily save a tensor with `torch.save` and load it in another script with `torch.load`.

In [64]:
# save our created sentence representations
torch.save(sentence_embeddings.cpu(), 'my_sent_embs.pt')

# load them again
my_sent_embs = torch.load('my_sent_embs.pt')
print(my_sent_embs.size())

torch.Size([2, 768])


# SBERT

In [65]:
from sentence_transformers import SentenceTransformer

# Initialize SBERT model
model = SentenceTransformer('all-mpnet-base-v2')

# Generate embeddings
sentence_embeddings = model.encode(sentences)

print(sentence_embeddings.shape)

(2, 768)
