Loading the model:

In [2]:
from transformers import AutoConfig, AutoTokenizer, AutoModel

plm_name = "google-bert/bert-base-uncased"

config = AutoConfig.from_pretrained(plm_name)
lmtokenizer = AutoTokenizer.from_pretrained(plm_name)
lm: AutoModel = AutoModel.from_pretrained(plm_name, output_attentions=False, add_pooling_layer=False, ignore_mismatched_sizes=False)


Test tokenization:

In [3]:
from pprint import pprint

texts = [
    "Learn the various meanings of recompilation",
    "This is another good example."
]

spec_flag = True
print(lmtokenizer.tokenize(texts[0], add_special_tokens=True))
# encoded_text = lmtokenizer(texts, add_special_tokens=True, padding=True, return_tensors='pt')

# pprint(encoded_text)

['[CLS]', 'learn', 'the', 'various', 'meanings', 'of', 'rec', '##omp', '##ilation', '[SEP]']


Get the vocabulary of the model and print its size:

In [4]:
vocab = lmtokenizer.get_vocab()
len(vocab)

30522

Get the id of the token `cat`:

In [5]:
vocab['cat']

4937

Check what are the tokens stored at indexes 2000 to 2010:

In [6]:
id2token = { id:token for token, id in vocab.items()}

for i in range(2000, 2010):
    print(id2token[i])

to
was
he
is
as
for
on
with
that
it


Apply the tokenizer on a couple of texts, and get the indices of the tokens (with padding, so that we have a matrix) and also the attention mask matrix: the result is dict with keys `input_ids` and `atention_mask`: 

In [7]:
from pprint import pprint

texts = [
    "Learn the various meanings of recompilation",
    "This is another good example."
]

spec_flag = True
# print(lmtokenizer.tokenize(texts[0], add_special_tokens=True))

encoded_text = lmtokenizer(texts, add_special_tokens=False, padding=True, return_tensors='pt', return_token_type_ids=False)

pprint(encoded_text)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0]]),
 'input_ids': tensor([[ 4553,  1996,  2536, 15383,  1997, 28667, 25377, 29545],
        [ 2023,  2003,  2178,  2204,  2742,  1012,     0,     0]])}


Apply the LM on the input texts, by providing as arguments the `input_ids` and the `attention_mask`. This will return contextual embeddings, encoding each input token as a vector of dimension `d` (the dimension of the latent representations of the LM encoder). As expected, the result is a tensor of shape `(N, T, d)`, `N` being the batch size (number of input texts in the batch), `T` is the max length (max number of tokens) of the input texts, and `d` is the model embedding dimension:

In [8]:
lm_output= lm(input_ids=encoded_text['input_ids'], attention_mask=encoded_text['attention_mask'])

embs = lm_output.last_hidden_state

embs.size()

torch.Size([2, 8, 768])

Access the vector of the 4th token of the 2nd text and display its dimension:

In [28]:
embs[1,3].size()

torch.Size([768])

Compute the mean vectors for each text (i.e. mean along the temporal dimension 1), so we will get a matrix, i.e. 2 vectors of dim `d`. Warning: this is not the best way of computing the mean embedding of texts because the embeddings of padding tokens (which are not real tokens of the input text) are included in the mean: 

In [30]:
embs.mean(dim=1).size()

torch.Size([2, 768])

We can discard the padding tokens when computing the mean embeddings by msetting to zero all the values where mask=0, summing the non-zero values and dividing by the length of the token sequence:

In [6]:
import torch

expanded_mask = encoded_text['attention_mask'].unsqueeze(-1).expand(embs.size())

lengths = torch.clamp(expanded_mask.sum(-2), min=1e-12)

(embs * expanded_mask).sum(dim=-2) / lengths


tensor([[ 0.1553,  0.1637,  0.0750,  ...,  0.1541, -0.0194,  0.0119],
        [ 0.0246, -0.3282,  0.0557,  ..., -0.1639,  0.3721,  0.2535]],
       grad_fn=<DivBackward0>)