In [10]:
from transformers import BertForMaskedLM, BertTokenizer

The BERT large uncased, in its variant _whole word masking_, has been trained over BookCorpus and Wikipedia English with NSP - Next Sentenct Prediction - and MLM - Masked Language Modeling - objectives. Let's import it and its tokenizer:

In [11]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking', cache_dir = 'hf_cache')
unmasking_model = BertForMaskedLM.from_pretrained('bert-large-uncased-whole-word-masking', cache_dir = 'hf_cache')

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The pipeline convenience object is created to interface with both the tokenizer and the model.

In [19]:
from transformers import FillMaskPipeline

unmasker = FillMaskPipeline(model = unmasking_model, tokenizer = tokenizer, tokenizer_kwargs = {"truncation": True})

With the following line we are getting all the candidates to the masked word proposed by BERT. Each substitute has a confidence level associated with the token.

In [90]:
from typing import List

def mask_token(tokens: List[str], idx: int) -> str:
    tokens = tokens[:]
    tokens[idx] = '[MASK]'
    return ' '.join(tokens)

original_sentence = 'modern shared room near Harvard.'
original_sentence_tokens = tokenizer.tokenize(original_sentence)

masked_sentence = mask_token(original_sentence_tokens, 2)
candidates = unmasker(masked_sentence, top_k = 20)

At this point, some words can be more suitable than others. We try to figure out the fitness level by reinserting the token into the sentence and by testing the similarity between the original sentence and the one with the mask replaced.

In [91]:
from transformers import BertModel
import torch

embeddings_model = BertModel.from_pretrained('bert-large-uncased-whole-word-masking', cache_dir = 'hf_cache')

def get_meaned_embeddings(sentence: str):
    tokens = tokenizer.tokenize(sentence)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    input_ids = torch.tensor(input_ids).unsqueeze(0)
    with torch.no_grad():
        outputs = embeddings_model(input_ids)
        embeddings = outputs.last_hidden_state[0]

    return embeddings.mean(1)

cos_sim = torch.nn.CosineSimilarity(dim = 0)

for candidate in candidates:
    sim = cos_sim(get_meaned_embeddings(candidate['sequence']), get_meaned_embeddings(original_sentence))
    print(sim, candidate['sequence'])

tensor(0.9997) modern shared is near harvard.
tensor(0.9992) modern shared campus near harvard.
tensor(0.9712) modern shared houses near harvard.
tensor(0.9996) modern shared property near harvard.
tensor(0.9993) modern shared buildings near harvard.
tensor(0.9992) modern shared building near harvard.
tensor(0.9996) modern shared house near harvard.
tensor(0.9990) modern shared residence near harvard.
tensor(0.9984) modern shared, near harvard.
tensor(0.9996) modern shared was near harvard.
tensor(0.9990) modern shared it near harvard.
tensor(0.9704) modern shared housing near harvard.
tensor(0.9995) modern shared located near harvard.
tensor(0.9802) modern shared space near harvard.
tensor(0.9996) modern shared apartments near harvard.
tensor(0.9996) modern shared lived near harvard.
tensor(0.9994) modern shared school near harvard.
tensor(0.9997) modern shared history near harvard.
tensor(0.9999) modern shared lives near harvard.
tensor(0.9997) modern shared rooms near harvard.
