Multiple token prediction with MLM #4306

jannisborn · 2020-05-12T07:28:01Z

🚀 Feature request

It would be great if the fill_mask interface would be able to predict multiple tokens at a time.

Motivation

I didn't find any related issue. Using transformer for chemical data, queries like fill_mask('CCCO<mask>C') work fine. But writing fill_mask('CC<mask>CO<mask>C') I obtain:

~/miniconda3/envs/paccmann/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    553                 values, predictions = topk.values.numpy(), topk.indices.numpy()
    554             else:
--> 555                 masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
    556                 logits = outputs[i, masked_index, :]
    557                 probs = logits.softmax(dim=0)

ValueError: only one element tensors can be converted to Python scalars

Your contribution

The user can easily implement an auto-regressive solution and given that fill_mask returns the topk tokens, one could even select between greedy search, beam search or a probabilistic sampling. But a one-shot approach would be preferable since it minimizes probabilities to aggregate errors as in auto-regression. Rather than outsourcing this to the user, I'd prefer a solution integrated into the package.

As minimum request, I would like to propose to raise a proper error message instead of the obtained ValueError.

The text was updated successfully, but these errors were encountered:

julien-c · 2020-05-12T13:41:12Z

An error message would be great, want to submit a PR?

For multiple masked token support, I'm not entirely sure. The sampling might be very use case-specific.

stale · 2020-07-11T15:49:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stephen-pilli · 2020-10-31T06:25:05Z

I'm facing a situation where I've to fetch probabilities from BERT MLM for multiple words in a single sentence.

Original : "Mountain Dew is an energetic drink"
Masked : "[MASK] is an energetic drink"

But BERT MLM task doesn't consider two tokens at a time for the MASK.

BigSalmon2 · 2021-05-07T18:32:52Z

@stephen-pilli

Here's how to mask multiple tokens.

import torch

sentence = "The capital of France <mask> contains the Eiffel <mask>."

token_ids = tokenizer.encode(sentence, return_tensors='pt')

# print(token_ids)

token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')

print(token_ids_tk)

masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()

masked_pos = [mask.item() for mask in masked_position ]

print (masked_pos)

with torch.no_grad():

    output = model(token_ids)

last_hidden_state = output[0].squeeze()

print ("\n\n")

print ("sentence : ",sentence)

print ("\n")

list_of_list =[]

for mask_index in masked_pos:

    mask_hidden_state = last_hidden_state[mask_index]

    idx = torch.topk(mask_hidden_state, k=100, dim=0)[1]

    words = [tokenizer.decode(i.item()).strip() for i in idx]

    list_of_list.append(words)

    print (words)


best_guess = ""

for j in list_of_list:

    best_guess = best_guess+" "+j[0]

jannisborn mentioned this issue May 12, 2020

Masking multiple tokens at a time seyonechithrananda/bert-loves-chemistry#2

Closed

stale bot added the wontfix label Jul 11, 2020

stale bot closed this as completed Jul 18, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple token prediction with MLM #4306

Multiple token prediction with MLM #4306

jannisborn commented May 12, 2020 •

edited

Loading

julien-c commented May 12, 2020

stale bot commented Jul 11, 2020

stephen-pilli commented Oct 31, 2020

BigSalmon2 commented May 7, 2021

Multiple token prediction with MLM #4306

Multiple token prediction with MLM #4306

Comments

jannisborn commented May 12, 2020 • edited Loading

🚀 Feature request

Motivation

Your contribution

julien-c commented May 12, 2020

stale bot commented Jul 11, 2020

stephen-pilli commented Oct 31, 2020

BigSalmon2 commented May 7, 2021

jannisborn commented May 12, 2020 •

edited

Loading