Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple token prediction with MLM #4306

Closed
jannisborn opened this issue May 12, 2020 · 4 comments
Closed

Multiple token prediction with MLM #4306

jannisborn opened this issue May 12, 2020 · 4 comments
Labels

Comments

@jannisborn
Copy link
Contributor

jannisborn commented May 12, 2020

🚀 Feature request

It would be great if the fill_mask interface would be able to predict multiple tokens at a time.

Motivation

I didn't find any related issue. Using transformer for chemical data, queries like fill_mask('CCCO<mask>C') work fine. But writing fill_mask('CC<mask>CO<mask>C') I obtain:

~/miniconda3/envs/paccmann/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    553                 values, predictions = topk.values.numpy(), topk.indices.numpy()
    554             else:
--> 555                 masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
    556                 logits = outputs[i, masked_index, :]
    557                 probs = logits.softmax(dim=0)

ValueError: only one element tensors can be converted to Python scalars

Your contribution

The user can easily implement an auto-regressive solution and given that fill_mask returns the topk tokens, one could even select between greedy search, beam search or a probabilistic sampling. But a one-shot approach would be preferable since it minimizes probabilities to aggregate errors as in auto-regression. Rather than outsourcing this to the user, I'd prefer a solution integrated into the package.

As minimum request, I would like to propose to raise a proper error message instead of the obtained ValueError.

@julien-c
Copy link
Member

An error message would be great, want to submit a PR?

For multiple masked token support, I'm not entirely sure. The sampling might be very use case-specific.

@stale
Copy link

stale bot commented Jul 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 11, 2020
@stale stale bot closed this as completed Jul 18, 2020
@stephen-pilli
Copy link

I'm facing a situation where I've to fetch probabilities from BERT MLM for multiple words in a single sentence.

Original : "Mountain Dew is an energetic drink"
Masked : "[MASK] is an energetic drink"

But BERT MLM task doesn't consider two tokens at a time for the MASK.

@BigSalmon2
Copy link

@stephen-pilli

Here's how to mask multiple tokens.

import torch

sentence = "The capital of France <mask> contains the Eiffel <mask>."

token_ids = tokenizer.encode(sentence, return_tensors='pt')

# print(token_ids)

token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')

print(token_ids_tk)

masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()

masked_pos = [mask.item() for mask in masked_position ]

print (masked_pos)

with torch.no_grad():

    output = model(token_ids)

last_hidden_state = output[0].squeeze()

print ("\n\n")

print ("sentence : ",sentence)

print ("\n")

list_of_list =[]

for mask_index in masked_pos:

    mask_hidden_state = last_hidden_state[mask_index]

    idx = torch.topk(mask_hidden_state, k=100, dim=0)[1]

    words = [tokenizer.decode(i.item()).strip() for i in idx]

    list_of_list.append(words)

    print (words)


best_guess = ""

for j in list_of_list:

    best_guess = best_guess+" "+j[0]

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants