Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use BERT / gpt-2 for text generation #2311

Closed
orenpapers opened this issue Dec 25, 2019 · 13 comments
Closed

Can I use BERT / gpt-2 for text generation #2311

orenpapers opened this issue Dec 25, 2019 · 13 comments
Labels

Comments

@orenpapers
Copy link

❓ Questions & Help

I want to get a list of possible completions and their probabilities.
For example,
For the sentence "I put the glass of the _"
I want to get a vector with word and probabilities from a pre-trained model, such as :
desk = 0.1
table = 0.2
car = 0.05
shirt = 0.001
Is that possible?

@patrickvonplaten
Copy link
Contributor

You could do something like this when using gpt2

from transformers import GPT2LMHeadModel, GPT2Tokenizer
from torch.nn import functional as F
import torch

model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')

# encode input context
input_ids = torch.tensor(tokenizer.encode('I put the glass of the')).unsqueeze(0)
# get logits of last predicted token
next_word_logits = model(input_ids)[0][0, -1].detach()
next_word_probs = F.softmax(next_word_logits, dim=0)

next_words = ['desk', 'table', 'car', 'shirt']
next_words_probs = []
# encode tokens for which prob is to be estimated
next_word_ids = [tokenizer.encode(next_word) for next_word in next_words]

for next_word_id in next_word_ids:
    next_word_input_ids = input_ids.clone()
    next_word_prob = next_word_probs[next_word_id[0]].item()
    # We need a while loop here because a single word can be composed of multiple tokens
    # 'desk' is encoded to 2 tokens so that we have to call the model another time
    while(len(next_word_id) > 1):
        next_word_input_ids = torch.cat((next_word_input_ids, torch.tensor([next_word_id[0]]).unsqueeze(0)), dim=1)
        # get logits of last predicted token
        next_word_logits = model(next_word_input_ids)[0][0, -1].detach()
        # multiply prob of next token to prob of previous tokens
        next_word_prob *= F.softmax(next_word_logits, dim=0)[next_word_id[1]].item()
        # remove first token since already used
        next_word_id = next_word_id[1:]
    next_words_probs.append(next_word_prob)

# print result
for next_word, next_word_prob in zip(next_words, next_words_probs):
    print('{} = {}'.format(next_word, next_word_prob))

@shashankMadan-designEsthetics

Yes it is possible u need to take the topk of lm_logits (it will be output[0] in case of gpt)which essentially gives to 50257 probabilities (highest to lowest) which is the vocab size then you need to take top k which gives indices and values, values are nothing but urs scores(0.8, 0.1) and the indices which correspond to the 50257 vocabulary words which u can decode using tokenize decode.

@orenpapers
Copy link
Author

@patrickvonplaten Amazing thanks!
And if I want the rank of these words from all the word in the vocab?
e.g. desk is the most probable word , table in #12 , etc. ?

@patrickvonplaten
Copy link
Contributor

Since GPT-2's output is based on byte-pair-encoding tokens and not on words you would have to define your own vocabulary. Having defined your vocabulary, I would simply calculate the probability for each word using the above procedure and then sort the tensor.
To better understand how byte-pair-encoding works this might help.
To sort the tensor this might help.

@orenpapers
Copy link
Author

@patrickvonplaten Thanks, you think it will be possible to do it for all (or at least most) of the words in English in my personal MAC?

@patrickvonplaten
Copy link
Contributor

Yeah, I think that should definitely be feasible.
Many words will consists of two tokens or less and will therefore need at most one additional forward pass (because the first forward pass is the same for all words and need to be calculated only once).

So if you have a vocabulary of say 300.000 words, I'd estimate that you would have to compute around 200.000 forward passes. You can calculate how much time a forward pass would take by averaging the computation time for 100 times calculating the probability for the word 'desk'.

Concerning memory, there should not be a problem.

@patrickvonplaten
Copy link
Contributor

And the final vector giving the probabilities over your defined vocabulary should be normalized to make a prob distribution.

@orenpapers
Copy link
Author

@patrickvonplaten You mean using softmax?

@stale
Copy link

stale bot commented Mar 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 2, 2020
@patrickvonplaten
Copy link
Contributor

I was thinking to just normalize like this:
https://stackoverflow.com/questions/26785354/normalizing-a-list-of-numbers-in-python

but you could also use softmax again - depends on what you want and what works better for you!

@orenpapers
Copy link
Author

@patrickvonplaten is it possible with BERT pre-trained model?
Thanks!

@patrickvonplaten
Copy link
Contributor

You might take a look at masked language modeling :-) https://huggingface.co/transformers/usage.html#masked-language-modeling

@orenkobo
Copy link

@patrickvonplaten Nice! Thanks for the pointer!
And let's say I want to check a specific word in a masked location (What is the probability of the word "package " in the sequence "HuggingFace is creating a { } that the community uses to"? Is this possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants