Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using BERT as a language Model #37

Closed
mdasadul opened this issue Nov 19, 2018 · 18 comments
Closed

using BERT as a language Model #37

mdasadul opened this issue Nov 19, 2018 · 18 comments

Comments

@mdasadul
Copy link

mdasadul commented Nov 19, 2018

I was trying to use BERT as a language model to assign a score(could be PPL score) of a given sentence. Something like
P("He is go to school")=0.008
P("He is going to school")=0.08
Which is indicating that the probability of second sentence is higher than first sentence. Is there a way to get a score like this?

Thanks

@thomwolf
Copy link
Member

I don't think you can do that with Bert. The masked LM loss is not a Language Modeling loss, it doesn't work nicely with the chain rule like the usual Language Modeling loss.
Please see the discussion on the TensorFlow repo on that here.

@mdasadul
Copy link
Author

Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))

score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

@zhangyichang
Copy link

Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))
score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

no, you masked word but not restore.

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020
update default adversarial training config for roberta question answe…
@orenpapers
Copy link

@mdasadul Did you managed to do it?

@mdasadul
Copy link
Author

mdasadul commented May 27, 2020 via email

@orenpapers
Copy link

@mdasadul Do you mean this one?
https://twitter.com/mdasaduluofa/status/1181917072999231489/photo/1
I see this it for GPT-2, do you have a code for BERT?

@mdasadul
Copy link
Author

mdasadul commented Jun 1, 2020

It should be similar. Following code is for distilBert

from torch.multiprocessing import TimeoutError, Pool,set_start_method,Queue
import torch.multiprocessing as mp
import torch
from transformers import  DistilBertTokenizer,DistilBertForMaskedLM
from flask import Flask,request
import json

try:
    set_start_method('spawn')
except RuntimeError:
    pass

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def load_model():
    model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased').to(device)
    model.eval()
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    return tokenizer, model

tokenizer, model =load_model()
#st.text('Done!')

def score(sentence):
    if len(sentence.strip().split())<=1 : return 10000
    tokenize_input = tokenizer.tokenize(sentence)
    if len(tokenize_input)>512: return 10000
    input_ids = torch.tensor(tokenizer.encode(tokenize_input)).unsqueeze(0).to(device)
    with torch.no_grad():
        loss=model(input_ids,masked_lm_labels = input_ids)[0]
    return  math.exp(loss.item()/len(tokenize_input))```

@orenpapers
Copy link

orenpapers commented Jun 1, 2020

@mdasadul I get the error:
TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
Also, can you please explain why for following steps are necessary:

  1. unsqueeze(0)
  2. add torch.no_grad()
  3. add model.eval()

@nlp-sudo
Copy link

nlp-sudo commented Jul 7, 2020

The score is equivalent to perplexity. Hence lower the score better the sentence, right?

@mdasadul
Copy link
Author

mdasadul commented Jul 7, 2020 via email

@orenschonlab
Copy link

@mdasadul I get the error:

    return math.exp(loss.item() / len(tokenize_input))
ValueError: only one element tensors can be converted to Python scalars

Any idea why?

@mdasadul
Copy link
Author

mdasadul commented Mar 14, 2021 via email

@orenschonlab
Copy link

@mdasadul I have a sentence with more than 1 word and still get the error
sentence is ' Harry had never believed he would'
input_ids is tensor([[ 101, 4302, 2018, 2196, 3373, 2002, 2052, 102]])

@EricFillion
Copy link

Below is an example from the official docs on how to implement GPT2 to determine perplexity.

https://huggingface.co/transformers/perplexity.html

@orenschonlab
Copy link

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

@mdasadul
Copy link
Author

@orenschonlab Try below

import torch
import sys
import numpy as np
 
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model (weights)
with torch.no_grad():
        model = GPT2LMHeadModel.from_pretrained('gpt2')
        model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def score(sentence):
    tokenize_input = tokenizer.encode(sentence)
    tensor_input = torch.tensor([tokenize_input])
    loss=model(tensor_input, labels=tensor_input)[0]
    return np.exp(loss.detach().numpy())
 
if __name__=='__main__':
    for line in sys.stdin:
        if line.strip() !='':
            print(line.strip()+'\t'+ str(score(line.strip())))
        else:
            break

@EricFillion
Copy link

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

I just played around with the code @mdasadul posted above. It works perfectly and is nice and concise. It outputted the same scores from the official documentation for short inputs.

If you're still interested in using the method from the official documentation, then you can replace "'\n\n'.join(test['text'])" with the text you wish to determine the perplexity of. You'll also want to add ".item()" to ppl to convert the tensor to a float.

@kaisugi
Copy link
Contributor

kaisugi commented Jul 23, 2021

This repo is quite useful. It supports Huggingface models.

https://github.com/awslabs/mlm-scoring

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023
alex-breen pushed a commit to alex-breen/transformers that referenced this issue Sep 27, 2023
* Add support for `Blenderbot` models

Closes huggingface#37
References huggingface#29

* Add support for `BlenderbotTokenizer`

* Add blenderbot to supported models

* Add support for `BlenderbotSmallTokenizer`

* Add custom tests for blenderbot-small

* Add support for `BlenderbotSmall` models

* Update list of supported models

* Improve `addPastKeyValues` function

* Allow skipping of adding encoder past key values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants