using BERT as a language Model #37

mdasadul · 2018-11-19T15:26:20Z

I was trying to use BERT as a language model to assign a score(could be PPL score) of a given sentence. Something like
P("He is go to school")=0.008
P("He is going to school")=0.08
Which is indicating that the probability of second sentence is higher than first sentence. Is there a way to get a score like this?

Thanks

thomwolf · 2018-11-20T09:06:07Z

I don't think you can do that with Bert. The masked LM loss is not a Language Modeling loss, it doesn't work nicely with the chain rule like the usual Language Modeling loss.
Please see the discussion on the TensorFlow repo on that here.

mdasadul · 2019-04-11T20:46:12Z

Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))

score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

zhangyichang · 2019-05-30T12:22:39Z

Hello @thomwolf I can see it is possible to assign score by using BERT . By masking each word sequentially. Then score sentence by summary of word score. Here is how people were doing it for Tensorflow. I am trying to do following

import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))

score("There is a book on the table")
88.899999

Is it the right way to assign score using BERT?

no, you masked word but not restore.

update default adversarial training config for roberta question answe…

orenpapers · 2020-05-27T07:36:47Z

@mdasadul Did you managed to do it?

mdasadul · 2020-05-27T14:52:31Z

Yes please check my tweet on this @mdasaduluofa

…

On Wed, May 27, 2020, 1:37 PM orko19 ***@***.***> wrote: @mdasadul <https://github.com/mdasadul> Did you managed to do it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5DO5N2MGF6QCTAZ3L3NITRTS7J3ANCNFSM4GFFKJJA> .

orenpapers · 2020-05-27T15:21:22Z

@mdasadul Do you mean this one?
https://twitter.com/mdasaduluofa/status/1181917072999231489/photo/1
I see this it for GPT-2, do you have a code for BERT?

mdasadul · 2020-06-01T10:09:34Z

It should be similar. Following code is for distilBert

from torch.multiprocessing import TimeoutError, Pool,set_start_method,Queue
import torch.multiprocessing as mp
import torch
from transformers import  DistilBertTokenizer,DistilBertForMaskedLM
from flask import Flask,request
import json

try:
    set_start_method('spawn')
except RuntimeError:
    pass

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def load_model():
    model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased').to(device)
    model.eval()
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    return tokenizer, model

tokenizer, model =load_model()
#st.text('Done!')

def score(sentence):
    if len(sentence.strip().split())<=1 : return 10000
    tokenize_input = tokenizer.tokenize(sentence)
    if len(tokenize_input)>512: return 10000
    input_ids = torch.tensor(tokenizer.encode(tokenize_input)).unsqueeze(0).to(device)
    with torch.no_grad():
        loss=model(input_ids,masked_lm_labels = input_ids)[0]
    return  math.exp(loss.item()/len(tokenize_input))```

orenpapers · 2020-06-01T10:50:45Z

@mdasadul I get the error:
TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
Also, can you please explain why for following steps are necessary:

unsqueeze(0)
add torch.no_grad()
add model.eval()

nlp-sudo · 2020-07-07T05:54:05Z

The score is equivalent to perplexity. Hence lower the score better the sentence, right?

mdasadul · 2020-07-07T06:12:41Z

Yes that is right Md Asadul Islam Machine Learning Engineer Scribendi Inc

…

On Mon, Jul 6, 2020 at 11:54 PM nlp-sudo ***@***.***> wrote: The score is equivalent to perplexity. Hence lower the score better the sentence, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5DO5KTBQJEEM7J72TCH2LR2K2AVANCNFSM4GFFKJJA> .

orenschonlab · 2021-03-14T11:48:11Z

@mdasadul I get the error:

    return math.exp(loss.item() / len(tokenize_input))
ValueError: only one element tensors can be converted to Python scalars

Any idea why?

mdasadul · 2021-03-14T13:15:31Z

Yes, your sentence needs to be longer than 1 word. PPL of 1 word sentence doesn't mean anything. Please try with longer sentences Md Asadul Islam Machine Learning Engineer Scribendi Inc

…

On Sun, Mar 14, 2021 at 7:48 AM orenschonlab ***@***.***> wrote: @mdasadul <https://github.com/mdasadul> I get the error: return math.exp(loss.item() / len(tokenize_input)) ValueError: only one element tensors can be converted to Python scalars Any idea why? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5DO5ITYG7M6TG2XV5NZ6LTDSPARANCNFSM4GFFKJJA> .

orenschonlab · 2021-03-14T14:02:03Z

@mdasadul I have a sentence with more than 1 word and still get the error
sentence is ' Harry had never believed he would'
input_ids is tensor([[ 101, 4302, 2018, 2196, 3373, 2002, 2052, 102]])

EricFillion · 2021-03-15T05:10:43Z

Below is an example from the official docs on how to implement GPT2 to determine perplexity.

https://huggingface.co/transformers/perplexity.html

orenschonlab · 2021-03-15T08:25:31Z

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

mdasadul · 2021-03-15T11:40:23Z

@orenschonlab Try below

import torch
import sys
import numpy as np
 
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model (weights)
with torch.no_grad():
        model = GPT2LMHeadModel.from_pretrained('gpt2')
        model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def score(sentence):
    tokenize_input = tokenizer.encode(sentence)
    tensor_input = torch.tensor([tokenize_input])
    loss=model(tensor_input, labels=tensor_input)[0]
    return np.exp(loss.detach().numpy())
 
if __name__=='__main__':
    for line in sys.stdin:
        if line.strip() !='':
            print(line.strip()+'\t'+ str(score(line.strip())))
        else:
            break

EricFillion · 2021-03-15T17:54:08Z

@EricFillion But how can it be used for a sentence, not for a dataset?
Meaning I want the perplexity of the sentence:
Harry had never believed he would

I just played around with the code @mdasadul posted above. It works perfectly and is nice and concise. It outputted the same scores from the official documentation for short inputs.

If you're still interested in using the method from the official documentation, then you can replace "'\n\n'.join(test['text'])" with the text you wish to determine the perplexity of. You'll also want to add ".item()" to ppl to convert the tensor to a float.

kaisugi · 2021-07-23T03:01:25Z

This repo is quite useful. It supports Huggingface models.

https://github.com/awslabs/mlm-scoring

Pop

* Add support for `Blenderbot` models Closes huggingface#37 References huggingface#29 * Add support for `BlenderbotTokenizer` * Add blenderbot to supported models * Add support for `BlenderbotSmallTokenizer` * Add custom tests for blenderbot-small * Add support for `BlenderbotSmall` models * Update list of supported models * Improve `addPastKeyValues` function * Allow skipping of adding encoder past key values

thomwolf closed this as completed Nov 20, 2018

hscspring mentioned this issue Apr 10, 2019

can the pre-trained model be used as a language model? google-research/bert#35

Closed

maeotaku mentioned this issue May 23, 2019

bert->onnx ->caffe2 weird error #633

Closed

nshmyrev mentioned this issue Oct 9, 2019

Which model is best to used for language model rescoring for ASR #1306

Closed

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020

Merge pull request huggingface#37 from stevezheng23/dev/zheng/coqa

aef3ee2

update default adversarial training config for roberta question answe…

ezekielbarnett mentioned this issue Dec 4, 2020

Deutsch to English Translation Model by Google doesn't work anymore... #7761

Closed

orenschonlab mentioned this issue Mar 14, 2021

Get perplexity of a sentence EricFillion/happy-transformer#200

Open

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023

Merge pull request huggingface#37 from jamesthesnake/pop

4a45cb0

Pop

lwmlyy mentioned this issue Aug 15, 2023

add util for ram efficient loading of model when using fsdp #25107

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using BERT as a language Model #37

using BERT as a language Model #37

mdasadul commented Nov 19, 2018 •

edited

thomwolf commented Nov 20, 2018

mdasadul commented Apr 11, 2019

zhangyichang commented May 30, 2019

orenpapers commented May 27, 2020

mdasadul commented May 27, 2020 via email

orenpapers commented May 27, 2020

mdasadul commented Jun 1, 2020

orenpapers commented Jun 1, 2020 •

edited

nlp-sudo commented Jul 7, 2020

mdasadul commented Jul 7, 2020 via email

orenschonlab commented Mar 14, 2021

mdasadul commented Mar 14, 2021 via email

orenschonlab commented Mar 14, 2021

EricFillion commented Mar 15, 2021

orenschonlab commented Mar 15, 2021

mdasadul commented Mar 15, 2021

EricFillion commented Mar 15, 2021

kaisugi commented Jul 23, 2021

using BERT as a language Model #37

using BERT as a language Model #37

Comments

mdasadul commented Nov 19, 2018 • edited

thomwolf commented Nov 20, 2018

mdasadul commented Apr 11, 2019

zhangyichang commented May 30, 2019

orenpapers commented May 27, 2020

mdasadul commented May 27, 2020 via email

orenpapers commented May 27, 2020

mdasadul commented Jun 1, 2020

orenpapers commented Jun 1, 2020 • edited

nlp-sudo commented Jul 7, 2020

mdasadul commented Jul 7, 2020 via email

orenschonlab commented Mar 14, 2021

mdasadul commented Mar 14, 2021 via email

orenschonlab commented Mar 14, 2021

EricFillion commented Mar 15, 2021

orenschonlab commented Mar 15, 2021

mdasadul commented Mar 15, 2021

EricFillion commented Mar 15, 2021

kaisugi commented Jul 23, 2021

mdasadul commented Nov 19, 2018 •

edited

orenpapers commented Jun 1, 2020 •

edited