-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818
Comments
Hey @veronica320 ! I hope everything is going well with you. From what I have seen, you are using the pre-trained model from Huggingface's Hub for instantiating the LMHead, that's correct? It might be the case that these pre-trained models were originally trained with a masked language modeling objective in mind, so when applying them to a causal language modeling task without fine-tuning they might be having a hard time to decode complete sequences. I would suggest to attach the LMHead model as you have been doing, but instead of directly trying to predict/score it, fine-tune (train) for a few iterations on your dataset with a causal language modeling objective-like. GPT-2, at least the pre-trained Best regards, |
Hi @gugarosa, thanks a lot! Do you happen to know if there're any such fine-tuned checkpoints for BERT/RoBERTa that I can use directly? Cause I was hoping to get a language model trained on generic English texts (e.g. BERT/RoBERTa's pretraining data), and directly evaluate them on my data. Given the size of their pretraining data, is it realistic to do it myself? EDIT: Actually, would you recommend any other models (e.g. BertForMaskedLM?) or evaluation metrics (other than perplexity) instead? Our end goal is just to "score sentences" with BERT/RoBERTa. Thanks again for your help! |
HI all, could you move this discussion on the forums so it can benefit the whole community? We keep the issues for bugs and feature requests only :-) |
Yes, I made a post here. Would appreciate it if you could give more suggestions! |
Hi there,
Thanks for putting together this awesome repo!
I met two problems when trying to use encoder-based models (e.g. BERT, RoBERTa) for causal language modeling, i.e. scoring the conditional likelihood of texts given previous texts. Namely,
Would appreciate it if you could kindly help! Description below:
Environment info
transformers
version: 4.8.2Who can help
@sgugger @LysandreJik @patrickvonplaten
(I'm tagging the main contributors to the relevant lines from git blame; apologies if not the right people!)
Information
Models I am using (Bert, XLNet ...): BERT, RoBERTa
The problem arises when using:
Please see the code snippet under "to reproduce".
The tasks I am working on is:
As mentioned before, I'm hoping to use encoder-based models (e.g. BERT, RoBERTa) for causal language modeling, i.e. scoring the conditional likelihood of texts given previous texts.
To reproduce
Steps to reproduce the behavior:
I'm following this doc and this issue which were written for GPT2. I'm trying to adapt it for BERT and RoBERTa.
Code snippet:
This is a minimal class that I write. You can directly run it by substituting the
cache_dir
variable.Expected behavior
It is expected that models score ending1 as more likely than ending2, therefore score1 should be smaller than score2.
However,
model_name
is"bert-base-uncased"
, the output is:which means BERT thinks "I love is." is more plausible than "I love you."?
model_name
is"roberta-base"
, the output is:Though it correctly scores ending1 as more likely, the perplexity values are super large.
Could you please take a look? Thanks in advance for any help!
Best,
Veronica
The text was updated successfully, but these errors were encountered: