Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818

veronica320 · 2021-09-30T21:11:04Z

Hi there,

Thanks for putting together this awesome repo!

I met two problems when trying to use encoder-based models (e.g. BERT, RoBERTa) for causal language modeling, i.e. scoring the conditional likelihood of texts given previous texts. Namely,

RoBERTa has super large perplexity values, and
BERT cannot correctly compare the relative perplexity of simple sentences.

Would appreciate it if you could kindly help! Description below:

Environment info

transformers version: 4.8.2
Platform: linux
Python version: 3.7.9
PyTorch version (GPU?): 1.5.0 (gpu)
Tensorflow version (GPU?): N/A
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@sgugger @LysandreJik @patrickvonplaten
(I'm tagging the main contributors to the relevant lines from git blame; apologies if not the right people!)

Information

Models I am using (Bert, XLNet ...): BERT, RoBERTa

The problem arises when using:

my own modified scripts: (give details below)
Please see the code snippet under "to reproduce".

The tasks I am working on is:

my own task or dataset: (give details below)
As mentioned before, I'm hoping to use encoder-based models (e.g. BERT, RoBERTa) for causal language modeling, i.e. scoring the conditional likelihood of texts given previous texts.

To reproduce

Steps to reproduce the behavior:
I'm following this doc and this issue which were written for GPT2. I'm trying to adapt it for BERT and RoBERTa.

Load the pretrained models
Feed the prompt and the ending into the model
Get the loss, and exponentiate it

Code snippet:
This is a minimal class that I write. You can directly run it by substituting the cache_dir variable.

import os
import numpy as np
import csv
import math

cuda = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = cuda
import torch

from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config
from transformers import BertTokenizer, BertLMHeadModel, BertConfig
from transformers import RobertaTokenizer, RobertaForCausalLM, RobertaConfig

cache_dir=".cache/transformers"

class PPLScorer():
	"""A LM scorer for the conditional perplexity of an ending given a prompt."""

	def __init__(self, model_name):
		"""Initialize model and tokenizer."""

		self.model_name = model_name
		if "gpt2" in model_name:
			with torch.no_grad():
				self.LM = GPT2LMHeadModel.from_pretrained(model_name, cache_dir=cache_dir).to("cuda:0")
				self.LM.eval()
			self.tokenizer = GPT2TokenizerFast.from_pretrained(model_name, cache_dir=cache_dir)

		elif "roberta" in model_name:
			with torch.no_grad():
				config = RobertaConfig.from_pretrained(model_name)
				config.is_decoder = True # We'd like to use it as a standalone decoder
				self.LM = RobertaForCausalLM.from_pretrained(model_name, config=config, cache_dir=cache_dir).to("cuda:0")
				self.LM.eval()
			self.tokenizer = RobertaTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

		elif "bert" in model_name:
			with torch.no_grad():
				config = BertConfig.from_pretrained(model_name)
				config.is_decoder = True # We'd like to use it as a standalone decoder
				self.LM = BertLMHeadModel.from_pretrained(model_name, config=config, cache_dir=cache_dir).to("cuda:0")
				self.LM.eval()
			self.tokenizer = BertTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

		else:
			raise ValueError(f"Unknown model name: {model_name}.")

	def get_input_ids(self, sentence, is_ending = False):
		"""Get input ids of a sentence."""

		tokens = self.tokenizer.tokenize(sentence)
		# GPT and RoBERTa adds a 'Ġ' character before each non-starting token. Here we manually add it.
		if "gpt" in self.model_name or "roberta" in self.model_name:
			if is_ending:
				tokens[0] = 'Ġ' + tokens[0]
		return tokens


	def score_conditional(self, prompt, ending):
		"""Get the conditional likelihood of the ending given the prompt."""

		prompt_tokens = self.get_input_ids(prompt, is_ending=False)
		ending_tokens = self.get_input_ids(ending, is_ending=True)

		all_tokens = prompt_tokens + ending_tokens
		input_ids = torch.tensor([self.tokenizer.convert_tokens_to_ids(all_tokens)]).to("cuda:0")
		target_ids = input_ids.clone()
        
		# ignore the loss on the prompt tokens
		target_ids[:, :len(prompt_tokens)] = -100

		with torch.no_grad():
			outputs = self.LM(input_ids, labels=target_ids)
			log_likelihood = outputs[0].detach().item()

		ppl = math.exp(log_likelihood)

		return ppl

if __name__ == "__main__":
	# You can modify model_name 
	model_name = ["bert-base-uncased",
				  "roberta-base",
				  "gpt2",
				  ][1]

	conditional_LM = PPLScorer(model_name)

	# Test 1    
	prompt = "I love"
	ending1 = "you."
	ending2 = "is."

	score1 = conditional_LM.score_conditional(prompt, ending1)
	score2 = conditional_LM.score_conditional(prompt, ending2)
	print(score1, score2)
	if score1 < score2:
		print("Ending 1 is more likely.")
	elif score1 > score2:
		print("Ending 2 is more likely.")
	else:
		print("Equally likely.")

Expected behavior

It is expected that models score ending1 as more likely than ending2, therefore score1 should be smaller than score2.
However,

When model_name is "bert-base-uncased", the output is:

801.6779910371988 432.06698212552516
Ending 2 is more likely.

which means BERT thinks "I love is." is more plausible than "I love you."?

When model_name is "roberta-base", the output is:

7402846.771924554 950510861.61753
Ending 1 is more likely.

Though it correctly scores ending1 as more likely, the perplexity values are super large.

We also tried a couple of other sentences and model variations (e.g. bert/roberta large), but the problems persist. Instead, gpt2-based models have no issue (the comparison is always correct, and the perplexity scores are usually tens to hundreds).

Could you please take a look? Thanks in advance for any help!

Best,
Veronica

The text was updated successfully, but these errors were encountered:

gugarosa · 2021-10-01T15:25:49Z

Hey @veronica320 ! I hope everything is going well with you.

From what I have seen, you are using the pre-trained model from Huggingface's Hub for instantiating the LMHead, that's correct?

It might be the case that these pre-trained models were originally trained with a masked language modeling objective in mind, so when applying them to a causal language modeling task without fine-tuning they might be having a hard time to decode complete sequences.

I would suggest to attach the LMHead model as you have been doing, but instead of directly trying to predict/score it, fine-tune (train) for a few iterations on your dataset with a causal language modeling objective-like.

GPT-2, at least the pre-trained gpt2 model does not have this problem because it was pre-trained according to a causal language modeling objective, which is essentially what you are trying to achieve.

Best regards,
Gustavo.

veronica320 · 2021-10-01T15:42:09Z

Hi @gugarosa, thanks a lot! Do you happen to know if there're any such fine-tuned checkpoints for BERT/RoBERTa that I can use directly?

Cause I was hoping to get a language model trained on generic English texts (e.g. BERT/RoBERTa's pretraining data), and directly evaluate them on my data. Given the size of their pretraining data, is it realistic to do it myself?

EDIT: Actually, would you recommend any other models (e.g. BertForMaskedLM?) or evaluation metrics (other than perplexity) instead? Our end goal is just to "score sentences" with BERT/RoBERTa.

Thanks again for your help!

sgugger · 2021-10-01T16:01:40Z

HI all, could you move this discussion on the forums so it can benefit the whole community? We keep the issues for bugs and feature requests only :-)
Thank you!

veronica320 · 2021-10-01T16:44:59Z

Yes, I made a post here. Would appreciate it if you could give more suggestions!

veronica320 closed this as completed Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818

Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818

veronica320 commented Sep 30, 2021 •

edited

Loading

gugarosa commented Oct 1, 2021

veronica320 commented Oct 1, 2021 •

edited

Loading

sgugger commented Oct 1, 2021

veronica320 commented Oct 1, 2021

Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818

Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818

Comments

veronica320 commented Sep 30, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

gugarosa commented Oct 1, 2021

veronica320 commented Oct 1, 2021 • edited Loading

sgugger commented Oct 1, 2021

veronica320 commented Oct 1, 2021

veronica320 commented Sep 30, 2021 •

edited

Loading

veronica320 commented Oct 1, 2021 •

edited

Loading