Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818

Closed
2 tasks done
veronica320 opened this issue Sep 30, 2021 · 4 comments
Closed
2 tasks done

Weird behavior of BertLMHeadModel and RobertaForCausalLM #13818

veronica320 opened this issue Sep 30, 2021 · 4 comments

Comments

@veronica320
Copy link

veronica320 commented Sep 30, 2021

Hi there,

Thanks for putting together this awesome repo!

I met two problems when trying to use encoder-based models (e.g. BERT, RoBERTa) for causal language modeling, i.e. scoring the conditional likelihood of texts given previous texts. Namely,

  • RoBERTa has super large perplexity values, and
  • BERT cannot correctly compare the relative perplexity of simple sentences.

Would appreciate it if you could kindly help! Description below:

Environment info

  • transformers version: 4.8.2
  • Platform: linux
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.5.0 (gpu)
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@sgugger @LysandreJik @patrickvonplaten
(I'm tagging the main contributors to the relevant lines from git blame; apologies if not the right people!)

Information

Models I am using (Bert, XLNet ...): BERT, RoBERTa

The problem arises when using:

  • my own modified scripts: (give details below)
    Please see the code snippet under "to reproduce".

The tasks I am working on is:

  • my own task or dataset: (give details below)
    As mentioned before, I'm hoping to use encoder-based models (e.g. BERT, RoBERTa) for causal language modeling, i.e. scoring the conditional likelihood of texts given previous texts.

To reproduce

Steps to reproduce the behavior:
I'm following this doc and this issue which were written for GPT2. I'm trying to adapt it for BERT and RoBERTa.

  1. Load the pretrained models
  2. Feed the prompt and the ending into the model
  3. Get the loss, and exponentiate it

Code snippet:
This is a minimal class that I write. You can directly run it by substituting the cache_dir variable.

import os
import numpy as np
import csv
import math

cuda = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = cuda
import torch

from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config
from transformers import BertTokenizer, BertLMHeadModel, BertConfig
from transformers import RobertaTokenizer, RobertaForCausalLM, RobertaConfig

cache_dir=".cache/transformers"

class PPLScorer():
	"""A LM scorer for the conditional perplexity of an ending given a prompt."""

	def __init__(self, model_name):
		"""Initialize model and tokenizer."""

		self.model_name = model_name
		if "gpt2" in model_name:
			with torch.no_grad():
				self.LM = GPT2LMHeadModel.from_pretrained(model_name, cache_dir=cache_dir).to("cuda:0")
				self.LM.eval()
			self.tokenizer = GPT2TokenizerFast.from_pretrained(model_name, cache_dir=cache_dir)

		elif "roberta" in model_name:
			with torch.no_grad():
				config = RobertaConfig.from_pretrained(model_name)
				config.is_decoder = True # We'd like to use it as a standalone decoder
				self.LM = RobertaForCausalLM.from_pretrained(model_name, config=config, cache_dir=cache_dir).to("cuda:0")
				self.LM.eval()
			self.tokenizer = RobertaTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

		elif "bert" in model_name:
			with torch.no_grad():
				config = BertConfig.from_pretrained(model_name)
				config.is_decoder = True # We'd like to use it as a standalone decoder
				self.LM = BertLMHeadModel.from_pretrained(model_name, config=config, cache_dir=cache_dir).to("cuda:0")
				self.LM.eval()
			self.tokenizer = BertTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

		else:
			raise ValueError(f"Unknown model name: {model_name}.")

	def get_input_ids(self, sentence, is_ending = False):
		"""Get input ids of a sentence."""

		tokens = self.tokenizer.tokenize(sentence)
		# GPT and RoBERTa adds a 'Ġ' character before each non-starting token. Here we manually add it.
		if "gpt" in self.model_name or "roberta" in self.model_name:
			if is_ending:
				tokens[0] = 'Ġ' + tokens[0]
		return tokens


	def score_conditional(self, prompt, ending):
		"""Get the conditional likelihood of the ending given the prompt."""

		prompt_tokens = self.get_input_ids(prompt, is_ending=False)
		ending_tokens = self.get_input_ids(ending, is_ending=True)

		all_tokens = prompt_tokens + ending_tokens
		input_ids = torch.tensor([self.tokenizer.convert_tokens_to_ids(all_tokens)]).to("cuda:0")
		target_ids = input_ids.clone()
        
		# ignore the loss on the prompt tokens
		target_ids[:, :len(prompt_tokens)] = -100

		with torch.no_grad():
			outputs = self.LM(input_ids, labels=target_ids)
			log_likelihood = outputs[0].detach().item()

		ppl = math.exp(log_likelihood)

		return ppl

if __name__ == "__main__":
	# You can modify model_name 
	model_name = ["bert-base-uncased",
				  "roberta-base",
				  "gpt2",
				  ][1]

	conditional_LM = PPLScorer(model_name)

	# Test 1    
	prompt = "I love"
	ending1 = "you."
	ending2 = "is."

	score1 = conditional_LM.score_conditional(prompt, ending1)
	score2 = conditional_LM.score_conditional(prompt, ending2)
	print(score1, score2)
	if score1 < score2:
		print("Ending 1 is more likely.")
	elif score1 > score2:
		print("Ending 2 is more likely.")
	else:
		print("Equally likely.")

Expected behavior

It is expected that models score ending1 as more likely than ending2, therefore score1 should be smaller than score2.
However,

  1. When model_name is "bert-base-uncased", the output is:

801.6779910371988 432.06698212552516
Ending 2 is more likely.

which means BERT thinks "I love is." is more plausible than "I love you."?

  1. When model_name is "roberta-base", the output is:

7402846.771924554 950510861.61753
Ending 1 is more likely.

Though it correctly scores ending1 as more likely, the perplexity values are super large.

  1. We also tried a couple of other sentences and model variations (e.g. bert/roberta large), but the problems persist. Instead, gpt2-based models have no issue (the comparison is always correct, and the perplexity scores are usually tens to hundreds).

Could you please take a look? Thanks in advance for any help!

Best,
Veronica

@gugarosa
Copy link
Contributor

gugarosa commented Oct 1, 2021

Hey @veronica320 ! I hope everything is going well with you.

From what I have seen, you are using the pre-trained model from Huggingface's Hub for instantiating the LMHead, that's correct?

It might be the case that these pre-trained models were originally trained with a masked language modeling objective in mind, so when applying them to a causal language modeling task without fine-tuning they might be having a hard time to decode complete sequences.

I would suggest to attach the LMHead model as you have been doing, but instead of directly trying to predict/score it, fine-tune (train) for a few iterations on your dataset with a causal language modeling objective-like.

GPT-2, at least the pre-trained gpt2 model does not have this problem because it was pre-trained according to a causal language modeling objective, which is essentially what you are trying to achieve.

Best regards,
Gustavo.

@veronica320
Copy link
Author

veronica320 commented Oct 1, 2021

Hi @gugarosa, thanks a lot! Do you happen to know if there're any such fine-tuned checkpoints for BERT/RoBERTa that I can use directly?

Cause I was hoping to get a language model trained on generic English texts (e.g. BERT/RoBERTa's pretraining data), and directly evaluate them on my data. Given the size of their pretraining data, is it realistic to do it myself?

EDIT: Actually, would you recommend any other models (e.g. BertForMaskedLM?) or evaluation metrics (other than perplexity) instead? Our end goal is just to "score sentences" with BERT/RoBERTa.

Thanks again for your help!

@sgugger
Copy link
Collaborator

sgugger commented Oct 1, 2021

HI all, could you move this discussion on the forums so it can benefit the whole community? We keep the issues for bugs and feature requests only :-)
Thank you!

@veronica320
Copy link
Author

Yes, I made a post here. Would appreciate it if you could give more suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants