-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BART Large generate predictions are wonky #15559
Comments
I have encountered the same problem. And I have also found that |
@patil-suraj - we had another issue thread about this somewhere no? I can't find it anymore though :-/ |
@patrickvonplaten |
Found the original issue: #9731 . Looking a bit into the commit history here: https://huggingface.co/facebook/bart-large/commits/main it looks like the mask token problem actually only existed for @patil-suraj - could you double-check this real quick? |
I can confirm that bart-large generates very strange output. tokenizer = BartTokenizer.from_pretrained(model_name)
model = FlaxBartForConditionalGeneration.from_pretrained(model_name)
model.params = model.to_bf16(model.params) # convert float16 to bfloat16 (for TPU)
sentences = (
'She waded to the bank and picked up her shoes and stockings.',
'The bank is increasing the amount they lend to small companies.',
)
inputs = tokenizer(sentences, padding=True, return_tensors='jax')
output = model.generate(inputs.input_ids)
print(tokenizer.batch_decode(output.sequences, skip_special_tokens=True, clean_up_tokenization_spaces=False))
|
To me this seems to be unrelated to #9731; |
I see sorry you're right - I looked too quickly indeed. Will take a deeper look in the coming days. |
Any updates? |
Hey @StephAO, You are passing tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
batch = tokenizer("My friends are <mask> but they eat too many carbs.", return_tensors="pt")
generated_ids = model.generate(batch["input_ids"])
print(tokenizer.decode(generated_ids[0])) gives sensible outputs:
Where did you see code that |
Good catch. I am actually unsure where/if I saw code that passed
Lastly, it is still confusing to me why this is required for bart-large, but not for bart-base. Either way, thank you for the update! |
Thanks for the great feedback! Actually if you would be interested, it would be amazing to open a PR to fix the docs - both the mask filling example and the implementation notes - to fix this :-) Otherwise I'm happy to open a PR for it as well! |
@patrickvonplaten Thanks for the explanitaton, it helps a lot! Actually, I think it does not only effect mask infilling, but also the fine-tuning procedure (at least on my side). Without the |
Hmm, the problem is that I'm not sure whether this should be done or not
This is also why only @sshleifer sorry to ping you here - do you remember by any chance if Also cc @patil-suraj any ideas? |
Sam will have a better answer but IIRC the
In my experiments, it actually doesn't matter. It depends on how the But now the issue is, in bart like models the
transformers/src/transformers/models/bart/modeling_bart.py Lines 1397 to 1398 in 05c237e
And this will always add from transformers import BartTokenizer
from transformers.models.bart.modeling_bart import shift_tokens_right
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
labels = tokenizer("This is a test", return_tensors="pt").input_ids
decoder_input_ids = shift_tokens_right(labels, tokenizer.pad_token_id, tokenizer.eos_token_id)
decoder_input_ids
# tensor([[ 2, 0, 713, 16, 10, 1296]])
tokenizer.batch_decode(decoder_input_ids)
['</s><s>This is a test'] but |
Thanks for the summary @patil-suraj - that's super helpful! So as I undertsand it, the BartTokenizer will always add the BOS token to the beginning. from transformers import BartTokenizer
tok = BartTokenizer.from_pretrained("facebook/bart-large")
print(tok.decode(tok("hello").input_ids))
This means for Seq2Seq if someone follows any of our official examples (both accelerate and Trainer), this means the labels are
with the decoder input ids then (since they are the labels shifted to the left):
Now, we know from Sam's comment here that it is not necessary to add the BOS token ( But the problem now is that while we quitly add BOS to the labels in all of our example scripts for fine-tuning because of BartTokenizer's behavior, we don't "force-add" the BOS token when evaluating the model with On the other hand, one could also argue that the model should learn to always generate As a conclusion, @patil-suraj and I think, we should actually add Since those two checkpoints are highly used, keen to hear your opinion @LysandreJik @sgugger here |
Are we 100% sure that the change would only make predictions better in all circumstances? |
If someone fine-tunes with a BOS as the second token (behind |
I'm not talking about fine-tuning, I'm talking about users relying on this model in production right now. |
This is a pretrained checkpoint - so I highly doubt anybody uses this model + |
Does this problem can cause a pretrained model from a scratch to show poor loss and accuracy score? I already pre-train a model with TensorFlow Keras and BART and it show me a good logit accuracy (~80%) for base-scale. However, i was struggling for months to use BART large using the same experimental setup. The logit accuracy for BART large never goes beyond ~4%. I did everything including decreasing and increasing batch size, learning rate .. etc but with no luck. |
@salrowili Yeah, I can confirm it will. Compared with |
Did you manage to fix it with forced_bos_token_id=0 solution? |
According to my efforts now, I can observe the following facts:
However, some experimental runs will fail when switching the fine-tuning environment (e.g., from single-card to multi-card). And it often occurs with toooo long evaluating time since the model is trying to produce a sequence such as
One failure case can be as below, and the |
Thank for sharing this interesting findings. I have also conduct a small experiment involving fine-tuning SQuAD with both BART-base and BART-large with Pytorch XLA on TPU-8 unit with Google Colab (attached). its based on this colab https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb . Although i uses Google Colab pro which has more memory 35GB, i think Google Colab free gives free TPU with 25GB so for everyone who is interesting in replicating this experiment can do it , but you may need to reduce the value of "per_device_train_batch_size" to use less memory. per_device_train_batch_size*8= Total Batch size. If you ran into out of memory error reduce the per_device_train_batch_size. if you got SIG error restart the colab and run again all cells that are needed to restore variable and importing packages. At beginning the training will be slow since XLA compilation needs more time especially for bart-large (~10mins). largest batch size is 16 for bart-large (per_device_train_batch_size=2) |
@salrowili Thanks for sharing! I think currently this bug will affect all NLG tasks (e.g., summarization), but not for NLU tasks (e.g., classification and extractive machine reading comprehension). I get some ideas why it becomes so, and would like to share here later when it is confirmed. |
Thanks! I am still curious about:
|
Thank you @SivilTaram for this detailed explanation. Wonderful effort. Could this problem be related to mBART? because mBART has only a large-scale version, not a base scale and I think the first token is designed for language code. see #9811 |
Thanks a lot for all the work guys! I think at this point we can only add
Also maybe we can ask the official authors what they think about your findings. |
Yes I agree that my finding on perturbing |
|
Good point @salrowili ! I have no idea now, but I agree that the first token is originally desgined to serve for mBART instead of BART. |
what is your loss score on both, not EM? I am curious to know |
This might have been a false alarm, let me triple check |
Yeah false alarm my bad folks. |
Actually right now with the same transformers bart for conditional generation instance, AllenAI's beam search gives me near 100% EM and hugging face's gives me very low EM, with a much lower per token accuracy (~65%). Both with num_beams of 4 and max length of 500, trying to figure out the difference... |
Ok found the cause of the difference, the huggingface |
Can also confirm that gradients of |
Good job! This reinforces my belief that it is |
We just found out that This might be a reason for this behavior here |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
There is a similar issue with In short, no matter what the encoder input sequences are, the decoder sequence I am not very convinced by @SivilTaram's reasoning on the norms of |
I have verified the same situation occurs for Furthermore, the following code snippet confirms the same for
The results:
|
For the record: Here are some outputs example idx: 1
max diff in lm logits: 1.621246337890625e-05
--------------------
predicted token ids: [1640, 1640, 13989, 212, 4038]
predicted tokens: ['(', '(', 'ĠMLS', 'th', 'Ġanniversary']
document tokens: ['<s>', '(', 'CNN', ')', 'On', 'Ġthe', 'Ġ6', 'th', 'Ġof', 'ĠApril', 'Ġ1996', ',', 'ĠSan', 'ĠJose', 'ĠClash', 'Ġand']
========================================
example idx: 10
max diff in lm logits: 7.033348083496094e-06
--------------------
predicted token ids: [11770, 11770, 16, 6308, 5678]
predicted tokens: ['March', 'March', 'Ġis', 'Ġcontains', 'Ġlinks']
document tokens: ['<s>', 'March', 'Ġ10', ',', 'Ġ2015', 'Ġ.', 'ĠWe', "'re", 'Ġtruly', 'Ġinternational', 'Ġin', 'Ġscope', 'Ġon', 'ĠTuesday', '.', 'ĠWe']
========================================
example idx: 20
max diff in lm logits: 7.62939453125e-06
--------------------
predicted token ids: [41650, 41650, 11, 1429, 224]
predicted tokens: ['Tok', 'Tok', 'Ġin', 'ĠJapan', 'Ġsay']
document tokens: ['<s>', 'Tok', 'yo', 'Ġ(', 'CNN', ')', 'Police', 'Ġin', 'ĠJapan', 'Ġsay', 'Ġthey', 'Ġhave', 'Ġarrested', 'Ġa', 'Ġ40', '-']
========================================
example idx: 24
max diff in lm logits: 9.5367431640625e-06
--------------------
predicted token ids: [23122, 23122, 52, 9, 10]
predicted tokens: ['London', 'London', 'Ġwe', 'Ġof', 'Ġa']
document tokens: ['<s>', 'London', 'Ġ(', 'CNN', ')', 'A', 'Ġphoto', 'Ġof', 'Ġa', 'Ġwe', 'asel', 'Ġh', 'itching', 'Ġa', 'Ġsurprise', 'Ġlift']
========================================
example idx: 36
max diff in lm logits: 8.106231689453125e-06
--------------------
predicted token ids: [4030, 4030, 2238, 18, 7396]
predicted tokens: ['New', 'New', 'ĠKorean', "'s", 'izes']
document tokens: ['<s>', 'New', 'ĠDelhi', ',', 'ĠIndia', 'Ġ(', 'CNN', ')', 'The', 'ĠNorth', 'ĠKorean', 'Ġambassador', 'Ġin', 'ĠBangladesh', 'Ġissued', 'Ġan'] import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import datasets
summarization_name_mapping = {
"cnn_dailymail": ("article", "highlights"),
"xsum": ("document", "summary"),
}
def get_dataset(dataset_name, dataset_config, tokenizer, n_samples):
max_source_length = 1024
max_target_length = 128
padding = True
ignore_pad_token_for_loss = True
padding = "max_length"
prefix = ""
max_train_samples = n_samples
max_eval_samples = n_samples
preprocessing_num_workers = 8
raw_datasets = datasets.load_dataset(dataset_name, dataset_config)
text_column, summary_column = summarization_name_mapping[dataset_name]
def foo(x):
if x == tokenizer.cls_token_id:
return 1
elif x == tokenizer.pad_token_id:
return -1
else:
return 0
def preprocess_function(examples):
# remove pairs where at least one record is None
inputs, targets = [], []
for i in range(len(examples[text_column])):
if examples[text_column][i] and examples[summary_column][i]:
inputs.append(examples[text_column][i])
targets.append(examples[summary_column][i])
inputs = [prefix + inp for inp in inputs]
model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
# Tokenize targets with the `text_target` keyword argument
labels = tokenizer(text_target=targets, max_length=max_target_length, padding=padding, truncation=True)
# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
if padding == "max_length" and ignore_pad_token_for_loss:
labels["input_ids"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
if tokenizer.__class__.__name__.startswith("LED"):
model_inputs["global_attention_mask"] = [[foo(y) for y in x] for x in model_inputs["input_ids"]]
return model_inputs
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["validation"]
train_dataset = train_dataset.select(range(max_train_samples))
eval_dataset = eval_dataset.select(range(max_eval_samples))
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=preprocessing_num_workers,
remove_columns=[text_column, summary_column, 'id'],
desc="Running tokenizer on train dataset",
)
eval_dataset = eval_dataset.map(
preprocess_function,
batched=True,
num_proc=preprocessing_num_workers,
remove_columns=[text_column, summary_column, 'id'],
desc="Running tokenizer on validation dataset",
)
return train_dataset, eval_dataset
def check_model(train_dataset, eval_dataset, model, tokenizer, n_samples, text_column, summary_column):
for idx, eval_example in enumerate(eval_dataset):
input_ids = eval_example["input_ids"]
decoder_input_ids = model.prepare_decoder_input_ids_from_labels(labels=torch.tensor([eval_example["labels"]], dtype=torch.int32))
decoder_input_ids = decoder_input_ids.numpy().tolist()
eval_example["decoder_input_ids"] = decoder_input_ids[0] # remove batch dim
eval_example.pop("labels")
decoder_input_ids = eval_example.pop("decoder_input_ids")
eval_example["decoder_input_ids"] = [2, 0] + decoder_input_ids[2:5]
for k in eval_example:
eval_example[k] = torch.tensor([eval_example[k]], dtype=torch.int32)
output = model(**eval_example)
print(f"example idx: {idx}")
print(f'max diff in lm logits: {np.amax(np.abs((output.logits[0, 0] - output.logits[0, 1]).detach().to("cpu").numpy()))}')
print(f"-" * 20)
pred_ids = torch.argmax(output.logits, dim=-1).detach().to("cpu").numpy().tolist()[0]
print(f'predicted token ids: {pred_ids}')
pred_tokens = tokenizer.convert_ids_to_tokens(pred_ids)
print(f'predicted tokens: {pred_tokens}')
document_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(f'document tokens: {document_tokens[:16]}')
print(f"=" * 40)
def run(checkpoint_name, dataset_name, dataset_config=None, n_samples=100):
text_column, summary_column = summarization_name_mapping[dataset_name]
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_name)
train_dataset, eval_dataset = get_dataset(dataset_name, dataset_config=dataset_config, tokenizer=tokenizer, n_samples=n_samples)
check_model(train_dataset, eval_dataset, model, tokenizer, n_samples, text_column, summary_column)
run("facebook/bart-large", "cnn_dailymail", "3.0.0", n_samples=100)
#run("facebook/bart-large", "xsum", None, n_samples=10)
#run("allenai/led-large-16384", "cnn_dailymail", "3.0.0", n_samples=10)
#run("allenai/led-large-16384", "xsum", None, n_samples=10) |
I just saw this thread - sorry for all the pain here! The problem was caused by an unfortunate config bug in the original BART-large training run, which caused decoder sequences to start with an extra </ s> token I assume that got fixed in BART-base, which is why it's behaving differently. |
Thank you so much for clarifying it @mikelewis0 well appreciated! |
I think you misinterpreted the cause. The extra |
Environment info
transformers
version: 4.16.2 (issue exists on 4.9.2)Who can help
@patrickvonplaten @sshleifer
Information
Essentially re-opening issue 8005, BART-large does not mask fill properly (whereas BART-base has entirely reasonable outputs). The previous fix of setting
force_bos_token_to_be_generated = True
is no longer viable since the option no longer exists in BART config. It also seems like adjust_logits_during_generation (where force_bos_token_to_be_generated was used) is no longer implemented in the BART model.To reproduce
Steps to reproduce the behavior:
The text was updated successfully, but these errors were encountered: