[BUG] DataCollatorForSeq2Seq with PaddingStrategy.MAX_LENGTH may not pad labels #30521

muzhi1991 · 2024-04-27T17:25:30Z

It seems that when padding, if the MAX_LENGTH policy is set, the same padding is not performed on the label.

test case below：

from transformers import DataCollatorForSeq2Seq,
from transformers.utils import PaddingStrategy
inputs=[{'input_ids': [151644, 8948, 198],'attention_mask': [1, 1, 1],'labels': [-100, -100, -100]},
 {'input_ids': [151644, 8948, 198, 2610],'attention_mask': [1, 1, 1, 1],'labels': [-100, -100, -100, -100]},
 {'input_ids': [151644, 8948, 198, 2610, 525], 'attention_mask': [1, 1, 1, 1, 1],'labels': [-100, -100, -100, -100, -100]}]
data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        padding=PaddingStrategy.MAX_LENGTH,
        max_length=10,
    )
res=data_collator(inputs)

print(res['input_ids'].shape,res['labels'].shape)

results:
torch.Size([3, 10]) torch.Size([3, 5])

expected results:
torch.Size([3, 10]) torch.Size([3, 10])

Should the following code handle the pad length of the label according to different strategies?

transformers/src/transformers/data/data_collator.py

Line 592 in 73014b5

max_label_length = max(len(l) for l in labels)

vasqu · 2024-04-27T19:23:28Z

Thanks for raising this issue! Yea, that seems like a valid bug imo. The padding strategy isn't respected with max_length.

I'd change these lines:

transformers/src/transformers/data/data_collator.py

Lines 591 to 592 in 73014b5

    
           if labels is not None: 
        
               max_label_length = max(len(l) for l in labels)

to something like:

no_padding = self.padding == False or self.padding == PaddingStrategy.DO_NOT_PAD
if labels is not None and not no_padding:
    max_padding = self.padding == PaddingStrategy.MAX_LENGTH and self.max_length is not None
    max_label_length = max(len(l) for l in labels) if not max_padding else self.max_length

no_padding is also not respected but it doesn't matter too much considering longest will result in the same end state. So the first line might be unnecessary, it just saves some computation ig.

Running this for a similar example to yours:

from transformers import BartTokenizer, DataCollatorForSeq2Seq
from transformers.utils import PaddingStrategy

tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
inputs = [{'input_ids': [151644, 8948, 198],'attention_mask': [1, 1, 1],'labels': [1, -100, -100]},
          {'input_ids': [151644, 8948, 198, 2610],'attention_mask': [1, 1, 1, 1],'labels': [2, 5, -100, -100]},
          {'input_ids': [151644, 8948, 198, 2610, 525], 'attention_mask': [1, 1, 1, 1, 1],'labels': [3, 4, 6, -100, -100]}]

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding=PaddingStrategy.MAX_LENGTH,
    max_length=10,
)
res = data_collator(inputs)
print(res['input_ids'].shape, res['labels'].shape)

Output: torch.Size([3, 10]) torch.Size([3, 10])

amyeroberts · 2024-04-29T10:12:17Z

cc @Rocketknight1 as you appear to be the most recent person to touch the data collators :)

Rocketknight1 · 2024-04-29T19:06:50Z

Transformers maintainer here: Yes, this looks like a bug! Also @vasqu your solution looks good - would you be willing to make a PR to add it?

vasqu · 2024-04-29T21:42:52Z

@Rocketknight1 opened a PR at #30556 including some tests for the se2seq collator since there haven't been any.

muzhi1991 changed the title ~~DataCollatorForSeq2Seq with PaddingStrategy.MAX_LENGTH may not pad labels~~ [BUG] DataCollatorForSeq2Seq with PaddingStrategy.MAX_LENGTH may not pad labels Apr 27, 2024

vasqu mentioned this issue Apr 29, 2024

Fix seq2seq collator padding #30556

Merged

5 tasks

amyeroberts closed this as completed in #30556 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DataCollatorForSeq2Seq with PaddingStrategy.MAX_LENGTH may not pad labels #30521

[BUG] DataCollatorForSeq2Seq with PaddingStrategy.MAX_LENGTH may not pad labels #30521

muzhi1991 commented Apr 27, 2024

vasqu commented Apr 27, 2024 •

edited

amyeroberts commented Apr 29, 2024

Rocketknight1 commented Apr 29, 2024

vasqu commented Apr 29, 2024

[BUG] DataCollatorForSeq2Seq with PaddingStrategy.MAX_LENGTH may not pad labels #30521

[BUG] DataCollatorForSeq2Seq with PaddingStrategy.MAX_LENGTH may not pad labels #30521

Comments

muzhi1991 commented Apr 27, 2024

vasqu commented Apr 27, 2024 • edited

amyeroberts commented Apr 29, 2024

Rocketknight1 commented Apr 29, 2024

vasqu commented Apr 29, 2024

vasqu commented Apr 27, 2024 •

edited