Packing in SFT #805

wdykas · 2023-09-21T21:02:42Z

I understand how packing is allowed in pretraining but I was looking for some clarification on how we are allowed to pack samples for SFT with ConstantLengthDataset. I see that an EOS token is put between samples https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L567 but how are attention masks handled to make sure that the samples don't attend to each others context or is just putting EOS enough signal?

Could you also point me to the code for how multiple answers are handled in generation for loss calculations?

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-09-25T12:59:41Z

Hi @wdykas !

Packing is a common practice and a trick to enable pre-training / fine-tuning on more sequences. I think that adding the EOS token is an enough signal for the model.

The trick seems to has been introduced in the T5 paper: https://arxiv.org/pdf/1910.10683.pdf

Could you also point me to the code for how multiple answers are handled in generation for loss calculations?

Can you elaborate more on this question? for loss calculation we do not perform generation we only compute the next token once.

wdykas · 2023-09-25T23:58:22Z

Sorry I meant more along the lines of how can calculate loss on the tokens without cross contamination in attention between sequences? Is there code where you create special attention masks? This seemed to be an issue opened here: huggingface/transformers#25452

lvwerra · 2023-09-26T09:47:09Z

Since the sequences are generally not correlated there is no issue in cross contamination. Otherwise you would need to pass a dedicated attention mask but this is not currently done.

wdykas · 2023-09-26T17:39:05Z

thanks for the answer! One last QQ, a problem that I see in this implementation is that we create a long continuous buffer of text separated by [EOS]. Then it seems we create examples of seq_len by naively looping over https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L579C35-L579C48 without considering EOS.
Doesn't this mean that some examples, if not many, will start with sequences that are cutoff and are in the middle?

lvwerra · 2023-09-29T10:10:47Z

Yes, that's the idea of packing. If you want to avoid that you can disable packing and each sequence will start at the beginning and filled with padding tokens to the max sequence length in the batch.

yuanenming · 2023-10-03T15:03:16Z

I think it is better to add a feature to make the sequence packing without any kinds of cross contamination like the figure above. Specifically in the supervised finetuning stage. Because in the SFT stage, We usually have short conversations (e.g. << 4096), which may cause some "training-inference gap" if we don't add the attention mask.

Notice that there is an implicit affect of the very long sequence attention (if we do not add the attention mask) -- the entropy [1].

[1] https://kexue.fm/archives/8823

spartan-minhbui · 2023-10-05T14:37:07Z

I think it is better to add a feature to make the sequence packing without any kinds of cross contamination like the figure above. Specifically in the supervised finetuning stage. Because in the SFT stage, We usually have short conversations (e.g. << 4096), which may cause some "training-inference gap" if we don't add the attention mask.

Notice that there is an implicit affect of the very long sequence attention (if we do not add the attention mask) -- the entropy [1].

[1] https://kexue.fm/archives/8823

Thank you for the answer!

Qubitium · 2023-10-10T13:51:20Z

Since the sequences are generally not correlated there is no issue in cross contamination. Otherwise you would need to pass a dedicated attention mask but this is not currently done.

I am are very confused on the current sft packing implementation.

My dataset is small, less than 1k rows, and very correlated.

As such, based on what I have gathered in this issue, the current packing code causes my full finetune training result to go off-the-rails in terms of quality vs packing false.

So my question is:

How do I manually tweak the attention mask as to fix the strong correlation issue between our sequences within a packed final output row? Is there an example code I can reference to?
How can the training quality survive cut-offs where some pre-packed sequences are split into end and beg of output example rows? Since the current code naively split into fixed sized final output.

Thanks.

lvwerra · 2023-10-11T14:21:19Z

Packing is mainly a throughput/efficiency technique at scale. For 1000 samples, why don't you just turn packing off?

Qubitium · 2023-10-11T16:16:07Z

Packing is mainly a throughput/efficiency technique at scale. For 1000 samples, why don't you just turn packing off?

Because gpu waste is waste at any scale. The idea of packing allows us to reduce small/micro training time to be completed in 1/2 of the time vs packing off. Imagine 1 minute training time (packing=on) vs 2 minutes (packing=off) but repeat that N times over the course of the day for a developer that is trying to maximize work/time efficiency. We have verified the training time diff for packing=on but result quality is an issue for us.

I don't think packing should be reserved for large data. For many cases, 1000 dataset for finetuning is actually overkill. Quality of data trumps dataset row count. So 1k may be tiny, but actually for a special laser focused training, it's quite large.

Just that the current implementations is not optimized for correlated/small datasets, and/or that we are not using it properly, thus my questions on the two points that is causing confusion/problem for us.

github-actions · 2023-11-05T15:04:59Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

hanyin88 · 2023-12-23T11:46:08Z

Thanks for sharing this very important topic.

Another likely very relevant consideration is that, in Huggingface's BetterTransformer (Fast Kernel in LLaMA-recipes) does not support attention_mask:

The PyTorch-native scaled_dot_product_attention operator can only dispatch to Flash Attention if no attention_mask is provided. Thus, by default in training mode, the BetterTransformer integration drops the mask support and can only be used for training that do not require a padding mask for batched training. BetterTransformer is not suited for the fine-tuning of models on tasks that requires a padding mask.

khai-meetkai · 2024-01-16T04:46:50Z

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

gotzmann · 2024-01-30T22:59:30Z

@khai-meetkai will it work with Unsloth? https://github.com/unslothai/unsloth

cassanof · 2024-02-16T17:11:21Z

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

Not correct. you are using padding....

Qubitium · 2024-02-16T23:49:24Z

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

Not correct. you are using padding....

It is padding packed entries that is shorter than max. Or are you referring to something else? Packing does not mean you can't use padding.

cassanof · 2024-02-16T23:53:22Z

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

Not correct. you are using padding....

It is padding packed entries that is shorter than max. Or are you referring to something else? Packing does not mean you can't use padding.

I think this thread is about the ConstantLengthDatasetLoader, but correct me if I'm wrong. One of the advantages of using CLDL is that you don't use any padding, thus training could be more stable in some circumstances / downstream tasks. Yes, you can pack with padding too, but it wouldn't be the same thing,

Qubitium · 2024-02-17T12:45:27Z

I think this thread is about the ConstantLengthDatasetLoader, but correct me if I'm wrong. One of the advantages of using CLDL is that you don't use any padding, thus training could be more stable in some circumstances / downstream tasks. Yes, you can pack with padding too, but it wouldn't be the same thing,

Ok I see your point. I would argue that constantlength with no padding which naively splits just to fill max length is throwing some daset rows out the window. Now if your train set is millions of rows, this may not matter but smaller datasets, you most definitely want to pack as much into max len with no naive splitting and pad the rest.

Also can you point to papers or articles showing padding in general affecting quality/stability of training? I have never trained with no padding so I have no experience in this regard.

Peter-Devine · 2024-03-01T09:54:30Z

you would need to pass a dedicated attention mask but this is not currently done

Is this done now? I'd like to know if TRL's implementation attends to all samples at once. I know packages such as Axolotl support multipacking.

Thanks!

yuleiqin · 2024-03-02T11:49:44Z

Anything update? I am also interested to use packing for instruction finetuning on long-context examples.

In that case, I would combine both short and long context examples. To improve training efficiency, I think packing is useful to maximize your training efficiency (GPU memory) for each batch. However, the current implementation of packing might not be optimum for the SFT datasets:

each sample in one concatenated overall example should not attend to each other.
each sample should not be cut in the middle because the answer should be following the instruction without any loss of information.

Anything new for the solution? Thanks in advance for your help.

github-actions bot closed this as completed Nov 13, 2023

This was referenced Nov 21, 2023

Allow passing 2D attention mask huggingface/transformers#27640

Open

4D attention_mask support huggingface/transformers#27539

Merged

hanyin88 mentioned this issue Dec 23, 2023

Should we adjust attention mask when packing data? meta-llama/llama-recipes#341

Open

RdoubleA mentioned this issue May 9, 2024

Sample packing for map datasets with correct RoPE encoding and no cross-contamination pytorch/torchtune#875

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packing in SFT #805

Packing in SFT #805

wdykas commented Sep 21, 2023

younesbelkada commented Sep 25, 2023

wdykas commented Sep 25, 2023

lvwerra commented Sep 26, 2023

wdykas commented Sep 26, 2023

lvwerra commented Sep 29, 2023

yuanenming commented Oct 3, 2023

spartan-minhbui commented Oct 5, 2023

Qubitium commented Oct 10, 2023 •

edited

lvwerra commented Oct 11, 2023

Qubitium commented Oct 11, 2023

github-actions bot commented Nov 5, 2023

hanyin88 commented Dec 23, 2023

khai-meetkai commented Jan 16, 2024

gotzmann commented Jan 30, 2024

cassanof commented Feb 16, 2024

Qubitium commented Feb 16, 2024

cassanof commented Feb 16, 2024

Qubitium commented Feb 17, 2024

Peter-Devine commented Mar 1, 2024 •

edited

yuleiqin commented Mar 2, 2024

Packing in SFT #805

Packing in SFT #805

Comments

wdykas commented Sep 21, 2023

younesbelkada commented Sep 25, 2023

wdykas commented Sep 25, 2023

lvwerra commented Sep 26, 2023

wdykas commented Sep 26, 2023

lvwerra commented Sep 29, 2023

yuanenming commented Oct 3, 2023

spartan-minhbui commented Oct 5, 2023

Qubitium commented Oct 10, 2023 • edited

lvwerra commented Oct 11, 2023

Qubitium commented Oct 11, 2023

github-actions bot commented Nov 5, 2023

hanyin88 commented Dec 23, 2023

khai-meetkai commented Jan 16, 2024

gotzmann commented Jan 30, 2024

cassanof commented Feb 16, 2024

Qubitium commented Feb 16, 2024

cassanof commented Feb 16, 2024

Qubitium commented Feb 17, 2024

Peter-Devine commented Mar 1, 2024 • edited

yuleiqin commented Mar 2, 2024

Qubitium commented Oct 10, 2023 •

edited

Peter-Devine commented Mar 1, 2024 •

edited