Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packing in SFT #805

Closed
wdykas opened this issue Sep 21, 2023 · 20 comments
Closed

Packing in SFT #805

wdykas opened this issue Sep 21, 2023 · 20 comments

Comments

@wdykas
Copy link

wdykas commented Sep 21, 2023

I understand how packing is allowed in pretraining but I was looking for some clarification on how we are allowed to pack samples for SFT with ConstantLengthDataset. I see that an EOS token is put between samples https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L567 but how are attention masks handled to make sure that the samples don't attend to each others context or is just putting EOS enough signal?

Could you also point me to the code for how multiple answers are handled in generation for loss calculations?

@younesbelkada
Copy link
Collaborator

Hi @wdykas !

Packing is a common practice and a trick to enable pre-training / fine-tuning on more sequences. I think that adding the EOS token is an enough signal for the model.

Screenshot 2023-09-25 at 14 58 14

The trick seems to has been introduced in the T5 paper: https://arxiv.org/pdf/1910.10683.pdf

Could you also point me to the code for how multiple answers are handled in generation for loss calculations?

Can you elaborate more on this question? for loss calculation we do not perform generation we only compute the next token once.

@wdykas
Copy link
Author

wdykas commented Sep 25, 2023

Sorry I meant more along the lines of how can calculate loss on the tokens without cross contamination in attention between sequences? Is there code where you create special attention masks? This seemed to be an issue opened here: huggingface/transformers#25452

@lvwerra
Copy link
Member

lvwerra commented Sep 26, 2023

Since the sequences are generally not correlated there is no issue in cross contamination. Otherwise you would need to pass a dedicated attention mask but this is not currently done.

@wdykas
Copy link
Author

wdykas commented Sep 26, 2023

thanks for the answer! One last QQ, a problem that I see in this implementation is that we create a long continuous buffer of text separated by [EOS]. Then it seems we create examples of seq_len by naively looping over https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L579C35-L579C48 without considering EOS.
Doesn't this mean that some examples, if not many, will start with sequences that are cutoff and are in the middle?

@lvwerra
Copy link
Member

lvwerra commented Sep 29, 2023

Yes, that's the idea of packing. If you want to avoid that you can disable packing and each sequence will start at the beginning and filled with padding tokens to the max sequence length in the batch.

@yuanenming
Copy link

image

I think it is better to add a feature to make the sequence packing without any kinds of cross contamination like the figure above. Specifically in the supervised finetuning stage. Because in the SFT stage, We usually have short conversations (e.g. << 4096), which may cause some "training-inference gap" if we don't add the attention mask.

Notice that there is an implicit affect of the very long sequence attention (if we do not add the attention mask) -- the entropy [1].

[1] https://kexue.fm/archives/8823

@spartan-minhbui
Copy link

image

I think it is better to add a feature to make the sequence packing without any kinds of cross contamination like the figure above. Specifically in the supervised finetuning stage. Because in the SFT stage, We usually have short conversations (e.g. << 4096), which may cause some "training-inference gap" if we don't add the attention mask.

Notice that there is an implicit affect of the very long sequence attention (if we do not add the attention mask) -- the entropy [1].

[1] https://kexue.fm/archives/8823

Thank you for the answer!

@Qubitium
Copy link

Qubitium commented Oct 10, 2023

Since the sequences are generally not correlated there is no issue in cross contamination. Otherwise you would need to pass a dedicated attention mask but this is not currently done.

I am are very confused on the current sft packing implementation.

My dataset is small, less than 1k rows, and very correlated.

As such, based on what I have gathered in this issue, the current packing code causes my full finetune training result to go off-the-rails in terms of quality vs packing false.

So my question is:

  1. How do I manually tweak the attention mask as to fix the strong correlation issue between our sequences within a packed final output row? Is there an example code I can reference to?

  2. How can the training quality survive cut-offs where some pre-packed sequences are split into end and beg of output example rows? Since the current code naively split into fixed sized final output.

Thanks.

@lvwerra
Copy link
Member

lvwerra commented Oct 11, 2023

Packing is mainly a throughput/efficiency technique at scale. For 1000 samples, why don't you just turn packing off?

@Qubitium
Copy link

Packing is mainly a throughput/efficiency technique at scale. For 1000 samples, why don't you just turn packing off?

Because gpu waste is waste at any scale. The idea of packing allows us to reduce small/micro training time to be completed in 1/2 of the time vs packing off. Imagine 1 minute training time (packing=on) vs 2 minutes (packing=off) but repeat that N times over the course of the day for a developer that is trying to maximize work/time efficiency. We have verified the training time diff for packing=on but result quality is an issue for us.

I don't think packing should be reserved for large data. For many cases, 1000 dataset for finetuning is actually overkill. Quality of data trumps dataset row count. So 1k may be tiny, but actually for a special laser focused training, it's quite large.

Just that the current implementations is not optimized for correlated/small datasets, and/or that we are not using it properly, thus my questions on the two points that is causing confusion/problem for us.

Copy link

github-actions bot commented Nov 5, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@hanyin88
Copy link

Thanks for sharing this very important topic.

Another likely very relevant consideration is that, in Huggingface's BetterTransformer (Fast Kernel in LLaMA-recipes) does not support attention_mask:

The PyTorch-native scaled_dot_product_attention operator can only dispatch to Flash Attention if no attention_mask is provided. Thus, by default in training mode, the BetterTransformer integration drops the mask support and can only be used for training that do not require a padding mask for batched training. BetterTransformer is not suited for the fine-tuning of models on tasks that requires a padding mask.

@khai-meetkai
Copy link

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

@gotzmann
Copy link

@khai-meetkai will it work with Unsloth? https://github.com/unslothai/unsloth

@cassanof
Copy link

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

Not correct. you are using padding....

@Qubitium
Copy link

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

Not correct. you are using padding....

It is padding packed entries that is shorter than max. Or are you referring to something else? Packing does not mean you can't use padding.

@cassanof
Copy link

You can use the correct implementation of packing without cross-contamination attention from this repo: https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

Not correct. you are using padding....

It is padding packed entries that is shorter than max. Or are you referring to something else? Packing does not mean you can't use padding.

I think this thread is about the ConstantLengthDatasetLoader, but correct me if I'm wrong. One of the advantages of using CLDL is that you don't use any padding, thus training could be more stable in some circumstances / downstream tasks. Yes, you can pack with padding too, but it wouldn't be the same thing,

@Qubitium
Copy link

I think this thread is about the ConstantLengthDatasetLoader, but correct me if I'm wrong. One of the advantages of using CLDL is that you don't use any padding, thus training could be more stable in some circumstances / downstream tasks. Yes, you can pack with padding too, but it wouldn't be the same thing,

Ok I see your point. I would argue that constantlength with no padding which naively splits just to fill max length is throwing some daset rows out the window. Now if your train set is millions of rows, this may not matter but smaller datasets, you most definitely want to pack as much into max len with no naive splitting and pad the rest.

Also can you point to papers or articles showing padding in general affecting quality/stability of training? I have never trained with no padding so I have no experience in this regard.

@Peter-Devine
Copy link

Peter-Devine commented Mar 1, 2024

you would need to pass a dedicated attention mask but this is not currently done

Is this done now? I'd like to know if TRL's implementation attends to all samples at once. I know packages such as Axolotl support multipacking.

Thanks!

@yuleiqin
Copy link

yuleiqin commented Mar 2, 2024

Anything update? I am also interested to use packing for instruction finetuning on long-context examples.

In that case, I would combine both short and long context examples. To improve training efficiency, I think packing is useful to maximize your training efficiency (GPU memory) for each batch. However, the current implementation of packing might not be optimum for the SFT datasets:

  1. each sample in one concatenated overall example should not attend to each other.
  2. each sample should not be cut in the middle because the answer should be following the instruction without any loss of information.

Anything new for the solution? Thanks in advance for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests