Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LongLoRA trick for efficient tuning of long-context models #958

Closed
casper-hansen opened this issue Sep 25, 2023 · 19 comments
Closed

Comments

@casper-hansen
Copy link

casper-hansen commented Sep 25, 2023

Feature request

The authors of LongLoRA explore a trick you can toggle on during training and toggle off during inference. The key takeaways are:

  • LoRA perplexity deteriorates as context length increases, LongLoRA solves this and has the same perplexity as FP16 on longer contexts
  • Able to train with longer contexts much faster than LoRA while keeping the same VRAM usage as LoRA

image

Paper: https://arxiv.org/pdf/2309.12307.pdf
Code: https://github.com/dvlab-research/LongLoRA
Trick is implemented here: https://github.com/dvlab-research/LongLoRA/blob/4cbb61b93ac8c7ea4db7ffc24a824f3fb153554d/llama_attn_replace.py

Motivation

The key motivation is to enable a LoRA method that scales better for fine-tuning models with longer context sizes. We know this to be expensive both in compute hours and memory requirements, but LongLoRA seems to optimize for this. Using LongLoRA, we are able to train models faster and keep the perplexity of FP16 on longer context sizes.

Your contribution

The trick is implemented in the function below. It is a 2-step process.

Shift short attention:

def shift(qkv, bsz, q_len, group_size, num_heads, head_dim):
    qkv[:, num_heads // 2:] = qkv[:, num_heads // 2:].roll(-group_size // 2, dims=2)
    qkv = qkv.transpose(1, 2).reshape(bsz * (q_len // group_size), group_size, num_heads, head_dim).transpose(1, 2)
    return qkv

Unpacking after attention computations:

output[:, :, self.num_heads//2:] = output[:, :, self.num_heads//2:].roll(group_size//2, dims=1)
@BenjaminBossan
Copy link
Member

This looks very promising, especially if it really just requires a couple of lines of extra code. Thanks for bringing this to our attention. Do you want to work on adding this?

@casper-hansen
Copy link
Author

@BenjaminBossan I am currently occupied with a lot of other work, and I do not know where to start with implementing this in PEFT, so I would hope one of the great HF MLEs can pick this up🙏

@BenjaminBossan
Copy link
Member

Haha, let's see if flattery does the trick for you ;-)

I was mainly asking because it was not clear to me from your post if you wanted to add it or just suggested it to be added, not to push you.

@casper-hansen
Copy link
Author

casper-hansen commented Sep 26, 2023

Haha, maybe it will :)

I mainly posted here because I saw it could have a large benefit for all users looking to train LLMs with LoRA/QLoRA and long contexts

@marcasty
Copy link

I'd like to add this feature if one of the great HF MLEs hasn't picked it up yet ;)

@BenjaminBossan
Copy link
Member

We haven't started on it yet, so feel free to take a shot. Don't hesitate to ask questions or to create a draft PR early, so that we can give quick feedback.

@marcasty
Copy link

I read some code and have a few thoughts:

  1. Should there be a boolean argument to LoraConfig() in peft/tuners/lora/config.py to specify if LongLora is to be used
  2. The authors of LongLoRA implemented this for Llama by updating the forward pass in transformers.models.llama.modeling_llama.LlamaAttention then passing that model to LoraConfig. It almost feels like a transformers project rather than peft

So does supporting this feature require custom LongLoRA attention implementations for each model?

@BenjaminBossan
Copy link
Member

2. The authors of LongLoRA implemented this for Llama by updating the forward pass in transformers.models.llama.modeling_llama.LlamaAttention then passing that model to LoraConfig. It almost feels like a transformers project rather than peft

So does supporting this feature require custom LongLoRA attention implementations for each model?

Interesting. In theory, we can do that in PEFT also, whether that's a good idea or not depends on what exactly is happening in the modified forward method.

Looking at the code here, it looks quite difficult to implement this in a generic way. In particular, these new methods (actually, functions) don't call super().forward, i.e. they need to replicate exactly what happens inside the method they replace. Perhaps it would be possible to extract the generic parts and put them into a solution that is not model-specific, but that would require some deep digging to figure out.

@casper-hansen
Copy link
Author

@BenjaminBossan So in other words, it would be difficult to implement in PEFT because you have to hijack the attention module and add the appropriate calls in the right places? Another solution is for other training libraries to implement custom forward methods and implement it in those, e.g. as they do in axolotl for llama or mistral models. The problem is that this causes a lot of overhead when trying to support this feature.

@BenjaminBossan
Copy link
Member

So in other words, it would be difficult to implement in PEFT because you have to hijack the attention module and add the appropriate calls in the right places?

I think the answer is "it depends". We could do the same thing as in that repo, i.e. provide highly model-specific solutions and then only enable LongLoRA on those models. But I would argue that there would only be little benefit of using PEFT if it just does the exact same, narrow thing.

It would be much better to have a more generic implementation, but whether that works depends on the changes that were introduced. I cannot comment on that, because I haven't looked into the changes in detail. But if they could be generalized and abstracted, they would be great additions to PEFT.

To give a toy example, if the changes amount to adding 1 to to the output of the attention layer, that would be possible to add to PEFT in a generalized fashion. If the changes require 10 modifications in different lines in the forward method, that could not be generalized. Does that make sense?

Another solution is for other training libraries to implement custom forward methods and implement it in those

Could you elaborate on how that is different? Thanks.

@casper-hansen
Copy link
Author

Yes, I do agree. It looks like you need several modifications that are neither at the start nor the end of the forward. That makes it quite hard to generalize.

Could you elaborate on how that is different? Thanks.

Training libraries like axolotl implement custom forward functions for some models in order to implement sample packing and enable features like flash attention. They also build on top of PEFT and other Huggingface libraries.

@marcasty
Copy link

marcasty commented Oct 2, 2023

the task that motivates this paper is to extend the context length for transformer language models. Does PEFT support extending the context length (I couldn't find it)? The code we're discussing is only Shift Short Attention, so just implementing Shift Short Attention would allow easy efficient long context fine-tuning but not efficient context extension. I think the Shift Short Attention implementation is the bigger fish to fry but thought this was worth mentioning to appreciate the scope of this feature

@BenjaminBossan
Copy link
Member

Training libraries like axolotl implement custom forward functions for some models in order to implement sample packing and enable features like flash attention. They also build on top of PEFT and other Huggingface libraries.

Okay, so your suggestion is to offload the task of implementing the custom forward methods to the training libs? Yes, there is a bit of a blurry line if these types of changes are in the scope of PEFT or not. I'd say yes, as it's not strictly related to specific training methods, but it could be done either way.

Does PEFT support extending the context length (I couldn't find it)?

No, the idea would be for PEFT to make the modification to the layers, the training itself on longer contexts would be up to the user.

@casper-hansen
Copy link
Author

Okay, so your suggestion is to offload the task of implementing the custom forward methods to the training libs?

If not upstreamed in PEFT, this would be the alternative. Ideally, it can be implemented directly in PEFT though - it looks quite nice and I think everyone training long-context models would appreciate this feature.

@teknium1
Copy link

dead?

@matteoguarrera
Copy link

I'd like to add this feature if one of the great HF MLEs hasn't picked it up yet ;)

You still working on that? I wanna help

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this as completed Dec 4, 2023
@seanxuu
Copy link

seanxuu commented Jan 29, 2024

如果一个伟大的 HF MLE 还没有拿起它,我想添加这个功能;)

你还在努力吗?我想帮忙

same question

@Gauravbhai4
Copy link

can we implement this on longformer model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants