Implement LongLoRA trick for efficient tuning of long-context models #958

casper-hansen · 2023-09-25T08:57:30Z

Feature request

The authors of LongLoRA explore a trick you can toggle on during training and toggle off during inference. The key takeaways are:

LoRA perplexity deteriorates as context length increases, LongLoRA solves this and has the same perplexity as FP16 on longer contexts
Able to train with longer contexts much faster than LoRA while keeping the same VRAM usage as LoRA

Paper: https://arxiv.org/pdf/2309.12307.pdf
Code: https://github.com/dvlab-research/LongLoRA
Trick is implemented here: https://github.com/dvlab-research/LongLoRA/blob/4cbb61b93ac8c7ea4db7ffc24a824f3fb153554d/llama_attn_replace.py

Motivation

The key motivation is to enable a LoRA method that scales better for fine-tuning models with longer context sizes. We know this to be expensive both in compute hours and memory requirements, but LongLoRA seems to optimize for this. Using LongLoRA, we are able to train models faster and keep the perplexity of FP16 on longer context sizes.

Your contribution

The trick is implemented in the function below. It is a 2-step process.

Shift short attention:

def shift(qkv, bsz, q_len, group_size, num_heads, head_dim):
    qkv[:, num_heads // 2:] = qkv[:, num_heads // 2:].roll(-group_size // 2, dims=2)
    qkv = qkv.transpose(1, 2).reshape(bsz * (q_len // group_size), group_size, num_heads, head_dim).transpose(1, 2)
    return qkv

Unpacking after attention computations:

output[:, :, self.num_heads//2:] = output[:, :, self.num_heads//2:].roll(group_size//2, dims=1)

BenjaminBossan · 2023-09-26T10:27:40Z

This looks very promising, especially if it really just requires a couple of lines of extra code. Thanks for bringing this to our attention. Do you want to work on adding this?

casper-hansen · 2023-09-26T10:44:45Z

@BenjaminBossan I am currently occupied with a lot of other work, and I do not know where to start with implementing this in PEFT, so I would hope one of the great HF MLEs can pick this up🙏

BenjaminBossan · 2023-09-26T10:48:52Z

Haha, let's see if flattery does the trick for you ;-)

I was mainly asking because it was not clear to me from your post if you wanted to add it or just suggested it to be added, not to push you.

casper-hansen · 2023-09-26T11:09:19Z

Haha, maybe it will :)

I mainly posted here because I saw it could have a large benefit for all users looking to train LLMs with LoRA/QLoRA and long contexts

marcasty · 2023-09-29T13:47:29Z

I'd like to add this feature if one of the great HF MLEs hasn't picked it up yet ;)

BenjaminBossan · 2023-09-29T13:52:22Z

We haven't started on it yet, so feel free to take a shot. Don't hesitate to ask questions or to create a draft PR early, so that we can give quick feedback.

marcasty · 2023-09-29T21:04:18Z

I read some code and have a few thoughts:

Should there be a boolean argument to LoraConfig() in peft/tuners/lora/config.py to specify if LongLora is to be used
The authors of LongLoRA implemented this for Llama by updating the forward pass in transformers.models.llama.modeling_llama.LlamaAttention then passing that model to LoraConfig. It almost feels like a transformers project rather than peft

So does supporting this feature require custom LongLoRA attention implementations for each model?

BenjaminBossan · 2023-10-02T15:35:03Z

2. The authors of LongLoRA implemented this for Llama by updating the forward pass in transformers.models.llama.modeling_llama.LlamaAttention then passing that model to LoraConfig. It almost feels like a transformers project rather than peft

So does supporting this feature require custom LongLoRA attention implementations for each model?

Interesting. In theory, we can do that in PEFT also, whether that's a good idea or not depends on what exactly is happening in the modified forward method.

Looking at the code here, it looks quite difficult to implement this in a generic way. In particular, these new methods (actually, functions) don't call super().forward, i.e. they need to replicate exactly what happens inside the method they replace. Perhaps it would be possible to extract the generic parts and put them into a solution that is not model-specific, but that would require some deep digging to figure out.

casper-hansen · 2023-10-02T15:38:51Z

@BenjaminBossan So in other words, it would be difficult to implement in PEFT because you have to hijack the attention module and add the appropriate calls in the right places? Another solution is for other training libraries to implement custom forward methods and implement it in those, e.g. as they do in axolotl for llama or mistral models. The problem is that this causes a lot of overhead when trying to support this feature.

BenjaminBossan · 2023-10-02T15:54:19Z

So in other words, it would be difficult to implement in PEFT because you have to hijack the attention module and add the appropriate calls in the right places?

I think the answer is "it depends". We could do the same thing as in that repo, i.e. provide highly model-specific solutions and then only enable LongLoRA on those models. But I would argue that there would only be little benefit of using PEFT if it just does the exact same, narrow thing.

It would be much better to have a more generic implementation, but whether that works depends on the changes that were introduced. I cannot comment on that, because I haven't looked into the changes in detail. But if they could be generalized and abstracted, they would be great additions to PEFT.

To give a toy example, if the changes amount to adding 1 to to the output of the attention layer, that would be possible to add to PEFT in a generalized fashion. If the changes require 10 modifications in different lines in the forward method, that could not be generalized. Does that make sense?

Another solution is for other training libraries to implement custom forward methods and implement it in those

Could you elaborate on how that is different? Thanks.

casper-hansen · 2023-10-02T16:02:32Z

Yes, I do agree. It looks like you need several modifications that are neither at the start nor the end of the forward. That makes it quite hard to generalize.

Could you elaborate on how that is different? Thanks.

Training libraries like axolotl implement custom forward functions for some models in order to implement sample packing and enable features like flash attention. They also build on top of PEFT and other Huggingface libraries.

marcasty · 2023-10-02T16:03:00Z

the task that motivates this paper is to extend the context length for transformer language models. Does PEFT support extending the context length (I couldn't find it)? The code we're discussing is only Shift Short Attention, so just implementing Shift Short Attention would allow easy efficient long context fine-tuning but not efficient context extension. I think the Shift Short Attention implementation is the bigger fish to fry but thought this was worth mentioning to appreciate the scope of this feature

BenjaminBossan · 2023-10-02T18:25:55Z

Training libraries like axolotl implement custom forward functions for some models in order to implement sample packing and enable features like flash attention. They also build on top of PEFT and other Huggingface libraries.

Okay, so your suggestion is to offload the task of implementing the custom forward methods to the training libs? Yes, there is a bit of a blurry line if these types of changes are in the scope of PEFT or not. I'd say yes, as it's not strictly related to specific training methods, but it could be done either way.

Does PEFT support extending the context length (I couldn't find it)?

No, the idea would be for PEFT to make the modification to the layers, the training itself on longer contexts would be up to the user.

casper-hansen · 2023-10-02T18:30:51Z

Okay, so your suggestion is to offload the task of implementing the custom forward methods to the training libs?

If not upstreamed in PEFT, this would be the alternative. Ideally, it can be implemented directly in PEFT though - it looks quite nice and I think everyone training long-context models would appreciate this feature.

teknium1 · 2023-10-16T10:36:19Z

dead?

matteoguarrera · 2023-11-01T22:34:33Z

I'd like to add this feature if one of the great HF MLEs hasn't picked it up yet ;)

You still working on that? I wanna help

github-actions · 2023-11-26T15:03:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

seanxuu · 2024-01-29T09:24:49Z

如果一个伟大的 HF MLE 还没有拿起它，我想添加这个功能;)

你还在努力吗？我想帮忙

same question

Gauravbhai4 · 2024-02-10T13:00:42Z

can we implement this on longformer model

winglian mentioned this issue Sep 25, 2023

LongLora suport OpenAccess-AI-Collective/axolotl#623

Open

5 tasks

Narsil mentioned this issue Oct 2, 2023

LongLora Support huggingface/text-generation-inference#1087

Closed

2 tasks

github-actions bot closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement LongLoRA trick for efficient tuning of long-context models #958

Implement LongLoRA trick for efficient tuning of long-context models #958

casper-hansen commented Sep 25, 2023 •

edited

BenjaminBossan commented Sep 26, 2023

casper-hansen commented Sep 26, 2023

BenjaminBossan commented Sep 26, 2023

casper-hansen commented Sep 26, 2023 •

edited

marcasty commented Sep 29, 2023

BenjaminBossan commented Sep 29, 2023

marcasty commented Sep 29, 2023

BenjaminBossan commented Oct 2, 2023

casper-hansen commented Oct 2, 2023

BenjaminBossan commented Oct 2, 2023

casper-hansen commented Oct 2, 2023

marcasty commented Oct 2, 2023

BenjaminBossan commented Oct 2, 2023

casper-hansen commented Oct 2, 2023

teknium1 commented Oct 16, 2023

matteoguarrera commented Nov 1, 2023

github-actions bot commented Nov 26, 2023

seanxuu commented Jan 29, 2024

Gauravbhai4 commented Feb 10, 2024

Implement LongLoRA trick for efficient tuning of long-context models #958

Implement LongLoRA trick for efficient tuning of long-context models #958

Comments

casper-hansen commented Sep 25, 2023 • edited

Feature request

Motivation

Your contribution

BenjaminBossan commented Sep 26, 2023

casper-hansen commented Sep 26, 2023

BenjaminBossan commented Sep 26, 2023

casper-hansen commented Sep 26, 2023 • edited

marcasty commented Sep 29, 2023

BenjaminBossan commented Sep 29, 2023

marcasty commented Sep 29, 2023

BenjaminBossan commented Oct 2, 2023

casper-hansen commented Oct 2, 2023

BenjaminBossan commented Oct 2, 2023

casper-hansen commented Oct 2, 2023

marcasty commented Oct 2, 2023

BenjaminBossan commented Oct 2, 2023

casper-hansen commented Oct 2, 2023

teknium1 commented Oct 16, 2023

matteoguarrera commented Nov 1, 2023

github-actions bot commented Nov 26, 2023

seanxuu commented Jan 29, 2024

Gauravbhai4 commented Feb 10, 2024

casper-hansen commented Sep 25, 2023 •

edited

casper-hansen commented Sep 26, 2023 •

edited