Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

iantbutler01 · 2023-07-03T04:12:48Z

What does this PR do?

Implements NTK-Aware scaled and dynamically scaled RoPE for the PositionRotaryEmbedding to allow models to scale beyond their default max_tokens.

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

Fixes #512

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

iantbutler01 · 2023-07-03T04:15:04Z

I've tested fixed NTK-Aware scaling on a project I'm working on and was successfully generating at 2400 tokens which is about the limit my RTX 6000 Ada can handle from VRAM with falcon 40BN instruct, but it was entirely coherent generation above the original 2048 token context.

I still need to test dynamic scaling and clean up the PR further to comply with guidelines and the checklist, but wanted to open this up in the meantime.

ssmi153 · 2023-07-14T05:47:53Z

Just a note that Huggingface Transformers natively supports this now: huggingface/transformers@34d9409. Does this make it easier to implement here?

iantbutler01 · 2023-07-17T01:37:00Z

@ssmi153 Not particularly, most of the attention modules in this repo are custom to support flash attention. The work in transformers is good to review for my implementation and that's about it from what I see.

Narsil

Thanks for the PR.

We need only 1 new CLI argument. There's already a LOT of arguments, so let's try to keep them to a bare minimum for new features.
Overall, can we remove a lot of the complexity ?
From what I read, dynamic scaling seems just better than static scaling, so let's just use dynamic scaling, no ?
The current code has a lot of pathways, can we keep them to a minimum ?
Keep the code as close to the original as possible
Nothing should be directly in custom_modeling file. This behavior it seems should be entirely agnostic of modeling code.

This can go in the config for instance (like quantize) and be in models/flash_llama.py for instance (this is not modeling code, but wrapping the model itself, this will probably be factored away at some point, but here would be a good place for now).

I'm happy to make those changes if you want, as they are mostly stylistic choices rather than business logic.

Narsil · 2023-07-18T09:11:20Z

router/src/main.rs

+    #[clap(default_value = "2048", long, env)]
    max_batch_prefill_tokens: u32,
-    #[clap(default_value = "16000", long, env)]
+    #[clap(default_value = "8192", long, env)]


This does not belong in this PR.

We can discuss changing the defaults, but it's a separate concerns.

Oh yup fair, I don't want to change them I meant to clean this out. I'll remove!

Narsil · 2023-07-18T09:11:57Z

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

+if os.getenv("ROPE_DYNAMIC_SCALING", False).lower() == "true":
+    ROPE_DYNAMIC_SCALING = True
+else:
+    ROPE_DYNAMIC_SCALING = False


Nothing should be model specific.

Narsil · 2023-07-18T10:01:23Z

server/text_generation_server/utils/layers.py

@@ -369,7 +369,7 @@ def forward(self, hidden_states, residual=None):
    import rotary_emb

    class PositionRotaryEmbedding(nn.Module):
-        def __init__(self, inv_freq):
+        def __init__(self, inv_freq, scale_factor=1, dynamic_scaling=False, max_seq_len=2048, dim=None, base=None):


Can we have at most 1 extra argument.

A lot of information should be extractable directly from inv_freq.

Yup, I can try to simplify this

Narsil · 2023-07-18T10:01:50Z

server/text_generation_server/utils/layers.py

+            if self.dynamic_scaling:
+                scale_factor = (self.scale_factor * length / self.original_max_seq_len) - (self.scale_factor - 1)
+                max_seq_len = self.original_max_seq_len * scale_factor
+                self.inv_freq = self._get_inv_freq(self.dim, self.base, inv_freq.device, scale_factor)


This is not really OK I think.

You ditching entirely the original self.inv_freq which unfortunately for us is sometimes different from the calculation proposed (that's why not all models are static and some are load.

Llama most notably has different saved inv_freq (not sure why but it's indeed the case).

Part of dynamic scaling is calculating the new inv_freq, looking at the dynamic scaling implementation in Transformers I don't see them preserving this value either.

What would you suggest alternatively?

I was thinking interpolation when I wrote this.

Now that I reflect more it would make the code even more complex, which is not the desired effect.

Can we maybe move out the scaling factor out of get_inv_freq and keep it directly here (since it just seems to be rescaling the base)

And so let's keep rewriting inv_freq. It has some indesirable effects on those models, but the other way is even worse.

That sounds reasonable, I'll make this change after work.

iantbutler01 · 2023-07-18T11:47:10Z

@Narsil thanks for the review!

So my only reason for suggesting we keep static scaling is that it's much easier to consider VRAM usage of a statically scaled context window. If you have a model with 2048 context default, and scale it by 4 to a max of 8192 you can much more easily consider the VRAM consumption of that max.

Otherwise I agree from what I've read as well dynamic is better performing.

That said it's like you mentioned, having both adds complexity! If you still feel that's not enough of a reason to keep static I'll remove it 👍

M-Chris · 2023-07-19T03:36:32Z

Would love to see this exposed now that the huggingface/transformers#24653 merge is complete.

I understand there are various complexities with flash attention, and if flash attention 2 will be implemented..
but maybe this can still follow the same implementation of transformers merge to reduce confusion and use the "rope_scaling={"type": "dynamic", "factor": 2.0}," args? or at least a variation there of?

--rope_scaling {"type": "dynamic", "factor": 2.0} or however seen fit by you guys ?

🙌

gante · 2023-07-19T10:54:26Z

@iantbutler01 @Narsil some information for decision-making 🤗

Overall, can we remove a lot of the complexity ?
From what I read, dynamic scaling seems just better than static scaling, so let's just use dynamic scaling, no ?

The current state of scaling techniques in transformers:

dynamic scaling is the best for scaling without fine-tuning
linear scaling is the best for scaling with fine-tuning

What's out there that I'm adding next in transformers:

The folks behind dynamic scaling have created a technique that is the best in both regimes, before and after fine-tuning (see this comment)

So... perhaps we can jump straight into the best technique in TGI? :D It should only need one flag in practice, the scale

iantbutler01 · 2023-07-20T02:25:30Z

@gante That sounds good to me! I'll work on this over the next few days.

M-Chris · 2023-07-21T23:12:05Z

FWIW I tested against Llama via a simple overwrite on the Docker image on the latest TGI 0.9.3, using a flash attn v2 compatible GPU and works good 😄

I pulled against this pr branch and layered it into the overwrite, with a few adjustments and assumptions.

Assumed SCALE is the only ENV
Assumed dynamic as the only option since this is only inference.
Applied changes to FlashLlamaAttention since rotary_emb has moved since the original pr

Couple thoughts//concerns..

The user will be required to set their --max-total-tokens,--max-input-length & --max-batch-prefill-tokens maybe something you have already a plan to address iantbutler01? also maybe its fine if a user manipulates these as seen fit.
Not sure if there is a specific reason to be holding back the current version of transformers? (deployed that the docker as a test, doesn't seem to have any affect from testing )

Here is the docker used to quickly overwrite for testing if its helpful:

FROM ghcr.io/huggingface/text-generation-inference:latest

WORKDIR /usr/src

RUN apt-get update -y

RUN pip install transformers==4.31.0
RUN pip install scipy
RUN pip install bitsandbytes==0.40.2
RUN pip install --upgrade accelerate

COPY ./app_overwrites/server/text_generation_server/. /opt/conda/lib/python3.9/site-packages/text_generation_server

Id be happy to share anything else if needed 🍻

iantbutler01 · 2023-07-24T23:30:31Z

@gante Looking at @jquesnelle's repo, and the comment you linked to it looks like there is actually both a standard by parts as well as a dynamic by parts method. So it looks like the improvement you were talking about applies to both types of NTK aware scaling? In that case I'm inclined to make this PR just the dynamic by parts method to save on some of the complexity.

iantbutler01 · 2023-07-25T02:21:41Z

@Narsil @gante I spent some time tonight working to implement the dynamic parts by method I mentioned in my last comment. I'm coming to realize, that with this new method and the comment here: #512 (comment) suggesting there are now models that have been fine tuned with scaling that the complexity here has the chance to really be ballooning.

Even just the parts by method itself is more gnarly and requires supporting a whole bunch of parameters on the attention module.

Before I continue, at the risk of the complexity putting this in review hell, I'd like some guidance on what you all think I should proceed with. Personally if I add the dynamic parts by method linked in my previous comment it will have effectively set up the ability to implement the other methods here anyway, but maybe a follow up PR for those makes sense.

gante · 2023-07-25T12:13:32Z

@iantbutler01 you raised good points: as users fine-tune their models with rope scaling, they may lose compatibility with TGI (depending on how we decide to do things here). And yes, let's settle on a path that avoids review hell!

I'd suggest separating the two use cases and making two separate decisions/PRs:

loading models with rope_scaling, such as the one linked here.
- Would NOT be in this PR
- Should consist in reading model.config and loading the right RoPE class.
- Support for each scaling strategy could be added progressively, depending on demand. For instance, all current relevant cases of fine-tuned RoPE scaling rely on linear scaling, so there is only 1 small modification needed in practice.
enabling dynamic RoPE scaling
- Would be in this PR
- Triggered by (e.g.) a simple env var
- Implementing the NTK-by-parts, since it is the best-performing dynamic method so far

@Narsil @iantbutler01 WDYT?

iantbutler01 · 2023-07-25T19:07:46Z

@gante I am fine with that approach, that's basically what I started last night but I wanted to make sure that's what everyone had in mind.

iantbutler01 · 2023-07-28T23:17:56Z

Given the license change I am no longer comfortable contributing my work.

OliverFM · 2023-07-31T09:17:04Z

@iantbutler01 would you be willing to contribute this change to a fork?
I was planning on adding support for Speculative Decoding, see the issue #729, but will no longer do that with the license changes.

I am strongly considering maintaining a fork of the repo from the commit before the license change. I would be adding support for speculative decoding there.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because #529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

iantbutler01 mentioned this pull request Jul 3, 2023

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. #512

Closed

iantbutler01 added 2 commits July 17, 2023 01:16

Implement scaled and dynamically scaled RoPE

f01c11b

Update conditionals for dynamic scaling

0ec4d81

iantbutler01 force-pushed the fork/main branch from abc4e31 to 0ec4d81 Compare July 17, 2023 01:18

iantbutler01 changed the title ~~[Draft] Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding~~ Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding Jul 17, 2023

Fix env vars

b4ce728

Narsil reviewed Jul 18, 2023

View reviewed changes

arnocandel mentioned this pull request Jul 20, 2023

long context h2oai/h2ogpt#360

Open

iantbutler01 closed this Jul 28, 2023

iantbutler01 deleted the fork/main branch July 29, 2023 02:44

OlivierDehaene mentioned this pull request Jul 29, 2023

Text-Generation-Inference v1.0+ new license: HFOIL 1.0 #726

Closed

Narsil mentioned this pull request Jul 31, 2023

Adding Rope scaling. #741

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

iantbutler01 commented Jul 3, 2023 •

edited

iantbutler01 commented Jul 3, 2023 •

edited

ssmi153 commented Jul 14, 2023

iantbutler01 commented Jul 17, 2023 •

edited

Narsil left a comment •

edited

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023

Narsil Jul 18, 2023

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023 •

edited

iantbutler01 Jul 18, 2023

Narsil Jul 18, 2023

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023

iantbutler01 commented Jul 18, 2023 •

edited

M-Chris commented Jul 19, 2023

gante commented Jul 19, 2023

iantbutler01 commented Jul 20, 2023

M-Chris commented Jul 21, 2023 •

edited

iantbutler01 commented Jul 24, 2023

iantbutler01 commented Jul 25, 2023 •

edited

gante commented Jul 25, 2023

iantbutler01 commented Jul 25, 2023

iantbutler01 commented Jul 28, 2023

OliverFM commented Jul 31, 2023

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

Conversation

iantbutler01 commented Jul 3, 2023 • edited

What does this PR do?

Before submitting

Who can review?

iantbutler01 commented Jul 3, 2023 • edited

ssmi153 commented Jul 14, 2023

iantbutler01 commented Jul 17, 2023 • edited

Narsil left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iantbutler01 Jul 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iantbutler01 commented Jul 18, 2023 • edited

M-Chris commented Jul 19, 2023

gante commented Jul 19, 2023

iantbutler01 commented Jul 20, 2023

M-Chris commented Jul 21, 2023 • edited

Couple thoughts//concerns..

iantbutler01 commented Jul 24, 2023

iantbutler01 commented Jul 25, 2023 • edited

gante commented Jul 25, 2023

iantbutler01 commented Jul 25, 2023

iantbutler01 commented Jul 28, 2023

OliverFM commented Jul 31, 2023

iantbutler01 commented Jul 3, 2023 •

edited

iantbutler01 commented Jul 3, 2023 •

edited

iantbutler01 commented Jul 17, 2023 •

edited

Narsil left a comment •

edited

iantbutler01 Jul 18, 2023 •

edited

iantbutler01 commented Jul 18, 2023 •

edited

M-Chris commented Jul 21, 2023 •

edited

iantbutler01 commented Jul 25, 2023 •

edited