NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. #512

flozi00 · 2023-06-30T05:37:58Z

Feature request

Longer context up to 8k tokens, the given discussion and notebook generate promising results

Motivation

Discussion: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

Colab Notebook: https://colab.research.google.com/drive/1VI2nhlyKvd5cw4-zHvAIk00cAVj2lCCC#scrollTo=d2ceb547

Your contribution

As it's only 3 lines of code to change it would be pretty easy to change

I will start training an model and give an example demo

Narsil · 2023-06-30T07:13:48Z

Oh nice. And if you want to write a PR that would be awesome too.

Please be mindful that tgi code doesn't do batching the same as transformers meaning the change is most likely be slightly more complex.
Also lots of models actually defined this buffer directly in their weights instead of instantiating, with unfortunately some downstream differences in generation.

flozi00 · 2023-06-30T12:19:41Z

The purple one is trained with the 3 line fix given in the colab

iantbutler01 · 2023-07-01T14:20:34Z

@Narsil Just wanted to chime in here and say I'm working on an implementation and PR for this

flozi00 · 2023-07-01T15:37:05Z

@iantbutler01 let me know if you need support at any point
atm i am focused on training such models rather than integration into tgi

iantbutler01 · 2023-07-03T04:17:01Z

I've opened a draft PR, #529

I've tested the fixed NTK Aware scaling and it seems to work, I still need to test dynamic scaling and clean up the PR to comply with contributor guidelines, but I wanted to at least start the discussion.

GemsFord · 2023-07-11T09:09:12Z

@iantbutler01 does this method only supports LLaMa models? if yes, why did you add the support in flash_rw_modeling.py?

iantbutler01 · 2023-07-11T10:51:03Z

@GemsFord The method should work for any model using rotary embeddings, its agnostic. My main use is for Falcon 40bn which I've been running locally and testing these changes with.

GemsFord · 2023-07-11T11:41:38Z

@iantbutler01 Thanks for adding the support for Falcon, I use that too that's why I asked. I am waiting for your PR to get merged.

iantbutler01 · 2023-07-12T19:39:32Z

Yup, I plan to clean this up and make it ready for review this weekend. I was on vacation and now catching back up with my work, but I will have time this weekend. As far as I can tell the implementation works so it's just a matter of cleaning up and then going through review feedback.

flozi00 · 2023-07-13T18:15:34Z

huggingface/transformers@34d9409#diff-9ba75cc28be7924a2fc43de1d2c8c7779ad597129d33d1af39153951463cd0bc

Rope Scaling got merged to the transformers repo

iantbutler01 · 2023-07-13T18:21:54Z

Nice, I don't think that effects this work unless they implemented it in a flash attention enabled module. I'll definitely check it out to make sure my implementation here is correct though

flozi00 · 2023-07-13T21:12:43Z

Most interesting is the dynamic ntk aware rope being added
Maybe an option for tgi too adding the dynamic version ?

iantbutler01 · 2023-07-13T21:13:37Z

That's already in my PR :D

flozi00 · 2023-07-13T21:41:36Z

Great 😀👍

keelezibel · 2023-07-15T02:36:12Z

Hi, any instructions on how to use this after PR is merged? Also, I was thinking why there would be a desync between transformers lib and this repo since it would be too expensive to run LLMs without an inference server and instantiating an instance using the transformers lib alone.

iantbutler01 · 2023-07-17T04:37:56Z

@Narsil I've updated the PR to remove draft status, I think I'm ready for review, just pinging you because you were the earliest responder from HF on this thread.

andreaskoepf · 2023-07-24T15:24:23Z

The associated PR #529 seems to add post-hoc RoPE scaling (for models trained without scaling). Now that linear & dynamic rope scaling got merged into transformers (huggingface/transformers#24653) more models will be fine-tuned with scaled RoPE. For example we (open-assistant) uploaded today a first experiment llama2-13b-orca-8k-3319 which was fine-tuned with 8k context with simple linear scaling, it has in the config.json and can be used out of the box with transformers 4.31.0:

  "rope_scaling": {
    "factor": 2.0,
    "type": "linear"
  },

Will support for these kinds of fine-tuned models also be added to TGI? Will a separate PR be required for this?

Currently those models can simply be loaded with TGI but since the rope-scaling is not active the output is gibberish. Until rope-scaled models are supported it might be good to generate an error or warning when rope_scaling is not None in the model's configuration.

Or will the rope-scaling of the HF transformers llama impl automatically be used one the TGI transformers dependency in requirements.txt is updated (currently it is still transformers==4.29.2)?

Narsil · 2023-07-31T08:21:39Z

Two separate things, but we'll align with that yes.

Narsil · 2023-07-31T09:12:25Z

@andreaskoepf Can you provide an example where the rope scaling fails ?

I'm trying few dummy examples, but I'm not sure if what I'm doing is correct or not as the model output doesn't seem particularly bad either way (I'm guessing I'm not entering large enough prompts)

Narsil · 2023-07-31T12:06:25Z

@andreaskoepf the PR linked should fix it.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because #529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

yadamonk · 2023-07-31T14:24:22Z

@Narsil So we can now use models like the llama2 orca 8k mentioned by @andreaskoepf?

Narsil · 2023-07-31T14:43:13Z

You should be able to !

I was able to get coherent results on prompts of 6k on that model.
I'm still waiting on confirmation that knows expectation from that particular model (my references to test are on llama v1-7b non finetuned, that I'm sure works, for the finetuned the output looks OK but without any reference points to compare to it's kind of hard)

flozi00 · 2023-07-31T15:21:47Z

I tried to test using gptq weights, on v1.0 everything is fine, with the latest container

File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 206, in __init__
    self.query_key_value = TensorParallelColumnLinear.load_multi(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 264, in load_multi
    weight = weights.get_multi_weights_col(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 134, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 220, in _get_gptq_params
    raise e

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 213, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 66, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 53, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight gptq_bits does not exist
 rank=0
Error: ShardCannotStart

Narsil · 2023-07-31T15:41:52Z

What model is that ?

flozi00 · 2023-07-31T15:43:40Z

flozi00/Llama-2-7b-german-assistant-v2-4bit-autogptq

The only commit touched that part of code is #738 after the 1.0 release

Narsil · 2023-07-31T15:50:09Z

I'm bad. https://github.com/huggingface/text-generation-inference/pull/743/files

Narsil · 2023-07-31T16:41:16Z

Should be ok after this, could you confirm ?

flozi00 · 2023-07-31T16:51:56Z

Another issue found

text-generation-inference/server/text_generation_server/utils/layers.py

Line 384 in 15fc646

def _create_inv_freq(dim, base, device):

defined here

https://github.com/huggingface/text-generation-inference/blob/15fc64668f8d3dd407768286e5a0536aeb78c2e1/server/text_generation_server/utils/layers.py#L486C24-L486C39
used here and not accessible from the other class

so dynamic scaling is not working and raise function not defined error, linear scaling with quantized model is working. I can see that it has problems with the stop tokens now, so the model makes whole conversations, but i think that can be solved by some configuration

flozi00 · 2023-07-31T16:56:27Z

solving that typo here

#745

Narsil · 2023-07-31T16:58:17Z

Shoot I just merge my PR which is the same :)

Edit: accepted yours so you'll end up in contributors !
Thanks.

flozi00 · 2023-07-31T17:03:24Z

Thanks a lot :)
I love that, at most huggingface projects the core team is so fast 🚀

flozi00 · 2023-07-31T17:44:20Z

I can confirm, dynamic is working now too

MUZAMMILPERVAIZ · 2023-08-28T07:15:01Z

what should be the rope scaling factor for 32k context, 0.125?

iantbutler01 mentioned this issue Jul 17, 2023

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

Closed

5 tasks

arnocandel mentioned this issue Jul 20, 2023

Support for extended context for LlaMA Based models #499

Closed

Narsil mentioned this issue Jul 31, 2023

Adding Rope scaling. #741

Merged

5 tasks

Narsil closed this as completed in #741 Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. #512

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. #512

flozi00 commented Jun 30, 2023 •

edited

Narsil commented Jun 30, 2023

flozi00 commented Jun 30, 2023

iantbutler01 commented Jul 1, 2023

flozi00 commented Jul 1, 2023

iantbutler01 commented Jul 3, 2023

GemsFord commented Jul 11, 2023

iantbutler01 commented Jul 11, 2023

GemsFord commented Jul 11, 2023

iantbutler01 commented Jul 12, 2023

flozi00 commented Jul 13, 2023

iantbutler01 commented Jul 13, 2023

flozi00 commented Jul 13, 2023

iantbutler01 commented Jul 13, 2023

flozi00 commented Jul 13, 2023

keelezibel commented Jul 15, 2023

iantbutler01 commented Jul 17, 2023

andreaskoepf commented Jul 24, 2023 •

edited

Narsil commented Jul 31, 2023

Narsil commented Jul 31, 2023

Narsil commented Jul 31, 2023

yadamonk commented Jul 31, 2023

Narsil commented Jul 31, 2023

flozi00 commented Jul 31, 2023

Narsil commented Jul 31, 2023

flozi00 commented Jul 31, 2023

Narsil commented Jul 31, 2023

Narsil commented Jul 31, 2023

flozi00 commented Jul 31, 2023 •

edited

flozi00 commented Jul 31, 2023

Narsil commented Jul 31, 2023 •

edited

flozi00 commented Jul 31, 2023

flozi00 commented Jul 31, 2023

MUZAMMILPERVAIZ commented Aug 28, 2023 •

edited

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. #512

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. #512

Comments

flozi00 commented Jun 30, 2023 • edited

Feature request

Motivation

Your contribution

Narsil commented Jun 30, 2023

flozi00 commented Jun 30, 2023

iantbutler01 commented Jul 1, 2023

flozi00 commented Jul 1, 2023

iantbutler01 commented Jul 3, 2023

GemsFord commented Jul 11, 2023

iantbutler01 commented Jul 11, 2023

GemsFord commented Jul 11, 2023

iantbutler01 commented Jul 12, 2023

flozi00 commented Jul 13, 2023

iantbutler01 commented Jul 13, 2023

flozi00 commented Jul 13, 2023

iantbutler01 commented Jul 13, 2023

flozi00 commented Jul 13, 2023

keelezibel commented Jul 15, 2023

iantbutler01 commented Jul 17, 2023

andreaskoepf commented Jul 24, 2023 • edited

Narsil commented Jul 31, 2023

Narsil commented Jul 31, 2023

Narsil commented Jul 31, 2023

yadamonk commented Jul 31, 2023

Narsil commented Jul 31, 2023

flozi00 commented Jul 31, 2023

Narsil commented Jul 31, 2023

flozi00 commented Jul 31, 2023

Narsil commented Jul 31, 2023

Narsil commented Jul 31, 2023

flozi00 commented Jul 31, 2023 • edited

flozi00 commented Jul 31, 2023

Narsil commented Jul 31, 2023 • edited

flozi00 commented Jul 31, 2023

flozi00 commented Jul 31, 2023

MUZAMMILPERVAIZ commented Aug 28, 2023 • edited

flozi00 commented Jun 30, 2023 •

edited

andreaskoepf commented Jul 24, 2023 •

edited

flozi00 commented Jul 31, 2023 •

edited

Narsil commented Jul 31, 2023 •

edited

MUZAMMILPERVAIZ commented Aug 28, 2023 •

edited