self extend / longlm #186

flozi00 · 2024-01-16T10:32:58Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Was this discussed/approved via a Github issue or the discord / slack channel? Please add a link
to it if that's the case.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

flozi00 · 2024-01-16T12:45:47Z

@tgaddair could you please take a look for the failing rust test ?
And as everytime its not tested at the moment, xould you provide a docker again ?

flozi00 · 2024-01-17T14:32:50Z

tested and working :)

tgaddair

Nice! I have some doubts about the way we're computing attention. Happy to discuss further.

tgaddair · 2024-01-17T17:21:48Z

server/lorax_server/models/custom_modeling/attentions/self_extend/mistral_self_extend_patch.py

+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+def self_extend_forward(


This signature is quite different from what the current implementation without self-extend uses. Instead of using flash attention and paged attention, it uses the more conventional attention computation with past_key_values. As such, because the rest of the FlashMistral, etc. classes don't pass in the past_key_values, my expectation is that the attention computation will be incorrect during the decode phase.

I think what we need is a variation on this function that works with the existing Flash / Paged Attention computation.

I tried with and without the self extend and the generated response was the same
The actual implementation is used from the authors, i think (https://github.com/datamllab/LongLM)

I definitely think that we should add the flash attention version, but dont know if it would make sense to wait until they released it with tested results

from the paper:

Limitation: The limitation of the proposed Self-Extend includes the lack of implementation of Flash Attention (Dao
et al., 2022) and the performance degradation with too large
group size, which means the context window still cannot be
extended to infinity with current SelfExtend. Meanwhile,
like many regular tasks, there is still no consensus at present
about how to do evaluation for long context tasks, which
may cause problematic evaluation results.

Future Work: For future work, we will implement Flash
Attention for Self-Extend to enhance its efficiency. We
are also interested in testing SelfExtend on models using
other positional encoding. Larger models, longer context
and more challenging tasks will be tested if we can have
access to more computational resources in the future. In
the meantime, more sophisticated mapping methods will
be considered as the replacement of the simple FLOOR operation, so as to achieve better long context understanding
abilities and longer extended context window length.

Hey @flozi00, it looks like there was a bug in the code causing the self extend code path to never be executed. I fixed the issue with plumbing through the self_extend_attention param, and now there are some errors showing up. So my suspicion is that the answers were identical because we were executing the non-extended code in both cases.

Here's the current error:

File "/data/lorax/server/lorax_server/models/flash_mistral.py", line 427, in forward logits = model.forward( File "/data/lorax/server/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 565, in forward hidden_states = self.model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/lorax/server/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 508, in forward hidden_states, residual = layer( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/lorax/server/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 431, in forward attn_output = self.self_attn( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) TypeError: self_extend_forward() got multiple values for argument 'group_size_1'

oh, okay
will take this tomorrow
Thanks for finding this missing one

flozi00 · 2024-01-20T09:30:40Z

"Another good news: the flash attention version will come in days!"
https://github.com/datamllab/LongLM#:~:text=Another%20good%20news%3A%20the%20flash%20attention%20version%20will%20come%20in%20days!

Will wait for that :)

akelch11 · 2024-01-23T01:41:15Z

Hi @flozi00 , I saw that you had a recent pull request that passed all the tests. Do you have to do anything to supply environment variables or configs to the tests? My PR is having issues with this, failing server tests due to not being able to login into HuggingFace and connect to Llama2

flozi00 · 2024-01-23T06:09:24Z

Hi @flozi00 , I saw that you had a recent pull request that passed all the tests. Do you have to do anything to supply environment variables or configs to the tests? My PR is having issues with this, failing server tests due to not being able to login into HuggingFace and connect to Llama2

Hi, I don't have to set up the vars because I am editing branches from this repo and not in a fork

do python stuff

0cadfae

flozi00 linked an issue Jan 16, 2024 that may be closed by this pull request

LongLM #182

Open

flozi00 added 3 commits January 16, 2024 11:37

try launcher cli

fb790df

sources

c40594a

fix rust

f43a47f

flozi00 and others added 4 commits January 17, 2024 10:09

Update build.yaml

fb8bb9b

fix cli

0e1f263

fix import

b73cb89

typos

8b5af5b

flozi00 requested a review from tgaddair January 17, 2024 14:32

tgaddair reviewed Jan 17, 2024

View reviewed changes

Plumb self_extend_attention through serve_inner

3448f6b

flozi00 marked this pull request as draft January 19, 2024 16:24

flozi00 closed this Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

self extend / longlm #186

self extend / longlm #186

flozi00 commented Jan 16, 2024

flozi00 commented Jan 16, 2024

flozi00 commented Jan 17, 2024

tgaddair left a comment

tgaddair Jan 17, 2024

flozi00 Jan 17, 2024

tgaddair Jan 17, 2024

flozi00 Jan 17, 2024

flozi00 commented Jan 20, 2024

akelch11 commented Jan 23, 2024

flozi00 commented Jan 23, 2024

self extend / longlm #186

self extend / longlm #186

Conversation

flozi00 commented Jan 16, 2024

What does this PR do?

Before submitting

Who can review?

flozi00 commented Jan 16, 2024

flozi00 commented Jan 17, 2024

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Jan 17, 2024

Choose a reason for hiding this comment

flozi00 Jan 17, 2024

Choose a reason for hiding this comment

tgaddair Jan 17, 2024

Choose a reason for hiding this comment

flozi00 Jan 17, 2024

Choose a reason for hiding this comment

flozi00 commented Jan 20, 2024

akelch11 commented Jan 23, 2024

flozi00 commented Jan 23, 2024