Confused with four attention mechanism and their performance mentioned by paper #33

weizhenhuan · 2023-10-10T03:16:26Z

Nice idea, and it really works well! Thanks for you nice work. But I have some questions. In paper, it mentions four attention mechanism, dense attention fails because it mismatches with the traing phase's length when the outputs' length is longer than training phase, window attention fails because it evicts the initial tokens' kv cache, but for sliding attention with recomputation and streaming attention, I have some questions.

The sliding attention with recomputation just recomputes the kv state from L recent tokens, theoretically, it should have the same PPL with window attention, because they uses the same tokens' kv, it clear to see in picture. It just saves the kv space. However, sliding attention has better PPL in picture. Did I have a misunderstanding with sliding attention?
The streaming attention's idea is to use initial tokens' kv and L recent tokens' kv, the reason why it uses initial tokens' kv is clear in paper. Compared with dense attention, dense attention also uses these tokens' kv, so it should have no the softmax shift problem. Dense attention even uses more tokens, so dense attention should have better PPL because it has longer context, the streaming attention should have higher inference speed and longer output length because it uses less tokens. But in picture, the streaming attention have better PPL than dense attention.

Thanks for your nice work again. Hope to get a reply.

Guangxuan-Xiao · 2023-10-11T21:50:56Z

Thank you for your interest in our paper; I appreciate your insightful questions. Here are my clarifications:

Sliding Window with Re-computation vs. Window Attention:
- Assume we have a token sequence [a, b, c, d, e, f, g], and the model's window size is 4. For predicting token 'g', the sliding window with re-computation truncates the text sequence to [d, e, f], treating it as a whole sequence before inputting it into the language model, predicting only token 'g'. Here, token 'd' is at position 0, 'e' at 1 (seeing only 'd'), and 'f' at 2 (seeing 'd' and 'e').
- In contrast, window attention reuses the previously computed KV states. So, while predicting 'g', the reused tokens [d, e, f]'s KV are based on prior computations: 'd' was computed at pos 3 (seeing a, b, c), 'e' at pos 3 (seeing b, c, d), and 'f' at pos 3 (seeing c, d, e). The critical distinction is that in sliding window with re-computation, some key states are treated as initial tokens, whereas in window attention, all previous tokens' KV are computed as if they were middle tokens.
StreamingLLM vs. Dense Attention:
The superior performance of StreamingLLM over dense attention is attributed to the fact that dense attention struggles to generalize to sequences exceeding its training length. In the figure, we showed the language modeling perplexity on a book containing 65K tokens. The perplexity of dense attention becomes problematic because the Llama-2-13B model we used was pre-trained on a chunk size of 4096, causing its perplexity to deteriorate for sequences surpassing 4K. For a deeper dive, you might find length extrapolation works, such as the ALiBi paper (https://arxiv.org/abs/2108.12409), insightful.

I hope this addresses your confusion.

Thanks,
Guangxuan

Guangxuan-Xiao · 2023-10-12T01:41:10Z

I'd like to clarify that the issue isn't related to absolute or relative positional encoding. Our current results were obtained with models that utilize relative position encodings, such as Llama and MPT. The core of the matter lies in whether the context Keys have been computed with or without previous tokens. Attention sinks refer to Keys computed without prior tokens. Hence, such keys are present in the sliding window with the recomputation baseline. However, in the window attention baseline, all context keys are computed from numerous preceding tokens, and the model is trained to recognize that these aren't attention sinks.

weizhenhuan · 2023-10-12T05:19:27Z

Get it! Thanks for your kind reply.

BitCalSaul · 2023-10-17T17:05:10Z

@weizhenhuan Hey, zhenhuan, I am also interested in the figure 1 and spent some time to figure it out. I put my thoughts on #42 , and I'd appreciate it if you could correct me if I've made any mistakes. Thank you!

weizhenhuan changed the title ~~Confused with four attention mechanism and their permance mentioned by paper~~ Confused with four attention mechanism and their performance mentioned by paper Oct 10, 2023

This was referenced Oct 11, 2023

For LLMs already trained with window attention and BOS token #1

Closed

Comparison with SWA in Mistral #24

Open

Guangxuan-Xiao pinned this issue Oct 12, 2023

Guangxuan-Xiao closed this as completed Oct 13, 2023

Guangxuan-Xiao mentioned this issue Oct 13, 2023

Question about evaluation results and demo #39

Closed

hxs91 mentioned this issue Oct 16, 2023

Question about long input and difference between streaming-llm and dense attention. #41

Closed

ZiweiHe mentioned this issue Oct 30, 2023

Add benchmarks comparing against Sliding Window Attention tomaarsen/attention_sinks#10

Open

h3ndrik mentioned this issue Dec 8, 2023

ContextShift sometimes degrades output LostRuins/koboldcpp#550

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused with four attention mechanism and their performance mentioned by paper #33

Confused with four attention mechanism and their performance mentioned by paper #33

weizhenhuan commented Oct 10, 2023

Guangxuan-Xiao commented Oct 11, 2023 •

edited

Loading

Guangxuan-Xiao commented Oct 12, 2023 •

edited

Loading

weizhenhuan commented Oct 12, 2023

BitCalSaul commented Oct 17, 2023

Confused with four attention mechanism and their performance mentioned by paper #33

Confused with four attention mechanism and their performance mentioned by paper #33

Comments

weizhenhuan commented Oct 10, 2023

Guangxuan-Xiao commented Oct 11, 2023 • edited Loading

Guangxuan-Xiao commented Oct 12, 2023 • edited Loading

weizhenhuan commented Oct 12, 2023

BitCalSaul commented Oct 17, 2023

Guangxuan-Xiao commented Oct 11, 2023 •

edited

Loading

Guangxuan-Xiao commented Oct 12, 2023 •

edited

Loading