Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about re-computation #51

Closed
ysanimals opened this issue Oct 24, 2023 · 4 comments
Closed

question about re-computation #51

ysanimals opened this issue Oct 24, 2023 · 4 comments

Comments

@ysanimals
Copy link

Hello. After read paper and code, I have a question about the method about the Sliding Window w/Re-compuation, is there any code or paper about this method for reference? Thank you a lot.

@Guangxuan-Xiao
Copy link
Collaborator

Hi,

Thank you for your interest in our work!

The "Sliding Window with Recomputation" method is commonly used in language modeling evaluations to calculate perplexity given a certain number of context tokens. Huggingface provides a comprehensive tutorial on this topic. You can check it out here: https://huggingface.co/docs/transformers/perplexity.

Best,
Guangxuan

@rituraj2847
Copy link

rituraj2847 commented Oct 25, 2023

Hi Guangxuan. Could you please explain how sliding windows with recomputation leads to O(TL^2) complexity. IMO, calculation of keys and values for the first L-1 tokens should be a function of weight matrix and dimension of KVs. How is this computation for each token taking O(L^2) time? I don't see a need for computing attention scores for initial L-1 tokens which would take O(L^2). Thank you.

@Bhuvanesh09
Copy link

@rituraj2847 Your argument would have been valid had there been only one layer in the transformer. i.e. We need to only do a forward pass for each token with weight matrices of K and V [O(L)]. Unfortunately, the output of attention block of the first layer (attained through (Q @ K.T) @ V ) is actually the initial embedding of the second layer.
Hence, each time you change the positional encoding of the tokens, it would take O(L^2) ops to do generation iteration of one new token since we need to calculate self-attention for all the tokens. Since the sequence length is T, the overall complexity becomes O(TL^2).
Only for the last layer, we may avoid calculating self-attention for all the initial tokens. For the last layer we may only calculate attention of initial tokens applied to the query vector of the last token.
I also had the same question as you and had to chalk out my understanding of transformers on a piece of paper in order to understand this.

@rituraj2847
Copy link

rituraj2847 commented Oct 25, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants