Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I'm (A Bit) Suspicious of Table 3. #44

Closed
FrederickGeek8 opened this issue Oct 18, 2023 · 1 comment
Closed

I'm (A Bit) Suspicious of Table 3. #44

FrederickGeek8 opened this issue Oct 18, 2023 · 1 comment

Comments

@FrederickGeek8
Copy link

Hi there, thanks for writing such an interesting paper. When I heard of your paper, I immediately had the thought that it might be related to Evan Miller's Attention is Off By One blog post. And I was right! I was excited to see your experiments but when I came to Table 3 which describes the results of pre-training, from scratch, identical language models corresponding to vanilla Attention, Attention with a Zero Sink, and Attention with a Learnable Sink with 0-4 sink tokens prepended.

Maybe it's because of some strange sense of "moral truth" I have with the Zero Sink, but I was a little surprised that the Zero Sink didn't do better experimentally. But then I looked closer and I noticed your $0 + 1024$ experiments and I was a little confused with the results presented.

In the table description you say

Cache config x+y denotes adding x initial tokens with y recent tokens.

Based on that definition, shouldn't the $0 + 1024$ case with 0 sink tokens make the 3 formulations equivalent? Where do the wildly different perplexity results come from for that experiment? Perhaps I'm misunderstanding the description of this table.

Thank you for fielding my questions!

@Guangxuan-Xiao
Copy link
Collaborator

Hi,

Thank you for diving deep into our paper and bringing up this insightful question! Regarding the 0+1024 configuration, it implies that the zero sink may be evicted when the input size surpasses the cache size. Essentially, it's equivalent to training a model that's trained to the SoftMax1 function but executing inference with the standard SoftMax. This discrepancy is what led to the unexpected surge in perplexity. I hope this clarifies your query.

Guangxuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants