I'm (A Bit) Suspicious of Table 3. #44

FrederickGeek8 · 2023-10-18T02:39:09Z

Hi there, thanks for writing such an interesting paper. When I heard of your paper, I immediately had the thought that it might be related to Evan Miller's Attention is Off By One blog post. And I was right! I was excited to see your experiments but when I came to Table 3 which describes the results of pre-training, from scratch, identical language models corresponding to vanilla Attention, Attention with a Zero Sink, and Attention with a Learnable Sink with 0-4 sink tokens prepended.

Maybe it's because of some strange sense of "moral truth" I have with the Zero Sink, but I was a little surprised that the Zero Sink didn't do better experimentally. But then I looked closer and I noticed your $0 + 1024$ experiments and I was a little confused with the results presented.

In the table description you say

Cache config x+y denotes adding x initial tokens with y recent tokens.

Based on that definition, shouldn't the $0 + 1024$ case with 0 sink tokens make the 3 formulations equivalent? Where do the wildly different perplexity results come from for that experiment? Perhaps I'm misunderstanding the description of this table.

Thank you for fielding my questions!

Guangxuan-Xiao · 2023-10-18T16:51:38Z

Hi,

Thank you for diving deep into our paper and bringing up this insightful question! Regarding the 0+1024 configuration, it implies that the zero sink may be evicted when the input size surpasses the cache size. Essentially, it's equivalent to training a model that's trained to the SoftMax1 function but executing inference with the standard SoftMax. This discrepancy is what led to the unexpected surge in perplexity. I hope this clarifies your query.

Guangxuan

Guangxuan-Xiao closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm (A Bit) Suspicious of Table 3. #44

I'm (A Bit) Suspicious of Table 3. #44

FrederickGeek8 commented Oct 18, 2023

Guangxuan-Xiao commented Oct 18, 2023

I'm (A Bit) Suspicious of Table 3. #44

I'm (A Bit) Suspicious of Table 3. #44

Comments

FrederickGeek8 commented Oct 18, 2023

Guangxuan-Xiao commented Oct 18, 2023