Beit why no k bias in attention? #510

cqliheping · 2021-11-10T10:46:27Z

The k bias is always zero in code. Is there any reason for this? This is different from the normal implement.

Line 124 in 421cffe

    
           qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))

In my test. when finetune, k bias has little affect on performance. But I do not have a test on pretrain.

donglixp · 2021-11-10T11:19:37Z

Both (i.e., with or without key.bias) are equivalent in terms of calculation results. They are canceled by the softmax function.

Softmax(q,k) = exp(q.weight * key.weight + q.bias * key.weight + q.weight * key.bias + q.bias * key.bias) / Z

Because the query is the same over all the keys, so the term (q.weight * key.bias + q.bias * key.bias) remains the same across all the keys, which in turn can be cancelled without affecting the softmax results.

exp(a)/(exp(a)+ exp(b)) == exp(a+C)/(exp(a+C)+ exp(b+C))

donglixp closed this as completed Nov 10, 2021

donglixp self-assigned this Nov 10, 2021

addf400 mentioned this issue Dec 6, 2021

BEiT zero v_bias #558

Closed

liyz15 mentioned this issue Dec 9, 2021

Why don't train the k_bias pengzhiliang/MAE-pytorch#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beit why no k bias in attention? #510

Beit why no k bias in attention? #510

cqliheping commented Nov 10, 2021 •

edited

Loading

donglixp commented Nov 10, 2021

Beit why no k bias in attention? #510

Beit why no k bias in attention? #510

Comments

cqliheping commented Nov 10, 2021 • edited Loading

donglixp commented Nov 10, 2021

cqliheping commented Nov 10, 2021 •

edited

Loading