You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the self-attention containes some bias, e.g. (Q*K+bias_1+bias_2)*V, can we still aplly YOSO or Nystromformer to speed up the computation? If we can still do it, can you give me some hit on how to deal with this case?
The text was updated successfully, but these errors were encountered:
I assume that you meant softmax(Q*K+bias_1+bias_2)*V, otherwise you can compute it using distributive law. Then, one approach you can do is constructing a new set of Q' and K': Q' = [Q, bias_1, bias_2] and K' = [K, 1, 1] such that Q'K' = QK+bias_1+bias_2. Next, regular attention approximation can be applied to the new Q' and K'.
If the self-attention containes some bias, e.g. (Q*K+bias_1+bias_2)*V, can we still aplly YOSO or Nystromformer to speed up the computation? If we can still do it, can you give me some hit on how to deal with this case?
The text was updated successfully, but these errors were encountered: