You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because of the parametric way, I think multiple projection to lower dimension is equivalent to "first project then split" , e.g: 100d ->100d->split(25d * 4), the number of parameters is 100 * 100; multiple projection(4times): 100d->25d, the number of parameters is 100*25 *4. And "split and concat " looks more elegant. By the way, I also tried such two methods in my own experiments and found no difference on final performance. @ty5491003
as the paper said or in some other implementation:
self.w_qs = nn.Linear(d_model, n_head * d_k)
the data size is larger.
but in this project, it is
it's like using partial of Q/K/V to form one head.
Can anyone help to explain why it uses "split" and "concat" to get multi-head?
Thanks!
The text was updated successfully, but these errors were encountered: