Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why "split" to get multi-head? #55

Closed
LifangD opened this issue Oct 18, 2018 · 3 comments
Closed

why "split" to get multi-head? #55

LifangD opened this issue Oct 18, 2018 · 3 comments

Comments

@LifangD
Copy link

LifangD commented Oct 18, 2018

as the paper said or in some other implementation:
self.w_qs = nn.Linear(d_model, n_head * d_k)
the data size is larger.
but in this project, it is

       Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 
       K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
       V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)

it's like using partial of Q/K/V to form one head.
Can anyone help to explain why it uses "split" and "concat" to get multi-head?

Thanks!

@LifangD LifangD closed this as completed Oct 26, 2018
@ty5491003
Copy link

ty5491003 commented Mar 18, 2019

I noted you closed this issue, do you have the answer? I have the same question.
@LifangD thx.

@LifangD
Copy link
Author

LifangD commented Mar 18, 2019

Because of the parametric way, I think multiple projection to lower dimension is equivalent to "first project then split" , e.g: 100d ->100d->split(25d * 4), the number of parameters is 100 * 100; multiple projection(4times): 100d->25d, the number of parameters is 100*25 *4. And "split and concat " looks more elegant. By the way, I also tried such two methods in my own experiments and found no difference on final performance. @ty5491003

@ty5491003
Copy link

I got it, thx /:laughing:/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants