Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
First of all, thanks a lot for your post Transformers from scratch, it is one of the best and most complete explanations about it that I've read.
I understood that as projecting an input X with emb dimensions into heads vector of emb dimensions.
However, other implementations define something such as:
which I understood as projecting an input X with emb dimensions and separating heads subvectors of dk dimensions in order to perform the attention operation. This is based on The Annotated Transformer code.
Did I understand your implementation correctly? If yes, are these implementations equivalent? Is there a significant difference between them in terms of "architectural power"?
Sorry for the long question.
Thanks for the question. Looks like you found a mistake in my version. From Vaswani et al.:
In other words, they split the embedding in h chunks and feed each separately to an attention head. In my version I replicate the embedding h times and feed the whole thing to each attention head. That means that my version has more weights (also theoretically more power, but I guess that power isn't necessary for the performance of the model).
I'll dig into this a bit more and update the blog post and the code.