Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing SelfAttention classes #8

sidneyaraujomelo opened this issue Sep 25, 2019 · 2 comments

Comparing SelfAttention classes #8

sidneyaraujomelo opened this issue Sep 25, 2019 · 2 comments


Copy link

@sidneyaraujomelo sidneyaraujomelo commented Sep 25, 2019


First of all, thanks a lot for your post Transformers from scratch, it is one of the best and most complete explanations about it that I've read.
I have a question regarding your implementation of the SelfAttention class, specially the computation of query, key, values for all heads, for example:

self.tokeys = nn.Linear(emb, emb * heads, bias=False)

I understood that as projecting an input X with emb dimensions into heads vector of emb dimensions.

However, other implementations define something such as: = emb // heads
self.tokeys = nn.Linear(emb, emb)
and then perform something like:
keys = self.tokeys(x).view(nbatches, -1, heads,, 2)

which I understood as projecting an input X with emb dimensions and separating heads subvectors of dk dimensions in order to perform the attention operation. This is based on The Annotated Transformer code.

Did I understand your implementation correctly? If yes, are these implementations equivalent? Is there a significant difference between them in terms of "architectural power"?

Sorry for the long question.
Thanks in advance!

@pbloem pbloem self-assigned this Sep 26, 2019
Copy link

@pbloem pbloem commented Sep 26, 2019

Hi Sidney,

Thanks for the question. Looks like you found a mistake in my version. From Vaswani et al.:

In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

In other words, they split the embedding in h chunks and feed each separately to an attention head. In my version I replicate the embedding h times and feed the whole thing to each attention head. That means that my version has more weights (also theoretically more power, but I guess that power isn't necessary for the performance of the model).

I'll dig into this a bit more and update the blog post and the code.

Copy link

@sidneyaraujomelo sidneyaraujomelo commented Sep 26, 2019

Thanks a lot, Peter! I'll stay tuned for the updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants