# Comparing SelfAttention classes #8

Closed
opened this issue Sep 25, 2019 · 2 comments
Closed

# Comparing SelfAttention classes#8

opened this issue Sep 25, 2019 · 2 comments
Assignees

### sidneyaraujomelo commented Sep 25, 2019

 Hello. First of all, thanks a lot for your post Transformers from scratch, it is one of the best and most complete explanations about it that I've read. I have a question regarding your implementation of the SelfAttention class, specially the computation of query, key, values for all heads, for example: self.tokeys = nn.Linear(emb, emb * heads, bias=False) I understood that as projecting an input X with emb dimensions into heads vector of emb dimensions. However, other implementations define something such as: self.dk = emb // heads self.tokeys = nn.Linear(emb, emb) and then perform something like: keys = self.tokeys(x).view(nbatches, -1, heads, self.dk).transpose(1, 2) which I understood as projecting an input X with emb dimensions and separating heads subvectors of dk dimensions in order to perform the attention operation. This is based on The Annotated Transformer code. Did I understand your implementation correctly? If yes, are these implementations equivalent? Is there a significant difference between them in terms of "architectural power"? Sorry for the long question. Thanks in advance!
self-assigned this Sep 26, 2019

### pbloem commented Sep 26, 2019

 Hi Sidney, Thanks for the question. Looks like you found a mistake in my version. From Vaswani et al.: In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. In other words, they split the embedding in h chunks and feed each separately to an attention head. In my version I replicate the embedding h times and feed the whole thing to each attention head. That means that my version has more weights (also theoretically more power, but I guess that power isn't necessary for the performance of the model). I'll dig into this a bit more and update the blog post and the code.

### sidneyaraujomelo commented Sep 26, 2019

 Thanks a lot, Peter! I'll stay tuned for the updates.