You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a feeling that you've already found the answer since you asked this question last December, nevertheless I would like to answer it.
In the basic implementation of attention mechanism you have three separate weight matrices for query, key and value and thus in order to obtain q, k, v you apply these 3 matrices separately on input x (in self-attention). Here is an example.
This approach is covered with Linear class.
The other approach is to have a single matrix that stores weights for all three matrices (query, key and value). So you can apply this big combined matrix ones (helps with parallelization on GPU) and then you can split the output into three chunks to have queries, keys and values. Here is an example.
For this approach MergedLinear class is created.
Hi,
Thank you for this really nice paper.
This is not an issue but a general question, why is there a Linear and MergedLinear class?
Thank you,
Maxime.
The text was updated successfully, but these errors were encountered: