Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Linear and MergedLinear? #41

Open
maximedb opened this issue Dec 13, 2022 · 1 comment
Open

Why Linear and MergedLinear? #41

maximedb opened this issue Dec 13, 2022 · 1 comment

Comments

@maximedb
Copy link

Hi,

Thank you for this really nice paper.

This is not an issue but a general question, why is there a Linear and MergedLinear class?

Thank you,

Maxime.

@Andrei-Aksionov
Copy link

Hello @maximedb

I have a feeling that you've already found the answer since you asked this question last December, nevertheless I would like to answer it.

In the basic implementation of attention mechanism you have three separate weight matrices for query, key and value and thus in order to obtain q, k, v you apply these 3 matrices separately on input x (in self-attention). Here is an example.
This approach is covered with Linear class.

The other approach is to have a single matrix that stores weights for all three matrices (query, key and value). So you can apply this big combined matrix ones (helps with parallelization on GPU) and then you can split the output into three chunks to have queries, keys and values. Here is an example.
For this approach MergedLinear class is created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants