Why Linear and MergedLinear? #41

maximedb · 2022-12-13T15:41:02Z

Hi,

Thank you for this really nice paper.

This is not an issue but a general question, why is there a Linear and MergedLinear class?

Thank you,

Maxime.

Andrei-Aksionov · 2023-05-08T13:13:21Z

I have a feeling that you've already found the answer since you asked this question last December, nevertheless I would like to answer it.

In the basic implementation of attention mechanism you have three separate weight matrices for query, key and value and thus in order to obtain q, k, v you apply these 3 matrices separately on input x (in self-attention). Here is an example.
This approach is covered with Linear class.

The other approach is to have a single matrix that stores weights for all three matrices (query, key and value). So you can apply this big combined matrix ones (helps with parallelization on GPU) and then you can split the output into three chunks to have queries, keys and values. Here is an example.
For this approach MergedLinear class is created.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Linear and MergedLinear? #41

Why Linear and MergedLinear? #41

maximedb commented Dec 13, 2022

Andrei-Aksionov commented May 8, 2023

Why Linear and MergedLinear? #41

Why Linear and MergedLinear? #41

Comments

maximedb commented Dec 13, 2022

Andrei-Aksionov commented May 8, 2023