Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption #17

Open
GregorKobsik opened this issue Feb 14, 2022 · 1 comment
Open

Memory consumption #17

GregorKobsik opened this issue Feb 14, 2022 · 1 comment

Comments

@GregorKobsik
Copy link

Could you provide some additional information about the memory consumption using your Graph Transformer?

You state, that sparse-attention favors both computation time and memory consumption, but do not provide actual measurements of the second property in your evaluation or do not state clearly, if and how your implementation is able to take advantage of it.
Some peak memory measurements of your experiments as an addendum to your evaluation of the computation times (e.g. Table 1) could be beneficial to others, too. As in my case the quadratic growth of the memory consumption w.r.t. the sequence length prevents an efficient use of Transformers for some task, where connectivity information is given and can be simply modeled by masking out (-Inf) the attention scores in the attention matrix.

Also some exemplary or artificial data could be interesting, e.g. (Mean) number of nodes n = {128, 1024, 2048, 4096}, (mean) number of edges per node e = {4, 8, 16, 64, 128}, to get an impression of the resource consumption of your Graph Transformer with Sparse Graph vs. NLP-Transformer (Full Graph with masking).

(I hope, that I could run the experiments myself, but I suppose your evaluation pipeline is already running and data provided by the original author should be more precise and trustworthy to other researchers, too.)

@vijaydwivedi75
Copy link
Member

Hi @GregorKobsik,

When we use spare Graph Transformer (using the original sparse adj matrix), the memory consumption is O(E). With the Fully-connected graph, the memory consumption becomes O(N^2), where N and E denote the number of nodes and edges respectively. Because it is straightforward with the above orders of memory consumption, we did not put actual memory consumed during training/evaluation.

In case of real-world world (sparse) graphs, E would be significantly smaller than N^2, hence the memory consumed in a Graph Transformer operating on the sparse graph would be smaller than the one operating on fully-connected graph (like the NLP Transformer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants