Running machine translation using different GNNs #536

smith-co · 2022-03-29T23:23:44Z

❓ Questions and Help

I am running the NMT example on the same dataset with GNN variants:

GCN
GGNN
GraphSage

While the execution runs with GCN, I get Out-of-Memory (OOM) for GGNN and GraphSage. Can anyone help me with this query?

AlanSwift · 2022-03-30T14:05:29Z

Please try a smaller batch_size or try another GPU with larger memory.

smith-co · 2022-03-30T19:47:53Z

@AlanSwift I already tried a smaller batch size. What I find suprising is:

It runs for GCN and GAT.
But it gives Out-of-Memory (OOM) for GGNN and GraphSage.

Its the same dataset. But GGNN and GraphSage fails to run while GCN and GAT works.

So GGNN/GraphSage needs more resource for some reason? Super interested to know why?

AlanSwift · 2022-03-31T16:46:08Z

We haven't investigated the memory efficiency for dgl :).
It seems that GGNN and GraphSage need more GPU memory.

smith-co · 2022-04-01T03:38:21Z

@AlanSwift I get this OOM error at runtime for GGNN:

  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/models/graph2seq.py", line 226, in forward
    return self.encoder_decoder(batch_graph=batch_graph, oov_dict=oov_dict, tgt_seq=tgt_seq)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/models/graph2seq.py", line 173, in encoder_decoder
    batch_graph = self.gnn_encoder(batch_graph)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 557, in forward
    h = self.models(dgl_graph, (feat_in, feat_out), etypes, edge_weight)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 442, in forward
    return self.model(graph, node_feats, etypes, edge_weight)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 210, in forward
    graph_in.apply_edges(
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/dgl_cu111-0.7a210520-py3.9-linux-x86_64.egg/dgl/heterograph.py", line 4300, in apply_edges
    edata = core.invoke_edge_udf(g, eid, etype, func)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/dgl_cu111-0.7a210520-py3.9-linux-x86_64.egg/dgl/core.py", line 85, in invoke_edge_udf
    return func(ebatch)
  File "/mnt/volume1/anaconda3/envs/ggnn/lib/python3.9/site-packages/graph4nlp_cu111-0.4.0-py3.9.egg/graph4nlp/pytorch/modules/graph_embedding/ggnn.py", line 212, in <lambda>
    "W_e*h": self.linears_in[i](edges.src["h"])
RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 3; 14.76 GiB total capacity; 11.83 GiB already allocated; 447.75 MiB free; 12.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea?

smith-co · 2022-04-01T03:55:18Z

@AlanSwift came across this discussion at DGL: Memory consumption of the GGNN module

AlanSwift · 2022-04-02T05:53:54Z

It seems the dgl sacrifices memory efficiency for time efficiency. We will pay attention to this problem. Thank you for letting us know it!

smith-co · 2022-04-06T06:37:57Z

@AlanSwift can you please provide me with a fix/suggestion 🙏

nashid · 2022-04-19T15:15:05Z

@AlanSwift, this is interesting. I also faced the same problem. Wondering do you have any solution to this?

nashid · 2022-05-12T17:18:50Z

@AlanSwift do you have a plan to address the GGNN implementation limitation?

AlanSwift · 2022-05-12T17:23:31Z

Currently, this is not on my plan since it is related to the DGL.

AlanSwift closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running machine translation using different GNNs #536

Running machine translation using different GNNs #536

smith-co commented Mar 29, 2022

AlanSwift commented Mar 30, 2022

smith-co commented Mar 30, 2022

AlanSwift commented Mar 31, 2022

smith-co commented Apr 1, 2022

smith-co commented Apr 1, 2022

AlanSwift commented Apr 2, 2022

smith-co commented Apr 6, 2022 •

edited

nashid commented Apr 19, 2022

nashid commented May 12, 2022

AlanSwift commented May 12, 2022

Running machine translation using different GNNs #536

Running machine translation using different GNNs #536

Comments

smith-co commented Mar 29, 2022

❓ Questions and Help

AlanSwift commented Mar 30, 2022

smith-co commented Mar 30, 2022

AlanSwift commented Mar 31, 2022

smith-co commented Apr 1, 2022

smith-co commented Apr 1, 2022

AlanSwift commented Apr 2, 2022

smith-co commented Apr 6, 2022 • edited

nashid commented Apr 19, 2022

nashid commented May 12, 2022

AlanSwift commented May 12, 2022

smith-co commented Apr 6, 2022 •

edited