Faild to train useing GPU with M3gnet model #94

SmallBearC · 2023-06-24T08:53:53Z

I am able to train normally using the CPU. When I use GPU, it keeps failing. I don't know where the problem is, so I hope someone can help me. I would be very grateful。
My script is as follows:

from _future_ import annotations

import os
import shutil

import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.core import Structure
from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs
from matgl.models import M3GNet, MEGNet
from matgl.utils.training import ModelLightningModule, PotentialLightningModule
import torch

if _name_ == '_main_':
stru0 = Structure.from_file("../../strus/0/POSCAR")
stru1 = Structure.from_file("../../strus/1/POSCAR")
structures = [stru0, stru1] * 10
energies = np.zeros(len(structures))
forces = [np.zeros((len(s), 3)).tolist() for s in structures]
stresses = [np.zeros((3, 3)).tolist()] * len(structures)
element_types = get_element_list([stru0, stru1])
converter = Structure2Graph(element_types=element_types, cutoff=5.0)
dataset = M3GNetDataset(
threebody_cutoff=4.0,
structures=structures,
converter=converter,
energies=energies,
forces=forces,
stresses=stresses,
)
train_data, val_data, test_data = split_dataset(
dataset,
frac_list=[0.8, 0.1, 0.1],
shuffle=True,
random_state=42,
)
train_loader, val_loader, test_loader = MGLDataLoader(
train_data=train_data,
val_data=val_data,
test_data=test_data,
collate_fn=collate_fn_efs,
batch_size=32,
num_workers=8,
)
model = M3GNet(
element_types=element_types,
is_intensive=False,
)
lit_model = PotentialLightningModule(model=model)
torch.set_default_device('cuda')
torch.multiprocessing.set_start_method('spawn', force=True)
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model=lit_model, train_dataloaders=train_loader, val_dataloaders=val_loader)

The error message I received is as follows:

Traceback (most recent call last):
File "/home/lycui/test/mgl/test/train.py", line 59, in
trainer.fit(model=lit_model, train_dataloaders=train_loader, val_dataloaders=val_loader)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
call._call_and_handle_interrupt(
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
results = self._run_stage()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in _run_stage
self._run_sanity_check()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_sanity_check
val_loop.run()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 177, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 375, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 287, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 379, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 59, in validation_step
results, batch_size = self.step(batch) # type: ignore
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 329, in step
e, f, s, _ = self(g=g, state_attr=state_attr, l_g=l_g)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 317, in forward
e, f, s, h = self.model(g=g, l_g=l_g, state_attr=state_attr)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/apps/pes.py", line 75, in forward
total_energies = self.data_std * self.model(g=g, state_attr=state_attr, l_g=l_g) + self.data_mean
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/models/_m3gnet.py", line 227, in forward
expanded_dists = self.bond_expansion(g.edata["bond_dist"])
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_bond.py", line 65, in forward
bond_basis = self.rbf(bond_dist)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_basis.py", line 104, in call
return self._call_sbf(r)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_basis.py", line 120, in _call_sbf
func(r[:, None] * root[None, :] / self.cutoff) * factor / torch.abs(func_add1(root[None, :]))
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/_device.py", line 62, in torch_function
return func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

shyuep · 2023-06-25T00:32:19Z

Your torch default device needs to be the very first call. Not after all the other model setup has occurred. Or you can use with torch.device("cuda") before all your code.

shyuep · 2023-06-25T00:33:02Z

I should add that you should post such questions on the Discussions section, not in as an Issue in GitHub. Issues are meant for actual bug reports.

SmallBearC · 2023-06-25T01:28:17Z

Thank you very very much for your guidance I have tried as follows: from __future__ import annotations import torch torch.set_default_device('cuda') #torch.device('cuda') import os import shutil import numpy as np import pytorch_lightning as pl from dgl.data.utils import split_dataset from pymatgen.core import Structure from matgl.ext.pymatgen import Structure2Graph, get_element_list from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs from matgl.models import M3GNet, MEGNet from matgl.utils.training import ModelLightningModule, PotentialLightningModule #mp.set_start_method('spawn') #torch.set_default_device('cpu') if __name__ == '__main__': #    torch.device('cuda')     stru0 = Structure.from_file("../../strus/0/POSCAR")     stru1 = Structure.from_file("../../strus/1/POSCAR")     ...... Or from __future__ import annotations import torch #torch.set_default_device('cuda') #torch.device('cuda') import os import shutil import numpy as np import pytorch_lightning as pl from dgl.data.utils import split_dataset from pymatgen.core import Structure from matgl.ext.pymatgen import Structure2Graph, get_element_list from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs from matgl.models import M3GNet, MEGNet from matgl.utils.training import ModelLightningModule, PotentialLightningModule #mp.set_start_method('spawn') #torch.set_default_device('cpu') if __name__ == '__main__':     torch.set_default_device('cuda')     torch.device('cuda')     stru0 = Structure.from_file("../../strus/0/POSCAR")     stru1 = Structure.from_file("../../strus/1/POSCAR")     ...... But it all had the same error: Original Traceback (most recent call last):   File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop     data = fetcher.fetch(index)   File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch     data = [self.dataset[idx] for idx in possibly_batched_index]   File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>     data = [self.dataset[idx] for idx in possibly_batched_index]   File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/dgl/data/utils.py", line 425, in __getitem__     return self.dataset[self.indices[item]]   File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/graph/data.py", line 321, in __getitem__     self.state_attr[idx], RuntimeError: CUDA error: initialization error CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. When I use the torch.device('cuda') as follows: from __future__ import annotations import torch torch.device('cuda') #torch.device('cuda') import os import shutil import numpy as np import pytorch_lightning as pl from dgl.data.utils import split_dataset from pymatgen.core import Structure from matgl.ext.pymatgen import Structure2Graph, get_element_list from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs from matgl.models import M3GNet, MEGNet from matgl.utils.training import ModelLightningModule, PotentialLightningModule #mp.set_start_method('spawn') #torch.set_default_device('cpu') if __name__ == '__main__': #    torch.device('cuda')     stru0 = Structure.from_file("../../strus/0/POSCAR")     stru1 = Structure.from_file("../../strus/1/POSCAR")     ...... Or from __future__ import annotations import torch #torch.set_default_device('cuda') #torch.device('cuda') import os import shutil import numpy as np import pytorch_lightning as pl from dgl.data.utils import split_dataset from pymatgen.core import Structure from matgl.ext.pymatgen import Structure2Graph, get_element_list from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs from matgl.models import M3GNet, MEGNet from matgl.utils.training import ModelLightningModule, PotentialLightningModule #mp.set_start_method('spawn') #torch.set_default_device('cpu') if __name__ == '__main__':     torch.device('cuda')     torch.device('cuda')     stru0 = Structure.from_file("../../strus/0/POSCAR")     stru1 = Structure.from_file("../../strus/1/POSCAR")     ...... I get this error: File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/models/_m3gnet.py", line 224, in forward     g.edata["bond_vec"] = bond_vec File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/dgl/view.py", line 227, in __setitem__     self._graph._set_e_repr(self._etid, self._edges, {key: val}) File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4460, in _set_e_repr     raise DGLError( dgl._ffi.base.DGLError: Cannot assign edge feature "bond_vec" on device cpu to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device. So I'm confused, I don't know what to do. Here are some packages version: dgl                         1.1.0+cu117 cryptography          41.0.1 matgl                     0.6.2 pytorch-lightning    2.0.4 torch                     2.0.1 tqdm                     4.65.0 pymatgen              2023.5.31 ------------------ 原始邮件 ------------------ 发件人: "materialsvirtuallab/matgl" ***@***.***>; 发送时间: 2023年6月25日(星期天) 上午8:32 ***@***.***>; ***@***.******@***.***>; 主题: Re: [materialsvirtuallab/matgl] Faild to train useing GPU with M3gnet model (Issue #94) Your torch default device needs to be the very first call. Not after all the other model setup has occurred. Or you can use with torch.device("cuda") before all your code. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

chiku-parida · 2023-07-31T12:30:10Z

Are you able to solve this problem? Please let me know.
I have tried like below declaring CUDA in the beginning still the error persists...

'''
if torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'

print(f'The available device is {device}')
'''

txy159 · 2023-11-27T05:54:54Z

Hi, @SmallBearC @chiku-parida, Are you able to solve this problem ? I also get the same errors.

shyuep closed this as completed Jun 25, 2023

mstapelberg mentioned this issue Nov 17, 2023

[Bug]: Multi-GPU Training not Working in 0.7.1 and 0.8.5 #195

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faild to train useing GPU with M3gnet model #94

Faild to train useing GPU with M3gnet model #94

SmallBearC commented Jun 24, 2023 •

edited

shyuep commented Jun 25, 2023

shyuep commented Jun 25, 2023

SmallBearC commented Jun 25, 2023 via email

chiku-parida commented Jul 31, 2023

txy159 commented Nov 27, 2023

Faild to train useing GPU with M3gnet model #94

Faild to train useing GPU with M3gnet model #94

Comments

SmallBearC commented Jun 24, 2023 • edited

shyuep commented Jun 25, 2023

shyuep commented Jun 25, 2023

SmallBearC commented Jun 25, 2023 via email

chiku-parida commented Jul 31, 2023

txy159 commented Nov 27, 2023

SmallBearC commented Jun 24, 2023 •

edited