-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faild to train useing GPU with M3gnet model #94
Comments
Your torch default device needs to be the very first call. Not after all the other model setup has occurred. Or you can use with torch.device("cuda") before all your code. |
I should add that you should post such questions on the Discussions section, not in as an Issue in GitHub. Issues are meant for actual bug reports. |
Thank you very very much for your guidance
I have tried as follows:
from __future__ import annotations
import torch
torch.set_default_device('cuda')
#torch.device('cuda')
import os
import shutil
import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.core import Structure
from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs
from matgl.models import M3GNet, MEGNet
from matgl.utils.training import ModelLightningModule, PotentialLightningModule
#mp.set_start_method('spawn')
#torch.set_default_device('cpu')
if __name__ == '__main__':
# torch.device('cuda')
stru0 = Structure.from_file("../../strus/0/POSCAR")
stru1 = Structure.from_file("../../strus/1/POSCAR")
......
Or
from __future__ import annotations
import torch
#torch.set_default_device('cuda')
#torch.device('cuda')
import os
import shutil
import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.core import Structure
from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs
from matgl.models import M3GNet, MEGNet
from matgl.utils.training import ModelLightningModule, PotentialLightningModule
#mp.set_start_method('spawn')
#torch.set_default_device('cpu')
if __name__ == '__main__':
torch.set_default_device('cuda')
torch.device('cuda')
stru0 = Structure.from_file("../../strus/0/POSCAR")
stru1 = Structure.from_file("../../strus/1/POSCAR")
......
But it all had the same error:
Original Traceback (most recent call last):
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/dgl/data/utils.py", line 425, in __getitem__
return self.dataset[self.indices[item]]
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/graph/data.py", line 321, in __getitem__
self.state_attr[idx],
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
When I use the torch.device('cuda') as follows:
from __future__ import annotations
import torch
torch.device('cuda')
#torch.device('cuda')
import os
import shutil
import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.core import Structure
from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs
from matgl.models import M3GNet, MEGNet
from matgl.utils.training import ModelLightningModule, PotentialLightningModule
#mp.set_start_method('spawn')
#torch.set_default_device('cpu')
if __name__ == '__main__':
# torch.device('cuda')
stru0 = Structure.from_file("../../strus/0/POSCAR")
stru1 = Structure.from_file("../../strus/1/POSCAR")
......
Or
from __future__ import annotations
import torch
#torch.set_default_device('cuda')
#torch.device('cuda')
import os
import shutil
import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.core import Structure
from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs
from matgl.models import M3GNet, MEGNet
from matgl.utils.training import ModelLightningModule, PotentialLightningModule
#mp.set_start_method('spawn')
#torch.set_default_device('cpu')
if __name__ == '__main__':
torch.device('cuda')
torch.device('cuda')
stru0 = Structure.from_file("../../strus/0/POSCAR")
stru1 = Structure.from_file("../../strus/1/POSCAR")
......
I get this error:
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/models/_m3gnet.py", line 224, in forward
g.edata["bond_vec"] = bond_vec
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/dgl/view.py", line 227, in __setitem__
self._graph._set_e_repr(self._etid, self._edges, {key: val})
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4460, in _set_e_repr
raise DGLError(
dgl._ffi.base.DGLError: Cannot assign edge feature "bond_vec" on device cpu to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.
So I'm confused, I don't know what to do.
Here are some packages version:
dgl 1.1.0+cu117
cryptography 41.0.1
matgl 0.6.2
pytorch-lightning 2.0.4
torch 2.0.1
tqdm 4.65.0
pymatgen 2023.5.31
------------------ 原始邮件 ------------------
发件人: "materialsvirtuallab/matgl" ***@***.***>;
发送时间: 2023年6月25日(星期天) 上午8:32
***@***.***>;
***@***.******@***.***>;
主题: Re: [materialsvirtuallab/matgl] Faild to train useing GPU with M3gnet model (Issue #94)
Your torch default device needs to be the very first call. Not after all the other model setup has occurred. Or you can use with torch.device("cuda") before all your code.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Are you able to solve this problem? Please let me know. ''' print(f'The available device is {device}') |
Hi, @SmallBearC @chiku-parida, Are you able to solve this problem ? I also get the same errors. |
I am able to train normally using the CPU. When I use GPU, it keeps failing. I don't know where the problem is, so I hope someone can help me. I would be very grateful。
My script is as follows:
from _future_ import annotations
import os
import shutil
import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.core import Structure
from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs
from matgl.models import M3GNet, MEGNet
from matgl.utils.training import ModelLightningModule, PotentialLightningModule
import torch
if _name_ == '_main_':
stru0 = Structure.from_file("../../strus/0/POSCAR")
stru1 = Structure.from_file("../../strus/1/POSCAR")
structures = [stru0, stru1] * 10
energies = np.zeros(len(structures))
forces = [np.zeros((len(s), 3)).tolist() for s in structures]
stresses = [np.zeros((3, 3)).tolist()] * len(structures)
element_types = get_element_list([stru0, stru1])
converter = Structure2Graph(element_types=element_types, cutoff=5.0)
dataset = M3GNetDataset(
threebody_cutoff=4.0,
structures=structures,
converter=converter,
energies=energies,
forces=forces,
stresses=stresses,
)
train_data, val_data, test_data = split_dataset(
dataset,
frac_list=[0.8, 0.1, 0.1],
shuffle=True,
random_state=42,
)
train_loader, val_loader, test_loader = MGLDataLoader(
train_data=train_data,
val_data=val_data,
test_data=test_data,
collate_fn=collate_fn_efs,
batch_size=32,
num_workers=8,
)
model = M3GNet(
element_types=element_types,
is_intensive=False,
)
lit_model = PotentialLightningModule(model=model)
torch.set_default_device('cuda')
torch.multiprocessing.set_start_method('spawn', force=True)
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model=lit_model, train_dataloaders=train_loader, val_dataloaders=val_loader)
The error message I received is as follows:
Traceback (most recent call last):
File "/home/lycui/test/mgl/test/train.py", line 59, in
trainer.fit(model=lit_model, train_dataloaders=train_loader, val_dataloaders=val_loader)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
call._call_and_handle_interrupt(
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
results = self._run_stage()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in _run_stage
self._run_sanity_check()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_sanity_check
val_loop.run()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 177, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 375, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 287, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 379, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 59, in validation_step
results, batch_size = self.step(batch) # type: ignore
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 329, in step
e, f, s, _ = self(g=g, state_attr=state_attr, l_g=l_g)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 317, in forward
e, f, s, h = self.model(g=g, l_g=l_g, state_attr=state_attr)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/apps/pes.py", line 75, in forward
total_energies = self.data_std * self.model(g=g, state_attr=state_attr, l_g=l_g) + self.data_mean
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/models/_m3gnet.py", line 227, in forward
expanded_dists = self.bond_expansion(g.edata["bond_dist"])
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_bond.py", line 65, in forward
bond_basis = self.rbf(bond_dist)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_basis.py", line 104, in call
return self._call_sbf(r)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_basis.py", line 120, in _call_sbf
func(r[:, None] * root[None, :] / self.cutoff) * factor / torch.abs(func_add1(root[None, :]))
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/_device.py", line 62, in torch_function
return func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
The text was updated successfully, but these errors were encountered: