Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faild to train useing GPU with M3gnet model #94

Closed
SmallBearC opened this issue Jun 24, 2023 · 5 comments
Closed

Faild to train useing GPU with M3gnet model #94

SmallBearC opened this issue Jun 24, 2023 · 5 comments

Comments

@SmallBearC
Copy link

SmallBearC commented Jun 24, 2023

I am able to train normally using the CPU. When I use GPU, it keeps failing. I don't know where the problem is, so I hope someone can help me. I would be very grateful。
My script is as follows:

from _future_ import annotations

import os
import shutil

import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.core import Structure
from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MEGNetDataset, MGLDataLoader, collate_fn, collate_fn_efs
from matgl.models import M3GNet, MEGNet
from matgl.utils.training import ModelLightningModule, PotentialLightningModule
import torch

if _name_ == '_main_':
stru0 = Structure.from_file("../../strus/0/POSCAR")
stru1 = Structure.from_file("../../strus/1/POSCAR")
structures = [stru0, stru1] * 10
energies = np.zeros(len(structures))
forces = [np.zeros((len(s), 3)).tolist() for s in structures]
stresses = [np.zeros((3, 3)).tolist()] * len(structures)
element_types = get_element_list([stru0, stru1])
converter = Structure2Graph(element_types=element_types, cutoff=5.0)
dataset = M3GNetDataset(
threebody_cutoff=4.0,
structures=structures,
converter=converter,
energies=energies,
forces=forces,
stresses=stresses,
)
train_data, val_data, test_data = split_dataset(
dataset,
frac_list=[0.8, 0.1, 0.1],
shuffle=True,
random_state=42,
)
train_loader, val_loader, test_loader = MGLDataLoader(
train_data=train_data,
val_data=val_data,
test_data=test_data,
collate_fn=collate_fn_efs,
batch_size=32,
num_workers=8,
)
model = M3GNet(
element_types=element_types,
is_intensive=False,
)
lit_model = PotentialLightningModule(model=model)
torch.set_default_device('cuda')
torch.multiprocessing.set_start_method('spawn', force=True)
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model=lit_model, train_dataloaders=train_loader, val_dataloaders=val_loader)

The error message I received is as follows:

Traceback (most recent call last):
File "/home/lycui/test/mgl/test/train.py", line 59, in
trainer.fit(model=lit_model, train_dataloaders=train_loader, val_dataloaders=val_loader)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
call._call_and_handle_interrupt(
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 975, in _run
results = self._run_stage()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in _run_stage
self._run_sanity_check()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_sanity_check
val_loop.run()
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 177, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 375, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 287, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 379, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 59, in validation_step
results, batch_size = self.step(batch) # type: ignore
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 329, in step
e, f, s, _ = self(g=g, state_attr=state_attr, l_g=l_g)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/utils/training.py", line 317, in forward
e, f, s, h = self.model(g=g, l_g=l_g, state_attr=state_attr)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/apps/pes.py", line 75, in forward
total_energies = self.data_std * self.model(g=g, state_attr=state_attr, l_g=l_g) + self.data_mean
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/models/_m3gnet.py", line 227, in forward
expanded_dists = self.bond_expansion(g.edata["bond_dist"])
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_bond.py", line 65, in forward
bond_basis = self.rbf(bond_dist)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_basis.py", line 104, in call
return self._call_sbf(r)
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/matgl/layers/_basis.py", line 120, in _call_sbf
func(r[:, None] * root[None, :] / self.cutoff) * factor / torch.abs(func_add1(root[None, :]))
File "/home/lycui/anaconda3/envs/mgl/lib/python3.9/site-packages/torch/utils/_device.py", line 62, in torch_function
return func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@shyuep
Copy link
Contributor

shyuep commented Jun 25, 2023

Your torch default device needs to be the very first call. Not after all the other model setup has occurred. Or you can use with torch.device("cuda") before all your code.

@shyuep
Copy link
Contributor

shyuep commented Jun 25, 2023

I should add that you should post such questions on the Discussions section, not in as an Issue in GitHub. Issues are meant for actual bug reports.

@shyuep shyuep closed this as completed Jun 25, 2023
@SmallBearC
Copy link
Author

SmallBearC commented Jun 25, 2023 via email

@chiku-parida
Copy link

Are you able to solve this problem? Please let me know.
I have tried like below declaring CUDA in the beginning still the error persists...

'''
if torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'

print(f'The available device is {device}')
'''

@txy159
Copy link

txy159 commented Nov 27, 2023

Hi, @SmallBearC @chiku-parida, Are you able to solve this problem ? I also get the same errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants