You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
After specifying that restart: True and append: True in the .yaml I get the following error
Torch device: cuda
Successfully loaded the data set of type NpzDataset(3973)...
Successfully built the network...
! Restarting training ...
Traceback (most recent call last):
File "trainandtest.py", line 176, in <module>
trainer.train()
File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 673, in train
while not self.stop_cond:
File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 765, in stop_cond
if self.iepoch >= self.max_epochs:
AttributeError: 'Trainer' object has no attribute 'iepoch'
Looking through your code it seems like the method from_dict ( which is part of the Trainer class) isn't called. This method is where the variable iepoch is called. I'm not sure how to fix this myself. What I'm looking at is /nequip/train/trainer.py, btw.
~
To Reproduce
I don't think this is too relevant here. All I've done is added restart: True and append: True to the .yaml file shown in the Developer's tutorial. Please let me know if you think I should put it here and I will edit this part.
~ Expected behavior
I want my training session to restart and append to the previous files. But in any case, an additional question I have is with your terminology. You have "restart" and "requeuing" options. What I want is to resume a training session after it was terminated due to time constraints. I.e. if it was stopped at epoch 20 but I wanted to run 30 epochs I want to resume the training. Is this part of restarting or requeuing?
Environment:
I'll edit this part soon but for now let me tell you I've run it on my personal computer and the clusters (different environments for sure) and have the same error so I think this isn't package dependent but related to trainer.py
OS: Linux
python version 3.8.10
python environment (commands are given for python interpreter):
nequip version 0.3.2
e3nn version 0.3.3
pytorch version (import torch; torch.__version__)
(if relevant) GPU support with CUDA
cuda Version according to nvcc (nvcc --version)
cuda version according to PyTorch (import torch; torch.version.cuda)
Additional context
Please check my comments regarding resuming a training session. Thanks!
The text was updated successfully, but these errors were encountered:
Hi @davidleocadio, which command did you run for the restart: It will not work if you run nequip-train with a restart config, you'll have to use nequip-restart for a restart. I also see you're running nequip 0.3.2., can you try this with the latest version 0.3.3?
Regarding restart vs requeue: requeue is an option that automatically figures out whether it's your first run or a restarted ran, so if it's your first run, it trains a new model, if it's a restarted ran, it restarts an existing one. Requeue is helpful for queueing systems in which you might get interrupted and your job is restarted, in this case you don't need to take care of anything.
I find it quite useful, here's a config I usually use:
nequip-requeue config.yaml
and then config.yaml has the following lines (you can also find an example under configs/requeue.yaml):
Hi @simonbatzner. I was training my networks using the procedure and code outlined in the Developer's Tutorial, and I was under the impression I could keep training the networks that way and simply modify the config.yaml with requeue: True ; append: True
But now I realize it's meant to work by training the networks like you comment above.
Also, with nequip 0.3.3, it all seems to work now.
Describe the bug
After specifying that restart: True and append: True in the .yaml I get the following error
Looking through your code it seems like the method from_dict ( which is part of the Trainer class) isn't called. This method is where the variable iepoch is called. I'm not sure how to fix this myself. What I'm looking at is /nequip/train/trainer.py, btw.
~
To Reproduce
I don't think this is too relevant here. All I've done is added restart: True and append: True to the .yaml file shown in the Developer's tutorial. Please let me know if you think I should put it here and I will edit this part.
~
Expected behavior
I want my training session to restart and append to the previous files. But in any case, an additional question I have is with your terminology. You have "restart" and "requeuing" options. What I want is to resume a training session after it was terminated due to time constraints. I.e. if it was stopped at epoch 20 but I wanted to run 30 epochs I want to resume the training. Is this part of restarting or requeuing?
Environment:
I'll edit this part soon but for now let me tell you I've run it on my personal computer and the clusters (different environments for sure) and have the same error so I think this isn't package dependent but related to trainer.py
import torch; torch.__version__
)nvcc --version
)import torch; torch.version.cuda
)Additional context
Please check my comments regarding resuming a training session. Thanks!
The text was updated successfully, but these errors were encountered: