🐛 [BUG] Fail to restart and append: 'Trainer' object has no attribute 'iepoch' #83

davidleocadio · 2021-09-30T09:15:20Z

Describe the bug
After specifying that restart: True and append: True in the .yaml I get the following error

Torch device: cuda
Successfully loaded the data set of type NpzDataset(3973)...
Successfully built the network...
! Restarting training ...
Traceback (most recent call last):
  File "trainandtest.py", line 176, in <module>
    trainer.train()
  File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 673, in train
    while not self.stop_cond:
  File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 765, in stop_cond
    if self.iepoch >= self.max_epochs:
AttributeError: 'Trainer' object has no attribute 'iepoch'

Looking through your code it seems like the method from_dict ( which is part of the Trainer class) isn't called. This method is where the variable iepoch is called. I'm not sure how to fix this myself. What I'm looking at is /nequip/train/trainer.py, btw.
~

To Reproduce
I don't think this is too relevant here. All I've done is added restart: True and append: True to the .yaml file shown in the Developer's tutorial. Please let me know if you think I should put it here and I will edit this part.

~
Expected behavior
I want my training session to restart and append to the previous files. But in any case, an additional question I have is with your terminology. You have "restart" and "requeuing" options. What I want is to resume a training session after it was terminated due to time constraints. I.e. if it was stopped at epoch 20 but I wanted to run 30 epochs I want to resume the training. Is this part of restarting or requeuing?

Environment:

I'll edit this part soon but for now let me tell you I've run it on my personal computer and the clusters (different environments for sure) and have the same error so I think this isn't package dependent but related to trainer.py

OS: Linux
python version 3.8.10
python environment (commands are given for python interpreter):
- nequip version 0.3.2
- e3nn version 0.3.3
- pytorch version (import torch; torch.__version__)
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch (import torch; torch.version.cuda)

Additional context
Please check my comments regarding resuming a training session. Thanks!

The text was updated successfully, but these errors were encountered:

simonbatzner · 2021-10-01T12:01:18Z

Hi @davidleocadio, which command did you run for the restart: It will not work if you run nequip-train with a restart config, you'll have to use nequip-restart for a restart. I also see you're running nequip 0.3.2., can you try this with the latest version 0.3.3?

Regarding restart vs requeue: requeue is an option that automatically figures out whether it's your first run or a restarted ran, so if it's your first run, it trains a new model, if it's a restarted ran, it restarts an existing one. Requeue is helpful for queueing systems in which you might get interrupted and your job is restarted, in this case you don't need to take care of anything.

I find it quite useful, here's a config I usually use:

nequip-requeue config.yaml

and then config.yaml has the following lines (you can also find an example under configs/requeue.yaml):

root: example-root
run_name: example-run_name
workdir: example-workdir

requeue: true
append: true

davidleocadio · 2021-10-04T17:49:20Z

Hi @simonbatzner. I was training my networks using the procedure and code outlined in the Developer's Tutorial, and I was under the impression I could keep training the networks that way and simply modify the config.yaml with requeue: True ; append: True

But now I realize it's meant to work by training the networks like you comment above.

Also, with nequip 0.3.3, it all seems to work now.

Thanks for your help.

davidleocadio added the bug Something isn't working label Sep 30, 2021

davidleocadio closed this as completed Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [BUG] Fail to restart and append: 'Trainer' object has no attribute 'iepoch' #83

🐛 [BUG] Fail to restart and append: 'Trainer' object has no attribute 'iepoch' #83

davidleocadio commented Sep 30, 2021 •

edited

simonbatzner commented Oct 1, 2021

davidleocadio commented Oct 4, 2021

🐛 [BUG] Fail to restart and append: 'Trainer' object has no attribute 'iepoch' #83

🐛 [BUG] Fail to restart and append: 'Trainer' object has no attribute 'iepoch' #83

Comments

davidleocadio commented Sep 30, 2021 • edited

simonbatzner commented Oct 1, 2021

davidleocadio commented Oct 4, 2021

davidleocadio commented Sep 30, 2021 •

edited