Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 [BUG] Fail to restart and append: 'Trainer' object has no attribute 'iepoch' #83

Closed
davidleocadio opened this issue Sep 30, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@davidleocadio
Copy link

davidleocadio commented Sep 30, 2021

Describe the bug
After specifying that restart: True and append: True in the .yaml I get the following error

Torch device: cuda
Successfully loaded the data set of type NpzDataset(3973)...
Successfully built the network...
! Restarting training ...
Traceback (most recent call last):
  File "trainandtest.py", line 176, in <module>
    trainer.train()
  File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 673, in train
    while not self.stop_cond:
  File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 765, in stop_cond
    if self.iepoch >= self.max_epochs:
AttributeError: 'Trainer' object has no attribute 'iepoch'

Looking through your code it seems like the method from_dict ( which is part of the Trainer class) isn't called. This method is where the variable iepoch is called. I'm not sure how to fix this myself. What I'm looking at is /nequip/train/trainer.py, btw.
~

To Reproduce
I don't think this is too relevant here. All I've done is added restart: True and append: True to the .yaml file shown in the Developer's tutorial. Please let me know if you think I should put it here and I will edit this part.

~
Expected behavior
I want my training session to restart and append to the previous files. But in any case, an additional question I have is with your terminology. You have "restart" and "requeuing" options. What I want is to resume a training session after it was terminated due to time constraints. I.e. if it was stopped at epoch 20 but I wanted to run 30 epochs I want to resume the training. Is this part of restarting or requeuing?

Environment:

I'll edit this part soon but for now let me tell you I've run it on my personal computer and the clusters (different environments for sure) and have the same error so I think this isn't package dependent but related to trainer.py

  • OS: Linux
  • python version 3.8.10
  • python environment (commands are given for python interpreter):
    • nequip version 0.3.2
    • e3nn version 0.3.3
    • pytorch version (import torch; torch.__version__)
  • (if relevant) GPU support with CUDA
    • cuda Version according to nvcc (nvcc --version)
    • cuda version according to PyTorch (import torch; torch.version.cuda)

Additional context
Please check my comments regarding resuming a training session. Thanks!

@davidleocadio davidleocadio added the bug Something isn't working label Sep 30, 2021
@simonbatzner
Copy link
Collaborator

Hi @davidleocadio, which command did you run for the restart: It will not work if you run nequip-train with a restart config, you'll have to use nequip-restart for a restart. I also see you're running nequip 0.3.2., can you try this with the latest version 0.3.3?

Regarding restart vs requeue: requeue is an option that automatically figures out whether it's your first run or a restarted ran, so if it's your first run, it trains a new model, if it's a restarted ran, it restarts an existing one. Requeue is helpful for queueing systems in which you might get interrupted and your job is restarted, in this case you don't need to take care of anything.

I find it quite useful, here's a config I usually use:

nequip-requeue config.yaml

and then config.yaml has the following lines (you can also find an example under configs/requeue.yaml):

root: example-root
run_name: example-run_name
workdir: example-workdir

requeue: true
append: true    

@davidleocadio
Copy link
Author

Hi @simonbatzner. I was training my networks using the procedure and code outlined in the Developer's Tutorial, and I was under the impression I could keep training the networks that way and simply modify the config.yaml with requeue: True ; append: True

But now I realize it's meant to work by training the networks like you comment above.

Also, with nequip 0.3.3, it all seems to work now.

Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants