-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max Recursion Depth🐛 [BUG] #67
Comments
I found that training on a set of 10 configurations, the error occurs after 780 epochs (around 18 min), In the zip are the data and the config file used to get the error |
Hi @keano130 , Thanks for reaching out! This is very strange. We've seen this in our group only once before and I assumed it was some kind of corruption, but if you've seen it in a different computing environment it's definitely not. @Nicola89 could you post your version information from when you saw this so we can compare to @keano130's? At this point I don't really have any suspicions about the source of this, but I will look into it and let you know if I find anything or have questions. Are you able to successfully restart the training session using Thanks! |
Sure! When I encountered this error I was using the following env: python=3.8.11 I can also export and attach here the conda env if that is helpful. I concur with @keano130 about when the bug happens, i.e., deep in the training. I am attaching the error file of the test aspirin run where this happened. |
Were either of you running under One possibility is that this is a memory leak leading to an eventual OOM error that just isn't very informative, since |
I was running wandb, both GPU and system memory consumption were far from the maximum consumption in my case. |
After the error, it is possible to just restart the training, and it trains normally until after around 800 more epochs, where it fails again. |
interesting, thanks for the info @keano130. Does it fail in the exact same way, exact same stack trace? A workaround appears to be enabling |
It is always the exact same stack trace, the number of epochs before the error occurs is variable. Thank you for the workaround, but as restarting works, I think I'll keep the flexibility of the training model file, for now I stop most runs before 800 epochs anyhow. |
Update on my side: restart gives the same problem after a similar number of epochs (2923 vs 2964). |
OK, I have submitted a PR to e3nn that should implement a workaround to resolve this issue: e3nn/e3nn#297 If this is a problem for you, please try to install my branch from the linked PR. |
e3nn has made a new release incorporating the bugfix: https://github.com/e3nn/e3nn/releases/tag/0.3.5 So you can now get around this just by installing |
Describe the bug
In training a nequip neural network on molecules of 304 atoms, the training starts normal but after a while (around 840) epochs I get the following error.
To Reproduce
I am still trying to create a minimal working example, this error only happens after training for 10 hours at the moment.
Environment (please complete the following information):
Is this a problem that has occurred before?
The text was updated successfully, but these errors were encountered: