Max Recursion Depth🐛 [BUG] #67

keano130 · 2021-08-20T09:22:11Z

Describe the bug
In training a nequip neural network on molecules of 304 atoms, the training starts normal but after a while (around 840) epochs I get the following error.


RecursionError: maximum recursion depth exceeded while calling a Python object
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc43723/my_nequip/my_neq/lib/python3.8/site-packages/torch/fx/graph_module.py", line 505, in wrapped_call
    return cls_call(self, *args, **kwargs)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc43723/my_nequip/my_neq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "<eval_with_key_4055>", line 10, in forward
    new_zeros = x.new_zeros(add);  add = None
SystemError: <method 'new_zeros' of 'torch._C._TensorBase' objects> returned a result with an error set
Call using an FX-traced Module, line 10 of the traced Module's generated forward function:
    add = getitem_1 + (32,);  getitem_1 = None
    new_zeros = x.new_zeros(add);  add = None

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    getattr_3 = x.shape
    getitem_2 = getattr_3[slice(None, -1, None)];  getattr_3 = None
RecursionError: maximum recursion depth exceeded while calling a Python object

To Reproduce
I am still trying to create a minimal working example, this error only happens after training for 10 hours at the moment.

Environment (please complete the following information):

OS: Ubuntu
python version 3.8.6
python environment (commands are given for python interpreter):
- nequip version 0.3.3
- e3nn version 0.3.3
- pytorch version 1.9.0+cu111
(if relevant) GPU support with CUDA
- cuda Version according to nvcc :v11.1.105
- cuda version according to PyTorch : 11.1

Is this a problem that has occurred before?

The text was updated successfully, but these errors were encountered:

keano130 · 2021-08-20T10:25:00Z

I found that training on a set of 10 configurations, the error occurs after 780 epochs (around 18 min),
my gpu is NVIDIA GeForce GTX 1050 Ti.

In the zip are the data and the config file used to get the error

Runtime_error_small.zip

Linux-cpp-lisp · 2021-08-20T15:59:54Z

Hi @keano130 ,

Thanks for reaching out!

This is very strange. We've seen this in our group only once before and I assumed it was some kind of corruption, but if you've seen it in a different computing environment it's definitely not.

@Nicola89 could you post your version information from when you saw this so we can compare to @keano130's?

At this point I don't really have any suspicions about the source of this, but I will look into it and let you know if I find anything or have questions.

Are you able to successfully restart the training session using nequip-restart? Does the bug reoccur after restarting?

Thanks!

Nicola89 · 2021-08-20T16:12:49Z

Sure! When I encountered this error I was using the following env:

python=3.8.11
cudatoolkit=11.1.74
pytorch=1.9.0 (cu111)
pytorch-geometric=1.7.2
e3nn=0.3.3
nequip=0.3.3

I can also export and attach here the conda env if that is helpful. I concur with @keano130 about when the bug happens, i.e., deep in the training. I am attaching the error file of the test aspirin run where this happened.
recursion_depth.err.zip

Linux-cpp-lisp · 2021-08-20T16:23:11Z

Were either of you running under wandb, and if you were, could you check the GPU & system memory consumption?

One possibility is that this is a memory leak leading to an eventual OOM error that just isn't very informative, since new_zeros is an alloc... although in that circumstance you wouldn't expect it to consistently fail on this new_zeros...

keano130 · 2021-08-23T07:48:56Z

I was running wandb, both GPU and system memory consumption were far from the maximum consumption in my case.

keano130 · 2021-08-23T08:04:51Z

After the error, it is possible to just restart the training, and it trains normally until after around 800 more epochs, where it fails again.

Linux-cpp-lisp · 2021-08-23T15:03:14Z

interesting, thanks for the info @keano130. Does it fail in the exact same way, exact same stack trace?

A workaround appears to be enabling compile_model: True in your config. This compiles the model down to TorchScript for training. (Please note that if you do this the trained model file is somewhat less flexible / useful, since you can't go poking around in the Python module tree later, although I think the parameters can be loaded from it into a Python model if you really need to.)

keano130 · 2021-08-24T09:36:38Z

It is always the exact same stack trace, the number of epochs before the error occurs is variable.

Thank you for the workaround, but as restarting works, I think I'll keep the flexibility of the training model file, for now I stop most runs before 800 epochs anyhow.

Nicola89 · 2021-08-24T14:26:23Z

Update on my side: restart gives the same problem after a similar number of epochs (2923 vs 2964).

Linux-cpp-lisp · 2021-08-26T21:07:45Z

OK, I have submitted a PR to e3nn that should implement a workaround to resolve this issue: e3nn/e3nn#297

If this is a problem for you, please try to install my branch from the linked PR.

Linux-cpp-lisp · 2021-08-27T15:31:20Z

e3nn has made a new release incorporating the bugfix: https://github.com/e3nn/e3nn/releases/tag/0.3.5

So you can now get around this just by installing e3nn==0.3.5.

keano130 added the bug Something isn't working label Aug 20, 2021

Linux-cpp-lisp mentioned this issue Aug 26, 2021

Use CodeGenMixin for Extract e3nn/e3nn#297

Merged

6 tasks

Linux-cpp-lisp mentioned this issue Aug 26, 2021

[FX] Explicit GraphModule.__new__ call adds another GraphModuleImpl class to the inheritance hierarchy pytorch/pytorch#63883

Closed

Linux-cpp-lisp closed this as completed Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Max Recursion Depth🐛 [BUG] #67

Max Recursion Depth🐛 [BUG] #67

keano130 commented Aug 20, 2021

keano130 commented Aug 20, 2021

Linux-cpp-lisp commented Aug 20, 2021

Nicola89 commented Aug 20, 2021

Linux-cpp-lisp commented Aug 20, 2021

keano130 commented Aug 23, 2021

keano130 commented Aug 23, 2021

Linux-cpp-lisp commented Aug 23, 2021

keano130 commented Aug 24, 2021

Nicola89 commented Aug 24, 2021

Linux-cpp-lisp commented Aug 26, 2021

Linux-cpp-lisp commented Aug 27, 2021

Max Recursion Depth🐛 [BUG] #67

Max Recursion Depth🐛 [BUG] #67

Comments

keano130 commented Aug 20, 2021

keano130 commented Aug 20, 2021

Linux-cpp-lisp commented Aug 20, 2021

Nicola89 commented Aug 20, 2021

Linux-cpp-lisp commented Aug 20, 2021

keano130 commented Aug 23, 2021

keano130 commented Aug 23, 2021

Linux-cpp-lisp commented Aug 23, 2021

keano130 commented Aug 24, 2021

Nicola89 commented Aug 24, 2021

Linux-cpp-lisp commented Aug 26, 2021

Linux-cpp-lisp commented Aug 27, 2021