Training slowdown #311

grsmirnov · 2023-02-25T13:14:53Z

grsmirnov
Feb 25, 2023

I encountered weird behavior during the training process with Allegro+Nequip.
Briefly: there is a huge slowdown over time and GPU utilization becomes negligible on some datasets .
I attach the small example (the toy problem, the energies and forces are actually from the EAM potential):

180 atoms, hcp crystal, 2000 frames: everything good, ~3000 epochs in 20 hours on A100
128 atoms, bcc crystal, 2000 frames: slowdown observed, ~730 epochs
hcp+bcc altogether, 4000 frames: even bigger slowdown, only ~150 epochs for the same period

The parameters are mostly default and the same in all three cases, the only difference is the dataset. What is even more confusing - stopping and restarting the task restores the training speed for a short time (you could see the distinctive spike on the GPU utilization graph).

I have done some profiling with viztracer and it seems that the problem lies in grad computations - forward and backward calls require more and more time. I have also tested different combinations of hyperparameters, Nequip, Pytorch, and CUDA versions - the situation is almost the same every time. Do you have any ideas on how to overcome the problem? Is it some kind of bug or inevitable feature of my dataset?

Linux-cpp-lisp · 2023-02-26T20:19:49Z

Linux-cpp-lisp
Feb 26, 2023
Maintainer

Hi @grsmirnov ,

Nequip, Pytorch, and CUDA versions

Can you elaborate on your exact results? PyTorch has been making dramatic changes to their JIT lately that are having a tendency to break in strange ways due to upstream bugs in their code...

In particular, is your problem resolved by the use of PyTorch 1.11?

5 replies

grsmirnov Feb 27, 2023
Author

Ok, I' ve done more tests and now I believe that is pytorch+CUDA problem, not the Nequip or Allegro problem.

The above bcc.yaml example shows stable workload at least for 200 epochs on my local PC with the AMD Radeon card (self-compiled pytorch 2.0.0a0+gite0a954f +ROCM 5.4.3)
Pytorch 1.13.1 + CUDA 11.7 from conda have the similar performance degradation issue on our HPC with A100 cards.
Self-compiled Pytorch 2.0.0a0+git56c3e4c + CUDA 11.7 also starts to degrade
I've also tested Pytorch 1.8.2 and 1.12.1, but right now I would like to double-check that I did not mess up with the python versions.
Nequip was from the latest develop branch in all cases.
Probably my tests were unsystematic, so I will do more accurate research in a couple of days. Any help is appreciated in this topic.

Linux-cpp-lisp Feb 27, 2023
Maintainer

Huh this is very interesting, thanks for the info.

We only recommend PyTorch 1.11 or 1.13 (1.12 is broken due to upstream JIT bugs, 1.10 mostly works but still has some JIT bugs that limit us, and 1.8 is quite old and requires some compatibility stuff that may no longer work). Now it's looking like due to this performance bug, we can only currently recommend 1.11 until this is sorted out due to upstream PyTorch bugs.

pytorch 2.0.0a0+gite0a954f +ROCM 5.4.3 ... Pytorch 2.0.0a0+git56c3e4c + CUDA 11.7

Hm, ideally I would hope that the commit you are using locally with ROCm fixes the bug, but sadly I'd really just guess that as you suggest it's a CUDA issue. This would line up with my current understanding of the issue as well.

Hopefully 1.11 works for you so you can continue with your work.

Linux-cpp-lisp Feb 27, 2023
Maintainer

Regarding debugging, I can suggest something: In 2.0.0 on your HPC cluster with CUDA, open your nequip train.py and at the top somewhere add the lines:

torch._C._jit_override_can_fuse_on_cpu(False)
torch._C._jit_override_can_fuse_on_gpu(True)
torch._C._jit_set_texpr_fuser_enabled(True)
torch._C._jit_set_nvfuser_enabled(False)

(From / see https://github.com/pytorch/pytorch/blob/e0a0f37a11164f59b42bc80a6f95b54f722d47ce/torch/jit/_fuser.py#L46)

This disables nvFuser and re-enables the previous PyTorch NNC compiler. If this fixes it, we know exactly what the cause is, and hopefully it will allow you to move forward on new PyTorch versions if necessary while still using the old, working fuser. If this turns out to be the case I will add an official way to set this to nequip develop.

Thanks very much for this detailed report, we really appreciate it!

grsmirnov Feb 27, 2023
Author

Thanks for the help, I didn't suspect that the latest Pytorch versions gives such a headache.
My fault that I was way too attached to the latest stable release.
I've made some more tests and the brief summary is the following (hope it helps other users too):
torch-1.10.1+CUDA-11.3 - not recommended by the developers, but OK for the above test
torch-1.11.0+CUDA-11.3 - the best one, also OK for the above test
torch-1.12.1+CUDA-11.6 - not recommended by the developers, but also OK for the above test
torch-1.13.1+CUDA-11.7 - shows degrading performance, I can not recommend this version
torch-2.0.0* from git+CUDA-11.7 - degrading performance over time

pytorch 2.0.0a0+gite0a954f +ROCM 5.4.3 ... Pytorch 2.0.0a0+git56c3e4c + CUDA 11.7
Hm, ideally I would hope that the commit you are using locally with ROCm fixes the bug, but sadly I'd really just guess that as you suggest it's a CUDA issue. This would line up with my current understanding of the issue as well.

Actually the gite0a954f version is older than git56c3e4c. Keeping in mind problems with 1.13.1, it's either a CUDA11.7 problem or a still-unfixed 1.13 bug.

Regarding debugging, I can suggest something: In 2.0.0 on your HPC cluster with CUDA, open your nequip train.py and at the top somewhere add the lines:
Thanks for the suggestions, I will try, but maybe only by the end of the week.

Linux-cpp-lisp Feb 28, 2023
Maintainer

@grsmirnov thank you for this detailed information! Glad to hear that 1.11 works.

Thanks for the suggestions, I will try, but maybe only by the end of the week.

Sure, of course. If you do try please do post what you find, since I've tried to bring this to the attention of the PyTorch developers so more information helps get them on the right track.

Linux-cpp-lisp · 2023-02-27T15:25:42Z

Linux-cpp-lisp
Feb 27, 2023
Maintainer

Side note @grsmirnov , starting a new thread for it--- sounds like you have been using nequip and Allegro on ROCm through PyTorch's ROCm build; how has that been working for you? We've been curious for a while if this has worked for anyone.

2 replies

grsmirnov Feb 27, 2023
Author

My experience with Pytorch+ROCM is quite small so far, but here are my feelings. Probably the main issue is proper installation of AMD drivers. The latest conda/pip release of pytorch supports only ROCM 5.2, which is deprecated. Moreover, full Rocm 5.2 (and 5.3) installation is impossible on Ubuntu 20.04 due to too long list of arguments make error for the kernel-mode driver DKMS. (But full compatibility is declared in the AMD documentation!) So I switched to the Rocm 5.4.3 which can be properly installed without additional hacks. Thus compilation of Pytorch from the source is required, but the whole process is more or less straightforward.

If you succeed with the previous steps, the results are quite impressive. My ~700$ Radeon gaming card is only ~1.6 times slower than very expensive A100 card on the above test (float32 precision, rmax=5; l_max=2, num_layers=2, 128 atoms with energies and forces per frame). Of course, A100 has more memory and it is underloaded for this problem, we could expect bigger difference increasing complexity or the dataset.

Linux-cpp-lisp Feb 28, 2023
Maintainer

Huh ok, sounds unpleasant on the installation side but not unexpected 😆

Wow! That's great. It's true that in this small system regime things are dominated much more by overhead, but this is impressive regardless. Makes it sound like performance on datacenter class ROCm GPUs might be decent.

Thanks for sharing!

Linux-cpp-lisp · 2023-03-03T03:59:04Z

Linux-cpp-lisp
Mar 3, 2023
Maintainer

Hm... using your attached example (which only contains bcc.xyz, so that's all I tried), I seem to be able to reproduce the issue regardless of whether nvFuser is enabled... (assuming I'm doing that right). However, I am having difficulty reproducing it on another system right now.

Please let me know if you have any more insights into this strange issue, and thanks for your comprehensive posts so far!

6 replies

jjsjann123 Mar 6, 2023

@Linux-cpp-lisp sounds like disabling nvfuser in your case doesn't resolve the degrading performance over time. Are you also on nightly pytorch?

torch-1.10.1+CUDA-11.3 - not recommended by the developers, but OK for the above test
torch-1.11.0+CUDA-11.3 - the best one, also OK for the above test
torch-1.12.1+CUDA-11.6 - not recommended by the developers, but also OK for the above test
torch-1.13.1+CUDA-11.7 - shows degrading performance, I can not recommend this version
torch-2.0.0* from git+CUDA-11.7 - degrading performance over time

it's suspected that the issue only arises when we have 1.13.1 and/or CUDA 11.7. But on nightly, switching from nvfuser to NNC does resolve the problem?

grsmirnov Mar 6, 2023
Author

Probably this is not a CUDA problem. Additional test with conda pytorch nightly build + CUDA 11.7 (2.1.0.dev20230306 py3.11_cuda11.7_cudnn8.5.0_0) seems to work correctly without nvfuser.

Linux-cpp-lisp Mar 6, 2023
Maintainer

Very interesting, thanks @grsmirnov again for posting. To be clear, you are saying you observe it with nightly 2.1.0 with nvFuser, but do not with 2.1.0 without nvFuser?

Cause on 2.0 nightly a few days ago I observed it with or without nvFuser...

grsmirnov Mar 7, 2023
Author

@Linux-cpp-lisp : My example works correctly without nvFuser (i.e. with the modified nequip/nequip/scripts/train.py script as suggested here). Standard nequip installation showed poor performance. That is for pytorch 2.1.0. It is quite strange you did not observe the difference with 2.0.0. I've started one more run with the older pytorch 2.0.0.dev20230220 py3.11_cuda11.7_cudnn8.5.0_0, and so far it also works correctly without nvFuser.

Linux-cpp-lisp Mar 7, 2023
Maintainer

Hm maybe I made a mistake in my own tests... I will try again as soon as I get the chance. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training slowdown #311

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training slowdown #311

grsmirnov Feb 25, 2023

Replies: 3 comments · 13 replies

Linux-cpp-lisp Feb 26, 2023 Maintainer

grsmirnov Feb 27, 2023 Author

Linux-cpp-lisp Feb 27, 2023 Maintainer

Linux-cpp-lisp Feb 27, 2023 Maintainer

grsmirnov Feb 27, 2023 Author

Linux-cpp-lisp Feb 28, 2023 Maintainer

Linux-cpp-lisp Feb 27, 2023 Maintainer

grsmirnov Feb 27, 2023 Author

Linux-cpp-lisp Feb 28, 2023 Maintainer

Linux-cpp-lisp Mar 3, 2023 Maintainer

jjsjann123 Mar 6, 2023

grsmirnov Mar 6, 2023 Author

Linux-cpp-lisp Mar 6, 2023 Maintainer

grsmirnov Mar 7, 2023 Author

Linux-cpp-lisp Mar 7, 2023 Maintainer

grsmirnov
Feb 25, 2023

Replies: 3 comments 13 replies

Linux-cpp-lisp
Feb 26, 2023
Maintainer

grsmirnov Feb 27, 2023
Author

Linux-cpp-lisp Feb 27, 2023
Maintainer

Linux-cpp-lisp Feb 27, 2023
Maintainer

grsmirnov Feb 27, 2023
Author

Linux-cpp-lisp Feb 28, 2023
Maintainer

Linux-cpp-lisp
Feb 27, 2023
Maintainer

grsmirnov Feb 27, 2023
Author

Linux-cpp-lisp Feb 28, 2023
Maintainer

Linux-cpp-lisp
Mar 3, 2023
Maintainer

grsmirnov Mar 6, 2023
Author

Linux-cpp-lisp Mar 6, 2023
Maintainer

grsmirnov Mar 7, 2023
Author

Linux-cpp-lisp Mar 7, 2023
Maintainer