Replies: 3 comments 13 replies
-
Hi @grsmirnov ,
Can you elaborate on your exact results? PyTorch has been making dramatic changes to their JIT lately that are having a tendency to break in strange ways due to upstream bugs in their code... In particular, is your problem resolved by the use of PyTorch 1.11? |
Beta Was this translation helpful? Give feedback.
-
Side note @grsmirnov , starting a new thread for it--- sounds like you have been using |
Beta Was this translation helpful? Give feedback.
-
Hm... using your attached example (which only contains Please let me know if you have any more insights into this strange issue, and thanks for your comprehensive posts so far! |
Beta Was this translation helpful? Give feedback.
-
I encountered weird behavior during the training process with Allegro+Nequip.
Briefly: there is a huge slowdown over time and GPU utilization becomes negligible on some datasets .
I attach the small example (the toy problem, the energies and forces are actually from the EAM potential):
The parameters are mostly default and the same in all three cases, the only difference is the dataset. What is even more confusing - stopping and restarting the task restores the training speed for a short time (you could see the distinctive spike on the GPU utilization graph).
I have done some profiling with viztracer and it seems that the problem lies in grad computations - forward and backward calls require more and more time. I have also tested different combinations of hyperparameters, Nequip, Pytorch, and CUDA versions - the situation is almost the same every time. Do you have any ideas on how to overcome the problem? Is it some kind of bug or inevitable feature of my dataset?
Beta Was this translation helpful? Give feedback.
All reactions