-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple simulations on the same GPU are interfering with each other! #13
Comments
I'm having trouble running your script, because I can't import both |
I use PyTorch from the main Anaconda channel:
As mentioned in #9 (comment), |
Right, but it's compiled with the pre-C++11 ABI. When you download LibTorch from https://pytorch.org/get-started/locally/ they give you a choice of two packages, one with each ABI. But for conda packages they only provide the old ABI. |
|
Good to know. I'll try that version. Thanks! |
@peastman Have you managed to reproduce the issue? Any ideas how to fix? |
I have tried different versions of dependencies/compilers:
@peastman we would appreciate if you could help solve this. It is important and blocking our progress. |
I'm still struggling to get it to compile. I switched to the conda build of PyTorch, but when I try to compile the plugin it gets pages of link errors like
|
I haven't seen such errors. My first guess: you are missing MKL. |
I have tried to run one MD step under
I guess, most of them are false positives, but these are a bit suspicious:
|
It's there, but perhaps I need to do something extra to add it to the library path. This is a difference between the standard and conda builds. When I look at the version of libtorch downloaded from the PyTorch website it doesn't have that dependency:
But the conda build does:
|
I finally managed to get it to compile. Now I get this error when I try to run the script:
None of these problems happen when using the official PyTorch builds. I've already spent hours on this with no clear end in sight. I'm going to switch back to the official build, then rework your script to split out creating the |
I can reproduce the results you're seeing, but it doesn't seem to have anything to do with multiple simulations. It just depends on the platform. With OpenCL or CPU I get excellent energy conservation:
With CUDA the energy conservation is worse:
Furthermore, with CUDA the numbers vary from one run to the next, which shouldn't be the case. I'll keep investigating. |
Thanks for your effort, @peastman! Just keep in mind, the problem is caused not only by the second simulation, but also by other processes on GPU (i.e. VMD, Chrome, etc). |
The error seems to be happening when we copy the forces over from PyTorch's buffers to OpenMM's buffers. Usually it works correctly, but once in a while the forces come out as zero. This seems like a race condition, but scattering calls to |
After changing that, the trajectory is identical to the one from OpenCL:
It's slower than before, but that's probably mostly because it's a trivial model. For a real system the overhead would be much less significant. |
The fix is in #15. Could you try it out and see if it now works for you? |
Regarding #15, I think, the problem is not the race condition, but the multiple-context problem:
I guess, the a more correct fix:
|
In addition to the previous comments:
|
If two or more simulations are running on the same GPU and using
TorchForce
, they produce incorrect results. Other GPU processes (i.e. VMD) also affectTorchForce
, but the effects are less deterministic. Typically this manifests as random "explosions" of the system.Versions:
master
using Recipe for a conda package #9.The problem can be reproduced with:
If only one simulation is running, the total energy is conserved as expected:
If two simulations are running, the energy conservation degrades:
For reference, I tried the same setup with
CustomExternalForce
. No problems have been observed!The text was updated successfully, but these errors were encountered: