-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push primary context before invoking PyTorch #78
Conversation
Thank you for the work, Peter! I'll try to compile the openmm-torch from master and test it soon! |
@@ -154,5 +157,7 @@ double CudaCalcTorchForceKernel::execute(ContextImpl& context, bool includeForce | |||
if (!outputsForces) | |||
posTensor.grad().zero_(); | |||
} | |||
cuCtxPopCurrent(&primary); | |||
cuDevicePrimaryCtxRelease(cu.getDevice()); | |||
return energyTensor.item<double>(); // This implicitly synchronize the PyTorch context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context is being used after it was popped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, simple operations on tensors don't cause any problems. The only things that fail if we don't set the context first are forward()
and backward()
.
This is a collaborative project. We have already established that we using the PR review process to ensure the quality of the code. It is unacceptable this is merged without a proper review! Yesterday, I just glanced over the code and made a single comment, but note that wasn't marked as an approval. I had an impression that the PR was still in progress and there would be an opportunity to review it later. As the lead developers, it is our responsibility to show good development practice and build trust within the community. I feel that this case and #72 are disrespectful. |
A few issues with the code:
cuCtxPopCurrent(&primary);
cuDevicePrimaryCtxRelease(cu.getDevice());
return energyTensor.item<double>(); // This implicitly synchronize the PyTorch context
|
I merged it so that @dominicrufa could test it. He isn't able to build locally, and he asked me to merge it so we could make a dev build he can test. Once he tests it, we can make further changes as necessary. |
it looks like last build was ~8 days ago here. do we have a trigger to update the conda-installable? |
As far as I can tell from my testing, that isn't the case. It always seems to create its tensors on the correct context. Also, we don't create the ContextSelector in
This also doesn't seem to be the case, as far as I can tell from testing. We would have to dig deeply into the PyTorch source to really understand what it's doing. Based on everything I've observed, though,
Those functions just increment/decrement a single counter. They take negligible time. |
Right now we have to trigger dev builds manually. |
|
This is close to production-level code as we intent to release it with OpenMM 8. So, debugging on |
Fixes openmm/openmm#3588.
It seems that the CUDA runtime API, which PyTorch uses, doesn't play well with the stack of CUDA contexts. When you invoke a PyTorch function, it expects the primary context to already be current. If it isn't, then things often still work but sometimes fail with obscure errors. This change ensures the primary context will be current.
@Hong-Rui can you test this and verify it works for you?
@dominicrufa I suspect this may also resolve openmm/openmm-ml#32. Can you try it out and see?