Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push primary context before invoking PyTorch #78

Merged
merged 1 commit into from
Jun 15, 2022

Conversation

peastman
Copy link
Member

Fixes openmm/openmm#3588.

It seems that the CUDA runtime API, which PyTorch uses, doesn't play well with the stack of CUDA contexts. When you invoke a PyTorch function, it expects the primary context to already be current. If it isn't, then things often still work but sometimes fail with obscure errors. This change ensures the primary context will be current.

@Hong-Rui can you test this and verify it works for you?

@dominicrufa I suspect this may also resolve openmm/openmm-ml#32. Can you try it out and see?

@Hong-Rui
Copy link

Thank you for the work, Peter! I'll try to compile the openmm-torch from master and test it soon!

@@ -154,5 +157,7 @@ double CudaCalcTorchForceKernel::execute(ContextImpl& context, bool includeForce
if (!outputsForces)
posTensor.grad().zero_();
}
cuCtxPopCurrent(&primary);
cuDevicePrimaryCtxRelease(cu.getDevice());
return energyTensor.item<double>(); // This implicitly synchronize the PyTorch context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context is being used after it was popped

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, simple operations on tensors don't cause any problems. The only things that fail if we don't set the context first are forward() and backward().

@raimis raimis added this to Regular in Accelerated NNP in OpenMM via automation Jun 14, 2022
@raimis raimis moved this from Regular to Blockers in Accelerated NNP in OpenMM Jun 14, 2022
@peastman peastman merged commit a74c888 into openmm:master Jun 15, 2022
Accelerated NNP in OpenMM automation moved this from Blockers to Done Jun 15, 2022
@peastman peastman deleted the primary branch June 15, 2022 00:13
@raimis
Copy link
Contributor

raimis commented Jun 15, 2022

@peastman

This is a collaborative project. We have already established that we using the PR review process to ensure the quality of the code. It is unacceptable this is merged without a proper review!

Yesterday, I just glanced over the code and made a single comment, but note that wasn't marked as an approval. I had an impression that the PR was still in progress and there would be an opportunity to review it later.

As the lead developers, it is our responsibility to show good development practice and build trust within the community. I feel that this case and #72 are disrespectful.

@raimis
Copy link
Contributor

raimis commented Jun 15, 2022

A few issues with the code:

  • The primary context is used in CudaCalcTorchForceKernel::execute, but CudaCalcTorchForceKernel::initialize PyTorch might be initializing and allocating tensor on a different context.
  • As mentioned in my comment above, the context is used after being popped. As we saw in Improved coordination of CUDA contexts with PyTorch #47, the context switch and coordination is critical, other memory is corrupted and incorrect/random results are produced.
    cuCtxPopCurrent(&primary);
    cuDevicePrimaryCtxRelease(cu.getDevice());
    return energyTensor.item<double>(); // This implicitly synchronize the PyTorch context
  • For performance, we don't need cuDevicePrimaryCtxRetain and cuDevicePrimaryCtxRelease at each execution. The primary context should remain, while PyTorch used. So, they can be called by CudaCalcTorchForceKernel constructor and destructor, respectively.

raimis added a commit that referenced this pull request Jun 15, 2022
raimis added a commit that referenced this pull request Jun 15, 2022
@peastman
Copy link
Member Author

I merged it so that @dominicrufa could test it. He isn't able to build locally, and he asked me to merge it so we could make a dev build he can test. Once he tests it, we can make further changes as necessary.

@dominicrufa
Copy link

it looks like last build was ~8 days ago here. do we have a trigger to update the conda-installable?

@peastman
Copy link
Member Author

The primary context is used in CudaCalcTorchForceKernel::execute, but CudaCalcTorchForceKernel::initialize PyTorch might be initializing and allocating tensor on a different context.

As far as I can tell from my testing, that isn't the case. It always seems to create its tensors on the correct context. Also, we don't create the ContextSelector in initialize() until after we're all done creating PyTorch objects, so the current context at that point will be whatever PyTorch has set up.

As mentioned in my comment above, the context is used after being popped.

This also doesn't seem to be the case, as far as I can tell from testing. We would have to dig deeply into the PyTorch source to really understand what it's doing. Based on everything I've observed, though, forward() and backward() are the only functions where PyTorch doesn't set up its context correctly.

For performance, we don't need cuDevicePrimaryCtxRetain and cuDevicePrimaryCtxRelease at each execution.

Those functions just increment/decrement a single counter. They take negligible time.

@peastman
Copy link
Member Author

do we have a trigger to update the conda-installable?

Right now we have to trigger dev builds manually.

@raimis
Copy link
Contributor

raimis commented Jun 15, 2022

@dominicrufa

@peastman
Copy link
Member Author

Be sure to specify a74c888 as the revision to build from. In the current head, it's been reverted by #79.

@raimis do you want to create a new PR setting the context the way you think is right?

@raimis
Copy link
Contributor

raimis commented Jun 15, 2022

I merged it so that @dominicrufa could test it. He isn't able to build locally, and openmm/openmm-ml#32 (comment) so we could make a dev build he can test. Once he tests it, we can make further changes as necessary.

This is close to production-level code as we intent to release it with OpenMM 8. So, debugging on master doesn't seem to be the best approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants