Push primary context before invoking PyTorch #78

peastman · 2022-06-13T22:06:57Z

It seems that the CUDA runtime API, which PyTorch uses, doesn't play well with the stack of CUDA contexts. When you invoke a PyTorch function, it expects the primary context to already be current. If it isn't, then things often still work but sometimes fail with obscure errors. This change ensures the primary context will be current.

@Hong-Rui can you test this and verify it works for you?

@dominicrufa I suspect this may also resolve openmm/openmm-ml#32. Can you try it out and see?

Hong-Rui · 2022-06-14T06:26:09Z

Thank you for the work, Peter! I'll try to compile the openmm-torch from master and test it soon!

raimis · 2022-06-14T09:27:41Z

platforms/cuda/src/CudaTorchKernels.cpp

@@ -154,5 +157,7 @@ double CudaCalcTorchForceKernel::execute(ContextImpl& context, bool includeForce
        if (!outputsForces)
            posTensor.grad().zero_();
    }
+    cuCtxPopCurrent(&primary);
+    cuDevicePrimaryCtxRelease(cu.getDevice());
    return energyTensor.item<double>(); // This implicitly synchronize the PyTorch context


The context is being used after it was popped

As far as I can tell, simple operations on tensors don't cause any problems. The only things that fail if we don't set the context first are forward() and backward().

raimis · 2022-06-15T10:08:36Z

@peastman

This is a collaborative project. We have already established that we using the PR review process to ensure the quality of the code. It is unacceptable this is merged without a proper review!

Yesterday, I just glanced over the code and made a single comment, but note that wasn't marked as an approval. I had an impression that the PR was still in progress and there would be an opportunity to review it later.

As the lead developers, it is our responsibility to show good development practice and build trust within the community. I feel that this case and #72 are disrespectful.

raimis · 2022-06-15T13:49:24Z

A few issues with the code:

The primary context is used in CudaCalcTorchForceKernel::execute, but CudaCalcTorchForceKernel::initialize PyTorch might be initializing and allocating tensor on a different context.
As mentioned in my comment above, the context is used after being popped. As we saw in Improved coordination of CUDA contexts with PyTorch #47, the context switch and coordination is critical, other memory is corrupted and incorrect/random results are produced.

    cuCtxPopCurrent(&primary);
    cuDevicePrimaryCtxRelease(cu.getDevice());
    return energyTensor.item<double>(); // This implicitly synchronize the PyTorch context

For performance, we don't need cuDevicePrimaryCtxRetain and cuDevicePrimaryCtxRelease at each execution. The primary context should remain, while PyTorch used. So, they can be called by CudaCalcTorchForceKernel constructor and destructor, respectively.

This reverts commit a74c888.

peastman · 2022-06-15T15:17:06Z

I merged it so that @dominicrufa could test it. He isn't able to build locally, and he asked me to merge it so we could make a dev build he can test. Once he tests it, we can make further changes as necessary.

dominicrufa · 2022-06-15T15:23:58Z

it looks like last build was ~8 days ago here. do we have a trigger to update the conda-installable?

peastman · 2022-06-15T15:29:52Z

The primary context is used in CudaCalcTorchForceKernel::execute, but CudaCalcTorchForceKernel::initialize PyTorch might be initializing and allocating tensor on a different context.

As far as I can tell from my testing, that isn't the case. It always seems to create its tensors on the correct context. Also, we don't create the ContextSelector in initialize() until after we're all done creating PyTorch objects, so the current context at that point will be whatever PyTorch has set up.

As mentioned in my comment above, the context is used after being popped.

This also doesn't seem to be the case, as far as I can tell from testing. We would have to dig deeply into the PyTorch source to really understand what it's doing. Based on everything I've observed, though, forward() and backward() are the only functions where PyTorch doesn't set up its context correctly.

For performance, we don't need cuDevicePrimaryCtxRetain and cuDevicePrimaryCtxRelease at each execution.

Those functions just increment/decrement a single counter. They take negligible time.

peastman · 2022-06-15T15:31:32Z

do we have a trigger to update the conda-installable?

Right now we have to trigger dev builds manually.

raimis · 2022-06-15T15:40:56Z

@dominicrufa

Check out https://github.com/conda-forge/openmm-torch-feedstock
Modify the soruce section in https://github.com/conda-forge/openmm-torch-feedstock/blob/main/recipe/meta.yaml to point to a branch you want to build.
Run https://github.com/conda-forge/openmm-torch-feedstock/blob/main/build-locally.py and in a few minutes, you will have a custom package.

peastman · 2022-06-15T15:50:36Z

Be sure to specify a74c888 as the revision to build from. In the current head, it's been reverted by #79.

@raimis do you want to create a new PR setting the context the way you think is right?

raimis · 2022-06-15T15:53:40Z

I merged it so that @dominicrufa could test it. He isn't able to build locally, and openmm/openmm-ml#32 (comment) so we could make a dev build he can test. Once he tests it, we can make further changes as necessary.

This is close to production-level code as we intent to release it with OpenMM 8. So, debugging on master doesn't seem to be the best approach.

Push primary context before invoking PyTorch

1958fa1

raimis reviewed Jun 14, 2022

View reviewed changes

raimis added this to Regular in Accelerated NNP in OpenMM via automation Jun 14, 2022

raimis moved this from Regular to Blockers in Accelerated NNP in OpenMM Jun 14, 2022

jchodera mentioned this pull request Jun 14, 2022

TorchForce inside CustomCVForce fails with nnpops implementation openmm/openmm-ml#32

Closed

peastman merged commit a74c888 into openmm:master Jun 15, 2022

Accelerated NNP in OpenMM automation moved this from Blockers to Done Jun 15, 2022

peastman deleted the primary branch June 15, 2022 00:13

raimis added a commit that referenced this pull request Jun 15, 2022

Revert "Push primary context before invoking PyTorch (#78)"

b5d7589

This reverts commit a74c888.

raimis mentioned this pull request Jun 15, 2022

Revert "Push primary context before invoking PyTorch" #79

Merged

raimis added a commit that referenced this pull request Jun 15, 2022

Revert "Push primary context before invoking PyTorch (#78)" (#79)

661b004

This reverts commit a74c888.

peastman mentioned this pull request Jun 29, 2022

Fix interoperability with CustomCVForce #80

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push primary context before invoking PyTorch #78

Push primary context before invoking PyTorch #78

peastman commented Jun 13, 2022

Hong-Rui commented Jun 14, 2022

raimis Jun 14, 2022

peastman Jun 14, 2022

raimis commented Jun 15, 2022

raimis commented Jun 15, 2022

peastman commented Jun 15, 2022

dominicrufa commented Jun 15, 2022

peastman commented Jun 15, 2022

peastman commented Jun 15, 2022

raimis commented Jun 15, 2022

peastman commented Jun 15, 2022

raimis commented Jun 15, 2022

Push primary context before invoking PyTorch #78

Push primary context before invoking PyTorch #78

Conversation

peastman commented Jun 13, 2022

Hong-Rui commented Jun 14, 2022

raimis Jun 14, 2022

Choose a reason for hiding this comment

peastman Jun 14, 2022

Choose a reason for hiding this comment

raimis commented Jun 15, 2022

raimis commented Jun 15, 2022

peastman commented Jun 15, 2022

dominicrufa commented Jun 15, 2022

peastman commented Jun 15, 2022

peastman commented Jun 15, 2022

raimis commented Jun 15, 2022

peastman commented Jun 15, 2022

raimis commented Jun 15, 2022