Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized computing long range correction for CustomNonbondedForce #3236

Merged
merged 1 commit into from
Sep 20, 2021

Conversation

peastman
Copy link
Member

@peastman peastman commented Sep 7, 2021

Fixes #3229. This PR includes three different optimizations.

  1. Any part of the computation that doesn't depend on context parameters is precomputed and saved. It only needs to be repeated if you call updateParametersInContext(), not if you change a global parameter.
  2. The computation of integrals is parallelized across all the cores in your CPU.
  3. On the CUDA and OpenCL platforms, the computation is driven by a separate worker thread so it can be done at the same time the GPU is computing the forces.

I can't give a single number for how much faster it is. That depends both on the details of your system and on the relative speed of your CPU and GPU. In the best case where computing the coefficient on the CPU takes less time than computing the forces on the GPU, changing a global parameter will now have negligible cost. In other cases it may still be quite expensive, but a lot less expensive than it was before.

@peastman
Copy link
Member Author

@zhang-ivy have you had a chance to try this out?

@zhang-ivy
Copy link
Contributor

Thanks for the PR and sorry for the delay! I will test this out within the next couple of days.

@zhang-ivy
Copy link
Contributor

@peastman I'm trying to test out the changes in this PR, but it doesn't look like the nightlys are up to date: https://anaconda.org/omnia-dev/openmm/files -- it says it was last updated 14 days ago?

@peastman
Copy link
Member Author

I'm not sure what's going on with that. But it wouldn't help anyway: this hasn't been merged, so it wouldn't be in the nightly build. Are you able to build from source?

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 17, 2021

Ah right, I forgot that it hadn't been merged. I've spent a couple hours trying to build from source and I'm having trouble. I followed the instructions here, but when I try to import openmm, I get:

>>> import openmm
Segmentation fault

I noticed all of the CUDA tests (and one of the Reference tests) are failing when I run make test:

38/450 Test  #38: TestReferenceRandom ..............................   Passed    0.99 sec
        Start  39: TestReferenceSettle
 39/450 Test  #39: TestReferenceSettle ..............................   Passed    0.04 sec
        Start  40: TestReferenceVariableLangevinIntegrator
 40/450 Test  #40: TestReferenceVariableLangevinIntegrator ..........   Passed   21.91 sec
        Start  41: TestReferenceVariableVerletIntegrator
 41/450 Test  #41: TestReferenceVariableVerletIntegrator ............   Passed    4.07 sec
        Start  42: TestReferenceVerletIntegrator
 42/450 Test  #42: TestReferenceVerletIntegrator ....................***Failed    0.53 sec
        Start  43: TestReferenceVirtualSites
 43/450 Test  #43: TestReferenceVirtualSites ........................   Passed    0.05 sec
        Start  44: TestCudaAndersenThermostatSingle
 44/450 Test  #44: TestCudaAndersenThermostatSingle .................***Failed    0.38 sec
        Start  45: TestCudaAndersenThermostatMixed
 45/450 Test  #45: TestCudaAndersenThermostatMixed ..................***Failed    0.27 sec
        Start  46: TestCudaAndersenThermostatDouble
 46/450 Test  #46: TestCudaAndersenThermostatDouble .................***Failed    0.25 sec

@peastman
Copy link
Member Author

Try running a test individually:

./TestCudaAndersenThermostat

Does it give you an error message with more information?

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 17, 2021

Here's the error I get:

(long-range-peter) bash-4.2$ ./TestCudaAndersenThermostat
exception: Error launching CUDA compiler: 256
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:129:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
  129 | #error -- unsupported GNU version! gcc versions later than 7 are not supported!
      |  ^~~~~

My gcc version is:

gcc (GCC) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Originally, I had cudatoolkit 10.2.89 h8f6ccaa_8 conda-forge installed in my env and was running module load cuda/10.2, so I thought that upgrading my cudatoolkit to 11.2 and running module load cuda/11.3 (we don't have 11.2 on our cluster) and re-running make install would fix this. It doesn't -- I'm getting the same error as above

@peastman
Copy link
Member Author

Apparently nvcc doesn't like the version of gcc you have. That isn't something I've encountered. If you have an older compiler around you can use, you can set the environment variable CUDA_HOST_COMPILER to point to it.

Hopefully this won't impact running actual programs, since it will use the runtime compiler instead of nvcc.

@jchodera
Copy link
Member

jchodera commented Sep 17, 2021 via email

@peastman
Copy link
Member Author

I was hoping you could test it first. I don't want to merge it until we know this approach will actually work.

@jchodera
Copy link
Member

Since it's a challenge to build working OpenMM installs from scratch, could we enable builds of key branches (pushed to corresponding labels) on https://github.com/omnia-md/conda-dev-recipes ?

@peastman
Copy link
Member Author

Could you try running your code against the version you just built? Like I said, there's a good chance the errors affecting the test cases won't affect other programs, since it will use the runtime compiler instead of nvcc.

@zhang-ivy
Copy link
Contributor

I don't think I'll be able to test with my current build because I'm seeing this error:

>>> import openmm
Segmentation fault

@peastman peastman merged commit 779400a into openmm:master Sep 20, 2021
@peastman peastman deleted the longrange branch September 20, 2021 02:21
@peastman
Copy link
Member Author

I merged it. Now we just need to figure out why no nightly builds have been uploaded for a couple of weeks.

@jchodera
Copy link
Member

Now we just need to figure out why no nightly builds have been uploaded for a couple of weeks.

Sigh. Looks like the last build/upload on Sep 3 was successful, but I can't log into Azure to maintain/check the builds.
This would be a good time for you to get @jaimergp to transfer the omnia build account to you, perhaps, since only OpenMM is being built with this machinery.

@jaimergp
Copy link
Contributor

Aren't we all admins in there?

@jchodera
Copy link
Member

jchodera commented Sep 20, 2021 via email

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 20, 2021

Are the nightly builds not uploading because the build recipe hard codes the version number? We ran into this issue before here

@zhang-ivy
Copy link
Contributor

Actually, it seems like the the hard coded version number here is correct, so that's not the problem

@peastman
Copy link
Member Author

The status at https://github.com/omnia-md/conda-dev-recipes doesn't show any builds having even been attempted since Sept. 2.

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 20, 2021

The status at https://github.com/omnia-md/conda-dev-recipes doesn't show any builds having even been attempted since Sept. 2.

@jchodera @jaimergp any idea why this is happening?

@jaimergp
Copy link
Contributor

Repo inactivity might be a factor. I know GH will stop running cron jobs if no recent commits were applied. Azure might be doing the same and we missed the notification. I triggered a manual run and resynced the cron hooks so we are back on schedule. This should bring back the nightly builds.

@jchodera, the URL is https://dev.azure.com/OmniaMD/conda-dev-recipes/. You have admin credentials with your choderalab.org account, but might need to change the "directory" to Virginia Tech in your user menu (top right).

@zhang-ivy
Copy link
Contributor

@jaimergp Thanks so much for your help! Excited to try out the PR (I'll be able to do that tomorrow).

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 22, 2021

@peastman @jchodera I've tested out this PR on a perses-generated hybrid system for RBD:ACE2. (The same system/state files that I attached here).
Just as a reminder, the hybrid system handles nonbondeds with a NonbondedForce that contains non-alchemical atoms and a CustomNonbondedForce that contains alchemical atoms. By default, getUseDispersionCorrection() is True for the NonbondedForce and getUseLongRangeCorrection is False for the CustomNonbondedForce.

Here are the experiments I ran:

  • Experiment 1: setUseDispersionCorrection(True) for only the NonbondedForce, but not the CustomNonbondedForce (aka the default). It takes 8.1 seconds to run 100 steps of MD (where I change a global param every step).
  • Experiment 2: setLongRangeCorrection(True) for both CustomNonbondedForces. It takes 36.1 seconds to run 100 steps of MD (where I change a global param every step).
  • Experiment 3: setLongRangeCorrection(True) for both CustomNonbondedForces (and with the changes in this PR). It takes 12.9 seconds to run 100 steps of MD (where I change a global param every step).

It seems that for my system, this PR offers a 5-6 fold speed up for computing the long range correction in the CustomNonbondedForce.

I can't remember what the desired speed up is, @jchodera can you let me know if you think this is a sufficient speed up?
For now, I'll assume that this is enough and proceed with my rest over the protocol implementation.

Here is the notebook I used to run these experiments: https://gist.github.com/zhang-ivy/db2b94cda201541edd05f5cf333efe76

@peastman
Copy link
Member Author

What kind of hardware are you running it on? When I tested your system, 100 steps of MD while not recomputing the long range correction only took 0.33 seconds, not 8.1 seconds. Is this really the same system?

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 23, 2021

I'm running with CUDA on a GeForce GTX 1080Ti. I linked to my code above -- maybe you could take a look and compare with yours? Or if you share your code I can check myself -- maybe I am doing something wrong there.

Note that in my "Experiment 1" above, even though the long range correction is off, the dispersion correction is on, which may explain why it takes so long?

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 29, 2021

@peastman : Hmm, I'm still seeing that my simulation times are 10x are slower than yours, even when I use your code.

I'm running your above code snippet (using one of the nightly builds from last week) on a GeForce RTX 2080 and using the attached rbd_ace2_hybrid_system.xml and rbd_ace2_state_minimized.xml:

  • When I comment out the context.setParameter() call, it takes 3.035 seconds
  • When I do not comment out context.setParameter(), it takes 5.297 seconds

rbd_ace2_example_minimized.zip

Titan Vs can't be 10x faster than 2080s, right? If so, I'm not sure what I'm doing wrong 🤔

@peastman
Copy link
Member Author

I have no idea. The GPUs should be fairly similar in speed. The RTX 2080 has only about 2/3 as many compute units, but also a slightly higher clock rate. Nothing that would explain that big a difference.

Try increasing it to 1000 steps. Do the reported times increase by 10x?

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 30, 2021

@peastman : When I increased to 1000 steps, I started getting OpenMMException: Particle coordinate is nan during step 240. This error occurs at step 240 regardless of whether I comment in/out the context.setParameter() line. Additionally, if I change the timestep of the integrator to 2 femtoseconds, the error still happens at step 240, which is very puzzling..

^ I realized I wasn't saving my minimized state properly (was actually saving the pre-minimized state), which is what was causing the nans.
I also realized I didn't have cudatoolkit installed for some reason...

Now, when I use the proper minimized state + cudatoolkit installed, I'm seeing the following simulation times:

  • When I comment out the context.setParameter() call and run 100 steps: it takes 0.389 seconds
  • When I do not comment out context.setParameter() and run 100 steps: it takes 5.467 seconds
  • When I comment out the context.setParameter() call and run 1000 steps: it takes 3.371 seconds
  • When I do not comment out context.setParameter() and run 1000 steps: it takes 53.930 seconds

This makes me think that there is a chance that my openmm installation doesn't have the changes in your PR. I installed openmm like this mamba create -n long-range-peter -c omnia-dev/label/cuda102 openmm last week after Jaime fixed the nightly builds. When I run conda list openmm, I get:

openmm                    7.6              py38_cuda102_1    omnia-dev/label/cuda102

Given that all of the changes in this PR are in the C++ layer, is there an easy way for me to double check that my openmm installation actually has the changes in this PR?

@peastman
Copy link
Member Author

Sure, here it is.

minimized_state.zip

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Sep 30, 2021

@peastman : Thanks, but actually I ended up not needing your minimized_state.xml (see explanation in my edited message above).

TLDR from my previous message: I'm now seeing that without context.setParameter(), it takes ~0.3 seconds to run 100 steps, but with context.setParameter(), it's still taking ~ 5 seconds. My first thought was to check whether my openmm installation actually has the changes in this PR -- is there an easy way to do that?

@peastman
Copy link
Member Author

In Python you can print out openmm.version.git_revision. That will tell you exactly what version you have.

@zhang-ivy
Copy link
Contributor

Ok, I have 779400a1e3fdc988e84fa7aed760ac95d81c179c installed, which matches the hash for this PR. I'm not sure why the speed ups you implemented in this PR aren't working properly for me.
I'm meeting with @mikehenry tomorrow to see if he can reproduce the behavior I'm seeing and can provide an update afterwards.

@peastman
Copy link
Member Author

What kind of CPU are you using? How many cores? Is anything else running at the same time that's using the CPU?

Try monitoring it with top while it's running. It should show high CPU usage on all cores.

What speed do you get with the version that doesn't include the optimizations?

Regardless though, it seems suspiciously slow. Even when I just skipped the unnecessary calculations, before I implemented any parallelization at all, that sped it up to 1.8 seconds. Unless you have an ancient CPU, it shouldn't be 3x slower than mine on single threaded code.

@zhang-ivy
Copy link
Contributor

@peastman : I'm running these experiments on our hpc, so it depends on which node I get assigned.

I just ran the experiments again with GeForce GTX 1080Ti and Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores). I'm not sure if anything else is running at the same time on the CPU -- but I'm guessing its possible, as I'm using mode=shared, so I think that a single node can be shared by multiple jobs. Maybe this is the reason for the slow down?
Btw, I'm using CUDA 10.2 here, not the latest version.

Here are the latest experiments, with the CPU/GPU mentioned above:
With optimization (calling context.getParameter()): 3.3659 s
With optimization (not calling context.getParameter()): 0.664 s
Without optimization (calling context.getParameter()): 21.133 s
Without optimization (not calling context.getParameter()): 0.646 s

@ijpulidos ran some of these experiments on his local machine (which has an external GPU and a built in GPU):
(External) GPU: GeForce GTX Titan X - CPU: Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
With optimization (calling context.getParameter()): 1.59 s
With optimization (not calling context.getParameter()): 1.29 s

(Built in) GPU: Quadro T500, CPU: Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
With optimization (calling context.getParameter()): 6.02 s
With optimization (not calling context.getParameter()): 5.80 s

While Iván's experiments show much slower simulation times that what we had previously seen, there isn't a big difference when comparing the time required with vs without the context.getParameter() call. This seems to further suggest that the slowdown I'm seeing is related to the nodes being shared by other jobs.

@peastman
Copy link
Member Author

peastman commented Oct 1, 2021

This seems to further suggest that the slowdown I'm seeing is related to the nodes being shared by other jobs.

That would make sense, if it's having to compete with other processes for CPU time. And it's still more than 6x faster than without the optimization, so it clearly is helping.

@jchodera
Copy link
Member

jchodera commented Oct 3, 2021

@zhang-ivy : When running these tests on lilac, it's probably critical to (1) request the appropriate number of thread-slots with

# number of thread slots to request
#BSUB -n 8
# rusage[mem=7/task] : 'mem=X/task' specifies per thread-slot request for GB of memory
# span[hosts=1] : pack all thread-slots onto same physical node
# affinity[core(1):distribute=pack]" : pack onto as few cores as possible
#BSUB -R "rusage[mem=7/task] span[hosts=1] affinity[core(1):distribute=pack]"

and to also set the environment variable OPENMM_CPU_THREADS to the same number of thread-slots (here, 8). This will ensure you have the full set of thread-slots available, packed onto the smallest number of CPU cores.

(Note that you can use /job or /host instead of /task if you want to specify total memory per job or host rather than per thread; see the LSF docs).

@jchodera
Copy link
Member

jchodera commented Oct 3, 2021

That would make sense, if it's having to compete with other processes for CPU time. And it's still more than 6x faster than without the optimization, so it clearly is helping.

We're really hoping for an at most 10-20% slow-down when updating parameters in context over not updating the parameters. I still wonder if there's another way to tackle this by parallelizing the numerical integration on the GPU with a simpler numerical integration scheme.

@peastman
Copy link
Member Author

peastman commented Oct 3, 2021

See my description up at the top. You can't characterize this as a percent slowdown. The CPU and GPU are doing different calculations at the same time. If the CPU finishes first, it's a 0% slowdown. If the GPU finishes first, it could be a large percent slowdown. How long each one takes depends on the speed of your CPU, the speed of your GPU, and the details of the system.

Just comparing the different numbers above for different hardware, I had a slowdown of 70%, while @ljpulidos had slowdowns of 23% and 4% on two different GPUs. And those are all for the same system.

If you're able to update the parameter only every ten steps, all of those would be well below the target range of 10-20%.

@zhang-ivy
Copy link
Contributor

I think we need to update the parameter every step, not every 10 steps, right @jchodera?

@jchodera
Copy link
Member

jchodera commented Oct 5, 2021

For replica exchange, we can update only infrequently, but for nonequilibrium switching, we lose a huge amount of efficiency if we can't update every step.

@jchodera
Copy link
Member

jchodera commented Oct 5, 2021

What could work is to only update the anisotropic dispersion correction relatively infrequently (e.g. every 10 steps), but update the CustomNonbondedForce energy every step. Is there a easy way to toggle when the dispersion correction updates happen?

@peastman
Copy link
Member Author

peastman commented Oct 5, 2021

There might be a very easy way to do that. It depends on your integrator and simulation protocol. The correction only affects the energy, not forces, so we could make it only recalculate the correction when you request the energy. In most simulations, that happens very infrequently. You could keep changing the parameter every step, and the forces would change appropriately, but it wouldn't update the correction until the next time you needed the energy.

By the way, the correction is isotropic, not anisotropic.

@peastman
Copy link
Member Author

peastman commented Oct 6, 2021

I made the change in #3275. It's a very small change, and it might make a significant difference. Or it might not. It depends on the details of your simulation.

@zhang-ivy
Copy link
Contributor

@peastman Just spoke with @jchodera -- is there a way to put the long range correction computation in a separate force group? If so, it should be easy for us to use your PR #3275 to compute the long range energy less frequently, which would speed up my nonequilibrium switching simulations significantly.

@peastman
Copy link
Member Author

You can't separate out the long range correction from the rest of the CustomNonbondedForce. Is it actually necessary to compute the energy every step?

@jchodera
Copy link
Member

The CustomNonbondedForce energy computed from the energy expression is all of the very short-range interactions that we need to compute every timestep. The long-range correction, however, is very slowly varying with timestep and global parameters, and need not be computed frequently. If we could assign the computation of the long-range dispersion correction to a different force group (much like with NonbondedForce's reciprocal-space contribution), then we could take advantage of infrequent updates to the long-range correction.

@peastman
Copy link
Member Author

Could you describe exactly what you're computing the energy for? For example, does it even matter whether it includes the long range correction? Or is that only needed for the barostat?

@jchodera
Copy link
Member

Sure! Here's the basic concept for alchemically modifying Lennard-Jones interactions:

We have four atom groups:

  • environment atoms do not have their LJ change
  • core atoms interpolate LJ parameters from old to new values as lambda goes from 0 to 1
  • unique_old atoms have their interactions scaled down to 0 and are "lifted" into the 4th (w) dimension (or otherwise alchemically softened) as lambda goes from 0 to 1
  • unique_new atoms have their interactions scaled up from 0 and are "lowered" from the 4th (w) dimension (or otherwise alchemically softened) as lambda goes from 0 to 1

We may additional introduce REST scaling, which further scales down the interactions within the non-environment atoms and between environment and non-environment atoms by different scaling factors.

We remove all LJ interactions from NonbondedForce and put them into a CustomNonbondedForce with an energy expression that contains these modified interactions, labeling each atom with per-particle properties that denote which group it appears in.

The long-range correction is essential for two reasons:

  • In order to ensure that the densities are correct when running NPT simulations with MonteCarloBarostat, since omitting it can cause large pressure/density anomalies (0.9 - 1.3%, depending on the cutoff)
  • The (slowly-varying) contribution to the potential energy also makes an important contribution to the free energy of binding, ~0.1 kcal/mol per heavy atom for a 9A cutoff.

These effects are well-documented in this paper.

In our nonequilibrium simulations, we achieve optimal thermodynamic efficiency (dissipate the least amount of heat, and therefore achieve the lowest error) if we take a geodesic in thermodynamic control space that minimizes thermodynamic length; this achieves the lowest variance estimate. To do this, we need to take small steps in lambda every timestep, accumulating the energy difference at fixed positions every time we update lambda in a symmetric integrator step (BAOAB or OBABO).

We could reformulate these as a multiple timestep method where we update the slowly-varying components of the potential only once in the middle of a complex timestep that includes other substeps, but we would need to split out the slowly-varying components. Since the alchemical region is a ligand tightly nestled in a binding site, we need to update all of the nearby interactions every timestep.

I suppose we could create two different CustomNonbondedForce classes, one with the long-range correction and one without, and only compute the one with the correction infrequently? It's too bad we can't use a much larger cutoff as well, since we could then account for the anisotropic dispersion correction in that paper at the same time!

We'd be happy to brainstorm other efficient ways to account for this if you have some time!

@peastman
Copy link
Member Author

I suppose we could create two different CustomNonbondedForce classes, one with the long-range correction and one without, and only compute the one with the correction infrequently?

Let's follow up on that idea. I'm not sure if the following is entirely correct, and I also haven't thought much about whether it would be practical to implement. This is just brainstorming.

The correction doesn't affect the forces, so you don't need it for integration. And since it doesn't depend on the positions, you don't need to accumulate it every timestep. It's only required for two purposes.

  1. When the barostat decides whether to accept a step, it needs to take the correction into account.
  2. You also need to compute the correction at the very start of the simulation and again at the very end, so you can see how much it has changed and include it in your result.

Everything else can be done with a copy of the CustomNonbondedForce that doesn't include the correction. Does that sound right?

The (slowly-varying) contribution to the potential energy also makes an important contribution to the free energy of binding, ~0.1 kcal/mol per heavy atom for a 9A cutoff.

How much of that comes from the coefficient changing, and how much is from the box volume changing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move computation of CustomNonbondedForce.setUseLongRangeCorrection() to GPU kernels?
4 participants