Move computation of `CustomNonbondedForce.setUseLongRangeCorrection()` to GPU kernels? #3229

zhang-ivy · 2021-08-27T18:48:31Z

This is related to #3166 where I am trying to write a class that creates an alchemical hybrid system for running relative free energy calculations. The hybrid system will handle:

alchemical modification of atoms
REST scaling
4th dimension softcore

In a previous discussion, we decided that to handle nonbondeds, we will use:

a CustomNonbondedForce to handle direct space PME electrostatics interactions,
a CustomNonbondedForce to handle direct space PME steric interactions,
a CustomBondForce to handle direct space PME electrostatics and steric exceptions, and
a NonbondedForce to handle reciprocal space PME electrostatics.

We want to call setUseLongRangeCorrection(True) for the sterics CustomNonbondedForce, but not the electrostatics, because the electrostatics long range corrections are being handled in the NonbondedForce. However, setUseLongRangeCorrection() is currently being computed on CPUs, not GPUs, so the computation is very slow and should not be called at every timestep. However, when we run our free energy calculations, we are scaling the parameters at every timestep. In the past, we've addressed this issue by using 2 CustomNonbondedForces instead of one, where one of the forces contains the particles that will be alchemically scaled (with setUseLongRangeCorrection(False)) and the other contains the atoms without any scaling (with setUseLongRangeCorrection(True))

To avoid adding an extra CustomNonbondedForce (which would make the implementation of this already complicated class more complicated), would it be possible to move the computation of long range corrections to GPU kernels?

The text was updated successfully, but these errors were encountered:

peastman · 2021-08-27T19:41:02Z

This is closely related to #3054. That one relates to NonbondedForce and this one to CustomNonbondedForce.

peastman · 2021-08-27T20:31:38Z

This could be challenging to move to the GPU. There are other ways of speeding it up, though. Computing the coefficient for the correction involves the following steps.

Sort particles into classes based on unique combinations of values for the per-particle parameters. The cost of this scales with the number of particles in the system.
Parse and compile the energy expression. Usually this is fast, but if the expression involves large tabulated functions, fitting splines could make it expensive.
Integrate the energy for every pair of particle classes. The cost scales with the square of the number of classes. We have to do the integration numerically, which makes it slow.
If the force is set to compute parameter derivatives, we have to repeat steps 2 and 3 for each one.

The first two steps could be precomputed. We would only have to repeat them if you changed a per-particle parameter, not a global one. Computing the integrals could be parallelized very efficiently, which would divide the time by the number of cores in your CPU.

Could you provide an example (as a serialized System and State) for a typical system where you are using this method and where it is currently too slow?

zhang-ivy · 2021-08-27T22:15:01Z

The particles are already sorted into classes, so step 1 is already done.
I'm not sure how to implement 2 and 4, but we can discuss more on Monday. Importantly, we should discuss whether your suggested approach is simpler than our current workaround of just using an additional CustomNonbondedForce to split the alchemical atoms vs. the non-alchemical atoms.

I haven't actually encountered the slowness myself, as our current implementation of alchemical hybrid systems uses the workaround I described above. I've attached an example below of alanine dipeptide in solvent (not a typical system, but the system I'm currently using for testing). The example contains a system generated using my new implementation (with nonbondeds handled as I described in my initial post). Note that the system currently contains 1 CustomNonbondedForce (for both electrostatics and sterics) because I haven't yet refactored it to contain 2 CustomNonbondedForces. But regardless, I would expect the computation of long range corrections to be slow in the CustomNonbondedForce provided in the example.

Here is an example.
example_for_peter.zip

peastman · 2021-08-27T23:06:18Z

None of this is something you would need to do. I'm describing the current behavior of CustomNonbondedForceImpl::calcLongRangeCorrection() and ways I could make it faster. Having some realistic tests will be essential so I can figure out where the time is really being spent. For example, step 1 will clearly be negligible for alanine dipeptide, but for a system with 200,000 atoms it might be a major bottleneck.

jchodera · 2021-08-28T00:04:28Z

Computing the integrals could be parallelized very efficiently, which would divide the time by the number of cores in your CPU.

Couldn't this be integrated on the GPU by a CUDA/OpenCL kernel just as easily? The kernel could be perfectly parallelized over all pairs of atom classes, would have no thread divergence if you integrate from some fixed min to max at defined step size, and would be totally independent of all other energy kernels. The only major issue would be the double-precision accumulator needed to not lose accuracy, though there may be clever tricks one could play by assuming the large-r values will be smallest and integrating from large to small r to avoid underflow.

peastman · 2021-08-28T00:11:24Z

The integration is done with an adaptive method that repeatedly subdivides the grid until the results converge. We're integrating over the range [cutoff, infinity], which requires care to do accurately.

zhang-ivy · 2021-08-30T21:49:25Z

@peastman Here's a system and state xml that is a typical use case for me (its the RBD:ACE2 complex). Its 185378 atoms.
rbd_ace2_example.zip
Note that the CustomNonbondedForce has the long range correction set to True, which means the long range correction is on for computing sterics of alchemical atoms.

You can change the global parameter lambda_sterics_insert, which controls the sterics of alchemical new atoms -- at lambda = 0, the sterics are off, and lambda = 0, the sterics are on.

peastman · 2021-08-30T23:02:29Z

Thanks! I've done a bit of preliminary analysis on it. Running 100 steps normally takes about 0.33 seconds. If I modify the global parameter every step, it increases to about 15 seconds. So it's currently about 500x slower than we need it to be.

About 90% of the time is spent in an unexpected place: computing the total number of interactions between each pair of atom classes. Normally that would be easy. Just multiply the number of atoms in the two classes. But your system uses interaction groups, and in that case it loops over every interaction in every group. I could easily speed that up, or just not worry about it since it doesn't really need to be repeated every step.

Identifying classes and compiling the expression together account for about 1.2 seconds of the execution time. Those also don't need to be repeated every step.

Computing the integrals takes about 1.5 seconds of the 15 seconds total. Computing them in parallel should make that several times faster. And if I then allow that to run on the CPU at the same time the GPU is computing the other forces, that just might get it down to about where we need it to be.

So no promises, but it does look plausible.

zhang-ivy · 2021-08-31T13:37:42Z

Great, thanks for the update!

zhang-ivy · 2021-09-07T13:17:33Z

@peastman Just following up to see if you have another update on this!

peastman · 2021-09-07T16:11:04Z

I'm working on it right now. Running a short simulation with your system takes 0.33 seconds. If I change the global parameter at every step, it takes 15 seconds. The first thing I did was to separate out all the computations that don't depend on the global parameter so they don't have to be repeated. With that change, it takes 1.8 seconds, which is just about what I expected.

Next I tried parallelizing the integral calculation. That speeds it up to just over 1 second, which is a smaller speedup than I was hoping for. So now I need to analyze it and figure out why it isn't faster and whether I can get a better parallel efficiency.

peastman · 2021-09-07T23:20:35Z

Try #3236 and see how it works for you.

zhang-ivy · 2022-01-27T20:38:21Z

@peastman : Can we reopen this issue, given my most recent results?

Also, just a summary of our chat in December:

We ruled out the following potential solutions: John's analytical LRC integral PR and 2) running simulations at either endstate to determine the coefficients and then linearly interpolating between them.
We decided that moving LRC computation to GPU kernels would be a long term goal.

peastman · 2022-01-27T20:40:59Z

Done.

jchodera · 2022-04-07T05:18:34Z

#3552 should also help address this.

jchodera · 2022-04-10T23:47:55Z

@zhang-ivy : #3552 should have sped this up. It's worth checking the performance again to see how things look.

peastman · 2022-04-11T21:06:20Z

#3520 and #3552 should both have helped. More optimization still to come, but it would be good to see where we are right now. Try running a typical, real world example and see how much the performance is affected by updating the correction coefficient.

peastman · 2022-05-02T19:55:11Z

@zhang-ivy have you had a chance to try this out?

zhang-ivy · 2022-05-02T20:13:33Z

@peastman : Not yet, but I can aim to have it done by next Monday.

zhang-ivy · 2022-05-08T18:39:36Z

@peastman @jchodera : I tested out #3552 and #3520 by creating a fresh env with the most recent nightly build (last updated 13 days ago, so it includes both of the aforementioned PRs). The experiments I ran were the same as the ones from this past January .

First, I just wanted to recapitulate the results from January, so I used the env from January and re-ran the experiment involving changing the global parameter (units = seconds):

Next, using the same GPU, I used the new env with #3552 and #3520 and ran the same experiment involving changing the global parameter (units = seconds):

Takeaways:

The new PRs do offer a speed up with LRC on (especially for getState()), but the LRC on experiments are still much slower (~10x) than the ones with LRC off.

Note that I am running these experiments with 1 thread on a GeForce GTX 1080Ti.

jchodera · 2022-05-08T18:46:59Z

@peastman: Would we need to run on multiple (8?) CPU threads to see a significant speedup here?

peastman · 2022-05-08T22:37:08Z

Correct. The code is parallel, so if you only allow it one thread that will make it much slower.

zhang-ivy · 2022-05-09T00:17:43Z

Ok, I can try on multiple threads tomorrow, but I think our standard use case is to request only one thread for these jobs. Depending on how busy the cluster is, it can be very slow to request more than one. On May 8, 2022, at 18:37, Peter Eastman ***@***.***> wrote: Correct. The code is parallel, so if you only allow it one thread that will make it much slower. — Reply to this email directly, view it on GitHub<#3229 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIPGJCXTIUKSSFSDXVW4FG3VJA6Z5ANCNFSM5C6C4FCQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

jchodera · 2022-05-09T02:15:39Z

We can work with the HPC folks if up to 8 threads actually restores performance. The crucial question is whether we are still seeing appreciable slowdowns even when we use 8 threads. If so, we will still need to move this to the GPU.

peastman · 2022-05-09T02:39:27Z

Or even get it close. There are more optimizations still to come. This is just to see how close we are.

zhang-ivy · 2022-05-09T13:29:26Z

Ok, I've repeated my above experiment (with changing lambda every step) using 2 and 8 threads.

Here is 1 thread (again):

Here is 2 threads:

Here is 8 threads:

Takeaways:

Using more threads certainly decreases the fold change difference in LRC on vs LRC off getState() times. With 1 thread, there is ~10x difference. With 8 threads, there is 2-3x difference.
Using more threads yields a smaller speed up (less dramatic compared to the getState() speed up) for LRC on vs LRC off step() times, but the difference between LRC on vs LRC off was smaller for these to begin with.

peastman · 2022-05-09T14:48:14Z

Thanks! For reference, could you describe the system you're simulating? Number of atoms, number of unique atom types. Also, do the above numbers reflect a microbenchmark or are they your actual simulation protocol? We really want to know what performance is like in your real application.

zhang-ivy · 2022-05-09T15:19:41Z

This is the RBD:ACE2 system.
Number of atoms: 185391.
Number of unique atom types (for REST system with real-real moved and computed values, aka the last row):

in CustomNonbondedForce for electrostatics: 331
in CustomNonbondedForce for sterics: 54

The numbers above reflect a microbenchmark. We developed these tests to only require OpenMM (not perses or openmmtools), so that we exclude any slowness potentially contributed by perses or openmmtools. I would assume that the numbers above are a lower bound for the fold change difference in LRC on vs LRC off getState() calls. The fold change is likely going to stay the same or increase when we introduce the other libraries, so I would argue that we should focus on trying to optimize the openmm code for now, and then if the fold change increases upon introducing perses and openmmtools code, we can investigate those problems separately.

peastman · 2022-05-09T15:27:36Z

I would assume that the numbers above are a lower bound for the fold change difference in LRC on vs LRC off getState() calls.

That isn't necessarily a safe assumption. For example, you aren't actually calling getState() on every step in your real code are you? I thought energies were only getting computed internally within the integrator? And if Perses does introduce extra slowdowns, that will make the relative effect of this calculation smaller, not larger. Optimizing for microbenchmarks rather than real world use can easily lead you astray.

zhang-ivy · 2022-05-09T15:30:09Z

The integrator we are using does call getState() every step, because it has to compute the work, which is the difference in potential energies before and after the step.

ijpulidos mentioned this issue Sep 7, 2021

Long-range anisotropic dispersion correction choderalab/perses#59

Open

peastman mentioned this issue Sep 7, 2021

Optimized computing long range correction for CustomNonbondedForce #3236

Merged

peastman added this to the 7.7 milestone Sep 7, 2021

peastman closed this as completed in #3236 Sep 20, 2021

peastman mentioned this issue Oct 27, 2021

Allow long range correction to use default parameter values #3301

Open

peastman mentioned this issue Nov 24, 2021

Vectorize Lepton expressions #3351

Closed

peastman reopened this Jan 27, 2022

peastman mentioned this issue May 18, 2022

Vectorized calculating long range correction coefficient #3606

Merged

jchodera mentioned this issue May 29, 2022

Release thread for OpenMM 8.0 #3610

Closed

jchodera mentioned this issue Jul 3, 2022

Enable analytical long-range correction integrals to be specified for CustomNonbondedForce #3379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move computation of `CustomNonbondedForce.setUseLongRangeCorrection()` to GPU kernels? #3229

Move computation of `CustomNonbondedForce.setUseLongRangeCorrection()` to GPU kernels? #3229

zhang-ivy commented Aug 27, 2021 •

edited

peastman commented Aug 27, 2021

peastman commented Aug 27, 2021

zhang-ivy commented Aug 27, 2021

peastman commented Aug 27, 2021

jchodera commented Aug 28, 2021

peastman commented Aug 28, 2021

zhang-ivy commented Aug 30, 2021 •

edited

peastman commented Aug 30, 2021

zhang-ivy commented Aug 31, 2021

zhang-ivy commented Sep 7, 2021

peastman commented Sep 7, 2021

peastman commented Sep 7, 2021

zhang-ivy commented Jan 27, 2022

peastman commented Jan 27, 2022

jchodera commented Apr 7, 2022

jchodera commented Apr 10, 2022

peastman commented Apr 11, 2022

peastman commented May 2, 2022

zhang-ivy commented May 2, 2022

zhang-ivy commented May 8, 2022

jchodera commented May 8, 2022 via email

peastman commented May 8, 2022

zhang-ivy commented May 9, 2022 via email

jchodera commented May 9, 2022 via email

peastman commented May 9, 2022

zhang-ivy commented May 9, 2022

peastman commented May 9, 2022

zhang-ivy commented May 9, 2022

peastman commented May 9, 2022

zhang-ivy commented May 9, 2022

Move computation of CustomNonbondedForce.setUseLongRangeCorrection() to GPU kernels? #3229

Move computation of CustomNonbondedForce.setUseLongRangeCorrection() to GPU kernels? #3229

Comments

zhang-ivy commented Aug 27, 2021 • edited

peastman commented Aug 27, 2021

peastman commented Aug 27, 2021

zhang-ivy commented Aug 27, 2021

peastman commented Aug 27, 2021

jchodera commented Aug 28, 2021

peastman commented Aug 28, 2021

zhang-ivy commented Aug 30, 2021 • edited

peastman commented Aug 30, 2021

zhang-ivy commented Aug 31, 2021

zhang-ivy commented Sep 7, 2021

peastman commented Sep 7, 2021

peastman commented Sep 7, 2021

zhang-ivy commented Jan 27, 2022

peastman commented Jan 27, 2022

jchodera commented Apr 7, 2022

jchodera commented Apr 10, 2022

peastman commented Apr 11, 2022

peastman commented May 2, 2022

zhang-ivy commented May 2, 2022

zhang-ivy commented May 8, 2022

jchodera commented May 8, 2022 via email

peastman commented May 8, 2022

zhang-ivy commented May 9, 2022 via email

jchodera commented May 9, 2022 via email

peastman commented May 9, 2022

zhang-ivy commented May 9, 2022

peastman commented May 9, 2022

zhang-ivy commented May 9, 2022

peastman commented May 9, 2022

zhang-ivy commented May 9, 2022

Move computation of `CustomNonbondedForce.setUseLongRangeCorrection()` to GPU kernels? #3229

Move computation of `CustomNonbondedForce.setUseLongRangeCorrection()` to GPU kernels? #3229

zhang-ivy commented Aug 27, 2021 •

edited

zhang-ivy commented Aug 30, 2021 •

edited