Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move computation of CustomNonbondedForce.setUseLongRangeCorrection() to GPU kernels? #3229

Open
zhang-ivy opened this issue Aug 27, 2021 · 30 comments · Fixed by #3236
Open
Milestone

Comments

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Aug 27, 2021

This is related to #3166 where I am trying to write a class that creates an alchemical hybrid system for running relative free energy calculations. The hybrid system will handle:

  1. alchemical modification of atoms
  2. REST scaling
  3. 4th dimension softcore

In a previous discussion, we decided that to handle nonbondeds, we will use:

  1. a CustomNonbondedForce to handle direct space PME electrostatics interactions,
  2. a CustomNonbondedForce to handle direct space PME steric interactions,
  3. a CustomBondForce to handle direct space PME electrostatics and steric exceptions, and
  4. a NonbondedForce to handle reciprocal space PME electrostatics.

We want to call setUseLongRangeCorrection(True) for the sterics CustomNonbondedForce, but not the electrostatics, because the electrostatics long range corrections are being handled in the NonbondedForce. However, setUseLongRangeCorrection() is currently being computed on CPUs, not GPUs, so the computation is very slow and should not be called at every timestep. However, when we run our free energy calculations, we are scaling the parameters at every timestep. In the past, we've addressed this issue by using 2 CustomNonbondedForces instead of one, where one of the forces contains the particles that will be alchemically scaled (with setUseLongRangeCorrection(False)) and the other contains the atoms without any scaling (with setUseLongRangeCorrection(True))

To avoid adding an extra CustomNonbondedForce (which would make the implementation of this already complicated class more complicated), would it be possible to move the computation of long range corrections to GPU kernels?

@peastman
Copy link
Member

This is closely related to #3054. That one relates to NonbondedForce and this one to CustomNonbondedForce.

@peastman
Copy link
Member

This could be challenging to move to the GPU. There are other ways of speeding it up, though. Computing the coefficient for the correction involves the following steps.

  1. Sort particles into classes based on unique combinations of values for the per-particle parameters. The cost of this scales with the number of particles in the system.
  2. Parse and compile the energy expression. Usually this is fast, but if the expression involves large tabulated functions, fitting splines could make it expensive.
  3. Integrate the energy for every pair of particle classes. The cost scales with the square of the number of classes. We have to do the integration numerically, which makes it slow.
  4. If the force is set to compute parameter derivatives, we have to repeat steps 2 and 3 for each one.

The first two steps could be precomputed. We would only have to repeat them if you changed a per-particle parameter, not a global one. Computing the integrals could be parallelized very efficiently, which would divide the time by the number of cores in your CPU.

Could you provide an example (as a serialized System and State) for a typical system where you are using this method and where it is currently too slow?

@zhang-ivy
Copy link
Contributor Author

The particles are already sorted into classes, so step 1 is already done.
I'm not sure how to implement 2 and 4, but we can discuss more on Monday. Importantly, we should discuss whether your suggested approach is simpler than our current workaround of just using an additional CustomNonbondedForce to split the alchemical atoms vs. the non-alchemical atoms.

I haven't actually encountered the slowness myself, as our current implementation of alchemical hybrid systems uses the workaround I described above. I've attached an example below of alanine dipeptide in solvent (not a typical system, but the system I'm currently using for testing). The example contains a system generated using my new implementation (with nonbondeds handled as I described in my initial post). Note that the system currently contains 1 CustomNonbondedForce (for both electrostatics and sterics) because I haven't yet refactored it to contain 2 CustomNonbondedForces. But regardless, I would expect the computation of long range corrections to be slow in the CustomNonbondedForce provided in the example.

Here is an example.
example_for_peter.zip

@peastman
Copy link
Member

None of this is something you would need to do. I'm describing the current behavior of CustomNonbondedForceImpl::calcLongRangeCorrection() and ways I could make it faster. Having some realistic tests will be essential so I can figure out where the time is really being spent. For example, step 1 will clearly be negligible for alanine dipeptide, but for a system with 200,000 atoms it might be a major bottleneck.

@jchodera
Copy link
Member

Computing the integrals could be parallelized very efficiently, which would divide the time by the number of cores in your CPU.

Couldn't this be integrated on the GPU by a CUDA/OpenCL kernel just as easily? The kernel could be perfectly parallelized over all pairs of atom classes, would have no thread divergence if you integrate from some fixed min to max at defined step size, and would be totally independent of all other energy kernels. The only major issue would be the double-precision accumulator needed to not lose accuracy, though there may be clever tricks one could play by assuming the large-r values will be smallest and integrating from large to small r to avoid underflow.

@peastman
Copy link
Member

The integration is done with an adaptive method that repeatedly subdivides the grid until the results converge. We're integrating over the range [cutoff, infinity], which requires care to do accurately.

@zhang-ivy
Copy link
Contributor Author

zhang-ivy commented Aug 30, 2021

@peastman Here's a system and state xml that is a typical use case for me (its the RBD:ACE2 complex). Its 185378 atoms.
rbd_ace2_example.zip
Note that the CustomNonbondedForce has the long range correction set to True, which means the long range correction is on for computing sterics of alchemical atoms.

You can change the global parameter lambda_sterics_insert, which controls the sterics of alchemical new atoms -- at lambda = 0, the sterics are off, and lambda = 0, the sterics are on.

@peastman
Copy link
Member

Thanks! I've done a bit of preliminary analysis on it. Running 100 steps normally takes about 0.33 seconds. If I modify the global parameter every step, it increases to about 15 seconds. So it's currently about 500x slower than we need it to be.

About 90% of the time is spent in an unexpected place: computing the total number of interactions between each pair of atom classes. Normally that would be easy. Just multiply the number of atoms in the two classes. But your system uses interaction groups, and in that case it loops over every interaction in every group. I could easily speed that up, or just not worry about it since it doesn't really need to be repeated every step.

Identifying classes and compiling the expression together account for about 1.2 seconds of the execution time. Those also don't need to be repeated every step.

Computing the integrals takes about 1.5 seconds of the 15 seconds total. Computing them in parallel should make that several times faster. And if I then allow that to run on the CPU at the same time the GPU is computing the other forces, that just might get it down to about where we need it to be.

So no promises, but it does look plausible.

@zhang-ivy
Copy link
Contributor Author

Great, thanks for the update!

@zhang-ivy
Copy link
Contributor Author

@peastman Just following up to see if you have another update on this!

@peastman
Copy link
Member

peastman commented Sep 7, 2021

I'm working on it right now. Running a short simulation with your system takes 0.33 seconds. If I change the global parameter at every step, it takes 15 seconds. The first thing I did was to separate out all the computations that don't depend on the global parameter so they don't have to be repeated. With that change, it takes 1.8 seconds, which is just about what I expected.

Next I tried parallelizing the integral calculation. That speeds it up to just over 1 second, which is a smaller speedup than I was hoping for. So now I need to analyze it and figure out why it isn't faster and whether I can get a better parallel efficiency.

@peastman
Copy link
Member

peastman commented Sep 7, 2021

Try #3236 and see how it works for you.

@zhang-ivy
Copy link
Contributor Author

@peastman : Can we reopen this issue, given my most recent results?

Also, just a summary of our chat in December:

  • We ruled out the following potential solutions: John's analytical LRC integral PR and 2) running simulations at either endstate to determine the coefficients and then linearly interpolating between them.
  • We decided that moving LRC computation to GPU kernels would be a long term goal.

@peastman peastman reopened this Jan 27, 2022
@peastman
Copy link
Member

Done.

@jchodera
Copy link
Member

jchodera commented Apr 7, 2022

#3552 should also help address this.

@jchodera
Copy link
Member

@zhang-ivy : #3552 should have sped this up. It's worth checking the performance again to see how things look.

@peastman
Copy link
Member

#3520 and #3552 should both have helped. More optimization still to come, but it would be good to see where we are right now. Try running a typical, real world example and see how much the performance is affected by updating the correction coefficient.

@peastman
Copy link
Member

peastman commented May 2, 2022

@zhang-ivy have you had a chance to try this out?

@zhang-ivy
Copy link
Contributor Author

@peastman : Not yet, but I can aim to have it done by next Monday.

@zhang-ivy
Copy link
Contributor Author

@peastman @jchodera : I tested out #3552 and #3520 by creating a fresh env with the most recent nightly build (last updated 13 days ago, so it includes both of the aforementioned PRs). The experiments I ran were the same as the ones from this past January .

First, I just wanted to recapitulate the results from January, so I used the env from January and re-ran the experiment involving changing the global parameter (units = seconds):
Screen Shot 2022-05-08 at 2 23 15 PM

Next, using the same GPU, I used the new env with #3552 and #3520 and ran the same experiment involving changing the global parameter (units = seconds):
Screen Shot 2022-05-08 at 2 38 43 PM

Takeaways:

  • The new PRs do offer a speed up with LRC on (especially for getState()), but the LRC on experiments are still much slower (~10x) than the ones with LRC off.

Note that I am running these experiments with 1 thread on a GeForce GTX 1080Ti.

@jchodera
Copy link
Member

jchodera commented May 8, 2022 via email

@peastman
Copy link
Member

peastman commented May 8, 2022

Correct. The code is parallel, so if you only allow it one thread that will make it much slower.

@zhang-ivy
Copy link
Contributor Author

zhang-ivy commented May 9, 2022 via email

@jchodera
Copy link
Member

jchodera commented May 9, 2022 via email

@peastman
Copy link
Member

peastman commented May 9, 2022

Or even get it close. There are more optimizations still to come. This is just to see how close we are.

@zhang-ivy
Copy link
Contributor Author

Ok, I've repeated my above experiment (with changing lambda every step) using 2 and 8 threads.

Here is 1 thread (again):
Screen Shot 2022-05-09 at 9 22 03 AM

Here is 2 threads:

Screen Shot 2022-05-09 at 9 22 14 AM

Here is 8 threads:
Screen Shot 2022-05-09 at 9 25 17 AM

Takeaways:

  • Using more threads certainly decreases the fold change difference in LRC on vs LRC off getState() times. With 1 thread, there is ~10x difference. With 8 threads, there is 2-3x difference.
  • Using more threads yields a smaller speed up (less dramatic compared to the getState() speed up) for LRC on vs LRC off step() times, but the difference between LRC on vs LRC off was smaller for these to begin with.

@peastman
Copy link
Member

peastman commented May 9, 2022

Thanks! For reference, could you describe the system you're simulating? Number of atoms, number of unique atom types. Also, do the above numbers reflect a microbenchmark or are they your actual simulation protocol? We really want to know what performance is like in your real application.

@zhang-ivy
Copy link
Contributor Author

This is the RBD:ACE2 system.
Number of atoms: 185391.
Number of unique atom types (for REST system with real-real moved and computed values, aka the last row):

  • in CustomNonbondedForce for electrostatics: 331
  • in CustomNonbondedForce for sterics: 54

The numbers above reflect a microbenchmark. We developed these tests to only require OpenMM (not perses or openmmtools), so that we exclude any slowness potentially contributed by perses or openmmtools. I would assume that the numbers above are a lower bound for the fold change difference in LRC on vs LRC off getState() calls. The fold change is likely going to stay the same or increase when we introduce the other libraries, so I would argue that we should focus on trying to optimize the openmm code for now, and then if the fold change increases upon introducing perses and openmmtools code, we can investigate those problems separately.

@peastman
Copy link
Member

peastman commented May 9, 2022

I would assume that the numbers above are a lower bound for the fold change difference in LRC on vs LRC off getState() calls.

That isn't necessarily a safe assumption. For example, you aren't actually calling getState() on every step in your real code are you? I thought energies were only getting computed internally within the integrator? And if Perses does introduce extra slowdowns, that will make the relative effect of this calculation smaller, not larger. Optimizing for microbenchmarks rather than real world use can easily lead you astray.

@zhang-ivy
Copy link
Contributor Author

The integrator we are using does call getState() every step, because it has to compute the work, which is the difference in potential energies before and after the step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants