-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move computation of CustomNonbondedForce.setUseLongRangeCorrection()
to GPU kernels?
#3229
Comments
This is closely related to #3054. That one relates to NonbondedForce and this one to CustomNonbondedForce. |
This could be challenging to move to the GPU. There are other ways of speeding it up, though. Computing the coefficient for the correction involves the following steps.
The first two steps could be precomputed. We would only have to repeat them if you changed a per-particle parameter, not a global one. Computing the integrals could be parallelized very efficiently, which would divide the time by the number of cores in your CPU. Could you provide an example (as a serialized System and State) for a typical system where you are using this method and where it is currently too slow? |
The particles are already sorted into classes, so step 1 is already done. I haven't actually encountered the slowness myself, as our current implementation of alchemical hybrid systems uses the workaround I described above. I've attached an example below of alanine dipeptide in solvent (not a typical system, but the system I'm currently using for testing). The example contains a system generated using my new implementation (with nonbondeds handled as I described in my initial post). Note that the system currently contains 1 Here is an example. |
None of this is something you would need to do. I'm describing the current behavior of |
Couldn't this be integrated on the GPU by a CUDA/OpenCL kernel just as easily? The kernel could be perfectly parallelized over all pairs of atom classes, would have no thread divergence if you integrate from some fixed min to max at defined step size, and would be totally independent of all other energy kernels. The only major issue would be the double-precision accumulator needed to not lose accuracy, though there may be clever tricks one could play by assuming the large- |
The integration is done with an adaptive method that repeatedly subdivides the grid until the results converge. We're integrating over the range [cutoff, infinity], which requires care to do accurately. |
@peastman Here's a system and state xml that is a typical use case for me (its the RBD:ACE2 complex). Its 185378 atoms. You can change the global parameter |
Thanks! I've done a bit of preliminary analysis on it. Running 100 steps normally takes about 0.33 seconds. If I modify the global parameter every step, it increases to about 15 seconds. So it's currently about 500x slower than we need it to be. About 90% of the time is spent in an unexpected place: computing the total number of interactions between each pair of atom classes. Normally that would be easy. Just multiply the number of atoms in the two classes. But your system uses interaction groups, and in that case it loops over every interaction in every group. I could easily speed that up, or just not worry about it since it doesn't really need to be repeated every step. Identifying classes and compiling the expression together account for about 1.2 seconds of the execution time. Those also don't need to be repeated every step. Computing the integrals takes about 1.5 seconds of the 15 seconds total. Computing them in parallel should make that several times faster. And if I then allow that to run on the CPU at the same time the GPU is computing the other forces, that just might get it down to about where we need it to be. So no promises, but it does look plausible. |
Great, thanks for the update! |
@peastman Just following up to see if you have another update on this! |
I'm working on it right now. Running a short simulation with your system takes 0.33 seconds. If I change the global parameter at every step, it takes 15 seconds. The first thing I did was to separate out all the computations that don't depend on the global parameter so they don't have to be repeated. With that change, it takes 1.8 seconds, which is just about what I expected. Next I tried parallelizing the integral calculation. That speeds it up to just over 1 second, which is a smaller speedup than I was hoping for. So now I need to analyze it and figure out why it isn't faster and whether I can get a better parallel efficiency. |
Try #3236 and see how it works for you. |
@peastman : Can we reopen this issue, given my most recent results? Also, just a summary of our chat in December:
|
Done. |
#3552 should also help address this. |
@zhang-ivy : #3552 should have sped this up. It's worth checking the performance again to see how things look. |
@zhang-ivy have you had a chance to try this out? |
@peastman : Not yet, but I can aim to have it done by next Monday. |
@peastman @jchodera : I tested out #3552 and #3520 by creating a fresh env with the most recent nightly build (last updated 13 days ago, so it includes both of the aforementioned PRs). The experiments I ran were the same as the ones from this past January . First, I just wanted to recapitulate the results from January, so I used the env from January and re-ran the experiment involving changing the global parameter (units = seconds): Next, using the same GPU, I used the new env with #3552 and #3520 and ran the same experiment involving changing the global parameter (units = seconds): Takeaways:
Note that I am running these experiments with 1 thread on a GeForce GTX 1080Ti. |
@peastman: Would we need to run on multiple (8?) CPU threads to see a
significant speedup here?
|
Correct. The code is parallel, so if you only allow it one thread that will make it much slower. |
Ok, I can try on multiple threads tomorrow, but I think our standard use case is to request only one thread for these jobs. Depending on how busy the cluster is, it can be very slow to request more than one.
On May 8, 2022, at 18:37, Peter Eastman ***@***.***> wrote:
Correct. The code is parallel, so if you only allow it one thread that will make it much slower.
—
Reply to this email directly, view it on GitHub<#3229 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIPGJCXTIUKSSFSDXVW4FG3VJA6Z5ANCNFSM5C6C4FCQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
We can work with the HPC folks if up to 8 threads actually restores
performance. The crucial question is whether we are still seeing
appreciable slowdowns even when we use 8 threads. If so, we will still need
to move this to the GPU.
|
Or even get it close. There are more optimizations still to come. This is just to see how close we are. |
Ok, I've repeated my above experiment (with changing lambda every step) using 2 and 8 threads. Here is 2 threads: Takeaways:
|
Thanks! For reference, could you describe the system you're simulating? Number of atoms, number of unique atom types. Also, do the above numbers reflect a microbenchmark or are they your actual simulation protocol? We really want to know what performance is like in your real application. |
This is the RBD:ACE2 system.
The numbers above reflect a microbenchmark. We developed these tests to only require OpenMM (not perses or openmmtools), so that we exclude any slowness potentially contributed by |
That isn't necessarily a safe assumption. For example, you aren't actually calling |
The integrator we are using does call |
This is related to #3166 where I am trying to write a class that creates an alchemical hybrid system for running relative free energy calculations. The hybrid system will handle:
In a previous discussion, we decided that to handle nonbondeds, we will use:
CustomNonbondedForce
to handle direct space PME electrostatics interactions,CustomNonbondedForce
to handle direct space PME steric interactions,CustomBondForce
to handle direct space PME electrostatics and steric exceptions, andNonbondedForce
to handle reciprocal space PME electrostatics.We want to call
setUseLongRangeCorrection(True)
for the stericsCustomNonbondedForce
, but not the electrostatics, because the electrostatics long range corrections are being handled in theNonbondedForce
. However,setUseLongRangeCorrection()
is currently being computed on CPUs, not GPUs, so the computation is very slow and should not be called at every timestep. However, when we run our free energy calculations, we are scaling the parameters at every timestep. In the past, we've addressed this issue by using 2CustomNonbondedForce
s instead of one, where one of the forces contains the particles that will be alchemically scaled (withsetUseLongRangeCorrection(False)
) and the other contains the atoms without any scaling (withsetUseLongRangeCorrection(True)
)To avoid adding an extra
CustomNonbondedForce
(which would make the implementation of this already complicated class more complicated), would it be possible to move the computation of long range corrections to GPU kernels?The text was updated successfully, but these errors were encountered: