-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized computing long range correction for CustomNonbondedForce #3236
Conversation
@zhang-ivy have you had a chance to try this out? |
Thanks for the PR and sorry for the delay! I will test this out within the next couple of days. |
@peastman I'm trying to test out the changes in this PR, but it doesn't look like the nightlys are up to date: https://anaconda.org/omnia-dev/openmm/files -- it says it was last updated 14 days ago? |
I'm not sure what's going on with that. But it wouldn't help anyway: this hasn't been merged, so it wouldn't be in the nightly build. Are you able to build from source? |
Ah right, I forgot that it hadn't been merged. I've spent a couple hours trying to build from source and I'm having trouble. I followed the instructions here, but when I try to import openmm, I get:
I noticed all of the CUDA tests (and one of the Reference tests) are failing when I run
|
Try running a test individually:
Does it give you an error message with more information? |
Here's the error I get:
My gcc version is:
Originally, I had |
Apparently nvcc doesn't like the version of gcc you have. That isn't something I've encountered. If you have an older compiler around you can use, you can set the environment variable Hopefully this won't impact running actual programs, since it will use the runtime compiler instead of nvcc. |
If we merge this, Ivy could just test the nightly build!
|
I was hoping you could test it first. I don't want to merge it until we know this approach will actually work. |
Since it's a challenge to build working OpenMM installs from scratch, could we enable builds of key branches (pushed to corresponding labels) on https://github.com/omnia-md/conda-dev-recipes ? |
Could you try running your code against the version you just built? Like I said, there's a good chance the errors affecting the test cases won't affect other programs, since it will use the runtime compiler instead of nvcc. |
I don't think I'll be able to test with my current build because I'm seeing this error:
|
I merged it. Now we just need to figure out why no nightly builds have been uploaded for a couple of weeks. |
Sigh. Looks like the last build/upload on Sep 3 was successful, but I can't log into Azure to maintain/check the builds. |
Aren't we all admins in there? |
For some reason, none of my accounts allow me access to Azure.
|
Are the nightly builds not uploading because the build recipe hard codes the version number? We ran into this issue before here |
Actually, it seems like the the hard coded version number here is correct, so that's not the problem |
The status at https://github.com/omnia-md/conda-dev-recipes doesn't show any builds having even been attempted since Sept. 2. |
|
Repo inactivity might be a factor. I know GH will stop running cron jobs if no recent commits were applied. Azure might be doing the same and we missed the notification. I triggered a manual run and resynced the cron hooks so we are back on schedule. This should bring back the nightly builds. @jchodera, the URL is https://dev.azure.com/OmniaMD/conda-dev-recipes/. You have admin credentials with your choderalab.org account, but might need to change the "directory" to Virginia Tech in your user menu (top right). |
@jaimergp Thanks so much for your help! Excited to try out the PR (I'll be able to do that tomorrow). |
@peastman @jchodera I've tested out this PR on a perses-generated hybrid system for RBD:ACE2. (The same system/state files that I attached here). Here are the experiments I ran:
It seems that for my system, this PR offers a 5-6 fold speed up for computing the long range correction in the CustomNonbondedForce. I can't remember what the desired speed up is, @jchodera can you let me know if you think this is a sufficient speed up? Here is the notebook I used to run these experiments: https://gist.github.com/zhang-ivy/db2b94cda201541edd05f5cf333efe76 |
What kind of hardware are you running it on? When I tested your system, 100 steps of MD while not recomputing the long range correction only took 0.33 seconds, not 8.1 seconds. Is this really the same system? |
I'm running with CUDA on a Note that in my "Experiment 1" above, even though the long range correction is off, the dispersion correction is on, which may explain why it takes so long? |
@peastman : Hmm, I'm still seeing that my simulation times are 10x are slower than yours, even when I use your code. I'm running your above code snippet (using one of the nightly builds from last week) on a
rbd_ace2_example_minimized.zip Titan Vs can't be 10x faster than 2080s, right? If so, I'm not sure what I'm doing wrong 🤔 |
I have no idea. The GPUs should be fairly similar in speed. The RTX 2080 has only about 2/3 as many compute units, but also a slightly higher clock rate. Nothing that would explain that big a difference. Try increasing it to 1000 steps. Do the reported times increase by 10x? |
^ I realized I wasn't saving my minimized state properly (was actually saving the pre-minimized state), which is what was causing the nans. Now, when I use the proper minimized state + cudatoolkit installed, I'm seeing the following simulation times:
This makes me think that there is a chance that my openmm installation doesn't have the changes in your PR. I installed openmm like this
Given that all of the changes in this PR are in the C++ layer, is there an easy way for me to double check that my openmm installation actually has the changes in this PR? |
Sure, here it is. |
@peastman : Thanks, but actually I ended up not needing your TLDR from my previous message: I'm now seeing that without |
In Python you can print out |
Ok, I have |
What kind of CPU are you using? How many cores? Is anything else running at the same time that's using the CPU? Try monitoring it with What speed do you get with the version that doesn't include the optimizations? Regardless though, it seems suspiciously slow. Even when I just skipped the unnecessary calculations, before I implemented any parallelization at all, that sped it up to 1.8 seconds. Unless you have an ancient CPU, it shouldn't be 3x slower than mine on single threaded code. |
@peastman : I'm running these experiments on our hpc, so it depends on which node I get assigned. I just ran the experiments again with Here are the latest experiments, with the CPU/GPU mentioned above: @ijpulidos ran some of these experiments on his local machine (which has an external GPU and a built in GPU): (Built in) GPU: Quadro T500, CPU: Intel(R) Core(TM) i7-1185G7 @ 3.00GHz While Iván's experiments show much slower simulation times that what we had previously seen, there isn't a big difference when comparing the time required with vs without the |
That would make sense, if it's having to compete with other processes for CPU time. And it's still more than 6x faster than without the optimization, so it clearly is helping. |
@zhang-ivy : When running these tests on lilac, it's probably critical to (1) request the appropriate number of thread-slots with # number of thread slots to request
#BSUB -n 8
# rusage[mem=7/task] : 'mem=X/task' specifies per thread-slot request for GB of memory
# span[hosts=1] : pack all thread-slots onto same physical node
# affinity[core(1):distribute=pack]" : pack onto as few cores as possible
#BSUB -R "rusage[mem=7/task] span[hosts=1] affinity[core(1):distribute=pack]" and to also set the environment variable (Note that you can use |
We're really hoping for an at most 10-20% slow-down when updating parameters in context over not updating the parameters. I still wonder if there's another way to tackle this by parallelizing the numerical integration on the GPU with a simpler numerical integration scheme. |
See my description up at the top. You can't characterize this as a percent slowdown. The CPU and GPU are doing different calculations at the same time. If the CPU finishes first, it's a 0% slowdown. If the GPU finishes first, it could be a large percent slowdown. How long each one takes depends on the speed of your CPU, the speed of your GPU, and the details of the system. Just comparing the different numbers above for different hardware, I had a slowdown of 70%, while @ljpulidos had slowdowns of 23% and 4% on two different GPUs. And those are all for the same system. If you're able to update the parameter only every ten steps, all of those would be well below the target range of 10-20%. |
I think we need to update the parameter every step, not every 10 steps, right @jchodera? |
For replica exchange, we can update only infrequently, but for nonequilibrium switching, we lose a huge amount of efficiency if we can't update every step. |
What could work is to only update the anisotropic dispersion correction relatively infrequently (e.g. every 10 steps), but update the |
There might be a very easy way to do that. It depends on your integrator and simulation protocol. The correction only affects the energy, not forces, so we could make it only recalculate the correction when you request the energy. In most simulations, that happens very infrequently. You could keep changing the parameter every step, and the forces would change appropriately, but it wouldn't update the correction until the next time you needed the energy. By the way, the correction is isotropic, not anisotropic. |
I made the change in #3275. It's a very small change, and it might make a significant difference. Or it might not. It depends on the details of your simulation. |
@peastman Just spoke with @jchodera -- is there a way to put the long range correction computation in a separate force group? If so, it should be easy for us to use your PR #3275 to compute the long range energy less frequently, which would speed up my nonequilibrium switching simulations significantly. |
You can't separate out the long range correction from the rest of the CustomNonbondedForce. Is it actually necessary to compute the energy every step? |
The |
Could you describe exactly what you're computing the energy for? For example, does it even matter whether it includes the long range correction? Or is that only needed for the barostat? |
Sure! Here's the basic concept for alchemically modifying Lennard-Jones interactions: We have four atom groups:
We may additional introduce REST scaling, which further scales down the interactions within the non- We remove all LJ interactions from The long-range correction is essential for two reasons:
These effects are well-documented in this paper. In our nonequilibrium simulations, we achieve optimal thermodynamic efficiency (dissipate the least amount of heat, and therefore achieve the lowest error) if we take a geodesic in thermodynamic control space that minimizes thermodynamic length; this achieves the lowest variance estimate. To do this, we need to take small steps in We could reformulate these as a multiple timestep method where we update the slowly-varying components of the potential only once in the middle of a complex timestep that includes other substeps, but we would need to split out the slowly-varying components. Since the alchemical region is a ligand tightly nestled in a binding site, we need to update all of the nearby interactions every timestep. I suppose we could create two different We'd be happy to brainstorm other efficient ways to account for this if you have some time! |
Let's follow up on that idea. I'm not sure if the following is entirely correct, and I also haven't thought much about whether it would be practical to implement. This is just brainstorming. The correction doesn't affect the forces, so you don't need it for integration. And since it doesn't depend on the positions, you don't need to accumulate it every timestep. It's only required for two purposes.
Everything else can be done with a copy of the CustomNonbondedForce that doesn't include the correction. Does that sound right?
How much of that comes from the coefficient changing, and how much is from the box volume changing? |
Fixes #3229. This PR includes three different optimizations.
updateParametersInContext()
, not if you change a global parameter.I can't give a single number for how much faster it is. That depends both on the details of your system and on the relative speed of your CPU and GPU. In the best case where computing the coefficient on the CPU takes less time than computing the forces on the GPU, changing a global parameter will now have negligible cost. In other cases it may still be quite expensive, but a lot less expensive than it was before.