-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow long range correction to use default parameter values #3301
Comments
@peastman @zhang-ivy: I wanted to summarize the results of our discussion about efficiently treating long-range dispersion corrections in alchemical systems yesterday a bit more clearly, since I think this may point to a slightly different feature request. To generate an alchemical system, @zhang-ivy's current implementation removes all Lennard-Jones interactions from One possibility for making this scheme performant is to do the following (which requires new features):
It appears we may be able to avoid implementing the first two features if, instead of moving all environment-environment Lennard-Jones interactions from I believe @zhang-ivy elected to move everything to -- In the meantime, we will also carry out a quick test of what drives the contributions to the dispersion correction when alchemical modifications are mode. For both a protein mutation (e.g. large-to-small, such as Trp/Arg -> Gly) and a reasonably large small molecule transformation, we can try the following:
|
Splitting the Lennard-Jones interaction between NonbondedForce and CustomNonbondedForce will hurt your performance. If all atoms have epsilon=0, NonbondedForce ifdefs out the part of the kernel that computes LJ. But if it's zero for some atoms and nonzero for others, you will pay the cost of both computations for every atom pair. I also don't see what you gain from moving it back to NonbondedForce. How is that any better than using CustomNonbondedForce, with this feature added so it won't recalculate the coefficient when you change a global parameter? |
There are a several considerations for performance here, beyond not requiring the API for
What do you think about the potential for including support for |
A couple of questions:
A final thought: I'm a bit hesitant to move the environment-environment interactions to a |
There's a much easier way to determine where the change is coming from. The correction energy is given by E_C = C/volume. You can compute C as E_C*volume, where E_C is just the change in energy when you turn on the correction. C depends only on lambda, nothing else, so you only need to compute it twice, once for lambda=0 and once for lambda=1. Finally, compute the average volume in the starting and ending states. |
That's a great point!
We need sufficient statistics to compute good standard errors here too since the volume distributions will be pretty wide compared to the difference in means. Notably, the average is also over the inverse volumes, not the volumes. |
The logic is definitely extremely complex already, as seen from your detailed index of |
Why would it be unavoidable? It can be made to work either way. It's just a question of what is simpler. If keeping all LJ interactions in a single custom force is easier to implement than splitting it up, I don't see a problem with it. |
We don't yet have a solution to make that approach performant while also correctly including the long-range corrections for the alchemical region. |
Why would the performance be any different? The cost of recomputing the coefficient depends on the number of atom types, not the number of atoms. |
Right now, the only way to accurately include the contribution from the alchemical atoms is to compute the long-range correction every timestep, which we have established becomes the bottleneck in performance. If we moved the long-range dispersion correction back to We are exploring whether approximations may allow us to, in some cases, compute reasonable corrections to omitting the changes in alchemical atoms if we add some additional features to |
You can do that just as easily whichever force the LJ interactions are in. |
As we discussed, the only way to compute the LJ long-range correction in a performant manner if these live in the
My understanding was that requiring these two new features be implemented did not qualify for "just as easily" as simply moving the dispersion correction to Or am I misunderstanding the suggestion? |
What you're proposing is more complicated than just moving the dispersion correction to NonbondedForce. If you do that, it won't be correct because it will be computed as if the alchemical region didn't exist at all. The atoms in it will all have epsilon set to 0. You then proposed approximating them with a CustomExternalForce, requiring the addition of a new feature. But since that approximation just interpolates between two values, it could just as easily represent the entire correction. It makes no difference whether a correction is being computed by NonbondedForce, CustomNonbondedForce, or neither. The CustomExternalForce can include any or all of them. On the other hand, it's also possible that approximation isn't needed. If the change in the coefficient is sufficiently small, we can just ignore it. In that case it could be computed either with NonbondedForce (less accurately, no new features required) or CustomNonbondedForce (more accurately, the new feature described at the top of this issue required). Another possibility is that the change in the coefficient is small, but not quite small enough to completely ignore. In that case you can run the simulation holding it constant and just correct for it at the end. |
We'll definitely collect representative data and see!
I'm still having a hard time following the points of disagreement here, but it is sounding like we concur the addition of You're right that the entire dispersion correction could be moved to One other potentially poor idea: Since the time-consuming step in updating the long-range correction in |
There's no need to split it up. Using CustomExternalForce is just a hack. You really just want a single term adding a single fixed energy. You could just as easily use CustomBondForce (with a single bond) or any other.
We still need to find out whether it's needed. |
This would get very complicated, especially if there are multiple distinct alchemical regions. When we think of each atom as belonging to a single alchemical region or the environment, it's much easier to track contributions by atom. Whaat do you think of the alternative of specifying an algebraic expression for the integral in
|
In that case, the hack I suggested isn't appropriate. More generally, the long range correction is not a per-atom energy. It's a per-pair energy.
That would be very dangerous. It would be really easy for people to make mistakes and get the wrong result. It's also a really complicated integral, especially if you use a switching function. Check out the code for evaluating it for standard LJ, especially the comment at the top! openmm/openmmapi/src/NonbondedForceImpl.cpp Lines 239 to 274 in d83c272
|
That's why the default is to use "safe but slow" numerical integration. Setting the function manually should only be used if you have an especially simple integral or, in our case, an excellent algebraic approximation. The error can also easily be checked by computing the energy with and without the algebraic expression.
Yes, with the switching function, it is quite complicated. For unswitched LJ, a reasonable simplification is to use only the dispersive (r^{-6}) term, where the integral is simply
which is not very scary at all. The combining rules would also be included in the full energy expression:
In our particular case, our softcore form equals the LJ potential at lambda = 0 and 1, and we can simply consider the linear interpolation of these to incorporate the effects of the dispersion correction, using a scheme like this (where we assume the particle parameters are
Assuming each integration-by-quadrature would normally require 100 evaluations of the potential, specifying an algebraic approximation of the integral would give a speedup of ~ 100x. Notably, even if we didn't drop the repulsive (r^{-12}) term, which is already >10,000x times smaller than the attractive term for a 10A cutoff, the expression would not be much more complicated:
Sure, switching functions make things more complicated, but we can still do a great deal if we're not using switching functions. |
That seems like a really hard way of implementing it! If you're just going to use a linear interpolation, why don't you just precompute the coefficient for lambda=0 and lambda=1, then interpolated between them? No need to integrate anything by hand, and no need to ever recompute during the simulation. But we're getting ahead of ourselves. We still don't know if any of this is even necessary. Let's get the numbers first to find out how much the coefficient changes. |
As I noted in the previous comment, it's not actually hard---it's just a bit of symbolic integration. It took me longer to write the comment than do the integration.
It's actually much harder for us to develop the code to precompute the complete coefficients we need using the same scheme that is already built into By contrast, adding a method to optionally specify a custom expression to be used in place of the numerical integral means we only have to keep the integral expression compatible with the energy expression. This allows us wide latitude for experimentation without worrying about keeping multiple pieces of code in sync with each other.
We're benchmarking the magnitude for a couple of illustrative use cases, but if we can implement a feature like I proposed that allows us to efficiently get exact or near-exact results for all cases, this is vastly preferred over approximations. |
To be concrete, I'm proposing we add a method CustomNonbondedForce.useLongRangeCorrectionExpression(expression) which would cause the accumulation of interactions to use a single Lepton expression evaluation instead of calling the numerical integration method. |
You don't need to write any custom code at all for computing it! Just call |
I've run this experiment with barnase:barstar (mutation: K27A) (cutoff = 1 nm) with the following code (with the openmmtools langevin integrator): volumes_0 = []
energies_0_without_correction = []
energies_0_with_correction = []
system.getForce(7).setUseDispersionCorrection(False)
system.getForce(8).setUseLongRangeCorrection(False)
for _ in tqdm_notebook(range(1000)):
integrator.step(250)
# Get box volume and energy without correction
state = context.getState(getEnergy=True)
volumes_0.append(state.getPeriodicBoxVolume().value_in_unit_system(unit.md_unit_system))
energies_0_without_correction.append(state.getPotentialEnergy().value_in_unit_system(unit.md_unit_system))
# Get energy with correction
system.getForce(7).setUseDispersionCorrection(True)
system.getForce(8).setUseLongRangeCorrection(True)
state = context.getState(getEnergy=True)
energies_0_with_correction.append(state.getPotentialEnergy().value_in_unit_system(unit.md_unit_system))
# Turn correction off
system.getForce(7).setUseDispersionCorrection(False)
system.getForce(8).setUseLongRangeCorrection(False) For lambda = 1 box volumes: I also ran these experiments for the following tyk2 transformation (with cutoff = 1.0 nm): Lambda = 1 box volumes: @jchodera @peastman: Please let me know if I ran the experiment correctly and what your thoughts are. It seems to me that the changes in long range correction are dominated by the coefficient changing, not the box volume. |
Something's wrong with your calculation of the coefficient. It should depend only on lambda and be independent of time. The code should be something like this. customNonbondedForce.setUseLongRangeCorrection(False)
customNonbondedForce.setForceGroup(1) # so we can query just this force
...
# Create the context and set initial conditions
...
e1 = context.getState(getEnergy=True, groups={1}).getPotentialEnergy()
customNonbondedForce.setUseLongRangeCorrection(True)
context.reinitialize(preserveState=True) # otherwise it won't see the change to the force
e2 = context.getState(getEnergy=True, groups={1}).getPotentialEnergy()
volume = context.getState().getPeriodicBoxVolume()
print('coefficient:', (e2-e1)/volume) Run that code twice, once with lambda=0 and once with lambda=1. |
For computing statistics of the volume, could you omit the first few hundred snapshots? That way it isn't influenced by the early part where it's still equilibrating. For example, np.mean(1/np.array(volumes_0[300:])) |
Sorry, that should have been print('coefficient:', (e2-e1)*volume) (multiplying by the volume instead of dividing by it). |
I've fixed the volume computation (to discard the first 300 snapshots) and the coefficient computation (to only be compute once at each endstate using the code you supplied):
|
Can you give more digits of the volume? You only included one significant digit, which isn't enough to show the difference. The standard deviation is less than 1e-5, and the standard error is even smaller, so the means should be accurate to a lot of digits. |
No it doesn't. It causes no measurable slowdown at all in my environment.
You asked for "a couple of more days". That was six days ago. |
@peastman : I think John was talking about the slowdown that's still present in your results here when we don't use one atom class. We can work on somewhat reducing the complexity of the custom expression and the number of atom classes, but these free energy calculation factories are complicated, so it is not realistic to expect an implementation that uses a super simple expression / one atom class. Therefore, I think there's room for discussion of how to optimize the long range correction computation such that the slowdown (after simplifying the expression and reducing atom classes as much as we can) is roughly ~1.1x (a goal I think we had previously agreed upon). |
@jchodera and I debugged the long range correction slowdown that's present with 1 atom class. Here is the script and system/state xmls we used (it has minor modifications compared to the one I sent previously): with one atom class:
Takeaways:
with many atom classes:
Takeaways:
Let's continue the discussion for how to speed up this up on Thursday. Here are the potential solutions we've talked about so far:
|
That still doesn't make sense. With only one atom class, there should be no slowdown whatsoever. Even on a slow CPU, updating the coefficient should take a tiny fraction of the time required for the GPU to compute energy.
I strongly suspect this is the real problem. Even with only one work thread, you still have to wait for the thread to start up. If there are idle cores it will start immediately, but if it has to preempt another process, that could take several ms before it starts running. You need to be running all your benchmarks when there is nothing else running on the computer. If it has to compete with other processes, the numbers don't really tell us anything. |
My production runs will all be on our hpc (or perhaps Folding@home), where the nodes are being shared. Therefore, it's actually more useful to run the benchmarks in the setting where there is competition from other processes, because this is the setting that we are going to be doing the science in. I would assume many other openmm users use node-sharing clusters to run their simulations as well. |
Most clusters let you reserve a number of cores. If you're having to share cores with other processes, a lot of things will be much slower. That's not because OpenMM is slow. It's because another process is running instead. |
@peastman: LSF manages the shared cluster and provides control over requests and thread affinities. bsub -W 04:00 -m "ly-gpu lx-gpu lw-gpu ld-gpu lv-gpu lu-gpu lg-gpu lt-gpu" -q gpuqueue -n 20 -gpu "num=1:j_exclusive=yes:mode=shared" -R "rusage[mem=14/host] span[ptile=20] affinity[core(1):distribute=pack(socket=1)]" -env "all" -Is /bin/bash Here, we are assigned 20 task slots, each of which gets a core, all of which has to be packed onto the same socket (which has 20 cores). I can also use I have verified this difference in the single-class case is reproducible in my environment on this machine (which has an RTX 2080 Ti):
We have the whole socket dedicated to this job. I don't think this is because another thread is contending for processor resources. It might be some other resource, but it's not going to be processor resources. |
At the top of CustomNonbondedForceImpl.cpp add the line #include "openmm/internal/timer.h" At the top of double start = getCurrentTime(); And at the end add double end = getCurrentTime();
printf("%g\n", end-start); How long does it report the method is taking? For me, all times after the first iteration are around 0.0002, almost an order of magnitude less than the step times you quoted above. This on a Core i7-7740X, which is a four year old midrange processor, so nothing particularly fast. |
I suggest we postpone the meeting today, since we're still diagnosing the issue above. |
I'll run this experiment on a few systems today. |
I'm trying to follow the build directions here on our lilac cluster, but am running into
Presumably, OpenMM has been configured with the wrong C++ or CUDA compiler somehow, but we don't actually say anything helpful about which compilers we support in the build instructions, or even how to tell if your compiler is supported. |
That means you're using a very old compiler that doesn't fully support C++11. See #2545. |
Indeed! Let's update the installation instructions with more information on (1) what versions we don't support, and (2) what it will look like if you run into this issue. I am tracking lessons learned for updating documentation here: #3378 (comment) |
OK, I've finally managed to build from scratch, install in my conda environment, and recapitulate the data we previously observed on a single socket:
The first numbers show the timing for each call to |
I'm unclear if this is relevant, but your machine (Core i7-7740X) has a base clock of 4.3 GHz; our cluster is using an Intel Xeon Gold 6230 with base clock at 2.10 GHz. |
Here are the many-class timings---the real-life scenario we need to run without slowdown. We requested a whole 10-core (20-thread) socket, the node is relatively unloaded, and I'm pinning the thread affinities to the socket we are allocated.
As you can see, even with 8 threads, we're still 13x slower than without the long-range correction, and the entire time is being eaten by |
Interesting. Here is what is supposed to happen. In openmm/platforms/common/src/CommonKernels.cpp Line 2300 in 92c6dbb
The task executes openmm/platforms/common/src/CommonKernels.cpp Lines 1839 to 1841 in 92c6dbb
Finally, the CommonCalcCustomNonbondedForceKernel has registered a post-computation to get executed after all other work. openmm/platforms/common/src/CommonKernels.cpp Lines 1813 to 1824 in 92c6dbb
It calls |
Does this use the same |
Also, if you have any suggestions for how I can instrument the threads or computations to track what is starting/stopping/sleeping when, I can do that! |
No, each context has a single worker thread that's independent of any thread pool. See the ComputeContext::WorkThread class. It's intended for uses like this, where you don't specifically want to parallelize a CPU calculation (if you want that, you'll use the ThreadPool), but you don't want it to block the main thread from launching kernels.
Try printing out the value of
This should give us a picture of where the time is being spent and where delays are happening. If everything is working as intended, very little time should pass between the start of |
Thanks! Here's the output: Single-class test case
Multiple-class test case
|
Thanks! That's very helpful. Here's the output from a single iteration.
We can see the following time intervals:
Based on that, we can draw these conclusions:
This is a very small system (14,550 atoms). The above should be quite different on a larger system where the step time is dominated by GPU rather than CPU. In principle we could start the long range correction calculation earlier, which might let it overlap more. But that's a dangerous thing to do. We don't want to risk it competing with the main thread that's launching kernels. |
Is that for the multi-class? I thought this consumed an order of magnitude more time. |
Single class. |
How do the single class findings translate into the multi class prospects
for keeping this to <10% slowdown? The single class use case is totally
irrelevant to practical usage.
|
Right, it was just for diagnostic purposes to understand what's going on. In #3301 (comment) I compared the "vanilla" and "rest" versions of the system, and one of the differences I found was that the latter has many more atom classes (49 vs. 28). Since the number of integrals is the square of the number of classes, that makes it a lot more expensive. I raised the question of whether all those classes were really needed. If the number could be reduced, it would produce a big speedup. I don't think the answer to that question has been determined yet. |
@peastman : John included timing data for both single class and multi class here -- I think he was wondering whether the same conclusions you made about single class hold true for multi class.. from a cursory glance, I don't think they do, but would be interested to see what your conclusions are.
I haven't had a chance to dive deep into this yet, but John and I discussed this, and we would guess that we won't be able to reduce the number of atom classes by much -- there are going to be many atom classes due to differences in the actual atom type parameters, how they should be scaled by rest, and how they should be scaled alchemically. |
That example is unrealistic in two ways: first in having only one class, and second in having a tiny number of atoms. Really we want to profile a realistic case, with the actual number of classes you'll have (after doing whatever you can to reduce the number) and with a number of atoms that is typical of the real systems you'll be simulating. That will tell us where we currently are on performance, and how much cheaper the coefficient calculation needs to become. |
This relates to #3229 and #3236. CustomNonbondedForce computes its long range correction based on the current values of global parameters. That means it must be recomputed every time a parameter changes, which can be slow. In contrast, NonbondedForce computes it based on the default values of parameters, so it never changes. It could be useful to allow CustomNonbondedForce to use that behavior too. We could add a method
setLongRangeCorrectionUsesDefaultParameters()
that tells it whether to compute the correction based on the current or default values of parameters.The text was updated successfully, but these errors were encountered: