Allow long range correction to use default parameter values #3301

peastman · 2021-10-27T17:09:42Z

This relates to #3229 and #3236. CustomNonbondedForce computes its long range correction based on the current values of global parameters. That means it must be recomputed every time a parameter changes, which can be slow. In contrast, NonbondedForce computes it based on the default values of parameters, so it never changes. It could be useful to allow CustomNonbondedForce to use that behavior too. We could add a method setLongRangeCorrectionUsesDefaultParameters() that tells it whether to compute the correction based on the current or default values of parameters.

The text was updated successfully, but these errors were encountered:

jchodera · 2021-10-29T00:21:32Z

@peastman @zhang-ivy: I wanted to summarize the results of our discussion about efficiently treating long-range dispersion corrections in alchemical systems yesterday a bit more clearly, since I think this may point to a slightly different feature request.

To generate an alchemical system, @zhang-ivy's current implementation removes all Lennard-Jones interactions from NonbondedForce and implements them as a CustomNonbondedForce that also computes long-range corrections. This is problematic, because with the long-range corrections on, any change in alchemical parameters will trigger recomputation of the corrections, dominating the integration time in nonequilibrium switching. If we turn the long-range corrections off, we lose all the long-range dispersion corrections, producing highly erroneous densities.

One possibility for making this scheme performant is to do the following (which requires new features):

Have CustomNonbondedForce compute the long-range correction, but don't update it during the simulation
Omit the alchemical atoms from the computation of the long-range correction, instead replacing them with a CustomExternalForce that computes the correction for the alchemical atoms
Extend CustomExternalForce to include volume as a parameter to enable the contribution from the alchemical atoms to be encoded as an energy expression like ((1-lambda_sterics)*coefficient_initial + lambda_sterics*coefficient_final)/volume. We note that the long-range correction really only needs the dispersion attraction part of Lennard-Jones 12-6 (-4*epsilon*(sigma/r)^6), and that the long-range correction for this term for an alchemical atom i can be expressed in terms of precomputed coefficients for the initial and final states that can be linearly interpolated. The only time-dependent quantity aside from lambda that is required for computation is the volume.

It appears we may be able to avoid implementing the first two features if, instead of moving all environment-environment Lennard-Jones interactions from NonbondedForce to CustomNonbondedForce, we keep them all in NonbondedForce and use it to compute the long-range dispersion correction. The CustomNonbondedForce then only needs to handle the alchemical-environment interactions, and can avoid computing the long-range correction since they can be moved to the new CustomExternalForce (if the volume term is added).

I believe @zhang-ivy elected to move everything to CustomExternalForce to be able to decouple alchemical and REST regions, but if we consider the union of the REST and alchemical regions as the "alchemical region" that is being modified in some way above, then this scheme (NonbondedForce + CustomNonbondedForce) should still work.

--

In the meantime, we will also carry out a quick test of what drives the contributions to the dispersion correction when alchemical modifications are mode. For both a protein mutation (e.g. large-to-small, such as Trp/Arg -> Gly) and a reasonably large small molecule transformation, we can try the following:

Try this for each cutoff 9A, 10A, 12A
Generate 1 ns (1000 snapshots, 1 ps apart) simulating the alchemical system at lambda=0 and 1 using our current hybrid alchemical factory
For each snapshot, as we're generating it, compute the energy in both lambda=0 and 1, with all long-range dispersion corrections off and on, so that we can inspect how the distribution (and especially mean) of the long-range correction (U_LRC) shifts between lambda=0 and 1. If <U_LRC(lambda=0)>_1 ~= <U_LRC(lambda=0)>_0 and <U_LRC(lambda=1)>_0 ~= <U_LRC(lambda=1)>_1 , then the coefficients (not the volume distribution shift) is chiefly responsible for differences in <U_LRC>; if not, the volume distribution shift is chiefly responsible
Store state.getPeriodicBoxVolume() for each snapshot as well so we can inspect the volume distribution shift between lambda=0 and 1

peastman · 2021-10-29T01:05:20Z

Splitting the Lennard-Jones interaction between NonbondedForce and CustomNonbondedForce will hurt your performance. If all atoms have epsilon=0, NonbondedForce ifdefs out the part of the kernel that computes LJ. But if it's zero for some atoms and nonzero for others, you will pay the cost of both computations for every atom pair.

I also don't see what you gain from moving it back to NonbondedForce. How is that any better than using CustomNonbondedForce, with this feature added so it won't recalculate the coefficient when you change a global parameter?

jchodera · 2021-10-29T01:45:20Z

There are a several considerations for performance here, beyond not requiring the API for CustomNonbondedForce to be extended in two new ways:

The CustomNonbondedForce would only include interactions between the alchemical region (M particles) and the environment (N-M particles) using interaction groups. This reduces the cost to M(N-M)/2 interactions, where M << N---even fewer if the pairlist is used in the interaction group computation.
The CustomNonbondedForce energy expression is also very complicated, containing multiple special cases for scaling of some interactions for REST and softcore interactions for alchemical-environment interactions. We save the need to compute this complex expression with many per-particle parameters if we compute them separately
This cleanly separates the part of the potential that depends on global parameters from the part of the potential that does not, so that they can be put into different force groups. When computing the contribution to the protocol work, only the Custom*Force terms that change---corresponding to smaller numbers fo atoms---need to be recomputed.

What do you think about the potential for including support for volume in the CustomExternalForce energy expression?

zhang-ivy · 2021-10-29T14:08:19Z

If <U_LRC(lambda=0)>_1 ~= <U_LRC(lambda=0)>_0 and <U_LRC(lambda=1)>_0 ~= <U_LRC(lambda=1)>_1 , then the coefficients (not the volume distribution shift) is chiefly responsible for differences in <U_LRC>; if not, the volume distribution shift is chiefly responsible

A couple of questions:

Can you clarify what you mean by <U_LRC(lambda=0)>_1? I would assume you mean the energy of the long range correction with lambda = 0 positions and using the lambda = 1 system?
Why would seeing <U_LRC(lambda=0)>_1 ~= <U_LRC(lambda=0)>_0 and <U_LRC(lambda=1)>_0 ~= <U_LRC(lambda=1)>_1 mean that the coefficients are responsible? Aren't both the volume and coefficient included in the U_LRC?
If we see that the volume is responsible for most of U_LRC, then we might turn to @jchodera's suggested approach of using a CustomExternalForce with the changing volume term + precomputed coefficients. If we see that the coefficients are responsible, then none the suggestions in this thread (so far) will be sufficient, correct?

A final thought: I'm a bit hesitant to move the environment-environment interactions to a NonbondedForce. If I do, that would mean that in order to properly address long range corrections for nonequilibrium switching, I would need to introduce 2 more nonbonded-related forces (NonbondedForce and CustomExternalForce ) on top of the a system that already has 4 nonbonded-related forces. The more complicated the code gets, the harder it is to debug. Also, it makes it extremely difficult for anyone new to the code to understand what's going on. If this is the only way to implement this, then its better to have something working than not, but I fear that the logic is getting too complex.

peastman · 2021-10-29T15:55:24Z

There's a much easier way to determine where the change is coming from. The correction energy is given by E_C = C/volume. You can compute C as E_C*volume, where E_C is just the change in energy when you turn on the correction. C depends only on lambda, nothing else, so you only need to compute it twice, once for lambda=0 and once for lambda=1. Finally, compute the average volume in the starting and ending states.

jchodera · 2021-10-29T17:09:02Z

There's a much easier way to determine where the change is coming from. The correction energy is given by E_C = C/volume. You can compute C as E_C*volume, where E_C is just the change in energy when you turn on the correction. C depends only on lambda, nothing else, so you only need to compute it twice, once for lambda=0 and once for lambda=1.

That's a great point!

Finally, compute the average volume in the starting and ending states.

We need sufficient statistics to compute good standard errors here too since the volume distributions will be pretty wide compared to the difference in means. Notably, the average is also over the inverse volumes, not the volumes.

jchodera · 2021-10-29T17:12:17Z

A final thought: I'm a bit hesitant to move the environment-environment interactions to a NonbondedForce. If I do, that would mean that in order to properly address long range corrections for nonequilibrium switching, I would need to introduce 2 more nonbonded-related forces (NonbondedForce and CustomExternalForce ) on top of the a system that already has 4 nonbonded-related forces. The more complicated the code gets, the harder it is to debug. Also, it makes it extremely difficult for anyone new to the code to understand what's going on. If this is the only way to implement this, then its better to have something working than not, but I fear that the logic is getting too complex.

The logic is definitely extremely complex already, as seen from your detailed index of Force terms required to implement this before the addition of NonbondedForce and CustomExternalForce. I think this might be unavoidable now, but it's a good topic to further engage with @peastman to see if we can simplify much of this logic to a smaller number of C++-level classes that unify what would otherwise require a whole set of Custom*Force classes.

peastman · 2021-10-29T17:19:47Z

Why would it be unavoidable? It can be made to work either way. It's just a question of what is simpler. If keeping all LJ interactions in a single custom force is easier to implement than splitting it up, I don't see a problem with it.

jchodera · 2021-10-29T17:24:04Z

Why would it be unavoidable? It can be made to work either way. It's just a question of what is simpler. If keeping all LJ interactions in a single custom force is easier to implement than splitting it up, I don't see a problem with it.

We don't yet have a solution to make that approach performant while also correctly including the long-range corrections for the alchemical region.

peastman · 2021-10-29T17:34:34Z

Why would the performance be any different? The cost of recomputing the coefficient depends on the number of atom types, not the number of atoms.

jchodera · 2021-10-29T17:38:14Z

Right now, the only way to accurately include the contribution from the alchemical atoms is to compute the long-range correction every timestep, which we have established becomes the bottleneck in performance.

If we moved the long-range dispersion correction back to NonbondedForce for the environment atoms, adding volume to the allowed energy expression variables could allow us to correctly compute the alchemical atom contributions in CustomExternalForce.

We are exploring whether approximations may allow us to, in some cases, compute reasonable corrections to omitting the changes in alchemical atoms if we add some additional features to CustomNonbondedForce. But even then, it's not necessarily going to be fully accurate in all cases.

peastman · 2021-10-29T18:31:04Z

If we moved the long-range dispersion correction back to NonbondedForce for the environment atoms, adding volume to the allowed energy expression variables could allow us to correctly compute the alchemical atom contributions in CustomExternalForce.

You can do that just as easily whichever force the LJ interactions are in.

jchodera · 2021-10-29T19:32:12Z

You can do that just as easily whichever force the LJ interactions are in.

As we discussed, the only way to compute the LJ long-range correction in a performant manner if these live in the CustomNonbondedForce @zhang-ivy constructed to handle all LJ interactions is to add the two features I suggested:

Allow CustomNonbondedForce to omit contributions from a subset of atoms from the long-range correction
Allow CustomNonbondedForce to include the long-range correction in energy computations, but not recompute the coefficients each timestep

My understanding was that requiring these two new features be implemented did not qualify for "just as easily" as simply moving the dispersion correction to NonbondedForce, which would not require new features.

Or am I misunderstanding the suggestion?

peastman · 2021-10-29T19:58:24Z

My understanding was that requiring these two new features be implemented did not qualify for "just as easily" as simply moving the dispersion correction to NonbondedForce, which would not require new features.

What you're proposing is more complicated than just moving the dispersion correction to NonbondedForce. If you do that, it won't be correct because it will be computed as if the alchemical region didn't exist at all. The atoms in it will all have epsilon set to 0. You then proposed approximating them with a CustomExternalForce, requiring the addition of a new feature. But since that approximation just interpolates between two values, it could just as easily represent the entire correction. It makes no difference whether a correction is being computed by NonbondedForce, CustomNonbondedForce, or neither. The CustomExternalForce can include any or all of them.

On the other hand, it's also possible that approximation isn't needed. If the change in the coefficient is sufficiently small, we can just ignore it. In that case it could be computed either with NonbondedForce (less accurately, no new features required) or CustomNonbondedForce (more accurately, the new feature described at the top of this issue required).

Another possibility is that the change in the coefficient is small, but not quite small enough to completely ignore. In that case you can run the simulation holding it constant and just correct for it at the end.

jchodera · 2021-10-29T23:08:42Z

On the other hand, it's also possible that approximation isn't needed. If the change in the coefficient is sufficiently small, we can just ignore it. In that case it could be computed either with NonbondedForce (less accurately, no new features required) or CustomNonbondedForce (more accurately, the new feature described at the top of this issue required).
Another possibility is that the change in the coefficient is small, but not quite small enough to completely ignore. In that case you can run the simulation holding it constant and just correct for it at the end.

We'll definitely collect representative data and see!

What you're proposing is more complicated than just moving the dispersion correction to NonbondedForce. If you do that, it won't be correct because it will be computed as if the alchemical region didn't exist at all. The atoms in it will all have epsilon set to 0. You then proposed approximating them with a CustomExternalForce, requiring the addition of a new feature. But since that approximation just interpolates between two values, it could just as easily represent the entire correction. It makes no difference whether a correction is being computed by NonbondedForce, CustomNonbondedForce, or neither. The CustomExternalForce can include any or all of them.

I'm still having a hard time following the points of disagreement here, but it is sounding like we concur the addition of volume to CustomExternalForce expression variables would be a useful feature to add because of the flexibility it affords in implementing things like dispersion corrections?

You're right that the entire dispersion correction could be moved to CustomExternalForce in principle, but there are not necessarily unambiguous implementations that account for the environment-environment contributions to the dispersion correction---it's not clear how this would be split up among atom contributions. Would environment atoms each compute their own contribution to just the environment-environment dispersion correction, leaving the alchemical atoms to contribute to the alchemical-environment dispersion interactions?

One other potentially poor idea: Since the time-consuming step in updating the long-range correction in CustomNonbondedForce every timestep is calling CustomNonbondedForceImpl::integrateInteraction to numerically integrate every unique interaction class, what if we instead allowed the user to optionally specify an analytical form of that integral as a lepton expression? As noted earlier, we can substitute a simple analytical expression that---at intermediate lambda---may not be exact, but whose values at lambda=0 and 1 is correct. This would still incur some evaluation overhead every timestep, but could be fast enough to no longer be the bottleneck since all numerical integration is avoided.

peastman · 2021-10-29T23:29:11Z

it's not clear how this would be split up among atom contributions

There's no need to split it up. Using CustomExternalForce is just a hack. You really just want a single term adding a single fixed energy. You could just as easily use CustomBondForce (with a single bond) or any other.

it is sounding like we concur the addition of volume to CustomExternalForce expression variables would be a useful feature to add

We still need to find out whether it's needed.

jchodera · 2021-10-30T00:29:50Z

There's no need to split it up. Using CustomExternalForce is just a hack. You really just want a single term adding a single fixed energy. You could just as easily use CustomBondForce (with a single bond) or any other.

This would get very complicated, especially if there are multiple distinct alchemical regions. When we think of each atom as belonging to a single alchemical region or the environment, it's much easier to track contributions by atom.

Whaat do you think of the alternative of specifying an algebraic expression for the integral in CustomNonbondedForce? This would allow us to get the exact results, potentially with no performance penalty:

One other potentially poor idea: Since the time-consuming step in updating the long-range correction in CustomNonbondedForce every timestep is calling CustomNonbondedForceImpl::integrateInteraction to numerically integrate every unique interaction class, what if we instead allowed the user to optionally specify an analytical form of that integral as a lepton expression? As noted earlier, we can substitute a simple analytical expression that---at intermediate lambda---may not be exact, but whose values at lambda=0 and 1 is correct. This would still incur some evaluation overhead every timestep, but could be fast enough to no longer be the bottleneck since all numerical integration is avoided.

peastman · 2021-10-31T16:27:06Z

When we think of each atom as belonging to a single alchemical region or the environment, it's much easier to track contributions by atom.

In that case, the hack I suggested isn't appropriate. More generally, the long range correction is not a per-atom energy. It's a per-pair energy.

Whaat do you think of the alternative of specifying an algebraic expression for the integral in CustomNonbondedForce?

That would be very dangerous. It would be really easy for people to make mistakes and get the wrong result. It's also a really complicated integral, especially if you use a switching function. Check out the code for evaluating it for standard LJ, especially the comment at the top!

openmm/openmmapi/src/NonbondedForceImpl.cpp

Lines 239 to 274 in d83c272

    
           double NonbondedForceImpl::evalIntegral(double r, double rs, double rc, double sigma) { 
        
               // Compute the indefinite integral of the LJ interaction multiplied by the switching function. 
        
               // This is a large and somewhat horrifying expression, though it does grow on you if you look 
        
               // at it long enough.  Perhaps it could be simplified further, but I got tired of working on it. 
        
               double A = 1/(rc-rs); 
        
               double A2 = A*A; 
        
               double A3 = A2*A; 
        
               double sig2 = sigma*sigma; 
        
               double sig6 = sig2*sig2*sig2; 
        
               double rs2 = rs*rs; 
        
               double rs3 = rs*rs2; 
        
               double r2 = r*r; 
        
               double r3 = r*r2; 
        
               double r4 = r*r3; 
        
               double r5 = r*r4; 
        
               double r6 = r*r5; 
        
               double r9 = r3*r6; 
        
               return sig6*A3*(( 
        
                   sig6*( 
        
                       + rs3*28*(6*rs2*A2 + 15*rs*A + 10) 
        
                       - r*rs2*945*(rs2*A2 + 2*rs*A + 1) 
        
                       + r2*rs*1080*(2*rs2*A2 + 3*rs*A + 1) 
        
                       - r3*420*(6*rs2*A2 + 6*rs*A + 1) 
        
                       + r4*756*(2*rs*A2 + A) 
        
                       - r5*378*A2) 
        
                   -r6*( 
        
                       + rs3*84*(6*rs2*A2 + 15*rs*A + 10) 
        
                       - r*rs2*3780*(rs2*A2 + 2*rs*A + 1) 
        
                       + r2*rs*7560*(2*rs2*A2 + 3*rs*A + 1)) 
        
                   )/(252*r9) 
        
                - log(r)*10*(6*rs2*A2 + 6*rs*A + 1) 
        
                + r*15*(2*rs*A2 + A) 
        
                - r2*3*A2 
        
               ); 
        
           }

jchodera · 2021-10-31T17:02:49Z

That would be very dangerous. It would be really easy for people to make mistakes and get the wrong result.

That's why the default is to use "safe but slow" numerical integration. Setting the function manually should only be used if you have an especially simple integral or, in our case, an excellent algebraic approximation. The error can also easily be checked by computing the energy with and without the algebraic expression.

It's also a really complicated integral, especially if you use a switching function. Check out the code for evaluating it for standard LJ, especially the comment at the top!

Yes, with the switching function, it is quite complicated.

For unswitched LJ, a reasonable simplification is to use only the dispersive (r^{-6}) term, where the integral is simply

Integrate[(4*pi*r^2) * (4*epsilon*(sigma/r)^(-6)), {r, r_cutoff, Infinity}]
= (16*pi/3) * epsilon * sigma^3 * (sigma/r_cutoff)^3

which is not very scary at all. The combining rules would also be included in the full energy expression:

- (16*pi/3) * epsilon * sigma^3 * (sigma/r_cutoff)^3;
epsilon = sqrt(epsilon1*epsilon2);
sigma = 0.5*(sigma1+sigma2);

In our particular case, our softcore form equals the LJ potential at lambda = 0 and 1, and we can simply consider the linear interpolation of these to incorporate the effects of the dispersion correction, using a scheme like this (where we assume the particle parameters are epsilon_old, epsilon_new, sigma_old, sigma_new):

- (16*pi/3) * epsilon * sigma^3 * (sigma/r_cutoff)^3;
epsilon = sqrt(epsilon1*epsilon2);
sigma = 0.5*(sigma1+sigma2);
epsilon1 = (1-lambda_sterics)*epsilon_old1 + lambda_sterics*epsilon_new1;
epsilon2 = (1-lambda_sterics)*epsilon_old2 + lambda_sterics*epsilon_new2;
sigma1 = (1-lambda_sterics)*sigma_old1 + lambda_sterics*sigma_new1;
sigma2 = (1-lambda_sterics)*sigma_old2 + lambda_sterics*sigma_new2;

Assuming each integration-by-quadrature would normally require 100 evaluations of the potential, specifying an algebraic approximation of the integral would give a speedup of ~ 100x.

Notably, even if we didn't drop the repulsive (r^{-12}) term, which is already >10,000x times smaller than the attractive term for a 10A cutoff, the expression would not be much more complicated:

(16*pi/3) * epsilon * sigma^3 * ((1/3)*(sigma/r_cutoff)^9 - (sigma/r_cutoff)^3);

Sure, switching functions make things more complicated, but we can still do a great deal if we're not using switching functions.

peastman · 2021-11-01T23:58:45Z

That seems like a really hard way of implementing it! If you're just going to use a linear interpolation, why don't you just precompute the coefficient for lambda=0 and lambda=1, then interpolated between them? No need to integrate anything by hand, and no need to ever recompute during the simulation.

But we're getting ahead of ourselves. We still don't know if any of this is even necessary. Let's get the numbers first to find out how much the coefficient changes.

jchodera · 2021-11-02T15:31:12Z

That seems like a really hard way of implementing it!

As I noted in the previous comment, it's not actually hard---it's just a bit of symbolic integration. It took me longer to write the comment than do the integration.

If you're just going to use a linear interpolation, why don't you just precompute the coefficient for lambda=0 and lambda=1, then interpolated between them? No need to integrate anything by hand, and no need to ever recompute during the simulation.

It's actually much harder for us to develop the code to precompute the complete coefficients we need using the same scheme that is already built into CustomNonbondedForce and implement this in a separate Custom*Force. It's even harder to make sure this pre-computation code will always stay in sync with any other changes we make to the CustomNonbondedForce parameters, exclusions, exceptions, or splitting of forces.

By contrast, adding a method to optionally specify a custom expression to be used in place of the numerical integral means we only have to keep the integral expression compatible with the energy expression. This allows us wide latitude for experimentation without worrying about keeping multiple pieces of code in sync with each other.

But we're getting ahead of ourselves. We still don't know if any of this is even necessary. Let's get the numbers first to find out how much the coefficient changes.

We're benchmarking the magnitude for a couple of illustrative use cases, but if we can implement a feature like I proposed that allows us to efficiently get exact or near-exact results for all cases, this is vastly preferred over approximations.

jchodera · 2021-11-02T15:37:21Z

To be concrete, I'm proposing we add a method

CustomNonbondedForce.useLongRangeCorrectionExpression(expression)

which would cause the accumulation of interactions to use a single Lepton expression evaluation instead of calling the numerical integration method.

peastman · 2021-11-03T15:38:11Z

You don't need to write any custom code at all for computing it! Just call getState(getEnergy=True) twice, once with the correction enabled and once with it disabled. Subtract, multiply by the volume, and you've got the coefficient.

zhang-ivy · 2021-11-08T19:45:37Z

There's a much easier way to determine where the change is coming from. The correction energy is given by E_C = C/volume. You can compute C as E_C*volume, where E_C is just the change in energy when you turn on the correction. C depends only on lambda, nothing else, so you only need to compute it twice, once for lambda=0 and once for lambda=1. Finally, compute the average volume in the starting and ending states.

I've run this experiment with barnase:barstar (mutation: K27A) (cutoff = 1 nm) with the following code (with the openmmtools langevin integrator):

volumes_0 = []
energies_0_without_correction = []
energies_0_with_correction = []

system.getForce(7).setUseDispersionCorrection(False)
system.getForce(8).setUseLongRangeCorrection(False)

for _ in tqdm_notebook(range(1000)):
    integrator.step(250) 
    
    # Get box volume and energy without correction
    state = context.getState(getEnergy=True)
    volumes_0.append(state.getPeriodicBoxVolume().value_in_unit_system(unit.md_unit_system))
    energies_0_without_correction.append(state.getPotentialEnergy().value_in_unit_system(unit.md_unit_system))
    
    # Get energy with correction
    system.getForce(7).setUseDispersionCorrection(True)
    system.getForce(8).setUseLongRangeCorrection(True)
    state = context.getState(getEnergy=True)
    energies_0_with_correction.append(state.getPotentialEnergy().value_in_unit_system(unit.md_unit_system))
    
    # Turn correction off
    system.getForce(7).setUseDispersionCorrection(False)
    system.getForce(8).setUseLongRangeCorrection(False)

For lambda = 0 box volumes:

Lambda = 0 coefficients:

For lambda = 1 box volumes:

Lambda = 1 coefficients:

I also ran these experiments for the following tyk2 transformation (with cutoff = 1.0 nm):

Lambda = 0 box volumes:

Lambda = 0 coefficients:

Lambda = 1 box volumes:

Lambda = 1 coefficients:

@jchodera @peastman: Please let me know if I ran the experiment correctly and what your thoughts are. It seems to me that the changes in long range correction are dominated by the coefficient changing, not the box volume.
I didn't run these experiments with multiple cutoff values, but let me know if you still think that's necessary / whether I've done the above experiments correctly.

peastman · 2021-11-08T23:14:59Z

Something's wrong with your calculation of the coefficient. It should depend only on lambda and be independent of time. The code should be something like this.

customNonbondedForce.setUseLongRangeCorrection(False)
customNonbondedForce.setForceGroup(1) # so we can query just this force
...
# Create the context and set initial conditions
...
e1 = context.getState(getEnergy=True, groups={1}).getPotentialEnergy()
customNonbondedForce.setUseLongRangeCorrection(True)
context.reinitialize(preserveState=True) # otherwise it won't see the change to the force
e2 = context.getState(getEnergy=True, groups={1}).getPotentialEnergy()
volume = context.getState().getPeriodicBoxVolume()
print('coefficient:', (e2-e1)/volume)

Run that code twice, once with lambda=0 and once with lambda=1.

peastman · 2021-11-08T23:33:39Z

For computing statistics of the volume, could you omit the first few hundred snapshots? That way it isn't influenced by the early part where it's still equilibrating. For example,

np.mean(1/np.array(volumes_0[300:]))

peastman · 2021-11-08T23:56:03Z

Sorry, that should have been

print('coefficient:', (e2-e1)*volume)

(multiplying by the volume instead of dividing by it).

zhang-ivy · 2021-11-09T14:52:28Z

I've fixed the volume computation (to discard the first 300 snapshots) and the coefficient computation (to only be compute once at each endstate using the code you supplied):

System	Endstate	Mean 1/volume (nm**3)	Std dev of 1/volume (nm**3)	Coefficient (nm**3 kJ/mol)
RBD:ACE2 K27A	0	0.003003176444112776	6.920737491158719e-06	636266136.5993224
RBD:ACE2 K27A	1	0.003004076862678824	7.172201668744358e-06	636138745.5262223
Tyk2 complex	0	0.0020253639646206404	3.7006908511059544e-06	1391444606.7349782
Tyk2 complex	1	0.002024878962158837	4.267036202637575e-06	1391370281.631637

peastman · 2021-11-09T16:22:20Z

Can you give more digits of the volume? You only included one significant digit, which isn't enough to show the difference. The standard deviation is less than 1e-5, and the standard error is even smaller, so the means should be accurate to a lot of digits.

peastman · 2021-12-13T22:03:33Z

We're working on solving @zhang-ivy's issue with system dependence for the one-class test case, but even then, this doesn't solve the fact that this causes a huge slow-down even on your environment.

No it doesn't. It causes no measurable slowdown at all in my environment.

Since you've moved ahead with the release while I was traveling even after I asked for a few more days for us to sort this out

You asked for "a couple of more days". That was six days ago.

zhang-ivy · 2021-12-13T22:25:25Z

No it doesn't. It causes no measurable slowdown at all in my environment.

@peastman : I think John was talking about the slowdown that's still present in your results here when we don't use one atom class. We can work on somewhat reducing the complexity of the custom expression and the number of atom classes, but these free energy calculation factories are complicated, so it is not realistic to expect an implementation that uses a super simple expression / one atom class. Therefore, I think there's room for discussion of how to optimize the long range correction computation such that the slowdown (after simplifying the expression and reducing atom classes as much as we can) is roughly ~1.1x (a goal I think we had previously agreed upon).

zhang-ivy · 2021-12-14T20:46:58Z

@jchodera and I debugged the long range correction slowdown that's present with 1 atom class.

Here is the script and system/state xmls we used (it has minor modifications compared to the one I sent previously):
20211214_experiments_with_john 2.zip

with one atom class:

(long-range-7.7-nightly) bash-4.2$ unset OPENMM_CPU_THREADS
(long-range-7.7-nightly) bash-4.2$ python test_long_range.py # this experiment uses all threads, including those used by other jobs. since the node is being used for other jobs, there is thread contention
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.007426023483276367 # long range correction on
mean:  0.001016855239868164 # long range correction off
(long-range-7.7-nightly) bash-4.2$ OPENMM_CPU_THREADS=4 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.001648545265197754
mean:  0.000995039939880371
(long-range-7.7-nightly) bash-4.2$ OPENMM_CPU_THREADS=1 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.0016747713088989258
mean:  0.0009831905364990235

Takeaways:

The 7x slowdown in the first experiment seems to come from thread contention. When we use too many CPU threads, some of these threads are being used by other jobs and there is slowdown because we have to wait for the slowest thread to finish.
The ~2x slowdown in the next 2 experiments likely comes from having a slow CPU (with a fast GPU), whereas @peastman likely has a fast CPU. This is still slower than we'd like -- ideally we would only see a 1.1x slowdown.

with many atom classes:

(long-range-7.7-nightly) bash-4.2$ OPENMM_CPU_THREADS=1 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.0393261194229126
mean:  0.001032257080078125
(long-range-7.7-nightly) bash-4.2$ OPENMM_CPU_THREADS=2 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.03467965126037598
mean:  0.0009989261627197266
(long-range-7.7-nightly) bash-4.2$ OPENMM_CPU_THREADS=4 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.019467639923095702
mean:  0.001039409637451172
(long-range-7.7-nightly) bash-4.2$ OPENMM_CPU_THREADS=8 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.014207744598388672
mean:  0.0010085582733154296

Takeaways:

For many atom classes, the slowdown varies from 40x (1 thread) to 14x (8 threads).

Let's continue the discussion for how to speed up this up on Thursday. Here are the potential solutions we've talked about so far:

Move integration on gpu
Use analytical integral (John opened a PR for this Enable analytical long-range correction integrals to be specified for CustomNonbondedForce #3379)
Simplify complexity of custom expression
Reduce number of atom classes -- we may not be able to reduce this by too much, given the complicated nature of trying to use rest + alchemical scaling at the same time
Move environment-environment interactions to NonbondedForce

peastman · 2021-12-14T22:12:55Z

That still doesn't make sense. With only one atom class, there should be no slowdown whatsoever. Even on a slow CPU, updating the coefficient should take a tiny fraction of the time required for the GPU to compute energy.

When we use too many CPU threads, some of these threads are being used by other jobs and there is slowdown because we have to wait for the slowest thread to finish.

I strongly suspect this is the real problem. Even with only one work thread, you still have to wait for the thread to start up. If there are idle cores it will start immediately, but if it has to preempt another process, that could take several ms before it starts running. You need to be running all your benchmarks when there is nothing else running on the computer. If it has to compete with other processes, the numbers don't really tell us anything.

zhang-ivy · 2021-12-14T22:23:20Z

You need to be running all your benchmarks when there is nothing else running on the computer. If it has to compete with other processes, the numbers don't really tell us anything.

My production runs will all be on our hpc (or perhaps Folding@home), where the nodes are being shared. Therefore, it's actually more useful to run the benchmarks in the setting where there is competition from other processes, because this is the setting that we are going to be doing the science in. I would assume many other openmm users use node-sharing clusters to run their simulations as well.

peastman · 2021-12-14T22:42:13Z

Most clusters let you reserve a number of cores. If you're having to share cores with other processes, a lot of things will be much slower. That's not because OpenMM is slow. It's because another process is running instead.

jchodera · 2021-12-15T05:53:04Z

Most clusters let you reserve a number of cores. If you're having to share cores with other processes, a lot of things will be much slower. That's not because OpenMM is slow. It's because another process is running instead.

@peastman: LSF manages the shared cluster and provides control over requests and thread affinities.
We can request, for example, an entire socket on one of our Intel Xeon Gold 6230 CPU @ 2.10GHz nodes with task affinity requests:

bsub -W 04:00 -m "ly-gpu lx-gpu lw-gpu ld-gpu lv-gpu lu-gpu lg-gpu lt-gpu" -q gpuqueue -n 20 -gpu "num=1:j_exclusive=yes:mode=shared" -R "rusage[mem=14/host] span[ptile=20] affinity[core(1):distribute=pack(socket=1)]" -env "all" -Is /bin/bash

Here, we are assigned 20 task slots, each of which gets a core, all of which has to be packed onto the same socket (which has 20 cores).

I can also use os.sched_setaffinity to set the affinity to the assigned socket's processor cores.

I have verified this difference in the single-class case is reproducible in my environment on this machine (which has an RTX 2080 Ti):

$ OPENMM_CPU_THREADS=1 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.001719212532043457
mean:  0.0010221481323242187
(openmm-nightly) [chodera@lilac:20211214_experiments_with_john]$ OPENMM_CPU_THREADS=2 python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
mean:  0.0016939163208007813
mean:  0.0010236263275146484

We have the whole socket dedicated to this job. I don't think this is because another thread is contending for processor resources. It might be some other resource, but it's not going to be processor resources.

peastman · 2021-12-15T18:26:35Z

At the top of CustomNonbondedForceImpl.cpp add the line

#include "openmm/internal/timer.h"

At the top of calcLongRangeCorrection() add

double start = getCurrentTime();

And at the end add

double end = getCurrentTime();
printf("%g\n", end-start);

How long does it report the method is taking? For me, all times after the first iteration are around 0.0002, almost an order of magnitude less than the step times you quoted above. This on a Core i7-7740X, which is a four year old midrange processor, so nothing particularly fast.

peastman · 2021-12-16T17:09:21Z

I suggest we postpone the meeting today, since we're still diagnosing the issue above.

jchodera · 2021-12-16T17:31:20Z

I'll run this experiment on a few systems today.

jchodera · 2021-12-16T21:04:00Z

I'm trying to follow the build directions here on our lilac cluster, but am running into

 27%] Building CXX object platforms/cuda/sharedTarget/CMakeFiles/OpenMMCUDA.dir/__/src/CudaArray.cpp.o
In file included from /home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeProgram.h:30:0,
                 from /home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeContext.h:38,
                 from /home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ExpressionUtilities.h:30,
                 from /home/chodera/github/openmm/openmm/platforms/cuda/./include/CudaExpressionUtilities.h:30,
                 from /home/chodera/github/openmm/openmm/platforms/cuda/./include/CudaContext.h:45,
                 from /home/chodera/github/openmm/openmm/platforms/cuda/src/CudaArray.cpp:28:
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:63:29: error: ‘is_trivially_copyable’ is not a member of ‘std’
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type addArg(const T& value) {
                             ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:63:29: error: ‘is_trivially_copyable’ is not a member of ‘std’
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:63:57: error: template argument 1 is invalid
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type addArg(const T& value) {
                                                         ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:63:58: error: expected identifier before ‘::’ token
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type addArg(const T& value) {
                                                          ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:63:65: error: invalid type in declaration before ‘,’ token
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type addArg(const T& value) {
                                                                 ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:63:65: error: invalid use of ‘::’
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:88:29: error: ‘is_trivially_copyable’ is not a member of ‘std’
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type setArg(int index, const T& value) {
                             ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:88:29: error: ‘is_trivially_copyable’ is not a member of ‘std’
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:88:57: error: template argument 1 is invalid
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type setArg(int index, const T& value) {
                                                         ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:88:58: error: expected identifier before ‘::’ token
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type setArg(int index, const T& value) {
                                                          ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:88:65: error: invalid type in declaration before ‘,’ token
     typename std::enable_if<std::is_trivially_copyable<T>::value, void>::type setArg(int index, const T& value) {
                                                                 ^
/home/chodera/github/openmm/openmm/platforms/cuda/../common/include/openmm/common/ComputeKernel.h:88:65: error: invalid use of ‘::’
make[2]: *** [platforms/cuda/sharedTarget/CMakeFiles/OpenMMCUDA.dir/build.make:91: platforms/cuda/sharedTarget/CMakeFiles/OpenMMCUDA.dir/__/src/CudaArray.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:4330: platforms/cuda/sharedTarget/CMakeFiles/OpenMMCUDA.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Presumably, OpenMM has been configured with the wrong C++ or CUDA compiler somehow, but we don't actually say anything helpful about which compilers we support in the build instructions, or even how to tell if your compiler is supported.

peastman · 2021-12-16T21:08:08Z

That means you're using a very old compiler that doesn't fully support C++11. See #2545.

jchodera · 2021-12-16T21:30:07Z

Indeed!

Let's update the installation instructions with more information on (1) what versions we don't support, and (2) what it will look like if you run into this issue.

I am tracking lessons learned for updating documentation here: #3378 (comment)

jchodera · 2021-12-17T02:22:59Z

OK, I've finally managed to build from scratch, install in my conda environment, and recapitulate the data we previously observed on a single socket:

0.000259876
0.000191927
0.000767946
0.000771046
0.000782967
0.000781059
0.000786066
0.000748873
0.000751972
0.000771046
0.000767946
mean:  0.00177309513092041
mean:  0.001011204719543457

The first numbers show the timing for each call to calcLongRangeCorrection, and the mean reports with and without the long-range correction. It looks like the calls generally take ~0.0008 s, which corresponds to the difference between with and without the long-range correction. In other words, it looks like this computation is not happening while asynchronously waiting on the GPU---it's adding to the overall GPU-only cost.

jchodera · 2021-12-17T02:26:07Z

For me, all times after the first iteration are around 0.0002, almost an order of magnitude less than the step times you quoted above. This on a Core i7-7740X, which is a four year old midrange processor, so nothing particularly fast.

I'm unclear if this is relevant, but your machine (Core i7-7740X) has a base clock of 4.3 GHz; our cluster is using an Intel Xeon Gold 6230 with base clock at 2.10 GHz.

jchodera · 2021-12-17T02:43:19Z

Here are the many-class timings---the real-life scenario we need to run without slowdown. We requested a whole 10-core (20-thread) socket, the node is relatively unloaded, and I'm pinning the thread affinities to the socket we are allocated.

(openmm-dev) [chodera@lilac:20211214_experiments_with_john]$ export OPENMM_CPU_THREADS=1 ; python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
calcLongRangeCorrection: 0.037389 s elapsed with 1 threads
calcLongRangeCorrection: 0.0216608 s elapsed with 1 threads
calcLongRangeCorrection: 0.046005 s elapsed with 1 threads
calcLongRangeCorrection: 0.026686 s elapsed with 1 threads
calcLongRangeCorrection: 0.0444121 s elapsed with 1 threads
calcLongRangeCorrection: 0.044178 s elapsed with 1 threads
calcLongRangeCorrection: 0.047236 s elapsed with 1 threads
calcLongRangeCorrection: 0.0253401 s elapsed with 1 threads
calcLongRangeCorrection: 0.0282278 s elapsed with 1 threads
calcLongRangeCorrection: 0.038533 s elapsed with 1 threads
calcLongRangeCorrection: 0.0453181 s elapsed with 1 threads
mean:  0.0376945972442627
mean:  0.0010146617889404297
(openmm-dev) [chodera@lilac:20211214_experiments_with_john]$ export OPENMM_CPU_THREADS=2 ; python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
calcLongRangeCorrection: 0.0256419 s elapsed with 2 threads
calcLongRangeCorrection: 0.0142262 s elapsed with 2 threads
calcLongRangeCorrection: 0.030798 s elapsed with 2 threads
calcLongRangeCorrection: 0.0304649 s elapsed with 2 threads
calcLongRangeCorrection: 0.0274291 s elapsed with 2 threads
calcLongRangeCorrection: 0.026356 s elapsed with 2 threads
calcLongRangeCorrection: 0.0220499 s elapsed with 2 threads
calcLongRangeCorrection: 0.0333619 s elapsed with 2 threads
calcLongRangeCorrection: 0.0289369 s elapsed with 2 threads
calcLongRangeCorrection: 0.0269229 s elapsed with 2 threads
calcLongRangeCorrection: 0.021894 s elapsed with 2 threads
mean:  0.027163147926330566
mean:  0.0010322809219360351
(openmm-dev) [chodera@lilac:20211214_experiments_with_john]$ export OPENMM_CPU_THREADS=4 ; python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
calcLongRangeCorrection: 0.015636 s elapsed with 4 threads
calcLongRangeCorrection: 0.0116291 s elapsed with 4 threads
calcLongRangeCorrection: 0.0199699 s elapsed with 4 threads
calcLongRangeCorrection: 0.0201321 s elapsed with 4 threads
calcLongRangeCorrection: 0.0201581 s elapsed with 4 threads
calcLongRangeCorrection: 0.020077 s elapsed with 4 threads
calcLongRangeCorrection: 0.0202491 s elapsed with 4 threads
calcLongRangeCorrection: 0.020298 s elapsed with 4 threads
calcLongRangeCorrection: 0.0199671 s elapsed with 4 threads
calcLongRangeCorrection: 0.020221 s elapsed with 4 threads
calcLongRangeCorrection: 0.0205522 s elapsed with 4 threads
mean:  0.020234084129333495
mean:  0.001031637191772461
(openmm-dev) [chodera@lilac:20211214_experiments_with_john]$ export OPENMM_CPU_THREADS=8 ; python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
calcLongRangeCorrection: 0.0097959 s elapsed with 8 threads
calcLongRangeCorrection: 0.00831795 s elapsed with 8 threads
calcLongRangeCorrection: 0.012059 s elapsed with 8 threads
calcLongRangeCorrection: 0.0131459 s elapsed with 8 threads
calcLongRangeCorrection: 0.0130041 s elapsed with 8 threads
calcLongRangeCorrection: 0.0127192 s elapsed with 8 threads
calcLongRangeCorrection: 0.0120091 s elapsed with 8 threads
calcLongRangeCorrection: 0.013195 s elapsed with 8 threads
calcLongRangeCorrection: 0.0131209 s elapsed with 8 threads
calcLongRangeCorrection: 0.011822 s elapsed with 8 threads
calcLongRangeCorrection: 0.0127001 s elapsed with 8 threads
mean:  0.01311037540435791
mean:  0.0010306119918823242

As you can see, even with 8 threads, we're still 13x slower than without the long-range correction, and the entire time is being eaten by calcLongRangeCorrection being 13x slower than the GPU execution time.

peastman · 2021-12-17T03:53:16Z

it looks like this computation is not happening while asynchronously waiting on the GPU

Interesting. Here is what is supposed to happen. In CommonCalcCustomNonbondedForceKernel::execute(), if it sees the coefficient needs to be recalculated, it creates a LongRangeTask object and schedules it to be executed on a separate work thread.

openmm/platforms/common/src/CommonKernels.cpp

Line 2300 in 92c6dbb

    
           cc.getWorkThread().addTask(new LongRangeTask(cc, context.getOwner(), longRangeCorrectionData, longRangeCoefficient, longRangeCoefficientDerivs, forceCopy));

The task executes computeLongRangeCorrection().

openmm/platforms/common/src/CommonKernels.cpp

Lines 1839 to 1841 in 92c6dbb

    
           void execute() { 
        
               CustomNonbondedForceImpl::calcLongRangeCorrection(*force, data, context, longRangeCoefficient, longRangeCoefficientDerivs, cc.getThreadPool()); 
        
           }

Finally, the CommonCalcCustomNonbondedForceKernel has registered a post-computation to get executed after all other work.

openmm/platforms/common/src/CommonKernels.cpp

Lines 1813 to 1824 in 92c6dbb

    
           double computeForceAndEnergy(bool includeForces, bool includeEnergy, int groups) { 
        
               if ((groups&(1<<force->getForceGroup())) == 0) 
        
                   return 0; 
        
               cc.getWorkThread().flush(); 
        
               Vec3 a, b, c; 
        
               cc.getPeriodicBoxVectors(a, b, c); 
        
               double volume = a[0]*b[1]*c[2]; 
        
               map<string, double>& derivs = cc.getEnergyParamDerivWorkspace(); 
        
               for (int i = 0; i < longRangeCoefficientDerivs.size(); i++) 
        
                   derivs[force->getEnergyParameterDerivativeName(i)] += longRangeCoefficientDerivs[i]/volume; 
        
               return longRangeCoefficient/volume; 
        
           }

It calls cc.getWorkThread().flush() to wait until the task has finished and computes the energy based on the result. So the question is why this isn't working correctly for you (but it does for me). Is something else forcing synchronization between the threads sooner?

jchodera · 2021-12-17T06:21:39Z

Does this use the same ThreadPool for the Platform? If so, doesn't using OPENMM_CPU_THREADS initialize this thread pool to have a single thread, requiring these to run sequentially? Or do you always have enough threads to allow the GPU thread to sleep while the CPU thread does work even if OPENMM_CPU_THREADS=1?

jchodera · 2021-12-17T06:22:46Z

Also, if you have any suggestions for how I can instrument the threads or computations to track what is starting/stopping/sleeping when, I can do that!

peastman · 2021-12-17T16:53:24Z

Does this use the same ThreadPool for the Platform?

No, each context has a single worker thread that's independent of any thread pool. See the ComputeContext::WorkThread class. It's intended for uses like this, where you don't specifically want to parallelize a CPU calculation (if you want that, you'll use the ThreadPool), but you don't want it to block the main thread from launching kernels.

Also, if you have any suggestions for how I can instrument the threads or computations to track what is starting/stopping/sleeping when, I can do that!

Try printing out the value of getCurrentTime() at the following points:

At the very start of CudaCalcForcesAndEnergyKernel::beginComputation() (found in CudaKernels.cpp)
In CommonCalcCustomNonbondedForceKernel::execute() immediately before the line that adds the LongRangeTask
Again immediately after that line
At the start and end of LongRangeTask::execute()
Immediately before and after the call to cc.getWorkThread().flush() in LongRangePostComputation::computeForceAndEnergy()
At the very end of CudaCalcForcesAndEnergyKernel::finishComputation()

This should give us a picture of where the time is being spent and where delays are happening. If everything is working as intended, very little time should pass between the start of beginComputation() and the call to flush(). It should block for the amount of time required to compute the coefficient. From then until the end of finishComputation() will take however much additional time is required to finish computing the energy. How is it different from that for you?

jchodera · 2021-12-17T17:14:32Z

Thanks! Here's the output:

Single-class test case

(openmm-dev) [chodera@lilac:20211214_experiments_with_john]$ python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
       1639761143.964407 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761144.316511 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761144.316556 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761144.316692 : LongRangeTask::execute() start
calcLongRangeCorrection: 0.00114202 s elapsed with 1 threads
       1639761144.317948 : LongRangeTask::execute() end
       1639761144.487151 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
       1639761144.487179 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761144.487250 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761144.487260 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761144.487645 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
       1639761144.487655 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761144.487979 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761144.488197 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761144.488326 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761144.488342 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761144.488357 : LongRangeTask::execute() start
       1639761144.488526 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000869989 s elapsed with 1 threads
       1639761144.489315 : LongRangeTask::execute() end
       1639761144.489354 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761144.489413 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761144.589805 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761144.590180 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761144.590228 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761144.590294 : LongRangeTask::execute() start
       1639761144.590350 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000874996 s elapsed with 1 threads
       1639761144.591400 : LongRangeTask::execute() end
       1639761144.591511 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761144.591652 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761144.692382 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761144.692772 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761144.692820 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761144.692841 : LongRangeTask::execute() start
       1639761144.693138 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000937939 s elapsed with 1 threads
       1639761144.693941 : LongRangeTask::execute() end
       1639761144.694049 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761144.694227 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761144.795002 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761144.795362 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761144.795409 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761144.795478 : LongRangeTask::execute() start
       1639761144.795566 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000874996 s elapsed with 1 threads
       1639761144.796566 : LongRangeTask::execute() end
       1639761144.796674 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761144.796808 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761144.897524 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761144.897903 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761144.897951 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761144.898018 : LongRangeTask::execute() start
       1639761144.898075 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000950813 s elapsed with 1 threads
       1639761144.899153 : LongRangeTask::execute() end
       1639761144.899262 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761144.899395 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761145.000155 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761145.000543 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761145.000592 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761145.000659 : LongRangeTask::execute() start
       1639761145.000712 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000895977 s elapsed with 1 threads
       1639761145.001713 : LongRangeTask::execute() end
       1639761145.001825 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761145.001964 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761145.102715 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761145.103073 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761145.103120 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761145.103186 : LongRangeTask::execute() start
       1639761145.103240 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000916004 s elapsed with 1 threads
       1639761145.104321 : LongRangeTask::execute() end
       1639761145.104484 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761145.104621 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761145.205378 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761145.205757 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761145.205805 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761145.205871 : LongRangeTask::execute() start
       1639761145.206061 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000900984 s elapsed with 1 threads
       1639761145.206885 : LongRangeTask::execute() end
       1639761145.206991 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761145.207124 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761145.307883 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761145.308253 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761145.308301 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761145.308369 : LongRangeTask::execute() start
       1639761145.308424 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000874043 s elapsed with 1 threads
       1639761145.309357 : LongRangeTask::execute() end
       1639761145.309482 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761145.309618 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761145.410373 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761145.410750 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761145.410799 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761145.410867 : LongRangeTask::execute() start
       1639761145.411052 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000905037 s elapsed with 1 threads
       1639761145.411884 : LongRangeTask::execute() end
       1639761145.411990 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761145.412129 : CudaCalcForcesAndEnergyKernel::finishComputation completed
mean:  0.002057528495788574
       1639761147.328006 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.330366 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.330388 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.331099 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.331251 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.331760 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.432031 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.432755 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.533184 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.533884 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.634310 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.635038 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.735569 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.736370 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.837020 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.837814 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761147.938496 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761147.939212 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761148.039995 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761148.040804 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761148.141571 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761148.142372 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761148.243140 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761148.243961 : CudaCalcForcesAndEnergyKernel::finishComputation completed
mean:  0.0010401248931884766

Multiple-class test case

(openmm-dev) [chodera@lilac:20211214_experiments_with_john]$ python test_long_range.py
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
       1639761248.361807 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761248.362779 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761248.362803 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761248.362896 : LongRangeTask::execute() start
       1639761248.364175 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.039037 s elapsed with 1 threads
       1639761248.402066 : LongRangeTask::execute() end
       1639761248.402183 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761248.402313 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761248.402337 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761248.402701 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
       1639761248.402728 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761248.403044 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761248.403337 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761248.403614 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761248.403645 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761248.403657 : LongRangeTask::execute() start
       1639761248.403859 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.044256 s elapsed with 1 threads
       1639761248.448049 : LongRangeTask::execute() end
       1639761248.448157 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761248.448299 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761248.549093 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761248.549494 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761248.549543 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761248.549609 : LongRangeTask::execute() start
       1639761248.549807 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.042762 s elapsed with 1 threads
       1639761248.592524 : LongRangeTask::execute() end
       1639761248.592634 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761248.592778 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761248.693531 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761248.693898 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761248.693945 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761248.694011 : LongRangeTask::execute() start
       1639761248.694206 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.04406 s elapsed with 1 threads
       1639761248.738230 : LongRangeTask::execute() end
       1639761248.738337 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761248.738485 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761248.839245 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761248.839654 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761248.839702 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761248.839768 : LongRangeTask::execute() start
       1639761248.839962 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.0275352 s elapsed with 1 threads
       1639761248.867423 : LongRangeTask::execute() end
       1639761248.867566 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761248.867703 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761248.968496 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761248.968860 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761248.968907 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761248.968974 : LongRangeTask::execute() start
       1639761248.969153 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.025527 s elapsed with 1 threads
       1639761248.994663 : LongRangeTask::execute() end
       1639761248.994772 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761248.994911 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761249.095669 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761249.096045 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761249.096093 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761249.096159 : LongRangeTask::execute() start
       1639761249.096353 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.0421522 s elapsed with 1 threads
       1639761249.138431 : LongRangeTask::execute() end
       1639761249.138614 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761249.138755 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761249.239535 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761249.239902 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761249.239950 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761249.240017 : LongRangeTask::execute() start
       1639761249.240214 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.0249038 s elapsed with 1 threads
       1639761249.265042 : LongRangeTask::execute() end
       1639761249.265151 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761249.265287 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761249.366076 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761249.366447 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761249.366520 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761249.366587 : LongRangeTask::execute() start
       1639761249.366793 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.0273151 s elapsed with 1 threads
       1639761249.394062 : LongRangeTask::execute() end
       1639761249.394173 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761249.394312 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761249.495092 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761249.495484 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761249.495533 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761249.495599 : LongRangeTask::execute() start
       1639761249.495796 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.0275979 s elapsed with 1 threads
       1639761249.523355 : LongRangeTask::execute() end
       1639761249.523484 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761249.523623 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761249.624379 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761249.624771 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
       1639761249.624820 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
       1639761249.624886 : LongRangeTask::execute() start
       1639761249.625085 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.0243421 s elapsed with 1 threads
       1639761249.649350 : LongRangeTask::execute() end
       1639761249.649481 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
       1639761249.649619 : CudaCalcForcesAndEnergyKernel::finishComputation completed
mean:  0.034261345863342285
       1639761251.513418 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761251.515748 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761251.515767 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761251.516429 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761251.516580 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761251.517080 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761251.617660 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761251.618491 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761251.719183 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761251.719962 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761251.820714 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761251.821527 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761251.922281 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761251.923113 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761252.023879 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761252.024732 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761252.125514 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761252.126322 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761252.227096 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761252.227797 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761252.328562 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761252.329380 : CudaCalcForcesAndEnergyKernel::finishComputation completed
       1639761252.430144 : CudaCalcForcesAndEnergyKernel::beginComputation()
       1639761252.430985 : CudaCalcForcesAndEnergyKernel::finishComputation completed
mean:  0.0010822534561157227

peastman · 2021-12-17T18:34:13Z

Thanks! That's very helpful. Here's the output from a single iteration.

1639761144.692382 : CudaCalcForcesAndEnergyKernel::beginComputation()
1639761144.692772 : CommonCalcCustomNonbondedForceKernel::execute(): adding LongRangeTask
1639761144.692820 : CommonCalcCustomNonbondedForceKernel::execute(): added LongRangeTask
1639761144.692841 : LongRangeTask::execute() start
1639761144.693138 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy calling cc.getWorkThread().flush();
calcLongRangeCorrection: 0.000937939 s elapsed with 1 threads
1639761144.693941 : LongRangeTask::execute() end
1639761144.694049 : CommonCalcCustomNonbondedForceKernel::computeForceAndEnergy called cc.getWorkThread().flush();
1639761144.694227 : CudaCalcForcesAndEnergyKernel::finishComputation completed

We can see the following time intervals:

Start of the energy evaluation until it registers the LongRangeTask: 0.4 ms
Register the task until it starts executing on the worker thread: 0.06 ms
Total time from start of evaluation until calling flush() (indicating all other CPU work has finished): 0.75 ms
Time that flush() blocks for: 0.9 ms
End of flush() until the energy calculation is complete: 0.18 ms
Total time for the energy evaluation: 1.8 ms

Based on that, we can draw these conclusions:

The CPU calculation is getting run in parallel with the GPU. flush() blocks for a significant amount of time, and all GPU kernels have already been launched at this point.
Even without the long range correction, this system is dominated by CPU overhead. Just launching kernels and doing other setup creates a 0.75 ms delay from when the evaluation starts until it has finished all other work and is just waiting for the long range correction calculation. That's out of the 1 ms total time for the whole evaluation you reported without the long range correction. There just isn't enough GPU work for the calculation to overlap with.

This is a very small system (14,550 atoms). The above should be quite different on a larger system where the step time is dominated by GPU rather than CPU.

In principle we could start the long range correction calculation earlier, which might let it overlap more. But that's a dangerous thing to do. We don't want to risk it competing with the main thread that's launching kernels.

jchodera · 2021-12-17T18:37:30Z

Thanks! That's very helpful. Here's the output from a single iteration.

Is that for the multi-class? I thought this consumed an order of magnitude more time.

peastman · 2021-12-17T18:54:25Z

Single class.

jchodera · 2021-12-17T18:57:54Z

How do the single class findings translate into the multi class prospects for keeping this to <10% slowdown? The single class use case is totally irrelevant to practical usage.

peastman · 2021-12-17T19:10:09Z

Right, it was just for diagnostic purposes to understand what's going on. In #3301 (comment) I compared the "vanilla" and "rest" versions of the system, and one of the differences I found was that the latter has many more atom classes (49 vs. 28). Since the number of integrals is the square of the number of classes, that makes it a lot more expensive. I raised the question of whether all those classes were really needed. If the number could be reduced, it would produce a big speedup. I don't think the answer to that question has been determined yet.

zhang-ivy · 2021-12-17T19:28:32Z

@peastman : John included timing data for both single class and multi class here -- I think he was wondering whether the same conclusions you made about single class hold true for multi class.. from a cursory glance, I don't think they do, but would be interested to see what your conclusions are.

I raised the question of whether all those classes were really needed. If the number could be reduced, it would produce a big speedup. I don't think the answer to that question has been determined yet.

I haven't had a chance to dive deep into this yet, but John and I discussed this, and we would guess that we won't be able to reduce the number of atom classes by much -- there are going to be many atom classes due to differences in the actual atom type parameters, how they should be scaled by rest, and how they should be scaled alchemically.

peastman · 2021-12-17T20:17:42Z

I think he was wondering whether the same conclusions you made about single class hold true for multi class.

That example is unrealistic in two ways: first in having only one class, and second in having a tiny number of atoms. Really we want to profile a realistic case, with the actual number of classes you'll have (after doing whatever you can to reduce the number) and with a number of atoms that is typical of the real systems you'll be simulating. That will tell us where we currently are on performance, and how much cheaper the coefficient calculation needs to become.

peastman added the enhancement label Oct 27, 2021

jchodera mentioned this issue Dec 14, 2021

Enable analytical long-range correction integrals to be specified for CustomNonbondedForce #3379

Closed

jchodera mentioned this issue Dec 16, 2021

Further streamline creating a conda-based openmm development environment and instructions on installing from source #3378

Open

Allow long range correction to use default parameter values #3301

Allow long range correction to use default parameter values #3301

Comments

peastman commented Oct 27, 2021

jchodera commented Oct 29, 2021

peastman commented Oct 29, 2021

jchodera commented Oct 29, 2021

zhang-ivy commented Oct 29, 2021 • edited Loading

peastman commented Oct 29, 2021

jchodera commented Oct 29, 2021 • edited Loading

jchodera commented Oct 29, 2021

peastman commented Oct 29, 2021

jchodera commented Oct 29, 2021

peastman commented Oct 29, 2021

jchodera commented Oct 29, 2021

peastman commented Oct 29, 2021

jchodera commented Oct 29, 2021

peastman commented Oct 29, 2021

jchodera commented Oct 29, 2021

peastman commented Oct 29, 2021

jchodera commented Oct 30, 2021

peastman commented Oct 31, 2021

jchodera commented Oct 31, 2021 • edited Loading

peastman commented Nov 1, 2021

jchodera commented Nov 2, 2021

jchodera commented Nov 2, 2021

peastman commented Nov 3, 2021

zhang-ivy commented Nov 8, 2021 • edited Loading

peastman commented Nov 8, 2021

peastman commented Nov 8, 2021

peastman commented Nov 8, 2021

zhang-ivy commented Nov 9, 2021 • edited Loading

peastman commented Nov 9, 2021

peastman commented Dec 13, 2021

zhang-ivy commented Dec 13, 2021 • edited Loading

zhang-ivy commented Dec 14, 2021 • edited Loading

peastman commented Dec 14, 2021

zhang-ivy commented Dec 14, 2021

peastman commented Dec 14, 2021

jchodera commented Dec 15, 2021

peastman commented Dec 15, 2021

peastman commented Dec 16, 2021

jchodera commented Dec 16, 2021

jchodera commented Dec 16, 2021

peastman commented Dec 16, 2021

jchodera commented Dec 16, 2021

jchodera commented Dec 17, 2021

jchodera commented Dec 17, 2021

jchodera commented Dec 17, 2021

peastman commented Dec 17, 2021

jchodera commented Dec 17, 2021

jchodera commented Dec 17, 2021

peastman commented Dec 17, 2021

jchodera commented Dec 17, 2021 • edited Loading

peastman commented Dec 17, 2021

jchodera commented Dec 17, 2021

peastman commented Dec 17, 2021

jchodera commented Dec 17, 2021 via email

peastman commented Dec 17, 2021

zhang-ivy commented Dec 17, 2021 • edited Loading

peastman commented Dec 17, 2021

zhang-ivy commented Oct 29, 2021 •

edited

Loading

jchodera commented Oct 29, 2021 •

edited

Loading

jchodera commented Oct 31, 2021 •

edited

Loading

zhang-ivy commented Nov 8, 2021 •

edited

Loading

zhang-ivy commented Nov 9, 2021 •

edited

Loading

zhang-ivy commented Dec 13, 2021 •

edited

Loading

zhang-ivy commented Dec 14, 2021 •

edited

Loading

jchodera commented Dec 17, 2021 •

edited

Loading

zhang-ivy commented Dec 17, 2021 •

edited

Loading