Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically assigned tmin may become too small #180

Open
annulen opened this issue Nov 12, 2023 · 53 comments
Open

Automatically assigned tmin may become too small #180

annulen opened this issue Nov 12, 2023 · 53 comments

Comments

@annulen
Copy link

annulen commented Nov 12, 2023

I'm using drms 6e-4 dmax 1e-3 criteria for optimization (which were taken from ORCA's TightOpt), and I've got into situation where optimization got stuck in endless loop with energy slowly climbing up. Quailty was always < 0 but no steps were rejected with reasoning Not rejecting step - trust below tmin = 6.000e-05.

Unreasonably low tmin caught my attention (btw, I'm using tip of master branch). Corresponding code in params.py says that

  1. For DFT tmin should actually be not smaller than 1.0e-4 because of gradient errors.
  2. tmin should be smaller than Convergence_drms to avoid rejection of valid steps.

Both criteria look sensible to me. However, code that computes tmin is written as

self.tmin = kwargs.get('tmin', min(1.0e-4, self.Convergence_drms*0.1))

which explicitly contradicts criterion (1), as tmin can easily drop below 1e-4.

I think proper default value of tmin should better be chosen as

min(max(1.0e-4, self.Convergence_drms*0.1), self.Convergence_drms)

With this logic tmin may only become smaller than 1.0e-4 when Convergence_drms < 1.0e-4, but still be limited by Convergence_drms*0.1 from above for large values of Convergence_drms. For my case, it gives tmin = 1.0e-4.

I can make a PR if you think this is the right way.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

Setting tmin properly is a nontrivial problem, as a matter of fact, it was recently changed (#177). The underlying problem is that many quantum chemistry codes have small inconsistencies between the gradient and the energy, on the order of 1e-4, for DFT and potentially other methods as well. If this error exists, geomeTRIC can take a step in the wrong direction and the step will be low quality.

As a result of the low quality step, the trust radius is decreased (the step is not necessarily rejected, as that only happens for very low quality steps). The reason for lowering the trust radius is that the first order Taylor expansion becomes exact in the limit of small step sizes, so the step quality should eventually become good. However, when the gradient and energy are inconsistent, there is a chance the step quality never becomes good. A long (perhaps endless) sequence of low-quality steps is what results.

The reason for the current setting of tmin is because if the convergence criteria is satisfied for the RMS and max gradient and there is an energy/gradient inconsistency that causes an endless loop before the other three criteria could be met (RMS & max step & energy change), the trust radius can be reduced until it is well below Convergence_drms, causing the optimization to converge for the RMS / max displacement (and presumably the energy change will have fallen below the criterion in the meantime). This was designed to prevent exactly the behavior you are describing. However, if the RMS / max gradient criteria have not been met yet, for example if the energy and gradient have a large inconsistency on the order of 1e-3 or greater, then one could still end up with an endless sequence of low-quality steps. In other words, your description makes me suspect that the source of the problem is the electronic structure method and not the geometry optimization. It's true that if the value tmin were set larger, one has a better chance of "jumping" out of the bad region and hope to end up in a region of the PES where the energy and gradient are consistent again; my experience is this never happens.

I have two suggestions for the next step -

(1) If you could share the output file and ORCA input file with me, I could help to diagnose further.

(2) You can try the finite difference tool in geomeTRIC/tools/finite_difference_grad.py. For your input structure, use one that was produced during the endless loop of bad steps. Use the command line options --step 1e-3 --stencil 5 to produce a high-quality numerical gradient. If you use Work Queue you might be able to speed things up more.

If the energy and gradient are consistent, then you should see agreement of 1e-5 or lower for the analytic and numerical gradient. If the inconsistency is larger than 1e-3 then that is probably the root cause of your failed optimization.

PS: It is possible that we could somehow detect poor convergence behavior such as these endless loops and switch the algorithm to a pure energy-based one (such as the Powell direction-set method or Nelder-Mead simplex method). That would require writing more code than just the algorithm. Since these algorithms are designed for when gradients are unavailable, the gradient convergence criteria should no longer be used. We would also need to implement the capability in the engine to compute just the energy (and not the gradient) for efficiency reasons. This would be a major contribution to the code and require quite a bit of testing to ensure it works though.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

Thanks for your help.

I'm pretty sure ORCA has some numerical noise in energy and gradients because of RIJCOSX approximation, exspecially by COSX part and a way of its implementation (different grid is used for the final SCF iteration & gradient).

However, I'm not sure if we should really dig into gradient quality now, because all possible solutions would slow down calculations significantly. Even if gradient quality is not stellar, with tmin = 1e-4 optimization converges fine. It's tmin that causes trouble here, not gradient. Of course when we reduce tmin gradient quality would drop further and further.

What do you think about my code? It sets tmin below 1e-4 only when drms < 1e-4. Or is it actually important for tmin to be 10 times smaller than drms in all cases?

self.tmin = kwargs.get('tmin', min(max(1.0e-4, self.Convergence_drms * 0.1), self.Convergence_drms))

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

It's possible your code could be a good solution, but I can't decide that without understanding the behavior that causes the tmin = 1e-4 optimization to converge, but not when a smaller tmin is used. Could you share the optimization log file?

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

To answer your other question:

Or is it actually important for tmin to be 10 times smaller than drms in all cases?

I think it's important for tmin to have the ability to become smaller than drms in order to cause convergence in the situation where:
(1) There is some energy / gradient inconsistency
(2) The gradient RMS/max criteria are met
(3) The optimization is taking poor-quality steps even for very small trust radii, preventing the other three criteria from being met

PS: If tmin is set to 1e-4, that is less than twice as large as the default setting of 6e-5 using your settings. If I could see the log file, that would help me understand why one setting of tmin helps convergence but the other does not.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

Here are the logs. However, I'm working with modified version of geomeTRIC. All my modifications can be seen at annulen/geomeTRIC@master...smart_reset and I believe they have no influence oin convergence (except for tmin change)

This is failed optimization: PMDA_chair_dimer2_3.log
This one converged:
PMDA_chair_dimer2.log

UPD: just to clarify, my change is responsible for Resetting Hessian to guess thing that happens near the beginning and doesn't disrupt anything later, so it should not create any differences for convergence. But if you insist I can repeat with unmodified master, it's a small system chosen for quick tests.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

I think it's important for tmin to have the ability to become smaller than drms in order to cause convergence in the situation where:
(1) There is some energy / gradient inconsistency
(2) The gradient RMS/max criteria are met
(3) The optimization is taking poor-quality steps even for very small trust radii, preventing the other three criteria from being met

Out of curiosity: why is it so important to decrease trust radius? By default ORCA is using RFO steps with static trust radius 0.3, and steps are nevertheless decreasing and eventually converge (but of course it's still far from perfect, that's why I'm here:)

If it was possible I'd prefer to keep trust radius above 0.005 a.u. to make sure that gradient quality is not an issue.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

I think I understood it. Most quantum chemistry programs implement algorithms with built-in step limiting. The most popular ones are rational function (RF or RFO) and quasi-Newton (QN), some programs also implement GDIIS and Newton-Raphson (NR). For example, by default ORCA uses RF and switches to QN when RF step is too small, while Firefly uses GDIIS and later switches to NR.

However, geomeTRIC instead does a full step of trust radius size, and step limiting completely depends on its ability to reduce radius. If this is the case, I guess I'll have to implement at least RF and QN before I can start using geomeTRIC in production.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

The algorithm geomeTRIC uses for energy minimization has always been a quasi-Newton (QN) method. For transition state optimization it uses RFO. In both cases the trust radius is used to limit the step size. It is possible for the step to be smaller than the trust radius, in fact that happens very often, though it may not have happened in your calculations because the trust radius became very small.

You can read more about the theory here: https://geometric.readthedocs.io/en/latest/how-it-works.html

I looked at your output files. Starting at around step ~20, the step quality is no longer good and the trust radius is reduced to 1e-4 (PMDA_chair_dimer2.log) or 6e-5 (PMDA_chair_dimer2_3.log). In both calculations, the gradient continues to decrease even as the energy increases. At step 40, the calculation with tmin=1e-4 converges. In the calculation with tmin=6e-5, the calculation is killed after 46 steps. However, the behavior is in fact very similar to the case with tmin=1e-4, but due to the smaller value of tmin, the steps are smaller and it takes a longer time for the gradient to decrease to the point where the criteria are met. I think if you had allowed the calculation to run for ~20 more steps the calculation with tmin=6e-5 would have converged as well.

One potential way to resolve this issue is to change how the trust radius is adjusted. We can consider the step state is "good" even if the step quality is bad if the energy convergence criteria has already been met (which is true in your case), even if the energy change is not as expected. The following may work - add the following line immediately above here:

        if Converged_energy: step_state = StepState.Good
        elif Quality > 0.75: step_state = StepState.Good
        # if Quality > 0.75: step_state = StepState.Good

https://github.com/leeping/geomeTRIC/blob/master/geometric/optimize.py#L560

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

This might be better:

        if Quality > 0.75: step_state = StepState.Good
        elif Converged_energy or (Quality > (0.5 if params.transition else 0.25)): step_state = StepState.Okay
        # elif Quality > (0.5 if params.transition else 0.25): step_state = StepState.Okay
        elif Quality > 0.0: step_state = StepState.Poor

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

To answer your previous question:

Out of curiosity: why is it so important to decrease trust radius? By default ORCA is using RFO steps with static trust radius 0.3, and steps are nevertheless decreasing and eventually converge (but of course it's still far from perfect, that's why I'm here:)

As I said before, the reason for decreasing the trust radius is to ensure the second-order Taylor expansion of the energy is a good approximation. Otherwise the optimizer could take bad steps (for example, Step 5 in PMDA_chair_dimer2.log is a case where the trust radius decrease is working as intended). When the energy and gradient are consistent, the trust radius is naturally large at the end of the optimization even though the steps become very small.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

Just for your information, I can understand diffs (and find them easier to use than instructions like "add X before line Y"). If I understood it correctly, you last change in that way would look like

diff --git a/geometric/optimize.py b/geometric/optimize.py
index 0521f1c..6ea6d84 100644
--- a/geometric/optimize.py
+++ b/geometric/optimize.py
@@ -565,7 +565,8 @@ class Optimizer(object):
         # 2020-03-10: Step quality thresholds are hard-coded here.
         colors = {}
         if Quality > 0.75: step_state = StepState.Good
-        elif Quality > (0.5 if params.transition else 0.25): step_state = StepState.Okay
+        elif Converged_energy or (Quality > (0.5 if params.transition else 0.25)): step_state = StepState.Okay
+        #elif Quality > (0.5 if params.transition else 0.25): step_state = StepState.Okay
         elif Quality > 0.0: step_state = StepState.Poor
         else:
             colors['energy'] = "\x1b[92m" if Converged_energy else "\x1b[91m"

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

That's correct.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

As I said before, the reason for decreasing the trust radius is to ensure the second-order Taylor expansion of the energy is a good approximation. Otherwise the optimizer could take bad steps (for example, Step 5 in PMDA_chair_dimer2.log is a case where the trust radius decrease is working as intended). When the energy and gradient are consistent, the trust radius is naturally large at the end of the optimization even though the steps become very small.

Thanks for explanation. I was thinking of trust radius just as a safety net for optimizer in case something goes wrong, and was really surprised to see it dropping down so low.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

Yes, I think one insight I gained from this is that the trust radius only works properly as a safety net if one assumes the energy and gradient are consistent. Otherwise it does not work as intended and it can actually slow things down. There are situations like yours in which the convergence could be sped up if the trust radius had not decreased as much as it did. I'd be interested to know if the fix above solves the problem.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

Ok, then I'll reproduce the same state of code that I had when doing PMDA_chair_dimer2_3.log, plus your patch on top, so we could make a better comparison.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

While it's going, another question out of curiosity: why are you using QN instead of RFO for energy minimization? I thought RFO was more reliable, especially with bad input geometry. It's used by default at least in ORCA and Molpro.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

For energy minimizations, RFO and QN are very similar and with a reasonable default parameter choice for RFO, they are equivalent (see attached paper). The transition state optimization uses partitioned RFO to maximize along one mode and minimize the others.

WIREs Comput Mol Sci - 2011 - Schlegel - Geometry optimization.pdf

@annulen
Copy link
Author

annulen commented Nov 12, 2023

For energy minimizations, RFO and QN are very similar and with a reasonable default parameter choice for RFO, they are equivalent (see attached paper).

Thanks!

The transition state optimization uses partitioned RFO to maximize along one mode and minimize the others.

For the record, ORCA also use PRFO for TS by default.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

elif Converged_energy or (Quality > (0.5 if params.transition else 0.25)): step_state = StepState.Okay

Thinking more of your patch, I start to doubt that it would be helpful for my bigger system. When I optimized 9-molecule cluster with ORCA I've reached situation when energy changes were minimal, but gradients didn't reduce, and I had to reduce trust radius manually to get any reasonable steps. Your change completely stops updating trust radius after energy has converged so I'm afraid geomeTRIC can get stuck the same way for 9-molecule cluster.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

To clarify, I haven't tried 9-molecule cluster with geomeTRIC yet, only with built-in ORCA's optimizer.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

You could make it more aggressive in the following way, this will increase the trust radius as long as the energy convergence criterion is being met. We would need to perform more testing to ensure this doesn't break something else though.

diff --git a/geometric/optimize.py b/geometric/optimize.py
index 0521f1c..6ea6d84 100644
--- a/geometric/optimize.py
+++ b/geometric/optimize.py
@@ -565,7 +565,8 @@ class Optimizer(object):
         # 2020-03-10: Step quality thresholds are hard-coded here.
         colors = {}
+        if Converged_energy or Quality > 0.75: step_state = StepState.Good
-        if Quality > 0.75: step_state = StepState.Good
+        # if Quality > 0.75: step_state = StepState.Good
         elif Quality > (0.5 if params.transition else 0.25): step_state = StepState.Okay
         elif Quality > 0.0: step_state = StepState.Poor
         else:
             colors['energy'] = "\x1b[92m" if Converged_energy else "\x1b[91m"

@annulen
Copy link
Author

annulen commented Nov 12, 2023

On the other hand I've never set trust radius values lower than 1e-3 in ORCA, so maybe it won't be an issue.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

How do you think, could it be a good idea to use different convergence criteria for cases when energy goes up or down? When I was dealing with ORCA's optimizer it always seemed weird to me to get a "converged" message after a series of positive energy changes (with decreasing gradient). IMO if energy keeps climbing it's by no means converged.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

My first thought is that I don't think it's a good idea to apply different criteria based on the direction of the energy change. That has not been done in the computational literature as far as I know, and if we start to do that, the community might think it's being done to "declare victory" prematurely. I haven't seen many cases where the energy goes up monotonically as you are observing; in a different calculation you might see the energy fluctuate up and down.

I don't think ORCA is applying different criteria based on the direction of the energy change, but it would be good to know precisely what criteria it is using. If the energy goes up but the absolute value of the energy change in each step is below the convergence criteria, it is possible for the optimization to converge even if the energy goes up by a nontrivial amount (say, 1e-5) over a number of steps (say, 20 steps). It is valid for you to state that the calculation is by not converged - after all, it had a lower-energy structure 20 steps ago - but there is simply no way to fix the situation other than by addressing energy / gradient consistency.

In my opinion, energy/gradient consistency is not addressed carefully in the published literature, and there is always an incentive to cut corners in accuracy in favor of speed, so as a result the default settings of many codes are in a "gray area" where things are not exactly correct but results are still mostly reproducible.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

My first thought is that I don't think it's a good idea to apply different criteria based on the direction of the energy change. That has not been done in the computational literature as far as I know, and if we start to do that, the community might think it's being done to "declare victory" prematurely.

My idea was actually the opposite: don't declare victory if we are going up, unless change is really small. E.g. 1e-6 when going down and 1e-7 when going up. But if nobody have done it yet than it's probably a bad idea.

I don't think ORCA is applying different criteria based on the direction of the energy change, but it would be good to know precisely what criteria it is using.

Convergence criteria that I'm using now exactly replicate criteria that I previously used in ORCA:

--converge energy 1e-6 grms 3.33e-6 gmax 1e-5 drms 6e-4 dmax 1e-3

where all criteria are mandatory. Default behavior of ORCA is different, it has a few "profiles" and interprets criterai more liberally (e.g., if energy, drms, and dmax converged, and grms is way below requirement, than we can ignore gmax being a couple of times higher than need). I hated that "liberal" thing and turned it off using undocumented option found in binary file that makes it behave like geomeTRIC in that regard (EnforceStrictConvergence true). Energy, drms, and dmax above exactly match "TightOpt" profile of ORCA, while grms and gmax are more tight — I've set them to be consistent with criteria that my colleague used in a different program.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

Thanks - I think I understand your motivation for wanting to have a tighter energy criterion if the energy goes up. It is because the optimization converges if the energy is increasing by a small amount each step, but we know it's not really converged.

However, there are some complicating factors for this idea - one is that very similar criteria are also used in constrained geometry optimization and transition state optimization. In those calculations, some amount of energy increase is expected if required to satisfy the constraint or maximize the energy along the reaction coordinate. Therefore, one would have to turn this on only for unconstrained energy minimizations.

The other reason is that I think the "energy steadily increasing" behavior only happens when the energy / gradient are inconsistent when close to convergence, and one cannot expect to get rigorous optimization results if such an inconsistency exists. Probably the best we can do is find an approximate solution and tweak the step size control and convergence criteria so that the optimization doesn't take forever in a frustrating way.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

However, if you think I've made things too tight, it's not the case: grms and gmax that I've actually reached in ORCA with 9-molecule cluster were higher, I've never seen any one of them ever shown as converged.

Oh, and ORCA uses gradient in internal coordinates for evaluation of grms and gmax criteria, so behavior differs depending on chosen coordinate system.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

The inconsistency you are seeing might be grid related. Using a larger grid (such as (99, 590)) might improve your success rate when using your thresholds, which are relatively tight. But without going to a larger grid, I think that implementing one of the "diffs" above should result in convergence even if the steady energy increase is unsettling.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

Probably the best we can do is find an approximate solution and tweak the step size control and convergence criteria so that the optimization doesn't take forever in a frustrating way.

Actually, it's not only about frustration, as I haven't tell you the complete story yet.

As you might guess, 9-molecule cluster has lots of H-bonds in it, and dihedral angles that include them have very shallow potential energy surfaces. Changing them roughly corresponds to spatial rotation of individual molecules of cluster. ORCA by default uses redundant internal coordinates which I thought was a main source of troubles. On the other hand, optimization in Cartesian coordinates behaved even worse, and I thought it was caused by a poor implementation of ORCA's Cartesian optimizer (which also lacks some features so it was not a completely unreasonable assumption).

@annulen
Copy link
Author

annulen commented Nov 12, 2023

So, my hope was that TRIC coordinates will allow quicker path to convergence, and tight gradient criteria would ensure that molecules are oriented in the most optimal way

@annulen
Copy link
Author

annulen commented Nov 12, 2023

In case you are interested, here is an example output from ORCA with default grids:
orca.out.txt

Note that there is DFT grid and three COSX grids. GRIDX 2 is used for SCF iterations and GRIDX 3 for final integration and gradient. I've tried DEFGRID3 NOFINALGRIDX setting that improved DFT grid and switched all COSX calculations to GRIDX 3 (consistency ensured!). Initially it seemed like it helped, but CPU and memory costs increased, and later the same problems appeared, so I've decided it was a placebo effect caused by slightly different PES shapes that got things unstuck initially.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

Here is a log with your patch from #180 (comment)

PMDA_chair_dimer2.log

As you can see, it starts exactly the same as PMDA_chair_dimer2_3.log linked above, but starts to diverge since step 26, and eventually converges.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

And indeed, trust stops at 1.010e-03 which I consider reasonable. Now I'd like to find out if Converged_energy is really the right criterion for this task, or I should just set --tmin 1e-3 everywhere instead.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

Thanks! I still think that checking the energy / gradient consistency using finite difference is a good way to find the true cause of the issue. Within ORCA there is the NUMGRAD option, I believe you can use it to compute a single-point numerical gradient. I bet that if the max gradient error could be made less than 1e-5, you will no longer observe this strange behavior.

I think one of the two "diffs" in this thread, where Converged_energy causes the step state to be set to Okay or Good, could prevent this issue from reappearing in the future, but I'm not sure which option to use. As you said, the less aggressive option (Okay) does not change tmin anymore so it is stuck at the value it had before the energy convergence criteria was met. How about the more aggressive option (Good)?

@annulen
Copy link
Author

annulen commented Nov 12, 2023

How about the more aggressive option (Good)?

Will try it next. BTW, here is a trajectory file from the latest optimization:
PMDA_chair_dimer2_optim.xyz.txt

It seems like some steps contain a rotation of system as a whole. Grids are not rotationally invariant, so that may cause additional jitter. I guess it could be a good idea to always have translational coordinates of one of the fragments frozen, so that whole system rotations won't happen.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

Thanks, I can see the slight whole-system rotation at the end. This has been observed in several other software packages as well. GeomeTRIC deals with this in the following way:

Step   24 : Displace = 2.160e-03/3.679e-03 (rms/max) Trust = 2.020e-03 (-) Grad = 4.498e-05/1.750e-04 (rms/max) E (change) = -122.0442482526 (+1.694e-06) Quality = -0.579
Poor-quality step dominated by net translation/rotation detected; will project out net forces and torques past this point.

It isn't a perfect solution but I think it works well enough.

Using your input file I was able to create an equivalent input for Q-Chem. I have tightened the grid in that calculation and I will let you know what the result looks like.

@annulen
Copy link
Author

annulen commented Nov 12, 2023

It isn't a perfect solution but I think it works well enough.

However it often kicks in late in optimization progress. While pinning down center of mass of one of the fragments looks like a reasonable constraint and can be used for any system right from the start by default (e.g. pin fragment 1, and allow user to choose different pinned fragment as an option separate from other constraints). Yep it decreases degrees of freedom from 3N to 3N-3, but I don't see how could it harm optimization.

Using your input file I was able to create an equivalent input for Q-Chem. I have tightened the grid in that calculation and I will let you know what the result looks like.

Note the devil in details. ORCA uses RIJCOSX by default. RIJCOSX is RI (resolution of identity aka density fitting) for Coulomb integrals and COSX (chain of spheres — numerical integration scheme actually) for exchange integrals. I don't know if QChem supports those, especially COSX, and without it stability may be much better. However, it allows DFT to scale better, larger clusters get bigger time savings.

@leeping
Copy link
Owner

leeping commented Nov 12, 2023

However it often kicks in late in optimization progress. While pinning down center of mass of one of the fragments looks like a reasonable constraint and can be used for any system right from the start by default (e.g. pin fragment 1, and allow user to choose different pinned fragment as an option separate from other constraints). Yep it decreases degrees of freedom from 3N to 3N-3, but I don't see how could it harm optimization.

With one center of mass pinned down, the system can still undergo a net rotation, although the axis of rotation will change location. Even if you pinned two centers of mass, there is still the possibility of rotation about the axis connecting the centers of mass.

Note the devil in details. ORCA uses RIJCOSX by default. RIJCOSX is RI (resolution of identity aka density fitting) for Coulomb integrals and COSX (chain of spheres — numerical integration scheme actually) for exchange integrals. I don't know if QChem supports those, especially COSX, and without it stability may be much better. However, it allows DFT to scale better, larger clusters get bigger time savings.

Thanks, I did note ORCA's use of the RIJCOSX approximation. I am thinking that if the optimization also fails in Q-Chem, it could point to a problem with the optimization algorithm. Otherwise, it could point to the root cause being an energy/gradient inconsistency possibly caused by one of the grids. I consider it part of the process of elimination to find the root issue.

@annulen
Copy link
Author

annulen commented Nov 13, 2023

Here are results with (Good) patch:
PMDA_chair_dimer2.log
PMDA_chair_dimer2_optim.xyz.txt

Results are rather... curious. When trust radius started growing after step 26, Quality quickly went down to large negative values, however it converged at step 34 with a step of Quality 24.7

Of course, final energy -122.0442461097 is nowhere as good as it was step 22 (-122.0442516550).

I've also tried out my idea with pinning of center of the first molecule. My constraints file looked like

$freeze
trans-xyz 1-20

Seems like it did what I intended and it reduced whole system rotations, however overall convergence got worse:
PMDA_chair_dimer2.log. So you were right and it wasn't a bright idea after all.

@leeping
Copy link
Owner

leeping commented Nov 13, 2023

Here's the Q-Chem result. I think it's more well-behaved than what you've been seeing so far; toward the end of the optimization, the energy changes are negative and the step size drops below the convergence criteria by itself. There is still some net rotation that gets projected out at step 27.

PMDA_qchem.tar.gz

@leeping
Copy link
Owner

leeping commented Nov 13, 2023

Thanks for sharing your result. Here's what I see happening: When the energy increases but the size of the increase is below the energy convergence criterion, the step is counted as "Good" and the trust radius increases (step 25-27). But then the step is large enough for the energy increase to be larger than the convergence criterion, causing the step to be rejected (28). This indicates to me that the more "moderate" approach, i.e. (Okay) patch was the better solution.

@annulen
Copy link
Author

annulen commented Nov 13, 2023

This indicates to me that the more "moderate" approach, i.e. (Okay) patch was the better solution.

I guess that it should actually depend on trust radius value. If it's considered "too low" (e.g. <1e-3) then it may be increased, otherwise it should stay the same.

Or I should simply use higher tmin setting, like --tmin=1e-3. FWIW, my motivation when reporting this was to improve geomeTRIC defaults, not to get a help for my case. Anyway, thank you very much for your help.

As for QChem results: from energy values I assume it uses exact DFT calculation, and it's no surprise that those gradients are more precise. I can try to run finite_difference_grad.py is you are still interested in that. But it would be more useful for me to get some way to deal with existing gradients somehow. Maybe switching to RFO could help?

In ORCA I also tried GDIIS which I suppose to be even more stable, however it didn't make any substantial improvement there. Though I guess it could be a nice to have addition to geomeTRIC, even as a "checkbox" feature :)

@annulen
Copy link
Author

annulen commented Nov 13, 2023

Also, I'd like to notice that small systems like this one usually converge relatively well in ORCA with the same settings. I've just picked particular system that exposed pathologically bad convergence which looked suspiciously similar to my experience with 9-molecule cluster.

@leeping
Copy link
Owner

leeping commented Nov 13, 2023

I think for your calculation which has the behavior of steadily increasing energy, setting --tmin=1e-3 should help. It was helpful for me to find out about this, because I think the (Okay) patch might improve things in general. I appreciate the issue report and discussions!

I think running finite_difference_grad.py using your input file could be helpful just to prove to both of us that there was an error in the gradient that was the root cause of the issue. I think in order to have reliable convergence behavior in the optimizer, the analytic and numerical gradients should agree to within 1e-5, unfortunately, this will increase the cost of the single point calculation.

I don't think one could play with tmin or anything else to guarantee good convergence if the energy and gradient are inconsistent. There are many different ways the inconsistency could manifest; in this case the gradient decreased as the energy slowly increased, but in the next case it could be totally different.

I do know about GDIIS, but I don't understand the mathematics of how it works. I do have some understanding of DIIS for electronic structure and QN/RFO for geometry optimization, but unless there is a convincing case where GDIIS performs better, there's some higher priority tasks to work on. :) For example getting v1.1 released...

@annulen
Copy link
Author

annulen commented Nov 13, 2023

I think I've found an issue with your "Okay" patch. Optimization now may not ever hit Poor-quality step dominated by net translation/rotation detected because it requires step_state in (StepState.Poor, StepState.Reject), and after energy converged steps are always Okay or Good.

I think it could make sense to enable translation and rotation substraction automatically when energy is converged without waiting for sufficiently bad step.

UPD: Trying with this patch now: Oops, just realized this patch was a bad idea

diff --git a/geometric/optimize.py b/geometric/optimize.py
index 9b0e1a3..6ee08ba 100644
--- a/geometric/optimize.py
+++ b/geometric/optimize.py
@@ -640,7 +642,7 @@ class Optimizer(object):
         # if step_state in (StepState.Okay, StepState.Poor, StepState.Reject) and params.transition:
         #     logger.info("LPW: Recalculating Hessian\n")
         #     self.recalcHess = True
-        if step_state in (StepState.Poor, StepState.Reject):
+        if step_state in (StepState.Okay, StepState.Poor, StepState.Reject):
             new_trust = max(params.tmin, min(self.trust, self.cnorm)/2)
             # if (Converged_grms or Converged_gmax) or (params.molcnv and Converged_molpro_gmax):
             #     new_trust = max(new_trust, self.params.Convergence_dmax if self.params.usedmax else self.params.Convergence_drms)

@annulen
Copy link
Author

annulen commented Nov 13, 2023

Trying this now

diff --git a/geometric/optimize.py b/geometric/optimize.py
index 9b0e1a3..c09f42c 100644
--- a/geometric/optimize.py
+++ b/geometric/optimize.py
@@ -565,7 +565,8 @@ class Optimizer(object):
         # 2020-03-10: Step quality thresholds are hard-coded here.
         colors = {}
         if Quality > 0.75: step_state = StepState.Good
-        elif Quality > (0.5 if params.transition else 0.25): step_state = StepState.Okay
+        elif Converged_energy or (Quality > (0.5 if params.transition else 0.25)): step_state = StepState.Okay
+        #elif Quality > (0.5 if params.transition else 0.25): step_state = StepState.Okay
         elif Quality > 0.0: step_state = StepState.Poor
         else:
             colors['energy'] = "\x1b[92m" if Converged_energy else "\x1b[91m"
@@ -646,13 +648,6 @@ class Optimizer(object):
             #     new_trust = max(new_trust, self.params.Convergence_dmax if self.params.usedmax else self.params.Convergence_drms)
             self.trustprint = "\x1b[91m-\x1b[0m" if new_trust < self.trust else "="
             self.trust = new_trust
-            # A poor quality step that is dominated by overall translation/rotation
-            # is a sign that projecting out the net force and torque may be needed
-            if self.params.subfrctor == 1 and ((rms_displacement_noalign / rms_displacement) > self.lowq_tr_thre):
-                self.lowq_tr_count += 1
-                if self.lowq_tr_count == self.lowq_tr_limit :
-                    logger.info("Poor-quality step dominated by net translation/rotation detected; ")
-                    logger.info("will project out net forces and torques past this point.\n")
         elif step_state == StepState.Good:
             new_trust = min(params.tmax, np.sqrt(2)*self.trust)
             self.trustprint = "\x1b[92m+\x1b[0m" if new_trust > self.trust else "="
@@ -660,6 +655,15 @@ class Optimizer(object):
         elif step_state == StepState.Okay:
             self.trustprint = "="
 
+        if step_state != StepState.Good:
+            # A poor quality step that is dominated by overall translation/rotation
+            # is a sign that projecting out the net force and torque may be needed
+            if self.params.subfrctor == 1 and ((rms_displacement_noalign / rms_displacement) > self.lowq_tr_thre):
+                self.lowq_tr_count += 1
+                if self.lowq_tr_count == self.lowq_tr_limit :
+                    logger.info("Poor-quality step dominated by net translation/rotation detected; ")
+                    logger.info("will project out net forces and torques past this point.\n")
+
         if step_state == StepState.Reject:
             if hasattr(self, 'X_rj') and np.allclose(self.X_rj, self.X, atol=1e-6):
                 logger.info("\x1b[93mA previously rejected step was repeated; accepting to avoid infinite loop\x1b[0m\n")

@annulen
Copy link
Author

annulen commented Nov 13, 2023

I've done optimization with patch above and using defgrid3 nofinalgridx ORCA settings which both improve DFT grid and make COSX to use the same large grid on all stages (without nofinalgridx it would use even larger grid for final integration and gradient).

Sample ORCA output with grids:
orca.out.txt
Optimization log:
PMDA_chair_dimer2.log
Trajectory:
PMDA_chair_dimer2_optim.xyz.txt

Convergence runs much more smoothly. However, a lot of steps were taken to refine structure to satisfy all criteria.
Notable things:

  • Resulting energy is very close to the minimum energy seen throughout all steps: -122.0441848052 vs -122.0441848168 on step 49
  • It's really impressive how good guessed Hessian is! There was a reset after step 46 and I've expected a huge misstep, but it made a really good step instead!

@leeping
Copy link
Owner

leeping commented Nov 13, 2023

Looking at your output, I think the behavior is better. There are still some problems because around step 20 the step quality starts to get low, and between step 31-58 there are some fluctuations up and down. However, the converged structure does have the lowest energy with a margin of error of 1e-6. The (Okay) patch appears to have helped so I think that should be included in the master.

I see that your patch essentially makes the "subfrctor" more aggressive so that it can be applied when the step state is "Okay". There is currently a command line option that turns the behavior on from the beginning as --subfrctor 2, you probably already know that.

The reason why subfrctor is not always active is because there are finite field calculations where one expects overall reorientation of the molecule - I have corresponded with another group that uses geomeTRIC for this purpose. Therefore I don't think it should be turned on by default for "Okay" steps. However, I think it should be turned on for "Okay" steps when the step quality is low (i.e. those steps that used to be Poor/Reject but are now being set to Okay because of the first patch.)

I think this patch would implement the behavior I described.

diff --git a/geometric/optimize.py b/geometric/optimize.py
index 78d8454..6a8b999 100644
--- a/geometric/optimize.py
+++ b/geometric/optimize.py
@@ -565,7 +565,8 @@ class Optimizer(object):
         # 2020-03-10: Step quality thresholds are hard-coded here.
         colors = {}
         if Quality > 0.75: step_state = StepState.Good
-        elif Quality > (0.5 if params.transition else 0.25): step_state = StepState.Okay
+        elif Converged_energy or (Quality > (0.5 if params.transition else 0.25)):
+            step_state = StepState.Okay
         elif Quality > 0.0: step_state = StepState.Poor
         else:
             colors['energy'] = "\x1b[92m" if Converged_energy else "\x1b[91m"
@@ -634,6 +635,13 @@ class Optimizer(object):
         # if step_state in (StepState.Okay, StepState.Poor, StepState.Reject) and params.transition:
         #     logger.info("LPW: Recalculating Hessian\n")
         #     self.recalcHess = True
+        def activate_subfrctor():
+            if self.params.subfrctor == 1 and ((rms_displacement_noalign / rms_displacement) > self.lowq_tr_thre):
+                self.lowq_tr_count += 1
+                if self.lowq_tr_count == self.lowq_tr_limit :
+                    logger.info("Poor-quality step dominated by net translation/rotation detected; ")
+                    logger.info("will project out net forces and torques past this point.\n")
+
         if step_state in (StepState.Poor, StepState.Reject):
             new_trust = max(params.tmin, min(self.trust, self.cnorm)/2)
             # if (Converged_grms or Converged_gmax) or (params.molcnv and Converged_molpro_gmax):
@@ -642,17 +650,14 @@ class Optimizer(object):
             self.trust = new_trust
             # A poor quality step that is dominated by overall translation/rotation
             # is a sign that projecting out the net force and torque may be needed
-            if self.params.subfrctor == 1 and ((rms_displacement_noalign / rms_displacement) > self.lowq_tr_thre):
-                self.lowq_tr_count += 1
-                if self.lowq_tr_count == self.lowq_tr_limit :
-                    logger.info("Poor-quality step dominated by net translation/rotation detected; ")
-                    logger.info("will project out net forces and torques past this point.\n")
+            activate_subfrctor()
         elif step_state == StepState.Good:
             new_trust = min(params.tmax, np.sqrt(2)*self.trust)
             self.trustprint = "\x1b[92m+\x1b[0m" if new_trust > self.trust else "="
             self.trust = new_trust
         elif step_state == StepState.Okay:
             self.trustprint = "="
+            if (Quality < (0.5 if params.transition else 0.25)): activate_subfrctor()

         if step_state == StepState.Reject:
             if hasattr(self, 'X_rj') and np.allclose(self.X_rj, self.X, atol=1e-6):

@annulen
Copy link
Author

annulen commented Nov 13, 2023

Looking at your output, I think the behavior is better. There are still some problems because around step 20 the step quality starts to get low, and between step 31-58 there are some fluctuations up and down. However, the converged structure does have the lowest energy with a margin of error of 1e-6. The (Okay) patch appears to have helped so I think that should be included in the master.

I didn't actually check defgrid3 nofinalgridx version without (Okay) patch, however I strongly suspect that trust radius dropping down to 6e-5 would still cause troubles.

I see that your patch essentially makes the "subfrctor" more aggressive so that it can be applied when the step state is "Okay". There is currently a command line option that turns the behavior on from the beginning as --subfrctor 2, you probably already know that.

Actually, I haven't explored all options yet as I've just started using geomeTRIC on this Saturday. However, from what I see --subfrctor 2 is doing completely different thing — it removes rotations always, not only when step is Okay, but also when it's Good. But why change anything if steps are already Good ?

BTW, when I encounter subfrctor I'm inclined to read it as "subfractor" which could be a good name for some Sci-Fi device :)

The reason why subfrctor is not always active is because there are finite field calculations where one expects overall reorientation of the molecule

I see. However, I think this feature should be made more prominent, as most people optimize geometries in rotationally invariant conditions.

@annulen
Copy link
Author

annulen commented Nov 13, 2023

I've finished finite_difference_grad.py for the starting geometry (as it's the single point which is shared between all optimization runs, even if not the most interesting). I used defgrid3 nofinalgridx setting as used in the most recent run. My command was

finite_difference_grad.py --ase-class=ase.calculators.orca.ORCA --ase-kwargs="{\"orcasimpleinput\":\"B3LYP D3BJ defgrid3 nofinalgridx\", \"orcablocks\":`cat blocks.inp | jq -Rsa .`}" PMDA_chair_dimer2.xyz ase

Results are here:
grads.txt

As you can see, there are a lot of differences larger than 1e-5, though not that much larger. I expect that with default grid settings results will be worse — will report result later.

I think in order to have reliable convergence behavior in the optimizer, the analytic and numerical gradients should agree to within 1e-5, unfortunately, this will increase the cost of the single point calculation.

Well, defgrid3 nofinalgridx is probably the most accurate level that I can reasonably afford for large systems. It looks like that some "denoising" is needed for gradients and energies. If I understood correctly GDIIS interpolates both gradients and energies using previous optimization history, so it might be a valid "noise gate" for numeric errors. If you think it's a perspective way I can try to look into implementing that. Or if you have other ideas I can look into them as well — I'm not at all hellbent on getting GDIIS specifically.

@annulen
Copy link
Author

annulen commented Nov 17, 2023

For energy minimizations, RFO and QN are very similar and with a reasonable default parameter choice for RFO, they are equivalent (see attached paper).

However, it seems that in ORCA implementation there is a difference between them. Here is a quote from manual:

In any case, each individual element of ∆q is restricted to magnitude MaxStep and the total length of the step is restricted to Trust. In the RFO case, this is achieved by minimizing the predicted energy on the hypersphere of radius Trust which also modifies the direction of the step while in the quasi-Newton step, the step vector is simply scaled down.

In other words, when step is larger than trust radius, QN step is simply scaled down while RFO step is optimized within radius to get lowest predicted energy.

@leeping
Copy link
Owner

leeping commented Nov 17, 2023

In QN it is also possible to minimize the energy on the hypersphere of radius Trust; that is what geomeTRIC does. I am not sure why the QN optimizer in ORCA decides to scale the step instead.

@corinwagen
Copy link

hey folks, is there an accepted solution to these situations? reading through this, I'm not 100% sure what to do (I'm in a similar situation here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants