Fix negative scaling factors for L-BFGS #392

rcurtin · 2024-02-09T22:59:50Z

While investigating #390, I came across an interesting finding: L-BFGS was computing negative scaling values for the Hessian scaling step. (These should not be negative!) Specifically, the gradient norm was being computed as negative and this then snowballed into a convergence issue. As it turns out, the convergence issue only seems to manifest on non-x86_64 architectures when using arma::fmat likely due to numerical precision specifics and other things along those lines. Here's an example run of the Johnson844LovaszThetaFMatSDP test, printing negative gradient norms:

$ ./ensmallen_tests Johnson844LovaszThetaFMatSDP
ensmallen version: 2.21.0 (Bent Antenna)
armadillo version: 9.800.1 (Horizon Scraper)
random seed: 0
Filters: Johnson844LovaszThetaFMatSDP
Scaling factor less than zero! -0.0215062
Scaling factor less than zero! -2.41463e-09
===============================================================================
All tests passed (1122 assertions in 1 test case)

So, first question is: @conradsnicta do you consider it a responsibility of Armadillo to provide a nonnegative norm? If so, I can add an extra if (norm < 0) return 0 type check to the norm() function and open an MR. If not, I'll just keep that in mind and check throughout ensmallen to make sure we don't depend on norm() returning nonnegative values.

Inside of ensmallen I chose to fix the issue by avoiding division by the gradient norm (or other quantities) when they are very small or negative. On the M1 Macbook I have sitting around, this fixes the test failure and convergence succeeds.

I also fixed some compilation warnings, and made AugLagrangian not output the entire set of coordinates when ENS_PRINT_INFO is enabled (the coordinates could be huge!).

When the norm of the gradient gets very small (but nonzero), we can end up with very large scaling factors. To avoid this, we now use a tolerance before dividing. In addition, this tolerance handles when the computed norm is negative (which can happen due to precision issues).

conradsnicta · 2024-02-11T11:47:14Z

@rcurtin That's a weird one. I can add a workaround to Armadillo so that norm is always >= 0, but that would still mean that ensmallen and mlpack have to take into account older versions of Armadillo without the workaround.

mlpack-bot

Second approval provided automatically after 24 hours. 👍

rcurtin · 2024-02-13T14:21:31Z

@rcurtin That's a weird one. I can add a workaround to Armadillo so that norm is always >= 0, but that would still mean that ensmallen and mlpack have to take into account older versions of Armadillo without the workaround.

@conradsnicta Up to you---maybe it would be nice to do that from the Armadillo perspective, but you are right that ensmallen and mlpack would still need to take it into account regardless.

conradsnicta · 2024-02-26T04:25:47Z

@rcurtin @barracuda156

TLDR: I suspect the issue is ultimately with the BLAS library (Accelerate framework?) provided on arm64/aarch64 (Apple M1), which in turn means this might be a bug in macOS.

Long story below.

I checked the relevant code paths for the norm() function in Armadillo, and I'm struggling to see how the norm can be negative.

In most cases (non-tiny vectors and matrices), Armadillo ends up calling either snrm2() or dnrm2() from BLAS, depending on whether the element type is 32-bit or 64-bit float.

In case the result from snrm2() or dnrm2() is zero, Armadillo assumes there is a potential underflow and uses a robustified norm calculation. This uses absolute values and squared values during calculation, so a negative value is unlikely to appear there.

This implies that the negative value comes out snrm2() or dnrm2(), as provided by BLAS (or OpenBLAS, or the Accelerate framework) on arm64/aarch64 (Apple M1).

A while back I encountered weirdness with the sdot() function under the Accelerate framework. If I recall correctly, sdot() was returning a 64-bit float instead of 32-bit float, contrary to the offiicial Netlib specification: https://netlib.org/lapack/explore-html/d0/d16/sdot_8f.html
This caused the results from sdot() under macOS to be essentially garbage.

It's entirely possible that we have the same problem with snrm2().

Perhaps there are too many herbs being consumed near and around Cupertino?

barracuda156 · 2024-02-26T05:01:03Z

@conradsnicta It should be possible to compare cases of using OpenBLAS vs Accelerate (and maybe some alternative BLAS implementations too).

Could you sum up a specific procedure for what is needed to be tested?

OpenBLAS upstream is very responsive, if there is an issue there, that could be fixed pretty fast, I believe.

conradsnicta · 2024-02-26T12:51:09Z

@barracuda156 I don't think the bug would be in OpenBLAS, as the developers follow the API specified by Netlib BLAS and LAPACK.

Looking more closely at Armadillo's source code, it has many other workarounds that are active under macOS whenever a BLAS or LAPACK function is meant to return a 32-bit float but erroneously returns a 64-bit float instead. I think the banality of these bugs made me forget just how many of them were.

Since snrm2() is specified by Netlib to return a 32-bit float, I highly suspect that we're running into the same class of bugs with the Accelerate framework under macOS. (For clarity and posterity, Accelerate framework = Apple's implementation of BLAS and LAPACK).

I've extended Armadillo to also have workarounds for snrm2() and sasum() under macOS. This will be part of the upcoming 12.8.1 release.

I don't really have further bandwidth (nor patience) to keep looking into this.

rcurtin · 2024-02-26T18:14:21Z

Wow, thanks for digging here, and thanks for adding the workaround. I think the workarounds are all that's worthwhile here---it would take a lot of effort to get this in front of the right people in the Accelerate framework to fix it (and probably also to convince them that there is a problem).

rcurtin added 3 commits February 9, 2024 17:53

Fix compilation warnings for unused variables.

bd0effc

Don't print coordinates in verbose mode (they could be huge!).

d4f676b

rcurtin added c: optimizers t: bugfix labels Feb 9, 2024

rcurtin mentioned this pull request Feb 9, 2024

[aarch64] One test failure in Johnson844LovaszThetaFMatSDP #390

Closed

Update HISTORY.

295fa0a

conradsnicta approved these changes Feb 11, 2024

View reviewed changes

mlpack-bot bot approved these changes Feb 12, 2024

View reviewed changes

rcurtin merged commit 0c64b08 into mlpack:master Feb 13, 2024
4 checks passed

rcurtin deleted the lbfgs-fix-negative-scaling-factors branch February 13, 2024 14:21

This was referenced Feb 13, 2024

Release version 2.21.1: "Bent Antenna" #393

Closed

Release version 2.21.1: "Bent Antenna" #395

Merged

Porkepix mentioned this pull request Feb 16, 2024

ensmallen 2.21.1 Homebrew/homebrew-core#162932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix negative scaling factors for L-BFGS #392

Fix negative scaling factors for L-BFGS #392

rcurtin commented Feb 9, 2024

conradsnicta commented Feb 11, 2024

mlpack-bot bot left a comment

rcurtin commented Feb 13, 2024 •

edited

conradsnicta commented Feb 26, 2024

barracuda156 commented Feb 26, 2024

conradsnicta commented Feb 26, 2024

rcurtin commented Feb 26, 2024

Fix negative scaling factors for L-BFGS #392

Fix negative scaling factors for L-BFGS #392

Conversation

rcurtin commented Feb 9, 2024

conradsnicta commented Feb 11, 2024

mlpack-bot bot left a comment

Choose a reason for hiding this comment

rcurtin commented Feb 13, 2024 • edited

conradsnicta commented Feb 26, 2024

barracuda156 commented Feb 26, 2024

conradsnicta commented Feb 26, 2024

rcurtin commented Feb 26, 2024

rcurtin commented Feb 13, 2024 •

edited