Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix negative scaling factors for L-BFGS #392

Merged
merged 4 commits into from Feb 13, 2024

Conversation

rcurtin
Copy link
Member

@rcurtin rcurtin commented Feb 9, 2024

While investigating #390, I came across an interesting finding: L-BFGS was computing negative scaling values for the Hessian scaling step. (These should not be negative!) Specifically, the gradient norm was being computed as negative and this then snowballed into a convergence issue. As it turns out, the convergence issue only seems to manifest on non-x86_64 architectures when using arma::fmat likely due to numerical precision specifics and other things along those lines. Here's an example run of the Johnson844LovaszThetaFMatSDP test, printing negative gradient norms:

$ ./ensmallen_tests Johnson844LovaszThetaFMatSDP
ensmallen version: 2.21.0 (Bent Antenna)
armadillo version: 9.800.1 (Horizon Scraper)
random seed: 0
Filters: Johnson844LovaszThetaFMatSDP
Scaling factor less than zero! -0.0215062
Scaling factor less than zero! -2.41463e-09
===============================================================================
All tests passed (1122 assertions in 1 test case)

So, first question is: @conradsnicta do you consider it a responsibility of Armadillo to provide a nonnegative norm? If so, I can add an extra if (norm < 0) return 0 type check to the norm() function and open an MR. If not, I'll just keep that in mind and check throughout ensmallen to make sure we don't depend on norm() returning nonnegative values.

Inside of ensmallen I chose to fix the issue by avoiding division by the gradient norm (or other quantities) when they are very small or negative. On the M1 Macbook I have sitting around, this fixes the test failure and convergence succeeds.

I also fixed some compilation warnings, and made AugLagrangian not output the entire set of coordinates when ENS_PRINT_INFO is enabled (the coordinates could be huge!).

When the norm of the gradient gets very small (but nonzero), we can end up with
very large scaling factors.  To avoid this, we now use a tolerance before
dividing.  In addition, this tolerance handles when the computed norm is
negative (which can happen due to precision issues).
@conradsnicta
Copy link
Contributor

@rcurtin That's a weird one. I can add a workaround to Armadillo so that norm is always >= 0, but that would still mean that ensmallen and mlpack have to take into account older versions of Armadillo without the workaround.

Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@rcurtin
Copy link
Member Author

rcurtin commented Feb 13, 2024

@rcurtin That's a weird one. I can add a workaround to Armadillo so that norm is always >= 0, but that would still mean that ensmallen and mlpack have to take into account older versions of Armadillo without the workaround.

@conradsnicta Up to you---maybe it would be nice to do that from the Armadillo perspective, but you are right that ensmallen and mlpack would still need to take it into account regardless.

@rcurtin rcurtin merged commit 0c64b08 into mlpack:master Feb 13, 2024
4 checks passed
@rcurtin rcurtin deleted the lbfgs-fix-negative-scaling-factors branch February 13, 2024 14:21
@conradsnicta
Copy link
Contributor

@rcurtin @barracuda156

TLDR: I suspect the issue is ultimately with the BLAS library (Accelerate framework?) provided on arm64/aarch64 (Apple M1), which in turn means this might be a bug in macOS.

Long story below.

I checked the relevant code paths for the norm() function in Armadillo, and I'm struggling to see how the norm can be negative.

In most cases (non-tiny vectors and matrices), Armadillo ends up calling either snrm2() or dnrm2() from BLAS, depending on whether the element type is 32-bit or 64-bit float.

In case the result from snrm2() or dnrm2() is zero, Armadillo assumes there is a potential underflow and uses a robustified norm calculation. This uses absolute values and squared values during calculation, so a negative value is unlikely to appear there.

This implies that the negative value comes out snrm2() or dnrm2(), as provided by BLAS (or OpenBLAS, or the Accelerate framework) on arm64/aarch64 (Apple M1).

A while back I encountered weirdness with the sdot() function under the Accelerate framework. If I recall correctly, sdot() was returning a 64-bit float instead of 32-bit float, contrary to the offiicial Netlib specification: https://netlib.org/lapack/explore-html/d0/d16/sdot_8f.html
This caused the results from sdot() under macOS to be essentially garbage.

It's entirely possible that we have the same problem with snrm2().

Perhaps there are too many herbs being consumed near and around Cupertino?

@barracuda156
Copy link

@conradsnicta It should be possible to compare cases of using OpenBLAS vs Accelerate (and maybe some alternative BLAS implementations too).

Could you sum up a specific procedure for what is needed to be tested?

OpenBLAS upstream is very responsive, if there is an issue there, that could be fixed pretty fast, I believe.

@conradsnicta
Copy link
Contributor

@barracuda156 I don't think the bug would be in OpenBLAS, as the developers follow the API specified by Netlib BLAS and LAPACK.

Looking more closely at Armadillo's source code, it has many other workarounds that are active under macOS whenever a BLAS or LAPACK function is meant to return a 32-bit float but erroneously returns a 64-bit float instead. I think the banality of these bugs made me forget just how many of them were.

Since snrm2() is specified by Netlib to return a 32-bit float, I highly suspect that we're running into the same class of bugs with the Accelerate framework under macOS. (For clarity and posterity, Accelerate framework = Apple's implementation of BLAS and LAPACK).

I've extended Armadillo to also have workarounds for snrm2() and sasum() under macOS. This will be part of the upcoming 12.8.1 release.

I don't really have further bandwidth (nor patience) to keep looking into this.

@rcurtin
Copy link
Member Author

rcurtin commented Feb 26, 2024

Wow, thanks for digging here, and thanks for adding the workaround. I think the workarounds are all that's worthwhile here---it would take a lot of effort to get this in front of the right people in the Accelerate framework to fix it (and probably also to convince them that there is a problem).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants