New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix negative scaling factors for L-BFGS #392
Fix negative scaling factors for L-BFGS #392
Conversation
When the norm of the gradient gets very small (but nonzero), we can end up with very large scaling factors. To avoid this, we now use a tolerance before dividing. In addition, this tolerance handles when the computed norm is negative (which can happen due to precision issues).
@rcurtin That's a weird one. I can add a workaround to Armadillo so that norm is always >= 0, but that would still mean that ensmallen and mlpack have to take into account older versions of Armadillo without the workaround. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍
@conradsnicta Up to you---maybe it would be nice to do that from the Armadillo perspective, but you are right that ensmallen and mlpack would still need to take it into account regardless. |
TLDR: I suspect the issue is ultimately with the BLAS library (Accelerate framework?) provided on arm64/aarch64 (Apple M1), which in turn means this might be a bug in macOS. Long story below. I checked the relevant code paths for the norm() function in Armadillo, and I'm struggling to see how the norm can be negative. In most cases (non-tiny vectors and matrices), Armadillo ends up calling either snrm2() or dnrm2() from BLAS, depending on whether the element type is 32-bit or 64-bit float. In case the result from snrm2() or dnrm2() is zero, Armadillo assumes there is a potential underflow and uses a robustified norm calculation. This uses absolute values and squared values during calculation, so a negative value is unlikely to appear there. This implies that the negative value comes out snrm2() or dnrm2(), as provided by BLAS (or OpenBLAS, or the Accelerate framework) on arm64/aarch64 (Apple M1). A while back I encountered weirdness with the sdot() function under the Accelerate framework. If I recall correctly, sdot() was returning a 64-bit float instead of 32-bit float, contrary to the offiicial Netlib specification: https://netlib.org/lapack/explore-html/d0/d16/sdot_8f.html It's entirely possible that we have the same problem with snrm2(). Perhaps there are too many herbs being consumed near and around Cupertino? |
@conradsnicta It should be possible to compare cases of using OpenBLAS vs Accelerate (and maybe some alternative BLAS implementations too). Could you sum up a specific procedure for what is needed to be tested? OpenBLAS upstream is very responsive, if there is an issue there, that could be fixed pretty fast, I believe. |
@barracuda156 I don't think the bug would be in OpenBLAS, as the developers follow the API specified by Netlib BLAS and LAPACK. Looking more closely at Armadillo's source code, it has many other workarounds that are active under macOS whenever a BLAS or LAPACK function is meant to return a 32-bit float but erroneously returns a 64-bit float instead. I think the banality of these bugs made me forget just how many of them were. Since snrm2() is specified by Netlib to return a 32-bit float, I highly suspect that we're running into the same class of bugs with the Accelerate framework under macOS. (For clarity and posterity, Accelerate framework = Apple's implementation of BLAS and LAPACK). I've extended Armadillo to also have workarounds for snrm2() and sasum() under macOS. This will be part of the upcoming 12.8.1 release. I don't really have further bandwidth (nor patience) to keep looking into this. |
Wow, thanks for digging here, and thanks for adding the workaround. I think the workarounds are all that's worthwhile here---it would take a lot of effort to get this in front of the right people in the Accelerate framework to fix it (and probably also to convince them that there is a problem). |
While investigating #390, I came across an interesting finding: L-BFGS was computing negative scaling values for the Hessian scaling step. (These should not be negative!) Specifically, the gradient norm was being computed as negative and this then snowballed into a convergence issue. As it turns out, the convergence issue only seems to manifest on non-x86_64 architectures when using
arma::fmat
likely due to numerical precision specifics and other things along those lines. Here's an example run of theJohnson844LovaszThetaFMatSDP
test, printing negative gradient norms:So, first question is: @conradsnicta do you consider it a responsibility of Armadillo to provide a nonnegative norm? If so, I can add an extra
if (norm < 0) return 0
type check to thenorm()
function and open an MR. If not, I'll just keep that in mind and check throughout ensmallen to make sure we don't depend onnorm()
returning nonnegative values.Inside of ensmallen I chose to fix the issue by avoiding division by the gradient norm (or other quantities) when they are very small or negative. On the M1 Macbook I have sitting around, this fixes the test failure and convergence succeeds.
I also fixed some compilation warnings, and made
AugLagrangian
not output the entire set of coordinates whenENS_PRINT_INFO
is enabled (the coordinates could be huge!).