New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestCond.test_nan test failure for latest OpenBLAS #18914
Comments
The failure of SVD to converge was a common problem, ISTR it was mostly on the MAC, but I may be wrong. |
We see that in Fedora as well. https://koschei.fedoraproject.org/build/10273521 openblas was updated from 0.3.14 to 0.3.15 |
I can reproduce this with OpenBLAS |
The failure can be reduced to:
|
Hi, (1) we changed the behavior of LAPACK for a few functionality to have a direct return with a negative INFO when the input matrix has at least one NaN in it. For example, in your code, for DGESDD, because the input matrix has a NaN, LAPACK is now directly returning with INFO=-4. (INFO=-4 because A is the fourth parameter in the interface of DGESDD.) Negative INFO means that the input is not valid. (2) I have no idea what the LAPACK behavior for (2) we changed this behavior when we found cheap (free?) opportunity to do in our codes. (I am not sure if we did a full LAPACK code review, or just edited xGEDD. The idea was to do similar changes wherever possible.) Many LAPACK drivers (e.g. xGESDD) starts by computing the norm of A, || A ||, (because we might want to rescale the problem for example), and so, if there is NaN in A, ||A|| is NaN, and so, the new change is to (1) check whether ||A|| is NaN or not and (2) If ||A|| is NaN, we return directly. New behavior. (3) We will improve the documentation to mention the direct return in xGESDD. (4) We plan to do similar more changes to other LAPACK routines as we edit. In short the idea is that if we can quickly/cheaply/freely detect a NaN in the input matrix, we will use that information and trigger a quick return. (5) It is great that numpy has some matrices with NaN in input in its test suite. Wonderful to see. Julien. |
Thanks for the detailed reply @langou!
The default behavior in NumPy functions is to propagate NaN's. That's the case here as well: In [1]: a = np.array([[1, np.nan], [1, 1]])
In [2]: np.linalg.svd(a)
Out[2]:
(array([[nan, nan],
[nan, nan]]),
array([nan, nan]),
array([[nan, nan],
[nan, nan]])) For many functions that's desired behavior for end users. For linear algebra functions typically it's not. However, checking all input for nan's is expensive, so I think the issue is not that LAPACK is doing something unreasonable here I think. It's just tricky to support many versions of different LAPACK libraries if behavior changes without warning.
This makes perfect sense.
That sounds good.
I actually like that. Then eventually we can remove the expensive checks from SciPy (if we ever get to a point where LAPACK > 3.9 is the minimum version we support). |
In [1]: a = np.array([[1, np.nan], [1, 1]])
In [2]: np.linalg.svd(a)
Out[2]:
(array([[nan, nan],
[nan, nan]]),
array([nan, nan]),
array([[nan, nan],
[nan, nan]])) I see, when numpy propagates, numpy propagates! :) That makes sense though. In particular for an operation such as SVD. Side comment: It is interesting to see the various levels of NaN propagation. It depends on the operation. "NaN propagation" could also mean "we are not converting existing NaNs to a non-NaN values". For example, if we do AXPY: y = alpha * x + y with alpha scalar, x and y vectors, and if alpha is zero, and if there is a NaN in x, then "NaN propagation" makes sure that this NaN in x propagates to y. (So then an implementation that checks whether alpha is zero or not, and direct return when alpha is zero would not propagate NaNs.) So, in this AXPY case, NaN propagation is about "not erasing NaNs" as opposed to "creating some" (SVD case).
|
Thanks. This is good to know who does what. I think Matlab is also checking for NaNs and Infs in the inputs before processing operations. So Matlab is same as scipy then. LAPACK as the LAPACKE interface. With the LAPACKE interface, users can have LAPACKE check for NaNs and do direct returns as well.
This is a fair comment. |
:) I see the next flood of corner case issues coming there:)
The upstream LAPACK issue can be closed I think. We just need to decide what to do with a case like this:
I think it's best to go with (1); doing the expensive check seems like too high a price to pay. We'll then keep this inconsistency for many years; we're at LAPACK >= 3.2.1 for NumPy and >= 3.4.1 for SciPy - upgrading will be slow. Other opinions? |
I ran into this issue as well, but indirectly: the PyTorch test suite compares its own
which results in the same "SVD did not converge" error. It's a little bit less direct than inputting an array that already has a NaN though. In this case, this call fails Line 1779 in b235f9e
invx is a matrix of NaN's. It is then the second norm on this line Line 1780 in b235f9e
nuc norm.
@rgommers @langou thanks for a very clear discussion, I understand why it is happening now. As for the solution: @rgommers to pitch in on your proposed solutions, at first sight I would probably also vote for the most 'performant' approach, i.e. your solution to document it (1). But let me just clarify if I understand that correctly: by solution (1) you mean to put in the Also, I guess that since the behavior is then undefined, it makes sense to remove the I'm also not entirely sure what should be done about cases such as the one in the |
Yes indeed.
That does make sense, yes. The test suite should pass also for configs we don't have in CI.
Agreed. |
New MKL 2022.0 seem to use the latest Reference LAPACK and now # With MKL 2021.4.0
In [1]: import numpy as np
In [2]: a = np.full((3, 3), float('nan'))
In [3]: np.linalg.svd(a, compute_uv=False)
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
Intel MKL ERROR: Parameter 5 was incorrect on entry to DLASCL.
Out[3]: array([nan, nan, nan])
# With new MKL 2022
In [1]: import numpy as np
In [2]: a = np.full((3, 3), float('nan'))
In [3]: np.linalg.svd(a, compute_uv=False)
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-3-48fccc58801d> in <module>
----> 1 np.linalg.svd(a, compute_uv=False)
~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/core/overrides.py in svd(*args, **kwargs)
~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/linalg/linalg.py in svd(a, full_matrices, compute_uv, hermitian)
1658
1659 signature = 'D->d' if isComplexType(t) else 'd->d'
-> 1660 s = gufunc(a, signature=signature, extobj=extobj)
1661 s = s.astype(_realType(result_t), copy=False)
1662 return s
~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag)
95
96 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 97 raise LinAlgError("SVD did not converge")
98
99 def _raise_linalgerror_lstsq(err, flag):
LinAlgError: SVD did not converge |
This is on a macOS system, but @mattip ran into it yesterday as well during a demo, and that was on Linux I believe. I cannot reproduce it on my macOS or Linux setups, nor on Gitpod. All those use
openblas
from conda-forge, the problem may be coming from another BLAS implementation or be hardware-specific.The text was updated successfully, but these errors were encountered: