TestCond.test_nan test failure for latest OpenBLAS #18914

rgommers · 2021-05-05T11:28:28Z

This is on a macOS system, but @mattip ran into it yesterday as well during a demo, and that was on Linux I believe. I cannot reproduce it on my macOS or Linux setups, nor on Gitpod. All those use openblas from conda-forge, the problem may be coming from another BLAS implementation or be hardware-specific.

=================================== FAILURES ===================================
______________________________ TestCond.test_nan _______________________________

self = <numpy.linalg.tests.test_linalg.TestCond object at 0x13d00e670>

    def test_nan(self):
        # nans should be passed through, not converted to infs
        ps = [None, 1, -1, 2, -2, 'fro']
        p_pos = [None, 1, 2, 'fro']
    
        A = np.ones((2, 2))
        A[0,1] = np.nan
        for p in ps:
>           c = linalg.cond(A, p)

A          = array([[ 1., nan],
       [ 1.,  1.]])
p          = None
p_pos      = [None, 1, 2, 'fro']
ps         = [None, 1, -1, 2, -2, 'fro']
self       = <numpy.linalg.tests.test_linalg.TestCond object at 0x13d00e670>

numpy/linalg/tests/test_linalg.py:777: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
<__array_function__ internals>:5: in cond
    ???
        args       = (array([[ 1., nan],
       [ 1.,  1.]]), None)
        kwargs     = {}
        relevant_args = (array([[ 1., nan],
       [ 1.,  1.]]),)
numpy/linalg/linalg.py:1765: in cond
    s = svd(x, compute_uv=False)
        p          = None
        x          = array([[ 1., nan],
       [ 1.,  1.]])
<__array_function__ internals>:5: in svd
    ???
        args       = (array([[ 1., nan],
       [ 1.,  1.]]),)
        kwargs     = {'compute_uv': False}
        relevant_args = (array([[ 1., nan],
       [ 1.,  1.]]),)
numpy/linalg/linalg.py:1672: in svd
    s = gufunc(a, signature=signature, extobj=extobj)
        _nx        = <module 'numpy' from '/Users/username/Documents/GitHub/numpy/build/testenv/lib/python3.9/site-packages/numpy/__init__.py'>
        a          = array([[ 1., nan],
       [ 1.,  1.]])
        compute_uv = False
        extobj     = [8192, 1536, <function _raise_linalgerror_svd_nonconvergence at 0x117e80670>]
        full_matrices = True
        gufunc     = <ufunc 'svd_n'>
        hermitian  = False
        m          = 2
        n          = 2
        result_t   = <class 'numpy.float64'>
        signature  = 'd->d'
        t          = <class 'numpy.float64'>
        wrap       = <built-in method __array_prepare__ of numpy.ndarray object at 0x13c341510>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

err = 'invalid value', flag = 8

    def _raise_linalgerror_svd_nonconvergence(err, flag):
>       raise LinAlgError("SVD did not converge")
E       numpy.linalg.LinAlgError: SVD did not converge

err        = 'invalid value'
flag       = 8

numpy/linalg/linalg.py:97: LinAlgError
=========================== short test summary info ============================
FAILED numpy/linalg/tests/test_linalg.py::TestCond::test_nan - numpy.linalg.L...
= 1 failed, 14429 passed, 611 skipped, 1237 deselected, 18 xfailed, 3 xpassed in 363.18s (0:06:03) =

The text was updated successfully, but these errors were encountered:

charris · 2021-05-05T13:26:52Z

The failure of SVD to converge was a common problem, ISTR it was mostly on the MAC, but I may be wrong.

hroncok · 2021-05-07T10:33:34Z

We see that in Fedora as well.

https://koschei.fedoraproject.org/build/10273521

openblas was updated from 0.3.14 to 0.3.15

rgommers · 2021-05-08T13:02:14Z

I can reproduce this with OpenBLAS 0.3.15 from conda-forge; the test passes with 0.3.12 and 0.3.13. I'll report upstream. gh-18943 xfail's the test for now.

rgommers · 2021-05-08T13:13:36Z

The failure can be reduced to:

>>> a = np.array([[1, np.nan], [1, 1]])
>>> np.linalg.svd(a)
...
LinAlgError: SVD did not converge

langou · 2021-05-12T16:59:27Z

Hi,

(1) we changed the behavior of LAPACK for a few functionality to have a direct return with a negative INFO when the input matrix has at least one NaN in it. For example, in your code, for DGESDD, because the input matrix has a NaN, LAPACK is now directly returning with INFO=-4. (INFO=-4 because A is the fourth parameter in the interface of DGESDD.) Negative INFO means that the input is not valid.

(2) I have no idea what the LAPACK behavior for a = np.array([[1, np.nan], [1, 1]]); np.linalg.svd(a) was and what the numpy expected LAPACK to do. I think a direct return with INFO=-4 is clean. Different opinions welcome.

(2) we changed this behavior when we found cheap (free?) opportunity to do in our codes. (I am not sure if we did a full LAPACK code review, or just edited xGEDD. The idea was to do similar changes wherever possible.) Many LAPACK drivers (e.g. xGESDD) starts by computing the norm of A, || A ||, (because we might want to rescale the problem for example), and so, if there is NaN in A, ||A|| is NaN, and so, the new change is to (1) check whether ||A|| is NaN or not and (2) If ||A|| is NaN, we return directly. New behavior.
See: Reference-LAPACK/lapack#471

(3) We will improve the documentation to mention the direct return in xGESDD.

(4) We plan to do similar more changes to other LAPACK routines as we edit. In short the idea is that if we can quickly/cheaply/freely detect a NaN in the input matrix, we will use that information and trigger a quick return.

(5) It is great that numpy has some matrices with NaN in input in its test suite. Wonderful to see.

Julien.

rgommers · 2021-05-15T12:18:05Z

Thanks for the detailed reply @langou!

have no idea what the LAPACK behavior for a = np.array([[1, np.nan], [1, 1]]); np.linalg.svd(a) was and what the numpy expected LAPACK to do. I think a direct return with INFO=-4 is clean. Different opinions welcome.

The default behavior in NumPy functions is to propagate NaN's. That's the case here as well:

In [1]: a = np.array([[1, np.nan], [1, 1]])

In [2]: np.linalg.svd(a)
Out[2]: 
(array([[nan, nan],
        [nan, nan]]),
 array([nan, nan]),
 array([[nan, nan],
        [nan, nan]]))

For many functions that's desired behavior for end users. For linear algebra functions typically it's not. However, checking all input for nan's is expensive, so numpy.linalg functions don't do it. In contrast, scipy.linalg functions have a check_finite=True keyword, so it will not call LAPACK at all but just raise an exception immediately; this costs ~10% overhead on each function call.

I think the issue is not that LAPACK is doing something unreasonable here I think. It's just tricky to support many versions of different LAPACK libraries if behavior changes without warning.

we changed this behavior when we found cheap (free?) opportunity to do in our codes.

This makes perfect sense.

We will improve the documentation to mention the direct return in xGESDD.

That sounds good.

We plan to do similar more changes to other LAPACK routines as we edit. In short the idea is that if we can quickly/cheaply/freely detect a NaN in the input matrix, we will use that information and trigger a quick return.

I actually like that. Then eventually we can remove the expensive checks from SciPy (if we ever get to a point where LAPACK > 3.9 is the minimum version we support).

langou · 2021-05-15T14:07:09Z

In [1]: a = np.array([[1, np.nan], [1, 1]])

In [2]: np.linalg.svd(a)
Out[2]: 
(array([[nan, nan],
        [nan, nan]]),
 array([nan, nan]),
 array([[nan, nan],
        [nan, nan]]))

I see, when numpy propagates, numpy propagates! :)

That makes sense though. In particular for an operation such as SVD.

Side comment: It is interesting to see the various levels of NaN propagation. It depends on the operation. "NaN propagation" could also mean "we are not converting existing NaNs to a non-NaN values". For example, if we do AXPY: y = alpha * x + y with alpha scalar, x and y vectors, and if alpha is zero, and if there is a NaN in x, then "NaN propagation" makes sure that this NaN in x propagates to y. (So then an implementation that checks whether alpha is zero or not, and direct return when alpha is zero would not propagate NaNs.) So, in this AXPY case, NaN propagation is about "not erasing NaNs" as opposed to "creating some" (SVD case).

"a non-NaN": a NaNaN. :)

langou · 2021-05-15T14:15:49Z

However, checking all input for nan's is expensive, so numpy.linalg functions don't do it. In contrast, scipy.linalg functions have a check_finite=True keyword, so it will not call LAPACK at all but just raise an exception immediately; this costs ~10% overhead on each function call.

Thanks. This is good to know who does what. I think Matlab is also checking for NaNs and Infs in the inputs before processing operations. So Matlab is same as scipy then. LAPACK as the LAPACKE interface. With the LAPACKE interface, users can have LAPACKE check for NaNs and do direct returns as well.

I think the issue is not that LAPACK is doing something unreasonable here I think. It's just tricky to support many versions of different LAPACK libraries if behavior changes without warning.

This is a fair comment.

rgommers · 2021-05-16T20:21:04Z

"a non-NaN": a NaNaN. :)

:) I see the next flood of corner case issues coming there:)

I think the issue is not that LAPACK is doing something unreasonable here I think. It's just tricky to support many versions of different LAPACK libraries if behavior changes without warning.

The upstream LAPACK issue can be closed I think. We just need to decide what to do with a case like this:

Document behavior may differ between LAPACK implementations and live with it.
Do the expensive check and always propagate nan's.
Do the expensive check and always raise a LinalgError.

I think it's best to go with (1); doing the expensive check seems like too high a price to pay. We'll then keep this inconsistency for many years; we're at LAPACK >= 3.2.1 for NumPy and >= 3.4.1 for SciPy - upgrading will be slow.

Other opinions?

casparvl · 2021-11-02T10:15:49Z

I ran into this issue as well, but indirectly: the PyTorch test suite compares its own torch.linalg.cond results to np.linag.cond using the following case:

input=np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])
np.linalg.cond(input,'nuc')

which results in the same "SVD did not converge" error. It's a little bit less direct than inputting an array that already has a NaN though. In this case, this call fails

numpy/numpy/linalg/linalg.py

Line 1779 in b235f9e

invx = _umath_linalg.inv(x, signature=signature)

, the error is ignored, and invx is a matrix of NaN's. It is then the second norm on this line

numpy/numpy/linalg/linalg.py

Line 1780 in b235f9e

r = norm(x, p, axis=(-2, -1)) * norm(invx, p, axis=(-2, -1))

that causes the "SVD did not converge" error. Note that this only happens for the nuc norm.

@rgommers @langou thanks for a very clear discussion, I understand why it is happening now.

As for the solution: @rgommers to pitch in on your proposed solutions, at first sight I would probably also vote for the most 'performant' approach, i.e. your solution to document it (1). But let me just clarify if I understand that correctly: by solution (1) you mean to put in the np.linalg.cond docs that if the input contains a NaN, the output is undefined (rather than NaN propagation) and may or may not raise an error? I guess, seeing my case above, it should be added that the same applies for non-invertible matrices, which is a less obvious case... (and there maybe other functions in linalg in which intermediate calls result in NaN and therefore trigger this behaviour)

Also, I guess that since the behavior is then undefined, it makes sense to remove the test.nan from the test suite or keep it but test for all possible expected outputs (i.e. either NaN propagation or an "SVD did not converge" error)?

I'm also not entirely sure what should be done about cases such as the one in the PyTorch test suite. I guess maybe they should adapt their input to an invertable matrix, without NaN's, since undefined behaviour is not something one wants in a test suite :)

rgommers · 2021-11-17T07:11:38Z

(1) you mean to put in the np.linalg.cond docs that if the input contains a NaN, the output is undefined (rather than NaN propagation) and may or may not raise an error?

Yes indeed.

Also, I guess that since the behavior is then undefined, it makes sense to remove the test.nan from the test suite or keep it but test for all possible expected outputs (i.e. either NaN propagation or an "SVD did not converge" error)?

That does make sense, yes. The test suite should pass also for configs we don't have in CI.

I'm also not entirely sure what should be done about cases such as the one in the PyTorch test suite. I guess maybe they should adapt their input to an invertable matrix, without NaN's, since undefined behaviour is not something one wants in a test suite :)

Agreed.

IvanYashchuk · 2022-01-27T12:27:32Z

New MKL 2022.0 seem to use the latest Reference LAPACK and now np.linalg.cond and np.linalg.norm paths that depend on SVD have different behavior depending on MKL version:

# With MKL 2021.4.0
In [1]: import numpy as np

In [2]: a = np.full((3, 3), float('nan'))

In [3]: np.linalg.svd(a, compute_uv=False)

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 5 was incorrect on entry to DLASCL.
Out[3]: array([nan, nan, nan])

# With new MKL 2022
In [1]: import numpy as np

In [2]: a = np.full((3, 3), float('nan'))

In [3]: np.linalg.svd(a, compute_uv=False)
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-3-48fccc58801d> in <module>
----> 1 np.linalg.svd(a, compute_uv=False)

~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/core/overrides.py in svd(*args, **kwargs)

~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/linalg/linalg.py in svd(a, full_matrices, compute_uv, hermitian)
   1658
   1659         signature = 'D->d' if isComplexType(t) else 'd->d'
-> 1660         s = gufunc(a, signature=signature, extobj=extobj)
   1661         s = s.astype(_realType(result_t), copy=False)
   1662         return s

~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag)
     95
     96 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 97     raise LinAlgError("SVD did not converge")
     98
     99 def _raise_linalgerror_lstsq(err, flag):

LinAlgError: SVD did not converge

rgommers added 00 - Bug component: numpy.linalg labels May 5, 2021

rgommers mentioned this issue May 8, 2021

TST: xfail TestCond.test_nan unconditionally #18943

Merged

rgommers mentioned this issue May 8, 2021

SVD convergence error in the presence of nan's OpenMathLib/OpenBLAS#3225

Closed

This was referenced May 8, 2021

dgesdd: Handle norm nan value Reference-LAPACK/lapack#471

Merged

DGESDD with nan exits with: On entry to DLASCL parameter number 4 had an illegal value Reference-LAPACK/lapack#469

Closed

rgommers mentioned this issue May 11, 2021

numpy v1.20.3 conda-forge/numpy-feedstock#232

Closed

3 tasks

rgommers changed the title ~~TestCond.test_nan test failure on machines~~ TestCond.test_nan test failure for latest OpenBLAS May 16, 2021

weslleyspereira mentioned this issue May 24, 2021

Update the documentation of INFO in xGESDD Reference-LAPACK/lapack#567

Merged

h-vetinari mentioned this issue Jun 29, 2021

TEST: 1.20.x + blas variants conda-forge/numpy-feedstock#227

Closed

casparvl mentioned this issue Aug 30, 2021

{lib}[foss/2021a] TensorFlow v2.5.0 w/ Python 3.9.5 (WIP) easybuilders/easybuild-easyconfigs#13841

Closed

casparvl mentioned this issue Nov 1, 2021

{devel}[foss/2021a] PyTorch v1.10.0, torchvision v0.11.1, Horovod v0.23.0 w/ Python 3.9.5 + CUDA-11.3.1 easybuilders/easybuild-easyconfigs#14233

Merged

This was referenced Nov 2, 2021

test_cond_cpu tests fail when running with numpy compiled against OpenBLAS 0.3.15 pytorch/pytorch#67675

Closed

test_svd_errors_and_warnings tests fail on CPU pytorch/pytorch#67693

Closed

IvanYashchuk mentioned this issue Jan 27, 2022

Functions depending on SVD are broken for inputs with non-finite values with MKL 2022+ and OpenBLAS 0.3.15+ pytorch/pytorch#71911

Open

2 tasks

langou mentioned this issue Nov 21, 2022

rcond is 0 for NaN matrix in lapack-3.11 Reference-LAPACK/lapack#763

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestCond.test_nan test failure for latest OpenBLAS #18914

TestCond.test_nan test failure for latest OpenBLAS #18914

rgommers commented May 5, 2021

charris commented May 5, 2021

hroncok commented May 7, 2021

rgommers commented May 8, 2021

rgommers commented May 8, 2021

langou commented May 12, 2021

rgommers commented May 15, 2021

langou commented May 15, 2021 •

edited

langou commented May 15, 2021

rgommers commented May 16, 2021

casparvl commented Nov 2, 2021

rgommers commented Nov 17, 2021

IvanYashchuk commented Jan 27, 2022

TestCond.test_nan test failure for latest OpenBLAS #18914

TestCond.test_nan test failure for latest OpenBLAS #18914

Comments

rgommers commented May 5, 2021

charris commented May 5, 2021

hroncok commented May 7, 2021

rgommers commented May 8, 2021

rgommers commented May 8, 2021

langou commented May 12, 2021

rgommers commented May 15, 2021

langou commented May 15, 2021 • edited

langou commented May 15, 2021

rgommers commented May 16, 2021

casparvl commented Nov 2, 2021

rgommers commented Nov 17, 2021

IvanYashchuk commented Jan 27, 2022

langou commented May 15, 2021 •

edited