Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestCond.test_nan test failure for latest OpenBLAS #18914

Open
rgommers opened this issue May 5, 2021 · 12 comments
Open

TestCond.test_nan test failure for latest OpenBLAS #18914

rgommers opened this issue May 5, 2021 · 12 comments

Comments

@rgommers
Copy link
Member

rgommers commented May 5, 2021

This is on a macOS system, but @mattip ran into it yesterday as well during a demo, and that was on Linux I believe. I cannot reproduce it on my macOS or Linux setups, nor on Gitpod. All those use openblas from conda-forge, the problem may be coming from another BLAS implementation or be hardware-specific.

=================================== FAILURES ===================================
______________________________ TestCond.test_nan _______________________________

self = <numpy.linalg.tests.test_linalg.TestCond object at 0x13d00e670>

    def test_nan(self):
        # nans should be passed through, not converted to infs
        ps = [None, 1, -1, 2, -2, 'fro']
        p_pos = [None, 1, 2, 'fro']
    
        A = np.ones((2, 2))
        A[0,1] = np.nan
        for p in ps:
>           c = linalg.cond(A, p)

A          = array([[ 1., nan],
       [ 1.,  1.]])
p          = None
p_pos      = [None, 1, 2, 'fro']
ps         = [None, 1, -1, 2, -2, 'fro']
self       = <numpy.linalg.tests.test_linalg.TestCond object at 0x13d00e670>

numpy/linalg/tests/test_linalg.py:777: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
<__array_function__ internals>:5: in cond
    ???
        args       = (array([[ 1., nan],
       [ 1.,  1.]]), None)
        kwargs     = {}
        relevant_args = (array([[ 1., nan],
       [ 1.,  1.]]),)
numpy/linalg/linalg.py:1765: in cond
    s = svd(x, compute_uv=False)
        p          = None
        x          = array([[ 1., nan],
       [ 1.,  1.]])
<__array_function__ internals>:5: in svd
    ???
        args       = (array([[ 1., nan],
       [ 1.,  1.]]),)
        kwargs     = {'compute_uv': False}
        relevant_args = (array([[ 1., nan],
       [ 1.,  1.]]),)
numpy/linalg/linalg.py:1672: in svd
    s = gufunc(a, signature=signature, extobj=extobj)
        _nx        = <module 'numpy' from '/Users/username/Documents/GitHub/numpy/build/testenv/lib/python3.9/site-packages/numpy/__init__.py'>
        a          = array([[ 1., nan],
       [ 1.,  1.]])
        compute_uv = False
        extobj     = [8192, 1536, <function _raise_linalgerror_svd_nonconvergence at 0x117e80670>]
        full_matrices = True
        gufunc     = <ufunc 'svd_n'>
        hermitian  = False
        m          = 2
        n          = 2
        result_t   = <class 'numpy.float64'>
        signature  = 'd->d'
        t          = <class 'numpy.float64'>
        wrap       = <built-in method __array_prepare__ of numpy.ndarray object at 0x13c341510>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

err = 'invalid value', flag = 8

    def _raise_linalgerror_svd_nonconvergence(err, flag):
>       raise LinAlgError("SVD did not converge")
E       numpy.linalg.LinAlgError: SVD did not converge

err        = 'invalid value'
flag       = 8

numpy/linalg/linalg.py:97: LinAlgError
=========================== short test summary info ============================
FAILED numpy/linalg/tests/test_linalg.py::TestCond::test_nan - numpy.linalg.L...
= 1 failed, 14429 passed, 611 skipped, 1237 deselected, 18 xfailed, 3 xpassed in 363.18s (0:06:03) =
@charris
Copy link
Member

charris commented May 5, 2021

The failure of SVD to converge was a common problem, ISTR it was mostly on the MAC, but I may be wrong.

@hroncok
Copy link
Contributor

hroncok commented May 7, 2021

We see that in Fedora as well.

https://koschei.fedoraproject.org/build/10273521

openblas was updated from 0.3.14 to 0.3.15

@rgommers
Copy link
Member Author

rgommers commented May 8, 2021

I can reproduce this with OpenBLAS 0.3.15 from conda-forge; the test passes with 0.3.12 and 0.3.13. I'll report upstream. gh-18943 xfail's the test for now.

@rgommers
Copy link
Member Author

rgommers commented May 8, 2021

The failure can be reduced to:

>>> a = np.array([[1, np.nan], [1, 1]])
>>> np.linalg.svd(a)
...
LinAlgError: SVD did not converge

@langou
Copy link

langou commented May 12, 2021

Hi,

(1) we changed the behavior of LAPACK for a few functionality to have a direct return with a negative INFO when the input matrix has at least one NaN in it. For example, in your code, for DGESDD, because the input matrix has a NaN, LAPACK is now directly returning with INFO=-4. (INFO=-4 because A is the fourth parameter in the interface of DGESDD.) Negative INFO means that the input is not valid.

(2) I have no idea what the LAPACK behavior for a = np.array([[1, np.nan], [1, 1]]); np.linalg.svd(a) was and what the numpy expected LAPACK to do. I think a direct return with INFO=-4 is clean. Different opinions welcome.

(2) we changed this behavior when we found cheap (free?) opportunity to do in our codes. (I am not sure if we did a full LAPACK code review, or just edited xGEDD. The idea was to do similar changes wherever possible.) Many LAPACK drivers (e.g. xGESDD) starts by computing the norm of A, || A ||, (because we might want to rescale the problem for example), and so, if there is NaN in A, ||A|| is NaN, and so, the new change is to (1) check whether ||A|| is NaN or not and (2) If ||A|| is NaN, we return directly. New behavior.
See: Reference-LAPACK/lapack#471

(3) We will improve the documentation to mention the direct return in xGESDD.

(4) We plan to do similar more changes to other LAPACK routines as we edit. In short the idea is that if we can quickly/cheaply/freely detect a NaN in the input matrix, we will use that information and trigger a quick return.

(5) It is great that numpy has some matrices with NaN in input in its test suite. Wonderful to see.

Julien.

@rgommers
Copy link
Member Author

Thanks for the detailed reply @langou!

have no idea what the LAPACK behavior for a = np.array([[1, np.nan], [1, 1]]); np.linalg.svd(a) was and what the numpy expected LAPACK to do. I think a direct return with INFO=-4 is clean. Different opinions welcome.

The default behavior in NumPy functions is to propagate NaN's. That's the case here as well:

In [1]: a = np.array([[1, np.nan], [1, 1]])

In [2]: np.linalg.svd(a)
Out[2]: 
(array([[nan, nan],
        [nan, nan]]),
 array([nan, nan]),
 array([[nan, nan],
        [nan, nan]]))

For many functions that's desired behavior for end users. For linear algebra functions typically it's not. However, checking all input for nan's is expensive, so numpy.linalg functions don't do it. In contrast, scipy.linalg functions have a check_finite=True keyword, so it will not call LAPACK at all but just raise an exception immediately; this costs ~10% overhead on each function call.

I think the issue is not that LAPACK is doing something unreasonable here I think. It's just tricky to support many versions of different LAPACK libraries if behavior changes without warning.

we changed this behavior when we found cheap (free?) opportunity to do in our codes.

This makes perfect sense.

We will improve the documentation to mention the direct return in xGESDD.

That sounds good.

We plan to do similar more changes to other LAPACK routines as we edit. In short the idea is that if we can quickly/cheaply/freely detect a NaN in the input matrix, we will use that information and trigger a quick return.

I actually like that. Then eventually we can remove the expensive checks from SciPy (if we ever get to a point where LAPACK > 3.9 is the minimum version we support).

@langou
Copy link

langou commented May 15, 2021

In [1]: a = np.array([[1, np.nan], [1, 1]])

In [2]: np.linalg.svd(a)
Out[2]: 
(array([[nan, nan],
        [nan, nan]]),
 array([nan, nan]),
 array([[nan, nan],
        [nan, nan]]))

I see, when numpy propagates, numpy propagates! :)

That makes sense though. In particular for an operation such as SVD.

Side comment: It is interesting to see the various levels of NaN propagation. It depends on the operation. "NaN propagation" could also mean "we are not converting existing NaNs to a non-NaN values". For example, if we do AXPY: y = alpha * x + y with alpha scalar, x and y vectors, and if alpha is zero, and if there is a NaN in x, then "NaN propagation" makes sure that this NaN in x propagates to y. (So then an implementation that checks whether alpha is zero or not, and direct return when alpha is zero would not propagate NaNs.) So, in this AXPY case, NaN propagation is about "not erasing NaNs" as opposed to "creating some" (SVD case).

  • "a non-NaN": a NaNaN. :)

@langou
Copy link

langou commented May 15, 2021

However, checking all input for nan's is expensive, so numpy.linalg functions don't do it. In contrast, scipy.linalg functions have a check_finite=True keyword, so it will not call LAPACK at all but just raise an exception immediately; this costs ~10% overhead on each function call.

Thanks. This is good to know who does what. I think Matlab is also checking for NaNs and Infs in the inputs before processing operations. So Matlab is same as scipy then. LAPACK as the LAPACKE interface. With the LAPACKE interface, users can have LAPACKE check for NaNs and do direct returns as well.

I think the issue is not that LAPACK is doing something unreasonable here I think. It's just tricky to support many versions of different LAPACK libraries if behavior changes without warning.

This is a fair comment.

@rgommers rgommers changed the title TestCond.test_nan test failure on machines TestCond.test_nan test failure for latest OpenBLAS May 16, 2021
@rgommers
Copy link
Member Author

"a non-NaN": a NaNaN. :)

:) I see the next flood of corner case issues coming there:)

I think the issue is not that LAPACK is doing something unreasonable here I think. It's just tricky to support many versions of different LAPACK libraries if behavior changes without warning.

The upstream LAPACK issue can be closed I think. We just need to decide what to do with a case like this:

  1. Document behavior may differ between LAPACK implementations and live with it.
  2. Do the expensive check and always propagate nan's.
  3. Do the expensive check and always raise a LinalgError.

I think it's best to go with (1); doing the expensive check seems like too high a price to pay. We'll then keep this inconsistency for many years; we're at LAPACK >= 3.2.1 for NumPy and >= 3.4.1 for SciPy - upgrading will be slow.

Other opinions?

@casparvl
Copy link

casparvl commented Nov 2, 2021

I ran into this issue as well, but indirectly: the PyTorch test suite compares its own torch.linalg.cond results to np.linag.cond using the following case:

input=np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])
np.linalg.cond(input,'nuc')

which results in the same "SVD did not converge" error. It's a little bit less direct than inputting an array that already has a NaN though. In this case, this call fails

invx = _umath_linalg.inv(x, signature=signature)
, the error is ignored, and invx is a matrix of NaN's. It is then the second norm on this line
r = norm(x, p, axis=(-2, -1)) * norm(invx, p, axis=(-2, -1))
that causes the "SVD did not converge" error. Note that this only happens for the nuc norm.

@rgommers @langou thanks for a very clear discussion, I understand why it is happening now.

As for the solution: @rgommers to pitch in on your proposed solutions, at first sight I would probably also vote for the most 'performant' approach, i.e. your solution to document it (1). But let me just clarify if I understand that correctly: by solution (1) you mean to put in the np.linalg.cond docs that if the input contains a NaN, the output is undefined (rather than NaN propagation) and may or may not raise an error? I guess, seeing my case above, it should be added that the same applies for non-invertible matrices, which is a less obvious case... (and there maybe other functions in linalg in which intermediate calls result in NaN and therefore trigger this behaviour)

Also, I guess that since the behavior is then undefined, it makes sense to remove the test.nan from the test suite or keep it but test for all possible expected outputs (i.e. either NaN propagation or an "SVD did not converge" error)?

I'm also not entirely sure what should be done about cases such as the one in the PyTorch test suite. I guess maybe they should adapt their input to an invertable matrix, without NaN's, since undefined behaviour is not something one wants in a test suite :)

@rgommers
Copy link
Member Author

(1) you mean to put in the np.linalg.cond docs that if the input contains a NaN, the output is undefined (rather than NaN propagation) and may or may not raise an error?

Yes indeed.

Also, I guess that since the behavior is then undefined, it makes sense to remove the test.nan from the test suite or keep it but test for all possible expected outputs (i.e. either NaN propagation or an "SVD did not converge" error)?

That does make sense, yes. The test suite should pass also for configs we don't have in CI.

I'm also not entirely sure what should be done about cases such as the one in the PyTorch test suite. I guess maybe they should adapt their input to an invertable matrix, without NaN's, since undefined behaviour is not something one wants in a test suite :)

Agreed.

@IvanYashchuk
Copy link
Contributor

New MKL 2022.0 seem to use the latest Reference LAPACK and now np.linalg.cond and np.linalg.norm paths that depend on SVD have different behavior depending on MKL version:

# With MKL 2021.4.0
In [1]: import numpy as np

In [2]: a = np.full((3, 3), float('nan'))

In [3]: np.linalg.svd(a, compute_uv=False)

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

Intel MKL ERROR: Parameter 5 was incorrect on entry to DLASCL.
Out[3]: array([nan, nan, nan])

# With new MKL 2022
In [1]: import numpy as np

In [2]: a = np.full((3, 3), float('nan'))

In [3]: np.linalg.svd(a, compute_uv=False)
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-3-48fccc58801d> in <module>
----> 1 np.linalg.svd(a, compute_uv=False)

~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/core/overrides.py in svd(*args, **kwargs)

~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/linalg/linalg.py in svd(a, full_matrices, compute_uv, hermitian)
   1658
   1659         signature = 'D->d' if isComplexType(t) else 'd->d'
-> 1660         s = gufunc(a, signature=signature, extobj=extobj)
   1661         s = s.astype(_realType(result_t), copy=False)
   1662         return s

~/.conda/envs/pytorch-cuda-dev/lib/python3.9/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag)
     95
     96 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 97     raise LinAlgError("SVD did not converge")
     98
     99 def _raise_linalgerror_lstsq(err, flag):

LinAlgError: SVD did not converge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants