Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behavior with numpy array equality against a scalar unicode string #7859

Open
Anthony-C31 opened this issue Feb 23, 2022 · 11 comments
Labels
bug - incorrect behavior Bugs: incorrect behavior

Comments

@Anthony-C31
Copy link

Anthony-C31 commented Feb 23, 2022

Hello,
I have a different behavior when executing the following code with or with numba:

import numpy as np
import numba as nb
table = np.zeros(2, dtype=([('field1', np.uint8), ('field2', '<U1')]))
table[0] = (1, 'A')
table[1] = (2, 'B')
@nb.njit
def foo(table):
    print(table['field1'] == 1)
    print(table['field2'] == 'A')
    return
foo(table)

Without numba: (prints two vectors of boolean, OK)
[ True False]
[ True False]

With numba: (second print returns just a boolean, NOK)
[ True False]
False

Thanks for your help!
Anthony

@sklam
Copy link
Member

sklam commented Feb 23, 2022

Bug confirmed. The problem is in the second ==. It's currently typed as (unaligned array([unichr x 1], 1d, A), unicode_type) -> Literal[bool](False). Need to check which definition it has selected.

@sklam sklam added the bug - incorrect behavior Bugs: incorrect behavior label Feb 23, 2022
@sklam sklam changed the title Incorrect behavior with numpy structured array Incorrect behavior with numpy array equality against a scalar unicode string Feb 24, 2022
@sklam
Copy link
Member

sklam commented Feb 24, 2022

Debug summary:

  • Current problem occurs because the following unicode equality is taken:
    elif a_unicode ^ b_unicode:
    # one of the things is unicode, everything compares False
    def eq_impl(a, b):
    return False
    This definition is hard-wired to False because it thinks it is comparing unicode to non-unicode.
  • What should happen is the Array __eq__ should have recognized this.
  • Related problem np.asarray("A") is not supported.

Minimal reproducer:

import numpy as np
import numba as nb


@nb.njit
def foo(a):
    return a == 'A'

a = np.array(['A', 'B'], dtype=np.dtype('<U1'))
x = foo(a)
y = foo.py_func(a)
np.testing.assert_equal(x, y)

@dlee992
Copy link
Contributor

dlee992 commented Apr 12, 2022

I would like to contribute a bug fix for this issue, if you haven't started working on this. I have some experience about Numba source code when I develop SDC. @gmarkall @sklam

@dlee992
Copy link
Contributor

dlee992 commented Apr 28, 2022

Hi, @sklam @stuartarchibald . I am trying to fix this issue. Right now, I found two string-related types, CharSeq and UnicodeCharSeq defined for numpy array, and Unicode type defined in a separate file. Does Numba have any method to convert_to/from between these types? e.g., convert Unicode to UnicodeCharSeq.

I think we need to add some conditions to handle SS->?/UU->? in method ufunc_find_matching_loop. And add some codegen methods for equal operation between these types, similar with method int_eq_impl(context, builder, sig, args).

I guess complete supporting for ufunc with string perhaps needs many code additions/changes. I will firstly propose one PR to only focus on fixing this issue. Then, try to fullfil all related parts with this general problem.

Any useful info for me?

@sklam
Copy link
Member

sklam commented Apr 28, 2022

@dlee992, yes, i agree adding equal ufunc for strings and unicode may fix this issue. I am guessing bulk of the work is extending ufunc_find_matching_loop for SS->?/UU->?. For writing the equivalent to int_eq_impl for unicode, take a look at complex_div_impl:

numba/numba/cpython/numbers.py

Lines 1060 to 1089 in e5bbae7

def complex_div_impl(context, builder, sig, args):
def complex_div(a, b):
# This is CPython's algorithm (in _Py_c_quot()).
areal = a.real
aimag = a.imag
breal = b.real
bimag = b.imag
if not breal and not bimag:
raise ZeroDivisionError("complex division by zero")
if abs(breal) >= abs(bimag):
# Divide tops and bottom by b.real
if not breal:
return complex(NAN, NAN)
ratio = bimag / breal
denom = breal + bimag * ratio
return complex(
(areal + aimag * ratio) / denom,
(aimag - areal * ratio) / denom)
else:
# Divide tops and bottom by b.imag
if not bimag:
return complex(NAN, NAN)
ratio = breal / bimag
denom = breal * ratio + bimag
return complex(
(a.real * ratio + a.imag) / denom,
(a.imag * ratio - a.real) / denom)
res = context.compile_internal(builder, complex_div, sig, args)
return impl_ret_untracked(context, builder, sig.return_type, res)

See how it trigger the compiler within itself. The unicode equal ufunc can likely reuse the unicode equality like this:

def unicode_eq_impl(context, builder, sig, args):
    def eq(a, b):
        return a == b

    res = context.compile_internal(builder, eq, sig, args)
    return impl_ret_untracked(context, builder, sig.return_type, res)

@dlee992
Copy link
Contributor

dlee992 commented Apr 29, 2022

@sklam , interesting thing:

In [2]: np.equal.types
Out[2]:
['??->?',
 'bb->?',
 'BB->?',
 'hh->?',
 'HH->?',
 'ii->?',
 'II->?',
 'll->?',
 'LL->?',
 'qq->?',
 'QQ->?',
 'ee->?',
 'ff->?',
 'dd->?',
 'gg->?',
 'FF->?',
 'DD->?',
 'GG->?',
 'OO->?',
 'mm->?',
 'MM->?',
 'OO->O',
 'OO->?']

np.equal.types doesn't have UU->? or SS->?. But ufunc_find_matching_loop select candidate from np.equal.types. Emmm, strange. BTW, which choice is for CharSeq/UnicodeCharSeq/Unicode?

@sklam
Copy link
Member

sklam commented May 3, 2022

Looks like numpy is not using the ufunc for unicode/string array equality.

In [3]: a=np.asarray("ABC")

In [4]: a
Out[4]: array('ABC', dtype='<U3')

In [5]: np.equal(a, a)
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
<ipython-input-5-e945fc627bdf> in <module>
----> 1 np.equal(a, a)

UFuncTypeError: ufunc 'equal' did not contain a loop with signature matching types (dtype('<U3'), dtype('<U3')) -> dtype('bool')

It's not obvious from a quick search where the numpy code for a == a is.

@stuartarchibald
Copy link
Contributor

It's not obvious from a quick search where the numpy code for a == a is.

It's probably this:
https://github.com/numpy/numpy/blob/26c1e04b3af572ed4ee1cc2f9ac3faca27858dc5/numpy/core/src/multiarray/arrayobject.c#L1275
it's in the rich comparison slot for the array type:
https://github.com/numpy/numpy/blob/26c1e04b3af572ed4ee1cc2f9ac3faca27858dc5/numpy/core/src/multiarray/arrayobject.c#L1709

Note the remarks about string arrays and ufuncs not being defined, there's a special case:
https://github.com/numpy/numpy/blob/26c1e04b3af572ed4ee1cc2f9ac3faca27858dc5/numpy/core/src/multiarray/arrayobject.c#L1281-L1309

?

@sklam
Copy link
Member

sklam commented May 4, 2022

Good find @stuartarchibald. So Numba should follow that and special case the handling for == for string types and not go down the ufunc route.

@stuartarchibald
Copy link
Contributor

#7884 might help.

@dlee992
Copy link
Contributor

dlee992 commented Jun 15, 2022

I discussed with Seberg at this PR numpy/numpy#21041, about unifying string ufuncs into the universal way. Still in discussion and need more NumPy PRs for ufunc.types stuff used in Numba ufunc_find_matching_loop. It's kind of needing long effort, since it seems NumPy internal has two style loops to handle ufunc stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - incorrect behavior Bugs: incorrect behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants