Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Printing a Numpy structured array containing a string column throws an error #5777

Open
2 tasks done
marnixhoh opened this issue May 29, 2020 · 4 comments
Open
2 tasks done
Labels

Comments

@marnixhoh
Copy link

marnixhoh commented May 29, 2020

This works:

from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

def my_test():
    values = np.empty(2, dtype=values_dtype)
    values['one'][0] = 'test'
    print(values['one'][0])
    return values
result = my_test()
print(result)

But when jitting the function, it doesn't:

from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.empty(2, dtype=values_dtype)
    values['one'][0] = 'test'
    print(values['one'][0])
    return values
result = my_test()
print(result)

Note that the error is thrown by the print statement:
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

numba: 0.49.1
numpy: 1.18.4
python: 3.7.1

@stuartarchibald
Copy link
Contributor

Thanks for the report, I can reproduce.

@stuartarchibald
Copy link
Contributor

This seems to fix it:

from numba import njit, objmode
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.zeros(2, dtype=values_dtype)
    values['one'][0] = 'test'
    print(values['one'][0])
    return values
result = my_test()

print(my_test.py_func().dtype)
print(my_test().dtype)
print(my_test())

it's probably junk in memory/string not null-terminated.

@marnixhoh
Copy link
Author

@stuartarchibald I also wanted to share (just in case it is helpful), that when you instantiate the structured array containing a unicode/string field and then try to print it inside of the jitted function, Python as whole crashes. (I am using Jupyter notebook for these tests on a Mac).

For example:

from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.empty(2, dtype=values_dtype)
    print(values)
    return values
result = my_test()

@stuartarchibald
Copy link
Contributor

@stuartarchibald I also wanted to share (just in case it is helpful), that when you instantiate the structured array containing a unicode/string field and then try to print it inside of the jitted function, Python as whole crashes. (I am using Jupyter notebook for these tests on a Mac).

For example:

from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.empty(2, dtype=values_dtype)
    print(values)
    return values
result = my_test()

Thanks, it's probably an invalid/OOB read from np.empty being junk and the string to read not being null-terminated. This confirms:

$ valgrind --tool=memcheck --suppressions=../cpython/Misc/valgrind-python.supp --suppressions=contrib/valgrind-numba.supp python issue5777_2.py
<snip>
==31941== Invalid read of size 8
==31941==    at 0x22F8D4: PyUnicode_AsUTF8AndSize (unicodeobject.c:3818)
==31941==    by 0x226A911D: ???
==31941==    by 0xCA1C01F: ???
==31941==    by 0x1F2C684F: ???
==31941==    by 0x1EB7664F: ???
==31941==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==31941== 
==31941== 
==31941== Process terminating with default action of signal 11 (SIGSEGV)
==31941==  Access not within mapped region at address 0x8
==31941==    at 0x22F8D4: PyUnicode_AsUTF8AndSize (unicodeobject.c:3818)
==31941==    by 0x226A911D: ???
==31941==    by 0xCA1C01F: ???
==31941==    by 0x1F2C684F: ???
==31941==    by 0x1EB7664F: ???

This would be fine:

from numba import njit, gdb
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit(debug=True)
def my_test():
    values = np.zeros(2, dtype=values_dtype)
    print(values)
    return values
result = my_test()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants