Printing a Numpy structured array containing a string column throws an error #5777

marnixhoh · 2020-05-29T16:21:05Z

This works:

from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

def my_test():
    values = np.empty(2, dtype=values_dtype)
    values['one'][0] = 'test'
    print(values['one'][0])
    return values
result = my_test()
print(result)

But when jitting the function, it doesn't:

from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.empty(2, dtype=values_dtype)
    values['one'][0] = 'test'
    print(values['one'][0])
    return values
result = my_test()
print(result)

Note that the error is thrown by the print statement:
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

numba: 0.49.1
numpy: 1.18.4
python: 3.7.1

I am using the latest released version of Numba (most recent is visible in
the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
I have included below a minimal working reproducer (if you are unsure how
to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

The text was updated successfully, but these errors were encountered:

stuartarchibald · 2020-05-29T16:41:47Z

Thanks for the report, I can reproduce.

stuartarchibald · 2020-05-29T16:47:58Z

This seems to fix it:

from numba import njit, objmode
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.zeros(2, dtype=values_dtype)
    values['one'][0] = 'test'
    print(values['one'][0])
    return values
result = my_test()

print(my_test.py_func().dtype)
print(my_test().dtype)
print(my_test())

it's probably junk in memory/string not null-terminated.

marnixhoh · 2020-05-30T08:17:10Z

@stuartarchibald I also wanted to share (just in case it is helpful), that when you instantiate the structured array containing a unicode/string field and then try to print it inside of the jitted function, Python as whole crashes. (I am using Jupyter notebook for these tests on a Mac).

For example:

from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.empty(2, dtype=values_dtype)
    print(values)
    return values
result = my_test()

stuartarchibald · 2020-06-01T10:26:45Z

@stuartarchibald I also wanted to share (just in case it is helpful), that when you instantiate the structured array containing a unicode/string field and then try to print it inside of the jitted function, Python as whole crashes. (I am using Jupyter notebook for these tests on a Mac).

For example:
from numba import njit
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit
def my_test():
    values = np.empty(2, dtype=values_dtype)
    print(values)
    return values
result = my_test()

Thanks, it's probably an invalid/OOB read from np.empty being junk and the string to read not being null-terminated. This confirms:

$ valgrind --tool=memcheck --suppressions=../cpython/Misc/valgrind-python.supp --suppressions=contrib/valgrind-numba.supp python issue5777_2.py
<snip>
==31941== Invalid read of size 8
==31941==    at 0x22F8D4: PyUnicode_AsUTF8AndSize (unicodeobject.c:3818)
==31941==    by 0x226A911D: ???
==31941==    by 0xCA1C01F: ???
==31941==    by 0x1F2C684F: ???
==31941==    by 0x1EB7664F: ???
==31941==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==31941== 
==31941== 
==31941== Process terminating with default action of signal 11 (SIGSEGV)
==31941==  Access not within mapped region at address 0x8
==31941==    at 0x22F8D4: PyUnicode_AsUTF8AndSize (unicodeobject.c:3818)
==31941==    by 0x226A911D: ???
==31941==    by 0xCA1C01F: ???
==31941==    by 0x1F2C684F: ???
==31941==    by 0x1EB7664F: ???

This would be fine:

from numba import njit, gdb
import numpy as np

values_dtype = np.dtype([
    ('one', 'U10'),
    ('two', 'f8')
])

@njit(debug=True)
def my_test():
    values = np.zeros(2, dtype=values_dtype)
    print(values)
    return values
result = my_test()

stuartarchibald added bug lowpriority labels May 29, 2020

stuartarchibald removed the lowpriority label May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Printing a Numpy structured array containing a string column throws an error #5777

Printing a Numpy structured array containing a string column throws an error #5777

marnixhoh commented May 29, 2020 •

edited

stuartarchibald commented May 29, 2020

stuartarchibald commented May 29, 2020

marnixhoh commented May 30, 2020

stuartarchibald commented Jun 1, 2020

Printing a Numpy structured array containing a string column throws an error #5777

Printing a Numpy structured array containing a string column throws an error #5777

Comments

marnixhoh commented May 29, 2020 • edited

stuartarchibald commented May 29, 2020

stuartarchibald commented May 29, 2020

marnixhoh commented May 30, 2020

stuartarchibald commented Jun 1, 2020

marnixhoh commented May 29, 2020 •

edited