Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: rec.array load from bytes should refuse object dtype #23132

Open
lixi-zhou opened this issue Jan 30, 2023 · 3 comments
Open

ENH: rec.array load from bytes should refuse object dtype #23132

lixi-zhou opened this issue Jan 30, 2023 · 3 comments
Labels

Comments

@lixi-zhou
Copy link

lixi-zhou commented Jan 30, 2023

Describe the issue:

If a rec array contains object dtype, after converting it to bytes and saving to disk, the data cannot be loaded back by another script but the code works if the saving and loading are within the same file.

Please see the attached reproduce code for example. First, run a.py, the code works without issue. Later, run b.py it will throw a Segmentation fault error.

Reproduce the code example:

# =======================a.py======================
import numpy as np
arr_dtype = [('keys', 'O'), ('data', '<i8')]
a = np.rec.array([('abcd', 0), ('abbe', 1), ('abbe', 2), ('ded', 3), ('ads', 4)], 
      dtype=arr_dtype)
with open('test.data', 'wb') as f:
  f.write(a.tobytes())
with open('test.data', 'rb') as f:
  a_bytes = f.read()
b = np.rec.array(a_bytes, dtype=arr_dtype)
print(b)

# =======================b.py======================
import numpy as np
arr_dtype = [('keys', 'O'), ('data', '<i8')]
with open('test.data', 'rb') as f:
  a_bytes = f.read()
b = np.rec.array(a_bytes, dtype=arr_dtype)
print(b)

Error message:

# after run a.py, successfully load the data without issue. The script outputs:
[('abcd', 0) ('abbe', 1) ('abbe', 2) ('ded', 3) ('ads', 4)]

# after run b.py. The script outputs:
Segmentation fault

Runtime information:

1.24.1
3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0]
[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2',
'AVX512F',
'AVX512CD',
'AVX512_SKX',
'AVX512_CLX'],
'not_found': ['AVX512_KNL',
'AVX512_KNM',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'SkylakeX',
'filepath': '/home/xxx/miniconda3/envs/py38/lib/python3.8/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so',
'internal_api': 'openblas',
'num_threads': 8,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.21'}]

Context for the issue:

No response

@lixi-zhou lixi-zhou changed the title BUG: rec.array load from bytes with 'OBJECT' dtype cause segmentation bug BUG: rec.array load from bytes with 'OBJECT' dtype cause segmentation fault Jan 30, 2023
@seberg
Copy link
Member

seberg commented Jan 31, 2023

It really can't work to load what tobytes() writes for object arrays.

np.rec.array() should refuse to try and fail gracefully, though. It isn't even a low-level API, but user-facing and user-facing API has to ensure safety.
(Maybe we should fail even more generally, but just forbidding it there seems fine also.)

@lixi-zhou
Copy link
Author

It really can't work to load what tobytes() writes for object arrays.

np.rec.array() should refuse to try and fail gracefully, though. It isn't even a low-level API, but user-facing and user-facing API has to ensure safety. (Maybe we should fail even more generally, but just forbidding it there seems fine also.)

Will loading object arrays be supported or just forbidden? the loading works if it follows the tobytes() function.

@seberg
Copy link
Member

seberg commented Jan 31, 2023

Loading objects from tobytes() is fundamentally unsafe (so much so, it is arguable we should maybe refuse the .tobytes() call).
It cannot be supported. What you experience as "works" is still fundamentally unsafe in many ways.

This is very much a: don't do it, unless you know how it works (i.e. you know how object lifetime management works in Python, which practically means knowing the C-API well). Even if you do, you probably don't have a good enough reason to actually do it and make sure it is safe.

You can use np.save, which uses pickle for objects, or even just pickle directly.

np.rec.array() is weird for not rejecting things here, as for example np.fromstring np.frombuffer() does.

@seberg seberg changed the title BUG: rec.array load from bytes with 'OBJECT' dtype cause segmentation fault ENH: rec.array load from bytes should refuse object dtype Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants