BUG: unable to read .npy file with non-eval'able metadata #23169

sjdv1982 · 2023-02-07T11:25:25Z

Describe the issue:

I have a .npy file generated from an h5py dataset. The data is a pure Numpy array (no subclass) and the file has been written by Numpy, but the dtype contains metadata set by h5py.
In this particular case, the metadata contains a <class 'bytes'> expression, which cannot be handled by safe_eval . I don't care about the metadata, but it renders the entire file unreadable.

 File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
    shape, fortran_order, dtype = _read_array_header(
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
    raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), }                                    \n"

I would consider this a bug in np.save . Serialization of a dtype descriptor that results in an unreadable header should fail loudly. It would be nice if the np.save could then be re-tried with the metadata stripped.

Reproduce the code example:

import numpy as np
np.load("file.npy")

Error message:

Traceback (most recent call last):
  File "/home/sjoerd/bayesian_inference_fm/aptool-dump.py", line 65, in dump
    arr = np.load(child_target + ".npy")
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/npyio.py", line 432, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
    shape, fortran_order, dtype = _read_array_header(
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
    raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), }                                    \n"

Runtime information:

Numpy version:

1.24.0
3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0]

show_runtime:

WARNING: threadpoolctl not found in system! Install it by pip install threadpoolctl. Once installed, try np.show_runtime again for more detailed build information
[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}}]

Context for the issue:

A pure Numpy array (no subclasses) saved with np.save should be readable by np.load. This issue gives a counter-example.

I could try to write a patch to fix this issue, if you are interested.

The text was updated successfully, but these errors were encountered:

seberg · 2023-02-07T11:52:11Z

Thanks for the report. To note, this should be a duplicate of gh-14142 and maybe gh-15488. I am not sure I think we should be storing metadata, since metadata is way too flexible for a simple format like .npy (not the object/pickle mode clearly, and pickles with a high enough protocol are probably useable if you are fine with pickles).

I though we give a warning on storing now, but I am not sure. A warning+metadata stripping is the solution (IIRC) that seemed most plausible to me. But I am not sure you would like that.

I could try to write a patch to fix this issue, if you are interested.

Yes, that would be nice. There was some history of a PR trying to relax np.dtype(...) to accept the currently stored stuff. That had its issues, however (would have to read through it again)

sjdv1982 · 2023-02-07T13:43:42Z

Thanks for your quick reply. I have re-read the issues you mentioned.

But if I understand correctly, there are two separate issues. Please correct me if I am wrong.

Some dtype descr headers written to .npy contain metadata fields that cannot be parsed by safe_eval .
There is no canonical way that metadata is stored in a descr header. The way it is done (the second field of the tuple) is degenerate with a shape description (ENH: better handle dtype creation with metadata #15962 (comment)).

Problem # 2 is revealed by the following snippet.

import numpy as np
from numpy.lib.utils import safe_eval
mytype = np.dtype('<f8', metadata={"BLAH":10})
descr = np.dtype([('x', mytype)]).descr
print(descr)
print(safe_eval(repr(descr)))  # no problem #1 here
dtype(descr)   # problem #2: ValueError: invalid shape in fixed-type tuple.

The solution to problem # 1 is unambiguous. np.save must be amended, because descr headers must be safe_eval'able, else they are unloadable. If h5py wishes to add metadata, they ought to respect that. If you like, I could write a patch to np.save to enforce this. But it won't be of any practical use until problem # 2 is solved too.

As for problem # 2, I don't know how it can be solved, other than guessing in np.load if the second tuple item is a shape or metadata. If I understand correctly, this is what #15962 does?

seberg · 2023-02-09T13:30:37Z

If I understand correctly, this is what #15962 does?

Yes, but in a weird way. @ninousf just opened the PR to drop metadata. That still seems like a decent solution to me (I have to look at the PR in detail). Maybe it would work for you if np.save() gave a warning "dropped metadata" and then saves things without the metadata?

sjdv1982 · 2023-02-09T13:43:59Z

@ninousf just opened the PR to drop metadata. That still seems like a decent solution to me (I have to look at the PR in detail). Maybe it would work for you if np.save() gave a warning "dropped metadata" and then saves things without the metadata?

For me, that would work. I don't care about the metadata, and more importantly, I control both ends of the conversion pipeline (np.save and np.load), so I have no worries about forward/backward compatibility. I can't speak for others who wish to load legacy .npy files, though.

sjdv1982 added the 00 - Bug label Feb 7, 2023

ninousf mentioned this issue Feb 9, 2023

ENH: Dtype without metadata #23185

Closed

seberg mentioned this issue May 17, 2023

ENH: drop dtype metadata #23371

Merged

charris closed this as completed in #23371 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: unable to read .npy file with non-eval'able metadata #23169

BUG: unable to read .npy file with non-eval'able metadata #23169

sjdv1982 commented Feb 7, 2023

seberg commented Feb 7, 2023

sjdv1982 commented Feb 7, 2023 •

edited

seberg commented Feb 9, 2023

sjdv1982 commented Feb 9, 2023

BUG: unable to read .npy file with non-eval'able metadata #23169

BUG: unable to read .npy file with non-eval'able metadata #23169

Comments

sjdv1982 commented Feb 7, 2023

Describe the issue:

Reproduce the code example:

Error message:

Runtime information:

Context for the issue:

seberg commented Feb 7, 2023

sjdv1982 commented Feb 7, 2023 • edited

seberg commented Feb 9, 2023

sjdv1982 commented Feb 9, 2023

sjdv1982 commented Feb 7, 2023 •

edited