Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: unable to read .npy file with non-eval'able metadata #23169

Closed
sjdv1982 opened this issue Feb 7, 2023 · 4 comments · Fixed by #23371
Closed

BUG: unable to read .npy file with non-eval'able metadata #23169

sjdv1982 opened this issue Feb 7, 2023 · 4 comments · Fixed by #23371
Labels

Comments

@sjdv1982
Copy link

sjdv1982 commented Feb 7, 2023

Describe the issue:

I have a .npy file generated from an h5py dataset. The data is a pure Numpy array (no subclass) and the file has been written by Numpy, but the dtype contains metadata set by h5py.
In this particular case, the metadata contains a <class 'bytes'> expression, which cannot be handled by safe_eval . I don't care about the metadata, but it renders the entire file unreadable.

 File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
    shape, fortran_order, dtype = _read_array_header(
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
    raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), }                                    \n"

I would consider this a bug in np.save . Serialization of a dtype descriptor that results in an unreadable header should fail loudly. It would be nice if the np.save could then be re-tried with the metadata stripped.

Reproduce the code example:

import numpy as np
np.load("file.npy")

Error message:

Traceback (most recent call last):
  File "/home/sjoerd/bayesian_inference_fm/aptool-dump.py", line 65, in dump
    arr = np.load(child_target + ".npy")
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/npyio.py", line 432, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
    shape, fortran_order, dtype = _read_array_header(
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
    raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), }                                    \n"

Runtime information:

Numpy version:

1.24.0
3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0]

show_runtime:

WARNING: threadpoolctl not found in system! Install it by pip install threadpoolctl. Once installed, try np.show_runtime again for more detailed build information
[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}}]

Context for the issue:

A pure Numpy array (no subclasses) saved with np.save should be readable by np.load. This issue gives a counter-example.

I could try to write a patch to fix this issue, if you are interested.

@seberg
Copy link
Member

seberg commented Feb 7, 2023

Thanks for the report. To note, this should be a duplicate of gh-14142 and maybe gh-15488. I am not sure I think we should be storing metadata, since metadata is way too flexible for a simple format like .npy (not the object/pickle mode clearly, and pickles with a high enough protocol are probably useable if you are fine with pickles).

I though we give a warning on storing now, but I am not sure. A warning+metadata stripping is the solution (IIRC) that seemed most plausible to me. But I am not sure you would like that.

I could try to write a patch to fix this issue, if you are interested.

Yes, that would be nice. There was some history of a PR trying to relax np.dtype(...) to accept the currently stored stuff. That had its issues, however (would have to read through it again)

@sjdv1982
Copy link
Author

sjdv1982 commented Feb 7, 2023

Thanks for your quick reply. I have re-read the issues you mentioned.

But if I understand correctly, there are two separate issues. Please correct me if I am wrong.

  1. Some dtype descr headers written to .npy contain metadata fields that cannot be parsed by safe_eval .
  2. There is no canonical way that metadata is stored in a descr header. The way it is done (the second field of the tuple) is degenerate with a shape description (ENH: better handle dtype creation with metadata #15962 (comment)).

Problem # 2 is revealed by the following snippet.

import numpy as np
from numpy.lib.utils import safe_eval
mytype = np.dtype('<f8', metadata={"BLAH":10})
descr = np.dtype([('x', mytype)]).descr
print(descr)
print(safe_eval(repr(descr)))  # no problem #1 here
dtype(descr)   # problem #2: ValueError: invalid shape in fixed-type tuple.

The solution to problem # 1 is unambiguous. np.save must be amended, because descr headers must be safe_eval'able, else they are unloadable. If h5py wishes to add metadata, they ought to respect that. If you like, I could write a patch to np.save to enforce this. But it won't be of any practical use until problem # 2 is solved too.

As for problem # 2, I don't know how it can be solved, other than guessing in np.load if the second tuple item is a shape or metadata. If I understand correctly, this is what #15962 does?

@seberg
Copy link
Member

seberg commented Feb 9, 2023

If I understand correctly, this is what #15962 does?

Yes, but in a weird way. @ninousf just opened the PR to drop metadata. That still seems like a decent solution to me (I have to look at the PR in detail). Maybe it would work for you if np.save() gave a warning "dropped metadata" and then saves things without the metadata?

@sjdv1982
Copy link
Author

sjdv1982 commented Feb 9, 2023

@ninousf just opened the PR to drop metadata. That still seems like a decent solution to me (I have to look at the PR in detail). Maybe it would work for you if np.save() gave a warning "dropped metadata" and then saves things without the metadata?

For me, that would work. I don't care about the metadata, and more importantly, I control both ends of the conversion pipeline (np.save and np.load), so I have no worries about forward/backward compatibility. I can't speak for others who wish to load legacy .npy files, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants