Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured array drops field-titles when being 'sliced' by field-names #9625

Closed
axeloide opened this issue Aug 29, 2017 · 10 comments
Closed

Comments

@axeloide
Copy link

axeloide commented Aug 29, 2017

Is the following behaviour of numpy (v1.13.1) a bug or by design?

>>>import numpy as np
>>>a = np.zeros((1,), dtype=[(('title 1', 'x'), '|i1'), (('title 2', 'y'), '>f4')])

>>>a.dtype.descr
[(('title 1', 'x'), '|i1'), (('title 2', 'y'), '>f4')]

>>>a[['y','x']].dtype.descr
[('y', '>f4'), ('x', '|i1')]

I would have expected the last expression to have returned this instead:
[(('title 2', 'y'), '>f4'), (('title 1', 'x'), '|i1')]

Why are the field-titles missing on a view obtained by indexing?
It is kind of surprising, that some aspects of the structured-array dtype just get dropped.

@eric-wieser
Copy link
Member

I vaguely recall that field titles are a deprecated feature, so it's not too surprising that they are not well supported

@axeloide
Copy link
Author

axeloide commented Aug 31, 2017

Are field-titles officially or unofficially deprecated? :-)

Just in case it's the latter, let me try to advocate for field-titles, by explaining my use-case:

I acquire large data-series with dozens of fields and find it very useful to have concise field-names (improves code readability), while at the same time also having field documentation in human readable form. It is handy that both, names and titles, are already defined where the data originated in the first place and just gets passed along through the pipeline: acquisition-->processing-->storage...
Losing titles along that pipeline is especially inconvenient, because they tend to be the more useful the further down the pipeline, e.g. numpy arrays that have been serialized for storage or exchange. As an example: It just takes these two lines of code to get any ndarray persisted to a database or sent over the wire:

serialized_array.dtype = repr(numpy_array.dtype.descr)
serialized_array.buffer = numpy_array.tostring()

And any consumer/receiver of that data has the benefit of a fully documented dtype, where the titles document important aspects of a field, like the physical unit (e.g.: mm or inches?)

BTW: Recreating a numpy array that has been serialized as shown above is done in just a single line of code:
numpy_array = np.frombuffer(serialized_array.buffer, dtype=eval(serialized_array.dtype))

@ahaldane
Copy link
Member

ahaldane commented Sep 3, 2017

We still support titles, but my impression is they are sometimes forgotten so they can be a bit buggy.

Multi-field indexing using titles behaves strangely in other ways too:

>>> a = np.zeros(4, dtype=[(('title', 'b'), 'i'), ('c', 'i')])
>>> a[['title', 'c']]
array([(0, 0), (0, 0), (0, 0), (0, 0)],
      dtype=[('title', '<i4'), ('c', '<i4')])
>>> a[['b', 'c']]
array([(0, 0), (0, 0), (0, 0), (0, 0)],
      dtype=[('b', '<i4'), ('c', '<i4')])
>>> a[['title', 'b']]
array([(0, 0), (0, 0), (0, 0), (0, 0)],
      dtype=[('title', '<i4'), ('b', '<i4')])

I've been working on multi-field indexing in another PR (#6053), I'll see if I can get titles to work more sensibly.

@ahaldane
Copy link
Member

ahaldane commented Sep 3, 2017

Also, I have proposed docs for structured arrays which doesn't deprecate titles, but I do say they are "obsolete" and that "their use is discouraged". See #9056. Is that too strong?

@ahaldane
Copy link
Member

ahaldane commented Sep 3, 2017

@axeloide, since you are an actual user of titles with multi-field indexing, can you comment on how you think the code in my last example should behave?

Here are two possibilities:

Behavior 1 (easier to implement):

>>> a = np.zeros(4, dtype=[(('title', 'b'), 'i'), ('c', 'i')])
>>> a[['title', 'c']]
KeyError: No field named 'title'
>>> a[['b', 'c']]
array([(0, 0), (0, 0), (0, 0), (0, 0)],
      dtype=[(('title', 'b'), '<i4'), ('c', '<i4')])
>>> a[['title', 'b']]
KeyError: No field named 'title'

Behavior 2:

>>> a = np.zeros(4, dtype=[(('title', 'b'), 'i'), ('c', 'i')])
>>> a[['title', 'c']]
array([(0, 0), (0, 0), (0, 0), (0, 0)],
      dtype=[(('title', 'b'), '<i4'), ('c', '<i4')])
>>> a[['b', 'c']]
array([(0, 0), (0, 0), (0, 0), (0, 0)],
      dtype=[(('title', 'b'), '<i4'), ('c', '<i4')])
>>> a[['title', 'b']]
KeyError: duplicate field name 'b'

@ahaldane
Copy link
Member

ahaldane commented Sep 3, 2017

Also, one last note: We are planning to merge #6053 which will probably affect you, since it changes the way multi-field indexing works. If you are using numpy 1.13 you should be getting lots of FutureWarnings about it.

@axeloide
Copy link
Author

axeloide commented Sep 4, 2017

@ahaldane , thanks a lot for addressing this in your commit! 👍

As you probably have guessed, I indeed would expect behaviour 1. Making titles also work as indices seems like one feature too much and violates in my opinion the principle of "separation of concerns". In my naive understanding names are unique keys and titles are optional meta-data that go piggy-back on them, but only serve documentation purposes.

Also thanks for the "FutureWarnings". They already made me check my usages and I'm OK with views instead of copies, it is actually what I had expected anyway.

As for your changes on the doc's stating that fields are an obsolete feature: yes that is way too strong!
I really consider titles a very useful feature, so much so, that I invested more effort in explaining why here: https://stackoverflow.com/q/45939506/2239469
I'd be willing to contribute to the documentation of titles: I find the pyplot example illustrative (title as axis labels).

@eric-wieser
Copy link
Member

eric-wieser commented Sep 4, 2017

I really consider titles a very useful feature, so much so, that I invested more effort in explaining why here:

Even if this metadata is useful, I'd argue that it's in the wrong place - I think it should be attached to the the field type, not to the field name; so belongs on the dtypes, not the containing np.void. So for instance, I'd favor something like:

x_dt = np.dtype(np.float32, metadata='elevation / m')
T_dt = np.dtype(np.float32, metadata='temperature / K')
new_dt = np.dtype([('x', x_dt), ('T', t_dt)])


some_data = np.array(..., new_dt)
some_x = some_data['x']

# some_x still has the metadata attached - using titles forces it to be discarded

@eric-wieser
Copy link
Member

eric-wieser commented Sep 4, 2017

Note also that using .descr is pretty fragile, and fails to roundtrip on more exotic dtypes:

>>> dt = np.dtype((int, 3))
>>> dt
dtype(('<i4', (3,)))
>>> np.dtype(dt.descr)
dtype([('f0', 'V12')])

While it seems I'm wrong about titles being obsolete (and was just remembering @ahaldane's PR), I would definitely recommend avoiding dtype.descr, and given that all internal uses are slowly being removed to fix bugs, would consider it obsolete.

@axeloide
Copy link
Author

axeloide commented Sep 4, 2017

@eric-wieser I agree with both your remarks. Having metadata/titles on the field-type instead would also be great.
I was being too lazy/confused when using repr(numpy_array.dtype.descr), I guess it would be much more reliable to use repr(numpy_array.dtype), instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants