Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: dtype objects with non-empty metadata display the metadata in their repr #5973

Closed
wants to merge 1 commit into from

Conversation

embray
Copy link
Contributor

@embray embray commented Jun 16, 2015

Before:

>>> dt = np.dtype('int32', metadata={'a': 1})
>>> dt
dtype('int32')
>>> dt == np.dtype('int32')
True
>>> dt is np.dtype('int32')
False

The last two examples demonstrate that although the new dtype object
reprs identically to dtype('int32') it is suprisingly not the same.
Displaying its metadata makes this clear, and is supported by the
general recommendation of reprs that can recreate the object they
repr.

After:

>>> dt = np.dtype('int32', metadata={'a': 1})
>>> dt
dtype('int32', metadata={'a': 1})

I couldn't find any prior discussion about this, so I guess it just hasn't come up. It doesn't break any existing tests, and makes the repr of dtypes with metadata more accurate, so I don't see the harm. I can add a new test if desired.

repr

Before:

    >>> dt = np.dtype('int32', metadata={'a': 1})
    >>> dt
    dtype('int32')
    >>> dt == np.dtype('int32')
    True
    >>> dt is np.dtype('int32')
    False

The last two examples demonstrate that although the new dtype object
reprs identically to dtype('int32') it is suprisingly not the same.
Displaying its metadata makes this clear, and is supported by the
general recommendation of reprs that can recreate the object they
repr.

After:

    >>> dt = np.dtype('int32', metadata={'a': 1})
    >>> dt
    dtype('int32', metadata={'a': 1})
@ahaldane
Copy link
Member

Is metadata documented anywhere? It would be good to document it here, here and in the numpy.dtype docstring.

dtype.base should be documented too

@embray
Copy link
Contributor Author

embray commented Jun 16, 2015

Huh, yeah, apparently it's pretty undocumented. I don't think it gets used much (perhaps for that reason). c_metadata is used for datetime dtypes, but the python-level metadata still has potential use for "custom" dtypes.

@embray
Copy link
Contributor Author

embray commented Jun 16, 2015

I see, it was used originally for datetime, but the datetime usage of it was replaced to use the new c_metadata in eb40102; so the metadata dict is no longer used anywhere internally by Numpy.

@embray
Copy link
Contributor Author

embray commented Jun 18, 2015

By the way, just to be clear, I do have an imminent use case for this: I'm working on an ndarray subclass that will handle encoded text a little more nicely, so that bytes arrays can still return unicode string scalars and also work better with item assignment with unicode strings. It's not 100% perfect but still better than nothing. I want to be able to use the dtype.metadata dict to include text encoding information. There are other options for that (my current approach actually transparently makes per-encoding subclasses of the array type). But having encoding as part of the dtype has many advantages. I can do this already too, but since it will be important to interpreting the array ("why am I getting unicode strings when it says the dtype is '|S<n>'?") it will be nicer to users to have this.

I'll work on touching up the documentation and then I hope we can get this merged.

@njsmith
Copy link
Member

njsmith commented Jun 19, 2015

Would it be a terrible imposition to store the metadata in your special
array class instead of in the dtype?
.
We desperately need to rewrite the dtype system so that it is easy to
actually define new dtypes that do this sort of thing properly and just
work with regular ndarrays. I'm sure you're well aware of how unpleasant it
is right now to do what you're trying to do.
.
The main challenge in doing this is not breaking backcompat.
.
Right now it is very likely that there is literally nobody using the
.metadata field on string or other dtypes. So if you start using it, then
it substantially increases the chances that we will have to put non-trivial
effort into preserving the .metadata field on new dtypes, managing a full
deprecation cycle, etc., which will take away effort from actually
productive things like making it easy to do what you actually want to do
:-).
.
I would actually prefer to deprecate .metadata now rather than documenting
it. It has no meaningful semantics at all. There is no guarantee that it
will be preserved through any operations. Arguably int32 shouldn't repr
it, because while it is a bit of data stored in the struct, it is not part
of the semantics of int32; it's more like the refcnt or something, it only
exists as a more-or-less accidental internal implementation detail. That it
is even possible to set at all is entirely an accident, not a designed API.

On Thu, Jun 18, 2015 at 10:03 AM, Erik Bray notifications@github.com
wrote:

By the way, just to be clear, I do have an imminent use case for this: I'm
working on an ndarray subclass that will handle encoded text a little more
nicely, so that bytes arrays can still return unicode string scalars and
also work better with item assignment with unicode strings. It's not 100%
perfect but still better than nothing. I want to be able to use the
dtype.metadata dict to include text encoding information. There are
other options for that (my current approach actually transparently makes
per-encoding subclasses of the array type). But having encoding as part of
the dtype has many advantages. I can do this already too, but since it will
be important to interpreting the array ("why am I getting unicode strings
when it says the dtype is '|S'?") it will be nicer to users to have this.

I'll work on touching up the documentation and then I hope we can get this
merged.


Reply to this email directly or view it on GitHub
#5973 (comment).

Nathaniel J. Smith -- http://vorpus.org

@embray
Copy link
Contributor Author

embray commented Jun 19, 2015

Would it be a terrible imposition to store the metadata in your special array class instead of in the dtype?

It's not a terrible imposition, because I already have a prototype that does this. However, it does lead to a great deal of awkwardness. For example, if I have an array of bytes, and I want to view it as an array of ASCII-encoded text, the current interface does not have a good way to pass additional arguments to ndarray.view() that should be used in creating the view.

I have a generic encoded_text_array type, but it's useless without specifying an encoding. The hack I have currently is to do something like array.view(encoded_text_array.with_encoding('ascii')), where with_encoding actually returns a new subclass that has the specific encoding baked into it. This isn't all bad, but it's still a hack. Being able to attach additional metadata to a dtype would be a much nicer way to handle this (especially since dtype objects don't allow arbitrary attributes). And it turns out it's not too hard to preserve this information between most operations of interest, as has been demonstrated with the datetime types. The only difference between this and datetime is that datetime types store their additional metadata now via the c_metadata member of PyArray_Descr structure, which makes it impossible to implement new specialized dtypes from pure Python.

An alternative would be to allow subclassing dtypes, but this comes with its own (similar) problems. Albeit problems that are already largely solved for ndarray subclasses.

@njsmith
Copy link
Member

njsmith commented Jun 24, 2015

So IIUC, right now you write

arr = get_bytes_array()
text_arr = arr.view(encoded_text_array.with_encoding("ascii"))

and you would like to instead write

arr = get_bytes_array()
arr.dtype.metadata["encoding"] = "ascii"
text_arr = arr.view(encoded_text_array)

Is that right?

I imagine that in practice you'll probably want to provide some sort of helper for this, so that it instead becomes:

arr = get_bytes_array()
text_arr = as_encoded(arr, "ascii")

though I guess this does have a very surprising flaw:

arr = get_bytes_array()
# mutates arr.dtype
text_arr1 = as_encoded(arr, "latin1")
# mutates arr.dtype, overwriting previous change
text_arr2 = as_encoded(arr, "utf8")
# now text_arr1 is also using utf8 encoding

My suggestion would be to instead write

arr = get_bytes_array()
text_arr = arr.view(encoded_text_array)
text_arr.encoding = "ascii"

which is the same number of lines, just as easy to wrap into a utility function, and avoids surprising aliasing.

What do you think? Would this solve your problem?

An alternative would be to allow subclassing dtypes

This is exactly the option that I'm hoping to avoid putting extra roadblocks in front of :-).

@embray
Copy link
Contributor Author

embray commented Jun 24, 2015

@njsmith That's a good analysis of what I'm up against. You are right, now that I see it that way, that in the current implementation just being able to store the encoding in the dtype metadata does not offer any particular advantage, since I still need a custom array subclass in order to interpret that dtype properly.

What I really want, it seems, is to be able to define a new dtype (call it "Text", say), that can take as an option an encoding. Then, somewhat in analogy to the datetime dtype, I could do something like:

>>> arr = get_array_bytes()
>>> text = arr.view(np.dtype('T[ascii]'))

I guess what I had in mind for now is something like:

>>> arr.view(encoded_text_array, np.dtype('S', metadata={'encoding': 'ascii'}))

but that's probably even more cumbersome than my current solution, even if the implementation requires less metaclass hackery.

All that said, there may also still be advantages to having a custom array class too. For example, I do want to be able to support automatic stripping of strings (particularly rstrip) just as the chararray class has now. I also want to have a method or attribute by which users can determine the maximum string length they can store in the array (for variable width encodings).

Anyways, I'll relent on this PR for now. If you have any specific ideas for what you have in mind either for the future of dtypes, or for text arrays let me know--I'm willing to contribute to implementations of either of these things. In particular, if my experience trying to implement a custom dtype in C has taught me anything, it's that having a Pythonic interface for defining new dtypes has the potential to be amazing (and it's not the C part I had trouble with--I'm a pretty adept C programmer--it was just hard to assemble all the pieces required and determine exactly what needs to happen where...)

@ahaldane
Copy link
Member

numpy sort of allows subclassing of dtypes if dtype is void. It seems somewhat hacky, but you might play around with something like this:

class my_unicode(np.void):
    def __repr__(self):
        return unicode(self.view(('S', self.dtype.itemsize))).strip() + u"Hello world"

a = array(["foo  ", "bar"])
b = a.view(dtype((my_unicode, 'V5')))
b[0]
 --> 'fooHello world'

@embray
Copy link
Contributor Author

embray commented Jun 26, 2015

Interesting, I hadn't thought of that. Seems like a quirk I might not want to rely on too heavily, but then again this is how the record type works too.

@njsmith
Copy link
Member

njsmith commented Jun 30, 2015

If you have any specific ideas for what you have in mind either for the future of dtypes, or for text arrays let me know--I'm willing to contribute to implementations of either of these things.

I'm trying to get my notes on these things together now actually, ahead of the numpy dev meeting at scipy (July 7). Any chance you'll be there? And either way the idea is that the discussion there will hopefully lead to some more concrete plan for going forward...

@embray
Copy link
Contributor Author

embray commented Jun 30, 2015

I won't be at SciPy, alas.

@embray
Copy link
Contributor Author

embray commented Oct 7, 2015

Going to go ahead and close this, since it seems the plan is to consider dtype.metadata deprecated and to discourage its use for anything.

I would argue that it would be convenient for the new dtype system to have some place to store type-specific metadata, but since that could still end up looking different from what's in there now it's not really relevant I suppose.

@embray embray closed this Oct 7, 2015
@njsmith
Copy link
Member

njsmith commented Oct 7, 2015

The idea of the new dtype system is that your dtype gets to be an arbitrary
python class, so it can have whatever attributes you want :-).

On Wed, Oct 7, 2015 at 9:34 AM, Erik Bray notifications@github.com wrote:

Going to go ahead and close this, since it seems the plan is to consider
dtype.metadata deprecated and to discourage its use for anything.

I would argue that it would be convenient for the new dtype system to have
some place to store type-specific metadata, but since that could still end
up looking different from what's in there now it's not really relevant I
suppose.


Reply to this email directly or view it on GitHub
#5973 (comment).

Nathaniel J. Smith -- http://vorpus.org

@embray
Copy link
Contributor Author

embray commented Oct 8, 2015

That'll be pretty nice. Will it still have some standard dtype base class, or at least an ABC? That is, is the idea to just make dtype subclassable (and I guess fix whatever issues are associated with that)?

@njsmith
Copy link
Member

njsmith commented Oct 8, 2015

Yes, you'd subclass dtype, and the stuff that's in the ArrFuncs struct
would move to being methods.
.
It'll have to be a concrete base class, not an ABC, because we need some
logic to make these C level slots (similar to add versus nb_add) for
speed.
On Oct 8, 2015 3:24 PM, "Erik Bray" notifications@github.com wrote:

That'll be pretty nice. Will it still have some standard dtype base
class, or at least an ABC? That is, is the idea to just make dtype
subclassable (and I guess fix whatever issues are associated with that)?


Reply to this email directly or view it on GitHub
#5973 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants