-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: dtype objects with non-empty metadata display the metadata in their repr #5973
Conversation
repr Before: >>> dt = np.dtype('int32', metadata={'a': 1}) >>> dt dtype('int32') >>> dt == np.dtype('int32') True >>> dt is np.dtype('int32') False The last two examples demonstrate that although the new dtype object reprs identically to dtype('int32') it is suprisingly not the same. Displaying its metadata makes this clear, and is supported by the general recommendation of reprs that can recreate the object they repr. After: >>> dt = np.dtype('int32', metadata={'a': 1}) >>> dt dtype('int32', metadata={'a': 1})
Huh, yeah, apparently it's pretty undocumented. I don't think it gets used much (perhaps for that reason). c_metadata is used for datetime dtypes, but the python-level metadata still has potential use for "custom" dtypes. |
I see, it was used originally for datetime, but the datetime usage of it was replaced to use the new c_metadata in eb40102; so the metadata dict is no longer used anywhere internally by Numpy. |
By the way, just to be clear, I do have an imminent use case for this: I'm working on an ndarray subclass that will handle encoded text a little more nicely, so that bytes arrays can still return unicode string scalars and also work better with item assignment with unicode strings. It's not 100% perfect but still better than nothing. I want to be able to use the I'll work on touching up the documentation and then I hope we can get this merged. |
Would it be a terrible imposition to store the metadata in your special On Thu, Jun 18, 2015 at 10:03 AM, Erik Bray notifications@github.com
Nathaniel J. Smith -- http://vorpus.org |
It's not a terrible imposition, because I already have a prototype that does this. However, it does lead to a great deal of awkwardness. For example, if I have an array of bytes, and I want to view it as an array of ASCII-encoded text, the current interface does not have a good way to pass additional arguments to I have a generic An alternative would be to allow subclassing dtypes, but this comes with its own (similar) problems. Albeit problems that are already largely solved for ndarray subclasses. |
So IIUC, right now you write
and you would like to instead write
Is that right? I imagine that in practice you'll probably want to provide some sort of helper for this, so that it instead becomes:
though I guess this does have a very surprising flaw:
My suggestion would be to instead write
which is the same number of lines, just as easy to wrap into a utility function, and avoids surprising aliasing. What do you think? Would this solve your problem?
This is exactly the option that I'm hoping to avoid putting extra roadblocks in front of :-). |
@njsmith That's a good analysis of what I'm up against. You are right, now that I see it that way, that in the current implementation just being able to store the encoding in the dtype metadata does not offer any particular advantage, since I still need a custom array subclass in order to interpret that dtype properly. What I really want, it seems, is to be able to define a new dtype (call it "Text", say), that can take as an option an encoding. Then, somewhat in analogy to the datetime dtype, I could do something like: >>> arr = get_array_bytes()
>>> text = arr.view(np.dtype('T[ascii]')) I guess what I had in mind for now is something like: >>> arr.view(encoded_text_array, np.dtype('S', metadata={'encoding': 'ascii'})) but that's probably even more cumbersome than my current solution, even if the implementation requires less metaclass hackery. All that said, there may also still be advantages to having a custom array class too. For example, I do want to be able to support automatic stripping of strings (particularly rstrip) just as the Anyways, I'll relent on this PR for now. If you have any specific ideas for what you have in mind either for the future of dtypes, or for text arrays let me know--I'm willing to contribute to implementations of either of these things. In particular, if my experience trying to implement a custom dtype in C has taught me anything, it's that having a Pythonic interface for defining new dtypes has the potential to be amazing (and it's not the C part I had trouble with--I'm a pretty adept C programmer--it was just hard to assemble all the pieces required and determine exactly what needs to happen where...) |
numpy sort of allows subclassing of dtypes if dtype is void. It seems somewhat hacky, but you might play around with something like this:
|
Interesting, I hadn't thought of that. Seems like a quirk I might not want to rely on too heavily, but then again this is how the record type works too. |
I'm trying to get my notes on these things together now actually, ahead of the numpy dev meeting at scipy (July 7). Any chance you'll be there? And either way the idea is that the discussion there will hopefully lead to some more concrete plan for going forward... |
I won't be at SciPy, alas. |
Going to go ahead and close this, since it seems the plan is to consider I would argue that it would be convenient for the new dtype system to have some place to store type-specific metadata, but since that could still end up looking different from what's in there now it's not really relevant I suppose. |
The idea of the new dtype system is that your dtype gets to be an arbitrary On Wed, Oct 7, 2015 at 9:34 AM, Erik Bray notifications@github.com wrote:
Nathaniel J. Smith -- http://vorpus.org |
That'll be pretty nice. Will it still have some standard |
Yes, you'd subclass dtype, and the stuff that's in the ArrFuncs struct
|
Before:
The last two examples demonstrate that although the new dtype object
reprs identically to dtype('int32') it is suprisingly not the same.
Displaying its metadata makes this clear, and is supported by the
general recommendation of reprs that can recreate the object they
repr.
After:
I couldn't find any prior discussion about this, so I guess it just hasn't come up. It doesn't break any existing tests, and makes the repr of dtypes with metadata more accurate, so I don't see the harm. I can add a new test if desired.