Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The epic dtype cleanup plan #2899

Closed
1 of 2 tasks
njsmith opened this issue Jan 10, 2013 · 5 comments
Closed
1 of 2 tasks

The epic dtype cleanup plan #2899

njsmith opened this issue Jan 10, 2013 · 5 comments

Comments

@njsmith
Copy link
Member

njsmith commented Jan 10, 2013

[this is very drafty notes-to-self that I'm hitting save on now because I was dumb and typed it into a webform where it might get lost, and because I'm lazy and don't want to pull it out to stick in some file on my hard-drive where I'll lose it. So no guarantees any of this actually makes sense or anything, but since we've been running into a number of nasty aspects of dtypes recently I wanted to write down a bunch of ideas together so we can try and organize them somewhat. Feel free to comment even at this early date. Also, don't tell anyone, but part of the secret goal here is enabling the NA dtype work too (though not all of this is a prerequisite for that by any means).]

End goal: dtypes basically function like Python objects WRT to subclassing etc.; isinstance/issubclass work in a useful way; parametric dtypes are not horrible; user-defined dtypes are on closer-to-equal footing with built-in dtypes

  • Pass dtypes to ufunc inner loops (What should be the calling convention for ufunc inner loop signatures? #12518)
  • Make dtypes immutable and make sure that we don't actually allocate new ones on ufunc return or when passing cast data to inner loops, but instead just incref-and-go. (immutability is a prerequisite for this.)
  • Move all the special case operations for void/string/datetime dtypes out of the core ufunc code and into ufunc loops.
  • Merge ufunc.so and multiarray.so, the split just ends up making us contort code to let them interface with each other (ENH: implement nep 0015: merge multiarray and umath #10915)
  • Make overflow/etc. error flag clearing/setting/checking methods on dtypes. Use this to split the integer and float error flag stuff (Stop mixing up integer and floating point overflow #2898 is to make this possible, but leaves the problem of arranging for the correct functions to be called on the correct loops. this is the clean way to actually call the correct pieces at the correct time, while also as a bonus allowing custom dtypes to do their own error signaling, or not, as they prefer.) (This is an example of a place where having to export this interface between multiarray.so, where dtypes live, and ufunc.so, where ufuncs live, would be pointlessly cumbersome.)
  • Make dtype (the class) inherit from type and define __instancecheck__ and __subclasscheck__ methods, so isinstance(foo, dtype_instance) can do something useful. (This will completely change the dtype struct's memory layout though, type is a huge object, and all of dtype's fields will get shifted down in memory.)
  • Give all the dtype functions self arguments. Turn them into the equivalent of cpdef methods. (not sure how we make this work -- follow MRO when doing dtype operations? or in dtype.new, copy parent fn ptrs to undefined child fn ptrs and ha ha no multiple inheritance is not supported?)
  • Make it possible to add new dtype methods without breaking the memory representation in the future (right now they have to go in the middle of the dtype struct...)
  • Make it possible to define new dtypes in Python. (A set of C-level dtype methods that just call python-level methods)
  • Re-arrange all the weird stuff in the current dtype memory representation so the parts that are specific to a single dtype go into that subclass (e.g. the field names, the datetime unit, etc., should not be part of the core dtype object)
  • ...figure out how to do this without completely messing up existing code. Really only code that actually messes with dtypes should be affected (like, code that defines new dtypes). Binary compatibility much harder than source compatibility -- can we tell people to recompile? Possibility: make sure we can distinguish a pointer-to-old-dtype-object from a pointer-to-new-dtype-object (should be easy, all PyObject*'s are compatible enough to let you get their type), and at all the public entry points do if (!(arg = CheckOrConvertThisDtypeThing(arg)) return -1;, and that function sets up a new-style dtype in place of the old-style one, while raising a DeprecationWarning. This might not be so horrible even, though we would want to make sure we did all the binary compatibility breaking changes at once. (This still won't help for user code that reaches into dtype objects by hand, though, as opposed to just creating them. It's possible that all user dtypes do this, so eh.)
  • Rationalize the dtype constructor arguments. Ideally move the horrible 'build a dtype from strings and tuples and sealing wax' stuff to Python code.

Note: there is an argument for getting all the dtype struct changes out of the way before exposing them to ufunc inner loops, since that will open the door to more user code that depends on the contents of dtype structs. Of course if the problem is already as bad as it could possibly be, then this doesn't matter :-). And letting ufuncs get at dtypes is probably the single biggest win on this list, so it would be nice to prioritize it.

@teoliphant
Copy link
Member

Cool writeup!

On Jan 9, 2013, at 7:21 PM, njsmith wrote:

[this is very drafty notes-to-self that I'm hitting save on now because I was dumb and typed it into a webform where it might get lost, and because I'm lazy and don't want to pull it out to stick in some file on my hard-drive where I'll lose it. So no guarantees any of this actually makes sense or anything, but since we've been running into a number of nasty aspects of dtypes recently I wanted to write down a bunch of ideas together so we can try and organize them somewhat. Feel free to comment even at this early date. Also, don't tell anyone, but part of the secret goal here is enabling the NA dtype work too (though not all of this is a prerequisite for that by any means).]

End goal: dtypes basically function like Python objects WRT to subclassing etc.; isinstance/issubclass work in a useful way; parametric dtypes are not horrible; user-defined dtypes are on closer-to-equal footing with built-in dtypes

Pass dtypes to ufunc inner loops
Make dtypes immutable and make sure that we don't actually allocate new ones on ufunc return or when passing cast data to inner loops, but instead just incref-and-go. (immutability is a prerequisite for this.)
Move all the special case operations for void/string/datetime dtypes out of the core ufunc code and into ufunc loops.
Merge ufunc.so and multiarray.so, the split just ends up making us contort code to let them interface with each other
Make overflow/etc. error flag clearing/setting/checking methods on dtypes. Use this to split the integer and float error flag stuff (#2898 is to make this possible, but leaves the problem of arranging for the correct functions to be called on the correct loops. this is the clean way to actually call the correct pieces at the correct time, while also as a bonus allowing custom dtypes to do their own error signaling, or not, as they prefer.) (This is an example of a place where having to export this interface between multiarray.so, where dtypes live, and ufunc.so, where ufuncs live, would be pointlessly cumbersome.)
Make dtype (the class) inherit from type and define instancecheck and subclasscheck methods, so isinstance(foo, dtype_instance) can do something useful. (This will completely change the dtype struct's memory layout though, type is a huge object, and all of dtype's fields will get shifted down in memory.)
In the process you might think about making the dtype "class" have instances that are the array-scalars instead of having new array-scalars. This might get tricky to have the int, float, complex dtype equivalents dual inherit as they do now the same way.

-Travis

Give all the dtype functions self arguments. Turn them into the equivalent of cpdef methods. (not sure how we make this work -- follow MRO when doing dtype operations? or in dtype.new, copy parent fn ptrs to undefined child fn ptrs and ha ha no multiple inheritance is not supported?)
Make it possible to add new dtype methods without breaking the memory representation in the future (right now they have to go in the middle of the dtype struct...)
Make it possible to define new dtypes in Python. (A set of C-level dtype methods that just call python-level methods)
Re-arrange all the weird stuff in the current dtype memory representation so the parts that are specific to a single dtype go into that subclass (e.g. the field names, the datetime unit, etc., should not be part of the core dtype object)
...figure out how to do this without completely messing up existing code. Really only code that actually messes with dtypes should be affected (like, code that defines new dtypes). Binary compatibility much harder than source compatibility -- can we tell people to recompile? Possibility: make sure we can distinguish a pointer-to-old-dtype-object from a pointer-to-new-dtype-object (should be easy, all PyObject*'s are compatible enough to let you get their type), and at all the public entry points do if (!(arg = CheckOrConvertThisDtypeThing(arg)) return -1;, and that function sets up a new-style dtype in place of the old-style one, while raising a DeprecationWarning. This might not be so horrible even, though we would want to make sure we did all the binary compatibility breaking changes at once. (This still won't help for user code that reaches into dtype objects by hand, though, as opposed to just creating them. It's possible that all user dtypes do this, so eh.)
Rationalize the dtype constructor arguments. Ideally move the horrible 'build a dtype from strings and tuples and sealing wax' stuff to Python code.
Note: there is an argument for getting all the dtype struct changes out of the way before exposing them to ufunc inner loops, since that will open the door to more user code that depends on the contents of dtype structs. Of course if the problem is already as bad as it could possibly be, then this doesn't matter :-). And letting ufuncs get at dtypes is probably the single biggest win on this list, so it would be nice to prioritize it.


Reply to this email directly or view it on GitHub.

@njsmith
Copy link
Member Author

njsmith commented Jan 10, 2013

@teoliphant: Another of the secret goals is enabling us to eventually get rid of scalar types entirely in favor of 0-d ndarrays, as explored in this long email: http://mail.scipy.org/pipermail/numpy-discussion/2013-January/064995.html

That's a good point about dual inheritance though -- I wonder how important that is. Certainly it would take some extreme trickery to have

>>> isinstance(np.array(1), int)
True
>>> isinstance(np.array(1.0), int)
False
>>> isinstance(np.array(1), float)
False
>>> isinstance(np.array(1.0), float)
True

We could have ndarray_int, ndarray_float types that were just ndarray-plus-a-nominal-inheritance-from-{int,float}, but the problem is that a single array over its lifetime can convert from 0-d to multi-dimensional, and can convert from float to int and vice-versa.

OTOH CPython does allow you to change the type of an object in-place, so this is possible...

In [15]: class Foo(object):
   ....:     pass
   ....: 

In [16]: class Bar(object):
   ....:     pass
   ....: 

In [17]: foo = Foo()

In [18]: foo.__class__ = Bar

In [19]: foo
Out[19]: <__main__.Bar at 0x29357d0>

Really in the long run it would be nice if the one-and-only-one way to do these kinds of checks was something like isinstance(np.asarray(obj).dtype, my_favorite_dtype), instead of playing wacky games with lying inheritance hierarchies and abstract base classes and stuff. But right now we have all these methods that sort of work, sometimes, if you're lucky (np.isinstance(obj, int), np.isinstance(obj, np.float64)), and we need to deal with that history...

@rkern rkern closed this as completed Jan 10, 2013
@rkern rkern reopened this Jan 10, 2013
@rkern
Copy link
Member

rkern commented Jan 10, 2013

Sorry. Stupid buttons.

@jpaalasm
Copy link
Contributor

Hi. I was doing some dtype stuff today and noticed some weridness in dtype construction (http://docs.scipy.org/doc/numpy/user/basics.rec.html#defining-structured-arrays).

The dtype constructor argument is interpreted differently depending on whether it is a list or tuple. Ouch!

This is very confusing as sequence types (list, tuple) are typically interpreted in the same way. Is there a backwards-incompatibility release in the horizon where this could be fixed?

@seberg
Copy link
Member

seberg commented Jan 7, 2020

I am going to close this. I went through the list of above changes, and either there are more specific open issues, or they are on the TODO list (i.e. the current state of gh-14422 includes the goal of Python objects, ). It also links here (and if not, I will fix that).
I think it is clear that dtypes should be full fledged python objects, the one point here which may need a bit more thought is what to do about error handling. However, that also got discussed in quite a lot of depth in gh-12518.

pydata/xarray#1262 is an interesting issue on xarray to list as motivation.

@seberg seberg closed this as completed Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants