The epic dtype cleanup plan #2899

njsmith · 2013-01-10T01:21:09Z

[this is very drafty notes-to-self that I'm hitting save on now because I was dumb and typed it into a webform where it might get lost, and because I'm lazy and don't want to pull it out to stick in some file on my hard-drive where I'll lose it. So no guarantees any of this actually makes sense or anything, but since we've been running into a number of nasty aspects of dtypes recently I wanted to write down a bunch of ideas together so we can try and organize them somewhat. Feel free to comment even at this early date. Also, don't tell anyone, but part of the secret goal here is enabling the NA dtype work too (though not all of this is a prerequisite for that by any means).]

End goal: dtypes basically function like Python objects WRT to subclassing etc.; isinstance/issubclass work in a useful way; parametric dtypes are not horrible; user-defined dtypes are on closer-to-equal footing with built-in dtypes

Pass dtypes to ufunc inner loops (What should be the calling convention for ufunc inner loop signatures? #12518)
Make dtypes immutable and make sure that we don't actually allocate new ones on ufunc return or when passing cast data to inner loops, but instead just incref-and-go. (immutability is a prerequisite for this.)
Move all the special case operations for void/string/datetime dtypes out of the core ufunc code and into ufunc loops.
Merge ufunc.so and multiarray.so, the split just ends up making us contort code to let them interface with each other (ENH: implement nep 0015: merge multiarray and umath #10915)
Make overflow/etc. error flag clearing/setting/checking methods on dtypes. Use this to split the integer and float error flag stuff (Stop mixing up integer and floating point overflow #2898 is to make this possible, but leaves the problem of arranging for the correct functions to be called on the correct loops. this is the clean way to actually call the correct pieces at the correct time, while also as a bonus allowing custom dtypes to do their own error signaling, or not, as they prefer.) (This is an example of a place where having to export this interface between multiarray.so, where dtypes live, and ufunc.so, where ufuncs live, would be pointlessly cumbersome.)
Make dtype (the class) inherit from type and define __instancecheck__ and __subclasscheck__ methods, so isinstance(foo, dtype_instance) can do something useful. (This will completely change the dtype struct's memory layout though, type is a huge object, and all of dtype's fields will get shifted down in memory.)
Give all the dtype functions self arguments. Turn them into the equivalent of cpdef methods. (not sure how we make this work -- follow MRO when doing dtype operations? or in dtype.new, copy parent fn ptrs to undefined child fn ptrs and ha ha no multiple inheritance is not supported?)
Make it possible to add new dtype methods without breaking the memory representation in the future (right now they have to go in the middle of the dtype struct...)
Make it possible to define new dtypes in Python. (A set of C-level dtype methods that just call python-level methods)
Re-arrange all the weird stuff in the current dtype memory representation so the parts that are specific to a single dtype go into that subclass (e.g. the field names, the datetime unit, etc., should not be part of the core dtype object)
...figure out how to do this without completely messing up existing code. Really only code that actually messes with dtypes should be affected (like, code that defines new dtypes). Binary compatibility much harder than source compatibility -- can we tell people to recompile? Possibility: make sure we can distinguish a pointer-to-old-dtype-object from a pointer-to-new-dtype-object (should be easy, all PyObject*'s are compatible enough to let you get their type), and at all the public entry points do if (!(arg = CheckOrConvertThisDtypeThing(arg)) return -1;, and that function sets up a new-style dtype in place of the old-style one, while raising a DeprecationWarning. This might not be so horrible even, though we would want to make sure we did all the binary compatibility breaking changes at once. (This still won't help for user code that reaches into dtype objects by hand, though, as opposed to just creating them. It's possible that all user dtypes do this, so eh.)
Rationalize the dtype constructor arguments. Ideally move the horrible 'build a dtype from strings and tuples and sealing wax' stuff to Python code.

Note: there is an argument for getting all the dtype struct changes out of the way before exposing them to ufunc inner loops, since that will open the door to more user code that depends on the contents of dtype structs. Of course if the problem is already as bad as it could possibly be, then this doesn't matter :-). And letting ufuncs get at dtypes is probably the single biggest win on this list, so it would be nice to prioritize it.

The text was updated successfully, but these errors were encountered:

teoliphant · 2013-01-10T06:46:46Z

Cool writeup!

On Jan 9, 2013, at 7:21 PM, njsmith wrote:

[this is very drafty notes-to-self that I'm hitting save on now because I was dumb and typed it into a webform where it might get lost, and because I'm lazy and don't want to pull it out to stick in some file on my hard-drive where I'll lose it. So no guarantees any of this actually makes sense or anything, but since we've been running into a number of nasty aspects of dtypes recently I wanted to write down a bunch of ideas together so we can try and organize them somewhat. Feel free to comment even at this early date. Also, don't tell anyone, but part of the secret goal here is enabling the NA dtype work too (though not all of this is a prerequisite for that by any means).]

End goal: dtypes basically function like Python objects WRT to subclassing etc.; isinstance/issubclass work in a useful way; parametric dtypes are not horrible; user-defined dtypes are on closer-to-equal footing with built-in dtypes

Pass dtypes to ufunc inner loops
Make dtypes immutable and make sure that we don't actually allocate new ones on ufunc return or when passing cast data to inner loops, but instead just incref-and-go. (immutability is a prerequisite for this.)
Move all the special case operations for void/string/datetime dtypes out of the core ufunc code and into ufunc loops.
Merge ufunc.so and multiarray.so, the split just ends up making us contort code to let them interface with each other
Make overflow/etc. error flag clearing/setting/checking methods on dtypes. Use this to split the integer and float error flag stuff (#2898 is to make this possible, but leaves the problem of arranging for the correct functions to be called on the correct loops. this is the clean way to actually call the correct pieces at the correct time, while also as a bonus allowing custom dtypes to do their own error signaling, or not, as they prefer.) (This is an example of a place where having to export this interface between multiarray.so, where dtypes live, and ufunc.so, where ufuncs live, would be pointlessly cumbersome.)
Make dtype (the class) inherit from type and define instancecheck and subclasscheck methods, so isinstance(foo, dtype_instance) can do something useful. (This will completely change the dtype struct's memory layout though, type is a huge object, and all of dtype's fields will get shifted down in memory.)
In the process you might think about making the dtype "class" have instances that are the array-scalars instead of having new array-scalars. This might get tricky to have the int, float, complex dtype equivalents dual inherit as they do now the same way.

-Travis

Give all the dtype functions self arguments. Turn them into the equivalent of cpdef methods. (not sure how we make this work -- follow MRO when doing dtype operations? or in dtype.new, copy parent fn ptrs to undefined child fn ptrs and ha ha no multiple inheritance is not supported?)
Make it possible to add new dtype methods without breaking the memory representation in the future (right now they have to go in the middle of the dtype struct...)
Make it possible to define new dtypes in Python. (A set of C-level dtype methods that just call python-level methods)
Re-arrange all the weird stuff in the current dtype memory representation so the parts that are specific to a single dtype go into that subclass (e.g. the field names, the datetime unit, etc., should not be part of the core dtype object)
...figure out how to do this without completely messing up existing code. Really only code that actually messes with dtypes should be affected (like, code that defines new dtypes). Binary compatibility much harder than source compatibility -- can we tell people to recompile? Possibility: make sure we can distinguish a pointer-to-old-dtype-object from a pointer-to-new-dtype-object (should be easy, all PyObject*'s are compatible enough to let you get their type), and at all the public entry points do if (!(arg = CheckOrConvertThisDtypeThing(arg)) return -1;, and that function sets up a new-style dtype in place of the old-style one, while raising a DeprecationWarning. This might not be so horrible even, though we would want to make sure we did all the binary compatibility breaking changes at once. (This still won't help for user code that reaches into dtype objects by hand, though, as opposed to just creating them. It's possible that all user dtypes do this, so eh.)
Rationalize the dtype constructor arguments. Ideally move the horrible 'build a dtype from strings and tuples and sealing wax' stuff to Python code.
Note: there is an argument for getting all the dtype struct changes out of the way before exposing them to ufunc inner loops, since that will open the door to more user code that depends on the contents of dtype structs. Of course if the problem is already as bad as it could possibly be, then this doesn't matter :-). And letting ufuncs get at dtypes is probably the single biggest win on this list, so it would be nice to prioritize it.

—
Reply to this email directly or view it on GitHub.

njsmith · 2013-01-10T14:58:39Z

@teoliphant: Another of the secret goals is enabling us to eventually get rid of scalar types entirely in favor of 0-d ndarrays, as explored in this long email: http://mail.scipy.org/pipermail/numpy-discussion/2013-January/064995.html

That's a good point about dual inheritance though -- I wonder how important that is. Certainly it would take some extreme trickery to have

>>> isinstance(np.array(1), int)
True
>>> isinstance(np.array(1.0), int)
False
>>> isinstance(np.array(1), float)
False
>>> isinstance(np.array(1.0), float)
True

We could have ndarray_int, ndarray_float types that were just ndarray-plus-a-nominal-inheritance-from-{int,float}, but the problem is that a single array over its lifetime can convert from 0-d to multi-dimensional, and can convert from float to int and vice-versa.

OTOH CPython does allow you to change the type of an object in-place, so this is possible...

In [15]: class Foo(object):
   ....:     pass
   ....: 

In [16]: class Bar(object):
   ....:     pass
   ....: 

In [17]: foo = Foo()

In [18]: foo.__class__ = Bar

In [19]: foo
Out[19]: <__main__.Bar at 0x29357d0>

Really in the long run it would be nice if the one-and-only-one way to do these kinds of checks was something like isinstance(np.asarray(obj).dtype, my_favorite_dtype), instead of playing wacky games with lying inheritance hierarchies and abstract base classes and stuff. But right now we have all these methods that sort of work, sometimes, if you're lucky (np.isinstance(obj, int), np.isinstance(obj, np.float64)), and we need to deal with that history...

rkern · 2013-01-10T15:30:31Z

Sorry. Stupid buttons.

jpaalasm · 2013-01-20T13:40:17Z

Hi. I was doing some dtype stuff today and noticed some weridness in dtype construction (http://docs.scipy.org/doc/numpy/user/basics.rec.html#defining-structured-arrays).

The dtype constructor argument is interpreted differently depending on whether it is a list or tuple. Ouch!

This is very confusing as sequence types (list, tuple) are typically interpreted in the same way. Is there a backwards-incompatibility release in the horizon where this could be fixed?

seberg · 2020-01-07T21:35:09Z

I am going to close this. I went through the list of above changes, and either there are more specific open issues, or they are on the TODO list (i.e. the current state of gh-14422 includes the goal of Python objects, ). It also links here (and if not, I will fix that).
I think it is clear that dtypes should be full fledged python objects, the one point here which may need a bit more thought is what to do about error handling. However, that also got discussed in quite a lot of depth in gh-12518.

pydata/xarray#1262 is an interesting issue on xarray to list as motivation.

rkern closed this as completed Jan 10, 2013

rkern reopened this Jan 10, 2013

charris added Enhancement and removed Enhancement labels Feb 21, 2014

mhvk mentioned this issue Feb 22, 2017

Logical DTypes pydata/xarray#1262

Open

ahaldane mentioned this issue Feb 19, 2018

DOC: Improve the description of the dtype parameter in numpy.array docstring #10614

Closed

mattip added component: numpy.dtype 23 - Wish List labels Aug 6, 2018

SimonHeybrock mentioned this issue Dec 4, 2018

Design document scipp/scipp#2

Merged

zerothi mentioned this issue Dec 6, 2018

Mixing scalar dtype objects with Python objects (casting) is very slow #12496

Closed

h-vetinari mentioned this issue Jan 25, 2019

BUG: incorrect equality of np.int64 and np.uint64 #12525

Closed

seberg added the 60 - Major release Issues that need or may be better addressed in a major release label May 2, 2019

h-vetinari mentioned this issue Sep 20, 2019

WIP,NEP: Create draft of DTypes NEP #14422

Closed

seberg removed 60 - Major release Issues that need or may be better addressed in a major release 54 - Needs decision labels Dec 18, 2019

seberg closed this as completed Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The epic dtype cleanup plan #2899

The epic dtype cleanup plan #2899

njsmith commented Jan 10, 2013 •

edited by eric-wieser

Loading

teoliphant commented Jan 10, 2013

njsmith commented Jan 10, 2013

rkern commented Jan 10, 2013

jpaalasm commented Jan 20, 2013

seberg commented Jan 7, 2020

The epic dtype cleanup plan #2899

The epic dtype cleanup plan #2899

Comments

njsmith commented Jan 10, 2013 • edited by eric-wieser Loading

teoliphant commented Jan 10, 2013

njsmith commented Jan 10, 2013

rkern commented Jan 10, 2013

jpaalasm commented Jan 20, 2013

seberg commented Jan 7, 2020

njsmith commented Jan 10, 2013 •

edited by eric-wieser

Loading