ENH: refactor zero-filling and expose dtype API slot for it #23591

ngoldbaum · 2023-04-14T19:33:34Z

This refactors zero-filling to be controlled by dtype-specific logic. It also moves all zero-filling logic inside PyArray_NewFromDescr_int and removes the _zerofill helper, which now has no consumers.

I experimented with making the API slot only provide the initial zero value and let numpy fill the buffer with that value by doing e.g. repeated memcpy calls, but this ended up not being sufficient, since in principle each zero value might contain embedded references that must be unique, so we need control over filling the entire array buffer. It also makes it slightly simpler to handle object and void dtypes, since I can just move the existing logic from _zerofill to the new object_fill_zero_value function I added.

I didn't see the need to add a test for this to the scaled float dtype, since the legacy dtypes should use all the code paths I modified, but am happy to add such a test if that's desired.

seberg

Thanks for looking into this, I think it has the potential to clean up a quite a bit more mess. Unfortunately, I really would like to try to stick to that design goal of not using array objects (especially for something that is very fundamental).

seberg · 2023-04-14T19:57:46Z

numpy/core/include/numpy/_dtype_api.h

+ *
+ * @returns -1 or 0 indicating error and arr being successfully filled.
+ */
+typedef int (fill_initial_function)(PyArrayObject *arr);


I made sure we never use array objects anywhere in the new DType API. That is to allow swapping out the array object or using NumPy dtypes for non-array objects in principle, and was one of the early wishes not just from me, but also for example Travis.

There are two possible paths:

Do a strided buffer fill (could be a set up like the traverse/clearing function or in principle a ufunc core). Those add the get_travers_loop() indirection since that is needed for structured dtypes at least.

We fill a single element buffer then use that to fill the rest. (I used that approach for the ufunc identity.)

OK, fair enough! I think it might be possible to write this using a traversal loop. I will poke around next week.

Cool, thanks. I don't have any opinion right now about whats better. The first is neat, but the second seems simpler and close to how identities currently work...

Where in numpy is the ufunc identity buffer setup? I’ll take a look at that first since I’m not familiar with it.

Its the get_reduction_initial_function *get_reduction_initial; and defined here:

numpy/numpy/core/include/numpy/_dtype_api.h

Lines 178 to 200 in fd50e7f

/**

* Query an ArrayMethod for the initial value for use in reduction.

*

* @param context The arraymethod context, mainly to access the descriptors.

* @param reduction_is_empty Whether the reduction is empty. When it is, the

* value returned may differ. In this case it is a "default" value that

* may differ from the "identity" value normally used. For example:

* - `0.0` is the default for `sum([])`. But `-0.0` is the correct

* identity otherwise as it preserves the sign for `sum([-0.0])`.

* - We use no identity for object, but return the default of `0` and `1`

* for the empty `sum([], dtype=object)` and `prod([], dtype=object)`.

* This allows `np.sum(np.array(["a", "b"], dtype=object))` to work.

* - `-inf` or `INT_MIN` for `max` is an identity, but at least `INT_MIN`

* not a good *default* when there are no items.

* @param initial Pointer to initial data to be filled (if possible)

*

* @returns -1, 0, or 1 indicating error, no initial value, and initial being

* successfully filled. Errors must not be given where 0 is correct, NumPy

* may call this even when not strictly necessary.

*/

typedef int (get_reduction_initial_function)(

PyArrayMethod_Context *context, npy_bool reduction_is_empty,

char *initial);

The actual implementation is more annoying, since its set up to fall back to the one on the ufunc:

numpy/numpy/core/src/umath/legacy_array_method.c

Line 266 in 1bbe582

get_initial_from_ufunc(

(which is all a bit annoying and I wouldn't mind making it nicer, but it helps support complicated reduce initials.)

But of course what we would need here is much simpler, since it only needs the descr.

I agree with not having things like this on the dtype. Having a ufunc would seem to make a lot of sense to me. Could be an actual zeros(out=array, or be more like the strange private _ones_like that already exists (no idea why! but it is covered by astropy units...). For zero filling it would then be zeros_like(array, out=array). But another logical concept is that for dtypes that do not have strange non-identical zeros, one effectively just uses whatever np.copyto does.

(Sorry, I have not really looked at this PR, just really read the comments here... Ignore if this is all irrelevant.)

I had thought about a ufunc, but overall still lean against it.

My thoughts are roughly:

Creating a zero filled array seems very core and doing ufunc dispatching is a bit of work for something simple. It feels a bit much if I think of an exposing API like PyArray_FillBuffer(dtype, ptr, num, stride, NPY_ZEROVAL) (that is partially a lie, because we made sure ufuncs are now set up in a way to make exactly this fairly easy!).

Its just seems like more hassle for the DType author to implement for something pretty basic (but of course users will implement many ufuncs anyway).

Another path would be an ArrayMethod (i.e. ufunc loop) that lives on the DType (just like casts are ufuncs loops/ArrayMethods that don't live on a ufunc object).

For clearing (dealloc+NULLint of array values) I had for a simplified traversal loop getter. I don't like splitting from ArrayMethod, but it seemed like a lot of set up for something very basic. I do translate this to zero().

Now ufuncs seem like they should have some advantages (maybe for subclasses?), but I can't really put my finger on anything concrete right now except a certain beauty of aiming that as much as possible should a ufunc.

N.B.: I suspect this will continue to include "zero", "one", and also "empty" (we do need this for object() although I am not sure we will need it generally).
In principle one could continue down the rabbit hole with the typical ufunc.identity values, and NA filling might be a thing at some point.

Yeah, it is no more than thinking that there is a certain beauty and simplicity if a DType interacts with arrays only via ufuncs. Of course, regardless the DType has to have a way to tell what "zero" is (though there is an obvious default).

numpy/core/src/multiarray/ctors.c

numpy/core/src/multiarray/dtypemeta.c

ngoldbaum · 2023-04-18T17:20:36Z

I re-implemented the API hook to use a traversal loop. Instead of using PyArray_FillObjectArray I now manually insert pointers to the zero PyObject* for object arrays and use the _fillobject helper used by PyArray_FillObjectArray for void arrays. In principle I could have avoided _fillobject but it seemed better to avoid duplicating the somewhat complicated recursive logic it uses.

I considered following the initial reduction API slot but decided to use a traversal loop instead both to ease implementation for me (the existing approach is basically a traversal loop, so it was straightforward to translate) and to allow more flexibility for the dtype author. I suspect that this is a niche case so in my opinion it's OK to require a little extra boilerplate if that buys dtype authors more flexibility.

seberg · 2023-04-18T17:40:22Z

Thanks, I will have a look tomorrow.

In principle I could have avoided _fillobject but it seemed better to avoid duplicating the somewhat complicated recursive logic it uses.

We should definitely change it to use a pattern like the clear traversal loop (and hopefully remove the other place at some point). But, we can follow up on that too.

(EDIT: To explain this, we should not figure out the structure again on every elment, its very slow. And after we have it, this should become the one true way to initialize zero, and hopefully _fillobject will eventually be gone, rather than this way around.)

ngoldbaum · 2023-04-18T18:09:40Z

we should not figure out the structure again on every elment, its very slow

Good point! I know next to nothing about structured dtypes so got scared of trying to heavily refactor _fillobject. I agree that it’s doing a lot of work that could be amortized over the whole array.

seberg

Thanks, happy to go this route.

One more thing to note: Your object filling loop does not DECREF anything that may be inside the array before.

We can do this, it just means:

We need to make a note of the choice that the loop cannot be called on memory that holds references. (You cannot use this to zerofill a used/fully initialized array, you would have to dealloc/clear it first!)
In theory we could choose to avoid the initial memory zeroing due to references always. But if we want to keep that open we would need to ask the zero-fill function to guarantee an initialized state on failure (in other words: calling clear on the memory must be OK). Or find another safe scheme.
(Not sure it matters in practice, for large arrays, we will touch all memory twice here, but the memory churn could be improved by us via chunking, and most zeroed arrays are likely used more than twice anyway.)

As I said, I can help with the void loop in a follow-up, or we can work together on it here (it may not be super hard by looking at the clear loop. We might even be able to practically re-use it).

numpy/core/src/multiarray/ctors.c

seberg · 2023-04-19T08:09:02Z

numpy/core/src/multiarray/ctors.c

+        NPY_traverse_info fill_zero_info;
+        /* float errors do not matter and we do not release GIL */
+        NPY_ARRAYMETHOD_FLAGS zero_flags = PyArrayMethod_MINIMAL_FLAGS;
+        NPY_traverse_info_init(&fill_zero_info);


We need to pair this with the free/clear function, unfortunately (this will be important for structured dtypes).

One path may be to not support/check for fill_zero_info.func == NULL here, but only use get_fill_zero_loop == NULL to make the decision about calloc. That would move this whole logic down to after the calloc choice was made. (structured dtypes won't be able to signal calloc is good enough, but 🤷)

Alternatively, we just need to add the cleanup either in fail: of before all relevant goto fails.

Oh oops, I missed the existence of NPY_traverse_info_xfree looking at the traversal code, thank you very much for catching this.

I ended up refactoring so the cleanup goes in the fail block. This does lead to allocating and initializing the traverse info struct in the fast path, but it's a tiny struct so I don't think that's a problem. Let me know if you'd like further massaging on this.

numpy/core/src/multiarray/dtypemeta.c

… array

ngoldbaum · 2023-04-19T19:07:28Z

We need to make a note of the choice that the loop cannot be called on memory that holds references.

Added a comment in the loop itself and added a warning in the documentation for the API hook in dtypemeta.h.

I can help with the void loop in a follow-up

I'd prefer this, if that's alright. This is already 3 or 4 levels of yak shaving from where I started working last week :)

ngoldbaum · 2023-04-19T19:13:56Z

Also I should highlight 4403c01 which made some changes to the setup for the traversal info struct. For zeroing we don't need the descriptor at all (the array creation routine already has it), so it's OK to just leave it as NULL.

seberg · 2023-04-20T10:44:03Z

OK, thanks. Lets put this in and I will fix the void loop in a follow-up.

(If anyone prefers another API, e.g. to ask for a single value only. Happy to entertain that also.)

This reorganizes the generic traversal functions for structured dtypes a bit (before it wasn't quite generic). Then uses that for zerofilling. We could get the two a bit closer by also supporting `func==NULL` explicitly for clearing. (I have not includes this here.) The old `FillObjectArray` is still in use now, this is not ideal and the new approach should be duplicated to add an "emptyfill" (same semantics, normally just memset to 0, but for objects we place an explicit `None` object). This is a necessary follow-up to numpygh-23591.

github-actions bot added the 01 - Enhancement label Apr 14, 2023

seberg reviewed Apr 14, 2023

View reviewed changes

ngoldbaum mentioned this pull request Apr 14, 2023

Initial support for missing data numpy/numpy-user-dtypes#56

Merged

ngoldbaum force-pushed the refactor-zero branch from 22079f1 to 94d4fd7 Compare April 18, 2023 16:56

ngoldbaum added 2 commits April 18, 2023 11:03

MAINT: correct typos in dtype API documentation comments

8a428fd

ENH: Refactor special zero-filling to be managed by the DType

726676a

ngoldbaum force-pushed the refactor-zero branch from 94d4fd7 to 3ffef9d Compare April 18, 2023 17:04

MAINT: refactor zero-filling to use a traversal loop

f8babe2

ngoldbaum force-pushed the refactor-zero branch from 3ffef9d to f8babe2 Compare April 18, 2023 17:12

seberg reviewed Apr 19, 2023

View reviewed changes

ngoldbaum added 4 commits April 19, 2023 09:21

MNT: respond to review comments

e533144

MAINT: support traversal loops without initializing the descriptor

4403c01

MAINT: housecleaning for traversal loop setup and implementations

4fef2c3

DOC: warn against using zero-filling loop with a previously allocated…

6a7bf10

… array

seberg merged commit e20f110 into numpy:main Apr 20, 2023
59 checks passed

seberg mentioned this pull request Apr 20, 2023

MAINT: Add a proper implementation for structured zerofill #23620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: refactor zero-filling and expose dtype API slot for it #23591

ENH: refactor zero-filling and expose dtype API slot for it #23591

ngoldbaum commented Apr 14, 2023

seberg left a comment

seberg Apr 14, 2023

ngoldbaum Apr 14, 2023

seberg Apr 14, 2023

ngoldbaum Apr 14, 2023

seberg Apr 14, 2023

mhvk Apr 15, 2023

seberg Apr 15, 2023

mhvk Apr 15, 2023

ngoldbaum commented Apr 18, 2023

seberg commented Apr 18, 2023 •

edited

ngoldbaum commented Apr 18, 2023

seberg left a comment

seberg Apr 19, 2023

ngoldbaum Apr 19, 2023

ngoldbaum Apr 19, 2023

ngoldbaum commented Apr 19, 2023

ngoldbaum commented Apr 19, 2023

seberg commented Apr 20, 2023

	/**
	* Query an ArrayMethod for the initial value for use in reduction.
	*
	* @param context The arraymethod context, mainly to access the descriptors.
	* @param reduction_is_empty Whether the reduction is empty. When it is, the
	* value returned may differ. In this case it is a "default" value that
	* may differ from the "identity" value normally used. For example:
	* - `0.0` is the default for `sum([])`. But `-0.0` is the correct
	* identity otherwise as it preserves the sign for `sum([-0.0])`.
	* - We use no identity for object, but return the default of `0` and `1`
	* for the empty `sum([], dtype=object)` and `prod([], dtype=object)`.
	* This allows `np.sum(np.array(["a", "b"], dtype=object))` to work.
	* - `-inf` or `INT_MIN` for `max` is an identity, but at least `INT_MIN`
	* not a good default when there are no items.
	* @param initial Pointer to initial data to be filled (if possible)
	*
	* @returns -1, 0, or 1 indicating error, no initial value, and initial being
	* successfully filled. Errors must not be given where 0 is correct, NumPy
	* may call this even when not strictly necessary.
	*/
	typedef int (get_reduction_initial_function)(
	PyArrayMethod_Context *context, npy_bool reduction_is_empty,
	char *initial);

ENH: refactor zero-filling and expose dtype API slot for it #23591

ENH: refactor zero-filling and expose dtype API slot for it #23591

Conversation

ngoldbaum commented Apr 14, 2023

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum commented Apr 18, 2023

seberg commented Apr 18, 2023 • edited

ngoldbaum commented Apr 18, 2023

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum commented Apr 19, 2023

ngoldbaum commented Apr 19, 2023

seberg commented Apr 20, 2023

seberg commented Apr 18, 2023 •

edited