Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Refactor reductions to use NEP 43 style dispatching/promotion #18905

Merged
merged 13 commits into from
Oct 19, 2021

Conversation

seberg
Copy link
Member

@seberg seberg commented May 4, 2021

Refactor reductions to use NEP 43 style dispatching/promotion

This also implements new promoters for the logical functions.
The reasoning is that any logical function can always be handled
by first casting to boolean without changing the meaning.
This is true (unless the output is object), and already effectively
used for np.all/np.any.

TODOs:

  • Check the output of code coverage.
  • The caching mechanism probably fails to cache well for some reductions, try to fix that.gi
  • np.fmod.reduce([True, False]) used to fail, unlike the non-reduce version (just check briefly, probably OK).

OLD versions:

This is now a "work towards this" PR to show all the changes to refactor ufuncs (or most of them) to implement NEP 43 style dispatching. This includes reductions.

The cache still grows indefinitely and does not do reference counting. Both seems OK to me for now. ufunc.at is not implemented here (but not very important).

Original notes

I marked this as a work in progress, but at this point it has some problems but works fine. Good chunks won't really be tested much yet for the simple reason that we only have the old dtypes available and those do not have any special isinstance() checks or similar.

The long list of things that are still a bit shaky (but maybe not all that bad):

  • The cache doesn't currently increment references. At this time that doesn't matter, but in the future, it may be that a DType used as part of a resolved and then cached signature is deleted again. So we need to do that (or add some "weak" mechanism).
  • As of now, reductions, accumulation, ufunc.at and reduceat do not use the new machinery. I will probably do that next. I will keep it this way to not blow up the diff even more.
  • I do not actually use any "new" promoters, yet. It would be nice to start using them when possible, so that we can warn about the corner cases that might change when we pull the plug on the legacy one. But, it isn't too important, since it is pretty easy to "flip the switch" on value based promotion also in the legacy version. I had considered trying to move them into the new mechanism, but that would mean passing around operands there. So right now I like the optimistic course of keeping it special.
  • As mentioned, it needs yet to be tested when it comes to complex paths used in the future (i.e. promotion is untested and some loop resolution features). But, the paths that can be used, should be tested decently I think. (I have added these tests to the result_type PR).
  • The cache uses a custom hash-map that can only grow currently. That is probably OK, but I could modify the mechanism to grow on e.g. if many hash-collisions occur and limit the size otherwise.

In case someone will ask: The new code is typically faster, but if it has to fall back to legacy behaviour (mostly because there is value-based casting involved), it will be slower. The slowdown seems smaller or similar to the speed-up that 1.20.x has over 1.19.x due to my buffer changes, so I am NOT worried about it. Lets get rid of the unnecessary value-based complexity instead.

@seberg seberg marked this pull request as draft May 4, 2021 19:23
@seberg seberg force-pushed the ufunc-refactor-2021 branch 3 times, most recently from 4160f06 to 27d54f3 Compare May 5, 2021 01:11
@seberg seberg force-pushed the ufunc-refactor-2021 branch 2 times, most recently from 7e221c6 to ca6cafe Compare May 9, 2021 01:22
@seberg seberg marked this pull request as ready for review May 11, 2021 00:40
@seberg seberg changed the title WIP: Refactor ufuncs around array-methods, promoters, and a new loop list/cache MAINT,API: Refactor ufuncs around array-methods, promoters, and a new loop list/cache May 11, 2021
@seberg seberg removed the 25 - WIP label May 13, 2021
@seberg seberg force-pushed the ufunc-refactor-2021 branch 4 times, most recently from fc0ca82 to d92c22a Compare May 21, 2021 01:31
@seberg
Copy link
Member Author

seberg commented May 21, 2021

(Note that I should "backport" some fixups for the dispatching code from my continuation on the reductions here)

@mhvk I have one question for you, unfortunately. Running the astropy tests on the branch: main...seberg:ufunc-refactor-2021-reductions (which includes this, but also fixed up reductions and probably fixes some smaller issues here)... I am running into problems with astropy and pyerfa (SciPy seems happy). (I guess pyerfa was always going to be one of the trickiest ones.)

I think the problem is that previously the type_resolver was in charge of checking if the inputs can be cast safely. But I modified this: The type resolver should only be in charge to adapt flexible/parametric dtypes, it could indicate an unsafe operation (i.e. if the operation itself is a cast), but normally does not. (In most cases, the type resolver is very lightweight and ensures that the inner-loop can handle the dtypes.)

Instead, the ufunc machinery finds out the exact loop/operation dtypes using the type-resolver and then checks against casting. Now I think pyerfa often has only a single loop and will just force cast all inputs to that? That might even only be a mild nuisance (string input for unicode operation), but it also seems that you usually use a structured dtype for the loop. The main error is:

numpy.core._exceptions._UFuncInputCastingError: Cannot cast ufunc 'atoiq' input 0 from dtype('<U1') to dtype([('type', 'S12')], align=True) with casting rule 'same_kind'

A similar one is:

UFuncTypeError(<ufunc 'dtf2d'>, 'same_kind', dtype('S3'), dtype([('type', 'S12')], align=True), 0)

So in the dtf2d ufunc, but basically the same type of cast (That one is embedded in a larger error).

For logical_or, etc. I opted to just add a flag to say: "always force casting inputs is OK here". But even then, I would not be sure how to add that flag for the pyerfa ufuncs.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2021

@seberg - I guess there are two questions here:

  1. Will it be the case that ufuncs compiled on an older numpy will still work with new numpy? If so, then I'm happy to adjust to a new and improved scheme, which will be tested but not used in a release until our minimum version of numpy goes to 1.21.
  2. With the new machinery, how should I go about defining the erfa ufuncs that currently need overrides?

Just to be sure, I use my own type resolver because (see https://github.com/liberfa/pyerfa/blob/c67e478c231faa1b54b52c995fc96877b1271062/erfa/ufunc.c.templ#L540; note the copied over code before...):

/*
 * UFUNC TYPE RESOLVER
 *
 * We provide our own type resolver, since the default one,
 * PyUFunc_DefaultTypeResolver from
 * numpy/core/src/umath/ufunc_type_resolution.c, has problems:
 * 1. It only looks for userloops if any of the operands have a user
 *    type, which does not work if the inputs are normal and no explicit
 *    output is given (see https://github.com/numpy/numpy/issues/11109).
 * 2. It only allows "safe" casting of inputs, which annoyingly prevents
 *    passing in a python int for int32 input.
 * The resolver below solves both, and speeds up the process by
 * explicitly assuming that a ufunc has only one function built in,
 * either a regular one or a userloop (for structured dtype).
 *
 * Combines code from linear_search_type_resolver and
 * linear_search_userloop_type_resolver from
 * numpy/core/src/umath/ufunc_type_resolution.c
 */

From what you write, the problem is related to the second item, though not so much the python int to int32, but rather casting of strings. Currently, it is essentially impossible to create a ufunc for string input, so instead I have structured dtypes that contain just a single field with a fixed-length string, and then I rely on unsafe casting to convert string input to that structured dtype (see
https://github.com/liberfa/pyerfa/blob/c67e478c231faa1b54b52c995fc96877b1271062/erfa/ufunc.c.templ#L339-L423) and in particular the following comment:

            /*
             * MHvK: we do our own check on casting, since by default
             * all items can cast to structured dtypes (see gh-11114),
             * which is not OK. So, we only allow VOID->same VOID,
             * and STRING -> VOID-of-STRING (which works well; we
             * recognize VOID-of-STRING by the dtype element size;
             * it would be rather costly to go look at dtype->fields).
             */

Maybe it is good to look at dtf2d specifically: the generated code from the template is:

    dtype_def = Py_BuildValue("[(s, s)]", "type", "S12");
    PyArray_DescrAlignConverter(dtype_def, &dt_type);
...
    ufunc = (PyUFuncObject *)PyUFunc_FromFuncAndDataAndSignature(
        NULL, NULL, NULL,
        0, 7, 3, PyUFunc_None,
        "dtf2d",
        "UFunc wrapper for eraDtf2d",
        0, NULL);
    if (ufunc == NULL) {
        goto fail;
    }
    dtypes[0] = dt_type;
    dtypes[1] = dt_int;
    dtypes[2] = dt_int;
    dtypes[3] = dt_int;
    dtypes[4] = dt_int;
    dtypes[5] = dt_int;
    dtypes[6] = dt_double;
    dtypes[7] = dt_double;
    dtypes[8] = dt_double;
    dtypes[9] = dt_int;
    status = PyUFunc_RegisterLoopForDescr(
        ufunc, dt_type,
        ufunc_loop_dtf2d, dtypes, NULL);
    if(status != 0){
        Py_DECREF(ufunc);
        goto fail;
    }
    ufunc->type_resolver = &ErfaUFuncTypeResolver;
    PyDict_SetItemString(d, "dtf2d", (PyObject *)ufunc);

With the new scheme, can one separately override the casting? Should that be done by the dtype, or by the ufunc? More generally, are string dtypes now possible as input for ufuncs?

@mhvk
Copy link
Contributor

mhvk commented May 21, 2021

p.s. I'm sorry, but I've lost track a bit. Is there a description in some NEP or some particular place in the code that would help me understand how to new scheme should work?

@seberg
Copy link
Member Author

seberg commented May 22, 2021

Thanks for the points! I have to let it sink a bit and think a bit more. Please don't feel compelled to work through all of the below!

One thing I can offer is try and port one or both of these ufuncs to the experimental-user-dtypes. That will force me to know more clearly what pyerfa needs and make it easier for you try things out/understand the new system.

From what you write, the problem is related to the second item, though not so much the python int to int32, but rather casting of strings.

Hmmm, I am confused why python ints are a problem. But I guess they are at least when there are only Python scalars involved? From a casting perspective those sould be same-kind though, which is the default casting.

With the new scheme, can one separately override the casting? Should that be done by the dtype, or by the ufunc? More generally, are string dtypes now possible as input for ufuncs?

  • There are no limitations for string DTypes or voids (but see end for caveats).
  • The casting: I have not anticipated overriding casting safety on the ufunc (yet), except for a single force-cast inputs, which is too blanket probably (I need this to make logical_and.reduce work seemlessly).
  • Should that be done by the dtype, or by the ufunc? DTypes can define how they cast, but right now ufuncs use (mostly) unmodified casting.

About what to do here:

It seems the loops that have problems with casting, are always defined with structured voids. For those loops we always fall back to the old type resolver (it is impossible to skip it, as the fields might be modified/checked). So, I can probably get away with always skipping the casting check. (If its enough to skip the input casts, that might be nice, but it doesn't really matter.)

If that is not good enough for some reason, expanding the old API slightly to allow you to set the "force input cast" flags should probably work. This should give you time to move to the new API without worrying about it.

About the new system:

Its part of NEP 43. But its a bit long maybe, and I am slightly changing it. (Some things just become clearer closer to the finish line...)

I will try to give an example below, of how it might look like. But I am not sure if the casting, promotion, and type resolution are clear enough from this:

def inner_loop(context, args, dims, strides, auxdata):
    dtypes = context.descriptors  # if you want to use string lengths.
    # do work.

def resolve_dtypes(self, DTypes, descriptors):
    """DTypes are the classes for all, descriptors the one that were passed/input"""
    # The loop always works with fixed size structured dtype (always the same, currently), could be adapted!
    return "equivalent_casting", ("S12", "S12", result_dtype)  # (self is still opaque, so it doesn't quite work)

# Something like this, on the C-level:
atoiq.register_loop(
        dtypes=(np.str_, np.str_, np.void),
        strided_loop=inner_loop,
        resolve_dtypes=resolve_dtypes,
        force_cast_inputs=True,  # do we need this, for pyerfa?! Is it too strong?
        )

def promote(ufunc, DTypes, signature):
    """ Assume this ufunc has only a single useful loop, so always "promote" to that. """
    desired = (np.str_, np.str_, np.void)
    # Return the "desired" DTypes, unless the user passed `dtype=` or `signature=` (can't override user wish)
    return tuple(d if s is None else s for d, s in zip(desired, signature))

atoiq.register_promoter(
    dtypes=(np.dtype, np.dtype, np.dtype),  # always use this one as a last try (most generic)
    promoter=promote)

There are some limitation mainly with string and void though:

  • There can only be a single "void" loop!
  • Void inputs are weird :(. np.array("asdf").astype(np.void).dtype == "V16". So if the user passes in strings, NumPy can't forward them as a structured DType to your loop. So you must add an SS->V loop (as above). If you want mixed loops, you may need to duplicate that and also add it as SV->V and VS->V. All of those loops could share all of their code, but must be added explicitly currently!
  • You must add a promote function to help NumPy pick the right loop, there is no more "pick the first safe loop" mechanism. The new default would probably be np.result_type(...) on DTypes. Note, that promote function could register/add a new loop to the ufunc. (NEP 43 currently suggested it will return a new loop, I like that, but I think it should be a later addition when need arises. It has some complexities about object life-time)
  • The casting-safety is inconvenient for unicode input that should be force cast to string :(. I wonder if we should consider U to S casts as "same_kind" to effectively fix it, but it may be a too-large change. (Since an error is raised if the string is non-ascii, that seems not that bad to me.) A blanket "use forcecasting" flag, may a bit too strong here (but its not terrible, if the promotion is designed with care, the user could only run into it by using signature=..., casting="safe"). I am pretty sure we could add API to deal with this in the future, but I am not sure how it should look like.

Again, since I am not quite sure how well, this works for you, I may have to try it for you if that helps. For now, I expect I can make the "old" mechanism work.

@seberg
Copy link
Member Author

seberg commented May 22, 2021

To confirm: Skipping the casting check for loops that have flexible/parametric legacy dtypes (structured ones) makes the astropy test suite pass.

I guess the main issue in the future is probably the unicode -> string cast. The use of structured dtypes would worry me slightly, but since you only use a single loop as far as I can tll, it seems unproblematic.

@mhvk
Copy link
Contributor

mhvk commented May 22, 2021

@seberg - thanks for the examples. That new process looks really good, I'm quite excited to try to use it!

I also checked your branch with pyerfa itself (which tests all the erfa wrappers, so is more complete than what indirectly gets tested in astropy), and confirm that all the failures are from casts from string to void-with-one-string-entry. Now the only reason I had those was of course because I could not figure out to support strings at all in the old framework, and with unsafe casting this happened to work...
https://github.com/liberfa/pyerfa/blob/c67e478c231faa1b54b52c995fc96877b1271062/erfa/ufunc.c.templ#L378-L386:

            else if (dtypes[i]->elsize == 1 || dtypes[i]->elsize == 12) {
                /* string structured array; string argument is OK */
                if (!((op_descr_type_num == NPY_STRING &&
                       op_descr->elsize <= dtype_elsize) ||
                      (op_descr_type_num == NPY_UNICODE &&
                       op_descr->elsize >> 2 <= dtype_elsize))) {
                    return 0;
                }
            }

It is obviously totally fine if this no longer works, as long as I have a work-around!

@mhvk
Copy link
Contributor

mhvk commented May 22, 2021

@seberg - on getting review for this PR: for me at least it would be really helpful to update the documentation as things are implemented and have some concrete examples of ufuncs that use the new type resolution. Or is this not yet possible? E.g., a new _umath_tests.c - as far as I'm concerned with examples that make no use of "smart" macros, but things you would put in a tutorial as an example. Could be something like a string parser to show how to use strings! (Am again having to use void to hold strings in astropy/astropy#11768...). It may also be a good place to ensure the new type resolution code gets more coverage...

I'm not sure what the plan for review is, but if at all possible, do split this PR in pieces! Even with your nice & liberal comments, it is very difficult to get an overall picture. Some suggestions for making this PR more manageable as I went through trying to get a sense:

  • Split off any formatting, small moves of code, and things like PyArray_TupleFromItems to separate, small clean-up PRs.
  • Move the legacy stuff to legacy_array_methods in a separate PR without any functional change.
  • Can the masked inner strided loops be split off to a separate PR as well? It seems unrelated.
  • Maybe the hash implementation could be split off and tested separately also? (it seems rather sad that it re-implements python's tuple hash; no way to re-use that?!)
  • Postpone the python type resolver for a follow-up PR (which includes tests)

Should add that I think this really is quite exciting! I think (g)ufunc are such a nice idea and it will be wonderful to have them be easier/more logical.

@mhvk
Copy link
Contributor

mhvk commented May 23, 2021

@seberg - a note with probably insufficient thought: if you are using the tuple hash, might it be possible to avoid duplicating the hashing by actually creating tuples and calculating their hashes, and then for finding use the well-optitimized standard dict lookup?

@seberg
Copy link
Member Author

seberg commented May 23, 2021

About the many small changes: There are probably quite a lot that crept in than, that should not have have/could be split out pretty well :/.

Masked inner strided-loops: It seemed weird to split off, but it may be possible (that would effectively mean that any masked ufunc must keep using the old machinery.)
Right now the masked changes are: 1. The masked strided-loop is moved (not much modified), 2. The masked resolution is deleted 3. The masked code is largely unified with the non-masked loop. (this is pretty massive, because it also makes the ufunc_object.c diff more confusing)

The hashing, dispatching/promotion, and legacy method things can maybe be partially broken out. It probably means using them solely in new unit-tests, but that is fine with me.
(I am considering if the promotion could be skipped to use the "legacy arraymethod" in the ufuncs, but dispatch only between the two versions of that.)

The hashing itself was based on a try from one year ago. IIRC it was around 5-10% overhead difference, which isn't super much (mainly skipping tuple creation and the "tuple hash" inlines the identity hash, after all). But I am really not sure what (if any) speed advantage exists in this code.
One thing I like about a custom version is that it seems easier to modify it e.g. to work with a maximum cache size (say do not cache more than 32 loops), or based on cache miss statistics. I.e. a dict isn't really a good "cache"...

@mhvk
Copy link
Contributor

mhvk commented May 24, 2021

@seberg - thanks for considering the splitting up. If hashing did not have a large impact, postponing that would be an easy gain for simpler review here! Otherwise, just splitting off the move of things to legacy_array_methods seems a prime candidate.

@seberg seberg marked this pull request as draft June 15, 2021 19:30
@seberg seberg force-pushed the ufunc-refactor-2021 branch 2 times, most recently from da7e9e4 to f3a8595 Compare June 15, 2021 19:34
@seberg seberg changed the title MAINT,API: Refactor ufuncs around array-methods, promoters, and a new loop list/cache MAINT: Refactor reductions to use NEP 43 style dispatching/promotion Aug 9, 2021
@seberg
Copy link
Member Author

seberg commented Aug 11, 2021

This should be pretty much ready for review, now only with the changes for reductions left. That said, it is bigger than I thought, so I am considering splitting out things again (i.e. can do the logical promoters first, even if I only really need them for reductions).

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice indeed - even just the few added tests show the big improvement!

My comments are all tiny, and may have suffered from really being a little too tired for a proper review.... But I like how it is becoming even clearer that the DType change was a really good idea!

numpy/core/src/umath/dispatching.c Outdated Show resolved Hide resolved
numpy/core/src/umath/reduction.h Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_type_resolution.c Show resolved Hide resolved
@seberg
Copy link
Member Author

seberg commented Sep 7, 2021

This should be ready for a round of review. I think the main problem is probably now for me to see if I have a serious flaw in the caching mechanism.

This also implements new promoters for the logical functions.
The reasoning is that any logical function can always be handled
by first casting to boolean without changing the meaning.
This is true (unless the output is object), and already effectively
used for `np.all`/`np.any`.
Mainly to see if this is indeed the problem, but the assert is
not necessary, and it might be we end up with the wrong dtype when
the input is a compatible (but different) integer already?
This did not actually trigger any problems, probably because the
reduce fallback mode later just catches it anyway...

On the upside, this probably fixes most of the caching issues ;)
At least the reduceat path seems to have been untested before.
The slight change in code layout (added assert) is just to make
the code slightly easier to read.  Since otherwise it looks like
there is an additional `else` branch when `out` is given.
More fixups are coming, the biggest change is that the error message
is now improved when a reduction makes no sense such as
`np.subtract.reduce(np.array([1, 2, 3], dtype="M8[s]"))`
where input and output cannot have the same descriptor.

(Some more fixups still to go)
In particular, this covers a casting error that currently cannot
be hit for normal ufuncs, because they already check casting during
the legacy dtype resolution (which is called first).
@seberg seberg marked this pull request as ready for review October 12, 2021 20:35
@seberg
Copy link
Member Author

seberg commented Oct 13, 2021

I would love to move this ahead. While I think the caching is not ideal especially for some reduction cases, the things that I tried (logical ufuncs) slowed down so little that it just does not matter. Partially probably because reductions just have huge overhead, but basically I don't think it is a problem right now even if it is not ideal and could be refined...

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in resolve_implementation_info. I wonder if there is a way we could write specific unit tests for it, and document what it is meant to do in pseudo-code (or a pure python implementation?).

* private for now (also it is very limited anyway).
* There is one "exception". NA aware dtypes cannot cast to bool
* (hopefully), so the `??->?` loop should error even with this flag.
* But a second NA fallback loop will be necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add to the comment "the ??->? loop should error even with this flag" which test hits this code path?

@@ -267,8 +267,39 @@ resolve_implementation_info(PyUFuncObject *ufunc,
* the subclass should be considered a better match
* (subclasses are always more specific).
*/
/* Whether this (normally output) dtype was specified at all */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where op_dtypes[i] is NULL for input? Maybe assert i >= nin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am using it currently for reductions. But that is the only possibility so I could assert that fairly constrained case (i >= nin || (i == 1 && nin == 2 && nout == 1).

strcmp(ufunc->name, "logical_or") == 0 ||
strcmp(ufunc->name, "logical_and") == 0 ||
strcmp(ufunc->name, "logical_xor") == 0)) {
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fragile, but it will be refactored and removed at some point, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although I admit "at some point" is possibly not very soon. Although, maybe these could be some of the first functions to be moved to the new API.
(The prime use-case would be moving np.equal to be full featured enough that == needs no – or almost no – special cases).

@mattip
Copy link
Member

mattip commented Oct 17, 2021

I think this is a nice cleanup. Besides resolve_implementation_info, the rest seems straightforward.

@mattip mattip merged commit 405c6ee into numpy:main Oct 19, 2021
@mattip
Copy link
Member

mattip commented Oct 19, 2021

Thanks @seberg

@seberg seberg deleted the ufunc-refactor-2021 branch October 19, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants