Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Reorganize string promotion, add object_fallback=False #19101

Closed
wants to merge 1 commit into from

Conversation

seberg
Copy link
Member

@seberg seberg commented May 26, 2021

(semi private)

This reorganizes promotion during array-coercion to allow flagging for
an object fallback mode. Here only "exposed" through:

np.core._multiarray_umath._discover_array_parameters(
        [1, "2"], object_fallback=True)

Note that as of now, it is a bit stricter then previously. The object fallback
actually hides fewer errors, not more! (e.g. say promotion fails for two
different structured voids, or datetimes).

But that would be trivial to allow for now...

We could probably thread this flag through by (ab)using the array-flags
that PyArray_FromAny allows passing.


@jorisvandenbossche I am thinking about something like this to fix the issue. But its fairly annoying, at least unless we aim to make this a full new keyword argument to np.array as well (or similar)...

@numpy/numpy We had discussed a few times to add something like np.array(..., dtype="allow_object"). This is practically that, and I can make it work (the code organization isn't pretty, but most of it is due to the FutureWarning).

…i private)

This reorganizes promotion during array-coercion to allow flagging for
an object fallback mode.  Here only "exposed" through:

    np.core._multiarray_umath._discover_array_parameters(
            [1, "2"], object_fallback=True)

Note that as of now, it is a bit stricter then previously.  The object fallback
actually hides fewer errors, not more!  (e.g. say promotion fails for two
different structured voids, or datetimes).

But that would be trivial to allow for now...

We could probably thread this flag through by (ab)using the array-flags
that `PyArray_FromAny` allows passing.
@seberg seberg changed the title ENH: Reorganize string promotion and add object_fallback=False (sem… ENH: Reorganize string promotion and add object_fallback=False May 26, 2021
@seberg seberg marked this pull request as draft May 26, 2021 01:58
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A small comment is that it might make sense to use npy_bool object_fallback everywhere instead of switching between int and bool - calls with npy_false immediately make clear that it is a bool being pass around.

p.s. Not sure whether the coverage misses are false positives, but obviously would be good to ensure the added code is covered!

@@ -35,13 +72,19 @@
*
* TODO: Before exposure, we should review the return value (e.g. no error
* when no common DType is found).
* Further, the `object_fallback` is probably only useful for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are fairly sure that in some places this will no longer be needed after the string deprecation, maybe be more explicit about it, otherwise in all likelihood this will stay around for much longer than needed!

* when no common DType is found. This is currently only needed for the
* string and number promotion deprecation!
*
* TODO: After this deprecation is over, the `object_fallback` may well be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be best if this could just change from "may well be useless!" to "should be removed"

@seberg
Copy link
Member Author

seberg commented May 26, 2021

On this one, Matti suggested in the meeting (if I got it right), that we could make the parameter-discovery function public but not the flag: I.e. the function would always use the object-fallback "future" behaviour.
I am unsure how much that helps pandas right now. (We also have to duplicate some work. Even if we assign to the empty, allocated array – but maybe that doesn't matter too much.)

Or we allow this flag in np.array() – maybe even only semi-public for transition!? Matti doesn't like this, I can't say I like it much, but:

  • It was previously brought up (I could never find the issue with the discussion)
  • I may dislike messing with warning states even more. (Unless we make it permantent, it is not a solution for end-users, but rather meant only as a workaround for pandas. End-users are expected to know what dtype they want, and in general we would be optimistic that few pandas users actually need this – or we also need to delay the change itself in any case.)

The alternative, or parallel(?) question is what to do about the deprecation:

We could delay the deprecation: I am not too fond of it, since string promotion is bound to create annoying corner cases once we get string ufuncs, but 🤷. Unless we are planning to just skip the deprecation now or in the next release.

Maybe as an example of where this oddity can strike:

np.array(["1", "2"]).searchsorted(2)  # same as `.searchsorted("2")`

(Which currently would gives a FutureWarning. Oddly, it seems it would actually work in the "future" path)

@mattip
Copy link
Member

mattip commented May 27, 2021

We have a nep about ragged arrays, and in one of the issues or PRs around this something like np.array(..., dtype=maybe_object was suggested. I don't like that, especially for library code that may end up creating object arrays where the user makes a naive mistake in combining input types.

I prefer that

  • we officially expose np.core._multiarray_umath._discover_array_parameters as a user-facing API
  • it will return a DataClass with shape and dtype and not warn nor raise if it returns object dtype. It is up to the user to decide whether object is ok for them:
if getattr(np, 'discover_array_parameters', None):
    params = np.discover_array_parameters(...)
    if params.dtype == object:
        # perhaps warn, raise, or ignore
    ret = np.array(..., dtype = params.dtype)
else:
    ret = np.array(...)
    if ret.dtype == object:
        # perhaps warn, raise, or ignore
  • maybe controversial: document that we do not commit to consistent behaviour in corner cases where value-based casting promotion may change in the future, and specifically give the cases that may change.

Would discover_array_parameters help pandas enough to move forward?

@seberg
Copy link
Member Author

seberg commented May 27, 2021

Thanks Matti. I am happy with that proposal. The question is really if that will help pandas enough. And I am still wondering if Pandas can just accept passing this warning on to the end-user in a few places.

maybe controversial: document that we do not commit to consistent behaviour in corner cases where value-based casting promotion may change in the future, and specifically give the cases that may change.

Luckily, array coercion ignores value based promotion: np.array([np.float32(3.), 4.]) and all variations thereof, will always consider the 4. to have the default float64 dtype. The only weird part is the integer "ladder" of long → long long → unsigned long long → object.

@mhvk
Copy link
Contributor

mhvk commented May 27, 2021

The idea of separate routine that inspects input and returns a DataClass seems nice (especialy if it can avoid just creating an array anyway). Also agree that it should always return something, but it might be good to have some options. E.g., it would be great it if the user could tell it to treat only lists as containers for multiple elements, and tuples strictly as indicating a structured dtype.

But I'm a bit confused about the relation to both this PR and NEP 34, i.e., does anything influence string promotion, and will dtype=object remain deprecated for ragged arrays?

@seberg
Copy link
Member Author

seberg commented May 27, 2021

Maybe a bit about the current code layout:

  1. We have on function that discovers array parameters and prepares filling the array. This is effectively np.discover_array_parameters here.
    • The extra work it caches internally for step 2 is effectively calling list() and caching the result of things like obj.__array__(). np.discover_array_parameters creates that cache currently, but then immediately deletes it.
  2. We have a second function that actually allocates the result and fills the array.

Currently, even if you do result_arr[...] = obj, we go through both steps (but the first should be a bit faster). So we can avoid creating the array, but (currently) may do some extra work anyway.

E.g., it would be great it if the user could tell it to treat only lists as containers for multiple elements

The first discovery function already has a bunch of flags and we could expose those/add more. Adding one to check strictly for lists should be pretty straight forward.

tuples strictly as indicating a structured dtype

We have an internal flag that says "a tuple is considered element". We can't infer the correct structured dtype really, so if you want a structured dtype, you need to pass dtype=structured_void. We can still expose it, setting it would mainly disallow using tuples with most (all?) other dtype, so it would be a weaker "list-only".

But I'm a bit confused about the relation to both this PR and NEP 34, i.e., does anything influence string promotion, and will dtype=object remain deprecated for ragged arrays?

Currently you must write np.array([1, [2]], dtype=object) for the ragged array case. The overlap is only that dtype=allow_object could also enable the legacy behaviour: np.array([1, [2]], dtype=allow_object) works, but np.array([1, 2[, dtype=allow_object) is not object array.

But allow_object could also be limited to undefined/invalid promotions specifically, making it a bit less random maybe.

@mhvk
Copy link
Contributor

mhvk commented May 27, 2021

Thanks! Now understanding the path currently taken more, I like @mattip's suggestion even more of exposing the dtype/shape inference function. And that also means we might as well postpone discussion of any allow_object flag - I can see that it would have very little use if one can do the introspection. (And on structured type inference, I think that may still be possible, but would then be a new part in the inference function, so is also a separate issue.)

@seberg
Copy link
Member Author

seberg commented May 27, 2021

(This is off topic, about how tuple inference would be expected to work now)

I think that may still be possible, but would then be a new part in the inference function

You can already do as much of this as I am comfortable with, probably. But its not quite public API yet. The current design idea is that you create a new (possibly abstract!) StructuredDType. This could see the tuple and infer the descriptor:

element_descriptor = StructuredDType.discover_descriptor_from_pyobject(tuple)

The user must then pass dtype=StructuredDType. But after that your DType can define what to do. (This is currently unsupported in np.array() but supported in np.discover_array_parameters()). The main problem with our current void, is that it can mean too many things at once.

@mhvk
Copy link
Contributor

mhvk commented May 27, 2021

That off-topic piece would be very useful!

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label May 28, 2021
@charris charris changed the title ENH: Reorganize string promotion and add object_fallback=False ENH: Reorganize string promotion, add object_fallback=False Jun 4, 2021
@charris
Copy link
Member

charris commented Jun 4, 2021

@seberg Is this still a work in progress?

@seberg
Copy link
Member Author

seberg commented Jun 4, 2021

No, unfortunately not. It relies on the string promotion behaviour, so will collide with that delay/reversal (unless we make that a backport-only).

I could expose the function first and not worry about string promotion for now (since its a new function and string promotion warning/error should follow soon anyway).

@seberg seberg removed the 09 - Backport-Candidate PRs tagged should be backported label Jun 4, 2021
@seberg
Copy link
Member Author

seberg commented Jun 4, 2021

@charris ah, sorry, removed the backport candidate tag, if we undo string promotion, there is no real point.

@charris
Copy link
Member

charris commented Jun 4, 2021

I will build 1.21.0rc2 next Sunday. If we are going to undo anything it would be good to get it in.

@seberg
Copy link
Member Author

seberg commented Dec 20, 2022

Going to close this for now. I am thinking of picking this up as two separate PRs (mainly, I would like to do something about our broken promotion...):

  1. Exposing the discovery function is probably useful anyway and, in a sense, overdue. It could try to distinguish object due to a weird object or due to a promotion error (although, not sure how to do it best. Returning None is an option but forces the user to check if the size is empty, which may also be None).

  2. For string and datetime promotion deprecation/futurewarning things are complicated. My current thought is that a with np._future_promotion(): might be helpful. Basically, a better way to raise the FutureWarning and get an object array out (since that is what we do for promotion errors currently).
    The param discovery function could possibly always live in that future, but not sure it matters. Entering a context isn't super quick, but quicker and more precise than setting up warning filters.

In either case, there may be some good stuff here, but I doubt we will put this in very similar to how it is.

@seberg seberg closed this Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants