ENH: Reorganize string promotion, add `object_fallback=False` #19101

seberg · 2021-05-26T01:58:09Z

(semi private)

This reorganizes promotion during array-coercion to allow flagging for
an object fallback mode. Here only "exposed" through:

np.core._multiarray_umath._discover_array_parameters(
        [1, "2"], object_fallback=True)

Note that as of now, it is a bit stricter then previously. The object fallback
actually hides fewer errors, not more! (e.g. say promotion fails for two
different structured voids, or datetimes).

But that would be trivial to allow for now...

We could probably thread this flag through by (ab)using the array-flags
that PyArray_FromAny allows passing.

@jorisvandenbossche I am thinking about something like this to fix the issue. But its fairly annoying, at least unless we aim to make this a full new keyword argument to np.array as well (or similar)...

@numpy/numpy We had discussed a few times to add something like np.array(..., dtype="allow_object"). This is practically that, and I can make it work (the code organization isn't pretty, but most of it is due to the FutureWarning).

…i private) This reorganizes promotion during array-coercion to allow flagging for an object fallback mode. Here only "exposed" through: np.core._multiarray_umath._discover_array_parameters( [1, "2"], object_fallback=True) Note that as of now, it is a bit stricter then previously. The object fallback actually hides fewer errors, not more! (e.g. say promotion fails for two different structured voids, or datetimes). But that would be trivial to allow for now... We could probably thread this flag through by (ab)using the array-flags that `PyArray_FromAny` allows passing.

mhvk

Looks good! A small comment is that it might make sense to use npy_bool object_fallback everywhere instead of switching between int and bool - calls with npy_false immediately make clear that it is a bool being pass around.

p.s. Not sure whether the coverage misses are false positives, but obviously would be good to ensure the added code is covered!

mhvk · 2021-05-26T13:42:05Z

numpy/core/src/multiarray/common_dtype.c

@@ -35,13 +72,19 @@
 *
 * TODO: Before exposure, we should review the return value (e.g. no error
 *       when no common DType is found).
+ *       Further, the `object_fallback` is probably only useful for the


If you are fairly sure that in some places this will no longer be needed after the string deprecation, maybe be more explicit about it, otherwise in all likelihood this will stay around for much longer than needed!

mhvk · 2021-05-26T13:42:55Z

numpy/core/src/multiarray/convert_datatype.c

+ * when no common DType is found.  This is currently only needed for the
+ * string and number promotion deprecation!
+ *
+ * TODO: After this deprecation is over, the `object_fallback` may well be


Would be best if this could just change from "may well be useless!" to "should be removed"

seberg · 2021-05-26T23:12:07Z

On this one, Matti suggested in the meeting (if I got it right), that we could make the parameter-discovery function public but not the flag: I.e. the function would always use the object-fallback "future" behaviour.
I am unsure how much that helps pandas right now. (We also have to duplicate some work. Even if we assign to the empty, allocated array – but maybe that doesn't matter too much.)

Or we allow this flag in np.array() – maybe even only semi-public for transition!? Matti doesn't like this, I can't say I like it much, but:

It was previously brought up (I could never find the issue with the discussion)
I may dislike messing with warning states even more. (Unless we make it permantent, it is not a solution for end-users, but rather meant only as a workaround for pandas. End-users are expected to know what dtype they want, and in general we would be optimistic that few pandas users actually need this – or we also need to delay the change itself in any case.)

The alternative, or parallel(?) question is what to do about the deprecation:

We could delay the deprecation: I am not too fond of it, since string promotion is bound to create annoying corner cases once we get string ufuncs, but 🤷. Unless we are planning to just skip the deprecation now or in the next release.

Maybe as an example of where this oddity can strike:

np.array(["1", "2"]).searchsorted(2)  # same as `.searchsorted("2")`

(Which currently would gives a FutureWarning. Oddly, it seems it would actually work in the "future" path)

mattip · 2021-05-27T06:20:21Z

We have a nep about ragged arrays, and in one of the issues or PRs around this something like np.array(..., dtype=maybe_object was suggested. I don't like that, especially for library code that may end up creating object arrays where the user makes a naive mistake in combining input types.

I prefer that

we officially expose np.core._multiarray_umath._discover_array_parameters as a user-facing API
it will return a DataClass with shape and dtype and not warn nor raise if it returns object dtype. It is up to the user to decide whether object is ok for them:

if getattr(np, 'discover_array_parameters', None):
    params = np.discover_array_parameters(...)
    if params.dtype == object:
        # perhaps warn, raise, or ignore
    ret = np.array(..., dtype = params.dtype)
else:
    ret = np.array(...)
    if ret.dtype == object:
        # perhaps warn, raise, or ignore

maybe controversial: document that we do not commit to consistent behaviour in corner cases where value-based ~~casting~~ promotion may change in the future, and specifically give the cases that may change.

Would discover_array_parameters help pandas enough to move forward?

seberg · 2021-05-27T16:44:19Z

Thanks Matti. I am happy with that proposal. The question is really if that will help pandas enough. And I am still wondering if Pandas can just accept passing this warning on to the end-user in a few places.

maybe controversial: document that we do not commit to consistent behaviour in corner cases where value-based casting promotion may change in the future, and specifically give the cases that may change.

Luckily, array coercion ignores value based promotion: np.array([np.float32(3.), 4.]) and all variations thereof, will always consider the 4. to have the default float64 dtype. The only weird part is the integer "ladder" of long → long long → unsigned long long → object.

mhvk · 2021-05-27T16:49:47Z

The idea of separate routine that inspects input and returns a DataClass seems nice (especialy if it can avoid just creating an array anyway). Also agree that it should always return something, but it might be good to have some options. E.g., it would be great it if the user could tell it to treat only lists as containers for multiple elements, and tuples strictly as indicating a structured dtype.

But I'm a bit confused about the relation to both this PR and NEP 34, i.e., does anything influence string promotion, and will dtype=object remain deprecated for ragged arrays?

seberg · 2021-05-27T17:24:06Z

Maybe a bit about the current code layout:

We have on function that discovers array parameters and prepares filling the array. This is effectively np.discover_array_parameters here.
- The extra work it caches internally for step 2 is effectively calling list() and caching the result of things like obj.__array__(). np.discover_array_parameters creates that cache currently, but then immediately deletes it.
We have a second function that actually allocates the result and fills the array.

Currently, even if you do result_arr[...] = obj, we go through both steps (but the first should be a bit faster). So we can avoid creating the array, but (currently) may do some extra work anyway.

E.g., it would be great it if the user could tell it to treat only lists as containers for multiple elements

The first discovery function already has a bunch of flags and we could expose those/add more. Adding one to check strictly for lists should be pretty straight forward.

tuples strictly as indicating a structured dtype

We have an internal flag that says "a tuple is considered element". We can't infer the correct structured dtype really, so if you want a structured dtype, you need to pass dtype=structured_void. We can still expose it, setting it would mainly disallow using tuples with most (all?) other dtype, so it would be a weaker "list-only".

But I'm a bit confused about the relation to both this PR and NEP 34, i.e., does anything influence string promotion, and will dtype=object remain deprecated for ragged arrays?

Currently you must write np.array([1, [2]], dtype=object) for the ragged array case. The overlap is only that dtype=allow_object could also enable the legacy behaviour: np.array([1, [2]], dtype=allow_object) works, but np.array([1, 2[, dtype=allow_object) is not object array.

But allow_object could also be limited to undefined/invalid promotions specifically, making it a bit less random maybe.

mhvk · 2021-05-27T17:35:39Z

Thanks! Now understanding the path currently taken more, I like @mattip's suggestion even more of exposing the dtype/shape inference function. And that also means we might as well postpone discussion of any allow_object flag - I can see that it would have very little use if one can do the introspection. (And on structured type inference, I think that may still be possible, but would then be a new part in the inference function, so is also a separate issue.)

seberg · 2021-05-27T18:13:49Z

(This is off topic, about how tuple inference would be expected to work now)

I think that may still be possible, but would then be a new part in the inference function

You can already do as much of this as I am comfortable with, probably. But its not quite public API yet. The current design idea is that you create a new (possibly abstract!) StructuredDType. This could see the tuple and infer the descriptor:

element_descriptor = StructuredDType.discover_descriptor_from_pyobject(tuple)

The user must then pass dtype=StructuredDType. But after that your DType can define what to do. (This is currently unsupported in np.array() but supported in np.discover_array_parameters()). The main problem with our current void, is that it can mean too many things at once.

mhvk · 2021-05-27T20:03:10Z

That off-topic piece would be very useful!

charris · 2021-06-04T16:45:16Z

@seberg Is this still a work in progress?

seberg · 2021-06-04T17:45:25Z

No, unfortunately not. It relies on the string promotion behaviour, so will collide with that delay/reversal (unless we make that a backport-only).

I could expose the function first and not worry about string promotion for now (since its a new function and string promotion warning/error should follow soon anyway).

seberg · 2021-06-04T17:50:12Z

@charris ah, sorry, removed the backport candidate tag, if we undo string promotion, there is no real point.

charris · 2021-06-04T17:56:56Z

I will build 1.21.0rc2 next Sunday. If we are going to undo anything it would be good to get it in.

seberg · 2022-12-20T15:39:35Z

Going to close this for now. I am thinking of picking this up as two separate PRs (mainly, I would like to do something about our broken promotion...):

Exposing the discovery function is probably useful anyway and, in a sense, overdue. It could try to distinguish object due to a weird object or due to a promotion error (although, not sure how to do it best. Returning None is an option but forces the user to check if the size is empty, which may also be None).
For string and datetime promotion deprecation/futurewarning things are complicated. My current thought is that a with np._future_promotion(): might be helpful. Basically, a better way to raise the FutureWarning and get an object array out (since that is what we do for promotion errors currently).
The param discovery function could possibly always live in that future, but not sure it matters. Entering a context isn't super quick, but quicker and more precise than setting up warning filters.

In either case, there may be some good stuff here, but I doubt we will put this in very similar to how it is.

seberg added 00 - Bug 01 - Enhancement labels May 26, 2021

seberg changed the title ~~ENH: Reorganize string promotion and add object_fallback=False (sem…~~ ENH: Reorganize string promotion and add object_fallback=False May 26, 2021

seberg marked this pull request as draft May 26, 2021 01:58

seberg mentioned this pull request May 26, 2021

np.ma.arrray(['string', np.ma.masked]) gives FutureWarning after #18116 (breaks astropy) #18425

Open

mhvk reviewed May 26, 2021

View reviewed changes

seberg mentioned this pull request May 26, 2021

DOC: FutureWarning from string promotion #19078

Closed

charris added the 09 - Backport-Candidate PRs tagged should be backported label May 28, 2021

charris changed the title ~~ENH: Reorganize string promotion and add object_fallback=False~~ ENH: Reorganize string promotion, add object_fallback=False Jun 4, 2021

seberg removed the 09 - Backport-Candidate PRs tagged should be backported label Jun 4, 2021

seberg closed this Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Reorganize string promotion, add `object_fallback=False` #19101

ENH: Reorganize string promotion, add `object_fallback=False` #19101

seberg commented May 26, 2021 •

edited by charris

Loading

mhvk left a comment

mhvk May 26, 2021

mhvk May 26, 2021

seberg commented May 26, 2021

mattip commented May 27, 2021 •

edited

Loading

seberg commented May 27, 2021

mhvk commented May 27, 2021

seberg commented May 27, 2021

mhvk commented May 27, 2021

seberg commented May 27, 2021

mhvk commented May 27, 2021

charris commented Jun 4, 2021

seberg commented Jun 4, 2021

seberg commented Jun 4, 2021

charris commented Jun 4, 2021

seberg commented Dec 20, 2022

ENH: Reorganize string promotion, add object_fallback=False #19101

ENH: Reorganize string promotion, add object_fallback=False #19101

Conversation

seberg commented May 26, 2021 • edited by charris Loading

mhvk left a comment

Choose a reason for hiding this comment

mhvk May 26, 2021

Choose a reason for hiding this comment

mhvk May 26, 2021

Choose a reason for hiding this comment

seberg commented May 26, 2021

mattip commented May 27, 2021 • edited Loading

seberg commented May 27, 2021

mhvk commented May 27, 2021

seberg commented May 27, 2021

mhvk commented May 27, 2021

seberg commented May 27, 2021

mhvk commented May 27, 2021

charris commented Jun 4, 2021

seberg commented Jun 4, 2021

seberg commented Jun 4, 2021

charris commented Jun 4, 2021

seberg commented Dec 20, 2022

ENH: Reorganize string promotion, add `object_fallback=False` #19101

ENH: Reorganize string promotion, add `object_fallback=False` #19101

seberg commented May 26, 2021 •

edited by charris

Loading

mattip commented May 27, 2021 •

edited

Loading