Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EA: revisit interface #32586

Closed
jbrockmendel opened this issue Mar 10, 2020 · 53 comments
Closed

EA: revisit interface #32586

jbrockmendel opened this issue Mar 10, 2020 · 53 comments
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 10, 2020

This is as good a time as any to revisit the "experimental" EA interface.

My read of the Issues and recollection of threads suggests there are three main groups of topics:

Clarification of the Interface
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  1. _values_for_argsort and values_for_factorize
    • Do we need both? The docs both say they should be order-preserving.
    • Is it safe to return a view? (Categorical.values_for_argsort makes a copy for no obvious reason)
    • What else can they be used for internally? e.g. in CLN: use _values_for_argsort for join_non_unique, join_monotonic #32467 _values_for_argsort is used for ExtensionIndex join_non_unique and join_monotonic
  2. What characteristics should _ndarray_values have? Is it needed? (DEPR: _ndarray_values vs to_numpy vs __array__ #32412) _ndarray_values has been removed
  3. What should _from_sequence accept?
    • Should it only be sequences that are unambiguously this dtype?
    • In particular, should DTA/TDA/PA not accept i8 values?
  4. What should fillna accept? (Should ExtensionBlock.fillna unbox Series / Index? #22954, EA: fillna should accept same type #32414)
    4.5) Require that __iter__ return native types? API: ExtensionArrays and conversion to "native" types (eg in tolist, to_dict, iteration, ..) #29738

Ndarray Compat
^^^^^^^^^^^^^^^^^
5) Headaches have been caused by trivial ndarray methods not being on EA
- #31199 size
- #32342 "T" (just the most recent; this has come up a lot)
- #24583 ravel
6) For arithmetic we're going to need something like either tile or broadcast_to

Methods Needed/Wanted For Index/Series/DataFrame/Block
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7) Suggested Methods (partial list)
- #27264 duplicated
- [x] #23437 _empty
- #28955 apply
- #23179 map
- #22680 hasnas
- [x] #27081 equals
- [x] #24144 _where
- [x] _putmask would be helpful for ExtensionIndex

I suggest we discuss these in order. Before jumping in, is there anything vital missing from this list? (this is only a small subset of the issues on the tracker)

cc @pandas-dev/pandas-core @xhochy

@jorisvandenbossche jorisvandenbossche added API Design ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action labels Mar 10, 2020
@TomAugspurger
Copy link
Contributor

_values_for_argsort and values_for_factorize

Do we need both?

I went back through #20361, but it isn't very illuminating. My early version used _values_for_argsort, but 30941cb changed it to _values_for_factorize. I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.

What characteristics should _ndarray_values have? Is it needed? (#32412)

This has always been nebulous, at least to me. I think it's primarily used in indexing engines, but I may be out of date.

What should _from_sequence accept?

Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values. We had a recent issue (perhaps around groupby?) that was caused by _from_sequence being too permissive.

What should fillna accept? (#22954, #32414)

I think my preference is to restrict this to instances of ExtensionDtype.type. It's somewhat strange for .fillna() to result in an object dtype. The obvious exception is something like IntegerArray.fillna(1.5), which may reasonable be cast to float.

Will comment more on the "additional methods" later, but my opening thought is that we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).

@jbrockmendel
Copy link
Member Author

What characteristics should _ndarray_values have? Is it needed? (#32412)

This has always been nebulous, at least to me. I think it's primarily used in indexing engines, but I may be out of date.

By my count we have 6 places left where we use _ndarray_values. #32646 removes two, #32538 removes one, #32476 removes one, and #32641 deprecates one. The last one I'll be making a PR for shortly. Unless there is an objection, I'd like to get rid of _ndarray_values once we no longer use it.

I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.

That is referenced indirectly in the _values_for_factorize docstring, should probably be made explicit.

The ExtensionIndex code is making an implicit assumption about values_for_argsort that should probably be made explicit: the ordering should be preserved not just within left._values_for_argsort() but also with right._values_for_argsort() (when right.dtype matches left.dtype).

Unless we can find a compelling reason to keep both _values_for_argsort and _values_for_factorize, I'd advocate dropping one of them.

What should _from_sequence accept?

Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values

I'm leaning more and more in that direction, just found another problem caused by being too lenient, xref #32668. I expect DTA/TDA/PA are the worst offenders, will experiment with tightening those up and see if it breaks the world.

As a heuristic, maybe something like "cls._from_sequence(values) works if and only if pd.array(values) would return cls". The heuristic breaks down for non-internal EAs until infer_dtype can handle them.

Will comment more on the "additional methods" later, but my opening thought is that we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).

+1 on "drop-in replacement for 1D ndarray"; my understanding is that the E in EA is about "extending np.ndarray"

A couple other potential guiding principles (not mutually exclusive):

  • The minimal set of methods so that Block/ExtensionIndex methods don't have to roundtrip EA->ndarray->EA
  • The minimal set of methods so that we're not special-casing Categorical all over the place.

@jorisvandenbossche
Copy link
Member

@jbrockmendel thanks for gathering all those

I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.

Some additional, assorted thoughts about this aspect:

  • _values_for_factorize works with NA sentinel (and I don't think the NA sentinel needs to sort correctly), while _values_for_argsort does not.
    Do we assume that missing values in argsort are smallest? At least in sorting.py::nargsort, we also still compute a mask, so there we don't make any assumption about it.
  • For _values_for_factorize, I think we should also consider making it an option to return (values, mask) in addition to (values, na_sentinel) in case this is easier/cheaper to provide (which is especially the case for the masked arrays; this will need to support of the factorize algos for masks though)
  • In GeoPandas, we implement _values_for_factorize, because geometries can be factorized, but we do not implement _values_for_argsort, because we consider geometries as "not sortable" (the hashed values from _values_for_factorize of course have a certain order, but eg Series.sort_values raises an error currently). Having those two methods makes this distinction possible.
    That said: we could probably indicate "not-orderable" in another way as well, if needed. And I am also not fully sure we did the best thing in GeoPandas, as being able to sort still has value for non-orderable values (similar to how an unordered categories has no min/max, but can still be sorted).

It's certainly good to think about if we can combine the two.

Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values

I'm leaning more and more in that direction, just found another problem caused by being too lenient, xref #32668. I expect DTA/TDA/PA are the worst offenders, will experiment with tightening those up and see if it breaks the world.

Yes, I think that in many places where we now use _from_sequence, we should be more strict and only allow actual scalars (a related issue here is #31108).
However, it's probably still useful to have the less strict version as well. Eg in pd.array(..) with a specified dtype, you can be more liberal in the conversion.

So one option might be to keep _from_sequence as is to for example be used in pd.array, and introduce a new _from_scalars that has the more strict behaviour (only accepting actual scalars). And by default that could use _from_sequence for backwards compatibility.

Another option might be to let this more general conversion be handled by the new astype machinery (#22384), since you could see a general conversion from list-like as in pd.array as an astype from object to the specific dtype. And then we can keep _from_sequence for the strict conversion.
(but the first option is probably easier short term, since refactoring astype is a bigger topic)

Aside, we should probably have defined methods like _from_sequence/_from_scalars on the dtype instead of the array.

What should fillna accept? (#22954, #32414)

I think my preference is to restrict this to instances of ExtensionDtype.type. It's somewhat strange for .fillna() to result in an object dtype. The obvious exception is something like IntegerArray.fillna(1.5), which may reasonable be cast to float.

Personally, I think we should adopt the rule of fillna being strictly filling values, and thus never to change the dtype.
If that is the case, then I think it is up to the array to accept anything it wants as fill value, as long as that can be converted to something that fits in the array (or being strict and only accept scalars, if the EA implementor prefers that).

It might be we need something similar like _can_hold_element for other places to know what values to accept.

By my count we have 6 places left where we use _ndarray_values ... Unless there is an objection, I'd like to get rid of _ndarray_values once we no longer use it.

I am not that optimistic we can actually fully get rid of it without some replacement (there are cases were we need an ndarray). Although in some cases _values_for_factorize could maybe be used instead, eg for the Index engine once we support general EAs in Index.


For the "bigger" topics discussed here (like the factorize/argsort values), it might be worth to start separate the discussion?

@jbrockmendel
Copy link
Member Author

It looks like we have a special Categorical._values_for_rank used in factorize, do we need this more generally?

@jorisvandenbossche
Copy link
Member

It looks like we have a special Categorical._values_for_rank used in factorize, do we need this more generally?

Or putting this the other way around: could rank rather use _values_for_argsort instead? (I suppose this _values_for_rank from Categorical predates the EA interface work)

@xhochy
Copy link
Contributor

xhochy commented Mar 24, 2020

_from_sequence(values)

As a heuristic, maybe something like "cls._from_sequence(values) works if and only if pd.array(values) would return cls". The heuristic breaks down for non-internal EAs until infer_dtype can handle them.

This would break fletcher quite a bit as nearly all scalar types we support are also types pandas supports natively, so pd.array would always return the pandas.array type, not the fletcher one.

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic. For example in the datetime case, one would like to support datetime.datetime, np.datetime and pd.Timestamp. It would be good to just leave the check for valid scalars in _from_sequence inside the ExtensionDtype.

Methods defined on pd.Series/DataFrame

I think I would be able to supply a long list here would I have made further progress with fletcher. The important thing here is, that it should be optional to override these in the ExtensionArray. For a lot of possible implementation of ExtensionArray, the defaults are probably better what someone would write but in the Arrow case, it would quite often make sense to overwrite them as there someone already went the extra mile and did a more efficient implementation that was tailored to Arrow's bitmask setup.

One more thing that troubles me currently is that the the dt and str accessors only work on the pandas-internal types. It would also be nice to optionally implement some methods from them in the ExtensionArray.

Test Setup

I really like this. This was an amazing thing for being able to implement the interface.

(also: thanks for the mention; I sadly don't have the bandwidth to watch the pandas repo but am always interested in giving input on EAs)

@xhochy
Copy link
Contributor

xhochy commented Mar 24, 2020

@TomAugspurger
Copy link
Contributor

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic.

Is it possible for you to define .type as a metaclass that registers each scalar type, as in https://github.com/ContinuumIO/cyberpandas/blob/dbce13f94a75145a59d7a7981a8a07571a2e5eb6/cyberpandas/ip_array.py#L22-L29?

@jorisvandenbossche
Copy link
Member

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic. For example in the datetime case, one would like to support datetime.datetime, np.datetime and pd.Timestamp. It would be good to just leave the check for valid scalars in _from_sequence inside the ExtensionDtype.

Can you clarify how that would be problematic?
And to be clear, if we do this change, we would only call _from_sequence in places where we know the data came from the array. Not in general construction functions like pd.array(.., dtype=). For that, something less strict is still needed (see my comment above about either introducing a separate _from_scalars, or either letting this general conversion be handled by astype).

@xhochy
Copy link
Contributor

xhochy commented Mar 24, 2020

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic.

Is it possible for you to define .type as a metaclass that registers each scalar type, as in https://github.com/ContinuumIO/cyberpandas/blob/dbce13f94a75145a59d7a7981a8a07571a2e5eb6/cyberpandas/ip_array.py#L22-L29?

Yes and no, we could though specify it currently as the union of dict, list, int, float, bytes, str, datetime, np.datetime, pd.Timestamp and some more. But I'm not sure whether this would bring any valuable information into the interface as it a very broad range of types.

fletcher is here the extreme case where have FletcherContinuousArray._from_sequence which is not yet actually yet bound to a specific dtype. Depending on the scalars you pass in, you get a different ExtensionDtype instance.

Can you clarify how that would be problematic?
And to be clear, if we do this change, we would only call _from_sequence in places where we know the data came from the array.

Currently for all results of an operation on a FletcherArray, we would also return a FletcherArray again. In some cases it though has a different type. For example we have __equals__(fletcher[int], fletcher[int]) -> fletcher[bool] or __div__(fletcher[int], fletcher[int]) -> fletcher[float], always to carry on the valid mask into the results.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 24, 2020

But I'm not sure whether this would bring any valuable information into the interface as it a very broad range of types.

And I also don't think this should be the goal / point of ExtensionDtype.type

fletcher is here the extreme case where have FletcherContinuousArray._from_sequence which is not yet actually yet bound to a specific dtype. Depending on the scalars you pass in, you get a different ExtensionDtype instance.

_from_sequence gets the dtype object passed, so even if the implementation is not bound to a dtype-specific class, it can use the dtype information to return the correct object.

Currently for all results of an operation on a FletcherArray, we would also return a FletcherArray again.

That's not where _from_sequence would be used. There are occasions that you do not return a FletcherArray, such as getitem for a scalar indexer or a reduction function, right?
It is only for those cases where _from_sequence would be used (again, the strict version would only be used for that, for general construction we still need something else).

@jorisvandenbossche
Copy link
Member

To put it in another way: generally speaking, _from_sequence (or a new, stricter _from_scalars) should be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent).
So that means that EA._from_sequence(np.asarray(EA, dtype=object), dtype=EA.dtype) (or EA._from_sequence(list(EA), dtype=EA.dtype)) should be able to reconstruct. For such a case, I don't think you need to be able to handle datetime.datetime, np.datetime, etc ?

@xhochy
Copy link
Contributor

xhochy commented Mar 24, 2020

So that means that EA._from_sequence(np.asarray(EA, dtype=object), dtype=EA.dtype) (or EA._from_sequence(list(EA), dtype=EA.dtype)) should be able to reconstruct. For such a case, I don't think you need to be able to handle datetime.datetime, np.datetime, etc ?

Yes

@jorisvandenbossche
Copy link
Member

So for the "_from_sequence strictness" discussion, assuming we agree that we need a strict version for certain use cases, I mentioned two options above, to be summarized as:

  1. Keep _from_sequence as is, and add a new _from_scalars method that is more strict (that in the base class can call _from_sequence initially for backwards compatibility). We can use _from_scalars in those cases where we need the strict version, and keep using _from_sequence elsewhere (eg in pd.array(.., dtype=))
  2. Update the expectation in our spec that _from_sequence should only accept a sequence of scalars of the array's type (so make _from_sequence the strict method), and use the astype machinery for construction. Basically, the current flexible _from_sequence would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.

Are there preferences? (or other options?)

From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (_from_scalars or _from_sequence), and in both cases initially the flexible version will still be used as fallback until the update is done).

The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually.

@TomAugspurger
Copy link
Contributor

I think my preference is for _from_sequence to be the "parse this sequence into an array" and for _from_scalars to be the strict version.

@jbrockmendel
Copy link
Member Author

Are there preferences? (or other options?)

I would rather avoid having adding another constructor (im already not wild about _from_sequence_of_strings).

IIRC a big part of the reason why _from_sequence became less strict than originally intended is because DTA/TDA/PA _from_sequence took on a bunch of cases that the DTI/TDI/PI constructors handle, in particular ndarray[i8] which by strict rules shouldnt be accepted. Most of the internal usages of those methods have now been removed, so it should be viable to tighten those three from_sequence methods. If we do that, what will be left in the way of making _from_sequence strict(er)?

@jorisvandenbossche
Copy link
Member

We still need a "non-strict" version for general construction (so for the same reason that our internal datetime-like ones were not strict). For example, _from_sequence is also used in pd.array(.., dtype=) and in astype.

@jbrockmendel
Copy link
Member Author

For example, _from_sequence is also used in pd.array(.., dtype=) and in astype.

Maybe something like fromtype that astype can defer to? There should also be some relationship between fromtype and _from_factorized

@jorisvandenbossche
Copy link
Member

Well, see my option number 2 above (#32586 (comment)), which proposes to defer this to the astype machinery. So that will involve some ways to have a casting method from a certain dtype to the EA dtype in question.

There should also be some relationship between fromtype and _from_factorized

I don't think this is necessarily required. The values for factorize can be completely internal to the array (not to be used by users).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 27, 2020

Regarding the "_values_for_factorize / _values_for_argsort / can they be used for more internally" discussion.
I just realized that we actually should not use them internally, ideally, according to our current EA interface (so we should fix the few cases we started using them in joining code).
The spec about factorize is clear that there are two ways to override its behaviour: implement _values_for_factorize/_from_factorized, or implement factorize itself:

# Implementer note: There are two ways to override the behavior of
# pandas.factorize
# 1. _values_for_factorize and _from_factorize.
# Specify the values passed to pandas' internal factorization
# routines, and how to convert from those values back to the
# original ExtensionArray.
# 2. ExtensionArray.factorize.
# Complete control over factorization.

Of course _values_for_factorize still has a default (astype(object)) implementation, but that can be very non-optimal for external EAs. So ideally, for anything factorize-related, we should actually always call the EA.factorize() method.
Eg in the merging code where it was recently introduced, we should check if can use factorize directly (which I suppose is possible, since we need to factorize there for the merge algo). In hash_array, where we use it already longer time, it might be less obvious to avoid (didn't check the code).

Fletcher is an example of external EAs that implement factorize and not _values_for_factorize.

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Mar 27, 2020

The spec about factorize is clear that there are two ways to override its behaviour: implement _values_for_factorize/_from_factorized, or implement factorize itself:

This is a fair point, especially in light of the fletcher example.

But then why do we need _values_for_factorize and _from_factorize at all? We can require factorize and comment that vff/ff pattern is a simple implementation.

so we should fix the few cases we started using them in joining code

If you're referring to ExtensionIndex._get_engine_target, that use _values_for_argsort

@jorisvandenbossche
Copy link
Member

But then why do we need _values_for_factorize and _from_factorize at all? We can require factorize and comment that vff/ff pattern is a simple implementation.

Yes, that's also possible, and I think we discussed that at some point, but we opted for also having the _values_for_factorize to avoid needing everyone to copy paste (and minor edit) the base class factorize implementation. It's only a few lines of code so not that much of a problem, but it is using internal API. So we would need to expose the functionality of _factorize_array somewhere (eg in pd.factorize()

If you're referring to ExtensionIndex._get_engine_target, that use _values_for_argsort

No, I am referring to where _values_for_factorize is used:

elif is_extension_array_dtype(lk.dtype) and is_dtype_equal(lk.dtype, rk.dtype):
lk, _ = lk._values_for_factorize()
rk, _ = rk._values_for_factorize()

Now, I think for argsort / _values_for_argsort we meant to have the same dual option, so the discussion also applies to that I think.

@jorisvandenbossche
Copy link
Member

So we should think about whether we can always do with EA.factorize() / EA.argsort(), or if we have use cases for actually getting an ndarray of hashable/sortable values (and that are not the integers from factorize), to see if we should actually add something to the EA interface like _values_for_factorize/argsort, but required (or make one of those two required).

@jbrockmendel
Copy link
Member Author

No, I am referring to where _values_for_factorize is used [in core.reshape.merge]

Got it, thanks. IIRC the first draft of the PR that did that used _values_for_argsort, so that remains an option. You're right though that using a public method would be preferable.

if we have use cases for actually getting an ndarray of hashable/sortable values (and that are not the integers from factorize),

I think _get_engine_target is pretty close to what you're describing. Or did you have something else in mind?

Using _values_for_argsort (_vfa) for _get_engine_target (_get) makes some implicit assumptions about _vfa that we should consider making explicit:

  • _vfa maintain order both within an array and across arrays of the same dtype. So (ea1 <= ea2) == (ea1._vfa() <= ea2._vfa())
    • A _vfa that would not satisfy this might look something like bad_vfa = good_vfa() - good_vfa.min()
  • _vfa commutes with __getitem__, so ea._vfa()[key] == ea[key]._vfa()

@crepererum
Copy link
Contributor

Sorry for joining that late, but here are some comments from the perspective of rle-array, which is a very simple example of how EAs could use for transparent and efficient compression. Some of it is more a question or a wish, so please take it with a grain of salt. Also, thanks for the whole initiative :)

Types

It would be nice if we could indeed be a little bit more concrete at many places and/or provide some more helpers in the layer around EAs. For example:

# `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)

# type conversion should be handled by EA?
ea[np.array([True, False], dtype=object)]

# what about NAs?
ea[np.array([True, False, None], dtype=object)]

# and different types of ints
ea[np.array([0, 1], dtype=np.int64)]
ea[np.array([0, 1], dtype=np.uint64)]
ea[np.array([0, 1], dtype=np.int8)]

# should floats be casted if they are exact integers?
ea[np.array([0, 1], dtype=np.float64)]

# what about the massive amount of slices?
ea[:]
ea[2:15:2]
ea[-15:-1:-2]
ea[::-1]

Same goes for many of the other _-methods mentioned in the discussion above.

What does E in EA mean?

For example, rle-array returns run-length-encoded bools for comparisons. Does this really work everywhere?

# `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)

mask = ea < 20
assert isinstance(mask, ExtensionArray)

df[mask] = ...

I sometimes have the impression that the builtin EAs are somewhat special (e.g. BooleanArray) and you cannot really extend all types used by pandas.

Sane defaults but powerful overwrites

I second what @xhochy already said: It would be nice if the base EA class would provide many methods that are based on a very small interface that the subclass must implement, but at the same time could be overwritten. rle-array definitely would use this power to make many implementations way more efficient.

That said, it is important that the test suite keeps testing the actual downstream EA and not the example/default implementations in the base class.

Object Casting

It was the case the the EA data was silently casted to object-ndarrays by some DF methods. Not sure if this still the case, but for the RLE use case this means that users suddenly might run out of memory. In general, it would be nice if EAs could control the casting a little stricter. rle-array issues a performance warning every time this happens.

EAs first vs second class

At least pre-1.0-pandas had a bad groupby-aggregate-performance when it came to exotic EAs because it tried 2 different fast paths before falling back to the actual python "slow-path" (which for RLE can be quite fast). It would be great if EAs could avoid this trail-and-error and directly choose the correct code path.

Memory Shuffles

This is very special to RLE, but groupby operations in pandas use unstable sorting (even elements within the same group are re-shuffled), which works great for plain-old-numpy-arrays but is very painful for non-array-based EAs.

NumPy Interopt

Would be great to have a test suite like this one and some help from the EA baseclass.

Special Methods (.dt, .str)

Also as @xhochy already wrote, it would be nice this could be factored into some kind of baseclass or auto-implementation as well so EA authors don't need to re-implement every single date and string speciality.

Number of EA types

For me, it is not really clear at which granularity EAs should be implemented. rle-array uses a single class for all payload types which as far as I understand the EA docs should be ok. fletcher does the same, but the builtin types use multiple array types and it is not clear to me why this is done.

Views

Pandas checks if views are correctly implemented (at least to a certain degree). But: views are really complex, especially when you have an entire tree of views (multiple nesting levels and branches). Do we really require that EAs implement the whole protocol?

@jbrockmendel
Copy link
Member Author

we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).

One more suggestion for these: The minimal logic/methods such that we don't need ExtensionBlock subclasses for pandas-internal EAs (cc @jorisvandenbossche IIRC you advocated something like this at the last sprint)

@jorisvandenbossche
Copy link
Member

To split off another topic from this issue, I opened a dedicated issue for the _values_for_argsort vs _values_for_factorize vs _ndarray_values discussion -> #33276

@jbrockmendel
Copy link
Member Author

re concat_same_type, there are three broad cases that SomeEA.concat_same_type may see:

  1. matching types and matching dtypes.
  2. matching types but different dtypes (PA with different freqs, DTA with different tzs, Categorical with different categories, ...)
  3. At least one instance of our class, but no assurance on the others (e.g. Categorical.concat_same_type aliases concat_categorical, which handles all kinds of things.

ATM DTA and PA raise on anything but case 1. Categorical allows any of these. IntervalArray requires matching closed but doesn't check for matching dtypes otherwise (i doubt this is intentional).

I expect long-term we'll want EA to have some ability to determine how it concats with other types, which will require some form of delegation like what Tom described. More immediately I think we should clarify whether _concat_same_type is expected to handle case 2 or just case 1.

@jorisvandenbossche
Copy link
Member

For reference, regarding the concatenation (concat, append, etc) of ExtensionArrays, there is a new proposal -> #33607
This introduces a mechanism for the ExtensionDtype to determine a common dtype from a set of input dtypes. And all arrays get casted to this common dtype, and then the all-same-dtype arrays get concatenated (with the existing EA._concat_same_type in case of extension arrays, otherwise np.concatenate).

@jbrockmendel
Copy link
Member Author

This came up in #36400: i propose adding default implementations of where and putmask to EA.

@jorisvandenbossche
Copy link
Member

Regarding putmask, what's the use case for it that cannot be covered by eg array[mask] = values ?

@jbrockmendel
Copy link
Member Author

Regarding putmask, what's the use case for it that cannot be covered by eg array[mask] = values ?

That would be convenient, but the putmask allows for values that are listlike with len(values) ! len(array).

@jorisvandenbossche
Copy link
Member

But still the correct length for mask ? Like:

In [30]: arr = np.array([1, 2, 3])                                                                                                                                                                                 

In [31]: arr[arr > 1] = [10, 10]                                                                                                                                                                                   

In [32]: arr                                                                                                                                                                                                       
Out[32]: array([ 1, 10, 10])

@jbrockmendel
Copy link
Member Author

But still the correct length for mask ? Like:

No. AFAICT its unrestricted. numpy will either truncate or repeat

@jorisvandenbossche
Copy link
Member

OK, but do we have any use case for that internally to have the values truncated or repeated?

@jbrockmendel
Copy link
Member Author

We have Index.putmask and Block.putmask both of which expect np.putmask-like mechanics

@jorisvandenbossche
Copy link
Member

We have Index.putmask and Block.putmask both of which expect np.putmask-like mechanics

Can you give an example where this specific numpy behaviour of truncating/repeating array-like values is used?

Doing a search for putmask, I get the following usages.

BlockManager.putmask is used in:

  • Series.update: other value is already aligned
  • NDFrame._where: not fully sure if other value is already aligned, but would expect so?

Block.putmask is used in (apart from BlockManager.putmask):

  • Block.fillna: I think the other value here is always a scalar? (you can't do fillna on eg a Series with an array-like)
  • Block.replace/_replace_coerce: not fully sure about this one (but as a user I wouldn't expect my replacement value to be repeated)

Index.putmask is used in:

  • Index.fillna: here the value is a scalar
  • Index.droplevel: value is also a scalar (np.nan)

Index.putmask is itself also a public method, but thus a method we hardly use ourselves (and I also personally never used it in practice or seen someone use it in analysis code). It's also only available on Index (eg not on Series). I suppose it tries to mimic numpy, but also deviates in several ways (it's a method while a numpy array doesn't have such a method, it returns a copy instead of working inplace, it upcasts instead of raising). I think this could also be arguments to actually deprecate it as a public Index method (which is independent of the question to add it to EA of course).

@jorisvandenbossche
Copy link
Member

So putmask actually has another aspect of its behaviour that was not yet clear to me: it also masks the other values.
So that is what makes it different from array[mask] = values.

In [43]: arr = np.array([1, 2, 3, 4])

In [44]: np.putmask(arr, arr > 2, np.array([11, 12, 13, 14]))

In [45]: arr  
Out[45]: array([ 1,  2, 13, 14])

which is basically equivalent (in case the other values have the same length as the array) to: array[mask] = values[mask].

And I suppose it is this behaviour of putmask that might used in eg update or where or replace.

Now, even with that, does that make putmask basically:

def putmask(array: ExtensionArray, mask, values):
    if isarray(values) and len(values) == len(array):
        array[mask] = values[mask]
    else:
        array[mask] = values

If it's basically the above, then I am not sure it warrants a method in the EA interface? (would there be a reason that a EA author would like to override this?)

@jbrockmendel
Copy link
Member Author

As mentioned on the call today, there are a handful of methods missing from EA if we want to stop special-casing pandas-internal EAs.

  1. replace - CategoricalBlock dispatches to Categorical.replace. I'm pretty sure this is the cause of BUG: pd.Series.replace(to_replace=...) does not preserve the original dtype of the series #33484.
  2. _format_native_types - DTBlock/TDBlock dispatch to_native_types to the EA's _format_native_types
  3. _can_hold_element - ATM EABlock._can_hold_element always returns True
    For many of our EAs we now have a _validate_setitem_value that we can check for can_hold_element
  4. is_view
  5. quantile
  6. fillna
    DTBlock casts to object on failure (as do non-EA blocks). We should have a single behavior.

@jbrockmendel
Copy link
Member Author

I'd like to move forward on de-special-casing our internal EAs, particularly Categorical.replace. To do that, I'd refactor parts of Block.replace and use that to implement a default implementation for ExtensionArray.replace.

Any thoughts/objections?

@jreback
Copy link
Contributor

jreback commented Nov 7, 2020

+1 on adding methods where appropriate and replace makes a lot of sense

@jorisvandenbossche
Copy link
Member

replace sounds indeed like a good candidate as EA method. I think eg for ArrowStringArray, we will also want to use a custom implementation.

@jbrockmendel
Copy link
Member Author

I'm actually finding that if we get a correct _can_hold_element we may not need a replace. (though this may depend on getting a correct putmask, not yet 100% sure if _can_hold_element gives that for free)

@jorisvandenbossche
Copy link
Member

I didn't fully understand it on the call, but so my question was: doesn't Categorical.replace still hold custom logic apart from _can_hold_element? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...)
Or, as you mention, replace could be implemented generally in terms of other methods like putmask, but so then it is still putmask that we might need to add to the interface?

(I am still wondering if a "putmask" can ever be as efficient as the current Categorical-specific "replace", though)

@jbrockmendel
Copy link
Member Author

I didn't fully understand it on the call

I have a branch that implements what i was describing, will make a draft PR shortly for exposition.

doesn't Categorical.replace still hold custom logic apart from _can_hold_element? (related to the fact that it can work on its categories to be more efficient, instead of on the actual values, but therefore needs to check whether the replacement is present in the categories already or not, etc ...)

It does, but the implementation via putmask I have in mind does something similar. I haven't checked perf.

Or, as you mention, replace could be implemented generally in terms of other methods like putmask, but so then it is still putmask that we might need to add to the interface?

Sort of. The existing ExtensionBlock.putmask is pretty reasonable* as a general case assuming block.values[mask] = other is valid, i.e. if block._can_hold_element(other). But Block.putmask doesn't make that assumption, and has casting logic for when it fails. So the missing thing is not EA.putmask, but "what do we cast to when simple-putmask fails?" (either via a try/except or a can_hold_element check that returns False)

* I'm assuming away intricacies of how np.putmask handles repeating or truncating of other has a mismatched length. For all our internal usages, which is what I really care about, this assumption is benign.

@jbrockmendel
Copy link
Member Author

AFAICT the only thing discussed here that is really up on the air is the strictness of _from_sequence, for which the discussion has moved to #33254. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants