ExtensionArray meta-issue #19696

TomAugspurger · 2018-02-14T13:29:04Z

jbrockmendel · 2018-02-22T05:37:12Z

Should these have a _typ attribute? If so, should SparseArray._typ be changed from "array" to something more specific?

TomAugspurger · 2018-02-22T11:47:16Z

We're adding an ABCExtensionArray and `EA._type` with the value 'extension' in #19520. SparseArray being _array isn't the perfect name, but I don't think that matters, do you? The more important name is the ABC.

…

On Wed, Feb 21, 2018 at 11:37 PM, jbrockmendel ***@***.***> wrote: Should these have a _typ attribute? If so, should SparseArray._typ be changed from "array" to something more specific? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19696 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkHC59w-YBE7ATTy1l6gtL-GgsSgks5tXP0MgaJpZM4SFUFN> .

TomAugspurger · 2018-02-23T12:29:47Z

Quick status update. The two big ones are in, and now work can mostly proceed in parallel (aside from #19863 which a couple of my followups depend on).

I'm currently rebasing all my branches on master and will make PRs today.

I'll also go through all the TODOs we've added an make issues / checkboxes for those.

If anyone is interested in working on these, PeriodArray will likely be the most sizable PR that I don't really have a start on, aside from a now out of date branch.

jbrockmendel · 2018-02-23T15:44:47Z

re PeriodArray, were in spitting distance of being ready to refactor arith/cmp methods out from DatetimeIndexOpsMixin/DatetimeIndexTimedeltaIndex/PeriodIndex. If you can wait for #19847, #19835, and #19800, I can get the ball rolling and save you some work.

jbrockmendel · 2018-03-18T17:05:48Z

Refactor arithmetic methods from DatetimeIndexOpsMixin/DatetimeIndex/TimedeltaIndex/PeriodIndex into array classes (ENH: implement DatetimeLikeArray #19902)
Refactor comparison methods from DTI/TDI/PI into array classes

TomAugspurger · 2018-03-20T21:48:59Z

Porting from #19902 (comment)

When do we want to call it and release 0.23? I can get __setitem__ finished up tonight or tomorrow. Groupby is I think close to being done (we have the API issue for argsort and a similar one for factorize. Actually supporting groupby doesn't seem difficult). Once those are done, I would call the interface sufficiently complete (aside perhaps from support for arithmetic ops, which I haven't really looked into. I'm not sure how difficult that will be, but I imagine it'll be some effort).

The big remaining question then is moving other array types to EAs

Period
Interval
DatetimeTZ
Sparse

Do we want to block 0.23 for those? Do we have an estimate for how long that would take?

cc @shoyer @jreback @jorisvandenbossche @chris-b1.

jbrockmendel · 2018-03-20T23:34:36Z

Do we have an estimate for how long that would take?

For the datetime/timedelta/period arrays (or the arithmetic/comparison bits of them), if/when #19902 goes through, I can get the next two PRs in that sequence out pretty quickly. But after those are in there will be some non-trivial design decisions to be made. The end of March would be an optimistic goal.

TomAugspurger · 2018-03-21T15:23:31Z

#19696 (comment)

cc @shoyer @jreback @jorisvandenbossche @chris-b1 (don't think you got pinged on this, since I did it in an edit).

jorisvandenbossche · 2018-03-23T12:35:50Z

I think ideally, we would already have converted Interval, Period, etc to use ExtensionArrays before doing a release, as this will exercise and stress-test the interface a lot.
But on the other hand, for projects like cyberpandas and geopandas, it would also be easier to already have a release that supports a basic extension array interface, making it easier for users to test it.
And we can be clear that the extension array interface for the extension authors will still be subject to change (which does not mean things will change for their users). So that would be in favor of releasing more quickly.

jorisvandenbossche · 2018-03-23T12:42:06Z

Another topic we need to discuss (although actually trying to implement it might make a discussion more tangible), is whether we want our current Index classes to subclass the ExtensionArrays (eg PeriodIndex subclassing PeriodArray) or to be composed of an ExtensionArray (PeriodIndex._data being a PeriodArray).

I think we previously thought to do composition, but @TomAugspurger said here #19902 (comment): (also @jreback mentioned this in #19957 (comment))

I've been going back and forth on which approach is best here. I'm slightly coming around to the idea of subclassing, but haven't 100% settled yet.

@TomAugspurger what do you see as advantages of using inheritance instead of composition?

I didn't really think about in detail (code-wise), but: Series uses composition, so not doing this for Index might reduce ability to share code. CategoricalIndex is already an existing example of Index with ExtensionArray that does composition. Other index classes with numeric or object dtypes (the ones having a numpy array under the hood) will keep using composition.

jbrockmendel · 2018-03-23T15:02:23Z

I didn't really think about in detail (code-wise), but: Series uses composition, so not doing this for Index might reduce ability to share code

I think Series might be the wrong comparison: FooBlock should be the analogous object to FooIndex, shouldn't it?

jreback · 2018-03-25T18:31:56Z

So I am a big +1 on making Index a subclass of EA. This will unify the external interface with the internal one (e.g. the user facing API). And will allow us to make sure the impl shares code (e.g. mainly this will involve changing some bespoke impl details of Index to conform to the new patterns of EA).

We already have a de-facto implementation of this for Datetime w/tz, meaning we are using it like an EA. subclassing DTI will just formalize this. We also get the Period & II impl for free if we do this. We could also change the categorical impl to actually use CI and again get some more code sharing / harmony.

The point of all this is to really dogfood EA internally in pandas. This will simplify code and promote a unified interface.

Then once the above is done, we can actually remove all of the Block types that are not simple ones. These would then merely become EA implementations. Again this would simplify the implementation details and friction that we currently have.

jorisvandenbossche · 2018-03-26T06:56:23Z

So I am a big +1 on making Index a subclass of EA. This will unify the external interface with the internal one (e.g. the user facing API).

What do you mean with unify interfaces? Which interfaces? You mean the ExtensionArray API and the Index API? I don't think those should be unified. The ExtensionArray interface is deliberately much smaller.

We already have a de-facto implementation of this for Datetime w/tz, meaning we are using it like an EA.

As far as I know, DatetimeIndex is an Index composed of a array holding the values, not subclassing anything. It is true that a Series of tz-aware data holds a DatetimeIndex, but they can both perfectly hold a DatetimeArray, without the Index subclassing from it.
Again, this might be a good approach, but what you raise are not actual reasons to do it IMO.

The point of all this is to really dogfood EA internally in pandas.

We can perfectly dogfood EA internally in pandas using composition instead of subclassing. So this is not relevant for the discussion (it is not about whether we want to use EAs internally in pandas, but how to use them).

Then once the above is done, we can actually remove all of the Block types that are not simple ones.

That's already the goal of the ExtensionBlock. Again that is AFAIK no argument in favor or against subclassing.

jbrockmendel · 2018-03-26T19:10:51Z

The main advantage I see to subclassing is avoiding series._values._foo_values.values._bar_values...

jorisvandenbossche · 2018-03-26T19:31:04Z

@jbrockmendel you give an example with Series. But I don't think we are actually contemplating making Series a subclass of extension array? Or are we, which would be a huge change in design? (in my head at least it was only about Index)
But to the point (considering Index): it will be no more the case than it is today, but only more consistently (Index is now composed of an array, but how this is done varies among different subtypes), and we also actually have different array representions that won't go away (the extension array, underlying codes, coerced to numpy array in some way, ..), so we will have such attributes anyhow.

jbrockmendel · 2018-03-26T20:36:05Z

@jorisvandenbossche You're right, a more apt example would have been series._data.blocks[0].values.data._values.... The point is that there is a cost to layers of unwrapping as well as variants of values/_values/_ndarray_values/data/_data/... and if a layer of unwrapping can be avoided without adding complications elsewhere it would be nice.

TomAugspurger · 2018-03-30T11:25:03Z

we also actually have different array representions that won't go away (the extension array, underlying codes, coerced to numpy array in some way, ..),

@jreback, @jbrockmendel do you have thoughts on this? NumericIndex is still going to compose an ndarray as the ._data attribute. Shouldn't our extension index classes do the same, but with an EA? I think declaring that ._data is a Union[ndarray, ExtensionArray] makes sense.

jbrockmendel · 2018-03-30T16:59:51Z

is still going to compose an ndarray as the ._data attribute. Shouldn't our extension index classes do the same

Assuming we are going with the Index-subclasses-EA approach, then yes. That said two preferences on naming conventions:

Block.values should also be brought into alignment with whatever name is chosen for this.
It would be really nice if all references to BlockManager could be found by grepping for \._data.

TomAugspurger · 2018-03-30T17:09:32Z

Assuming we are going with the Index-subclasses-EA approach, then yes

I don't follow. Doesn't the fact that Int64Index composing an ndarray as ._values argue for CategoricalIndex composing a Categorical as ._values?

jorisvandenbossche · 2018-03-30T18:08:30Z

Assuming we are going with the Index-subclasses-EA approach, then yes

Tom's question was rather about not assuming this.

jbrockmendel · 2018-03-30T18:58:16Z

I don't follow. Doesn't the fact that Int64Index composing an ndarray as ._values argue for CategoricalIndex composing a Categorical as ._values?

Tom's question was rather about not assuming this.

My answer then has to be a convex combination of "I don't have an opinion" and "I don't have an informed opinion."

jbrockmendel · 2018-04-11T16:52:56Z

Should this issue have a ExtensionArray label?

jbrockmendel · 2018-04-11T16:56:02Z

New thought on inheritance vs composition: there are cache_readonly attributes on Index subclasses that would need to be propertys on mutable Array classes. With inheritance we can just override the relevant properties on the subclass; getting the same affect with composition is less obvious.

TomAugspurger · 2018-04-11T18:43:24Z

With inheritance we can just override the relevant properties on the subclass; getting the same affect with composition is less obvious.

FWIW, https://github.com/pandas-dev/pandas/pull/20611/files didn't require any changes to the read only attributes (e.g. _hasnans). It's using the same code as before.

jbrockmendel · 2018-04-11T19:36:52Z

I don't know the IntervalIndex code that well, but it looks like is_non_overlapping_monotonic is a cache_readonly on the Index and a property on the Array. AFAICT the Array class never actually uses that attribute. To make the analogy hold, I think we need to imagine a hypothetical in which a bunch of IntervalArray methods access is_non_overlapping_monotonic.

jorisvandenbossche · 2018-04-12T07:53:33Z

To make the analogy hold, I think we need to imagine a hypothetical in which a bunch of IntervalArray methods access is_non_overlapping_monotonic.

Can you explain this further? What is the problem with the Array class using it for the Index having cached it?

The Index caching such properties of the underlying values is in principle somewhat brittle in case somebody changes the underlying values of the Index in place; but, users are not 'supposed' to do that, and you can currently already break some of the cached properties of Index in a similar way.

Further, if we want, we can actually also think about caching expensive properties on Arrays as well, if we want. We just need to make sure to clear the cache once its underlying values (left, right, closed in case of Interval) are changed.

jorisvandenbossche · 2018-04-12T16:09:00Z

@jreback @TomAugspurger to come back to our discussion on hangout

As I understand it now (Jeff, correct me if I am wrong), one of the main arguments you make is:

Index needs to be an ExtensionArray (= adhere to the "ExtensionArray interface", not necessarily by subclassing it), to ensure:
a) we have a guaranteed consistency in the public methods (like the IndexOpsMixin for Series/Index guarantees that a subset of their methods are consistent)
b) to avoid internally special casing between Index and Arrays

I agree that those two points are important to have, but I think the main discussion is how to achieve this. IMO it is not needed for Index to be an ExtensionArray to have those points.

But let's use some very concrete examples to discuss this, so using some methods that are currently part of the ExtensionArray interface

Factorization: we have a ExtensionArray._values_for_factorize as a way to implement array-specific factorization (eg for Categorical this are the codes), so that pd.factorize(array) works correctly (to be correct, pd.factorize only needs to EA.factorize() method, but the interface also provides the other methods to implement them).
And of course, it would be nice that pd.factorize(index[EA]) works consistently and without special casing. I think this is already the case, without the need that Index itself has this _values_for_factorize method.

The factorize implementation does this:

pandas/pandas/core/algorithms.py

Lines 606 to 611 in 2794474

    
           if is_extension_array_dtype(values): 
        
               values = getattr(values, '_values', values) 
        
               labels, uniques = values.factorize(na_sentinel=na_sentinel) 
        
               dtype = original.dtype 
        
           else: 
        
               values, dtype, _ = _ensure_data(values)

So basically it checks for extension dtypes, in that case gets the extension array, and calls its own factorize method. In the other case, it is converted to a numpy array and uses the corresponding hashtable.

The above code will work perfectly fine for Index of which the dtype is an extension dtype, without that it has the full extension array interface of _values_for_factorize and without that there is a special casing for "index". The special casing is for extension arrays vs numpy arrays (everything else like Series, Index, list, .. boils down to one of those two code paths).

Missing values: we have an ExtensionArray.isna. Yes, we want that Index.isna and ExtensionArray.isna behave consistently. But IMO it does not make sense that Index and ExtensionArray share an implementation from a mixin like we have IndexOpsMixin for Index/Series.
The reason for this is that isna will depend on the physical layout of your extension array (eg in my case of GeometryArray, my data our pointers and use data == 0 for isna), and one of the reasons to have an ExtensionArray interface is because the rest of pandas (Series, Index) does not need to be aware of the exact physical implementation of the array, but just knows that it can call isna to get the desired result. So in that sense, Index[EA].isna will need to call EA.isna.

In both those examples, I think we indeed want that Index and ExtensionArray have a consistent public interface, I fully agree on that. But I don't think we want them both having the ExtensionArray interface (because this interface is more than the public factorize, unique, isna, ... methods, but also the extension array specific things like _values_for_factorize, _from_factorized, ... which are IMO not needed for the Index class)

--> So let's think about how to make the public API consistent and having this consistency guaranteed by the code, without making Index an actual extension array?

Just brainstorming here, but we could have a ArrayLikeMixin where we put abstract methods for all the public methods of ExtensionArray that we want to be consistent on Index. Those methods will all need to be only abstract without any code, because the actual implementation will not be shared between Index and ExtensionArray. But by having this, we can ensure that both Index and ExtensionArray need to implement those methods. It could maybe also be a good home for shared docstrings.
(I don't think it guarantees consistent signatures, as you can always override an abstract method with a different signature, but we could add a test that manually checks for such consistency, or it might be that tools like LGTM check for this).

@jreback does the above explanation makes some sense? (which does not mean you have to agree :)) Do you understand our standpoint?
Thanks to the discussion, I now understand your point of that we have to pay attention to a consistent interface for Index/Array, as that is indeed an important point (eg Index.factorize and EA.factorize already have a slightly different signature). I just disagree that ensuring this consistency means that Index needs to have the full extension array interface.

TomAugspurger · 2018-04-12T16:34:10Z

One more specific case, this time in Series / DataFrame constructors which @jreback mentioned on the call.

Indexes have names, while arrays don't. So Series.__init__ will have to have an isinstance(data, Index) check to extract the name.

jbrockmendel · 2018-04-12T16:42:52Z

Can you explain this further? What is the problem with the Array class using it for the Index having cached it?

dti  = pd.date_range('2016-01-01', periods=3)
other = dti - dti.shift(1)

>>> dti + other
>>> dti - other

Under composition, both the addition and the subtraction ops involve lookups of dti._values._hasnans, which is not cached. This is where the analogy breaks down, since is_non_overlapping_monotonic is never used by the IntervalArray, only the IntervalIndex.

and you can currently already break some of the cached properties of Index in a similar way.

Yah, if a user does that they're on their own. This usage would render the cache invalid in all scenarios being discussed.

Further, if we want, we can actually also think about caching expensive properties on Arrays as well, if we want.

Yah, its doable, just less-obvious. FWIW I'm coming around to liking the composition approach more and more.

BTW sorry I missed the hangout. UTC-7. I'll take a look at the minutes.

jorisvandenbossche · 2018-04-12T16:54:34Z

I'll take a look at the minutes.

We didn't take much notes, I think the main thing that came out of the discussion is my long comment above.

jreback · 2018-04-13T11:25:00Z

@jorisvandenbossche on your comments above: #19696 (comment)

+1 on this. pretty good summarization. Yes I think just like we expose ExtensionArray, ideally would have a ExtensionArrayInterface (or ArrayInterface) that pretty much mixes into EA, Index, Series, DataFrame (maybe) to have a common interface. This is similar to how we use PandasObject currently (in fact need __unicode__ and __sizeof__ in a EAI among other routines).

These could provide really just an interface or a trivial imiplementation (__unicode__), but not facilities for the private implementation (e.g. _values_for_factorized).

Would for sure make use consistent in method names and to some extent signatures across all pandas objects. Could also test for this.

jorisvandenbossche · 2018-04-13T11:48:22Z

deally would have a ExtensionArrayInterface (or ArrayInterface) that pretty much mixes into EA, Index, Series, DataFrame (maybe) to have a common interface.

Just to avoid confusion in terminology: what we currently call the "extension array interface" is what is defined and explained in the ExtensionArray metaclass (so including the the _values_for_factorize et al methods; the interface that external developers need to follow to implement arrays), so this is not an interface we want to have consistent with Index, Series et al.
So would rather name it something like "PandasArrayLike Mixin/ABC"

There are also some questions regarding to this to what extent we actually want them completely consistent. For example, ExtensionArray is 1D, so we did not add an axis argument to its relevant methods (similarly to how this is also not the case for Index). And like this there are other examples as well.

jbrockmendel · 2018-07-14T18:53:21Z

Should ExtensionArray share code with IndexOpsMixin? A lot of it is implemented in terms of _ndarray_values (and most of the rest in terms of _values or values)

jorisvandenbossche · 2018-07-14T20:46:43Z

I don't think it should share code. I think IndexOpsMixin will regularly wrap code that is implemented in ExtensionArray. That only means we should make sure that Index and Series get to that with the same attribute.

jbrockmendel · 2020-04-03T16:51:08Z

Is the IntervalArray checkbox check-able?

jorisvandenbossche · 2020-04-04T12:48:01Z

I think so, yes (at least, IntervalArray is an ExtensionArray now)

jbrockmendel · 2022-10-18T21:29:00Z

closable?

mroeschke · 2023-03-30T18:51:38Z

Looks like most of the check boxes from the original issue are checked so I think we can close. We can open separate issues for the EA interface as they come up

TomAugspurger added API Design Internals Related to non-user accessible pandas implementation Master Tracker High level tracker for similar issues labels Feb 14, 2018

TomAugspurger mentioned this issue Feb 20, 2018

DF.stack() on single column datetime with timezone causes loss of timezone #19420

Closed

jorisvandenbossche mentioned this issue Feb 23, 2018

ENH: Allow storing ExtensionArrays in containers #19520

Merged

This was referenced Feb 26, 2018

ENH: ExtensionArray.setitem #19907

Merged

ENH: ExtensionArray.fillna #19909

Merged

jbrockmendel mentioned this issue Mar 16, 2018

ENH: implement DatetimeLikeArray #19902

Merged

jorisvandenbossche mentioned this issue Mar 23, 2018

ENH: Sorting of ExtensionArrays #19957

Merged

This was referenced Mar 29, 2018

release date question #20530

Closed

RLS: 0.23.0 #20531

Closed

TomAugspurger mentioned this issue Apr 4, 2018

REF: IntervalIndex[IntervalArray] #20611

Merged

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 11, 2018

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Oct 18, 2022

asfimport mentioned this issue Jan 14, 2020

[Python] Add API to map Arrow types (including extension types) to pandas ExtensionArray instances for to_pandas conversions apache/arrow#18536

Closed

mroeschke closed this as completed Mar 30, 2023

ExtensionArray meta-issue #19696

ExtensionArray meta-issue #19696

Comments

TomAugspurger commented Feb 14, 2018 • edited Loading

jbrockmendel commented Feb 22, 2018

TomAugspurger commented Feb 22, 2018 via email

TomAugspurger commented Feb 23, 2018

jbrockmendel commented Feb 23, 2018

jbrockmendel commented Mar 18, 2018 • edited Loading

TomAugspurger commented Mar 20, 2018 • edited Loading

jbrockmendel commented Mar 20, 2018

TomAugspurger commented Mar 21, 2018

jorisvandenbossche commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018

jbrockmendel commented Mar 23, 2018

jreback commented Mar 25, 2018

jorisvandenbossche commented Mar 26, 2018

jbrockmendel commented Mar 26, 2018

jorisvandenbossche commented Mar 26, 2018

jbrockmendel commented Mar 26, 2018

TomAugspurger commented Mar 30, 2018

jbrockmendel commented Mar 30, 2018

TomAugspurger commented Mar 30, 2018

jorisvandenbossche commented Mar 30, 2018

jbrockmendel commented Mar 30, 2018

jbrockmendel commented Apr 11, 2018

jbrockmendel commented Apr 11, 2018

TomAugspurger commented Apr 11, 2018

jbrockmendel commented Apr 11, 2018

jorisvandenbossche commented Apr 12, 2018

jorisvandenbossche commented Apr 12, 2018 • edited Loading

TomAugspurger commented Apr 12, 2018

jbrockmendel commented Apr 12, 2018

jorisvandenbossche commented Apr 12, 2018

jreback commented Apr 13, 2018

jorisvandenbossche commented Apr 13, 2018

jbrockmendel commented Jul 14, 2018

jorisvandenbossche commented Jul 14, 2018

jbrockmendel commented Apr 3, 2020

jorisvandenbossche commented Apr 4, 2020

jbrockmendel commented Oct 18, 2022

mroeschke commented Mar 30, 2023

TomAugspurger commented Feb 14, 2018 •

edited

Loading

jbrockmendel commented Mar 18, 2018 •

edited

Loading

TomAugspurger commented Mar 20, 2018 •

edited

Loading

jorisvandenbossche commented Apr 12, 2018 •

edited

Loading