Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionArray meta-issue #19696

Closed
14 of 15 tasks
TomAugspurger opened this issue Feb 14, 2018 · 38 comments
Closed
14 of 15 tasks

ExtensionArray meta-issue #19696

TomAugspurger opened this issue Feb 14, 2018 · 38 comments
Labels
API Design Closing Candidate May be closeable, needs more eyeballs ExtensionArray Extending pandas with custom dtypes or arrays. Internals Related to non-user accessible pandas implementation Master Tracker High level tracker for similar issues

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 14, 2018

Just so things don't get lost

@TomAugspurger TomAugspurger added API Design Internals Related to non-user accessible pandas implementation Master Tracker High level tracker for similar issues labels Feb 14, 2018
@jbrockmendel
Copy link
Member

Should these have a _typ attribute? If so, should SparseArray._typ be changed from "array" to something more specific?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 22, 2018 via email

@TomAugspurger
Copy link
Contributor Author

Quick status update. The two big ones are in, and now work can mostly proceed in parallel (aside from #19863 which a couple of my followups depend on).

I'm currently rebasing all my branches on master and will make PRs today.

I'll also go through all the TODOs we've added an make issues / checkboxes for those.

If anyone is interested in working on these, PeriodArray will likely be the most sizable PR that I don't really have a start on, aside from a now out of date branch.

@jbrockmendel
Copy link
Member

re PeriodArray, were in spitting distance of being ready to refactor arith/cmp methods out from DatetimeIndexOpsMixin/DatetimeIndexTimedeltaIndex/PeriodIndex. If you can wait for #19847, #19835, and #19800, I can get the ball rolling and save you some work.

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 18, 2018

  • Refactor arithmetic methods from DatetimeIndexOpsMixin/DatetimeIndex/TimedeltaIndex/PeriodIndex into array classes (ENH: implement DatetimeLikeArray #19902)
  • Refactor comparison methods from DTI/TDI/PI into array classes

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 20, 2018

Porting from #19902 (comment)

When do we want to call it and release 0.23? I can get __setitem__ finished up tonight or tomorrow. Groupby is I think close to being done (we have the API issue for argsort and a similar one for factorize. Actually supporting groupby doesn't seem difficult). Once those are done, I would call the interface sufficiently complete (aside perhaps from support for arithmetic ops, which I haven't really looked into. I'm not sure how difficult that will be, but I imagine it'll be some effort).

The big remaining question then is moving other array types to EAs

  • Period
  • Interval
  • DatetimeTZ
  • Sparse

Do we want to block 0.23 for those? Do we have an estimate for how long that would take?

cc @shoyer @jreback @jorisvandenbossche @chris-b1.

@jbrockmendel
Copy link
Member

Do we have an estimate for how long that would take?

For the datetime/timedelta/period arrays (or the arithmetic/comparison bits of them), if/when #19902 goes through, I can get the next two PRs in that sequence out pretty quickly. But after those are in there will be some non-trivial design decisions to be made. The end of March would be an optimistic goal.

@TomAugspurger
Copy link
Contributor Author

#19696 (comment)

cc @shoyer @jreback @jorisvandenbossche @chris-b1 (don't think you got pinged on this, since I did it in an edit).

@jorisvandenbossche
Copy link
Member

I think ideally, we would already have converted Interval, Period, etc to use ExtensionArrays before doing a release, as this will exercise and stress-test the interface a lot.
But on the other hand, for projects like cyberpandas and geopandas, it would also be easier to already have a release that supports a basic extension array interface, making it easier for users to test it.
And we can be clear that the extension array interface for the extension authors will still be subject to change (which does not mean things will change for their users). So that would be in favor of releasing more quickly.

@jorisvandenbossche
Copy link
Member

Another topic we need to discuss (although actually trying to implement it might make a discussion more tangible), is whether we want our current Index classes to subclass the ExtensionArrays (eg PeriodIndex subclassing PeriodArray) or to be composed of an ExtensionArray (PeriodIndex._data being a PeriodArray).

I think we previously thought to do composition, but @TomAugspurger said here #19902 (comment): (also @jreback mentioned this in #19957 (comment))

I've been going back and forth on which approach is best here. I'm slightly coming around to the idea of subclassing, but haven't 100% settled yet.

@TomAugspurger what do you see as advantages of using inheritance instead of composition?

I didn't really think about in detail (code-wise), but: Series uses composition, so not doing this for Index might reduce ability to share code. CategoricalIndex is already an existing example of Index with ExtensionArray that does composition. Other index classes with numeric or object dtypes (the ones having a numpy array under the hood) will keep using composition.

@jbrockmendel
Copy link
Member

I didn't really think about in detail (code-wise), but: Series uses composition, so not doing this for Index might reduce ability to share code

I think Series might be the wrong comparison: FooBlock should be the analogous object to FooIndex, shouldn't it?

@jreback
Copy link
Contributor

jreback commented Mar 25, 2018

So I am a big +1 on making Index a subclass of EA. This will unify the external interface with the internal one (e.g. the user facing API). And will allow us to make sure the impl shares code (e.g. mainly this will involve changing some bespoke impl details of Index to conform to the new patterns of EA).

We already have a de-facto implementation of this for Datetime w/tz, meaning we are using it like an EA. subclassing DTI will just formalize this. We also get the Period & II impl for free if we do this. We could also change the categorical impl to actually use CI and again get some more code sharing / harmony.

The point of all this is to really dogfood EA internally in pandas. This will simplify code and promote a unified interface.

Then once the above is done, we can actually remove all of the Block types that are not simple ones. These would then merely become EA implementations. Again this would simplify the implementation details and friction that we currently have.

@jorisvandenbossche
Copy link
Member

So I am a big +1 on making Index a subclass of EA. This will unify the external interface with the internal one (e.g. the user facing API).

What do you mean with unify interfaces? Which interfaces? You mean the ExtensionArray API and the Index API? I don't think those should be unified. The ExtensionArray interface is deliberately much smaller.

We already have a de-facto implementation of this for Datetime w/tz, meaning we are using it like an EA.

As far as I know, DatetimeIndex is an Index composed of a array holding the values, not subclassing anything. It is true that a Series of tz-aware data holds a DatetimeIndex, but they can both perfectly hold a DatetimeArray, without the Index subclassing from it.
Again, this might be a good approach, but what you raise are not actual reasons to do it IMO.

The point of all this is to really dogfood EA internally in pandas.

We can perfectly dogfood EA internally in pandas using composition instead of subclassing. So this is not relevant for the discussion (it is not about whether we want to use EAs internally in pandas, but how to use them).

Then once the above is done, we can actually remove all of the Block types that are not simple ones.

That's already the goal of the ExtensionBlock. Again that is AFAIK no argument in favor or against subclassing.

@jbrockmendel
Copy link
Member

The main advantage I see to subclassing is avoiding series._values._foo_values.values._bar_values...

@jorisvandenbossche
Copy link
Member

@jbrockmendel you give an example with Series. But I don't think we are actually contemplating making Series a subclass of extension array? Or are we, which would be a huge change in design? (in my head at least it was only about Index)
But to the point (considering Index): it will be no more the case than it is today, but only more consistently (Index is now composed of an array, but how this is done varies among different subtypes), and we also actually have different array representions that won't go away (the extension array, underlying codes, coerced to numpy array in some way, ..), so we will have such attributes anyhow.

@jbrockmendel
Copy link
Member

@jorisvandenbossche You're right, a more apt example would have been series._data.blocks[0].values.data._values.... The point is that there is a cost to layers of unwrapping as well as variants of values/_values/_ndarray_values/data/_data/... and if a layer of unwrapping can be avoided without adding complications elsewhere it would be nice.

This was referenced Mar 29, 2018
@TomAugspurger
Copy link
Contributor Author

we also actually have different array representions that won't go away (the extension array, underlying codes, coerced to numpy array in some way, ..),

@jreback, @jbrockmendel do you have thoughts on this? NumericIndex is still going to compose an ndarray as the ._data attribute. Shouldn't our extension index classes do the same, but with an EA? I think declaring that ._data is a Union[ndarray, ExtensionArray] makes sense.

@jbrockmendel
Copy link
Member

is still going to compose an ndarray as the ._data attribute. Shouldn't our extension index classes do the same

Assuming we are going with the Index-subclasses-EA approach, then yes. That said two preferences on naming conventions:

  • Block.values should also be brought into alignment with whatever name is chosen for this.
  • It would be really nice if all references to BlockManager could be found by grepping for \._data.

@TomAugspurger
Copy link
Contributor Author

Assuming we are going with the Index-subclasses-EA approach, then yes

I don't follow. Doesn't the fact that Int64Index composing an ndarray as ._values argue for CategoricalIndex composing a Categorical as ._values?

@jorisvandenbossche
Copy link
Member

Assuming we are going with the Index-subclasses-EA approach, then yes

Tom's question was rather about not assuming this.

@jbrockmendel
Copy link
Member

I don't follow. Doesn't the fact that Int64Index composing an ndarray as ._values argue for CategoricalIndex composing a Categorical as ._values?

Tom's question was rather about not assuming this.

My answer then has to be a convex combination of "I don't have an opinion" and "I don't have an informed opinion."

@jbrockmendel
Copy link
Member

Should this issue have a ExtensionArray label?

@jbrockmendel
Copy link
Member

New thought on inheritance vs composition: there are cache_readonly attributes on Index subclasses that would need to be propertys on mutable Array classes. With inheritance we can just override the relevant properties on the subclass; getting the same affect with composition is less obvious.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 11, 2018
@TomAugspurger
Copy link
Contributor Author

With inheritance we can just override the relevant properties on the subclass; getting the same affect with composition is less obvious.

FWIW, https://github.com/pandas-dev/pandas/pull/20611/files didn't require any changes to the read only attributes (e.g. _hasnans). It's using the same code as before.

@jbrockmendel
Copy link
Member

I don't know the IntervalIndex code that well, but it looks like is_non_overlapping_monotonic is a cache_readonly on the Index and a property on the Array. AFAICT the Array class never actually uses that attribute. To make the analogy hold, I think we need to imagine a hypothetical in which a bunch of IntervalArray methods access is_non_overlapping_monotonic.

@jorisvandenbossche
Copy link
Member

To make the analogy hold, I think we need to imagine a hypothetical in which a bunch of IntervalArray methods access is_non_overlapping_monotonic.

Can you explain this further? What is the problem with the Array class using it for the Index having cached it?

The Index caching such properties of the underlying values is in principle somewhat brittle in case somebody changes the underlying values of the Index in place; but, users are not 'supposed' to do that, and you can currently already break some of the cached properties of Index in a similar way.

Further, if we want, we can actually also think about caching expensive properties on Arrays as well, if we want. We just need to make sure to clear the cache once its underlying values (left, right, closed in case of Interval) are changed.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 12, 2018

@jreback @TomAugspurger to come back to our discussion on hangout

As I understand it now (Jeff, correct me if I am wrong), one of the main arguments you make is:

Index needs to be an ExtensionArray (= adhere to the "ExtensionArray interface", not necessarily by subclassing it), to ensure:
a) we have a guaranteed consistency in the public methods (like the IndexOpsMixin for Series/Index guarantees that a subset of their methods are consistent)
b) to avoid internally special casing between Index and Arrays

I agree that those two points are important to have, but I think the main discussion is how to achieve this. IMO it is not needed for Index to be an ExtensionArray to have those points.

But let's use some very concrete examples to discuss this, so using some methods that are currently part of the ExtensionArray interface

  • Factorization: we have a ExtensionArray._values_for_factorize as a way to implement array-specific factorization (eg for Categorical this are the codes), so that pd.factorize(array) works correctly (to be correct, pd.factorize only needs to EA.factorize() method, but the interface also provides the other methods to implement them).
    And of course, it would be nice that pd.factorize(index[EA]) works consistently and without special casing. I think this is already the case, without the need that Index itself has this _values_for_factorize method.

    The factorize implementation does this:

    if is_extension_array_dtype(values):
    values = getattr(values, '_values', values)
    labels, uniques = values.factorize(na_sentinel=na_sentinel)
    dtype = original.dtype
    else:
    values, dtype, _ = _ensure_data(values)

    So basically it checks for extension dtypes, in that case gets the extension array, and calls its own factorize method. In the other case, it is converted to a numpy array and uses the corresponding hashtable.

    The above code will work perfectly fine for Index of which the dtype is an extension dtype, without that it has the full extension array interface of _values_for_factorize and without that there is a special casing for "index". The special casing is for extension arrays vs numpy arrays (everything else like Series, Index, list, .. boils down to one of those two code paths).

  • Missing values: we have an ExtensionArray.isna. Yes, we want that Index.isna and ExtensionArray.isna behave consistently. But IMO it does not make sense that Index and ExtensionArray share an implementation from a mixin like we have IndexOpsMixin for Index/Series.
    The reason for this is that isna will depend on the physical layout of your extension array (eg in my case of GeometryArray, my data our pointers and use data == 0 for isna), and one of the reasons to have an ExtensionArray interface is because the rest of pandas (Series, Index) does not need to be aware of the exact physical implementation of the array, but just knows that it can call isna to get the desired result. So in that sense, Index[EA].isna will need to call EA.isna.

In both those examples, I think we indeed want that Index and ExtensionArray have a consistent public interface, I fully agree on that. But I don't think we want them both having the ExtensionArray interface (because this interface is more than the public factorize, unique, isna, ... methods, but also the extension array specific things like _values_for_factorize, _from_factorized, ... which are IMO not needed for the Index class)

--> So let's think about how to make the public API consistent and having this consistency guaranteed by the code, without making Index an actual extension array?

Just brainstorming here, but we could have a ArrayLikeMixin where we put abstract methods for all the public methods of ExtensionArray that we want to be consistent on Index. Those methods will all need to be only abstract without any code, because the actual implementation will not be shared between Index and ExtensionArray. But by having this, we can ensure that both Index and ExtensionArray need to implement those methods. It could maybe also be a good home for shared docstrings.
(I don't think it guarantees consistent signatures, as you can always override an abstract method with a different signature, but we could add a test that manually checks for such consistency, or it might be that tools like LGTM check for this).

@jreback does the above explanation makes some sense? (which does not mean you have to agree :)) Do you understand our standpoint?
Thanks to the discussion, I now understand your point of that we have to pay attention to a consistent interface for Index/Array, as that is indeed an important point (eg Index.factorize and EA.factorize already have a slightly different signature). I just disagree that ensuring this consistency means that Index needs to have the full extension array interface.

@TomAugspurger
Copy link
Contributor Author

One more specific case, this time in Series / DataFrame constructors which @jreback mentioned on the call.

Indexes have names, while arrays don't. So Series.__init__ will have to have an isinstance(data, Index) check to extract the name.

@jbrockmendel
Copy link
Member

Can you explain this further? What is the problem with the Array class using it for the Index having cached it?

dti  = pd.date_range('2016-01-01', periods=3)
other = dti - dti.shift(1)

>>> dti + other
>>> dti - other

Under composition, both the addition and the subtraction ops involve lookups of dti._values._hasnans, which is not cached. This is where the analogy breaks down, since is_non_overlapping_monotonic is never used by the IntervalArray, only the IntervalIndex.

and you can currently already break some of the cached properties of Index in a similar way.

Yah, if a user does that they're on their own. This usage would render the cache invalid in all scenarios being discussed.

Further, if we want, we can actually also think about caching expensive properties on Arrays as well, if we want.

Yah, its doable, just less-obvious. FWIW I'm coming around to liking the composition approach more and more.

BTW sorry I missed the hangout. UTC-7. I'll take a look at the minutes.

@jorisvandenbossche
Copy link
Member

I'll take a look at the minutes.

We didn't take much notes, I think the main thing that came out of the discussion is my long comment above.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2018

@jorisvandenbossche on your comments above: #19696 (comment)

+1 on this. pretty good summarization. Yes I think just like we expose ExtensionArray, ideally would have a ExtensionArrayInterface (or ArrayInterface) that pretty much mixes into EA, Index, Series, DataFrame (maybe) to have a common interface. This is similar to how we use PandasObject currently (in fact need __unicode__ and __sizeof__ in a EAI among other routines).

These could provide really just an interface or a trivial imiplementation (__unicode__), but not facilities for the private implementation (e.g. _values_for_factorized).

Would for sure make use consistent in method names and to some extent signatures across all pandas objects. Could also test for this.

@jorisvandenbossche
Copy link
Member

deally would have a ExtensionArrayInterface (or ArrayInterface) that pretty much mixes into EA, Index, Series, DataFrame (maybe) to have a common interface.

Just to avoid confusion in terminology: what we currently call the "extension array interface" is what is defined and explained in the ExtensionArray metaclass (so including the the _values_for_factorize et al methods; the interface that external developers need to follow to implement arrays), so this is not an interface we want to have consistent with Index, Series et al.
So would rather name it something like "PandasArrayLike Mixin/ABC"

There are also some questions regarding to this to what extent we actually want them completely consistent. For example, ExtensionArray is 1D, so we did not add an axis argument to its relevant methods (similarly to how this is also not the case for Index). And like this there are other examples as well.

@jbrockmendel
Copy link
Member

Should ExtensionArray share code with IndexOpsMixin? A lot of it is implemented in terms of _ndarray_values (and most of the rest in terms of _values or values)

@jorisvandenbossche
Copy link
Member

I don't think it should share code. I think IndexOpsMixin will regularly wrap code that is implemented in ExtensionArray. That only means we should make sure that Index and Series get to that with the same attribute.

@jbrockmendel
Copy link
Member

Is the IntervalArray checkbox check-able?

@jorisvandenbossche
Copy link
Member

I think so, yes (at least, IntervalArray is an ExtensionArray now)

@jbrockmendel
Copy link
Member

closable?

@mroeschke
Copy link
Member

Looks like most of the check boxes from the original issue are checked so I think we can close. We can open separate issues for the EA interface as they come up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Closing Candidate May be closeable, needs more eyeballs ExtensionArray Extending pandas with custom dtypes or arrays. Internals Related to non-user accessible pandas implementation Master Tracker High level tracker for similar issues
Projects
None yet
Development

No branches or pull requests

5 participants