Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Implement NDArrayBackedExtensionArray #33660

Merged
merged 6 commits into from
Apr 25, 2020

Conversation

jbrockmendel
Copy link
Member

Many EAs are thin wrappers around np.ndarray. We can both de-duplicate a bunch of code and make things easier on downstream authors by implementing NDArrayBackedExtensionArray as a base class for such EAs.

This PR only implements NDArrayBackedExtensionArray.take, but there is quite a bit more that can be shared in follow-ups:

  • copy, delete, repeat
  • mix in to PandasArray
  • with small changes, __getitem__, __setitem__, reductions, ...

The only change in logic this PR makes is to make Categorical.take raise a ValueError instead of a TypeError, matching DTA/TDA/PA.

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 20, 2020
# NB: A bunch of Interval tests fail if we use ._data
return self.asi8

def _from_backing_data(self, arr: np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a new method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

------
ValueError
"""
raise AbstractMethodError(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseMaskedArray is in core/arrays/masked.py. If the purpose of this is to create a common base class for another subset of extensionarrays, can we make the module names for the base classes more consistent.

We could also restructure the extension array tests with a similar hierarchy, so that could influence the naming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to other naming ideas; what do you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have core/arrays/numpy_ for PandasArray, so the simple numpy name is taken. The difference though is a PandasArray can be instantiated (It is also possible to instantiate a BaseMaskedArray directly) whereas NDArrayBackedExtensionArray is abstract. maybe we need a subdirectory of core/arrays for the base classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could reasonably put the mixin in arrays.numpy_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Categorical.take docsting explicitly stated that it raises a TypeError.

While I agree that a ValueError is more appropriate, do we consider than an API change? We'll want a release note regardless.

@@ -1780,85 +1781,17 @@ def fillna(self, value=None, method=None, limit=None):

return self._constructor(codes, dtype=self.dtype, fastpath=True)

def take(self, indexer, allow_fill: bool = False, fill_value=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to restore this docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored with TypeError->ValueError and whatsnew


def _from_backing_data(self, arr: np.ndarray):
# Note: we do not retain `freq`
return type(self)(arr, dtype=self.dtype) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike Frame and Series, EAs do not have the concept of _constructor. could this be useful?

the exception is Categorical which uses _constructor. so maybe either extend to all EAs or can we remove from Categorical for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC Categorical has _constructor because it subclasses PandasObject.

_from_backing_data is specifically about round-tripping or commutativity; not really a good fit for _constructor.

return self._codes

def _from_backing_data(self, arr: np.ndarray):
return self._constructor(arr, dtype=self.dtype, fastpath=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comment re _constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC Categorical has _constructor because it subclasses PandasObject.

_from_backing_data is specifically about round-tripping or commutativity; not really a good fit for _constructor.

in PandasObject _constructor is type(self), in Categorical, _constructor is Categorical (i.e type(self)). so I think it may worth investigating whether we can remove _constructor from Categorical and use type(self) here to be consistent with datetimelike._from_backing_data.

can u add return types for _from_backing_data, both here and in datetimelike


_ndarray: np.ndarray

def _from_backing_data(self: _T, arr: np.ndarray) -> _T:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is typing self needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that is how we indicate that the return type is "same type as self", but id defer to @simonjayhawkins on this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. T is a typevar, i.e. can take on different types (could be subtypes of a type or union of types) so adding to self binds the typevar and the return type is the same as self.

This can sometimes cause problems in Mixins, but here the 'Mixin' is IMO an abstract base class.

@jreback jreback added this to the 1.1 milestone Apr 25, 2020
@jreback jreback merged commit c4ebf21 into pandas-dev:master Apr 25, 2020
@jbrockmendel jbrockmendel deleted the ndarray-backed branch April 25, 2020 21:30
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants