REF: Implement NDArrayBackedExtensionArray #33660

jbrockmendel · 2020-04-19T21:04:25Z

Many EAs are thin wrappers around np.ndarray. We can both de-duplicate a bunch of code and make things easier on downstream authors by implementing NDArrayBackedExtensionArray as a base class for such EAs.

This PR only implements NDArrayBackedExtensionArray.take, but there is quite a bit more that can be shared in follow-ups:

copy, delete, repeat
mix in to PandasArray
with small changes, __getitem__, __setitem__, reductions, ...

The only change in logic this PR makes is to make Categorical.take raise a ValueError instead of a TypeError, matching DTA/TDA/PA.

jreback · 2020-04-20T00:42:08Z

pandas/core/arrays/datetimelike.py

+        # NB: A bunch of Interval tests fail if we use ._data
+        return self.asi8
+
+    def _from_backing_data(self, arr: np.ndarray):


is this a new method?

…array-backed

simonjayhawkins · 2020-04-20T14:16:46Z

pandas/core/arrays/_mixins.py

+        ------
+        ValueError
+        """
+        raise AbstractMethodError(self)


BaseMaskedArray is in core/arrays/masked.py. If the purpose of this is to create a common base class for another subset of extensionarrays, can we make the module names for the base classes more consistent.

We could also restructure the extension array tests with a similar hierarchy, so that could influence the naming.

I'm open to other naming ideas; what do you have in mind?

we already have core/arrays/numpy_ for PandasArray, so the simple numpy name is taken. The difference though is a PandasArray can be instantiated (It is also possible to instantiate a BaseMaskedArray directly) whereas NDArrayBackedExtensionArray is abstract. maybe we need a subdirectory of core/arrays for the base classes.

could reasonably put the mixin in arrays.numpy_

sounds good.

TomAugspurger

The Categorical.take docsting explicitly stated that it raises a TypeError.

While I agree that a ValueError is more appropriate, do we consider than an API change? We'll want a release note regardless.

TomAugspurger · 2020-04-20T15:00:00Z

pandas/core/arrays/categorical.py

@@ -1780,85 +1781,17 @@ def fillna(self, value=None, method=None, limit=None):

        return self._constructor(codes, dtype=self.dtype, fastpath=True)

-    def take(self, indexer, allow_fill: bool = False, fill_value=None):
-        """


Probably want to restore this docstring.

restored with TypeError->ValueError and whatsnew

…array-backed

simonjayhawkins · 2020-04-21T10:18:52Z

pandas/core/arrays/datetimelike.py

+
+    def _from_backing_data(self, arr: np.ndarray):
+        # Note: we do not retain `freq`
+        return type(self)(arr, dtype=self.dtype)  # type: ignore


unlike Frame and Series, EAs do not have the concept of _constructor. could this be useful?

the exception is Categorical which uses _constructor. so maybe either extend to all EAs or can we remove from Categorical for consistency.

IIUC Categorical has _constructor because it subclasses PandasObject.

_from_backing_data is specifically about round-tripping or commutativity; not really a good fit for _constructor.

simonjayhawkins · 2020-04-21T10:21:08Z

pandas/core/arrays/categorical.py

+        return self._codes
+
+    def _from_backing_data(self, arr: np.ndarray):
+        return self._constructor(arr, dtype=self.dtype, fastpath=True)


see other comment re _constructor.

IIUC Categorical has _constructor because it subclasses PandasObject.

_from_backing_data is specifically about round-tripping or commutativity; not really a good fit for _constructor.

in PandasObject _constructor is type(self), in Categorical, _constructor is Categorical (i.e type(self)). so I think it may worth investigating whether we can remove _constructor from Categorical and use type(self) here to be consistent with datetimelike._from_backing_data.

can u add return types for _from_backing_data, both here and in datetimelike

…array-backed

jorisvandenbossche · 2020-04-21T20:29:24Z

pandas/core/arrays/_mixins.py

+
+    _ndarray: np.ndarray
+
+    def _from_backing_data(self: _T, arr: np.ndarray) -> _T:


is typing self needed?

my understanding is that is how we indicate that the return type is "same type as self", but id defer to @simonjayhawkins on this

yes. T is a typevar, i.e. can take on different types (could be subtypes of a type or union of types) so adding to self binds the typevar and the return type is the same as self.

This can sometimes cause problems in Mixins, but here the 'Mixin' is IMO an abstract base class.

REF: Implement NDArrayBackedExtensionArray

85ea4fa

jbrockmendel mentioned this pull request Apr 19, 2020

REF: use array_algos shift for Categorical.shift #33663

Merged

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 20, 2020

jreback reviewed Apr 20, 2020

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into nd…

d43950c

…array-backed

simonjayhawkins reviewed Apr 20, 2020

View reviewed changes

TomAugspurger reviewed Apr 20, 2020

View reviewed changes

jbrockmendel added 2 commits April 20, 2020 08:25

restore docstring, whatsnew

aad9970

Merge branch 'master' of https://github.com/pandas-dev/pandas into nd…

95db99f

…array-backed

simonjayhawkins reviewed Apr 21, 2020

View reviewed changes

jbrockmendel added 2 commits April 21, 2020 07:51

Merge branch 'master' of https://github.com/pandas-dev/pandas into nd…

42ca6a7

…array-backed

annotate

fc304a0

jorisvandenbossche reviewed Apr 21, 2020

View reviewed changes

jorisvandenbossche approved these changes Apr 24, 2020

View reviewed changes

jreback added this to the 1.1 milestone Apr 25, 2020

jreback merged commit c4ebf21 into pandas-dev:master Apr 25, 2020

jbrockmendel deleted the ndarray-backed branch April 25, 2020 21:30

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

REF: Implement NDArrayBackedExtensionArray (pandas-dev#33660)

73204a8

simonjayhawkins mentioned this pull request Sep 2, 2020

TYP: misc typing in core\indexes\base.py #35991

Merged

jbrockmendel mentioned this pull request Nov 10, 2020

API: consistently raise TypeError for invalid-typed fill_value #37733

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Implement NDArrayBackedExtensionArray #33660

REF: Implement NDArrayBackedExtensionArray #33660

jbrockmendel commented Apr 19, 2020

jreback Apr 20, 2020

jbrockmendel Apr 20, 2020

simonjayhawkins Apr 20, 2020

jbrockmendel Apr 20, 2020

simonjayhawkins Apr 20, 2020

jbrockmendel Apr 20, 2020

simonjayhawkins Apr 21, 2020

TomAugspurger left a comment

TomAugspurger Apr 20, 2020

jbrockmendel Apr 20, 2020

simonjayhawkins Apr 21, 2020

jbrockmendel Apr 21, 2020

simonjayhawkins Apr 21, 2020

simonjayhawkins Apr 21, 2020

jorisvandenbossche Apr 21, 2020

jbrockmendel Apr 21, 2020

simonjayhawkins Apr 22, 2020


		_ndarray: np.ndarray

		def _from_backing_data(self: _T, arr: np.ndarray) -> _T:

REF: Implement NDArrayBackedExtensionArray #33660

REF: Implement NDArrayBackedExtensionArray #33660

Conversation

jbrockmendel commented Apr 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment