Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Implement NDArrayBackedExtensionArray #33660

Merged
merged 6 commits into from
Apr 25, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions pandas/core/arrays/_mixins.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from typing import Any, Sequence, TypeVar

import numpy as np

from pandas.errors import AbstractMethodError

from pandas.core.algorithms import take
from pandas.core.arrays.base import ExtensionArray

_T = TypeVar("_T", bound="NDArrayBackedExtensionArray")


class NDArrayBackedExtensionArray(ExtensionArray):
"""
ExtensionArray that is backed by a single NumPy ndarray.
"""

_ndarray: np.ndarray

def _from_backing_data(self: _T, arr: np.ndarray) -> _T:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is typing self needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that is how we indicate that the return type is "same type as self", but id defer to @simonjayhawkins on this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. T is a typevar, i.e. can take on different types (could be subtypes of a type or union of types) so adding to self binds the typevar and the return type is the same as self.

This can sometimes cause problems in Mixins, but here the 'Mixin' is IMO an abstract base class.

"""
Construct a new ExtensionArray `new_array` with `arr` as its _ndarray.

This should round-trip:
self == self._from_backing_data(self._ndarray)
"""
raise AbstractMethodError(self)

# ------------------------------------------------------------------------

def take(
self: _T,
indices: Sequence[int],
allow_fill: bool = False,
fill_value: Any = None,
) -> _T:
if allow_fill:
fill_value = self._validate_fill_value(fill_value)

new_data = take(
self._ndarray, indices, allow_fill=allow_fill, fill_value=fill_value,
)
return self._from_backing_data(new_data)

def _validate_fill_value(self, fill_value):
"""
If a fill_value is passed to `take` convert it to a representation
suitable for self._ndarray, raising ValueError if this is not possible.

Parameters
----------
fill_value : object

Returns
-------
fill_value : native representation

Raises
------
ValueError
"""
raise AbstractMethodError(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseMaskedArray is in core/arrays/masked.py. If the purpose of this is to create a common base class for another subset of extensionarrays, can we make the module names for the base classes more consistent.

We could also restructure the extension array tests with a similar hierarchy, so that could influence the naming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to other naming ideas; what do you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have core/arrays/numpy_ for PandasArray, so the simple numpy name is taken. The difference though is a PandasArray can be instantiated (It is also possible to instantiate a BaseMaskedArray directly) whereas NDArrayBackedExtensionArray is abstract. maybe we need a subdirectory of core/arrays for the base classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could reasonably put the mixin in arrays.numpy_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

93 changes: 13 additions & 80 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,10 @@
from pandas.core import ops
from pandas.core.accessor import PandasDelegate, delegate_names
import pandas.core.algorithms as algorithms
from pandas.core.algorithms import _get_data_algo, factorize, take, take_1d, unique1d
from pandas.core.algorithms import _get_data_algo, factorize, take_1d, unique1d
from pandas.core.array_algos.transforms import shift
from pandas.core.arrays.base import ExtensionArray, _extension_array_shared_docs
from pandas.core.arrays._mixins import NDArrayBackedExtensionArray
from pandas.core.arrays.base import _extension_array_shared_docs
from pandas.core.base import NoNewAttributesMixin, PandasObject, _shared_docs
import pandas.core.common as com
from pandas.core.construction import array, extract_array, sanitize_array
Expand Down Expand Up @@ -210,7 +211,7 @@ def contains(cat, key, container):
"""


class Categorical(ExtensionArray, PandasObject):
class Categorical(NDArrayBackedExtensionArray, PandasObject):
"""
Represent a categorical variable in classic R / S-plus fashion.

Expand Down Expand Up @@ -1250,7 +1251,7 @@ def shift(self, periods, fill_value=None):

def _validate_fill_value(self, fill_value):
"""
Convert a user-facing fill_value to a representation to use with our
Convert a user-facing fill_value to a representation to use with our
underlying ndarray, raising ValueError if this is not possible.

Parameters
Expand Down Expand Up @@ -1780,85 +1781,17 @@ def fillna(self, value=None, method=None, limit=None):

return self._constructor(codes, dtype=self.dtype, fastpath=True)

def take(self, indexer, allow_fill: bool = False, fill_value=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to restore this docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored with TypeError->ValueError and whatsnew

Take elements from the Categorical.

Parameters
----------
indexer : sequence of int
The indices in `self` to take. The meaning of negative values in
`indexer` depends on the value of `allow_fill`.
allow_fill : bool, default False
How to handle negative values in `indexer`.

* False: negative values in `indices` indicate positional indices
from the right. This is similar to
:func:`numpy.take`.

* True: negative values in `indices` indicate missing values
(the default). These values are set to `fill_value`. Any other
other negative values raise a ``ValueError``.

.. versionchanged:: 1.0.0

Default value changed from ``True`` to ``False``.

fill_value : object
The value to use for `indices` that are missing (-1), when
``allow_fill=True``. This should be the category, i.e. a value
in ``self.categories``, not a code.

Returns
-------
Categorical
This Categorical will have the same categories and ordered as
`self`.
# ------------------------------------------------------------------
# NDArrayBackedExtensionArray compat

See Also
--------
Series.take : Similar method for Series.
numpy.ndarray.take : Similar method for NumPy arrays.

Examples
--------
>>> cat = pd.Categorical(['a', 'a', 'b'])
>>> cat
[a, a, b]
Categories (2, object): [a, b]

Specify ``allow_fill==False`` to have negative indices mean indexing
from the right.

>>> cat.take([0, -1, -2], allow_fill=False)
[a, b, a]
Categories (2, object): [a, b]

With ``allow_fill=True``, indices equal to ``-1`` mean "missing"
values that should be filled with the `fill_value`, which is
``np.nan`` by default.

>>> cat.take([0, -1, -1], allow_fill=True)
[a, NaN, NaN]
Categories (2, object): [a, b]

The fill value can be specified.

>>> cat.take([0, -1, -1], allow_fill=True, fill_value='a')
[a, a, a]
Categories (2, object): [a, b]

Specifying a fill value that's not in ``self.categories``
will raise a ``TypeError``.
"""
indexer = np.asarray(indexer, dtype=np.intp)
@property
def _ndarray(self) -> np.ndarray:
return self._codes

if allow_fill:
# convert user-provided `fill_value` to codes
fill_value = self._validate_fill_value(fill_value)
def _from_backing_data(self, arr: np.ndarray):
return self._constructor(arr, dtype=self.dtype, fastpath=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comment re _constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC Categorical has _constructor because it subclasses PandasObject.

_from_backing_data is specifically about round-tripping or commutativity; not really a good fit for _constructor.

in PandasObject _constructor is type(self), in Categorical, _constructor is Categorical (i.e type(self)). so I think it may worth investigating whether we can remove _constructor from Categorical and use type(self) here to be consistent with datetimelike._from_backing_data.

can u add return types for _from_backing_data, both here and in datetimelike


codes = take(self._codes, indexer, allow_fill=allow_fill, fill_value=fill_value)
return self._constructor(codes, dtype=self.dtype, fastpath=True)
# ------------------------------------------------------------------

def take_nd(self, indexer, allow_fill: bool = False, fill_value=None):
# GH#27745 deprecate alias that other EAs dont have
Expand Down
31 changes: 19 additions & 12 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,9 @@
from pandas.core.dtypes.missing import is_valid_nat_for_dtype, isna

from pandas.core import missing, nanops, ops
from pandas.core.algorithms import checked_add_with_arr, take, unique1d, value_counts
from pandas.core.algorithms import checked_add_with_arr, unique1d, value_counts
from pandas.core.array_algos.transforms import shift
from pandas.core.arrays._mixins import NDArrayBackedExtensionArray
from pandas.core.arrays.base import ExtensionArray, ExtensionOpsMixin
import pandas.core.common as com
from pandas.core.construction import array, extract_array
Expand Down Expand Up @@ -425,7 +426,9 @@ def _with_freq(self, freq):
return self


class DatetimeLikeArrayMixin(ExtensionOpsMixin, AttributesMixin, ExtensionArray):
class DatetimeLikeArrayMixin(
ExtensionOpsMixin, AttributesMixin, NDArrayBackedExtensionArray
):
"""
Shared Base/Mixin class for DatetimeArray, TimedeltaArray, PeriodArray

Expand All @@ -437,6 +440,20 @@ class DatetimeLikeArrayMixin(ExtensionOpsMixin, AttributesMixin, ExtensionArray)
_generate_range
"""

# ------------------------------------------------------------------
# NDArrayBackedExtensionArray compat

@property
def _ndarray(self) -> np.ndarray:
# NB: A bunch of Interval tests fail if we use ._data
return self.asi8

def _from_backing_data(self, arr: np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a new method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

# Note: we do not retain `freq`
return type(self)(arr, dtype=self.dtype) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike Frame and Series, EAs do not have the concept of _constructor. could this be useful?

the exception is Categorical which uses _constructor. so maybe either extend to all EAs or can we remove from Categorical for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC Categorical has _constructor because it subclasses PandasObject.

_from_backing_data is specifically about round-tripping or commutativity; not really a good fit for _constructor.


# ------------------------------------------------------------------

@property
def ndim(self) -> int:
return self._data.ndim
Expand Down Expand Up @@ -711,16 +728,6 @@ def _validate_fill_value(self, fill_value):
)
return fill_value

def take(self, indices, allow_fill=False, fill_value=None):
if allow_fill:
fill_value = self._validate_fill_value(fill_value)

new_values = take(
self.asi8, indices, allow_fill=allow_fill, fill_value=fill_value
)

return type(self)(new_values, dtype=self.dtype)

@classmethod
def _concat_same_type(cls, to_concat, axis: int = 0):

Expand Down