Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ExtensionArray.searchsorted #24350

Merged
merged 6 commits into from
Dec 28, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -990,6 +990,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instantiate a registered ``DecimalDtype``; furthermore
the ``ExtensionDtype`` has gained the method ``construct_array_type`` (:issue:`21185`)
- :meth:`~pandas.api.types.ExtensionArray.searchsorted` has been added (:issue:`24350`)
- An ``ExtensionArray`` with a boolean dtype now works correctly as a boolean indexer. :meth:`pandas.api.types.is_bool_dtype` now properly considers them boolean (:issue:`22326`)
- Added ``ExtensionDtype._is_numeric`` for controlling whether an extension dtype is considered numeric (:issue:`22290`).
- The ``ExtensionArray`` constructor, ``_from_sequence`` now take the keyword arg ``copy=False`` (:issue:`21185`)
Expand Down
49 changes: 49 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ class ExtensionArray(object):
* unique
* factorize / _values_for_factorize
* argsort / _values_for_argsort
* searchsorted

The remaining methods implemented on this class should be performant,
as they only compose abstract methods. Still, a more efficient
Expand Down Expand Up @@ -505,6 +506,54 @@ def unique(self):
uniques = unique(self.astype(object))
return self._from_sequence(uniques, dtype=self.dtype)

def searchsorted(self, v, side="left", sorter=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like one-letter parameter names. Use values instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it either, but theres value in matching NumPy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, meant value, as that is the name used in Series.searchsorted and various other searchsorted implementations in pandas (Categorical.searchsorted at least, haven’t checked all impl.).

I think it`s inconsistent to follow numpy naming here, when pandas’s is better:-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use value currently in both Index and Series. let's be consistent here (in fact I think we went thru a deprecation cycle on those a while back).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see Stephan's comment in
#24350 (comment)?

Ideally, np.searchsorted(extension_array), would always work. If we do np.searshsorted(v=extension_array) I think a TypeError will be raised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above; this would be an inconsistency in naming
which is much worse that have np.searchedsorted(ea) not working which is just convenience anyhow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just a convenience. @shoyer do you have thoughts here?

Subclasses implementing __array_function__ will be allowed to override ExtensionArray.searchsorted with the correct function signature, but it'd be nice if things worked out of the box.

"""
Find indices where elements should be inserted to maintain order.

.. versionadded:: 0.25.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.24.0


Find the indices into a sorted array `self` (a) such that, if the
corresponding elements in `v` were inserted before the indices, the
order of `self` would be preserved.

Assuming that `a` is sorted:

====== ============================
`side` returned index `i` satisfies
====== ============================
left ``self[i-1] < v <= self[i]``
right ``self[i-1] <= v < self[i]``
====== ============================

Parameters
----------
v : array_like
Values to insert into `self`.
side : {'left', 'right'}, optional
If 'left', the index of the first suitable location found is given.
If 'right', return the last such index. If there is no suitable
index, return either 0 or N (where N is the length of `self`).
sorter : 1-D array_like, optional
Optional array of integer indices that sort array a into ascending
order. They are typically the result of argsort.

Returns
-------
indices : array of ints
Array of insertion points with the same shape as `v`.

See Also
--------
numpy.searchsorted : Similar method from NumPy.
"""
# Note: the base tests provided by pandas only test the basics.
# We do not test
# 1. Values outside the range of the `data_for_sorting` fixture
jreback marked this conversation as resolved.
Show resolved Hide resolved
# 2. Values between the values in the `data_for_sorting` fixture
# 3. Missing values.
arr = self.astype(object)
return arr.searchsorted(v, side=side, sorter=sorter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to astype to object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an ndarray. I suppose we could do np.asarray(self), which will convert to the best possible ndarray? But that could be lossy and so you wouldn't get the correct answer.

So yes, I think we do need object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be very inefficient, so subclasses would almost certainly need to override. I would rather add this as an abstract method then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to other methods. #24433


def _values_for_factorize(self):
# type: () -> Tuple[ndarray, Any]
"""
Expand Down
10 changes: 10 additions & 0 deletions pandas/core/arrays/sparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -1166,6 +1166,16 @@ def _take_without_fill(self, indices):

return taken

def searchsorted(self, v, side="left", sorter=None):
msg = "searchsorted requires high memory usage."
warnings.warn(msg, PerformanceWarning, stacklevel=2)
if not is_scalar(v):
v = np.asarray(v)
v = np.asarray(v)
return np.asarray(self, dtype=self.dtype.subtype).searchsorted(
v, side, sorter
)

def copy(self, deep=False):
if deep:
values = self.sp_values.copy()
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,26 @@ def test_hash_pandas_object_works(self, data, as_frame):
b = pd.util.hash_pandas_object(data)
self.assert_equal(a, b)

def test_searchsorted(self, data_for_sorting):
b, c, a = data_for_sorting
arr = type(data_for_sorting)._from_sequence([a, b, c])
assert arr.searchsorted(a) == 0
assert arr.searchsorted(a, side="right") == 1

assert arr.searchsorted(b) == 1
assert arr.searchsorted(b, side="right") == 2

assert arr.searchsorted(c) == 2
assert arr.searchsorted(c, side="right") == 3

result = arr.searchsorted(arr.take([0, 2]))
expected = np.array([0, 2], dtype=np.intp)
tm.assert_numpy_array_equal(result, expected)

# sorter
sorter = np.array([1, 2, 0])
assert data_for_sorting.searchsorted(a, sorter=sorter) == 0

@pytest.mark.parametrize("as_frame", [True, False])
def test_where_series(self, data, na_value, as_frame):
assert data[0] != data[1]
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,10 @@ def test_where_series(self, data, na_value):
# with shapes (4,) (4,) (0,)
super().test_where_series(data, na_value)

@pytest.mark.skip(reason="Can't compare dicts.")
def test_searchsorted(self, data_for_sorting):
super(TestMethods, self).test_searchsorted(data_for_sorting)


class TestCasting(BaseJSON, base.BaseCastingTests):
@pytest.mark.skip(reason="failing on np.array(self, dtype=str)")
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,10 @@ def test_combine_add(self, data_repeated):
def test_fillna_length_mismatch(self, data_missing):
super().test_fillna_length_mismatch(data_missing)

def test_searchsorted(self, data_for_sorting):
if not data_for_sorting.ordered:
raise pytest.skip(reason="searchsorted requires ordered data.")


class TestCasting(base.BaseCastingTests):
pass
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/test_sparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,10 @@ def test_combine_first(self, data):
pytest.skip("TODO(SparseArray.__setitem__ will preserve dtype.")
super(TestMethods, self).test_combine_first(data)

def test_searchsorted(self, data_for_sorting):
with tm.assert_produces_warning(PerformanceWarning):
super(TestMethods, self).test_searchsorted(data_for_sorting)


class TestCasting(BaseSparseTests, base.BaseCastingTests):
pass
Expand Down