Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ExtensionArray.searchsorted #24350

Merged
merged 6 commits into from
Dec 28, 2018

Conversation

TomAugspurger
Copy link
Contributor

No description provided.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 19, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Dec 19, 2018
@pep8speaks
Copy link

pep8speaks commented Dec 19, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 28, 2018 at 19:54 Hours UTC

@codecov
Copy link

codecov bot commented Dec 19, 2018

Codecov Report

Merging #24350 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24350      +/-   ##
==========================================
+ Coverage   92.29%   92.29%   +<.01%     
==========================================
  Files         162      162              
  Lines       51806    51816      +10     
==========================================
+ Hits        47815    47825      +10     
  Misses       3991     3991
Flag Coverage Δ
#multiple 90.7% <100%> (ø) ⬆️
#single 42.99% <20%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 97.46% <100%> (+0.04%) ⬆️
pandas/core/arrays/sparse.py 92.15% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c230f29...58418ab. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 19, 2018

Codecov Report

Merging #24350 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24350      +/-   ##
==========================================
- Coverage    92.3%    92.3%   -0.01%     
==========================================
  Files         165      165              
  Lines       52176    52186      +10     
==========================================
+ Hits        48161    48170       +9     
- Misses       4015     4016       +1
Flag Coverage Δ
#multiple 90.72% <100%> (ø) ⬆️
#single 42.96% <27.27%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 98.23% <100%> (+0.03%) ⬆️
pandas/core/base.py 97.7% <100%> (ø) ⬆️
pandas/core/arrays/sparse.py 92.17% <100%> (+0.06%) ⬆️
pandas/util/testing.py 87.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1b2a52...a91fcec. Read the comment docs.

@TomAugspurger TomAugspurger mentioned this pull request Dec 19, 2018
12 tasks
"""
Find indices where elements should be inserted to maintain order.

.. versionadded:: 0.25.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.24.0

pandas/core/arrays/base.py Show resolved Hide resolved
# 2. Values between the values in the `data_for_sorting` fixture
# 3. Missing values.
arr = self.astype(object)
return arr.searchsorted(v, side=side, sorter=sorter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to astype to object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an ndarray. I suppose we could do np.asarray(self), which will convert to the best possible ndarray? But that could be lossy and so you wouldn't get the correct answer.

So yes, I think we do need object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be very inefficient, so subclasses would almost certainly need to override. I would rather add this as an abstract method then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to other methods. #24433

@@ -505,6 +506,54 @@ def unique(self):
uniques = unique(self.astype(object))
return self._from_sequence(uniques, dtype=self.dtype)

def searchsorted(self, v, side="left", sorter=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like one-letter parameter names. Use values instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it either, but theres value in matching NumPy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, meant value, as that is the name used in Series.searchsorted and various other searchsorted implementations in pandas (Categorical.searchsorted at least, haven’t checked all impl.).

I think it`s inconsistent to follow numpy naming here, when pandas’s is better:-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use value currently in both Index and Series. let's be consistent here (in fact I think we went thru a deprecation cycle on those a while back).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see Stephan's comment in
#24350 (comment)?

Ideally, np.searchsorted(extension_array), would always work. If we do np.searshsorted(v=extension_array) I think a TypeError will be raised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above; this would be an inconsistency in naming
which is much worse that have np.searchedsorted(ea) not working which is just convenience anyhow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just a convenience. @shoyer do you have thoughts here?

Subclasses implementing __array_function__ will be allowed to override ExtensionArray.searchsorted with the correct function signature, but it'd be nice if things worked out of the box.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 20, 2018 via email

@shoyer
Copy link
Member

shoyer commented Dec 20, 2018

if I were writing an ExtensionArray that also wanted to implement __array_function__, is it OK for the name of positional arguments to differ?

__array_function__ passes on the exact positional and keyword argument from how the function is called. So in practice this would mean that anyone who uses keyword arguments with NumPy's names would get a TypeError, unless you add keyword-only arguments for NumPy's names, too. Using positional arguments would be fine, though.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 20, 2018 via email

@@ -505,6 +506,54 @@ def unique(self):
uniques = unique(self.astype(object))
return self._from_sequence(uniques, dtype=self.dtype)

def searchsorted(self, v, side="left", sorter=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use value currently in both Index and Series. let's be consistent here (in fact I think we went thru a deprecation cycle on those a while back).

# 2. Values between the values in the `data_for_sorting` fixture
# 3. Missing values.
arr = self.astype(object)
return arr.searchsorted(v, side=side, sorter=sorter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be very inefficient, so subclasses would almost certainly need to override. I would rather add this as an abstract method then.

pandas/core/arrays/base.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Changed v to value.

# 2. Values between the values in the `data_for_sorting` fixture
# 3. Missing values.
arr = self.astype(object)
return arr.searchsorted(value, side=side, sorter=sorter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC you have an issue to show a warning here for EA's that don't redefined this I think?

@jreback jreback merged commit 7617ed1 into pandas-dev:master Dec 28, 2018
@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

thanks @TomAugspurger

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 28, 2018 via email

@TomAugspurger TomAugspurger deleted the ea-searchsorted branch January 2, 2019 20:17
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants