Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: change IntervalIndex.contains to work elementwise #17753

Merged
merged 10 commits into from Jul 1, 2019

Conversation

@jorisvandenbossche
Copy link
Member

commented Oct 2, 2017

xref #16316

This is more a proof-of-concept, to see if there is agreement on the behaviour change (I suppose the implementation itself can be easily vectorized based on the left/right attributes).

cc @zfrenchee @shoyer @buyology @jreback

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2017

Any comments on this one?
(I still need to fix implementation and tests, but would like to check first whether there is some agreement on this?)

return True
except KeyError:
return False
return np.array([key in interval for interval in self], dtype='bool')

This comment has been minimized.

Copy link
@jreback

jreback Oct 10, 2017

Contributor

I think you can just do
return self.get_indexer(key) != -1

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2017

non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains in the first place is to allow us to control what __contains__ does (and for II have it mean something differnt), so the default of scalar in index is preserved.

But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean for each element, but .contains shouldn't be how its spelled.

@shoyer

This comment has been minimized.

Copy link
Member

commented Oct 10, 2017

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Oct 10, 2017

Also, if you want the return value of the current contains, you can always do idx.contains(key).any() with the proposed new behaviour to get the same (although then it might not be as efficient as before).

We can indeed discuss what would be the best name for this method (the one that returns boolean array), and it might not necessarily be 'contains'. But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2017

But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

Agreed with @jorisvandenbossche here.

@jschendel

This comment has been minimized.

Copy link
Member

commented Oct 13, 2017

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them (albeit for most elementwise wouldn't make sense). Even PeriodIndex, which seems the most similar to IntervalIndex, doesn't do elementwise.

I recall there being plans to add an interval accessor. Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Oct 13, 2017

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them

I am not aware of a general existing contains method for Index ? (so I don't see how this can be inconsistent)

It is true that DatetimeIndex/PeriodIndex have one, but this was added together with the introduction of IntervalIndex, and is thus rather coupled. Indeed, when changing it for IntervalIndex, we need to change it for DatetimeIndex/PeriodIndex as well (or for DatetimeIndex, we could also remove it, as this has no notion of intervals).

I recall there being plans to add an interval accessor

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Oct 13, 2017

In general: note that the proposed behaviour is consistent with how 'contains' works on an individual Interval. So in that sense the fact that a IntervalIndex.contains method works element-wise and does what it means for the individual elements, feels very natural and is similar to the many datetime-related methods/attributes both on Timestamp as on DatetimeIndex.

@jschendel

This comment has been minimized.

Copy link
Member

commented Oct 13, 2017

I am not aware of a general existing contains method for Index ?

There is a contains method for Index:

In [2]: idx = pd.Index(['foo', 'bar', 'baz'])

In [3]: idx.contains('foo')
Out[3]: True

In [4]: idx.contains('a')
Out[4]: False

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

Note that this already happens with the string accessor:

In [5]: idx.str.contains('a')
Out[5]: array([False,  True,  True], dtype=bool)

I do see your point though, and am not trying to imply that this would be fine simply because there is precedence for it. Just pointing out that it's not an unbroken rule.

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

I was thinking of #16401. Seems like the proposal was just for columns, so I guess I was mistaken in thinking that it would also work on an IntervalIndex, similar to how the string accessor works on both columns and indexes?

I probably wasn't clear in my original post, but I'm not against this change (nor do I feel like I've been around long enough to cast +1/-1 votes on stuff like this anyways). I was actually quite surprised when I first used IntervalIndex.contains and it didn't return elementwise. Just wanted to make sure there wasn't a more appropriate place, and that the consistency ramifications were fully understood.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2017

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally); this is only defined diffferently for IntervalIndex. So not really against changing the behavior of it. It by-definition should be able to act differently on II, that's the point.

But I think the intent was still a scalar return value, and not an array. The idea was to allow value in II to work.

maybe if we had/have .contain and .contains this could be more clear (e.g. former is a scalar, latter is an array). Not sure this is the best spelling though.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2017

There is a contains method for Index:

Whoops it seemed I really looked bad, I thought I was sure to have checked it :-) @jschendel Thanks for pointing this out. That is a bit unfortunate for the changes I am proposing here.

similar to how the string accessor works on both columns and indexes

Ah, yes, string accessor is indeed available on Index. I was thinking about the dt and cat accessors. But you are right, neither precedent necessarily needs to determine what we think best for IntervalIndex.

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally);

I don't fully understand this point, as Index.contains is (more or less) an exact alias for __contains__ for most index types. Apart for IntervalIndex, where it does something slightly different (not a pure __contains__), and this difference in behaviour of .contains between all indices and IntervalIndex is also confusing IMO.
I somehow like the fact that you can also do a 'contains' operation with a method instead of key in index, but it adds a duplicate method which is not available for something else (as the behaviour proposed here)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 2, 2017

closing as stale

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Dec 4, 2017

I don't agree with closing this :-), but it is true the discussion died.

Let me summarize my reasoning for this PR:

  • Currenlty Index.contains(key) does the same as Index.__contains__(key) / key in index. This is therefore duplicating functionality that is IMO not needed (in other PRs we are actively removing methods).
  • For all index types .contains() currenlty does an "exact search" of the label, IntervalIndex does not (also checks if passed values is contained in any of the individual values of the index), making this method inconsistent between index types.
  • The proposed IntervalIndex.contains is an element-wise version of the 'contains' operation on an individual Interval object (curently val in Interval(), but we could also add a Interval.contains() method).
  • I don't know of other interval-like implementations that have something like this that we could use as reference, but in spatial topology this predicate is called "contains" (https://en.wikipedia.org/wiki/DE-9IM, although the rules for the edges are a bit unintuitive, and you could also say the equivalent is "covers").

So my proposal is:

  • Remove (deprecate) .contains() for all other Index types
  • Only keep it for IntervalIndex and change it here to work element-wise

And also if we do not agree on the above, I think we need the functionality of this PR, so then we need to come up with another name than .contains() (and then "covers" might be reasonable one, although I find "contains" more logical).

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 21, 2018

ok with implementing this. pls rebase / update.

@jschendel

This comment has been minimized.

Copy link
Member

commented Feb 3, 2018

@jorisvandenbossche : Any chance this gets implemented relatively soon? I'm looking for an IntervalIndex method to test with #19502, but there aren't any existing methods that fit the bill, and this looks like it should work for that purpose. I'd be willing to take over the implementation if you have other priorities that take precedence over this.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2018

@jschendel can you review and rebase if you think good?

@jschendel

This comment has been minimized.

Copy link
Member

commented Jul 7, 2018

@jreback : Sure, but will wait until IntervalArray is done so as to not add additional conflicts there.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2018

closing as stale. if you'd like to continue pls ping.

@jreback jreback closed this Nov 1, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Nov 1, 2018

IntervalArray is done in the meantime, so we can rebase this now

@pep8speaks

This comment has been minimized.

Copy link

commented Nov 1, 2018

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-30 16:37:27 UTC
@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Nov 1, 2018

@jschendel If you want to take over, feel free. Otherwise I will take this up next week.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2018

closing as stale. if someone would like to revive, pls ping.

jorisvandenbossche added some commits Jun 29, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jun 29, 2019

I think it would be nice to include this along the other IntervalIndexing changes in #27100

(still need to add whatsnew docs + tests for deprecated methods)

@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Jun 29, 2019

@jschendel jschendel added the Deprecate label Jun 30, 2019

Show resolved Hide resolved pandas/core/indexes/interval.py Outdated
Show resolved Hide resolved pandas/core/indexes/interval.py Outdated
Show resolved Hide resolved pandas/tests/indexes/interval/test_interval.py Outdated
Show resolved Hide resolved pandas/core/indexes/base.py Outdated

@jorisvandenbossche jorisvandenbossche referenced this pull request Jul 1, 2019

Closed

RLS: 0.25.0 #24950

@@ -615,6 +615,7 @@ Other deprecations
- :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`)
- :meth:`Series.put` is deprecated. (:issue:`18262`)
- :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`)
- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`).

This comment has been minimized.

Copy link
@jreback

jreback Jul 1, 2019

Contributor

so this is fine, but I would add a sub-section on the usage for IntervalIndex itself

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 1, 2019

@jschendel
Copy link
Member

left a comment

lgtm, thanks!

def contains(self, other):
if isinstance(other, Interval):
raise NotImplementedError(
'contains not implemented for two intervals'

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 1, 2019

Member

I should be able to implement this in the next day or two if we want to get it in for 0.25.0; it should be relatively straight-forward and we have good testing infrastructure from overlaps that could be partially reused.

@jreback

jreback approved these changes Jul 1, 2019

@jreback jreback merged commit 6c65879 into pandas-dev:master Jul 1, 2019

14 checks passed

codecov/patch 100% of diff hit (target 50%)
Details
codecov/project 91.86% (target 82%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190630.46 had test failures
Details
pandas-dev.pandas (Checks) Checks succeeded
Details
pandas-dev.pandas (Docs) Docs succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 1, 2019

@jorisvandenbossche jorisvandenbossche deleted the jorisvandenbossche:intervalindex-contains branch Jul 2, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.