Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: change IntervalIndex.contains to work elementwise #17753

Merged
merged 10 commits into from
Jul 1, 2019

Conversation

jorisvandenbossche
Copy link
Member

xref #16316

This is more a proof-of-concept, to see if there is agreement on the behaviour change (I suppose the implementation itself can be easily vectorized based on the left/right attributes).

cc @zfrenchee @shoyer @buyology @jreback

@jorisvandenbossche
Copy link
Member Author

Any comments on this one?
(I still need to fix implementation and tests, but would like to check first whether there is some agreement on this?)

return True
except KeyError:
return False
return np.array([key in interval for interval in self], dtype='bool')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do
return self.get_indexer(key) != -1

@jreback
Copy link
Contributor

jreback commented Oct 10, 2017

non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains in the first place is to allow us to control what __contains__ does (and for II have it mean something differnt), so the default of scalar in index is preserved.

But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean for each element, but .contains shouldn't be how its spelled.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type labels Oct 10, 2017
@shoyer
Copy link
Member

shoyer commented Oct 10, 2017 via email

@jorisvandenbossche
Copy link
Member Author

Also, if you want the return value of the current contains, you can always do idx.contains(key).any() with the proposed new behaviour to get the same (although then it might not be as efficient as before).

We can indeed discuss what would be the best name for this method (the one that returns boolean array), and it might not necessarily be 'contains'. But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

@TomAugspurger
Copy link
Contributor

But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

Agreed with @jorisvandenbossche here.

@jschendel
Copy link
Member

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them (albeit for most elementwise wouldn't make sense). Even PeriodIndex, which seems the most similar to IntervalIndex, doesn't do elementwise.

I recall there being plans to add an interval accessor. Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

@jorisvandenbossche
Copy link
Member Author

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them

I am not aware of a general existing contains method for Index ? (so I don't see how this can be inconsistent)

It is true that DatetimeIndex/PeriodIndex have one, but this was added together with the introduction of IntervalIndex, and is thus rather coupled. Indeed, when changing it for IntervalIndex, we need to change it for DatetimeIndex/PeriodIndex as well (or for DatetimeIndex, we could also remove it, as this has no notion of intervals).

I recall there being plans to add an interval accessor

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

@jorisvandenbossche
Copy link
Member Author

In general: note that the proposed behaviour is consistent with how 'contains' works on an individual Interval. So in that sense the fact that a IntervalIndex.contains method works element-wise and does what it means for the individual elements, feels very natural and is similar to the many datetime-related methods/attributes both on Timestamp as on DatetimeIndex.

@jschendel
Copy link
Member

I am not aware of a general existing contains method for Index ?

There is a contains method for Index:

In [2]: idx = pd.Index(['foo', 'bar', 'baz'])

In [3]: idx.contains('foo')
Out[3]: True

In [4]: idx.contains('a')
Out[4]: False

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

Note that this already happens with the string accessor:

In [5]: idx.str.contains('a')
Out[5]: array([False,  True,  True], dtype=bool)

I do see your point though, and am not trying to imply that this would be fine simply because there is precedence for it. Just pointing out that it's not an unbroken rule.

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

I was thinking of #16401. Seems like the proposal was just for columns, so I guess I was mistaken in thinking that it would also work on an IntervalIndex, similar to how the string accessor works on both columns and indexes?

I probably wasn't clear in my original post, but I'm not against this change (nor do I feel like I've been around long enough to cast +1/-1 votes on stuff like this anyways). I was actually quite surprised when I first used IntervalIndex.contains and it didn't return elementwise. Just wanted to make sure there wasn't a more appropriate place, and that the consistency ramifications were fully understood.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally); this is only defined diffferently for IntervalIndex. So not really against changing the behavior of it. It by-definition should be able to act differently on II, that's the point.

But I think the intent was still a scalar return value, and not an array. The idea was to allow value in II to work.

maybe if we had/have .contain and .contains this could be more clear (e.g. former is a scalar, latter is an array). Not sure this is the best spelling though.

@jorisvandenbossche
Copy link
Member Author

There is a contains method for Index:

Whoops it seemed I really looked bad, I thought I was sure to have checked it :-) @jschendel Thanks for pointing this out. That is a bit unfortunate for the changes I am proposing here.

similar to how the string accessor works on both columns and indexes

Ah, yes, string accessor is indeed available on Index. I was thinking about the dt and cat accessors. But you are right, neither precedent necessarily needs to determine what we think best for IntervalIndex.

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally);

I don't fully understand this point, as Index.contains is (more or less) an exact alias for __contains__ for most index types. Apart for IntervalIndex, where it does something slightly different (not a pure __contains__), and this difference in behaviour of .contains between all indices and IntervalIndex is also confusing IMO.
I somehow like the fact that you can also do a 'contains' operation with a method instead of key in index, but it adds a duplicate method which is not available for something else (as the behaviour proposed here)

@jreback
Copy link
Contributor

jreback commented Dec 2, 2017

closing as stale

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 4, 2017

I don't agree with closing this :-), but it is true the discussion died.

Let me summarize my reasoning for this PR:

  • Currenlty Index.contains(key) does the same as Index.__contains__(key) / key in index. This is therefore duplicating functionality that is IMO not needed (in other PRs we are actively removing methods).
  • For all index types .contains() currenlty does an "exact search" of the label, IntervalIndex does not (also checks if passed values is contained in any of the individual values of the index), making this method inconsistent between index types.
  • The proposed IntervalIndex.contains is an element-wise version of the 'contains' operation on an individual Interval object (curently val in Interval(), but we could also add a Interval.contains() method).
  • I don't know of other interval-like implementations that have something like this that we could use as reference, but in spatial topology this predicate is called "contains" (https://en.wikipedia.org/wiki/DE-9IM, although the rules for the edges are a bit unintuitive, and you could also say the equivalent is "covers").

So my proposal is:

  • Remove (deprecate) .contains() for all other Index types
  • Only keep it for IntervalIndex and change it here to work element-wise

And also if we do not agree on the above, I think we need the functionality of this PR, so then we need to come up with another name than .contains() (and then "covers" might be reasonable one, although I find "contains" more logical).

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

ok with implementing this. pls rebase / update.

@jschendel
Copy link
Member

@jorisvandenbossche : Any chance this gets implemented relatively soon? I'm looking for an IntervalIndex method to test with #19502, but there aren't any existing methods that fit the bill, and this looks like it should work for that purpose. I'd be willing to take over the implementation if you have other priorities that take precedence over this.

@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

@jschendel can you review and rebase if you think good?

@jschendel
Copy link
Member

@jreback : Sure, but will wait until IntervalArray is done so as to not add additional conflicts there.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

closing as stale. if you'd like to continue pls ping.

@jreback jreback closed this Nov 1, 2018
@jorisvandenbossche
Copy link
Member Author

IntervalArray is done in the meantime, so we can rebase this now

@pep8speaks
Copy link

pep8speaks commented Nov 1, 2018

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-30 16:37:27 UTC

@jorisvandenbossche
Copy link
Member Author

@jschendel If you want to take over, feel free. Otherwise I will take this up next week.

@jreback
Copy link
Contributor

jreback commented Dec 9, 2018

closing as stale. if someone would like to revive, pls ping.

@jorisvandenbossche
Copy link
Member Author

I think it would be nice to include this along the other IntervalIndexing changes in #27100

(still need to add whatsnew docs + tests for deprecated methods)

@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Jun 29, 2019
@jschendel jschendel added the Deprecate Functionality to remove in pandas label Jun 30, 2019
pandas/core/indexes/interval.py Outdated Show resolved Hide resolved
pandas/core/indexes/interval.py Outdated Show resolved Hide resolved
pandas/tests/indexes/interval/test_interval.py Outdated Show resolved Hide resolved
@jorisvandenbossche jorisvandenbossche mentioned this pull request Jul 1, 2019
@@ -615,6 +615,7 @@ Other deprecations
- :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`)
- :meth:`Series.put` is deprecated. (:issue:`18262`)
- :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`)
- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is fine, but I would add a sub-section on the usage for IntervalIndex itself

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 1, 2019 via email

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

def contains(self, other):
if isinstance(other, Interval):
raise NotImplementedError(
'contains not implemented for two intervals'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to implement this in the next day or two if we want to get it in for 0.25.0; it should be relatively straight-forward and we have good testing infrastructure from overlaps that could be partially reused.

@jreback jreback merged commit 6c65879 into pandas-dev:master Jul 1, 2019
@jreback
Copy link
Contributor

jreback commented Jul 1, 2019

thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the intervalindex-contains branch July 2, 2019 12:41
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants