API: add 'level' kwarg to 'Index.isin' method #7892

Merged
merged 1 commit into from Aug 4, 2014

Conversation

Projects
None yet
3 participants
Contributor

immerrr commented Jul 31, 2014

closes #7890

This PR adds level kwarg for Index objects as discussed (briefly in #7890):

Index.isin(iterable, level=None)

Summary

  • for non-MultiIndex classes:
    • no functional changes in the actual membership test
    • valid level values are None, 0, -1 and self.name
    • if level doesn't match the name, it's a KeyError (was AssertionError before)
  • for MultiIndex classes:
    • if level=None, elements of self.values (tuples) are used
    • otherwise elements of corresponding level are used
    • valid level values are the same as in MultiIndex.get_level_values method, -self.nlevels..(self.nlevels - 1) plus all unique index names

TODO

  • add to documentation
  • optimize MultiIndex case
Contributor

immerrr commented Jul 31, 2014

Also, probably related to #3268

jreback added this to the 0.15.0 milestone Jul 31, 2014

@jreback jreback commented on the diff Jul 31, 2014

pandas/core/index.py
@@ -687,13 +687,29 @@ def _engine(self):
# property, for now, slow to look up
return self._engine_type(lambda: self.values, len(self))
+ def _validate_index_level(self, level):
+ """
+ Validate index level.
@jreback

jreback Jul 31, 2014

Contributor

his already exists as _get_level_number (oh I see you are extending it), ok then

Contributor

immerrr commented Jul 31, 2014

Performance patch is in, here's a synthetic benchmark:

In [7]: idx = pd.MultiIndex.from_product([np.arange(10000), ['a', 'b', 'c']])

In [8]: timeit idx.isin(['a'], level=1)
10000 loops, best of 3: 104 µs per loop

In [9]: timeit idx.get_level_values(1).isin(['a'])
1000 loops, best of 3: 1.37 ms per loop

In [10]: timeit idx.isin([1], level=0)
1000 loops, best of 3: 1.02 ms per loop

In [11]: timeit idx.get_level_values(0).isin([1])
100 loops, best of 3: 2.84 ms per loop

@jreback jreback and 1 other commented on an outdated diff Jul 31, 2014

pandas/core/index.py
+ passed set of values
+
+ Parameters
+ ----------
+ values : set or sequence of values
+ level : int or level name
+
+ Returns
+ -------
+ is_contained : ndarray (boolean dtype)
+ """
+
+ if level is None:
+ return lib.ismember(self._array_values(), set(values))
+ else:
+ num = self._get_level_number(level)
@jreback

jreback Jul 31, 2014

Contributor

this currently breaks if level is a list, you want to support that? (and pass tuples for the isin values)

@immerrr

immerrr Jul 31, 2014

Contributor

I didn't go for that in this PR, I tend to be conservative about API and would wait for an actual use case where

idx.isin(vals1, lev1) & idx.isin(vals2, lev2)

is clearly inferior to

idx.isin([vals1, vals2], [lev1, lev2])
Contributor

jreback commented Aug 1, 2014

looks good....doc and then can merge

FYI, if you want to revise/update isin section gr8! (I would also make it a top-level, rather than below boolean indexing)

Contributor

immerrr commented Aug 1, 2014

re: doc

other than that I've already added?

Contributor

jreback commented Aug 1, 2014

I think it makes sense to combine those 2 sections (isin of Index objects) and isin of PandasObject as they are really the same (and confusing to look in 2 sections). http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-with-isin

I would simply change the isin section (that I am pointing), make it a top-level (e.g. same as Boolean Indexing), them put the examples there

Contributor

immerrr commented Aug 3, 2014

I'm a bit concerned about introducing functionality of classes that themselves have not yet been described, because Index & MultiIndex sections are way below the "isin" section on containers while they probably should be introduced with the rest in dsintro.rst.

Although, I see some of that happening for docs on where that do mention Multiindex, so you may be ok with that, but just to be sure, are you?

Contributor

jreback commented Aug 3, 2014

I agree with all of that

index/MultiIndex should have an intro at the top of indexing (side issue)

but I think splitting isin up (as it is now) is a bigger problem

so I would consolidate and out as a big bullet after boolean indexing

feel free to do a short intro if u would like as well (for index/mi)

@jorisvandenbossche jorisvandenbossche and 2 others commented on an outdated diff Aug 3, 2014

pandas/core/index.py
@@ -2157,12 +2178,17 @@ def isin(self, values):
Parameters
----------
values : set or sequence of values
+ level : {0, None}
+
+ `level` argument is provided for compatibility with MultiIndex.
@jorisvandenbossche

jorisvandenbossche Aug 3, 2014

Owner

two remarks:

  • can you follow the numpy docstring format?
  • I think we should write the docstring for the 'general' case (so it is also applicable for MultiIndex), as this is the docstring that will be rendered in the html docs.

So roughly something like:

level : int, string, default None
    Level number or name to specify the index level to use if the index is a MultiIndex.
@immerrr

immerrr Aug 3, 2014

Contributor

can you follow the numpy docstring format?

Sure, although I thought I was close. What do you think stands out?

I think we should write the docstring for the 'general' case

I was trying to state explicitly that this argument means nothing in particular for simple Index classes while adding a meaningful docstring for MultiIndex.isin, but I do understand that some might like to have it generic and in one place.

Also, is it stated somewhere that level names must be (or rather, are expected to be) strings?

@jreback

jreback Aug 3, 2014

Contributor

level can be a name or int (level number)

@jorisvandenbossche

jorisvandenbossche Aug 3, 2014

Owner

can you follow the numpy docstring format?

Sure, although I thought I was close. What do you think stands out?

It is the exact formatting that is important for the correct rendering (see https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt). It should be formatted as:

arg : type
    Explanation

(so no blank line, and the explanation should be indented)

I think we should write the docstring for the 'general' case

I was trying to state explicitly that this argument means nothing in particular for simple Index classes while adding a meaningful docstring for MultiIndex.isin, but I do understand that some might like to have it generic and in one place.

The docstring in MultiIndex.isin should also be meaningful. But it is just that in the online documentation, there are references to Index.isin, so I think the all relevant information should also be there (to not have to link to both Index.isin and MultiIndex.isin)

Also, is it stated somewhere that level names must be (or rather, are expected to be) strings?

Ah, no, it can be any object I think, but int and string will be the normal ones?

@immerrr

immerrr Aug 4, 2014

Contributor

(so no blank line, and the explanation should be indented)

hah, ok, that particular line was meant as a note, hence non-indented and separated, but ok, let's make it a parameter description :)

@jorisvandenbossche

jorisvandenbossche Aug 4, 2014

Owner

yep, but 'notes' like this don't work in the 'Parameter' section (it is then interpreted as another parameter, the format is quit strict), therefore your have the 'Notes' section, as you did now :-)

@immerrr immerrr API: add 'level' kwarg to 'Index.isin' method
PERF: optimize MultiIndex.isin when level != None

DOC: update doc bits on Index.isin and MultiIndex.isin

DOC: merge isin sections for container & index classes

DOC: make isin a separate section

It's no longer a subsection of boolean indexing.
7708590
Contributor

jreback commented Aug 4, 2014

looks good...

@jorisvandenbossche anything else?

no, looking good!

@jreback jreback added a commit that referenced this pull request Aug 4, 2014

@jreback jreback Merge pull request #7892 from immerrr/add-level-kwarg-for-index-isin
API: add 'level' kwarg to 'Index.isin' method
0646ad5

@jreback jreback merged commit 0646ad5 into pandas-dev:master Aug 4, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Contributor

jreback commented Aug 4, 2014

thanks

let's check docs after it's built and review for correctness

immerrr deleted the immerrr:add-level-kwarg-for-index-isin branch Nov 2, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment