Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Add __contains__ to CategoricalIndex #21369

Merged
merged 2 commits into from
Jun 14, 2018

Conversation

topper-123
Copy link
Contributor

Currently, membership checks in CategoricalIndex is very slow as explained in #21022. This PR fixes the issue for CategoricalIndex, while #21022 contains the fix for Categorical. The difference between the two cases is the use of _engine for CategoricalIndex, which makes this even faster than the Catagorical solution in #21022.

Tests exist already and can be found in tests/indexes/test_category.py::TestCategoricalIndex::test_contains.

ASV:

      before           after         ratio
     [0c65c57a]       [986779ab]
-      2.49±0.2ms       3.26±0.2μs     0.00  categoricals.Contains.time_contains

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@topper-123 topper-123 force-pushed the categorical_contains branch 2 times, most recently from 7878db0 to 35a68d7 Compare June 7, 2018 22:40
@codecov
Copy link

codecov bot commented Jun 7, 2018

Codecov Report

Merging #21369 into master will increase coverage by <.01%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21369      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         153      153              
  Lines       49606    49610       +4     
==========================================
+ Hits        45589    45593       +4     
  Misses       4017     4017
Flag Coverage Δ
#multiple 90.3% <90%> (ø) ⬆️
#single 41.89% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.09% <90%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 576d5c6...ccfba1b. Read the comment docs.

@gfyoung gfyoung added Performance Memory or execution speed performance Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 8, 2018
@@ -324,20 +324,19 @@ def _reverse_indexer(self):
@Appender(_index_shared_docs['__contains__'] % _index_doc_kwargs)
def __contains__(self, key):
hash(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a really silly question, but what does this line do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just ensures that mutables are not passes into the function....

I think it should probably be removed, but that line is in various other places as well, so maybe a seperate PR, that goes through all similar cases? Or I can just remove it.

Copy link
Member

@gfyoung gfyoung Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you removed it in another part of the diff, hence why I'm asking. That being said, I like your suggestion. Let's investigate for another time then, in which case I would put back the other one that you deleted.

if self.categories._defer_to_indexing:
return key in self.categories

return key in self.values
Copy link
Member

@gfyoung gfyoung Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me: why do we NOT need to check membership in self.values anymore?

Copy link
Contributor Author

@topper-123 topper-123 Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For indices, their indexing engine (i.e. ._engine) has a __contains__ method which does the same thing but is faster (does caching etc. probably, haven't looked into the details of the code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for clarifying!

@topper-123 topper-123 force-pushed the categorical_contains branch 6 times, most recently from 0fea48b to 5c18b8a Compare June 9, 2018 14:58
Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

cc @jreback

return False
if is_scalar(loc):
return loc in self._engine
else: # if self.categories is IntervalIndex, loc is an array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line between things, e.g.

if isna(...):
     ....

try:
   ....
except:
  ....

if is_scalar(...):
   ...

# no else needed here
return ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you put comments before each case (not everythin needs a comment), but i find this hard to grok in its current form.

@topper-123 topper-123 force-pushed the categorical_contains branch 2 times, most recently from 6ed9dd6 to cfb8f6d Compare June 13, 2018 16:42
@topper-123 topper-123 force-pushed the categorical_contains branch 3 times, most recently from e616770 to 2a87671 Compare June 13, 2018 17:53
@topper-123
Copy link
Contributor Author

topper-123 commented Jun 13, 2018

I've updated the PR.

I've set it to be part of 0.23.2, if that's alright.

@jreback jreback added this to the 0.23.2 milestone Jun 14, 2018
@jreback jreback merged commit bf1c3dc into pandas-dev:master Jun 14, 2018
@jreback
Copy link
Contributor

jreback commented Jun 14, 2018

thanks @topper-123

@jorisvandenbossche
Copy link
Member

This is changing the implementation quite substantially, so let's move this to 0.24.0.txt?

@jorisvandenbossche
Copy link
Member

Any comments on my comment above about keeping this for 0.24.0 ?

@topper-123
Copy link
Contributor Author

Quility-wise this is ok to go into 23.2 IMO, the PRs are really not that complex, IMO, it's much faster and it doesn't change any APIs.

Also, my main motivation for writing this was speeding up slicing dataframes with a CategoricalIndex (see #20395), which previously was very slow (still is slow, but better than before, and now faster than fancy indexing, at least). I think a lot of people will appreciate this speedup.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

we tagged for 0.23.2 (and note is there)
it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24
either way is fine

@jorisvandenbossche
Copy link
Member

The main reason that I am raising this is that __contains__ checking has quite some implications (which is of course also the reason you speeded it up!), and I think it is rather easy to miss a small exotic corner case where the new implementation might differ.
To be clear, @topper-123 , I am not questioning the quality of this PR! I just know from experience, also for us, that it is easy to miss an unintended API change (which we might not even decide to fix if it is debateble behaviour, but then that is still better to be left as 0.24.0). Since this is only about performance improvement (and not a regression), I would play on safe and give it more time in 0.24.0.

it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24

Yep, it would be both in 0.23.2 or both in 0.24.0 (the other I actually tagged first as 0.24.0, but that was changed before merging)

@topper-123
Copy link
Contributor Author

Ok, your call, I won't object.

@jorisvandenbossche
Copy link
Member

OK, left it for 0.24 (moved the whatsnew already to v0.24.0.txt)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants