Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Add __contains__ to CategoricalIndex #21369

Merged
merged 2 commits into from Jun 14, 2018

Conversation

Projects
None yet
4 participants
@topper-123
Copy link
Contributor

commented Jun 7, 2018

  • progress towards #20395
  • xref #21022
  • tests added / passed
  • benchmark added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Currently, membership checks in CategoricalIndex is very slow as explained in #21022. This PR fixes the issue for CategoricalIndex, while #21022 contains the fix for Categorical. The difference between the two cases is the use of _engine for CategoricalIndex, which makes this even faster than the Catagorical solution in #21022.

Tests exist already and can be found in tests/indexes/test_category.py::TestCategoricalIndex::test_contains.

ASV:

      before           after         ratio
     [0c65c57a]       [986779ab]
-      2.49±0.2ms       3.26±0.2μs     0.00  categoricals.Contains.time_contains

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@topper-123 topper-123 force-pushed the topper-123:categorical_contains branch from 19ba6ec to 30dc1a3 Jun 7, 2018

@topper-123 topper-123 referenced this pull request Jun 7, 2018

Closed

PERF: __contains__ method for Categorical #21022

4 of 4 tasks complete

@topper-123 topper-123 force-pushed the topper-123:categorical_contains branch 2 times, most recently from 7878db0 to 35a68d7 Jun 7, 2018

@codecov

This comment has been minimized.

Copy link

commented Jun 7, 2018

Codecov Report

Merging #21369 into master will increase coverage by <.01%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21369      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         153      153              
  Lines       49606    49610       +4     
==========================================
+ Hits        45589    45593       +4     
  Misses       4017     4017
Flag Coverage Δ
#multiple 90.3% <90%> (ø) ⬆️
#single 41.89% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.09% <90%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 576d5c6...ccfba1b. Read the comment docs.

@@ -324,20 +324,19 @@ def _reverse_indexer(self):
@Appender(_index_shared_docs['__contains__'] % _index_doc_kwargs)
def __contains__(self, key):
hash(key)

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 8, 2018

Member

This might be a really silly question, but what does this line do?

This comment has been minimized.

Copy link
@topper-123

topper-123 Jun 8, 2018

Author Contributor

It just ensures that mutables are not passes into the function....

I think it should probably be removed, but that line is in various other places as well, so maybe a seperate PR, that goes through all similar cases? Or I can just remove it.

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 8, 2018

Member

Well, you removed it in another part of the diff, hence why I'm asking. That being said, I like your suggestion. Let's investigate for another time then, in which case I would put back the other one that you deleted.

if self.categories._defer_to_indexing:
return key in self.categories

return key in self.values

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 8, 2018

Member

Remind me: why do we NOT need to check membership in self.values anymore?

This comment has been minimized.

Copy link
@topper-123

topper-123 Jun 8, 2018

Author Contributor

For indices, their indexing engine (i.e. ._engine) has a __contains__ method which does the same thing but is faster (does caching etc. probably, haven't looked into the details of the code).

This comment has been minimized.

Copy link
@gfyoung

gfyoung Jun 8, 2018

Member

Awesome, thanks for clarifying!

@topper-123 topper-123 force-pushed the topper-123:categorical_contains branch 6 times, most recently from 0fea48b to 5c18b8a Jun 8, 2018

@gfyoung

gfyoung approved these changes Jun 9, 2018

Copy link
Member

left a comment

Nice!

cc @jreback

return False
if is_scalar(loc):
return loc in self._engine
else: # if self.categories is IntervalIndex, loc is an array

This comment has been minimized.

Copy link
@jreback

jreback Jun 13, 2018

Contributor

can you put a blank line between things, e.g.

if isna(...):
     ....

try:
   ....
except:
  ....

if is_scalar(...):
   ...

# no else needed here
return ...

This comment has been minimized.

Copy link
@jreback

jreback Jun 13, 2018

Contributor

also can you put comments before each case (not everythin needs a comment), but i find this hard to grok in its current form.

@topper-123 topper-123 force-pushed the topper-123:categorical_contains branch 2 times, most recently from 6ed9dd6 to cfb8f6d Jun 13, 2018

@topper-123 topper-123 force-pushed the topper-123:categorical_contains branch 3 times, most recently from e616770 to 2a87671 Jun 13, 2018

@topper-123 topper-123 force-pushed the topper-123:categorical_contains branch from 2a87671 to ccfba1b Jun 13, 2018

@topper-123

This comment has been minimized.

Copy link
Contributor Author

commented Jun 13, 2018

I've updated the PR.

I've set it to be part of 0.23.2, if that's alright.

@jreback jreback added this to the 0.23.2 milestone Jun 14, 2018

@jreback jreback merged commit bf1c3dc into pandas-dev:master Jun 14, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2018

thanks @topper-123

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

This is changing the implementation quite substantially, so let's move this to 0.24.0.txt?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 26, 2018

Any comments on my comment above about keeping this for 0.24.0 ?

@topper-123

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

Quility-wise this is ok to go into 23.2 IMO, the PRs are really not that complex, IMO, it's much faster and it doesn't change any APIs.

Also, my main motivation for writing this was speeding up slicing dataframes with a CategoricalIndex (see #20395), which previously was very slow (still is slow, but better than before, and now faster than fancy indexing, at least). I think a lot of people will appreciate this speedup.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

we tagged for 0.23.2 (and note is there)
it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24
either way is fine

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 27, 2018

The main reason that I am raising this is that __contains__ checking has quite some implications (which is of course also the reason you speeded it up!), and I think it is rather easy to miss a small exotic corner case where the new implementation might differ.
To be clear, @topper-123 , I am not questioning the quality of this PR! I just know from experience, also for us, that it is easy to miss an unintended API change (which we might not even decide to fix if it is debateble behaviour, but then that is still better to be left as 0.24.0). Since this is only about performance improvement (and not a regression), I would play on safe and give it more time in 0.24.0.

it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24

Yep, it would be both in 0.23.2 or both in 0.24.0 (the other I actually tagged first as 0.24.0, but that was changed before merging)

@topper-123

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

Ok, your call, I won't object.

alimcmaster1 added a commit to alimcmaster1/pandas that referenced this pull request Jun 28, 2018

alimcmaster1 added a commit to alimcmaster1/pandas that referenced this pull request Jun 28, 2018

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.2, 0.24.0 Jul 2, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 2, 2018

OK, left it for 0.24 (moved the whatsnew already to v0.24.0.txt)

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.