Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: avoid unnecessary recoding in CategoricalIndex._create_categorical #21506

Conversation

topper-123
Copy link
Contributor

This issue was found when looking for a solution to #20395. I've found that CategoricalIndex._create_categorical makes an unnecessary call to Categorical._set_dtype when passed a dtype that is equal to self.dtype:

>>> n = 100_000
>>> ci = pd.CategoricalIndex(list('a'*n + 'b'*n + 'c'*n))
>>> %timeit ci._create_categorical(ci, ci)
197 µs  # master and this PR
>>> %timeit ci._create_categorical(ci, ci, dtype=ci.dtype)
1.92 ms  # master
197 µs  # this PR

Internally, some operations in Pandas pass self.dtype to CategoricalIndex._create_categorical and onwards to _set_dtype, which is a very slow code path.

By avoiding calling _set_dtype unnecessarily, some operations become faster. For example:

>>> df = pd.DataFrame(dict(A=range(n*3)), index=ci)
>>> %timeit df.loc['b']
3.55 ms # master
2.14 ms  # this PR

As Categoricalndex._create_categorical is called directly or indirectly by various methods (e.g. CategoricalIndex._shallow_copy), there are probably other places where this speedup is relevant also.

@codecov
Copy link

codecov bot commented Jun 16, 2018

Codecov Report

Merging #21506 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21506      +/-   ##
==========================================
+ Coverage   91.92%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49587    49589       +2     
==========================================
+ Hits        45583    45585       +2     
  Misses       4004     4004
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.89% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.11% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e982e1...0db3e02. Read the comment docs.

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Jun 16, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 16, 2018

@topper-123 : Nice! Can you post asv benchmark timings (to make sure there were no regressions introduced) if you have a chance?

@topper-123
Copy link
Contributor Author

Hmm, there seem to be introduced some performance regressions by this PR that I don't quite understand. I will close this PR until I've figured it out, so as not to clutter PR quoue...

@topper-123 topper-123 closed this Jun 16, 2018
@topper-123 topper-123 deleted the CategoricalIndex._shallow_copy branch September 20, 2018 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants